You are on page 1of 25

DATA

HANDLING
USING PANDAS-
1
Intro to python libraries
-By Shashi Bhushan
PANDAS:
 Pandas is an open-source library that is made mainly for working with relational or labeled
data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on the top of the NumPy
library. Pandas is fast and it has high-performance & productivity for users.

 Pandas generally provide two data structure for manipulating data, They are:

 Series
 Data Frame
SERIES:
Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer,
string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is
nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type.
The object supports both integer and label-based indexing and provides a host of methods for
performing operations involving the index.
A pandas Series can be created using the following constructor −

pandas.Series( data, index, dtype, copy)


1. Data: data takes various forms like ndarray, list, constants
CREATING A PANDAS SERIES
2. Index: Index values must be unique and hashable, same length as data. Default np.arrange(n)
if no index is passed.

 3. Dtype: dtype is for data type. If None, data type will be inferred

4. Copy: Copy data. Default False

 In the real world, a Pandas Series will be created by loading the datasets from existing storage,
storage can be SQL Database, CSV file, and Excel file.
 Pandas Series can be created from the lists, dictionary, and from a scalar value etc. Series can
be created in different ways, here are some ways by which we create a series:
A SERIES CAN BE CREATED
USING VARIOUS INPUTS LIKE

 Array
 Dict
 Scalar value or constant

 A basic series, which can be created is an Empty Series.


 Create an Empty Series

 Example
 #import the pandas library and aliasing as pd
 import pandas as pd
 s = pd.Series() Its output is as follows −
 print s Series([], dtype: float64)
CREATE A SERIES FROM
NDARRAY
 An ndarray is a (usually fixed-size) multidimensional container of items of the same type and
size.
 If data is an ndarray, then index passed must be of the same length. If no index is passed, then by
default index will be range(n) where n is array length, i.e., [0,1,2,3…. range(len(array))-1].

 Example

 import pandas as pd
 import numpy as np
 data = np.array(['a','b','c','d'])
 s = pd.Series(data)
 print s
ITS OUTPUT IS AS FOLLOWS −
 We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1,
i.e., 0 to 3.
 Output:
EXAMPLE
import pandas as pd
import numpy as np
data = np.array(['a','b','c','d'])
s = pd.Series(data,index=[100,101,102,103])
print s

 Its output is as follows −

100 a
101 b
102 c
103 d
dtype: object
 We passed the index values here. Now we can see the customized indexed values in the output.
CREATE A SERIES FROM DICT
 A dict can be passed as input and if no index is specified, then the dictionary keys are taken in
a sorted order to construct index. If index is passed, the values in data corresponding to the
labels in the index will be pulled out.

 Example O/P:-
import pandas as pd
import numpy as np
data = {'a' : 0., 'b' : 1., 'c' : 2.}
s = pd.Series(data)
print s
 Observe − Dictionary keys are used to construct index
EXAMPLE
 import pandas as pd
 import numpy as np
 data = {'a' : 0., 'b' : 1., 'c' : 2.}
 s = pd.Series(data,index=['b','c','d','a'])
 print s

 Observe −
 Index order is persisted and the missing element is filled with NaN (Not a Number).
CREATE A SERIES FROM
SCALAR
 If data is a scalar value, an index must be provided. The value will be repeated to match the
length of index

 import pandas as pd
 import numpy as np
 s = pd.Series(5, index=[0, 1, 2, 3])
 print s
MATH OPERATIONS :
 There are some important math operations that can be performed on a pandas series to
simplify data analysis using Python and save a lot of time.
CONTINUE:
 s.median() Returns median of all value
 s.mode()Returns mode of the series
 s.value_counts() Returns series with frequency of each value
 s.describe() Returns a series with information like mean, mode etc depending on
dtype of data passed.
EX

 import pandas as pd

 #reading csv file

 s = pd.read_csv("stock.csv", squeeze = True)

 #using count function

 print(s.count())

 #using sum function

 print(s.sum())

 #using mean function

 print(s.mean())

 #calculatin average

 print(s.sum()/s.count())
HEAD & TAIL
 To view a small sample of a Series or the DataFrame object, use the head() and the tail() methods.
 head() returns the first n rows(observe the index values). The default number of elements to display
is five, but you may pass a custom number.

 import pandas as pd
 import numpy as np
 #Create a series with 4 random numbers
 s = pd.Series(np.random.randn(4))
 print ("The original series is:")
 print s

 print ("The first two rows of the data series:")


 print s.head(2)
ITS OUTPUT IS AS FOLLOWS −
 The original series is:
 0 0.720876
 1 -0.765898
 2 0.479221
 3 -0.139547
 dtype: float64

 The first two rows of the data series:


 0 0.720876
 1 -0.765898
 dtype: float64
TAIL()
 tail() returns the last n rows(observe the index values). The default number of elements to
display is five, but you may pass a custom number.

 import pandas as pd
 import numpy as np

 #Create a series with 4 random numbers


 s = pd.Series(np.random.randn(4))
 print ("The original series is:")
 print (s)

 print ("The last two rows of the data series:")


 print s.tail(2)
PANDAS SERIES.SELECT()
 Pandas series is a One-dimensional ndarray with axis labels. The labels need not be unique but
must be a hashable type. The object supports both integer- and label-based indexing and
provides a host of methods for performing operations involving the index.

 Pandas Series.select() function return data corresponding to axis labels matching criteria. We
pass the name of the function as an argument to this function which is applied on all the index
labels. The index labels satisfying the criteria are selected.
EX:
 import pandas as pd
 # Creating the Series
 sr = pd.Series(['New York', 'Chicago', 'Toronto', 'Lisbon', 'Rio', 'Moscow'])

 # Create the Datetime Index
 index_ = ['City 1', 'City 2', 'City 3', 'City 4', 'City 5', 'City 6']

 # set the index
 sr.index = index_

 # Print the series
 print(sr)
O/P:
 Now we will use Series.select() function to select the names of all those cities, whose index label ends with even
integer value.
 # Define a function to Select those cities whose index
 # label's last character is an even integer
 def city_even(city):
 # if last character is even
 if int(city[-1]) % 2 == 0:
 return True
 else:
 return False

 # Call the function and select the values
 selected_cities = sr.select(city_even, axis = 0)

 # Print the returned Series object
 print(selected_cities)
 As we can see in the output, the Series.select() function has successfully returned all those
cities which satisfies the given criteria.
 Use Series.select() function to select the sales of the ‘Coca Cola’ and ‘Sprite’ from the given Series object.
 importing pandas as pd
 import pandas as pd
 # Creating the Series
 sr = pd.Series([100, 25, 32, 118, 24, 65])

 # Create the Index
 index_ = ['Coca Cola', 'Sprite', 'Coke', 'Fanta', 'Dew', 'ThumbsUp']

 # set the index
 sr.index = index_

 # Print the series
 print(sr)
 Now we will use Series.select() function to select the sales of the listed beverages from the
given Series object.
INDEXING IN PANDAS :
 Indexing in pandas means simply selecting particular rows and columns of data from a
DataFrame. Indexing could mean selecting all the rows and some of the columns, some of the
rows and all of the columns, or some of each of the rows and columns. Indexing can also be
known as Subset Selection.

You might also like