You are on page 1of 29
MODULE 1 | ={ole) eee Module Handbook Oi. ea aa Data Science with Python - 1 Objectives ‘At the end of this module, you should be able to understand: Mathematical Computation with numpy Getting started with Pandas Data Import/export with pandas ‘Working with Data Frames ~ selection, Filtering, manipulation and statistical analysis Data Aggregation with Pandas Data Cleaning with Pandas vw typ 3 Boot Numpy NumPy is the fundamental package for scientific computing with Python. It contains among other things: + A powerful N-dimensional array object *+ Sophisticated (broadcasting) functions * Tools for integrating C/C++ and Fortran code * Useful linear algebra, Fourier transform, and random number capabilities Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases. Broadcasting Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array. Broadcasting two arrays together follows these rules: *+ If the arrays do not have the same rank, prepend the shape of the lower rank array with 1s until both shapes have the same length. The two arrays are said to be compatible in a dimension if they have the same size in the dimension, or if one of the arrays has size 1 in that dimension. + The arrays can be broadcast together if they are compatible in all dimensions. + After broadcasting, each array behaves as if it had shape equal to the elementwise maximum of shapes of the two input arrays. + In any dimension where one array had size 1 and the other array had size greater than 1, the first array behaves as if it were copied along that dimension vw typ 3 Boot Numpy In general, any numerical data that is stored in an array-like container can be converted to an ndarray through use of the array() function. The most obvious examples are sequence types like lists and tuples. # Import moduli mport numpy as np # Create an ar civilian birth 4352, 233, 3245, 256, 2394]) civilia Create ndarrays from = [l, 2, # list np. array (datal) # 1d array data2 = [range(1, 5), range(5, 9)] # list of lists arr2 = np.array(data2) # 2a array arr2.tolist () # convert array back to list vw typ 3 Boot Numpy create special arrays np. zeros (10) np.zeros((3, 6)) np.ones (10) np.linspace(0, 1, 5) # 0 to 1 (inclusive) with 5 points np.logspace(0, 3, 4) # 10*0 to 10*3 (inclusive) with 4 points Reshaping arrays arr = np.arange(10, dtype=float) .reshape((2, 5)) print (arr. shape) print (arr. re. (5, 2)) Stack arrays a = np.array([0, 1]) b = np.array((2, 31) ab = np.stack((a, b)). print (ab) # or np.hstack((a[:, None], b[:, None])) vw typ 3 Boot Numpy Generating Random Numbers With NumPy import numpy as np #Generate A R: np.random.normal () #Generate Four Random Numbers From 1 np. xandom.normal (size=4) #Generate Four Random Numbers From np. random. uniform (size=4) 4Generate Four Random Integers Between 1 and 100 np.random.randint (low=1, high=100, size=4) vw typ 3 dom Number From The Normal Distribution Distribution e Uniform Distribution Boot Linear Algebra with Numpy Transpose A Vector Or Matrix import numpy as np # Create vector ector = np.array([1, 2, 3, 4, 5, 6]) # Create matrix matrix = np.matrix([[1, 2, 3], 4, 5, 61, 7, 8, 911) vector.T # Tranpose vector matrix.T # Transpose matrix Flatten A Matrix port numpy as np # Create matrix matrix = np.m Zee [4, 5, 61, 7, 8, 911) # Flatten matrix matrix. flatten () vw typ 3 Boot Numpy Difference between numpy dot() and inner() a=np.array([[1,2], (3,411) b=np.array({(11,12], (13,1411) np.dot (a,b) output array({[37, 401, (85, 92]]) nner (a,b) #otuput array(((35, 41], [81, 95]]) vw typ 3 Boot Linear Algebra with Numpy Linear Algebra with matrices # Return determinant of matrix np. linalg.det (matrix) # Calculate the tracre of the matrix matrix.diagonal ().sum() # Calculate inverse of matrix np. linalg. inv (matrix) # Add two matrices np. add (ma ix_a, ma’ # Subtract two matrices np. subtract (matrix_a, matrix b) Flatten A Matrix A = np.array(([1,2], [3,4]]) numpy.linalg.eig(A) foutput (array ([-0.37228132, 5.37228132]), array({[-0.82456484, -0.41597356], [ 0.56576746, -0.90937671]])) vw typ 3 Boot Numpy Linear Algebra - Solving Equations 4x + 5y = 23 6x —3y =3 mport numpy as np from numpy.linalg import solve A = np.array({(4,5],[6,-3]]) b = np.array([23, 31) x = solve(A,b) foutput - value of x array({ 2, 3]) vw typ 3 Boot Pandas pandas is a Python package pr ding fast, flexi le, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal. [55 [ber rchpedapeint renee nla een Rees rages eer cd pandas is well suited for many. Se Fes one egies of data Any other form of observational / statistical data Poe neg nee cee to be placed into a pandas data structure The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users, DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries. "1 ups! Boot Pandas Here are just a few of the things that pandas does well: + Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data + Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects + Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations + Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data + Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects + Intelligent label-based slicing, fancy indexing, and subsetting of large data sets + Intuitive merging and joining data sets + Flexible reshaping and pivoting of data sets + Hierarchical labeling of axes (possible to have multiple labels per tick) + Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading data from the ultrafast HDFS format + Time series-specific functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging, etc. 12 vw typ 3 Pandas Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific research environments. For data scientists, working with data is typically divided into multiple stages: munging and cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or tabular display. pandas is the ideal tool for all of these tasks. PUSS Cu uo mgc tt Mur urar et) TC Cameo Ree Mi ee cl anaes 13 scoop Boot Pandas Pandas Series Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). import pandas as pd import numpy as np data = np.array(['a','b','c', s = pd.Series (da print (s) import pandas as pd s = pd. vw typ 3 14 Boot Pandas Pandas DataFrame A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. # Create an dataframe data = {'name': ['Jason', ‘Molly’ na', 'Jake', 'Amy'], ‘year’: [2012, 2012, 2013, 2014, 2014], ‘reports': df - pd.DataFrame(data, index - ['Cochice', ‘Pim: cruz', 'Maricopa', ‘Yuma']) raw_data ={'first_name':['Jason', 'Molly', 'Tina','Jake', 'Amy'], ,".",'Milner', 'Cooze'], "last_name':['Miller', 'Jacobs tage': [42, 52, 36, 24, 73], ‘preTestScore': [4, 24, 31, ". "postTestScore': ["25,000", "94,000", 57, 62, 70]} import pandas as pd mport numpy as np df = pd.DataFrame(raw_data, colun ast_name', ‘age', 'preTe: 15 vw D2up 3 Boot Data Import and Export using Pandas Loading A CSV Into pandas # Load a csv df = com/anshupandey/datasets/ pd. read_csv('https://raw.githubuserconte master/datawh.csv’) # Load a csv with no headers af ent .com/anshupandey/datasets/m pd. read_csv (https: //raw.githubuserco! aster/datanh.csv', header=None) Export DataFrame # Save dataframe as csv in the working director df.to_csv('pandas_dataframe_importing_csv/example.csv’) Load A JSON File Into Pandas mport pandas as pd url = “https: //raw. githubusercontent .com/anshupandey/datasets/master/serve r-metrics.json’ woad the first sheet of the JSON file into a data frame "columns") | json(url, orie 16 vw typ 3 Boot Pandas Descriptive Statistics For pandas Dataframe mport pandas as pd data = {'name': [‘Jason', 'Molly', ‘Tina’, tage’: [42, 52, 36, 24, 73], 'preTes 31, "postTestScore': (25, 94, 57, 62, df = pd.DataFrame(data, columns = ['name', ‘postTestScore')) #Summary statistics on preTestScore df ['preTestScore"] .describe() # The sum of all the ages df['age'].sum() # Mean preTestScore re'].mean() # Count the number of non-NA valu df['preTestScore'] .count () vw typ 3 ‘Jake’, rece ‘amy'), 'preTestScore', 7 Boot Pandas Descriptive Statistics For pandas Dataframe # Median value of preTest dé ['pr tScore'] .median () # Sample standard deviation of preTestScore values df['preTestScore'].std() # Skewness of preTestScore values df ['preTestScore'] .skew() # Kurtosis of preTestScore values df ['preTestScore'] .kurt () vw typ 3 18 Boot Data Manipulation with Pandas Dropping Rows & Columns in Dataframe Import pandas as pd data = {'name': ['Jason', 'Molly', ‘Tina’, 'Jake', ‘Amy'], tyear': , 2014), ‘reports': [4, 24, 31, 2, 3]} df - pd.DataFrame(data, index - ['Cochice', 'Pima', , ‘Maricopa’, 'Yuma']) # Drop an observation (row) d£.drop({'Cochice', 'Pima’]) Drop a variable (column) axis=1 denotes that we a row df.drop('reports', axis # Drop a row if it contains a ce: df[d£.name != 'Tina’ # Drop a row by row number (in this case, row 3) df .drop (d£.index[2]) # can be extended to dropping a range df .drop (df. ndex {[2,3]1) 19 vw typ 3 Boot Data Manipulation with Pandas Join And Merge Pandas Dataframe mport pandas as pd #first Dateframe raw_data = { 'subject_id’: ['1', '2', '3', "4", '5'], name': ['Alex', 'Amy', ‘Allen’, ‘Alice’, 'Ayoung'], ‘Anderson', 'Ackerman', 'Ali', 'Aoni', df_a = pd.DataFrame(raw data, columns = ['subject_id', 'first_name', ‘1ast_name! ]) Create a second dataframe ral ‘subject_id': ['4", 7 1 18], name': ['Bill', 'B 'Bran', 'Bryce', ‘Betty'], ["Bonder', '"Black', 'Balwner', ‘Brice’, "Btisan']) df_b = pd.DataFrame(raw_data, columns = ['subject_id', 'first_name', "last_name']) 20 vw typ 3 Boot Data Manipulation with Pandas Join And Merge Pandas Dataframe # Create a third dataframe raw data = ‘subject_id': ['1', 12", 13%, 14", '5', 17%, "81, 19", "10", nay, ‘test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]} df_n = pd.DataFrame(raw_data, columns # Join the two dataframes along rows df_new = pd.concat ([df_a, df_b]) Join two dataframes along columns pd.concat([df_a, df_b], axis=1) # Merge two dataframes along the subject. pd.merge(df_new, df_n, 24 vw typ 3 Boot Data Manipulation with Pandas Find Unique Values In Pandas Dataframes raw_data = {'regiment': ['5lst', ‘29th’, '2nd', ‘19th’, '12th', "101st', '90th', '30th', '193th', ‘1st’, '94th', "trucks': ['MAZ7310', np.nan, 'MAZ7310', 'MAZ7310", 0’, 'Tatral0', 'Tatral0', 'Tatral0', 'ZIS150', 'Tatral0d", "218150", 'ZIS150'], tanks': ['Merkava Mark 4", 'Merkava Mark 4", 'Merkava Mark 4', ‘Leopard 2A6M', 'Leor » ‘Leopard 2A6M', "Arjun MBT', ‘Leopard 2A6M', ‘Arjun MBT', ‘Arjun MBT', ‘Arjun MBT’, ‘Arjun MBI"), ‘aircraft': ['none', 'none', 'none', ‘Harbin 2-9", ‘Harbin Z-9', 'none', ‘Harbin Z-9', 'SH-60B Seahawk', 'SH-60B Seahawk', ' 60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk’]} df = pd.DataFrame(raw data, columns = [regiment’, ‘trucks’, ‘tanks', ‘aircrafts']) # Create @ list of unique values by turning the # pandas column list (set (df. trucks) ) # Create a list of unique values in df.trucks (d£['trucks"] .unique()) 22 vw typ 3 Boot Data Aggregation with Pandas Groupby operation raw_data = {'regiment': ['Nighthawks', 'Nighthawks', ‘Night ‘Nighthawks’, 'Dragoons', 'Dragoons', ‘Dragoons’, ‘Dragoons’, 'Scouts’, ‘Scouts’, ‘Scouts’, 'Scouts'], "company": PaLIst Om 2nd Um oncl eUlst) wll stu 2nd ‘and', ‘1st’, "1st", '2nd', '2nd'l, ‘name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze’, 'Jacon’, "Ryaner’,' ‘ali'), ‘preTest [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 31, "postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]} Import pandas as pd # Create dataframe df = pd.DataFrame(raw_data, columns = ['regiment’, ‘company’, 'name', 'preTestScore', 'postTestScore’]) # Create a groupby variable that groups preTestScores by regiment groupby_regiment ‘preTestScore'] .groupby (df ['regiment’ ]) # Mean of each regiment’s preTestScore groupby_regiment .mean () 23 vw typ 3 Boot Data Aggregation with Pandas Groupby operation View a grouping - Use list() to show what a grouping looks like ist (df['preTestScore'] .groupby (df [' regiment’ ])) # Descriptive statistics by group TestScore'] .groupby (df['regiment']) .describe () # Mean preTestScores grouped by reg company df ['preTestScore'] .gro: by ([df['regiment'], df['company']]) .mean() # Mean preTestScores grouped by regiment and company without # heirarchical indexing df ['preTestScore'] .groupby ([df['regiment'], df['company']]) .mean() .unstack () # Group t ent dataframe by regiment and company df.groupby(['regiment', ‘company']) .mean() # Number of observations in each regiment and company d£.groupby(['regiment', 'company']).size() 24 vw typ 3 Boot Data Cleaning with Pandas Delete Duplicates In pandas mport pandas as pd # Create dataframe with duplicates raw_data = {'first_name': ['Jason’, ‘Jason’, 'Jason','Tina', 'Jake', *aAmy"), ‘last_name': ['Miller', 'Miller', er’, 'Ali', ‘Milner’, ‘Cooze'l], taget: (42, 42, 24, 731, ‘preTestScore': [4, 4, 4, 31, 2, 3], 25, 25, 25, 57, 62, 70]} df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', ‘age’, 'preTestScore', 'postTestScore’]) # Identify which observations are duplicates df.duplicated() df.drop_duplicates () # Drop duplicates in the name column, but take the last obs in the duplicated s df.drop_duplicates(['first_name'], keep='last') 25 vw typ 3 Boot Data Cleaning with Pandas Missing Data In pandas Dataframes # Drop missing observations df_no_missing = df.dropna() # Create a new column full of missing values df [ "Loc on'] = np.nan if # Drop column ssing values df .dropna (axis: how='all’) # Drop rows that contain less than five observations # This is really mostly useful for time series d£.dropna (thresh: # Fill in missing data with zeros df. fillna(0) # inplace-True means that the changes are saved to the df right away True) df ["preTestScore"] .fillna(df["preTestScore"] .mean(), inplace: 26 vw typ 3 Boot Summary ‘This Module covered following topics + Mathematical Computation with numpy = Getting started with Pandas = Data import/export with pandas = Working with Data Frames — selection, Filtering, manipulation and statistical analysis + Data Aggregation with Pandas = Data Cleaning with Pandas vw typ 3

You might also like