You are on page 1of 13
| CHA PTER ata plays an important role in our lives, For examy : ple, a chain o} i Gata related {0 medeal reports and_precpton of th patent. A ank contains thousands of customers’ transaction details, Share market data Jepresents minute-to-minute changes in the values of the shares. In this way, the entire world is roaming around huge data. , Every piece of data is precious as it may affect the business organization which is using that data, So, we need some mechanism to store that data. Moreover, data may come from various sources. For example, in a business organization, we may get data from ales department, Purchase department, Production department, etc. Such data is stored in a system called ‘data warehouse’, We imagine data warehouse as a central repository of integrated data from different sources. 44, we should be able to retsiove it based on some pre-requisites. A te to Kenow about how much amount they had spent in the last 6 oul of how many items had been found defective in jeved from the huge data available Once the data is store! business company wan' months on purchasing # seh data cannot be easily Ft tieve the data as per the needs of the business their produetion unit, Such COW" OTe, in the data warehouse: 124 gata analysis OF ‘data analytics where the data that is mn. This is cal a anave questions raised by the management of the organizatior retrieved wi organization. © Figure 25.1 to an {ll be anh Goes data am alysis is called ‘data analyst’. Please see Scanned by CamScanner 694 | Chapter 25 — aS Warchouse 4 Figure 5.1: Das and ta valiaon Once the data is analyzed, it is the duty of the IT professional to present the results in the form of pictures or graphs so that the management will be able to understand it easily. Such graphs will also help them forecast the future of their company. This is called data visualization. The primary goal of data visualization is to communicate information clearly and efficiently using statistical graphs, plots and diagrams. Vio ollo Data Visuatzation Data Analysis: Data science is a term used for techniques to extract information from the data warehouse, analyze them and present the necess order to arrive at important conclusions and deci work is called ‘data scientist’ data scientist and data analyst ary data to the business organization in isions. A person who is involved in this We can find important differences between the roles of as shown in Table 25.1: Table 25.1: Differences Between Data Scientist and Data Analyst Data Scientist Data Analyst Data scientist formulates the questions that will help a business organization and then ‘proceed in solving them, Data scientist will have strong data Data analyst receives questions from the business team and provides answers to them. visualization skills and the ability to convert data into a business story. Data analyst simply analyzes the data and Provides information requested by the team. Perfection in mathematics, statistics and Programming languages like Python and R are needed for a data scientist, Perfection in data warehousing, big data concepts, SQL and business intelligence is needed for a data analyst. Data scientist estimates the unknown information from the known data. Data analyst looks at the known data from new perspective. Data Frame Data frame is an object that is useful in representing data in the form of rows and columns, For example, the data may come from a file or an Excel spreadsheet or from @ Scanned by CamScanner Data Science Using Python] 695 ence like a list or tuple. We c: python sean ° fs an represent that data in th frame. Once the data is stored into the data frame, we can perform 1e form of a data fame: useful in analyzing and understanding the data an pata frames are generally created from csv (comm a . la Se] ited Prreadsheet files, Python dictionaries, list of tuples or lst iene ae lee on contains pandas which is a package useful for data analysis and manipulation ‘aiso, xIrd is a package that is useful to retrieve data from Excel files. We should download these packages separately as they are developed by third-party people. You can see Chapter 2 to know how to download and install these packages. DataFrame is the aa in pandas package. We will first discuss various ways of creating data frame objects. Grating Data Frame from an Excel ‘Spreadsheet Let us assume that we have a large volume of data present in an Excel spreadsheet file by the name ‘empdata.xlsx’, as shown in Figure 25.2: pee a7 ‘empid ename_ cal empid _ename | 10000 10: 10-2 1 Ganesh Rao ‘ aed Anil Kumar 23000.5 2% a ae 1003 Gaurav Gupta se ya000 1004 Hema Chandra $600.73 20-8 2000 1008 tar present 9999-99 9-9-1999 eontins emproyee da mera nary and date of joing 7 emplove® 4pm aur own By opening Microsoft Office stnis ste contains 3002 Fe cal tls a with the fie name ‘empdata’, it will be «This file © i ave we 8 ately 2 the company. Ate Gata. wren Nile Scanned by CamScanner 696 | Chapter 25 To create the data frames, we should first import the pandas package. We may need xIrd Package also that is useful in extracting data from Excel files. To read the data from Excel file, we should use read_excel() function of pandas package in the following format: read_excel(‘file path’, ‘sheet number’), If our Excel file is available in ‘F:’ drive and ‘python\PANDAS’ subdirectory, open the Python IDLE window and type the commands as shown below: >>> import pandas as pd >>> import xird >>> df = pd-read_excel \python\Panpas\empdata.x1sx", “Sheet1") >>> df empid ename sal doj © “1001 Ganesh Rao 1000.00 2000-10-11 1. 1002 ‘anil Kumar 23000:50 2002-03-20 2 1003 Gaurav Gupta 1800033 2002-03-03 3 1004 Hema Chandra 1500.50 2000-09-10 4 1003 Laxmi Prasanna 1200.75 2000-10-08 1006 anant Nag 9999.99 1999-09-09 Thus, we created the data frame by the name ‘df. Please observe the first column having numbers from 0 to 5. This additional column is called ‘index column’ and added by the data frame. Creating Data Frame from .csv Files In many cases, the data will be in the form of .csv files. A .csv file is a comma-separated values file that is similar to an Excel file but it takes less memory. We can create the .csv file by saving the Excel file using the option: File -> Save As and typi File name: empdata 2 Save as type: CSV (conma delimited) We can read data from a .csv file using read esv() function that takes the file path as shown below: >>> import pandas as pd 333 aFe pd read_cev(PF:\\python\PANDAS\empdata.csv") ing the following: >>> df enpid enane sal oj Oo 1001 Ganesh Rao 10000.00 _ 10-10-01 1 1002 Anil Kumar 23000.50 3-20-2002 b 2 1003 Gaurav Gupta 1g000:33 “03-0309 3 1004 Hema chandra 16500150 10-09-00 4 1005 Laxmi Prasanna 12000:75 '08-10-00 5 1006 Anant Nag 9999199 09-09-99 Creating Data Frame from a Python Dictionary It is possible to create a Python dictionary that contains employee data. Let us remember that a dictionary stores data in the form of key-value pairs. In this case, we take ‘empid’, Scanned by CamScanner Data Science Using Python » gal’, ‘doi’ as Keys and corresponding lists as values. Let us first create a ‘enone’, Seton py the name ‘empdata’ as shown below: * enpdata = (empid": [1001, 1002, 1903, 1004, 1095, 2006]. 22 ame: ['Ganesh Rao", FAMgT kumar’, "gaurav Gupta", heme Chandra" nua, Praganna™, "Anant Nag], “sal” 10000, 23000. 0s '18000.33, 16500.50, 12000.75, 9999.99], Sain; Fz9-10-2000", "3-20-2002", "3-3-2002", "9-10-2000", ° I 5. 2000", "9-9-1999" 13 Now, let us convert this empdata dictionary into a data frame by passing this dictionary to DataFrame class object as: ss» df = pd.DataFrame(empdata) >>> df —" doj empid ename sal 0 10-10-2000 1001 Ganesh Rao 10000.00 1 3-20-2002 1002 Anil Kumar 23000. 50 } 73-3-2002 1003 Gaurav Gupta 1800.33 3 9210-2000 1004 Hema Chandra 1650030 4 To-8-2000 1005 Laxmi Prasanna 12000. 73 5 9-9-1999 1006 Anant Nag 9999.99 Creating Data Frame from Python List of Tuples It is possible to create a list of tuples that contains employee data, A tuple can be treated bs avow of data. Suppose, if we want to store the data of 6 employees, we have to create 6 tuples, Let us first create a list of 6 tuples by the name ‘empdata’ as shown below: sos empdata= (C1001, "Ganesh Rag" 10000;90» "10-10-2000", Goos SAni1 Kumar’ ,|23000-90, 3-20-2002') 1003, Array cupta’ 18000-33, 03-03" 5000'3" 0a? aur tnandra , 10500 as 1909-200 Oty (2005; Laxmi Prazangsos 99, 109-b9-1999')] (006, ‘Anant Nag’ Now, let us convert this list DataFrame class object a5: soo df = pd.bataFrame(e! ples does not have column names, we have to include the Sine i ist of tuy ir Since the orginal Ht of Be ame as ao in te Pesfing statement, Now let us display the data frame as! of tuples into a data frame by passing this dictionary to apdatay colunns=["eno", “name”, "sal", "do5")), o> df = al lo} eno ename G0 10-10-2000 91004 canesh R22 29009°S0 3-20-2002 s] kumar -20- 2008 Pree Aen) pen 1s0e8 25, Soros 3000 2 3003 fama chandra 15500.75 98-20-7009 3 ood | Hema rosanna 12009759 09-08-1999 Scanned by CamScanner 698 | Chapter 25 Operations on Data Frames Once we create a data frame, we can do various operations on it. eee eas help us in analyzing the data or manipulating the data. The reader is advised to refer to the list of all operations available in pandas at the following link: htths://pandas-pydata.org/pandas- docs/stabt e/generated/pandas. series html First we will create a data frame from a .csv file using read_csv() function as shown below. This data frame will be the basis for our operations. >>> df = pd. read_csv("F:\\python\ PANDAS \enpdata. csv") >>> df empid ‘ename sal doj 0 1001 Ganesh Rao 10000.00 _ 10-10-00 1 1002 Anil Kumar 23000:50 3-20-2002 2 1003 Gaurav Gupta 03-03-02 3 1004 Hema Chandra 10-09-00 4 1005 Laxmi Prasanna 12000:75 08-10-00 5 1006 Anant Nag 9999.99 09-09-99 Knowing Number of Rows and Columns To know the number of rows and columns available in the data frame, we can use shape attribute, It returns a tuple that contains number of rows and columns as: S55" df. shape (6, 4) Suppose, we want to retrieve only rows or columns, we can read that number from the tuple as: S55 rye = dFishape 35>. print(r) " # display only no of rows 6 Retrieving Rows from Data Frame ‘The method head() gives the first 5 rows and the method tail shown below: >>> df.headQ) () returns the last 5 rows, as empid ename sal doj 1001 Ganesh Rao 10000.00 10-10-2000 1002 Ani] Kumar 3-3-2002 1003 Gaurav Gupta 18000:33 3-3-2002 1004 Hema Chandra 16500:50 _ 3-3-2002 1005 Laxmi Prasanna 12000:75 1-8-2000 >o> df.tailQ. AWNHO empid _ enane sal doj 1 1002 Anil Kumar 23000.50 3 _ 2 1003 caurav Gupta 1800.33 3 1004 Hema Chandra 16500.50 Scanned by CamScanner | | Data Science Using Python 699 >> df. tail (2) empid _ _ename sal do} 4 1005 Laxmi Prasanna 12000.75 10- 2001 5 1006 Anant Nag 9999/99. 9-9-1999 Retrieving a Range of Rows We can treat the data frame as an object and retrieve the rows from example, if we write df]2:5), it using slicing, For we can get 24 row to 4" row. >>> df [2:5] (excludes 5 row), empid ename sal doj 2 1003 © Gaurav Gupta 8000.33 3-3-2003 3 1004 Hema Chandra 16500:50 _ 3-3-2002 4 1005 Laxmi Prasanna 12000:75 10-8-2000 Similarly, to display alternate rows, we can use df[0::2] or dff::2] as shown below: >>> dF[O::2] empid ename sal doj o “Ibo Ganesh Rao 10000.00 10-10-2000 2 1003 Gaurav Gupta 18000:33 3-3-2002 4 1005 Laxmi Prasanna 12000:75 1-8-2000 To display the rows in reverse order, we can use negative step size in slicing as: >>> df[5:0:-1] empid ename sal 5 “thos Anant Nag _ 9999.99 4 1005 Laxmi Prasanna 12000.75 5 1004 = Hema Chandra pen00 38 1003 Gauray Gupta . 1 1002 Anil kumar 23000.50 Te ‘, Retrieve Column Names mean To retrieve the column names from the data frame, we can use columns attribute as: >>> df. columns Index({'empid’, ‘ename', ‘sal’, ‘doj'], dtype="object') Scanned by CamScanner To Retrieve Column Data To get the column data, we can mention the column name as subscript. For example, dfempid will display all employee id numbers. This can also be done using di‘empidy which is shown below: >>> df.empid 0 = 1001 1 1002 2 1003 3 1004 4 1005 5 1006 Name: empid, dtype: int64 0 1001 1 1002 2 1003 3 1004 4 1005 5 1006 Name: empid, dtype: inté6s i Similarly, to display employee names, we can use df.ename or dff'ename]]. Retrieving Data from Multiple Columns To retrieve multiple columns data, we can provide the list of column names as subscript to data frame object as dif [list of column names} }. For example, to display the employee ids and their names, we can write: >s> df[['empid', "ename']) empid ename 1001, Ganesh Rao 1002 Anil Kumar 1003 © Gaurav, Gupta 1004 Hema Chandra 1005 Laxmi Prasanna 1006 ‘Anant Nag wane Finding Maximum and Minimum Values Itis possible to find the highest value using max() method and the least value using min() method. These methods are applied to columns containing numerical data. For example, to know the highest salary and the least salary, we can use a >>> df["sal'].maxQ 23000.5 >>> df['sal"] minQ, 9999. 9899999999998 Scanned by CamScanner D. ata Science Using Python] 704 we ; i i Waes, average, standard deviation, val value. This information is highly p> df. describe ays formation like number of minimum, maximum, 25%, 50% and 75% of the Useful for statistical analysis, empid sal count 6.000000 6.000000 mean 1003500000 14917011667 std 70829 §181'037711 min — 1001.000000 _ 9999: 9900; 25% 1002-25000 10500187905 50% 1003.500000 1425 75% 1004-750000 max. __1006.000000 Performing Queries on Data We can retrieve rows based on a query. The query should be given as subscript in the data frame object. For example, to retrieve all the rows where salary is greater than Rs. 10000, we can write: >>> df [df:sa1>10000] empid _4 ename sal doj 1 1002 ‘anil Kumar 23000.50 3-20-200; 2 1003. Gaurav, Gupta 18000.33 03-03-02 3 1004 Hema Chandra 16500.50 10-09-00 4 1005 Laxmi Prasanna 1200.75 08-10-00 To retrieve the row where salary is maximum, we can write: 355 df [dfisal) ==/df.salmaxQ] empid ename sal do} 1 1002. anil Kumar 2300.5. 3-20-20, Suppose, we want to show data from some columns based on a query, we can mention the list of columns and then the query as: dil[column names}][query]. For example, to display only id numbers and names where the salary is greater than Rs. 10000, we can write: q Sss/df[["empid’,) ‘ename"]] [df.sal>10000] empid ename 1002 anil Kumar 1003 Gaurav Gupta 1004 Hema Chandra 1005 Laxmi Prasanna pune Scanned by CamScanner Knowing the Index Range Tae first column is called index column and it is generated in the data frame automatically. We can retrieve the index information using index attribute as: >>> df. index RangeIndex(start=0, stop=6, step=1) Setting a column as Index We know the index column is automatically generated. If we do not want t we want to set a column from our data as index column, that is possible using set_index) method. The column with unique values can be set as index column. For example, to make ‘empid’ column as index column for our data frame, we can write: ‘his column and >>> dfl = df.set_index('empid') ‘The above statement creates another data frame ‘df1’ that uses ‘empid’ as index column. We can verify this by displaying dfl as: >>> df1 , ename sal do empid d 1001 Ganesh Rao 10000.00 10-10-2000 1002 Anil Kumar 23000.50 1003 Gaurav Gupta 18000.33 1004 Hema Chandra 16500.50 1005 Laxmi Prasanna 12000.75 1006 Anant Nag 9999.99 as index column in the new data frame ‘dfl’ as above. However, the original data frame ‘af in this case is not modified and it still uses aetomatically generated index column. If we want to modify the original ‘df and set empid as index column, we should add ‘nplace-True’as shown below: “sos df. set_indexC'empid', inplace=True) We can find the empid being used df ere ename sal doj id S0o1 Ganesh Rao 10000.00 10-10-00 1002 anil Kumar 23000.50 3-20-2002 1003 Gaurav Gupta 18000.33 03-03-02 1004 Hema Chandra 16500. 50 10-09-00 700s Laxmi Prasanna 12000.75 08-10-00 1006 ‘Anant Nag 9999.99 09-09-99 once we set ‘empid’ as index, it is possible to locate the data of any employee by passing employee id number to loc attribute as: >>> df.loc[1004] ename Hema Chandra sal 16500.5 doj 3-3-2002. Name: 1004, dtype: object Scanned by CamScanner Data Science Usi ig Python} 703 of hind ename sal j 0 “1001 Ganesh Rao 10000°4 693 1 1002 Ani T Kumar 23000°99 70519308 2 1003 Gaurav Gupta 18000, 33 3-3-200; 3 1004 Hema chandra ipson;39 32302002 4 1005 Laxmi Prasanna 12000.75 108-2008 5 1006 Anant Nag 9999/99 9-9-1999 sorting the Data ‘To sort the data coming from a .csv file, frame using read_csv() function as: >> df dd. read_csv("F: io meee oes aS -CSv("'F:\\python\ PANDAS \@npdata2icsv"; first let us read the data from the file into a data Here, we are loading the data from empdata2.csv file and also informing to take ‘doj’ as date type field using parse_dates option. Now let us display the data frame as: 35> print(dF) empid ename sal doj 1001. Ganesh Rao 1000-00 2000-10-10 1002 Anil Kumar 23000.50 2002-03-03 1003 Gaurav Gupta 1800033 2002-03-03 1004 Hema Chandra 1650050 2002-03-03 1005 Laxmi Prasanna. 12000.75 2000-08-10 1006 Anant Nag 9999.99 1999-09-09 uRUnHo To sort the rows on ‘doj’ column into ascending order, we can use sort_values() method as: >3> df = df.sort_values('doj') >>> dfl i name sal doj 5 i ie anne Nag _ 9999.99 1999-09-05 4 1005 Laxmi Prasanna 12000.75 2000-08-10 0 1001 Ganesh Rao 10000.00 zo ag 3 i003 iitay Gupta Teo00.38 2002-03-03 ul is 3 1003 Gaur@enandra 1650.50 2002-03-03 To sort in descending order, we should use an additional option ‘ascending = False’ as; soo dpi = df, sort values ('doj"s ascending=False) sas “atiple cokumns is also possible. This can be done using an option ‘by’ in the Si saanay ao cor example, we want to sort on ‘doj’ in descending order and in Scanned by CamScanner 704 | Chapter 25 that ‘sal’ in ascending order. That means, when two employees have same ‘doj’, then their salaries will be sorted into ascending order. >55 df = df.sort_values(by=['doj', ‘sal'], ascending=[False, True]) >>> df. | empid ename sal oj 3 1004 Hema Chandra 16500.50 2002-03-0: j 2 1003 Gaurav Gupta 18000.33 2002-03-03 | 1 1002 ‘Anil Kumar 23000.50 2002-03-03 0 4001 Ganesh Rao 1000.00 2000-10-10 4 1005 Laxmi Prasanna 12000.75 2000-08-10 5 1006 Anant Nag 9999.99 1999-09-09 Handling Missing Data In many cases, the data that we receive from various sources may not be perfect. That means there may be some missing data. For example, ‘empdatal.csv’ file contains the following data where employee name is missing in one row and salary and date of joining are missing in another row. Please see Figure 25.3: comm _*fhsmtemarentiogs em 3 A gy weasels Same Be 222 empl ename 2) 1001 Ganesh Rao 10000 10-10-00 1002 Anil Kumar 230005 03-03-02 1003 1800.33 03-03-02 41004 Hema Chandra 1005 Laxmi Prasanna | 1200.75 10-08-00 1006 Anant Nag 9999.99 09-09-99 Figure 25.3: A csv fle with missing data When we convert the data into a data frame, the missing data is represented by NaN (Not a Number). NaN is a default marker for the missing value. Please observe the following data frame: 555" import! pandas as pd >>> df = pd. read_csv("f:\\python\PANDAS\enpdatal csv") f >>> df Scanned by CamScanner ae pid ename sal j orto Ganesh Rao 10000;00 10-10°94 4002 Anil Kumar 23000.50 03-03-02 003 NaN 18000. -03- 1004 Hema Chandra 033 03-03-02 m NaN NaN 3005 Laxmi Prasanna 1200.75 10-08-00 £006 Anant Nag 999999 09-09-99 thod to replace th i caf USE fillna() met place the Na or NaN values by a specified We to fill the NaN values by 0, we can use: 8 Spectied value. For esampl oo df 2 GF FITTNACO) go> df empid ename sal i 9 1001 Ganesh Rao 1000.00 10-10°69 9 i002 Anil Kumar 23000:50 03-03-02 > 1003 0 1800.33 03-03-02 5 1004 Hema Chandra 0:00 0 3 4003 Laxmi Prasanna 12000:75 10-08-00 5 1006. Anant Nag 9999.99 09-09-99 pat this is not so useful as it is filling any type of column with zero. We can fill each tolumn with a different value by passing the column names and the value to be used to filin the column. For example, to fill ‘name’ column with ‘Name missing’, ‘sal’ with 0.0 and to’ with 00-00-00’, we should supply these values as a dictionary to fillna() method ‘as shown below: some Sya fr FIT Tha CL" Shame" ="Naniemiissing'y""sal"! "0.0; "dog": "00=00= o0') p> df enpid ename sal doj 0 1001 Ganesh Rao 10000.00 10-10-00 1 1002 Anil Kumar 2300050 03-03-02 2 1003 Name missing 18000.33 03-03-02 3 1004 Hema Chandra 0.00 00-00-00 = 4 1005 Laxmi Prasanna 10-08-00 5 1006 ‘Anant Nag 09-09-99 Ifwe do not want the missing data and want to remoye those rows having Na or NaN values, then we can use dropna() method as: >>> df= df dropnaQ, >>> df empid ename sal doj 9 1001 Ganesh Rao 10000.00 10-10-00 1 1002 Anil Kumar 23000.50 03-03-02 4 1005 Laxmi Prasanna 12000.75 10-08-00 : 1006 Anant Nag 9999.99 09-09-99 I this : e caning filling the necessary data or eliminating the missing data is called ‘data Scanned by CamScanner

You might also like