You are on page 1of 31

Preparing your Data using Python

Samuel G. Mori, CISA


Managing Partner, Analytics & Advisory Services
Spyrion LLC

April 12, 2018

1 Spyrion LLC™ Proprietary and confidential


Background

Samuel G. Mori, CISA, Six Sigma Green Belt


 Managing Partner, Analytics and Advisory Services

 Software Quality Assurance, Internal/External Audit, Business Intelligence


and Reporting, Advisory Services (GRC and Analytics)
 Subject matter expertise within commercial, manufacturing, healthcare,
biomedical and entertainment sectors
 B.S. Cognitive Science – Human Computer Interaction (UC San Diego)
 M.S. Accountancy – Accounting Information Systems (San Diego State)
 M.S. Data Science – Analytics & Modeling (Northwestern)

2 Spyrion LLC™ Proprietary and confidential


Agenda
 Learning Objectives

 Why should I prepare my data?

 What types of data might I encounter?

 How can Python help me?

3 Spyrion LLC™ Proprietary and confidential


Learning Objectives
 Understand the importance of preparing your data for analysis

 Understand different types of data formats you may encounter

 Understand what Python is and why you should use it

 Understand strategies and techniques for importing, preparing, and saving


your data using Python

4 Spyrion LLC™ Proprietary and confidential


Why should I prepare my data?
 Garbage in, garbage out

 Reduce errors

 Remove duplicate records

 Fix missing values

 Correct range values

 Fix formatting (i.e. date, text, number)

5 Spyrion LLC™ Proprietary and confidential


Experience Check
 How many people have experience with Python?

 What types of data formats do you use in your organizations?

− CSV, Excel, PDF, JSON, XML, SQL databases, etc

 What types of tools do you use?

− Excel, ACL, IDEA, SQL Server, Python, R, SAS, Cognos, etc

6 Spyrion LLC™ Proprietary and confidential


What types of data formats might I encounter?
 Comma Separated Value (CSV)

 Excel

 JavaScript Object Notation (JSON)

 Structured Query Language (SQL)

 And more!

Python can help with these!

7 Spyrion LLC™ Proprietary and confidential


CSV Example
 SFO Airport Survey Results

8 Spyrion LLC™ Proprietary and confidential


Excel Example
 SFO Airport Survey Results

9 Spyrion LLC™ Proprietary and confidential


JSON Examples

 Trip Advisor JSON file  Yelp JSON file

10 Spyrion LLC™ Proprietary and confidential


SQL Example

 Sample Customer Data

11 Spyrion LLC™ Proprietary and confidential


What is Python?
Definition

 Object-oriented, high-level programming language

 Used as a scripting or glue language to connect existing


components together

 Simple, easy to learn syntax emphasizes readability

 Supports modules and packages

 Python interpreter and the extensive standard library are FREE!

12 Spyrion LLC™ Proprietary and confidential


What is Python? (cont.)
Key Python Package:

 Pandas

− Open source library that allows you to work with CSV, Excel, JSON, and SQL
database files, pull them into tables (called dataframes), and perform various
data analysis techniques.

13 Spyrion LLC™ Proprietary and confidential


Coding Basics
Some basic python syntax to keep in mind:

 Declaring a variable (always to the left of equal sign)

 File names (can use “ “ or ‘ ‘)

− dataframe = pd.read_excel(‘file_name.xlsx', ‘sheet_name’)

Or

− file_name = ‘file_name.xlsx’

− sheet_name = ‘sheet_name’

− dataframe = pandas.read_excel(file_name, sheet_name)

14 Spyrion LLC™ Proprietary and confidential


Coding Basics (cont.)
Some basic python syntax to keep in mind:

 Using library packages

− Import pandas as pd #calling pandas library and creating reference ‘pd’

− dataframe = pd.read_excel(‘file_name.xlsx', ‘sheet_name’)

Or

− dataframe = pandas.read_excel(‘file_name.xlsx', ‘sheet_name’)

15 Spyrion LLC™ Proprietary and confidential


Case Study
SFO Airport Customer Survey Data – Excel & CSV files

16 Spyrion LLC™ Proprietary and confidential


Importing the Data
How do I import an Excel file?

17 Spyrion LLC™ Proprietary and confidential


Data Characteristics
What columns do we have?

18 Spyrion LLC™ Proprietary and confidential


Data Characteristics
What if I just want a subset of these columns?

19 Spyrion LLC™ Proprietary and confidential


Data Characteristics
What columns do I have and what are their data types?

20 Spyrion LLC™ Proprietary and confidential


Data Characteristics
How many columns and records do I have?

Can I do a count of different values within a column?

21 Spyrion LLC™ Proprietary and confidential


Modifying Data Values
Lets look at the data dictionary

How do I replace values to make them meaningful?

22 Spyrion LLC™ Proprietary and confidential


Saving to Excel
How do I save this new file?

What does my file look like?

23 Spyrion LLC™ Proprietary and confidential


Importing the Data
How do I import a CSV file?

What is NaN?

24 Spyrion LLC™ Proprietary and confidential


Fixing Error Values
How do I fix NaN values?

25 Spyrion LLC™ Proprietary and confidential


Adding Custom Columns
What if I want to add the Year in a column?

26 Spyrion LLC™ Proprietary and confidential


Identifying Value Ranges
How do I look at the data value ranges for multiple columns?

27 Spyrion LLC™ Proprietary and confidential


Saving to CSV
 How do I save this new file?

 What does my file look like?

28 Spyrion LLC™ Proprietary and confidential


Appendix

29 Spyrion LLC™ Proprietary and confidential


Additional Information
Python Development Environments
 Enthought Canopy
− https://www.enthought.com/product/canopy/
 Anaconda/Spyder
− https://www.anaconda.com/download/

Python Libraries
 Pandas
− http://pandas.pydata.org/

30 Spyrion LLC™ Proprietary and confidential


Questions?

31 Spyrion LLC™ Proprietary and confidential

You might also like