Data Set Exploration in Python - v1 - Students

Data Exploration with
Python
Andrew Michelson, MD
Pulmonary/Critical Care
Institute for Informatics
Washington University School of Medicine in St. Louis
February 17, 2020
Institute for Informatics (I 2)

Disclosures
No relevant financial disclosures.
Many topics could be their own courses, so this will be a brief overview
The best techniques to analyze and clean your data will depend on the question your
asking and data you have

Class Structure

Objectives
1. Learn how to import data into Python
2. Discuss variable identification
3. Explore missing data and discuss its management
4. Explore univariate & bivariate analyses
5. Discuss outlier assessment and management
6. Explore data transformation

The Data
Source: MIMIC-III Demo Data
Contents:
• Vital Signs: Blood pressure, heart rate, respiratory rate, etc…
• Laboratory Values: White Blood Cell Count, Potassium, etc…
• And more, but we won’t use any of that today

The Working Environment
1. Python
2. jupyter-notebook
3. Import libraries
A. Pandas
B. Numpy
C. Seaborn
D. Datetime
E. Matplotlib
F. Scipy.stats

Importing Data Into Python
1. Python is a versatile and powerful language that can accept data from
many formats
2. In this class we import CSV documents from the MIMIC-III demo data
3. Use: dfNAME = pd.read_csv(filepath/filename, sep = ’,’)

Importing Data Into Python
Jupyer-Notebook
• Open Jupyter-Notebook
• Run Section 2: Import Libraries for DataSet Exploration
• Fill in the blank to import the following files:
• ICUSTAYS.csv
• PATIENTS.csv
• D_ITEMS.csv
• D_LABITEMS.csv

Variable Identification
Variable Name: Variable name
Variable type:
• Continuous (ex, age)
• Categorical (ex, sex)
Data Type:
• String
• Category
• Integer
• Float
• ManyString
Independent vs Dependent:

Identify your variables:
>> DataFrame.head( )
Patients dataframe
Note: you can use >> DataFrame.tail( ) to view the tail rows of the data frame
By adding in a number within the parenthesis you can specify how many rows to view

View your data frame
ICU Stays

How do we know how many rows and columns we have in total?
>> DataFrame.shape
How do we know the type of the data type?
>> DataFrame.info()

Remove Extraneous Information that takes up space (visible and memory)
>> DataFrame.drop(items, axis, inplace)

Variable identification in Python
Go to section 3.0.1 and fill in the *** to start identifying your

variables
Complete until section 3.2: Merge Patients & ICU Data to

Create a single DataFrame

Manipulating Data in Python
Often data is collected from different sources and then
merged together for analysis.
>> DataFrame1.merge(DataFrame2, how = “left/right”,

on=[‘’])
After a merge, double check the shape, to make sure you

merged correctly

Variable identification in Python
Go to section 3.2: Merge Patients & ICU Data to create a single

DataFrame
Check the size of the new DataFrame to confirm a successful

merge

Missing Data
Very Common in clinical data
Why is data missing?

• Data extraction
• Data collection

Missing Data Categorization
1. Missing completely at random:
• The propensity for a data point to be missing is completely
random and not dependent on observed or unobserved data
2. Missing at random:
• Systematic differences between the missing and observed values,
but these can be entirely explained by other observed variables

Missing Data Categorization
3. Missing not at random
• There is a relationship between the propensity of a value to be
missing and it’s values

Missing Data Treatment
Adapted from: https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87

Missing Data: Case Deletion
List Wise Pair Wise
Delete all data Analyze all cases

where any where data is
missing available
value is present

Missing Data: Imputation
Goal is to fill missing data with estimated values
Most common methods: mean/median/mode:

• Population-wide
• Cohort-wide

Missing Data: Statistical-Model Imputation
Linear Regression
• Limitations:
• Reduces variability
• Overestimates the model fit and correlation coefficient
K-nearest Neighbor Imputation

• Limitations:
• The choice of k critical in getting desired results
• Very slow

Missing Data: Statistical-Model Imputation
Multiple Imputation by Chained Equations (MICE)
• Assumes data is missing at random
• Runs multiple regression models
• Each value is modeled conditionally
• Multiple data sets are made (usually at least 10)

Assessing Missing data in Python
Look for null entries
>>DataFrame.isnull( ).sum
Look for non-null entries

>>DataFrame.notnull( ).sum

Assessing Missing Data
Go to section 3.3: Assess Missing Data in NEW Patients

DataFrame and complete UP TO, but not including Import Vital
Signs

Data Mapping
Process of extracting and unifying data for further analysis
Measurements of interest could be mixed with measurements

not of interest
The same value can have different names

• Sometimes the differences in names is important, other
times its not
Occurs in many data sets, including MIMIC-III

Data Mapping
Vital Signs:
• Blood Pressure (systolic/diastolic)
• Heart Rate
• Respiratory Rate
• Oxygen saturation (%)
• Temperature
In MIMIC-III vital signs are mixed with other measurements in

the CHARTEVENTS.CSV

Data Mapping with Vital Signs
Systolic Blood Pressure Synonyms in THIS dataset:
• Non Invasive Blood Pressure systolic',
• 'Arterial Blood Pressure systolic',
• 'Manual Blood Pressure Systolic Left',
• 'Manual Blood Pressure Systolic Right’,

Count variable frequency
>> DataFrame.series.value_counts( )

Data Mapping with Dictionaries
Dictionaries are data structures

that consist of an unordered
collections of key-value pairs
that can be changed
Dictionary = {
<key>: <value>
}

To accommodate synonyms, or extract items of interest from a
larger data set, you can use a dictionary

Import the remaining data and assess
missingness
Go to section 4.2 Import Vital Signs complete up to section 5:

Univariate & Bivariate Analysis

Univariate Analysis
Explore variables individually
Basic descriptive analysis
Central Tendency Measure Dispersion Visualization

Mean Interquartile Range Histogram
Median Standard Deviation/ Box plot
Variance
Mode Skewness
Min Kurtosis
Max

Univariate Analysis: Skewness
Measure of the asymmetry of the probability distribution of a variable
• Positive or Right
• Negative or Left
Grading Skewness Severity

• Minimal: -0.5 and 0.5
• Moderate: -1 and -0.5 or 0.5 and 1
• Severe: < -1 or >1
https://en.wikipedia.org/wiki/Skewness

Univariate Analysis: Kurtosis
“The kurtosis parameter is a measure of the combined weight of the tails relative to the rest
of the distribution.”
Kurtosis >3: Positive
No Kurtosis/Normal
Kurtosis <3: Negative
https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics#kurtosis
https://bishalbanksonfinance.wordpress.com/tag/probabality-distribution/

Bivariate Analysis
A method to determine the relationship between 2 variables
1. Visualization: Scatter plots
2. Regression analysis: Find the equation for the line or curve that best fits the data
3. Correlation coefficients: A measure of association between two data points

Outliers
What is an outlier?
• A data point that appears far away and diverges from the overall pattern in a sample
• Can be univariate or bivariate

Outliers
How do outliers occur?
• Natural
• Sampling error
• Data entry error
• Data processing error
• Measurement error
• Intentional outlier
• Experimental error

Outliers
Why are they important?
• Alters population variance, leading to non-normal data distributions

• Alters performance of downstream analyses
• Biases results
How do you detect outliers?

• Visualization
• Bar charts
• Box plots
• Scatter plots (looking for bivariate outliers)
• There are many, many ways, but we will focus on visualization today!

Outliers: Univariate

Outliers: Univariate

Outliers: Bivariate

Outliers
How do you treat outliers? (Subject for an entire course!)
• Delete observations:
• Data entry error
• Data processing error
• Very few (subjective)
• Transform values
• Log conversion
• Binning
• Differential observation weights
• Impute
• Would avoid with natural outliers
• Treat outliers as a separate category

Assessing Data in Python: Pivot Tables
DataFrames must be properly structured before they can be plotted
Patient Label Value

John Smith Heart Rate 75
John Smith Respiratory Rate 15
Patient Heart Rate Respiratory Rate

John Smith 75 15
DataFrame.pivot_table(values = 'value', index = [‘columns’], columns='label')

Visualize Data Within Python
Declare the graph properties
>> fig, ax = plt.subplots(rows,columns, figsize = (width,height))
Locate a subset of data from within the larger dataframe

>> DataFrame.loc[DataFrame.column == ‘columnname’, ‘return column name']
Use Seaborn to make distribution and boxplots

>> sns.distplot(data, ax=ax[ X ])
>> sns.boxplot(x = data, ax = ax[ X ])
Pivot your dfce

>>DataFrame.pivot_table(values = 'value', index = [‘columns’],
columns='label').reset_index()
Use Seaborn to plot bivariate data

>>sns.pairplot(pivoted table)

Visualize Data Within Python
Seaborn can make a heatmap to help you more rapidly identify correlations
>> sns.heatmap(dflabs.corr(), vmax = 1)

Univariate & Bivariate Visualization with
Vital Signs
Go to section 5: Univariate & Bivariate Analysis and complete

until section 6: Data Transformation

Data Transformation
Skewed data
• Skewed data can violate model assumptions (logistic regression)
• Amplify a class imbalance, degrading model performance towards the tail of the
distribution
Heteroskedasticity
• The relationship between two variables shows increasing scatter (non-constant standard
error) at extremes of measurement of the dependent variable
• Two forms:
• Conditional: Unpredictable volatility
• Unconditional: Predictable volatility

Data Transformation: Heteroskedasticity
Conditional

Data Transformation: Heteroskedasticity
Unconditional

Data Transformation
Way to improve skewness and heteroskedasticity is to normalize your data
• Remove/manage outliers
• Log
• Cube Root
• Binning
• Normalization
• Sigmoid
• Hyperbolic tangent
• Etc…
Again, there are many different ways to do this and the best way will depend on your
planned analyses and the question you are answering

Data Transformation
To perform the log function on data, you take a Pandas Series as such:
>> DataFrame.Column = np.log(DataFrame.column)
To raise a value to the cube root

>> DataFrame.Column = DataFrame.column**(1/3)

Data Transformation
Go to section 6: Data Transformation and go until the end!

Questions?
Thank you!

References:
1. Grus, Joel. Data Science from Scratch. O’Reilly Media;2015.
2. Marcellino, P. Comprehensive data exploration with python.
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python. 2/2018. Accessed:
2/12/2020.
3. Sheridan, E. Un-bottling the data. 12/2/2019.
https://towardsdatascience.com/un-bottling-the-data-2da3187fb186. Accessed: 2/12/2020.
4. Ojeda, T. Data exploration with python, part 3.
https://www.districtdatalabs.com/data-exploration-with-python-3. Accessed: 2/12/20.
5. Sunil, R. A comprehensive guide to data exploration.
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/#two. Accessed: 2/12/2020.
6. Bratkovics, C. Exploratory data analysis tutorial in Python.
https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445. 6/16/19.
Accessed: 2/12/20.
7. Sunil, R. Ultiamte guide for data exomploration in Python using Numpy, Matplotlib and Pandas.
https://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-using-python-nump
y-scipy-matplotlib-pandas/
. 4/9/2015. Accessed: 2/12/2020.
8. Akinfaderin, W. Missing data conundrum: exploration and imputation techniques.
https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techni
ques-9f40abe0fd87
. 9/11/2017. Accessed: 2/12/20.
9. Wade, C. Transforming skewed data. https://towardsdatascience.com/transforming-skewed-data-73da4c2d0d16.
8/21/2019. Accessed: 2/20/20.
10. Chow, J. Log transformation base for data linearization does not matter.
https://towardsdatascience.com/log-transformation-base-for-data-linearization-does-not-matter-22eb3c1463d0.
6/27/2019. Accessed: 2/12/20. Institute for Informatics (I 2)
11. Azur MJ, Stuart EA, Franggakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it
Thank you!


Data Set Exploration in Python - v1 - Students

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Set Exploration in Python - v1 - Students

Uploaded by

Copyright:

Available Formats

Data Exploration with

February 17, 2020

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Institute for Informatics (I 2)

2. Discuss variable identification

3. Explore missing data and discuss its management

4. Explore univariate & bivariate analyses

5. Discuss outlier assessment and management

6. Explore data transformation

Institute for Informatics (I 2)

• Laboratory Values: White Blood Cell Count, Potassium, etc…

• And more, but we won’t use any of that today

Institute for Informatics (I 2)

Institute for Informatics (I 2)

3. Use: dfNAME = pd.read_csv(filepath/filename, sep = ’,’)

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Institute for Informatics (I 2)

How do we know the type of the data type?

Institute for Informatics (I 2)

>> DataFrame.drop(items, axis, inplace)

Institute for Informatics (I 2)

Go to section 3.0.1 and fill in the *** to start identifying your

Complete until section 3.2: Merge Patients & ICU Data to

Institute for Informatics (I 2)

>> DataFrame1.merge(DataFrame2, how = “left/right”,

After a merge, double check the shape, to make sure you

Institute for Informatics (I 2)

Go to section 3.2: Merge Patients & ICU Data to create a single

Check the size of the new DataFrame to confirm a successful

Institute for Informatics (I 2)

Why is data missing?

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Adapted from: https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87

Institute for Informatics (I 2)

List Wise Pair Wise

Delete all data Analyze all cases

Institute for Informatics (I 2)

Most common methods: mean/median/mode:

Institute for Informatics (I 2)

K-nearest Neighbor Imputation

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Look for non-null entries

Institute for Informatics (I 2)

Go to section 3.3: Assess Missing Data in NEW Patients

Institute for Informatics (I 2)

Measurements of interest could be mixed with measurements

The same value can have different names

Occurs in many data sets, including MIMIC-III

Institute for Informatics (I 2)

In MIMIC-III vital signs are mixed with other measurements in

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Institute for Informatics (I 2)

Dictionaries are data structures

Institute for Informatics (I 2)

Institute for Informatics (I 2)