You are on page 1of 58

Data Exploration with

Python
Andrew Michelson, MD
Pulmonary/Critical Care
Institute for Informatics
Washington University School of Medicine in St. Louis

February 17, 2020

Institute for Informatics (I 2)


Disclosures
No relevant financial disclosures.

Many topics could be their own courses, so this will be a brief overview

The best techniques to analyze and clean your data will depend on the question your
asking and data you have

Institute for Informatics (I 2)


Class Structure

Institute for Informatics (I 2)


Objectives
1. Learn how to import data into Python

2. Discuss variable identification

3. Explore missing data and discuss its management

4. Explore univariate & bivariate analyses

5. Discuss outlier assessment and management

6. Explore data transformation

Institute for Informatics (I 2)


The Data
Source: MIMIC-III Demo Data

Contents:
• Vital Signs: Blood pressure, heart rate, respiratory rate, etc…

• Laboratory Values: White Blood Cell Count, Potassium, etc…

• And more, but we won’t use any of that today

Institute for Informatics (I 2)


The Working Environment
1. Python

2. jupyter-notebook

3. Import libraries
A. Pandas
B. Numpy
C. Seaborn
D. Datetime
E. Matplotlib
F. Scipy.stats

Institute for Informatics (I 2)


Importing Data Into Python
1. Python is a versatile and powerful language that can accept data from
many formats

2. In this class we import CSV documents from the MIMIC-III demo data

3. Use: dfNAME = pd.read_csv(filepath/filename, sep = ’,’)

Institute for Informatics (I 2)


Importing Data Into Python

Jupyer-Notebook
• Open Jupyter-Notebook
• Run Section 2: Import Libraries for DataSet Exploration
• Fill in the blank to import the following files:
• ICUSTAYS.csv
• PATIENTS.csv
• D_ITEMS.csv
• D_LABITEMS.csv

Institute for Informatics (I 2)


Variable Identification
Variable Name: Variable name

Variable type:
• Continuous (ex, age)
• Categorical (ex, sex)

Data Type:
• String
• Category
• Integer
• Float
• ManyString

Independent vs Dependent:

Institute for Informatics (I 2)


Variable Identification
Identify your variables:

>> DataFrame.head( )

Patients dataframe

Note: you can use >> DataFrame.tail( ) to view the tail rows of the data frame

By adding in a number within the parenthesis you can specify how many rows to view

Institute for Informatics (I 2)


Variable Identification
View your data frame

ICU Stays

Institute for Informatics (I 2)


Variable Identification
How do we know how many rows and columns we have in total?

>> DataFrame.shape

How do we know the type of the data type?

>> DataFrame.info()

Institute for Informatics (I 2)


Variable Identification
Remove Extraneous Information that takes up space (visible and memory)

>> DataFrame.drop(items, axis, inplace)

Institute for Informatics (I 2)


Variable identification in Python

Go to section 3.0.1 and fill in the *** to start identifying your


variables

Complete until section 3.2: Merge Patients & ICU Data to


Create a single DataFrame

Institute for Informatics (I 2)


Manipulating Data in Python
Often data is collected from different sources and then
merged together for analysis.

>> DataFrame1.merge(DataFrame2, how = “left/right”,


on=[‘’])

After a merge, double check the shape, to make sure you


merged correctly

Institute for Informatics (I 2)


Variable identification in Python

Go to section 3.2: Merge Patients & ICU Data to create a single


DataFrame

Check the size of the new DataFrame to confirm a successful


merge

Institute for Informatics (I 2)


Missing Data
Very Common in clinical data

Why is data missing?


• Data extraction
• Data collection

Institute for Informatics (I 2)


Missing Data Categorization
1. Missing completely at random:
• The propensity for a data point to be missing is completely
random and not dependent on observed or unobserved data

2. Missing at random:
• Systematic differences between the missing and observed values,
but these can be entirely explained by other observed variables

Institute for Informatics (I 2)


Missing Data Categorization
3. Missing not at random
• There is a relationship between the propensity of a value to be
missing and it’s values

Institute for Informatics (I 2)


Missing Data Treatment

Adapted from: https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techniques-9f40abe0fd87

Institute for Informatics (I 2)


Missing Data: Case Deletion

List Wise Pair Wise

Delete all data Analyze all cases


where any where data is
missing available
value is present

Institute for Informatics (I 2)


Missing Data: Imputation
Goal is to fill missing data with estimated values

Most common methods: mean/median/mode:


• Population-wide
• Cohort-wide

Institute for Informatics (I 2)


Missing Data: Statistical-Model Imputation
Linear Regression
• Limitations:
• Reduces variability
• Overestimates the model fit and correlation coefficient

K-nearest Neighbor Imputation


• Limitations:
• The choice of k critical in getting desired results
• Very slow

Institute for Informatics (I 2)


Missing Data: Statistical-Model Imputation
Multiple Imputation by Chained Equations (MICE)
• Assumes data is missing at random
• Runs multiple regression models
• Each value is modeled conditionally
• Multiple data sets are made (usually at least 10)

Institute for Informatics (I 2)


Assessing Missing data in Python
Look for null entries
>>DataFrame.isnull( ).sum

Look for non-null entries


>>DataFrame.notnull( ).sum

Institute for Informatics (I 2)


Assessing Missing Data

Go to section 3.3: Assess Missing Data in NEW Patients


DataFrame and complete UP TO, but not including Import Vital
Signs

Institute for Informatics (I 2)


Data Mapping
Process of extracting and unifying data for further analysis

Measurements of interest could be mixed with measurements


not of interest

The same value can have different names


• Sometimes the differences in names is important, other
times its not

Occurs in many data sets, including MIMIC-III

Institute for Informatics (I 2)


Data Mapping
Vital Signs:
• Blood Pressure (systolic/diastolic)
• Heart Rate
• Respiratory Rate
• Oxygen saturation (%)
• Temperature

In MIMIC-III vital signs are mixed with other measurements in


the CHARTEVENTS.CSV

Institute for Informatics (I 2)


Data Mapping with Vital Signs
Systolic Blood Pressure Synonyms in THIS dataset:
• Non Invasive Blood Pressure systolic',
• 'Arterial Blood Pressure systolic',
• 'Manual Blood Pressure Systolic Left',
• 'Manual Blood Pressure Systolic Right’,

Institute for Informatics (I 2)


Data Mapping with Vital Signs
Count variable frequency
>> DataFrame.series.value_counts( )

Institute for Informatics (I 2)


Data Mapping with Dictionaries

Dictionaries are data structures


that consist of an unordered
collections of key-value pairs
that can be changed

Dictionary = {
<key>: <value>
}

Institute for Informatics (I 2)


Data Mapping with Vital Signs
To accommodate synonyms, or extract items of interest from a
larger data set, you can use a dictionary

Institute for Informatics (I 2)


Import the remaining data and assess
missingness

Go to section 4.2 Import Vital Signs complete up to section 5:


Univariate & Bivariate Analysis

Institute for Informatics (I 2)


Univariate Analysis
Explore variables individually

Basic descriptive analysis

Central Tendency Measure Dispersion Visualization


Mean Interquartile Range Histogram
Median Standard Deviation/ Box plot
Variance
Mode Skewness
Min Kurtosis
Max

Institute for Informatics (I 2)


Univariate Analysis: Skewness
Measure of the asymmetry of the probability distribution of a variable
• Positive or Right
• Negative or Left

Grading Skewness Severity


• Minimal: -0.5 and 0.5
• Moderate: -1 and -0.5 or 0.5 and 1
• Severe: < -1 or >1
https://en.wikipedia.org/wiki/Skewness

Institute for Informatics (I 2)


Univariate Analysis: Kurtosis
“The kurtosis parameter is a measure of the combined weight of the tails relative to the rest
of the distribution.”

Kurtosis >3: Positive

No Kurtosis/Normal

Kurtosis <3: Negative

https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics#kurtosis
https://bishalbanksonfinance.wordpress.com/tag/probabality-distribution/

Institute for Informatics (I 2)


Bivariate Analysis
A method to determine the relationship between 2 variables

1. Visualization: Scatter plots

2. Regression analysis: Find the equation for the line or curve that best fits the data

3. Correlation coefficients: A measure of association between two data points

Institute for Informatics (I 2)


Outliers
What is an outlier?

• A data point that appears far away and diverges from the overall pattern in a sample
• Can be univariate or bivariate

Institute for Informatics (I 2)


Outliers
How do outliers occur?
• Natural
• Sampling error
• Data entry error
• Data processing error
• Measurement error
• Intentional outlier
• Experimental error

Institute for Informatics (I 2)


Outliers
Why are they important?

• Alters population variance, leading to non-normal data distributions


• Alters performance of downstream analyses
• Biases results

How do you detect outliers?


• Visualization
• Bar charts
• Box plots
• Scatter plots (looking for bivariate outliers)
• There are many, many ways, but we will focus on visualization today!

Institute for Informatics (I 2)


Outliers: Univariate

Institute for Informatics (I 2)


Outliers: Univariate

Institute for Informatics (I 2)


Outliers: Bivariate

Institute for Informatics (I 2)


Outliers
How do you treat outliers? (Subject for an entire course!)

• Delete observations:
• Data entry error
• Data processing error
• Very few (subjective)

• Transform values
• Log conversion
• Binning
• Differential observation weights

• Impute
• Would avoid with natural outliers

• Treat outliers as a separate category

Institute for Informatics (I 2)


Assessing Data in Python: Pivot Tables
DataFrames must be properly structured before they can be plotted

Patient Label Value


John Smith Heart Rate 75
John Smith Respiratory Rate 15

Patient Heart Rate Respiratory Rate


John Smith 75 15

DataFrame.pivot_table(values = 'value', index = [‘columns’], columns='label')

Institute for Informatics (I 2)


Visualize Data Within Python
Declare the graph properties
>> fig, ax = plt.subplots(rows,columns, figsize = (width,height))

Locate a subset of data from within the larger dataframe


>> DataFrame.loc[DataFrame.column == ‘columnname’, ‘return column name']

Use Seaborn to make distribution and boxplots


>> sns.distplot(data, ax=ax[ X ])

>> sns.boxplot(x = data, ax = ax[ X ])

Pivot your dfce


>>DataFrame.pivot_table(values = 'value', index = [‘columns’],
columns='label').reset_index()

Use Seaborn to plot bivariate data


>>sns.pairplot(pivoted table)

Institute for Informatics (I 2)


Visualize Data Within Python
Seaborn can make a heatmap to help you more rapidly identify correlations
>> sns.heatmap(dflabs.corr(), vmax = 1)

Institute for Informatics (I 2)


Univariate & Bivariate Visualization with
Vital Signs

Go to section 5: Univariate & Bivariate Analysis and complete


until section 6: Data Transformation

Institute for Informatics (I 2)


Data Transformation
Skewed data
• Skewed data can violate model assumptions (logistic regression)
• Amplify a class imbalance, degrading model performance towards the tail of the
distribution

Heteroskedasticity
• The relationship between two variables shows increasing scatter (non-constant standard
error) at extremes of measurement of the dependent variable
• Two forms:
• Conditional: Unpredictable volatility
• Unconditional: Predictable volatility

Institute for Informatics (I 2)


Data Transformation: Heteroskedasticity
Conditional

Institute for Informatics (I 2)


Data Transformation: Heteroskedasticity
Unconditional

Institute for Informatics (I 2)


Data Transformation
Way to improve skewness and heteroskedasticity is to normalize your data
• Remove/manage outliers
• Log
• Cube Root
• Binning
• Normalization
• Sigmoid
• Hyperbolic tangent
• Etc…

Again, there are many different ways to do this and the best way will depend on your
planned analyses and the question you are answering

Institute for Informatics (I 2)


Data Transformation
To perform the log function on data, you take a Pandas Series as such:
>> DataFrame.Column = np.log(DataFrame.column)

To raise a value to the cube root


>> DataFrame.Column = DataFrame.column**(1/3)

Institute for Informatics (I 2)


Data Transformation

Go to section 6: Data Transformation and go until the end!

Institute for Informatics (I 2)


Questions?
Thank you!

Institute for Informatics (I 2)


References:
1. Grus, Joel. Data Science from Scratch. O’Reilly Media;2015.
2. Marcellino, P. Comprehensive data exploration with python.
https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python. 2/2018. Accessed:
2/12/2020.
3. Sheridan, E. Un-bottling the data. 12/2/2019.
https://towardsdatascience.com/un-bottling-the-data-2da3187fb186. Accessed: 2/12/2020.
4. Ojeda, T. Data exploration with python, part 3.
https://www.districtdatalabs.com/data-exploration-with-python-3. Accessed: 2/12/20.
5. Sunil, R. A comprehensive guide to data exploration.
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/#two. Accessed: 2/12/2020.
6. Bratkovics, C. Exploratory data analysis tutorial in Python.
https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445. 6/16/19.
Accessed: 2/12/20.
7. Sunil, R. Ultiamte guide for data exomploration in Python using Numpy, Matplotlib and Pandas.
https://www.analyticsvidhya.com/blog/2015/04/comprehensive-guide-data-exploration-sas-using-python-nump
y-scipy-matplotlib-pandas/
. 4/9/2015. Accessed: 2/12/2020.
8. Akinfaderin, W. Missing data conundrum: exploration and imputation techniques.
https://medium.com/ibm-data-science-experience/missing-data-conundrum-exploration-and-imputation-techni
ques-9f40abe0fd87
. 9/11/2017. Accessed: 2/12/20.
9. Wade, C. Transforming skewed data. https://towardsdatascience.com/transforming-skewed-data-73da4c2d0d16.
8/21/2019. Accessed: 2/20/20.
10. Chow, J. Log transformation base for data linearization does not matter.
https://towardsdatascience.com/log-transformation-base-for-data-linearization-does-not-matter-22eb3c1463d0.
6/27/2019. Accessed: 2/12/20. Institute for Informatics (I 2)
11. Azur MJ, Stuart EA, Franggakis C, Leaf PJ. Multiple imputation by chained equations: what is it and how does it
Thank you!

Institute for Informatics (I 2)


Institute for Informatics (I 2)

You might also like