You are on page 1of 32
JIGSAW ACADEMY ‘Anayyies for rofessionals EXPLORATORY DATA ANALYSIS & EDA — OVERVIEW & Exploratory Data Analysis is the assessment of the quality and characteristics of data available to tackle business problems We will cover the following in the session: = Whatis data? = | have a data set, can | start modeling? = Good or Bad: How do | assess quality of data? = Data Characteristics — How to summarize information = Data not conforming to expectations — big problem? DATA EXPLORATION CASE STUDY ‘A mobile service provider has noticed a lot of attrition in customers subscribing to their services. It wants to understand whatis driving attrition, and identify potential optionsto retain customers Data available includes: — Subscriber information: including age, location — Service start date — Service end date — Usage by month in minutes — Plan details — Promotion details, a) DATA EXPLORATION CASE STUDY oul ayant [se |e © aa arate aes. siti noe_[ ame] rh | Wigs andes [0 [0 [0 ttt | tae | sen Lens fracas] —2o0[ a8 [135 | rig aa entaese|1[ 0 [0 | 0] ater | et | ss | wT A | zeae? | Te7H [OH Se FEC EE soma] [S| sot_[ 258 | Tits andes [0 [0 [0 [1 tec [ pana | roe [rors einer 2 _[8_[ et [Nis andes [1 [010] tet [ra | ce fara —sr_[ 6 a | me | moet [0 [0 [0] 0 | at | aan [srr isa 5 roretsy 65 tar [35] 0 | Wigs andes [tT [00 | teat | asta [root [aes EDA — PRELIMINARY PRELIMINARIES = Analytical approaches to business problems focus on data driven analysis and conclusions. So, Whatis “Data” ? — Thesaurus definition: “Information * Visibility ~ collected and compiled, accessible * Neutral, Factual - always How to collect Data? Whyis it important to understand mode of data collection? * Direct observation + In ct collection + Active vs Passive * Census vs Sample Ee 44 EDA — DATA TYPES & Data Classification — Qualitative * Name * Gender — Quantitative * Discrete ~ex 0/1, 1 subscriber number — Nominal—area codes = Cardinal—Oort * Continuous — ex Price: $2.93, Income: $30,000 + Dates Other classification schemes ‘+ Primary vs Secondary + Based on usefulness + Actual vs Derived ———_—— A EDA CASE STUDY £ DATA CLASSIFICATION Data contained isoF tree yes] a. Qualitative — Type of plan, Type of sherp_ it promotion senoeel b. Quantitative — Subscriber Ids, Zip meee! Codes, Number of minutes used in [staat month hind c. Date — Date denoted separately. It is a pen quantitative data, but needs to be treated with care Pome What other characteristics are evident? DATA EXPLORATION — STEPS £ Data integrity, usefulness cannot be assumed Basic sanity checks — do | see what | expect to see — Should | always see what I expact to see? — Anomalies — always noise? What is an anomaly? — Domain knowledge Fundamental steps of data exploration — Basic data transformations — qualitative to quantitative — Summary statistics — Graphical analysis — Investigate anomalies, exceptions EDA CASE STUDY £ PRELIMINARIES What data is contained in the telecom dataset? . How was it collected? Is it the universe? . Active or Passive? . What are values in each variable? Do you understand the values? Is there missing data? . Do you see unexpected values? Is the deta enough? Do you need more variables? PNA Pepe ‘Once we review the data, we can formulate an approach toanswer the problems posed by the telecom service provider Without a thorough review, we run the danger of applying techniques that may not be appropriate, leadingto incorrect results EDA CASE STUDY £ BASIC CHECKS Transform qualitative data to quantitative, for ex: — Plantype currently has : 200 for 10, Nights and Weekends, and Coast to Coast. Create a variable Plan_Typel = 1 if field has “200 for 10” Generate derived values, for ex: — Lost customer? If service end date exists, customer has left. Create a variable that lists customer left= 1 ifservice date > 0 sanity checks- — What % of customers have left? — What is the earlieststart date? Latest end date? — fcustomer has left, is there usage dete post leaving? — What is the maximum monthly usage in minutes? Minimum? Are there negativenumbers? — Isthere a correlation between age and usage? What do you expect it to be? Is it borne out by data? — Correlation between “nights and weekends” and age? DATA EXPLORATION SUMMARY STATISTICS. Firststeps: — Summary statistics toassess * Completeness of data + Missing data * Outliers — Why “summary” stats * Large data sets — Which summary stats? * Most common = Min, Max = mean, median, mode — Std. Deviation — # Missing observations — Skewness| EDA CASE STUDY BASIC CHECKS B2S71 TSS | 8 56ES68 [55.708 TED 77 4 Frc cove eos] 29.301 Initial Findings: 1 2. 3. Unexpected values: negative for minutes used Very high monthly mimes? Should all zip codes have standard length What do the means of the promotion variables reveal? Sau 1s Tee o 7 ara 29.457. & EDA CASE STUDY = BASIC CHECKS Only one negativenunber Potential data entry error? Only one huge number Potential data entry erro=> Manyinstancesof <5 digit codes 3 digit 4 digitz B digi 2 00875 lode 027 1001603, Needfurther investigation EDA CASE STUDY QUALITATIVE VARIABLES & N 672 | 5.38 U 9875 | 79 Y 1953 | 15.6: 1. 200 for 10 most popular 1. Three values for new cell, is that expected? 2. Coast to Coast least popular 3. Transform to numeric variable: 2U" denete No? ‘In excel find and replace ‘In SAS — create variable ‘In excel —find and replace In SAS — create variable 2. “U" ismost frequent, so does 3. Transform to numeric variable: EDA DERIVED VARIABLES = No direct information on loss of customer — Needs to be derived from existing information — If Service End Dateis sted, implies customer hasleft — Create variable that has 1 i Service End Dateexists, 0 otherwise Another instance — age of customer — Hasto be calculated from birth-date. We cannot use birth-datein date format directly in models. In general derived variables are straight-forward, but in some cases more judgment required — Forexample, if we wantedto classify minutes used into Low, Medium, High, instead of a continuous variable, how are Low, Medium and High thresholds estimated? EDA CASE STUDY £ DERIVED VARIABLES Retained 61.86 Lost are 38.14 ry 19.83 Deriving actual customers lost reveals almost 40% of total customers have ended service ~ big number A distribution of customers by age group shows no skew by age - is that reasonable or contrary to expectations? > 4 DATA EXPLORATION — CROSS-TABS = Investigating patterns one variable at a time is a starting point, however, given the many potential relationships between groups of variables, a critical step in data exploration is to look for patterns across groups of variables For example, in order to create more meaningful age buckets, we should look at the proportion of customer attrition by age group to see if there are significant changes by age group The simplest way to look at patterns across groups of variables is to create cross-tabs DATA EXPLORATION — CROSS-TABS =25 359) 992, 41% 53% 25-36 4651 | 1362 55%. 45% 35-45 1555 984 61% 39% 45-55, 4774 847, 68% 32% > 55 4e87_[ 592 76% 24% Clearly, high attrition for customers below the age of 25, and much lower for customers above the age of 55. Potential to create age buckets as less than 25, between 25 to 55, and greater than 55? (Whyis it better to have fewer variables?) Other relevant cross-tabs based on this dataset? —__ fd EDA - GRAPHICAL ANALYSIS Visualization of data is a powerful method of understanding data and patterns within data Multiple techniques used to look at datato — Determine distribution — Identifyspread — Assess bias/skewness, — Identify outliers & GRAPHICAL DATA EXPLORATION = 1. Useful graphical representations include |. Simple run charts ll. Frequency distribution plots + Histograms + Probability plots Ill. Range charts + Box Plots + Stem and Leaf Plots IV. Joint distribution charts > 4 e GRAPHICAL DATA EXPLORATION = Simplest charts for single variables: Run Charts — Can assess distribution, spread, and outliers PISSILES ‘ CREATING RUN CHARTS Excel 1. Selectthe variable youwantto plot 2. Goto Insert/Chart/Line 3. AddLabel 4, Decide location of chart 5. Finish GRAPHICAL DATA EXPLORATION $ Split the range of the data into equal sized bins Plot frequency of bin values against each bin Allows visualization of mean, variance, distribution, outliers CREATING HISTOGRAMS = Excel SAS — Use Tools/Date Histogram option in PROC Analysis/Histogram add in UNIVARIATE Choose variable to plot Decide on bin size proc univariate data=Steel; Input inputs var Length Width; histogram; Add labels run; e GRAPHICAL DATA EXPLORATION = Probability Plots — assess distribution shape — Normal probability plot — widely used to check for normality of distribution CREATING PROBABILITY PLOTS SAS — Proc Univariate, qqplot option for normal probability plots — Forother distributions, use probplot with parameterss required proc univariate data=Measures; ‘qgplot Length Width; - GRAPHICAL EDA — BOX PLOTS = * Box plots allow us to look at measures of central tendency explicitly (quartiles are specified) * It makes it easier to assess distance for potential outliers CREATING BOX PLOTS SAS = Proc Univariate, plot option proc univariate data=stuff freq normal plot vardef= var weight height; Run; Produces stem and leaf plots, box plots, and normal probability plots by default Proc Boxplot also available & EDA — EXCEPTIONS = ‘We have looked at examples of obvious exception: ‘example, negative minutes used, etc) previous slides (for Basedon derived variables, need to look at other conditions to identify potential anom: — Example: if service end date exists, then do data exist for minutes used post end date? — Other examples? Exceptions need to be investigated — There may be valid explanations for extreme values — Danger of making data setvery “general” EDA - CORRELATIONS very important to look at the correlations between the independent variables — Iscorrelation expected? Should it be high or low? * Correlations can be positive or negative * Relationship could be linear or non-linear — What happens when correlation between two independent variables is high? * Multi-collinearity — What about correlation between dependent and independent variables? ‘The SAS System ‘The CORR Provedure EDA - CORRELATIONS Sample SAS correlation output — What do the numbers in the highlighted circle mean? ‘Simple Statistics minuset T2aaef a7 eerai] 9838ae7] escodt| ‘ol__ 100) srinuse rez co es s2ea0| 279507 ‘ol 1509) Prob > |r| under HO: Rho=0 mnnusel nase ‘minused 7 ‘=0004 inuse? 024042 ‘£000! 7 EDA SUMMARY = + Itis very critical to invest time in understanding the data once it has been pulled/received, since it is really the starting point to building any solution * Model results will only be as good as the data that goes in; and model results be reliable only if the right models are applied to the existing data + There are many techniques of EDA that can be used to gain a real understanding of available data, relationships between variables, and potential issues with available data EDA — CASE STUDY 6 In the attached dataset (telecom), preparea summary of findings based on initial data ‘exploration, including: — Size of dataset — Number ofvariables — Distributions = Correlations amongvariables — Potential outliers = Quentum of missingdata a JIGSAW ACADEMY Analytics for Professionals www.jigsawacademy.com ‘fi

You might also like