You are on page 1of 6

Data Sources

 Analytics uses data to improve decisions


 To plan and execute successful data collection, you need to know:
1. What variables might be relevant to the decision?
a. What are the hypotheses about how to improve the decision?
b. What variables are needed to test the hypotheses?
2. What are usable data sources for those variables?
a. What are the possible sources of the necessary information?
b. Are there data sourcing considerations (privacy, cost, standards,
etc.)
3. What data preparation is needed?
a. What does profiling reveal about the quality of the available data?
b. How should data from different sources be integrated?
c. What additional preparation steps are necessary on the integrated
data to yield variables for analytic modeling?
Hypotheses and Variables
 What are the hypotheses about how to improve decision?
o Hypotheses = proposed testable explanation
o Example: “The more medications a patient must take after discharge from
the hospital, the higher the likelihood of being readmitted”.
 What variables are needed to test the hypotheses?
o Variable: Measured characteristic that can take on two or more values
 Brainstorming on potential hypotheses with input from review of previous findings
is critical before selecting data sources
Selecting Data Sources
 Structured vs Unstructured Data
 Internal vs External
 Open vs Proprietary
 Purpose-generated vs Pre-existing

Analyzing Unstructured Data Sample Achievements:

 Text  IBM Watson analyzed over 200 million


o Categorized, documents
clustering, extraction
of documents
 Graphs and Networks
o Shortest path, link  Computing cliques in 2-B edge Twitter
prediction, community networks (Kang, Meeder, and
detection Faloutsos, KDD’11)
 Sequences Sample Achievements:
o Clickstream Path and
dwell time analysis
o Time series trend and  Mining trillions of subsequences in time
outlier detection, series (Rakthanmanon et al., KDD’12)
forecasting
o Genetic subsequence
matching, sequence
alignment

 Multimedia
 Microsoft hits 5.1% error rate on speech
o Audio: Filtering,
recognition exceeding human
speech recognition,
performance (Le Huang, 2017)
translation
 Real – time summarization of
o Imagery: Object and
surveillance video to show highlights,
Facial recognition
with up to 40% compression (PanOptus,
o Video: Segmentation,
2014)
Object Tracking,
Summarization

Internal Data Sources


 Data internal to organization analyzing data
 Types of internal data
o Customer relationship management
o Supply chain management
o Human Resources
o Finance and Accounting
o Order Processing
o Information Technology
 Internal Data Considerations
o May be surprisingly difficult to get or use, depending on the organization
o Often fragmented, with little documentation
o Ownership and access may be a political issue
o Data source may not have been set up access for others
 Proprietary Data
o Data sold by the source or by an aggregator as a product or service
 Examples
 Financial: stock prices, ratings, earnings
 Legal: laws, court decisions, patents
 Healthcare: Directories, procedures, prescriptions
 Proprietary Data Considerations
o Can be expensive, but support and quality control may be worth it
o Pricing models can be complicated
 Time period
 Fields
 Modules
 Named users/ concurrent users
 Query Volumes
 Geographic restrictions on use
 Commercial vs. non-profit
 Internal vs. external (resale)
o Beware: Scraping data in bulk from web interfaces may be prohibited
under Terms of service
 Purpose- Generated Data
o Sometimes the data you need for your project hasn’t been gathered and
you collect it first
 Examples:
 Surveys, polls, questionnaires
 Expert opinions, Delphi method
 “Wisdom of crowds” – type belief aggregation
 Field experiments, A/B testing
 Purpose – Generated Data Considerations
o Survey and experimental design are an art
 Pilot a survey or A/B test before deploying widely
o In surveying experts, diversity of expertise can be as important as depth of
expertise
o Beware of sample bias and groupthink
 Inputs should (at least initially be independent)
 Data Privacy
o Demographic data, identity data, relationships, government records
o Search patterns, purchase habits, preferences, opinion, browsing habits
on the web
o Individual’s network, communities, and preferences expressed in social
media
o ISP address that treats packets differently based on application, content,
source, and cookies
o Location data travel pattern, zip code
o TV viewing behavior
o Speed, time and location visited in a car
o Numbers dialed, cellphone usage behavior
o Energy usage, home appliances
 Potential Consequences of Privacy Violations
o Identity theft, fraud, crime
o Third part information sharing without consent
o Intrusive and insensitive handling of personal data (such as in teenage
pregnancy case, or showing dating ads to people in certain age)
o Incorrect interpretation or profiling, hence annoying targeting
o Damage to reputation
o Unsolicited marketing and promotion calls
o Being discriminated for personal traits and preferences
o Unknown future use of the data
o Targeted ads getting too intrusive and diluting browsing experience
o Use of data for profiling customer for differential pricing
o Possible blurring of personal, social, and work boundaries
 Questions to Ask about Data Quality
o Where are the data stored?
o Who owns the data?
o Who can access the data and how is data accessed?
o What fields are available?
o Are data fields defined consistently?
o What type is each field?
o How accurate is the data?
o How often and how are the data refreshed?
 Challenges in Data Sources
o Limited Availability:
 Legal/privacy restrictions
 Location/connectivity/timing
o Incompleteness:
 Records Missing
 Fields missing or not populated
o Inconsistency
 Same field means different things in different sources
 Attributes in one source are relations in another source
 Different data formats used in different source
 Different values represent the same object
o Inaccuracy
 Data is unintentionally or intentionally corrupted (or wrong to begin
with)
 Data was originally accurate, but now out of date
 Data has insufficient precision/granularity
 Goals Of Exploratory Data Analysis
o Test your problem framing
 Test assumptions given the available data
 Suggest new questions to ask from data
o Understand the available data
 Know the level of detail available at the atomic level for each field
 Understand the metadata
 Identify patterns of errors or missing values
o Prepare for model-building
 Investigate correlations amongst the candidate predictors
 Explore dimension reduction techniques
 Take note of the variables’ relative predictive power
 Typical Steps in Exploratory Data Analysis
o Formulate your question
o Read in your data
o Check the “packaging” (number of rows and columns, etc.)
o Look at the “top” and “bottom” of your data
o Check your n’s (check counts against landmarks
o Validate with at least one external data source
o Make a plot to create expectations and check deviations
o Try to easy solution first
 Understand the Available Data
o Know the level of detail available at the atomic level for each field
o Understand the metadata (origin of the data, last updates, etc.)
o Identify patterns of errors or missing values
 Summary Statistics
o Measures
 Mean/Median/Mode
 Min/Max
 Variance/Standard Deviation
 Counts, Histograms
o Use to spot
 Missing values
 Outliers
 Invalid values
 Unexpected data ranges
 Other EDA Methods
o Relationships over time: Trend/ seasonality/ noise, Time Series
differences, autocorrelation plots
o Relationships across space: Mapping geospatial data, distance-based
clustering
o Unsupervised learning: Self-organizing maps, clustering, association
rules, deep learning
 Missing Values
o Values can be Missing at Random or missing systematically – the type of
“missingness” will affect the choice of data implementation method
o Methods:
 Ignore the full record with the missing value
 Fill in the missing value manually
 Use the global constraint (“Unknown”) e.g. fill in the missing value
 Substitute mean or median of known values
 Substitute mean or median of known values for record’s class
 Use most probable value (from regression or decision tree) to
predict missing value
 Last resort – separate models for cases with missing information
 Conclusion
o Analytics is the use of data to improve decisions
o Exploratory data analysis should be performed before modeling to test
assumptions and understand data limitations
o EDA results can shed light on topics from problem framing, model
selection, data integration, imputation and scaling before modeling begins

You might also like