You are on page 1of 45

Introduction to Business

Analytics

By
Dr. Kingshuk Srivastava
Changing life under Digital Age
Transformation is Critical…

Companies must shift to a


Data-Driven
Business

are vulnerable
72% to
disruption
within
three years
Why?….Suddenly!!..

Why we’re all


Internal Threats
vulnerable to seismic
Siloed data and systems
shifts Gaps in expertise and
skills Inability to react
quickly

External Threats
Born-on-digital companies that steal 274,000
market share or rewrite customer
Estimated
expectations
worldwide startups
New business models that reinvent our each day
industry and change the game altogether

4
The shift to a Data-Driven Organization

Operations Reporting & Self-Service New


Data Analytics Busines
Warehousin s
g Models
Valu
e

Data
Data Decision Science
Efficiency
Modernization
Monetizati
on
Uses of Data
What is Data Science?

Data science is a "concept to unify statistics, data analysis


and their related methods" in order to "understand and
analyze an actual phenomena" with data.
What
Why
Analytics?
• Process of collecting, organizing and analysing large sets of data to discover useful information
which is most important

• Organizations have far more data than ever before

• Analytics solutions help organizations take better & fast decisions.

• Identify the opportunities for improvement

• All companies are moving towards using Business Analytics to understand data to develop their
business goals

Analytics is the discovery and rich with recorded


information with meaningful patterns in data
Significance of Analytics

Convert extensive data into powerful insights which drive


into efficient decisions

You can now base your decisions or strategies on data


rather than intuition

Applying right analytics to your data for desired


improvements

Achieve breakthrough results


What is Data Analytics?

Analytics is the use of:


data,
information technology,
statistical analysis,
quantitative methods, and
mathematical or computer-based models
to help managers gain improved insight about their business operations and make better, fact-based
decisions.
Business Analytics & Business Intelligence is a subset of Data Analytics
Components leading to Analytics
Domain Knowledge
Supply Chain
Domain CRM
Expertise Financials
Networking

Engineering Research

Intelligence
Scripting, SQL
Python, R Scala Computer Math &
Data Pipelines Machine
Big Data/ Science Stats
Mathematics
Apache Spark
Learning Computational

Data Science Projects Require multiple Skills


Understanding Wisdom P
Another approach to differentiate
Types of analytics
Understanding; Decision; Action;
Segments
Components Of Data Analytics
Further Understanding
Application Areas

Domains Analytics
Sector Specific specializations
Graph Analytics
Types of Data
Types of Data
The V’s of Big Data
Data Collection Techniques

• Observations,
• Tests,
• Surveys,
• Document analysis
(the research literature)
Quantitative Methods

Experiment: Research situation with at least one independent


variable, which is manipulated by the researcher
Independent Variable: The variable in the study under
consideration. The cause for the outcome for the study.

Dependent Variable: The variable being affected by the


independent variable. The effect of the study

y = f(x)
Which is which here?
Key Factors for High Quality
Experimental Design
Data should not be contaminated by poor measurement or errors
in procedure.

Eliminate confounding variables from study or minimize effects


on variables.

Representativeness: Does your sample represent the population


you are studying? Must use random sample techniques.
What Makes a Good Quantitative
Research Design?
4 Key Elements

1. Freedom from Bias


2. Freedom from Confounding
3. Control of Extraneous Variables
4. Statistical Precision to Test Hypothesis
Bias: When observations favor some individuals in the
population over others.

Confounding: When the effects of two or more variables cannot


be separated.

Extraneous Variables: Any variable that has an effect on the


dependent variable.

Need to identify and minimize these variables.


e.g., Erosion potential as a function of clay content. rainfall
intensity, vegetation & duration would be considered extraneous
variables.
Precision versus accuracy

"Precise" means sharply defined or measured.

"Accurate" means truthful or correct.


Both Accurate Accurate
and Precise Not precise

Not accurate
But precise
Neither accurate
nor precise
Sampling
Sampling is the problem of accurately acquiring the necessary
data in order to form a representative view of the problem.

This is much more difficult to do than is generally realized.


Overall Methodology:
* State the objectives of the survey
* Define the target population
* Define the data to be collected
* Define the variables to be determined
* Define the required precision & accuracy
* Define the measurement `instrument'
* Define the sample size & sampling method, then
select the sample
Data Preprocessing
Data Quality: Why Preprocess the Data?

• Measures for data quality: A multidimensional view


• Accuracy: correct or wrong, accurate or not
• Completeness: not recorded, unavailable, …
• Consistency: some modified but some not, dangling, …
• Timeliness: timely update?
• Believability: how trustable the data are correct?
• Interpretability: how easily the data can be understood?

37
Major Tasks in Data Preprocessing

• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
38
Data Cleaning

• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g., instrument faulty, human or computer
error, transmission error
• incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
• e.g., Occupation=“ ” (missing data)
• noisy: containing noise, errors, or outliers
• e.g., Salary=“−10” (an error)
• inconsistent: containing discrepancies in codes or names, e.g.,
• Age=“42”, Birthday=“03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
• Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?

39
Incomplete (Missing) Data

• Data is not always available


• E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
• Missing data may be due to
• equipment malfunction
• inconsistent with other recorded data and thus deleted
• data not entered due to misunderstanding
• certain data may not be considered important at the time of
entry
• not register history or changes of the data
• Missing data may need to be inferred
40
How to Handle Missing Data?

• Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values per
attribute varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
• a global constant : e.g., “unknown”, a new class?!
• the attribute mean
• the attribute mean for all samples belonging to the same class:
smarter
• the most probable value: inference-based such as Bayesian
formula or decision tree
41
Noisy Data

• Noise: random error or variance in a measured variable


• Incorrect attribute values may be due to
• faulty data collection instruments
• data entry problems
• data transmission problems
• technology limitation
• inconsistency in naming convention
• Other data problems which require data cleaning
• duplicate records
• incomplete data
• inconsistent data

42
How to Handle Noisy Data?

• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
• Combined computer and human inspection
• detect suspicious values and check by human (e.g., deal with
possible outliers)

43
Data Cleaning as a Process
• Data discrepancy detection
• Use metadata (e.g., domain, range, dependency, distribution)
• Check field overloading
• Check uniqueness rule, consecutive rule and null rule
• Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code,
spell-check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
• Data migration tools: allow transformations to be specified
• ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
• Iterative and interactive (e.g., Potter’s Wheels)
44
THANK YOU

You might also like