You are on page 1of 42

Sesi 2 – Data Analytics

& Framework
Pretest
• https://www.menti.com
• Join code: 6480 0540
Big Data

Data is created constantly, and at an ever-increasing rate.

Mobile phones, social media, imaging technologies to determine a medical diagnosis̶all these
and more create new data, and that must be stored somewhere for some purpose.

Devices and sensors automatically generate diagnostic information that needs to be stored and
processed in real time.
Data Deluge
HOW BIG?
Attributes of Big Data
Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of
rows and millions of columns.

Complexity of data types and structures: Big Data reflects the variety of new data sources,
formats, and structures, including digital traces being left on the web and other digital
repositories for subsequent analysis.

Speed of new data creation and growth: Big Data can describe high velocity data, with rapid
data ingestion and near real time analysis.
Distinctive
• Big Data cannot be efficiently analyzed
using only traditional databases or
methods.
• Big Data problems require new tools and
technologies to store, manage, and
realize the business benefit.
• These new tools and technologies enable
creation, manipulation, and management
of large datasets and the storage
environments that house them
• Most of the Big Data is unstructured or
semi-structured in nature, which requires
different techniques and tools to
process and analyze
Big Data Ecosystems
DATA
ANALYTICS VS
BUSINESS
INTELLIGENCE
DATA
DRIVEN
DECISION
MAKING
DATA
ANALYTICS
SKILL SET
Accounting & Data
Analytics

• Auditing
• KPMG reports that:
• Audit must better embrace technology.
• Technology will enhance the quality,
transparency, and accuracy of the audit.

• Financial Reporting
• The use of so many estimates and
valuations in Financial Accounting, some
believe that employing Data Analytics may
substantially improve the quality of the
estimates and valuations.
THE
FRAMEWORKS
• Data science projects differ from most traditional Business
Intelligence projects and many data analysis projects in that
data science projects are more exploratory in nature.
• It is critical to have a process to govern them and ensure that
the participants are thorough and rigorous in their approach
Traditional
Theories
Positivism

Empirical
Data Inquiry
Evidence

Analysis
The
• Scientific method in use for centuries, still provides a
solid framework for thinking about and deconstructing
problems into their principal parts. One of the most

Frameworks
valuable ideas of the scientific method relates to
forming hypotheses and finding ways to test ideas.
• CRISP-DM provides useful input on ways to frame
analytics problems and is a popular approach for data
mining.
• Tom Davenport’s DELTA framework: The DELTA
framework offers an approach for data analytics
projects, including the context of the organization’s
skills, datasets, and leadership engagement.
• Doug Hubbard’s Applied Information Economics (AIE)
approach: AIE provides a framework for measuring
intangibles and provides guidance on developing
decision models, calibrating expert estimates, and
deriving the expected value of information.
• “MAD Skills” by Cohen et al. offers input for several of
the techniques that focus on model planning, execution,
and key findings.
• Data Analytics model called the IMPACT cycle, by Isson
and Harriott
CRISP DM
Stage 1. Business Understanding

The key to a great success is a


The Business Understanding stage High level knowledge of the
creative problem formulation by
represents a part of the craft fundamentals helps creative
some analyst regarding how to
where the analystsʼ creativity business analysts see novel
cast the business problem as one
plays a large role. formulations.
or more data science problems.
Take the time to explore
what your organization The final step of this CRISP-
expects to gain from data DM phase discusses how to
mining. Try to involve as produce a project plan using
many key people as possible the information gathered
in these discussions and here.
document the results.
Step 1 - Determining Business
Objectives
• Gain as much insight as possible into the business goals for
data mining.
• Task List
• Start gathering background information about the current business
situation.
• Document specific business objectives decided upon by key decision
makers.
• Agree upon criteria used to determine data mining success from a
business perspective.
Step 2 - Assessing the
Situation
• Step involves asking questions such as:
• What sort of data are available for analysis?
• Do you have the personnel needed to
complete the project?
• What are the biggest risk factors involved?
• Do you have a contingency plan for each
risk?
Resource Inventory Requirement, Assumption and Constraints

Hardware Resources Determine Requirement – Legal restriction, Scheduling, Result


Deployment
Data & Knowledge Sources
Clarify Assumptions – Economic Factor, Data Quality, User
Personnel Resources
Expected Views

Verify Constraints – Data Access, Data Usage, Budget


Risk & Contingencies Terminology Cost - Benefit Analysis
Scheduling List of terms or jargon – Project Cost for data collection,
documentation deployment, operating
Financial
Benefit of objective achievement
Data
and insight
Risk
Step 3 -
Determining
Data Mining
Goals
DM Success
DM Goals
Criteria
Step 4 - Producing a Project Plan

• Have you discussed the project tasks and


proposed plan with everyone involved?
• Are time estimates included for all phases
or tasks?
• Have you included the effort and
resources needed to deploy the results or
business solution?
• Are decision points and review requests
highlighted in the plan?
• Have you marked phases where multiple
iterations typically occur, such as
modelling?
Stage 2. Data Understanding

• Collecting Initial Data


• Existing data. This includes a wide variety of
data, such as transactional data, survey data,
Web logs, etc. Consider whether the existing data
are enough to meet your needs.
• Purchased data. Does your organization use
supplemental data, such as demographics? If not,
consider whether it may be needed.
• Additional data. If the above sources don't meet
your needs, you may need to conduct surveys or
begin additional tracking to supplement the
existing data stores.
Describing Data
• Amount of data. For most modelling techniques, there are trade-offs
associated with data size. Large data sets can produce more accurate
models, but they can also lengthen the processing time. Consider
whether using a subset of data is a possibility.
• Value types. Data can take a variety of formats, such as numeric,
categorical (string), or Boolean (true/false).
• Coding schemes. Frequently, values in the database are
representations of characteristics such as gender or product type.
Writing a Data Description Report

Data Quantity Data Quality

• Format? • Characteristic

• Sourcing? • Types

• Size? • Priority
Exploring
Data
• Use this phase of CRISP-DM to
explore the data with the tables,
charts, and other visualization tools
Writing a Data Exploration Report

Answer the following questions:


• What sort of hypotheses have you formed
about the data?
• Which attributes seem promising for further
analysis?
• Have your explorations revealed new
characteristics about the data?
• How have these explorations changed your
initial hypothesis?
• Can you identify particular subsets of data for
later use?
Types of problems:
• Missing data include values that are
Verifying blank or coded as a non-response
(such as $null$, ?, or 999).
Data Quality • Data errors are usually typographical
errors made in entering the data.
• Measurement errors include data that
are entered correctly but are based on
an incorrect measurement scheme.
• Coding inconsistencies typically involve
nonstandard units of measurement or
value inconsistencies, such as the use
of both M and male for gender.
• Bad metadata include mismatches
between the apparent meaning of a
field and the meaning stated in a field
name or definition.
Stage 3. Data Preparation

Data preparation typically involves the


following tasks:
• Merging data sets and/or records
• Selecting a sample subset of data
• Aggregating records
• Deriving new attributes
• Sorting the data for modelling
• Removing or replacing blank or missing
values
• Splitting into training and test data sets
Selecting Data

Generally, there are two ways to select


data:
• Selecting items (rows) involves making
decisions such as which accounts,
products, or customers to include.
• Selecting attributes or characteristics
(columns) involves making decisions
about the use of characteristics such
as transaction amount or household
income.
Cleaning Data
Constructing New Data

There are two ways to construct


new data:
• Deriving attributes (columns or
characteristics)
• Generating records (rows)
Integrating Data
There are two basic methods of integrating
data:
• Merging data involves merging two data sets
with similar records but different attributes.
The data is merged using the same key
identifier for each record (such as customer
ID). The resulting data increases in columns
or characteristics.
• Appending data involves integrating two or
more data sets with similar attributes but
different records. The data is integrated based
upon a similar fields (such as product name or
contract length).
Formatting Data

As a final step before model building,


it is helpful to check whether certain
techniques require a particular format
or order to the data. Consider the
following questions when formatting
data:
• Which models do you plan to use?
• Do these models require a particular
data format or order?
Stage 4. Modeling

The modelling stage is the primary place where


data mining techniques are applied to the data. It Modelling is usually conducted in multiple
It is rare for an organization's data mining
is important to have some understanding of the iterations. Typically, data miners run several
question to be answered satisfactorily with a
fundamental ideas of data mining, including the models using the default parameters and then
single model and a single execution. This is what
sorts of techniques and algorithms that exist, fine-tune the parameters or revert to the data
makes data mining so interesting--there are many
because this is the part of the craft where the preparation phase for manipulations required by
ways to look at a given problem
most science and technology can be brought to their model of choice.
bear.
1 2 3 4

Selecting Modeling Generating a Test Design Building the Models Assessing the Model
Techniques •Describing the criteria for •Parameter settings include the •Adjust the parameters of existing

•The data types available for mining. "goodness" of a model notes you take on parameters that models.
•Defining the data on which these produce the best results. •Choose a different model to
•Your data mining goals.
criteria will be tested •The actual models produced. address your data mining problem.
•Specific modeling requirements
•Descriptions of model results
Stage 5. Evaluation

• Evaluating the Results


• Ranking the Models.
• New Questions.

• Review Process
• Determining the Next Steps
Stage 6. Deployment

Planning for Planning Monitoring


Deployment and Maintenance

Producing a Final Conducting a Final


Report Project Review
Q&A

You might also like