You are on page 1of 54

CHINHOYI UNIVERSITY OF

TECHNOLOGY

Entrepreneurship & Business Sciences


Graduate Business School

MSc. Data Analytics


Big Data Analytics
[MSCDA 6-8]

Eng. N.F Thusabantu - Shoniwa


TOPIC 1 : Analytics
• Names of Big Data

• Case for Big Data

• Big Data Options Team Challenge

Eng. N.F Thusabantu


Content
1. Introduction
2. What is Big Data
3. Why Big Data
4. Characteristic of Big Data
5. Application of Big Data
6. Big Data sources
7. Big Data Analytics Lifecycle
8. Tools used in Big Data
9. Risks of Big Data
10. Benefits of Big Data
Eng. N.F Thusabantu
Introduction
• Big data is generally defined
as collections of data sets
whose volume, velocity in
terms of time variation, or
variety is so large that it is
difficult to store, manage,
process and analyse the data
using traditional databases
and data processing tools.

Eng. N.F Thusabantu


Eng. N.F Thusabantu
What is Big Data
• Examining large amount of data
• Appropriate information
• Identification of hidden patterns, unknown correlations
• Competitive advantage
• Better business decisions: strategic and operational
• Effective marketing, customer satisfaction, increased revenue

Eng. N.F Thusabantu


Why Big Data
• FB generates 10TB daily

• Twitter generates 7TB of data


Daily

• IBM claims 90% of today’s


stored data was generated in
just the last two years.

Eng. N.F Thusabantu


Characteristics of Big Data
Volume
• Volume is probably the best known characteristic of big data;
this is no surprise, considering more than 90 percent of all today's
data was created in the past couple of years.

• The current amount of data can actually be quite staggering. Here


are some examples:
o 300 hours of video are uploaded to YouTube every minute.
o An estimated 1.1 trillion photos were taken in 2016, and that number rose by 9 percent
in 2017. As the same photo usually has multiple instances stored across different devices,
photo or document sharing services as well as social media services, the total number of
photos stored grew from 3.9 trillion in 2016 to 4.7 trillion in 2017. 
o In 2016 estimated global mobile traffic amounted for 6.2 exabytes per month. That's 6.2
billion gigabytes. 
Social Banking
Media Finance

Gaming Single User Our


Known
History
View

Entertain Purchase

Eng. N.F Thusabantu


Variety
 When it comes to big data, we do not only have to
handle structured data but also semi-structured and
mostly unstructured data as well.

 As you can deduce from the above examples, most big


data seems to be unstructured, but besides audio,
image, video files, social media updates, and other
text formats there are also log files, click data,
machine and sensor data, etc. 

 Variety refers to the forms of the data. Big data


comes in different forms such as structured or
unstructured data, including text data, image, audio,
video and sensor data.

Eng. N.F Thusabantu


Velocity (Speed)
• Velocity refers to the speed at which data is being generated, produced, created,
or refreshed. Velocity is another important characteristic of big data and the
primary reason for exponential growth of data. Modern IT, industrial and other
systems are generating data at increasingly higher speeds generating big data.

• Sure, it sounds impressive that Facebook's data warehouse stores upwards of 
300 petabytes of data, but the velocity at which new data is created should be
taken into account. Facebook claims 600 terabytes of incoming data per day. 

• Google alone processes on average more than "


40,000 search queries every second," which roughly translates to more than
3.5 billion searches per day. 

• Data is begin generated fast and need to be processed fast


• Online Data Analytics
• Late decisions  missing opportunities
Eng. N.F Thusabantu
DATA State

Eng. N.F Thusabantu


Applications of Big Data

Eng. N.F Thusabantu


Case Studies in Big Data and
Cyber Threats presented
UNITED STATES OF AMERICA
• In 2012, the Obama administration announced the Big Data Research and
Development Initiative, to explore how big data could be used to address important
problems faced by the government
• Comprised of 84 Big companies
• Big data analysis played a large role in Barack Obama's successful 2012 re-election
campaign
Registered voter Not registered voter
+ +
Likes Obama Likes Obama

Registered voter Not registered voter


+ +
Dislikes Obama Dislikes Obama

Eng. N.F Thusabantu


Eng. N.F Thusabantu
INDIA
• Big data analysis was tried out for the BJP to win the Indian
General Election 2014.

• The Indian government utilizes numerous techniques to ascertain


how the Indian electorate is responding to government action, as
well as ideas for policy augmentation.

ISRAEL
• A big data application was designed by Agro Web Lab to aid
irrigation regulation.

• Personalized diabetic treatments can be created through GlucoMe's


big data solution.
Eng. N.F Thusabantu
TARGET MARKETING
• All this is possible by adopting client data in open space
• A clear example is gathering all social media data, browser
data to get a clear picture of their customers in an evolving
field in Big Data called Social Network Analysis

• Through SNA :
 Can now accurately predict when one of their clients will
expect a baby

Insurance companies
• Can actually understand how you drive
Eng. N.F Thusabantu
eBay.com
• Uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB
Hadoop cluster for search, consumer recommendations, and merchandising

Amazon.com
• Handles millions of back-end operations every day, as well as queries from
more than half a million third-party sellers. The core technology that keeps
Amazon running is Linux-based and as of 2005 they had the world's three
largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB

FACEBOOK
• Handles over 50 billion photos from its user base

GOOGLE
• Was handling roughly 100 billion searches per month as of August 2012.

Eng. N.F Thusabantu


Big Data sources

Eng. N.F Thusabantu


Big Data Analytics
Lifecycle
 DATA ANALYTICS LIFECYCLE OVERVIEW

 Phase 1: Discovery
 Phase 2: Data Preparation
 Phase 3: Model Planning
 Phase 4: Model Building
 Phase 5: Communicate Results
 Phase 6: Operationalize

Eng. N.F Thusabantu


Data Analytics
Lifecycle Overview
• The data analytic lifecycle is designed for Big Data problems and
data science projects
• With six phases the project work can occur in several phases
simultaneously
• The cycle is iterative to portray a real project
• Work can return to earlier phases as new information is
uncovered

Eng. N.F Thusabantu


Key Roles for a Successful
Analytics Project

Eng. N.F Thusabantu


Key Roles for a
Successful Analytics Project
• Business User – understands the domain area
• Project Sponsor – provides requirements
• Project Manager – ensures meeting objectives
• Business Intelligence Analyst – provides business domain
expertise based on deep understanding of the data
• Database Administrator (DBA) – creates DB environment
• Data Engineer – provides technical skills, assists data
management and extraction, supports analytic sandbox

 DATA SCIENTIST – provides analytic techniques and


modeling
Eng. N.F Thusabantu
Background and Overview of
Data Analytics Lifecycle
• Data Analytics Lifecycle defines the analytics process and best
practices from discovery to project completion

• The Lifecycle employs aspects of


o Scientific method
o Cross Industry Standard Process for Data Mining (CRISP-DM)
• Process model for data mining
o Davenport’s DELTA framework
o Hubbard’s Applied Information Economics (AIE) approach
o MAD Skills: New Analysis Practices for Big Data by Cohen et al.

Eng. N.F Thusabantu


Overview of
Data Analytics Lifecycle

Eng. N.F Thusabantu


Phase 1: Discovery

1. Learning the Business Domain


2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources

Eng. N.F Thusabantu


Phase 2: Data
Preparation
• Includes steps to explore, preprocess, and condition
data
• Create robust environment – analytics sandbox
• Data preparation tends to be the most labor-intensive
step in the analytics lifecycle
o Often at least 50% of the data science project’s time
• The data preparation phase is generally the most
iterative and the one that teams tend to underestimate
most often
Eng. N.F Thusabantu
Phase 3: Model
Planning
• Activities to consider

o Assess the structure of the data – this dictates the tools and analytic
techniques for the next phase
o Ensure the analytic techniques enable the team to meet the business
objectives and accept or reject the working hypotheses
o Determine if the situation warrants a single model or a series of
techniques as part of a larger analytic workflow
o Research and understand how other analysts have approached this kind
or similar kind of problem

Eng. N.F Thusabantu


Phase 4: Model
Building
• Execute the models defined in Phase 3
• Develop datasets for training, testing, and production
• Develop analytic model on training data, test on test data
• Question to consider
o Does the model appear valid and accurate on the test data?
o Does the model output/behavior make sense to the domain experts?
o Do the parameter values make sense in the context of the domain?
o Is the model sufficiently accurate to meet the goal?
o Does the model avoid intolerable mistakes?
o Are more data or inputs needed?
o Will the kind of model chosen support the runtime environment?
o Is a different form of the model required to address the business problem?

Eng. N.F Thusabantu


Phase 5: Communicate
Results
• Determine if the team succeeded or failed in its objectives
• Assess if the results are statistically significant and valid
o If so, identify aspects of the results that present salient findings
o Identify surprising results and those in line with the hypotheses
• Communicate and document the key findings and major
insights derived from the analysis
o This is the most visible portion of the process to the outside stakeholders
and sponsors

Eng. N.F Thusabantu


Phase 6: Operationalize
• In this last phase, the team communicates the benefits of the project
more broadly and sets up a pilot project to deploy the work in a
controlled way
• Risk is managed effectively by undertaking small scope, pilot
deployment before a wide-scale rollout
• During the pilot project, the team may need to execute the algorithm
more efficiently in the database rather than with in-memory tools like
R, especially with larger datasets
• To test the model in a live setting, consider running the model in a
production environment for a discrete set of products or a single line
of business
• Monitor model accuracy and retrain the model if necessary

Eng. N.F Thusabantu


Phase 6: Operationalize
Key outputs from successful analytics project

Eng. N.F Thusabantu


Phase 6: Operationalize
Key outputs from successful analytics project
• Business user – tries to determine business benefits and
implications
• Project sponsor – wants business impact, risks, ROI
• Project manager – needs to determine if project completed on
time, within budget, goals met
• Business intelligence analyst – needs to know if reports and
dashboards will be impacted and need to change
• Data engineer and DBA – must share code and document
• Data scientist – must share code and explain model to peers,
managers, stakeholders

Eng. N.F Thusabantu


Phase 6: Operationalize
Four main deliverables
• Although the seven roles represent many interests, the interests
overlap and can be met with four main deliverables
1. Presentation for project sponsors – high-level takeaways for executive
level stakeholders
2. Presentation for analysts – describes business process changes and
reporting changes, includes details and technical graphs
3. Code for technical people
4. Technical specifications of implementing the code

Eng. N.F Thusabantu


Tools used in Big Data
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)

• Where data is stored?


– Distributed Storage (e.g. Amazon S3)

• What is the programming model?


– Distributed Processing (e.g. MapReduce)

• How data is stored & indexed?


– High-performance schema-free databases (e.g. MongoDB)

• What operations are performed on data?


– Analytic / Semantic Processing
Eng. N.F Thusabantu
Risks of Big Data
 UNORGANIZED DATA

• Big data is highly versatile.


• It comes from number of sources and in number of forms.
There’s structured data, there’s unstructured data. There’s data
coming from online and offline sources. And all this data
keeps piling up each day, each minute.
• It’s overwhelming for enterprises to tackle such unorganized
and siloed data sets effectively. A well planned governance
strategy can bring you out of your dark data and help you
make sense of it.

Eng. N.F Thusabantu


 DATA STORAGE AND RETENTION
This is one of the most obvious risks associated with big data.
• When data gets accumulated at such a rapid pace and in such
huge volumes, the first concern is its storage.
• Traditional data storage methods and technology are just not
enough to store big data and retain it well.
• Enterprises today need a shift to cloud based data storage
solutions to store, archive and access big data effectively.

Eng. N.F Thusabantu


 COST MANAGEMENT
• The process of storing, archiving, analysing, reporting and
managing big data involves costs.
• Many small and medium enterprises think that big data is only
for big businesses, and they cannot afford it.
• However, with careful budgeting and planning of resources,
big data costs can be mitigated well.
• Once the initial set up, migration and overhauling costs are
taken care of, big data acts as an incredible revenue
generator for digital enterprises.

Eng. N.F Thusabantu


 INCOMPETENT ANALYTICS
• Without proper analytics, big data is just a pile of trash lying
unnecessarily in your organization.
• Analytics is what makes data meaningful, giving management
valuable insights to make business decisions and plan strategies
for growth.
• With data growing at such an alarming rate, there’s obviously a
lack of skilled professionals and technology to analyse big data
efficiently.
• It exposes enterprises to the risk of misinterpretation of data,
and wrong decision making. Hiring the right talent and
applying the right tools is crucial to make relevant decisions
from a big data project.

Eng. N.F Thusabantu


 DATA PRIVACY
• With big data, comes the biggest risk of data privacy.
Enterprises worldwide make use of sensitive data, personal
customer information and strategic documents.
• When there’s so much confidential data lying around, the last
thing you want is a data breach at your enterprise.
• A security incident can not only affect critical data and bring
down your reputation; it also leads to legal actions and heavy
penalties.
• Taking measures for data privacy is not just a good initiative
anymore, it’s a compliance necessity.

Eng. N.F Thusabantu


Benefits of Big Data
 Identifying the root causes of failures and issues in real time
 Fully understanding the potential of data-driven marketing
 Generating customer offers based on their buying habits
 Improving customer engagement and increasing customer
loyalty
 Re-evaluating risk portfolios quickly
 Personalizing the customer experience
 Adding value to online and offline customer interactions

Eng. N.F Thusabantu


Group Exercise
 RIPTech Consultancy (Pvt) is a tech company that deals in
Software Engineering and Cyber Security. A new director who
was recently appointed wants to improve the company’s
engagement of employees across the global centers of
excellence (GCE) to drive innovation, research, and university
partnerships

• Explain how you can accomplish the following using the


BDAL
o Store formal and informal data
o Track research from global technologists
o Mine the data for patterns and insights to improve the team’s
operations and strategy [25 marks]
Eng. N.F Thusabantu
SOLUTION

Eng. N.F Thusabantu


HINT
1. This is a project report so remember the key objectives

2. Show your competences as a BIG DATA ANALYST

3. Tools used must be justified

4. Explain your data sources and how data was mined

5. Report should not be in 3rd person perspective

Eng. N.F Thusabantu


Phase 1: Discovery
 TEAM MEMBERS AND ROLES

Business user, project sponsor, project manager – Vice


President from Office of CTO

BI analyst – person from IT

Data engineer and DBA – people from IT

Data scientist – distinguished engineer

Eng. N.F Thusabantu


Phase 1: Discovery
 The data fell into two categories
o Five years of idea submissions from internal innovation
contests
o Minutes and notes representing innovation and research
activity from around the world
 Hypotheses grouped into two categories
o Descriptive analytics of what is happening to spark further
creativity, collaboration, and asset generation
o Predictive analytics to advise executive management of
where it should be investing in the future

Eng. N.F Thusabantu


Phase 2: Data Preparation
 Set up an analytics sandbox
 Discovered that certain data needed conditioning and
normalization and that missing datasets were critical
 Team recognized that poor quality data could impact
subsequent steps
 They discovered many names were misspelled and problems
with extra spaces
 These seemingly small problems had to be addressed

Eng. N.F Thusabantu


Phase 3: Model Planning
 The study included the following considerations
o Identify the right milestones to achieve the goals
o Trace how people move ideas from each milestone toward the
goal
o Trace ideas that die and others that reach the goal
o Compare times and outcomes using a few different methods

Eng. N.F Thusabantu


Phase 4: Model
Building
 Several analytic method were employed

o NLP on textual descriptions


o Social network analysis using R and RStudio
o Developed social graphs and visualizations

Eng. N.F Thusabantu


Phase 4: Model Building
Social graph of data submitters and finalists

Eng. N.F Thusabantu


Phase 4: Model Building
Social graph of top innovation influencers

Eng. N.F Thusabantu


Phase 5: Communicate
Results
 Study was successful in in identifying hidden innovators

 Found high density of innovators in Chipinge, Manicaland

 The CTO office launched longitudinal studies

Eng. N.F Thusabantu


Phase 6:
Operationalize
 Deployment was not really discussed

 Key findings
o Need more data in future
o Some data were sensitive
o A parallel initiative needs to be created to improve basic BI
activities
o A mechanism is needed to continually reevaluate the model
after deployment

Eng. N.F Thusabantu


Phase 6:
Operationalize

Eng. N.F Thusabantu

You might also like