You are on page 1of 60

Understanding Big Data

Motivations, Considerations, Basic Concepts

1
The Evolution of Analytics

2
The Evolution of Analytics

3
The Evolution of Analytics

4
Big Data Perspectives

5
Big Data Impacts to Core Business Practices

6
Big Data Market Revenue Forecast

7
Size of Hadoop & Big Data Market

8
Big Data Market by Segment

9
Big Data Revenue by Segment

10
The Fastest Growing Category

11
Big Data Initiatives and Success Rate

12
Big Data in Business & Government

13
Big Data in Business & Government

14
Big Data Practices and Inisiatives

15
16
Introducing Big Data
• Big Data is a strategic initiative build upon premise
that the internal data does not hold all answers.
• Big Data has the ability to change the nature of
business.
• Many organization sole existence is based upon their
capability to generate insights that only Big Data can
deliver.
• Big Data is not just about technology—it is also
about how these technologies can propel an
organization forward.

17
Big Data as A Field
• Analysis, processing, storing of large collections of
data from many sources.
• Big Data = Traditional Statistics + Analytic Algorithms
• Increasingly important as
• Datasets continue to become Larger, Diverse, Complex and
Streaming-centric.
• Advances in computational sciences have allowed the
processing of entire datasets, making sampling as done in
traditional statistics become unnecessary.
• Interdisciplinary: Mathematics, Statistics, Computer
Science, and Subject Matter Expertise.

18
Data Within Big Data
• Accumulates within enterprise via
applications, sensors, and external sources.
• Can be processed by a Big Data solution.
• Can be used by enterprise application directly.
• Can be fed into a data warehouse to enrich the
existing data.

19
Insights and Benefits
• Operational optimization
• Actionable intelligence
• Identification of new markets
• Accurate predictions
• Fault and fraud detection
• More detailed records
• Improved decision-making
• Scientific discoveries

20
Concepts and Terminology
• Datasets
• Data Analysis
• Data Analytics
• Business Intelligence (BI)
• Key Performance Indicators (KPI)

21
Datasets

Figure 1.1 Datasets can be


found in many different
formats.

• Collections or groups of related data


• Each group shares the same set of attributes
• Examples:
• Tweets stored in a flat file.
• Rows from a database table stored in CSV formatted file.
• Historical weather observations stored in XML files.
22
Data Analysis
• The process of examining data to find
facts, relationships, patterns, insights,
and trends.

• Example: Figure 1.2 The


• How many ice cream cones sold related to symbol used to
represent data
daily temperature. analysis.
Would support decisions on

• How much ice cream a store should order


given weather forecast information.

23
Data Analytics
• A broader term that encompasses data
analysis.
• Include the management of the
complete data life-cyle:
• Collecting
• Cleansing
• Organizing
• Storing Figure 1.3 The
• Analyzing symbol used to
• Governing represent data
analytics.
• Development of analysis methods,
scientific techniques, and automated
tools.

24
Data Analytics
• Uses highly scalable distributed
technologies and frameworks that
capable of analyzing large volumes of
data from different sources.
• Identifying, procuring, preparing, and
analyzing large amount of raw,
unstructured data to extract Figure 1.3 The
meaningful information for symbol used to
represent data
• Identifying patterns analytics.
• Enriching existing enterprise data
• Performing large-scale searches

25
4 Categories of Analytics

Figure 1.4 Value and complexity increase


from descriptive to prescriptive analytics.
26
Descriptive Analytics
• To answer questions about events that have
already occurred.
• 80% of analytics are descriptive.
• Often carried out via ad-hoc reporting and
dashboard
• Examples:
• What was the sales over the past 12 months?
• How many support calls received as categorized by
severity and geographic location?
• How much is the monthly commission earned by each
sales agent?

27
Descriptive Analytics

Figure 1.5 The operational systems, pictured left,


are queried via descriptive analytics tools to
generate reports or dashboards, pictured right. 28
Diagnostic Analytics
• To determine the cause of a phenomenon that
occurred in the past.
• Focus on the reason behind the event.
• Identify information related to a phenomenon.
• Examples:
• Why were Q2 sales less than Q1 sales?
• Why more support calls from Eastern region than
Western region?
• Why was there an increase in patient re-admission rates
over the past three months?

29
Diagnostic Analytics
• Provides more value than descriptive analytics.
• Requires a more advanced skillset.
• Requires collecting data from multiple sources.
• Storing data in a structure that facilitate performing
drill-down and roll-up analysis.
• Viewed via interactive visualization tools that
enable users to identify trends and patterns.
• What information is related to the phenomenon.

30
Diagnostic Analytics

Figure 1.6 Diagnostic analytics can result in data


that is suitable for performing drill-down and roll-
up analysis.

31
Predictive Analytics
• To determine the outcome of an event that might
occur in the future.
• Enhance information with meaning to generate
knowledge (how information is related).
• Generate future predictions based on past events.
• Examples:
• What are the chances a customer will default on a loan
if he have missed a monthly payment?
• What the patient survival rate if Drug B is administered
instead of Drug A?
• If a customer purchased Products A dan B, does he will
buy C?

32
Predictive Analytics
• Predict the outcomes of events based on patterns,
trends, and exception in historical and current data.
• Identification of both risks and opportunities.
• Involves the use of large datasets comprised of
internal and external data and various data analysis
techniques.
• Has greater value and requires a more advanced
skillset than both descriptive and diagnostic
analytics.

33
Predictive Analytics

Figure 1.7 Predictive analytics tools can provide


user-friendly front-end interfaces.

34
Prescriptive Analytics
• Build upon the results of predictive analytics.
• Prescribe actions that should be taken
• Focus not only on prescribing the best option to
follow, but also why.
• Results can be reasoned about because they
embed elements of situational understanding.
• Gain advantage or mitigate a risk.
• Examples:
• Among three drugs, which one provides the best
results?
• When the best time to trade a particular stock?

35
Prescriptive Analytics
• Has more value than any other type of analytics.
• Requires the most advanced skillset, as well as
specialized software and tools.
• Calculates various outcomes.
• Suggests the best course of action for each
outcome.
• Incorporates internal data with external data.
• Internal Data: business rules, historical data,
customer information, product data.
• External Data: social media, weather forecasts,
government-produced demographic data.
36
Prescriptive Analytics

Figure 1.8 Prescriptive analytics involves the use


of business rules and internal and/or external
37
data to perform an in-depth analysis.
Business Intelligence (BI)
• Enables an organization to gain insight into the
performance of an enterprise by analyzing data
generated by its business processes and information
systems.
• Corrects detected issues or enhances organizational
performance.
• Applies analytics to large amounts of data across the
enterprise (has been consolidated at data warehouse)

38
Business Intelligence

Figure 1.9 BI can be used to improve business applications,


consolidate data in data warehouses and analyze queries via
a dashboard.

39
Key Performance Indicators (KPI)
• A metric to gauge success within a particular business context.
• Linked to overall enterprise’s strategic goals and objectives.
• Identify business performance problems.
• Demonstrate regulatory compliance.
• Act as quantifiable reference points for measuring a specific aspect of
a business’ overall performance.

40
Key Performance Indicators (KPI)

Figure 1.10 A KPI dashboard acts as a central


reference point for gauging business performance.

41
Big Data Characteristics

Figure 1.11 The Five Vs of Big Data.

42
Volume
• The volume of data is substantial and ever growing.
• High data volumes impose distinct:
• Data storage,
• Processing demands,
• Additional processes for data preparation, curation, and management.
• Data sources:
• Online transactions, POS and banking.
• Scientific research experiments.
• Sensors (GPS, RFID, Smart meters, and Telematics)
• Social media (Facebook and Twitter)

43
Volume

Figure 1.12 Organizations and users world-wide create over


2.5 EBs of data a day. As a point of comparison, the Library of
Congress currently holds more than 300 TBs of data.

44
Velocity
• Data can arrive at fast speeds.
• Enormous datasets can accumulate within a very
short time.
• Demands highly elastic and available data processing
solutions and data storage capability.
• Depending on the data source, velocity may not
always high. Example: MRI scan images vs Internet
traffic logs.

45
Velocity

Figure 1.13 Examples of high-velocity Big Data


datasets produced every minute include tweets,
video, emails and GBs generated from a jet engine.

46
Variety
• Multiple formats and types of data that need to be
supported by Big Data solutions.
• Bring challenges for enterprise in terms of data
integration, transformation, processing, and storage.

Figure 1.14 Examples of high-variety Big Data datasets include


structured, textual, image, video, audio, XML, JSON, sensor data and
metadata.

47
Veracity
• Refers to quality or fidelity of data
• Leads to data processing activities to resolve invalid
data and remove noise.
• Data can be part of signal or noise.
• Noise is data that cannot be converted into
information and thus has no value.
• Data with a high signal-to-noise ratio has more
veracity.
• Data that is acquired in a controlled manner usually
contains less noise (online customer registrations vs
blogs.)

48
Value
• The usefulness of data for an enterprise.
• Related to the veracity characteristics
• Depends on how long data processing takes, because analytics results
have a shelf-life.
• Value and time inversely related.
• Stale results inhibit quality and speed of informed decision-making.

49
Value, Veracity and Time

Figure 1.15 Data that has high veracity and can be analyzed quickly has
more value to a business.

50
Value Lifecycle-related Concerns
• How well the data has been stored?
• Were valuable attributes of the data removed duing
data cleansing?
• Are the right types of questions being asked during
data analysis?
• Are the results of the analysis being accurately
communicated to the appropriate decision-makers?

51
Data Sources
• Human-generated: the result of human interaction
with systems:
• Online services
• Digital devices

• Machine-generated: the results of software programs


and hardware devices in response to real-world
events.
• Log file on authorization decision by a security service.
• Transaction vs inventory generated by a POS (point-of-
sale).
• Numerous sensors in a cellphone (position, signal strength,
etc.)

52
Human-generated Data

Figure 1.16 Examples of human-generated data


include social media, blog posts, emails, photo
sharing and messaging.

53
Machine-generated Data

Figure 1.17 Examples of machine-generated data


include web logs, sensor data, telemetry data,
smart meter data and appliance usage data.
54
Data Types
• Structured
• Unstructured
• Semi-structured
• Metadata

55
Structured Data
• Conforms to a data model or data schema.
• Often stored in tabular form.
• Often stored in a relational database
• Frequently generated by enterprise applications and IS such as ERP
and CRM sytems.
• Rarely requires special consideration in processing or storage.
• Examples:
• Banking transactions
• Invoices
• Customer records Figure 1.18 The symbol used
to represent structured data
stored in a tabular form.
56
Unstructured Data
• Does not conform to a data model or data schema.
• 80% data within any given enterprise.
• Has faster growth rate than structured data.
• Either textual or binary (image, audio, video data)
• Non-relational.

Figure 1.19 Video, image and audio files are all types
of unstructured data.
57
Unstructured Data
• Special purpose logic is usually required to process and store (Ex.
Correct codec to play a video file)
• Cannot be directly processed or queried using SQL.
• If stored within a relational database, it is stored in a table as a Binary
Large Object (BLOB)
• Not-only SQL (NoSQL) database is a non-relational database that can
be used to store unstructured data alongside structured data.

58
Semi-structured Data
• Has a defined level of structure and consistency, but is not relational in
nature.
• Hierarchical and graph based.
• More easily processed than unstructured data.
• Requires special pre-processing and storage, especially if the underlying
format is not text-based.
• Examples:
• EDI files
• Spreadsheets
• RSS feeds
• Sensor data Figure 1.20 XML, JSON and
sensor data are semi-structured.
59
Metadata
• Provides information about a dataset’s
characteristics and structures.
• Mostly machine-generated.
• Can be appended to data.
• Tracking of metadata is crucial to Big
Data processing, storage, and analysis.
Figure 1.21 The
• Examples: symbol used to
represent metadata
• XML tags about author and creation date.
• Attributes stating the file size and image
resolution of photo.

60

You might also like