Big Data Intro

Introduction to Big Data
Characteristics of Data:
1. Composition: The composition of data deals with the structure of data
• the sources of data
• the granularity
• the types
• the nature of data as to whether it is static or real-time streaming.
2. Condition: The condition of data deals with the state of data
• Can one use this data as is for analysis?

• Does it require cleansing for further enhancement and enrichment?
3. Context: The context of data deals with
• Where has this data been generated?
• Why was this data generated?
• How sensitive is this data?

• What are the events associated with this data? and so on.
Small data (data as it existed prior to the big data revolution) is about certainty.
• fairly known data sources
• no major changes to the composition or context of data.

• Most often we have answers to queries like
• why this data was generated
• where and when it was generated
• exactly how we would like to use it

• what questions will this data be able to answer, and so on.
Big data is about complexity... complexity in terms of -
• multiple and unknown datasets
• exploding volume
• the speed at which the data is being generated
• the speed at which it needs to be processed
• the variety of data (internal or external, behavioral or social) that is being generated.
Evolution of Big Data
What is Big Data?

• Terabytes or Petabytes or zettabytes of data.
• Its about 3Vs.
• Today’s Big maybe tomorrow’s normal.
• Anything beyond human and technical infrastructure - storage, processing and
analysis.
Definition of Big Data
Big data is high-volume, high-velocity, and high-variety information assets that
demand cost effective, innovative forms of information processing for enhanced
insight and decision making.
The 3Vs concept was proposed by the Garner analyst Doug Laney in a 2001 MetaGroup
research publication, titled, 3D Data Management: Controlling Data Volume, Variety and
Velocity.
"big data is high-volume, high-velocity, and high-variety information assets"

• talks about voluminous data (humongous data) that may have great variety (a good mix
of structured, semi-structured, and unstructured data) and will require a good
speed/pace for storage, preparation, processing, and analysis.
"cost effective, innovative forms of information processing"

• talks about embracing new techniques and technologies to capture (ingest), store,
process, persist, integrate, and visualize the high-volume, high-velocity, and high-variety
data.
"enhanced insight and decision making"

• talks about deriving deeper, richer, and meaningful insights and then using these
insights to make faster and better decisions to gain business value and thus a
competitive edge.
Data -› Information -› Actionable intelligence -› Better decisions -› Enhanced business

value.
WHAT IS BIG DATA?

Big data is data that is big in volume, velocity, and variety.
Volume:
• We have seen it grow from bits to bytes to petabytes and exabytes.
• Bits -> Bytes -› Kilobytes - Megabytes -› Gigabytes -› Terabytes

-› Petabytes -› Exabytes -› Zettabytes -› Yottabytes
Where Does This Data get Generated?

There are a multitude of sources for big data.
• An XLS, a DOC, a PDF, etc. is unstructured data
• a video on You Tube
• a chat conversation on Internet Messenger

• a customer feedback form on an online retail website is unstructured data
• a CCTV coverage, a weather forecast report is unstructured data too.
1. Typical internal data sources: Data present within an organization's firewall.
• Data storage: File systems, SQL (RDBMSs - Oracle, MS SQL Server, DB2, MySQL,
PostgreSQL,etc.), NoSQL (MongoDB, Cassandra, etc.), and so on.
• Archives: Archives of scanned documents, paper archives, customer
correspondence records, patients' health records, students' admission records,
students' assessment records, and so on.
2. External data sources: Data residing outside an organization's firewall.
• Public Web: Wikipedia, weather, regulatory, compliance, census, etc.
3. Both (internal + external data sources)
• Sensor data: Car sensors, smart electric meters, office buildings, air conditioning
units, refrigerators, and so on.
• Machine log data: Event logs, application logs, Business process logs, audit logs,
clickstream.
• Social media: Twitter, blogs, Facebook, LinkedIn, YouTube, Instagram, etc.
• Business apps: ERP, CRM, HR, Google Docs, and so on.

• Media: Audio, Video, Image, Podcast, etc.
• Docs: Comma separated value (CSV), Word Documents, PDE, XLS, PPT, and so
on.
Velocity:
• We have moved from the days of batch processing to real-time processing.
• Batch -› Periodic -› Near real time -› Real-time processing
Variety:
Variety deals with a wide range of data types and sources of data.
1. Structured data: From traditional transaction processing systems and RDBMS,

etc.
2. Semi-structured data: For example Hyper Text Markup Language (HTML),
eXtensible Markup Language (XML).
3. Unstructured data: For example unstructured text documents, audios, videos,

emails, photos, PDFs, social media, etc.
OTHER CHARACTERISTICS OF DATA WHICH ARE

NOT DEFINITIONAL TRAITS OF BIG DATA
1. Veracity and validity:
• Veracity refers to biases, noise, and abnormality in data.
• The key question here is: "Is all the data that is being stored, mined, and analyzed
meaningful and pertinent to the problem under consideration?"
• Validity refers to the accuracy and correctness of the data.
• Any data that is picked up for analysis needs to be accurate.
• It is not just true about big data alone.
2. Volatility:
• Volatility of data deals with, how long is the data valid?

• And how long should it be stored?
3. Variability:
• Data flows can be highly inconsistent with periodic peaks.
Types of Data
Digital data is classified into the following categories:
• Structured data
• Semi-structured data
• Unstructured data
Approximate percentage distribution of digital data
Structured Data
• This is the data which is in an organized form (e.g., in rows and columns) and can be
easily used by a computer program.
• Relationships exist between entities of data, such as classes and their objects.
• Data stored in databases is an example of structured data.
Sources of Structured Data Ease with Structured Data
Input / Update /
Databases such as Delete
Oracle, DB2,
Teradata, MySql,
PostgreSQL, etc Security
Ease with Structured data Indexing /

Structured data Spreadsheets Searching
Scalability
OLTP Systems
Transaction
Processing
Semi-structured Data
• This is the data which does not conform to a data model but has some structure.
• However, it is not in a form which can be used easily by a computer program.
• Example: emails, XML, markup languages like HTML, etc.

• Metadata for this data is available but is not sufficient.
Sources of Semi-structured Data XML (eXtensible Markup Language)
Other Markup Languages
JSON (Java Script Object Notation)

Semi-Structured Data
Characteristics of Semi-structured Data

Inconsistent Structure
Self-describing
(lable/value pairs)
Semi-structured data
Often Schema information is
blended with data values
Data objects may have different

attributes not known beforehand
Unstructured Data
• This is the data which does not conform to a data model or is not in a form which can be
used easily by a computer program.
• About 80–90% data of an organization is in this format.
• Example: memos, chat rooms, PowerPoint presentations, images, videos, letters,

researches, white papers, body of an email, etc.
Sources of Unstructured Data Web Pages
Images
Free-Form
Text
Audios
Unstructured data
Videos
Body of
Email
Text
Messages
Chats
Social
Media data
Word
Document
Issues with terminology – Unstructured Data

Structure can be implied despite not being
formerly defined.
Data with some structure may still be labeled

Issues with terminology
unstructured if the structure doesn’t help with
processing task at hand
Data may have some structure or may even be

highly structured in ways that are unanticipated
or unannounced.
Dealing with Unstructured Data
Data Mining
Natural Language Processing (NLP)
Dealing with Unstructured Data Text Analytics
Noisy Text Analytics
Big data applications
Big Data and the New School of Marketing

• New School marketers deliver what today’s consumers want: relevant interactive
communication across the digital power channels: email, mobile, social, display and the
web.
The Right Approach: Cross-Channel Lifecycle Marketing

• Cross-Channel Lifecycle Marketing really starts with the capture of customer
permission, contact information, and preferences for multiple channels.
• It also requires marketers to have the right integrated marketing and customer
information systems, so that
(1) they can have complete understanding of customers through stated preferences and
observed behavior at any given time; and
(2) they can automate and optimize their programs and processes throughout the customer
lifecycle.
Social and Affiliate Marketing
• Word-of-mouth marketing has been the most powerful form of marketing.
• Above and beyond the removal of barriers the social web brings to affiliate marketing, it
also brings into the mix the same concepts—product recommendations from a friend
network.
• As many detailed studies have shown, most people trust a recommendation from the
people they know.
• Using the backbone and publication tools created by companies like Facebook and
Twitter, brands will soon find that rewarding their own consumers for their advocacy is
a required piece of their overall digital marketing mix.
• The tools are available to them all and the scale is exponentially larger than ever before.
• Anyone can recommend a product through the click of a mouse. No more parties
needed.
Empowering Marketing with Social Intelligence
• As a result of the growing popularity and use of social media around the world and
across nearly every demographic, the amount of user-generated content - or “big data”
- created is immense, and continues growing exponentially.
• Millions of status updates, blog posts, photographs, and videos are shared every second.
• Successful organizations will not only need to identify the information relevant to their
company and products
- but also be able to dissect it, make sense of it,
- respond to it in real time and on a continuous basis,
- drawing business intelligence—or insights
- that help predict likely future customer behavior.

• Social media is the world ’s largest and purest focus group.
• Marketers now have the opportunity to mine social conversations for purchase intent
and brand lift through Big Data.
• So, marketers can communicate with consumers when they are emotionally engaged,
regardless of the channel.
Fraud and Big Data
• Fraud is intentional deception made for personal gain or to damage another individual.
• One of the most common forms of fraudulent activity is credit card fraud.
• Even though fraud detection is improving, the rate of incidents is rising.
• This means banks need more proactive approaches to prevent fraud.
• Social media and mobile phones are forming the new frontiers for fraud.
• In order to prevent the fraud, credit card transactions are monitored and checked in
near real time.
• If the checks identify pattern inconsistencies and suspicious activity, the transaction is
identified for review and escalation.
• The Capgemini Financial Services team believes that due to the nature of data streams
and processing required, Big Data technologies provide an optimal technology solution
based on the following three Vs:
1. High volume. Years of customer records and transactions (150 billion records per year)
2. High velocity. Dynamic transactions and social media information
3. High variety. Social media plus other unstructured data such as customer emails, call center
conversations, as well as transactional structured data
• Capgemini ’s new fraud Big Data initiative focuses on flagging the suspicious credit card
transactions to prevent fraud in near real-time via multi-attribute monitoring.
• Real-time inputs involving transaction data and customers records are monitored via
validity checks and detection rules.
• Pattern recognition is performed against the data to score and weight individual
transactions across each of the rules and scoring dimensions.
• A cumulative score is then calculated for each transaction record and compared against
thresholds to decide if the transaction is potentially suspicious or not.
• Elastic Search-
• a distributed, free/open-source search server.
• Using this tool, large historical data sets can be used in conjunction with real-
time data to identify deviations from typical payment patterns.
• This Big Data component allows overall historical patterns to be compared and
contrasted, and allows the number of attributes and characteristics about
consumer behavior to be very wide, with little impact on overall performance.
• Percolator is a system for incrementally processing updates to large data sets.
• Percolator query can handle both structured and unstructured data.
• This provides scalability to the event processing framework, and allows specific
suspicious transactions to be enriched with additional unstructured information -
phone location/geospatial records, customer travel schedules, and so on.
Social Network Analysis (SNA)

• SNA is the precise analysis of social networks.
• Social network analysis views social relationships and makes assumptions.
• SNA could reveal all individuals involved in fraudulent activity, from perpetrators to
their associates, and understand their relationships and behaviors to identify a bust out
fraud case.
Risk and Big Data

• Many of the world’s top analytics professionals work in risk management.
• The two most common types of risk management are credit risk management and
market risk management.
• A third type of risk, operational risk management, isn’t as common as credit and market
risk.
• Credit risk analytics focus on past credit behaviors to predict the likelihood that a
borrower will default on any type of debt by failing to make payments which they
obligated to do.
• For example, “Is this person likely to default on their Rs.300,000 mortgage?”
• Market risk analytics focus on understanding the likelihood that the value of a portfolio
will decrease due to the change in stock prices, interest rates, foreign exchange rates,
and commodity prices.
• For example, “Should we sell this holding if the price drops another 10 percent?”

Big Data Intro

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Intro

Uploaded by

Copyright:

Available Formats

Introduction to Big Data

• Can one use this data as is for analysis?

• How sensitive is this data?

• no major changes to the composition or context of data.

• exactly how we would like to use it

Evolution of Big Data

What is Big Data?

"big data is high-volume, high-velocity, and high-variety information assets"

"cost effective, innovative forms of information processing"

"enhanced insight and decision making"

Data -› Information -› Actionable intelligence -› Better decisions -› Enhanced business

WHAT IS BIG DATA?

• Bits -> Bytes -› Kilobytes - Megabytes -› Gigabytes -› Terabytes

Where Does This Data get Generated?

• a chat conversation on Internet Messenger

• Business apps: ERP, CRM, HR, Google Docs, and so on.

1. Structured data: From traditional transaction processing systems and RDBMS,

3. Unstructured data: For example unstructured text documents, audios, videos,

OTHER CHARACTERISTICS OF DATA WHICH ARE

• Volatility of data deals with, how long is the data valid?

Sources of Structured Data Ease with Structured Data

Ease with Structured data Indexing /

• Example: emails, XML, markup languages like HTML, etc.

Other Markup Languages

JSON (Java Script Object Notation)

Characteristics of Semi-structured Data

Data objects may have different

• Example: memos, chat rooms, PowerPoint presentations, images, videos, letters,

Issues with terminology – Unstructured Data

Data with some structure may still be labeled

Data may have some structure or may even be

Dealing with Unstructured Data

Natural Language Processing (NLP)

Dealing with Unstructured Data Text Analytics

Noisy Text Analytics

Big data applications

Big Data and the New School of Marketing

The Right Approach: Cross-Channel Lifecycle Marketing

- that help predict likely future customer behavior.

Social Network Analysis (SNA)

Risk and Big Data

You might also like