You are on page 1of 22

CP 422 Programming for Big

Data
- Introduction

Jiashu (Jessie) Zhao


At the Beginning of this course
• 9:30am-10:20am, MWF
• Evaluation:
• 3 Assignments: 30%
• Midterm: 30%
• Final: 40%
• Tools can be installed on your personal computers
or using Amazon AWS Server (potential cost is less
than 50 dollars)
• No Required Textbook
• References
• Data analytics with Hadoop: An introduction for Data Scientists. By
Bengfort, Benjamin, and Jenny Kim. (ISBN-13 : 978-1491913703)
• Hadoop: The Definitive Guide: Storage and Analysis at Internet
Scale. By Tom White. (ISBN-13 : 978-1491901632)
• MapReduce Design Patterns: Building Effective Algorithms and
Analytics for Hadoop and Other Systems. By Donald Miner, and
Adam Shook. (ISBN-13 : 978-1449327170)
• Learning Spark: Lightning-Fast Big Data Analysis. By Holden Karau,
Andy Konwinski, Patrick Wendell, and Matei Zaharia. (ISBN-13
: 978-1449358624)
• Programming Pig: Dataflow Scripting with Hadoop. By Alan
Gates and Daniel Dai. (ISBN-13 : 978-1491937099) Programming
• Hive: Data Warehouse and Query Language for Hadoop. By Edward
Capriolo, Dean Wampler, and Jason Rutherglen. (ISBN-13 : 978-
1449319335)
Introduction
• What is big data?

• What are the characteristics of big data?

• Why big data?

• What about big data programming?


What is big data?
• Big data is a field that treats ways to analyze,
systematically extract information from, or
otherwise deal with data sets that are too large or
complex to be dealt with by traditional data-
processing application software.

• Big data challenges include capture, select, process,


storage, search, sharing, transfer, analysis, and
visualization.
The five Vs of big data
Volume of big data
• According to market intelligence company IDC, the
‘Global Datasphere’ in 2018 reached 18 zettabytes.
• IDC predicts the world’s data will grow to 221
zettabytes in 2026.
• A zettabyte is 10!" bytes = one billion terabytes
4.6
30 billion RFID billion
tags today
12+ TBs (1.3B in 2005)
camera
of tweet data phones
every day world wide

100s of
millions
of GPS
data every day
? TBs of

enabled
devices sold
annually

25+ TBs of 2+
log data
every day billion
people on
the Web
76 million smart meters by end
in 2009… 2011
200M by 2014
Variety of big data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
• You can only scan the data once
• Multi Media Data

To extract knowledgeè link all these types of data together


A Single View to the Customer

Social Banking
Media Finance

Our
Gaming
Customer Known
History

Purchas
Entertain
e
Velocity
• Data is begin generated fast and need to
be processed fast
• Online Data Analytics
• Late decisions -> missing opportunities
• Examples
• E-Promotions: Based on your current
location, your purchase history, what you
like -> send promotions right now for store
next to you
• Healthcare monitoring: sensors monitoring
your activities and body -> any abnormal
measurements require immediate reaction
Real-time/Fast Data

Mobile devices
(tracking all objects all the time)

Social media and networks Scientific instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to


collect data
• But, by the ability to manage, analyze, summarize, visualize, and
discover knowledge from the collected data in a timely manner and in a
scalable fashion
Real-Time Analytics/Decision Requirement
Product
Recommendations Learning why Customers
Influence
that are Relevant Behavior Switch to competitors
& Compelling and their offers; in
time to Counter

Friend Invitations
Improving the Customer to join a
Marketing Game or Activity
Effectiveness of a that expands
Promotion while it business
is still in Play
Preventing Fraud
as it is Occurring
& preventing more
proactively
Veracity
• Data veracity is the degree of accuracy or
truthfulness of a data set

• It’s not just the quality of the data that is


important, but how trustworthy the source, the
type, and processing of the data are.

• Traditional vs Big data


Value
• Value is defined as the usefulness of data.
• Data from which business insights are garnered add
‘value’ to the company
• Aggregation of data doesn’t equal value addition
• Data that has high veracity and can be analyzed
quickly has more value to business.
Why big data?
• In our daily life, we may come across questions like:

• Which news article I should read out of the millions?

• How do I choose a book from the millions of available on


a site or stores?

• How do I keep myself updated about new events, sports,


inventions, and discoveries taking place across the world?

Solutions can be found by analyzing big data!


• “more data usually beats better algorithms”

• Data Mining
• Machine Learning
• Recommendation
• Finance
• …
What about big data
programming?
• If single computer/server is not big enough for the
large amount of data, what shall we do?
• In this case, how do computers communicate and
managed?
• If we would like some results on the entire data set,
how could it work?
Other References
• https://en.wikipedia.org/wiki/Big_data
• https://www.slideshare.net/hktripathy/lecture1-
introduction-to-big-data?from_action=save

You might also like