You are on page 1of 23

BIG DATA

Giảng viên: Th.S Vũ Minh Hoàng

Faculty of Information Technology - TLU


Về Giảng viên
Vũ Minh Hoàng
Senior Data Scientist – AI Center, FPT.
Big Data Analysis Lead – Viettel Corp.

Background:
2 YOE Senior Analyst at Data Analysis Department in CONMED
– Biggest Corp. in Medical industry In The US;

Project Owner & Data Scientist of Big Data Project;

MS. Information Technology – Data Science & Analytics at


Northeastern University, Boston, MA.
BS. International Business & Economics at FTU, HN.
Your turn
• Họ và tên?
• Biệt danh?
• Môn học yêu thích nhất?
• Định hướng nghề nghiệp tương lai
• Sở thích?
Mục tiêu
1. Hiểu được khái niệm, nguồn gốc, đặc trưng của BigData và diễn giải được nó bằng các ví dụ thực tế.

2. Hiểu được thách thức, ý nghĩa của Big Data, từ đó đưa ra giải pháp để vượt qua khó khăn đó.

3. Hiểu các nguyên tắc hoạt động cơ bản của các công cụ, công nghệ sử dụng hay xử lý và có thể so sánh, đưa
ra ưu và nhược điểm của các công cụ để áp dụng và sử dụng nó hiệu quả.

4. Hiểu được các cách thu, truyền, xử lý và phân tích cơ bản dữ liệu lớn.
Các Topics trong môn học:
• Khoa học Dữ liệu, BigData & thách thức thời đại

• Thu thập Big Data

• Lưu trữ Big Data

• Big Data System Architecture và Cloud

• Big Data Analysis


Data Is Driving Everything

1. Modern data acquisition is inexpensive!


• Smartphones, embedded systems, inexpensive sensors,
• Medical devices, simulators, …
2. Data storage is inexpensive!
3. Parallel (compute cluster) computation is inexpensive
• The Cloud, clusters of computers, GPUs, tensor processors, …

Science only has explanatory and predictive models in a few (mostly physical
sciences-related) domains
... So: can we use algorithms + data to understand phenomena? Build or
augment models? Build detectors? Make diagnoses?

6
Data Is Driving Everything

“Big data” “Deep learning”


“Data science” “Statistical analysis”
“Data lakes” “Biomedical informatics”
“Visual analytics” “Business analytics”

Lots of trends in pursuit of the same goals!


Discovery, models, decision-making, …

Also, new issues -


“Ethical algorithms”
“Reproducibility”
7
The Key Question in Big Data Analytics:
How Do We Understand and Predict?
Much of science and engineering derives from physics, which is “model-first”
• Newton’s laws, the theory of relativity, optics, how materials react under stress, etc.
• Here, the basis of prediction – even with stochastic processes – tends to be simulation
• Weather forecasting, simulating water in the movie Moana, etc.

We want predictions where we don’t have good models


e.g., behavior, biology, the brain, whether a product will be a success, what to invest in
• We need to use sampling, statistics, data-first approaches
• The big data revolution is mostly about how to acquire and handle enough data, and ask the
right questions, for these models to be useful!

• Of course, in the real world we often want to combine models and data!
8
9
Khoa học dữ liệu:
• Làm khoa học dựa vào dữ liệu, nhằm
tìm tri thức từ dữ liệu.

• Cách truyền thống nhằm kiểm chứng


các giả thiết có được từ trên tri thức
đã biết.
Jim Gray (1944‐2007)
Tổng quan về Big Data và Big Data Analytics

• What is Big Data and why does it matter

Source: Gartner
Big data là gì?
• Dữ liệu lớn nói về các tập dữ liệu

rất lớn và/hoặc rất phức tạp, vượt

quá khả năng xử lý của các kỹ

thuật IT truyền thống

• Data

• Information

• Knowledge
Big Data là gì?
Originally defined by
Gartner (2012) as
"Big data is high
volume, high
velocity, and/or high
variety information
assets that require
new forms of
processing to enable
enhanced decision
making, insight
discovery and
process
optimization”,
veracity was
addressed as
another distinct
characteristics of Big
Data.

Source: IBM, McKinsey Global Institution, Twitter, Cisco, EMC, SAS, Gartner, IBM, MEPTEC, QAS
Big Data Trivia
• 90% of the data in the world today has been
created in the past two years
• Approximately 100,000 tweets are sent globally
every minute
• Google receives over 2,000,000 search requests
every minute
• Total amount of data generated by 2020 is
predicted to be 5,200GB per person on the
planet
1000 kB kilobyte
• Retailers’ operating margins could increase as 1000 2
MB megabyte
much as 60% with Big Data data size 1000
3

4
GB gigabyte
measurement 1000 TB terabyte
unit cheat sheet 1000
5
PB petabyte
6
1000 EB exabyte
Source: The Economist, McKinsey & Company, Gartner, Facebook, IBM, 2015
7
1000 ZB zettabyte
An Internet Minute…
Why Does It Matter

Source: AlphaSix, 2015


Source: AlphaSix, 2015
What Does the Field Say
Data-driven roles
become more Data can’t be directly
prevalent used without pre-
Survey Summary processing

• 65% of organizations have increased the number of positions requiring data


analysis skills in the past 5 years (and only 4% decreased)

• Among data scientists, 67% said that the cleaning and organizing data is
one of their most time-consuming tasks
Need data-wrangling tools
• 77% of jobs that require coding need Python skills, 60% that require
statistics require R skills (#1 in each category)

• 52% work with a broad range of languages and platforms; 59% work with a
diverse portfolio of problems
Complex data
requires a
• Poor quality data is a common obstacle for 52% of data scientists programmatic
approach

Large scale, reliable data Need multiple skills


Source: CrowdFlower, 2015; SHRM, 2016 is an asset for companies and approaches
An Evolving Path from Traditional Data Analysis
to Big Data Analytics
FANG
companies

… analytics applications

… big data analytics •


Marketing analytics

Retail analytics
traditional data analysis •
Insurance analytics
• Use of Big Data, that •
Web analytics
is, decision making • Social media
• Data analysis has
upon analyzing large analytics
long been used in
volume of complex • Text analytics
various industries
data, is advancing • Geospatial
and functions, such analytics
rapidly only in the
as accounting, • …
last decade or so
auditing, customer
data management,
manufacturing
quality control
Bài tập (nhóm)
▪ Chia 04 nhóm sinh viên

▪ Case study: Robertson, H., & Travaglia, J. (2015). A Politics of Counting – Putting People Back into Big
Data. Discover Society, 3(23). Retrieved from https://archive.discoversociety.org/2015/07/30/a-politics-of-
counting-putting-people-back-into-big-data/

▪ Dựa trên 04 đặc điểm của Big Data: Volume, Velocity, Variety, Veracity, phân tích case study nêu trên và nộp 01 bài
luận dài 3-4 trang (chỉ nhóm trưởng nộp). Ghi rõ đóng góp của từng thành viên ở cuối bài luận.

▪ Nộp bài: Qua Teams/ email giảng viên.


Bài tập (nhóm)
• Có thể phân tích dựa trên các câu hỏi sau:

o What inclusions and exclusions persist from the days of traditional data collection and analytics to today’s era of “big data” collection
and analytics?

o How would you best describe the “no policy as policy” concept in relation to big data collection and analysis?

o What are some emerging data collection and analysis methods that better capture our “analog” selves?

o How should the generalized practices of demography be updated to match the social data systems of today? Is there some benefit to
keeping the current methods?

o The case study states that “marginalized groups have a long history of being excluded from official counting systems only to have a
great deal of, usually negative, attention paid to them by those same systems.”

o What are some examples of how this scenario is taking place today?

o What would be some suggested ideas for better, more inclusive data collection?

o How would a fair and inclusive method of data collection and analysis impact our society and its marginalized groups?

▪ Deadline: Thứ 03 tuần tới.

You might also like