Professional Documents
Culture Documents
BigData W1 Introduction HoangVu
BigData W1 Introduction HoangVu
Background:
2 YOE Senior Analyst at Data Analysis Department in CONMED
– Biggest Corp. in Medical industry In The US;
2. Hiểu được thách thức, ý nghĩa của Big Data, từ đó đưa ra giải pháp để vượt qua khó khăn đó.
3. Hiểu các nguyên tắc hoạt động cơ bản của các công cụ, công nghệ sử dụng hay xử lý và có thể so sánh, đưa
ra ưu và nhược điểm của các công cụ để áp dụng và sử dụng nó hiệu quả.
4. Hiểu được các cách thu, truyền, xử lý và phân tích cơ bản dữ liệu lớn.
Các Topics trong môn học:
• Khoa học Dữ liệu, BigData & thách thức thời đại
Science only has explanatory and predictive models in a few (mostly physical
sciences-related) domains
... So: can we use algorithms + data to understand phenomena? Build or
augment models? Build detectors? Make diagnoses?
6
Data Is Driving Everything
• Of course, in the real world we often want to combine models and data!
8
9
Khoa học dữ liệu:
• Làm khoa học dựa vào dữ liệu, nhằm
tìm tri thức từ dữ liệu.
Source: Gartner
Big data là gì?
• Dữ liệu lớn nói về các tập dữ liệu
• Data
• Information
• Knowledge
Big Data là gì?
Originally defined by
Gartner (2012) as
"Big data is high
volume, high
velocity, and/or high
variety information
assets that require
new forms of
processing to enable
enhanced decision
making, insight
discovery and
process
optimization”,
veracity was
addressed as
another distinct
characteristics of Big
Data.
Source: IBM, McKinsey Global Institution, Twitter, Cisco, EMC, SAS, Gartner, IBM, MEPTEC, QAS
Big Data Trivia
• 90% of the data in the world today has been
created in the past two years
• Approximately 100,000 tweets are sent globally
every minute
• Google receives over 2,000,000 search requests
every minute
• Total amount of data generated by 2020 is
predicted to be 5,200GB per person on the
planet
1000 kB kilobyte
• Retailers’ operating margins could increase as 1000 2
MB megabyte
much as 60% with Big Data data size 1000
3
4
GB gigabyte
measurement 1000 TB terabyte
unit cheat sheet 1000
5
PB petabyte
6
1000 EB exabyte
Source: The Economist, McKinsey & Company, Gartner, Facebook, IBM, 2015
7
1000 ZB zettabyte
An Internet Minute…
Why Does It Matter
• Among data scientists, 67% said that the cleaning and organizing data is
one of their most time-consuming tasks
Need data-wrangling tools
• 77% of jobs that require coding need Python skills, 60% that require
statistics require R skills (#1 in each category)
• 52% work with a broad range of languages and platforms; 59% work with a
diverse portfolio of problems
Complex data
requires a
• Poor quality data is a common obstacle for 52% of data scientists programmatic
approach
… analytics applications
▪ Case study: Robertson, H., & Travaglia, J. (2015). A Politics of Counting – Putting People Back into Big
Data. Discover Society, 3(23). Retrieved from https://archive.discoversociety.org/2015/07/30/a-politics-of-
counting-putting-people-back-into-big-data/
▪ Dựa trên 04 đặc điểm của Big Data: Volume, Velocity, Variety, Veracity, phân tích case study nêu trên và nộp 01 bài
luận dài 3-4 trang (chỉ nhóm trưởng nộp). Ghi rõ đóng góp của từng thành viên ở cuối bài luận.
o What inclusions and exclusions persist from the days of traditional data collection and analytics to today’s era of “big data” collection
and analytics?
o How would you best describe the “no policy as policy” concept in relation to big data collection and analysis?
o What are some emerging data collection and analysis methods that better capture our “analog” selves?
o How should the generalized practices of demography be updated to match the social data systems of today? Is there some benefit to
keeping the current methods?
o The case study states that “marginalized groups have a long history of being excluded from official counting systems only to have a
great deal of, usually negative, attention paid to them by those same systems.”
o What are some examples of how this scenario is taking place today?
o What would be some suggested ideas for better, more inclusive data collection?
o How would a fair and inclusive method of data collection and analysis impact our society and its marginalized groups?