You are on page 1of 51

Big Data and Data Science

and some applications on real data

Hugo Alatrista Salas


hugo.alatrista@epnewman.edu.pe
About me...

• Ph.D. en Data Science, Université de Montpellier - France


• MSc, en Computability, Algorithmic Networks Management and Security, Université de
Montpellier, France
• Researcher at Renacyt (Carlos Monge 2)
• Part time professor - Escuela de Posgrado Newman
• Member of the Artificial Intelligence team - PUCP
• Chief Data Officer CDO - Ping4All
• Co-founder of the SIMBig International Conference

1 / 43
Agenda

1. Motivation

2. What is Big Data?

3. Data Science

4. KDD process

5. Some (small) examples

6. Final words

2 / 43
Motivation
Papyrus of Oxyrhynchus (100 B.C.)

Papyrus of Euclid’s geometry with diagram (Oxyrhynchus, Egypt, ca. 100 AD, now at the University of Pennsylvania)

3 / 43
Punch cards (1937)

80 bytes (80 lines × 12 columns)

4 / 43
Floppy disks (1960)

Provided by IBM: 8 inches and 80 KBytes

5 / 43
Prince of Persia (1989)

Developed by Brφderbund and designed by Jordan Mechner (Yale


University)

6 / 43
Prince of Persia (1989)

OS: Apple II

7 / 43
Prince of Persia: The Forgotten Sands (2010)

OS: PS3

8 / 43
What happens on the Internet?

9 / 43
60 seconds on the Internet

https://vrworld.com/2014/10/27/researchers- create- 32- tbs- fiber- cable/

10 / 43
The republic of Facebook

11 / 43
Structured Vs Unstructured data

http://mis587pushkarmaid.blogspot.pe/2016/03/big- unstructured- data- vs- structured.html

12 / 43
IoT: Smart stuffs

https://www.cisco.com/c/dam/en_us/about/ac79/docs/innov/IoT_IBSG_0411FINAL.pdf

13 / 43
What is Big Data?
What do the dictionaries say? (1)

[OxfordDictionary ] Extremely large data sets that may be analysed


computationally to reveal patterns, trends, and associations, especially
relating to human behaviour and interactions: much IT investment is
going towards managing and maintaining big data

14 / 43
What do the dictionaries say? (2)

[Wikipedia] Big data is a term for data sets that are so large or complex
that traditional data processing application software is inadequate to deal
with them. Big data challenges include capturing data, data storage, data
analysis, search, sharing, transfer, visualization, querying, updating and
information privacy

15 / 43
Big (Rich) Data

Osmar Zaiane, University of Alberta, Canada - UP, Faculty of Engineering, 2015

16 / 43
Big Data and the four V s

Value + Vulnerability

https://ajaykumarjogawath.wordpress.com/2015/09/03/understanding- big- data/

17 / 43
Turning Big Data into a value

https://www.linkedin.com/pulse/understanding- big- data- basics- mohamad- el- charif

18 / 43
Big Data vs Small Data

• Big Data is similar to Small Data, but bigger, with heterogeneous


sources, and arriving quickly
• Consequently, it is necessary to use different approaches, techniques,
tools and architectures
• We must be capable of solving new problems and old problems in a
better way

19 / 43
Big Data landscape

https://mattturck.com/data2019/

20 / 43
Methods of the past Vs Current methods (1)

21 / 43
Methods of the past Vs Current methods (2)

22 / 43
Big Data trends

https://www.kdnuggets.com/2017/05/machine- learning- overtaking- big-data.html

23 / 43
Data Science
Data Science

Data Science was proposed in the 2nd Conference Franco-Japanese of


Statistics (University of Montpellier II, France, Sep. 1992)

https://link.springer.com/chapter/10.1007/978- 3- 642- 59789- 3_52

24 / 43
What do the dictionaries say? (1)

[Wikipedia] Data science, also known as data-driven science, is an


interdisciplinary field of scientific methods, processes, algorithms and
systems to extract knowledge or insights from data in various forms,
either structured or unstructured, similar to data mining.

25 / 43
What do the dictionaries say? (1)

[IBMResearch] A data scientist represents an evolution from the business


or data analyst role. The formal training is similar, with a solid
foundation typically in computer science and applications, modeling,
statistics, analytics and math. What sets the data scientist apart is strong
business acumen, coupled with the ability to communicate findings to
both business and IT leaders in a way that can influence how an
organization approaches a business challenge...

26 / 43
Pillars of Data Science

• Business domain
• Statistics and probability
• Mathematical thinking
• Computer science and software programming
• Written and verbal communication

Park City Math Institute, Princeton, USA

27 / 43
Data Scientist skills

http://upxacademy.com/data- scientist/

28 / 43
Data Scientist: the sexiest job

29 / 43
KDD process
KDD Process

Iterative and interactive multi-step process to transform huge databases


into knowledge1

1 From data mining to knowledge discovery: an overview. Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth. Advances in
knowledge discovery and data mining, pages 1-34, 1996

30 / 43
6-step Knowledge Discovery Process

The six-step KDP model2

2 Pal, N.R., Jain, L.C., Eds. (2005). Advanced Techniques in Knowledge Discovery and Data Mining, Springer

31 / 43
Knowledge Discovery Process

n-Step Knowledge Discovery Process KDP3

3 K.J. Cios, W. Pedrycz, and R.J. Swiniarski (2007). The Knowledge Discovery Process. Springer

32 / 43
CRISP-DM

https://www.kdnuggets.com/2017/01/four- problems- crisp- dm- fix.html

33 / 43
Data-driven process for Big data

https://www.researchgate.net/figure/Big-picture- of- the- data- driven- process- for- crowd- management- and- control_fig1_

332775847

34 / 43
Some (small) examples
Data summarization

Event summarization using tweets - Work developed in collaboration with Arturo Oncevay - PUCP

35 / 43
Text Mining

Study of the perception of citizen insecurity - Work developed in collaboration with Juandiego Morzan - UP

36 / 43
Analysis of a meteorological phenomenon

Spatio-temporal data mining - Master thesis conducted by Oscar A. Diaz - PUCP

37 / 43
Analysis of epidemiological data

Epidemiological pattern visualization - Work developed in collaboration with Agustı́n Guevara - PUCP

38 / 43
Studying the resilience in Peru (1)

Work developed in collaboration with Vincent Gauthier, Miguel Nunez-del-Prado

39 / 43
Studying the resilience in Peru (2)

PhD thesis conducted by Manuel Rodrı́guez (Universidad de Cuenca, Ecuador) - PUCP

40 / 43
Inhabitants mobility in Lima

Work developed in collaboration with Vincent Gauthier, Miguel Nunez-del-Prado

41 / 43
Final words
Final words

• Big data is real and we should be prepared


• Big Data is only getting worse in terms of volume, speed, availability
of sources, complexity, etc.
• Before to propose a Big Data project, think about the architecture
(hardware and software)
• Visualization is an important strategy to communicate ideas and to
summarize a huge amount of data
• We can not talk about Big Data without mentioning the Data
Science
• Data Science is transversal to many domains

42 / 43
Hugo Alatrista Salas, Ph.D.
Escuela de Posgrado Newman
https://simbig.org/alatrista-salas/
E-mail: hugo.alatrista@epnewman.edu.pe

43 / 43

You might also like