0 views

Uploaded by Atanu Sabyasachi

I have researched on hadoop and this paper is all i have got

- Our Digital Doubles
- Academic Essay.docx
- Data Mining Tools
- DATA MINING
- Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics - Data Science Central.pdf
- allaboutailuminarylabsjanuary122017-170112151616
- Big Data
- Artificial Intelligence Privacy and Ethics
- allaboutailuminarylabsjanuary122017-170112151616.pdf
- ml
- Mastering Predictive Analytics with R - Sample Chapter
- Data Mining
- A Glossary of Basic Artificial Intelligence Terms and Concepts
- Machine Learning That Matters
- B.Tech-CSE(S7-S8)-SYL-CP-18-02-2018_DRAFT.pdf
- Millennium Big Data Posthuman PUBLISHED
- MBMLbook.pdf
- NESSI_WhitePaper_BigData.pdf
- Stat841 Outline
- r05411101 Image Processing and Pattern Recognition

You are on page 1of 47

Krzysztof Dembczyński

Intelligent Decision Support Systems Laboratory (IDSS)

Poznań University of Technology, Poland

Master studies, second semester

Academic year 2018/19 (winter course)

1 / 22

Goal: understanding data . . .

2 / 22

3 / 22

Goal: . . . to make data analysis efficient.

4 / 22

• Buzzwords: Big Data, Data Science, Machine learning, NoSQL . . .

5 / 22

• Buzzwords: Big Data, Data Science, Machine learning, NoSQL . . .

• How Big Data Changes Everything:

I Several books showing the impact of Big Data revolution (e.g.,

Jeffrey Needham).

5 / 22

• Buzzwords: Big Data, Data Science, Machine learning, NoSQL . . .

• How Big Data Changes Everything:

I Several books showing the impact of Big Data revolution (e.g.,

Jeffrey Needham).

• Computerworld (Jul 11, 2007):

I 12 IT skills that employers can’t say no to:

1) Machine learning

...

5 / 22

• Buzzwords: Big Data, Data Science, Machine learning, NoSQL . . .

• How Big Data Changes Everything:

I Several books showing the impact of Big Data revolution (e.g.,

Jeffrey Needham).

• Computerworld (Jul 11, 2007):

I 12 IT skills that employers can’t say no to:

1) Machine learning

...

• Three priorities of Google announced at BoxDev 2015:

I Machine learning – speech recognition

I Machine learning – image understanding

I Machine learning – preference learning/personalization

• Buzzwords: Big Data, Data Science, Machine learning, NoSQL . . .

• How Big Data Changes Everything:

I Several books showing the impact of Big Data revolution (e.g.,

Jeffrey Needham).

• Computerworld (Jul 11, 2007):

I 12 IT skills that employers can’t say no to:

1) Machine learning

...

• Three priorities of Google announced at BoxDev 2015:

I Machine learning – speech recognition

I Machine learning – image understanding

I Machine learning – preference learning/personalization

company.

5 / 22

Data mining

• But what is a model?

6 / 22

if all you have is a hammer, everything looks like a nail

7 / 22

How to characterize your data?

select avg(column), std(column) from data

8 / 22

How to characterize your data?

select avg(column), std(column) from data

• Statistician might decide that the data comes from a Gaussian

distribution and use a formula to compute the most likely parameters

of this Gaussian: the mean and standard deviation.

8 / 22

How to characterize your data?

select avg(column), std(column) from data

• Statistician might decide that the data comes from a Gaussian

distribution and use a formula to compute the most likely parameters

of this Gaussian: the mean and standard deviation.

• Machine learner will use the data as training examples and apply a

learning algorithm to get a model that predicts future data.

8 / 22

How to characterize your data?

select avg(column), std(column) from data

• Statistician might decide that the data comes from a Gaussian

distribution and use a formula to compute the most likely parameters

of this Gaussian: the mean and standard deviation.

• Machine learner will use the data as training examples and apply a

learning algorithm to get a model that predicts future data.

• Data miner will discover the most frequent patterns.

8 / 22

They all want to understand data and use this knowledge for

making better decisions

9 / 22

Data+ideas vs. statistics+algorithms

It’s often more important to creatively invent new data

sources than to implement the latest academic variations on

an algorithm.

10 / 22

Data+ideas vs. statistics+algorithms

It’s often more important to creatively invent new data

sources than to implement the latest academic variations on

an algorithm.

• WhizBang! Labs tried to use machine learning to locate people’s

resumes on the Web: the algorithm was not able to do better than

procedures designed by hand, since a resume has a quite standard

shape and sentences.

10 / 22

Data+computational power

11 / 22

Data+computational power

I Scanning large databases can perform better than the best computer

vision algorithms!

11 / 22

Data+computational power

I Scanning large databases can perform better than the best computer

vision algorithms!

• Automatic translation

11 / 22

Data+computational power

I Scanning large databases can perform better than the best computer

vision algorithms!

• Automatic translation

I Statistical translation based on large corpora outperforms linguistic

models!

11 / 22

Human computation

• ESP game

• Check a lecture given by Luis von Ahn:

http://videolectures.net/iaai09_vonahn_hc/

• Amazon Mechanical Turk

12 / 22

Data+ideas vs. statistics+algorithms

Brad Efron

to extract information that was not supported by the data.

• Bonferroni’s Principle: “if you look in more places for interesting

patterns than your amount of data will support, you are bound to find

crap”.

• Rhine paradox.

13 / 22

Data+ideas vs. statistics+algorithms

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

I Netflix: recommender system.

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

I Netflix: recommender system.

I Google and PageRank.

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

I Netflix: recommender system.

I Google and PageRank.

I Clustering of Cholera cases in 1854.

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

I Netflix: recommender system.

I Google and PageRank.

I Clustering of Cholera cases in 1854.

I Win one of the Kaggle’s competitions!!! http://www.kaggle.com/.

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

I Netflix: recommender system.

I Google and PageRank.

I Clustering of Cholera cases in 1854.

I Win one of the Kaggle’s competitions!!! http://www.kaggle.com/.

I Autonomous cars.

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

I Netflix: recommender system.

I Google and PageRank.

I Clustering of Cholera cases in 1854.

I Win one of the Kaggle’s competitions!!! http://www.kaggle.com/.

I Autonomous cars.

I Deep learning.

14 / 22

Data+ideas vs. statistics+algorithms

I XBox Kinect: object tracking vs. pattern recognition (check:

http://videolectures.net/ecmlpkdd2011_bishop_embracing/).

I Pattern finding: association rules.

I Netflix: recommender system.

I Google and PageRank.

I Clustering of Cholera cases in 1854.

I Win one of the Kaggle’s competitions!!! http://www.kaggle.com/.

I Autonomous cars.

I Deep learning.

I And many others.

14 / 22

Data+ideas+computational power+statistics+algorithms

15 / 22

To be learned in the upcoming semester . . .

16 / 22

The aim and the scope of the course

mining of massive datasets.

17 / 22

The aim and the scope of the course

mining of massive datasets.

• Scope: We will learn about scalable algorithms for:

17 / 22

The aim and the scope of the course

mining of massive datasets.

• Scope: We will learn about scalable algorithms for:

I Classification and regression,

17 / 22

The aim and the scope of the course

mining of massive datasets.

• Scope: We will learn about scalable algorithms for:

I Classification and regression,

I Searching for similar items,

17 / 22

The aim and the scope of the course

mining of massive datasets.

• Scope: We will learn about scalable algorithms for:

I Classification and regression,

I Searching for similar items,

I And recommender systems.

17 / 22

The aim and the scope of the course

mining of massive datasets.

• Scope: We will learn about scalable algorithms for:

I Classification and regression,

I Searching for similar items,

I And recommender systems.

• The course is mainly based on the Mining of Massive Datasets

book: http://www.mmds.org/

17 / 22

Main information about the course

• Instructors:

I dr hab. inż. Krzysztof Dembczyński (kdembczynskicsputpoznanpl)

• Website:

I www.cs.put.poznan.pl/kdembczynski/lectures/mmds

• Time and place:

I Lecture: Thursday 13:30, room L125 BT.

I Labs: Wednesday, 13:30 and 16:50, room 45 CW.

I Office hours: Thursday, 10:00-12:00, room 2 CW.

18 / 22

Lectures

I Introduction

I Classification and regression (x6)

I Finding similar items (x3)

19 / 22

Labs

• Software: Python and Spark.

• List of tasks and exercises for each week (also homeworks).

• Mainly mini programming projects and short exercises.

• Main topics:

I Bonferroni’s principle (x2)

I Classification and Regression (Python) (x6)

I Finding similar items (x2)

20 / 22

Evaluation

• Lecture:

Test: 75 % (min. 50%)

Labs: 25 % (min. 50%)

• Labs:

• Scale:

90 % – 5.0 80 % – 4.5 70 % – 4.0

60 % – 3.5 50 % – 3.0 < 50 % – 2.0

• Bonus points for all: up to 10 percent points.

21 / 22

Bibliography

Cambridge University Press, 2011

http://www.mmds.org

Book. Second Edition.

Pearson Prentice Hall, 2009

Morgan and Claypool Publishers, 2010

http://lintool.github.com/MapReduceAlgorithms/

Edition.

Springer, 2009

http://www-stat.stanford.edu/~tibs/ElemStatLearn/

Manning Publications Co., 2011

22 / 22

- Our Digital DoublesUploaded byAnonymous 733U65U
- Academic Essay.docxUploaded byZainab Mohammad
- Data Mining ToolsUploaded byRaymond Katungi
- DATA MININGUploaded byapi-20013961
- Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics - Data Science Central.pdfUploaded byCibyBaby Punnamparambil
- allaboutailuminarylabsjanuary122017-170112151616Uploaded bypruebaprueba00
- Big DataUploaded byjaijohnk
- Artificial Intelligence Privacy and EthicsUploaded byAndrew Olton
- allaboutailuminarylabsjanuary122017-170112151616.pdfUploaded byDouglas
- mlUploaded byNhật Anh
- Mastering Predictive Analytics with R - Sample ChapterUploaded byPackt Publishing
- Data MiningUploaded bynavaneethangceb
- A Glossary of Basic Artificial Intelligence Terms and ConceptsUploaded byDan Costea
- Machine Learning That MattersUploaded byIrfan Yousafzai
- B.Tech-CSE(S7-S8)-SYL-CP-18-02-2018_DRAFT.pdfUploaded bybinzbinz
- Millennium Big Data Posthuman PUBLISHEDUploaded byargon
- MBMLbook.pdfUploaded byddo88
- NESSI_WhitePaper_BigData.pdfUploaded byIvan Georgiev
- Stat841 OutlineUploaded byAshish Gaurav
- r05411101 Image Processing and Pattern RecognitionUploaded byvasuvlsi
- Bigdata-whitepaper en v5Uploaded byEstherTan
- Final Data MININg PaperUploaded bydineshgomber
- chp%3A10.1007%2F978-3-319-10085-2_15Uploaded byayanouha
- EFFICIENT FEATURE SUBSET SELECTION MODEL FOR HIGH DIMENSIONAL DATAUploaded byJames Moreno
- Machine LearningUploaded byYushmantha Randima
- SeminarUploaded byteddy demissie
- Sigirt04d LewisUploaded bydaviddlewis
- Database Management SystemUploaded byMahesh Peddi
- BigDataHadoop_Lesson01Uploaded byAiden Rathore
- Combined Ad No 01-2019Uploaded byJunaid Afzal

- Quality Recognition & Prediction: Smarter Pattern Technology with the Mahalanobis-Taguchi SystemUploaded byMomentum Press
- Chapter 1Uploaded byEr Mukesh Mistry
- (Interdisciplinary Applied Mathematics 40) René Vidal, Yi Ma, S.S. Sastry (Auth.)-Generalized Principal Component Analysis-Springer-Verlag New York (2016)Uploaded byK&G PAPOULIAS
- 00536328.pdfUploaded byrupa_123
- Recognizing Action Units for Facial ExpressionsUploaded byGeorge Baciu
- pripUploaded bySrinivasa Rao G
- Spatial Database and AnalysisUploaded byMohan Munukuntla
- Siprireport Mapping the Development of Autonomy in Weapon Systems 1117 1Uploaded byMichele Giuliani
- Survey on Temporal Data MiningUploaded byKienking Khansultan
- Statistical Pattern Recog ReviewUploaded byjeromeku
- BDLS Table listUploaded byMohana Chandramouli K
- 667Uploaded byDeepa Garag
- Paper-1 Significance of One-Class Classification in Outlier DetectionUploaded byRachel Wheeler
- Signal Processing Volume 35 Issue 1 1994 [Doi 10.1016_0165-1684(94)90199-6] Olli Vainio -- Digital Signal Processing With C and the TMS320C30- By Rulph Chassaing, School of Engineering, Roger Williams University. PublisUploaded byXiDi LA'Bag
- IEEE - 2012 Intent Search Capturing User Intention for One-Click Internet Image SearchUploaded byAshwini Shinde
- A Colour Code Algorithm for Signature RecognitionUploaded byJulietta Cardenas Muñoz
- ML Topic1AUploaded byAbegunde Omolayo
- V3N2-121(1).pdfUploaded byrajeshkumar
- A Review on Time Series Data MiningUploaded byTri Kurniawan Wijaya
- Lecture 1Uploaded byShubham Kumar Singh
- Recognition of Ancient Tamil Handwritten Characters in Palm Manuscripts Using Genetic AlgorithmUploaded byEditor IJSET
- Dr. Roshan Process Optimization PaperUploaded byabanzabal
- Search Algorithms for Engineering OptimizationUploaded byDavor Kirin
- Feature Extraction for Document ClassificationUploaded byDr. D. Asir Antony Gnana Singh
- 31-34_Artificial Neural NetworkUploaded byZalak Rakholiya
- cvUploaded bySyed Zubair
- Machine LearningUploaded byTariq Khan
- UAT FormUploaded byPriyanshu Kumar
- Posdoc in Bogota, Colombia 2013Uploaded byEscuela de Matemática, Universidad de Costa Rica
- DNNsEasilyFooled_cvpr15.pdfUploaded byChristian Gonzalez Carrasco