You are on page 1of 27

Web Data Mining

Lab 4
Andreas Papadopoulos
andpapad+epl451@gmail.com
andpapad@cs.ucy.ac.cy

Intro to Mahout!

• Mahout
– Scalable Machine Learning and Data Mining
– Οn top of Apache Hadoop
– Use the map/reduce paradigm
– http://mahout.apache.org/

Mahout .Lab 4 .

Who uses Mahout? .

The name Mahout comes from the project’s use of Apache Hadoop for scalability and tolerance .What’s the name? • A mahout is a person who drives an elephant.

History .

Challenges • Large amount of input data – Techniques work better – Nature of the deploying context • Must produce results quickly The amount of input is so large that it is not feasible to process it all on one computer. even a powerful one .

Scalability • Mahout core algorithms are implemented on top of Apache Hadoop using the Map/Reduce paradigm .

Four main use cases (currently) • Recommendation – takes user behavior and tries to predict items users might like • Clustering – takes documents and groups topically related documents • Classification – tries to assign documents to a correct category based on examples • Frequent itemset mining – takes a set of item groups and identifies which items usually appear together .

Common use cases • • • • • • • • Recommend friends/dates/products Classify content into predefined groups Find similar content based on object properties Find associations/patterns in actions/behaviors Identify key topics in large collections of text Detect anomalies in machine output Ranking search results Others? .

apache.org/confluence/display/MAHOUT/Alg orithms • Goal: Be as fast and efficient as the possible given the intrinsic design of the algorithm – Some algorithms won’t scale to massive machine clusters – Others fit logically on a Map Reduce framework like Apache Hadoop – Still others will need other distributed programming models – Be pragmatic • Most Mahout implementations are Map Reduce enabled .Mahout – For a complete list of algorithms visit : • https://cwiki.

Algorithms and Applications .

Clustering • Call it fuzzy grouping based on a notion of similarity .

Dirichlet • Group similar looking objects • Notion of similarity: Distance measure: – – – – Euclidean Cosine Tanimoto Manhattan .Mahout Clustering • Plenty of Algorithms: K-Means. Fuzzy K-Means. Mean Shift. Canopy.

Classification • Predicting the type of a new object based on its features • The types are predetermined Dog Cat .

Mahout Classification • Plenty of algorithms – – – – – Naïve Bayes Complementary Naïve Bayes Random Forests Logistic Regression (SGD) Support Vector Machines • Learn a model from a manually classified data • Predict the class of a new object based on its features and the learned model .

Recommendations • Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him .

there is a notion of similarity in users or items – Cosine. Pearson and LLR . online and offline computation of recommendations • Like clustering. Tanimoto.Mahout Recommenders • Different types of recommenders – User based – Item based • Full framework for storage.

Frequent Pattern Mining • Find interesting groups of items based on how they co-occur in a dataset .

slideshare. tablet.net/hadoopusergroup/mailantispam . iphone – Spam Detection Yahoo! http://www. eggs and bread” – Query Logs ipad -> apple.Mahout Parallel FPGrowth • Identify the most commonly occurring patterns from – Sales Transactions buy “Milk.

5-src.gz – tar xvf mahout-distribution-0..tar. – ./bin/mahout .tar.Install Mahout – sudo apt-get install maven2 – wget http://apache.5-src.org//mahout/0.osuosl..5/mahout-distribution-0.gz – sudo mvn install – cd core/ – mvn compile – cd ./examples – mvn compile – cd .

Pattern mining .

dat • <MAHOUT_PATH>/mahout fpg -i new_accidents./new_accidents.gz • Extract data file: tar xvf new_accidents.ac./mahout seqdumper --seqFile patterns/fpgrowth/part-r-00000 .fi/data/ • Download http://www.dat -o patterns method mapreduce -g 10 -regex [\ ] • .cs.ucy.tar.cs.tar.helsinki.dat new_accidents.cy/courses/EPL451/lectures/Lab/new _accidents.gz • Start hadoop • Upload data hadoop dfs -copyFromLocal .Frequent Pattern Mining • Data: http://fimi.

stanford.edu/~echang/recsys08-69.Frequent Pattern Mining • mahout fpg • MapReduce (parallel) implementation of FP Growth Algorithm for frequent Itemset Mining • Runs each stage of PFPGrowth as described in the paper: http://infolab.pdf .

by: Haoyuan Li. Edward Y. Dong Zhang.Frequent Pattern Mining From: PFP: Parallel FP-Growth for Query Recommendation. Ming Zhang. Chang In Proceedings of the 2008 ACM conference on Recommender systems . Yi Wang.

Key: 218: Value: ([17. 21. ([17. 18. 43.3). 27. 18. 175].4). 12. 16. ([18.5) Key: 306: Value: ([17. 31. ([17.5). 43. 18. 18. 27. 218]. 171. 18. ([17. 29.6). 27. 31.4). ([18.7). 218]. 43. 15. 29. 18. ([17. ([17. 306]. 43. 16. 218]. 43. 175]. ([17. ([17. 16. ([16. 21. 12. 21. 31. 27. 31. 31. 43. ([17. ([12. 31. 306]. 29.4). 31.6). 21.3). 218]. ([17.7).6). ([17. ([18. 16.5).3). 18.5). 16. 306]. ([17. 18. 218].8).6). 43. 25. 43. 21. 18. 13. 218]. 27. 306]. 31. 12. 43. 171. 21.5). 12.5). 43. 21.3). ([12. 306]. 31. 16. 15. 218]. 16. ([12. 12. 27. 18. 306]. 16. 306]. 306]. ([17. 12. 16. 29. 27. 218]. 31. ([12. 18. 18. 171. 16. 18.7). 16. 18. 16. 218]. 27. ([17. 218]. 28. 43. ([18. 18. 12. ([17. 16. 218]. 18. 306]. 18. 18. 18. ([17. 16. 18.4). 218]. 31. 21. 18. ([17. 29. 43. 27.5). 43. 175]. 18. ([43. 16.4). ([17. 306]. 15.4).3). 1. 28. 306]. 12. 18. 27. 18. 15.6). 43.5). 43. 18. 18.4). 18. 16. 16. ([17. 31. 21. ([17. 43. 43. 18. 18.3).4). 218].6). 175].6). 306].6). 175]. 43. 306]. 18. ([17. 29. 43. 29. 43. 18. 29. 29.3). 16. ([17.3). 31. 175]. ([17. ([17. 27. 16. 1. 28. 306]. 43. 15. 16. ([17.6). ([17. ([17. 306]. 16. 306].6).7). 29. 31. 18. 12. 12. 43. 31. 175]. 15. 16. 25. 306]. 218].6). 175]. 1. 12. 21. 218]. 43. 21. 218]. 18. 29. 218]. ([17. 218]. 1. 12. 1. ([17. 18. 306]. 16. 18. 175].4). ([17. 27.3). 218]. 218].7). 18. 218].5). ([17. 18. 29.7). 171. 93. 28. 175]. 21. 18. ([17. 27. 28. 306]. 175]. 43.6). ([17.4). 31. 28. 306]. 43. 31. 12. ([17. ([17. 306]. 306]. ([17. 29. 218]. 16. ([18.3). 306].4). 175]. 16. 18. 16. 27. 29. 43. ([16. ([17. 16. ([17. 18. ([21. 218]. 18. 29. 306].4). 16.3).3). 306]. ([17. 306]. 21.6). ([18.6). 18. 218]. 16. 12. ([17.6). 43. 12. 171. ([17. 21. 29. 18. ([17. 175].5). 43. ([17. 31. ([17. 21. 16. 29. 18. ([17. 15.6).6). 43. ([17. 12. 306]. 218]. 18. 306]. 18. 18. 18. 28. 306]. ([17. 43. 18. 218]. 31. 12. ([17. 218]. 18.Frequent Pattern Mining Example Output • • • • • …. 218]. 18.5). 18. 15. 43. 306]. 218]. 21.6). 28. 12. 306]. ([17. 18. 28. 16. 21. 28. 12. 21.4). 24.7). 175]. 31. 171. ([17. ([17. ([17. 171. 18. 18. 21. ([17. 27.5).5).6). 12. ([17. ([17. 28.5). 18. 27. ([17. 43.4). 29. ([17.3). 21. 21. 306]. 16. 218]. 18. 12. 43. 43. 28.5). 218]. 29. ([17.5). 28. 218]. 21. 18. 306]. 21. 43. ([17.4). ([16.6). 29. 27. 16. ([17. 18. 1.7). 16. 29. 21.5).7). 27. 29. 12. 12. 28. 12. 18. 27. 29. 27. 21. ([17.3). 27.4). 27. 15. 29. ([17. 18. 31. ([17. 12. 29. 16. 29. 27. 12. 18. 21. ([17. 12. 218]. 43. 43. 21.7).5). 18. 43. 175].3) … . 16. 43. 29. 43. 306]. 18. 16. 24. ([21. 18. 43. 12. ([17.7). 18. 306]. 171. 29. 16. 16. 12. ([18.5). 43.6). 18. 21. ([17.5).6). ([17. ([17. 25.4) Key: 175: Value: ([17. 29. 175]. 12.3). 218]. 18.3). ([17. 306]. 16. 218]. 18. 175]. 12. 16. 21. 16.

org/confluence/display/MAHO UT/Algorithms • http://mahout.apache. Portugal 2011 • Introduction to Scalable Machine Learning with Apache Mahout. Thinking Lucid. Robin Anil OSCON. Grant Ingersoll.apache.Sources • Mahout Hands on! Ted Dunning. February 15.org/ . 2010 • https://cwiki.