Professional Documents
Culture Documents
Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and useful
to the data owner. - Hand, Mannila & Smyth
Data mining functionality: Association, Classification and Prediction, Cluster analysis, Outlier
and exception data analysis, time series analysis (trend and deviation)
Classification:
Big data analytics should start business problem rather than technology problem
Challenges: capturing data, data storage, searching and querying, sharing, transfer, analysis,
visualizing
Bab 2: Introduction to Hadoop Distributed File System (HDFS)
Basic feature HDFS:
● Highly fault-tolerant: using replication
● High throughput: using partitioning
● Suitable for applications with large data sets
● Streaming access to file system data: versus random access
Data characteristics:
● Streaming data access
● Batch processing rather than interactive user access
● Large datasets and files: gigabytes and terabytes size
● High aggregate data bandwidth
● Scale to hundreds of nodes in a cluster
● Tens of millions of files in a single instance
● Write-once-read-many: a file once created, written and closed need not be changed -
this assumption simplifies coherency
● A map-reduce application or web-crawler application fits perfectly with this model
Master/slave architecture
HDFS cluster consists of a single name node and a number of datanodes
HDFS exposes a file system namespace and allows user data to be stored in files
A file is split into one or more blocks and set of blocks are stored in DataNodes
HDFS architecture:
HDFS is designed to store very large files across machines in a large cluster
HDFS namespace stored by Namenode
HDFS Operation:
Create directory footdir: hdfs dfs -mkdir /foodir
List directory: hdfs dfs -ls (/path, kalo ingin list path)
Create local file to write: nano test.txt, then write
Put file to HDFS: hdfs dfs -put /path/test.txt
Get file from HDFS: hdfs dfs -get /path/test.txt
Delete file from directory: hdfs dfs -rm /path/test.txt
View file content: hdfs dfs -cat /path/test.txt or hdfs dfs -tail /path/test.txt
Get and merge file: hdfs dfs -getmerge /path/ ./result.txt (merge all file in path to result.txt)
Copy file or directory recursively: hadoop distcp /user/dir1 /path/dir2 (copy dir1 to dir2)
Key ideas:
● Move processing to the data: cluster have limited network bandwidth
● Process data sequentially, avoid random access: seeks are expensive, disk throughput
● Seamless scalability: from the mythical man-month to the tradable machine-hour
Mapreduce “runtime”:
● Handles scheduling: assign workers to map and reduce task
● Handles “data distribution”: moves processes to data
● Handles synchronization: gathers, sort, and shuffles intermediate data
● Handles errors and faults: detects worker failures and restarts
● Everything happens on top of a distributed FS
Summary:
Big ideas:
● Distribute the data → HDFS
● Move process to data → map reduce
Map reduce:
● Map: process a key-value pair into new key value pair (k,v) → (k’,v’)
● Reduce: process output of mapper that have the same key (cont: list of values [v’])
● Reduce: (k’,[v’]) into (k’,v’)
The rest is handled by the execution framework
Chapter 4: Classification & Regression
Classification & Regression are some machine learning technique
Classification: categorical
Regression: continuous-valued functions
Artificial Intelligence (AI): any technique which enables computers to mimic human behavior
Machine Learning: Teknik AI biar komputer bisa belajar tanpa di program explicitly
Deep learning: Subset ML which make the computation of multi-layer neural networks feasible
Supervised learning:
● Pros:
○ It can make future predictions
○ Can quantify relationship between predictions and response variables
○ It can show us how variables affect each other and how much
● Cons
○ Requires labeled data
Unsupervised learning
● Pros
○ Can find group of data that behave similarly
○ Can use unlabeled data
○ Can be a preprocessing step for supervised learning
● Cons
○ Has zero predictive power
○ Can be hard to determine if we are on the right track
○ Relies much more on human interpretation
Reinforcement learning
● Pros
○ Very complicated rewards system = very complicated AI system
○ Can learn in almost any environment, including our own earth
● Cons
○ The agent is erratic at first and makes many terrible choices before good
○ It can take while before agent avoids decisions altogether
○ The agent might play it safe
KNN algorithm assumes that similar things exists in close proximity (Classification)
K terlalu kecil = terlalu sensitif, K terlalu besar = bisa include majority points from other
● Advantages of KNN
○ Simple and easy to implement
○ The algorithm is versatile, bisa classification dan regression
● Disadvantages of KNN
○ Algorithm makin slow as the number of examples increase
Holdout method: split data training and testing (most common 80% training, 20% test)
Confusion matrix:
Apache spark is a cluster computing platform designed to be fast and general purpose
Spark is designed to be highly accessible, offering simple APIs in python, java, scala, and SQL
Spark can run in hadoop cluster and access any hadoop data source
Chapter 8: Spark MLlib
MLlib is Spark’s library of machine learning functions
Provides multiple ML algorithm including: classification, regression, clustering, collaborative
filtering