You are on page 1of 8

Bab 1: Introduction to Big Data

Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and useful
to the data owner. - Hand, Mannila & Smyth

Data mining functionality: Association, Classification and Prediction, Cluster analysis, Outlier
and exception data analysis, time series analysis (trend and deviation)

Classification:

Observational study: reports an association, does not need control group


Experimental study: reports a cause and effect, use control group, random & generalize

Bias refers to results that are systematically off the mark


Example of bias:
● Sample bias: sample ga represent environment yang di expect
● Exclusion bias: karena nge exclude some features from dataset pas cleaning data
● Observer/experimenter bias: tendency to see what we expect or want to see
● Prejudice bias: karena cultural influences or stereotypes
● Measurement bias: alat ukur nya salah, jadi nge skew data ke particular direction

Skill data scientist: data analytics, it/engineering, business/problem solving

Big data: high-Volume, high-Velocity, high-Variety, (extra: Veracity), (extra: value)


● Volume: Terabytes to Petabytes
● Variety: Structured and Unstructured
● Velocity: Batch and Streaming
● Veracity (extra): inconsistency and incompleteness of data

Big data analytics should start business problem rather than technology problem
Challenges: capturing data, data storage, searching and querying, sharing, transfer, analysis,
visualizing
Bab 2: Introduction to Hadoop Distributed File System (HDFS)
Basic feature HDFS:
● Highly fault-tolerant: using replication
● High throughput: using partitioning
● Suitable for applications with large data sets
● Streaming access to file system data: versus random access

Data characteristics:
● Streaming data access
● Batch processing rather than interactive user access
● Large datasets and files: gigabytes and terabytes size
● High aggregate data bandwidth
● Scale to hundreds of nodes in a cluster
● Tens of millions of files in a single instance
● Write-once-read-many: a file once created, written and closed need not be changed -
this assumption simplifies coherency
● A map-reduce application or web-crawler application fits perfectly with this model

Block is a continuous location on the hard drive where data is stored


FileSystem stores data as collection of blocks
Block is the unit transfer between disk and memory

Fragmentation: Internal vs External

Physical tuning: small or large block?


● Smaller block:
○ less internal fragmentation
○ higher chance external fragmentation for large files
● Larger block:
○ Less chance for external fragmentation
○ Internal fragmentation is negligible for large files
○ Less IO operation
○ Suitable for large files

OS block size is determined during format

Master/slave architecture
HDFS cluster consists of a single name node and a number of datanodes
HDFS exposes a file system namespace and allows user data to be stored in files
A file is split into one or more blocks and set of blocks are stored in DataNodes
HDFS architecture:

HDFS is designed to store very large files across machines in a large cluster
HDFS namespace stored by Namenode

Datanode stores data in files in its local file system


Datanode has no knowledge about HDFS filesystem

HDFS Operation:
Create directory footdir: hdfs dfs -mkdir /foodir
List directory: hdfs dfs -ls (/path, kalo ingin list path)
Create local file to write: nano test.txt, then write
Put file to HDFS: hdfs dfs -put /path/test.txt
Get file from HDFS: hdfs dfs -get /path/test.txt
Delete file from directory: hdfs dfs -rm /path/test.txt
View file content: hdfs dfs -cat /path/test.txt or hdfs dfs -tail /path/test.txt
Get and merge file: hdfs dfs -getmerge /path/ ./result.txt (merge all file in path to result.txt)
Copy file or directory recursively: hadoop distcp /user/dir1 /path/dir2 (copy dir1 to dir2)

Chapter 3: Introduction to Map Reduce


Typical large-data problemes:
● Iterate over a large number of records
● Extract something of interest from each
● Shuffle and sort intermediate result
● Generate final input
● The problem:
○ Diverse input format (data diversity & heterogeneity)
○ Large scale: terabytes, petabytes
○ Parallelization: problem from communication between worker and acces to
shared resources
Current Tools:
● Programming models:
○ Shared memory (pthreads)
○ Message passing (MPI)
● Design patterns:
○ Master-slaves
○ Produces-consumer flows
○ Shared work queues

Key ideas:
● Move processing to the data: cluster have limited network bandwidth
● Process data sequentially, avoid random access: seeks are expensive, disk throughput
● Seamless scalability: from the mythical man-month to the tradable machine-hour

Google File System (GFS) for Google’s map reduce


HDFS for hadoop

Core hadoop has 2 main systems:


● Yarn/Mapreduce: distributed big data processing infrastructure
● HDFS: fault-tolerant, high bandwidth, high availability distributed storage

MapReduce two function:


● Map (k,v) -> [(k’,v’)]
● reduce(k’,[v’]) -> [(k’,v’)]

Mapreduce “runtime”:
● Handles scheduling: assign workers to map and reduce task
● Handles “data distribution”: moves processes to data
● Handles synchronization: gathers, sort, and shuffles intermediate data
● Handles errors and faults: detects worker failures and restarts
● Everything happens on top of a distributed FS

Summary:
Big ideas:
● Distribute the data → HDFS
● Move process to data → map reduce
Map reduce:
● Map: process a key-value pair into new key value pair (k,v) → (k’,v’)
● Reduce: process output of mapper that have the same key (cont: list of values [v’])
● Reduce: (k’,[v’]) into (k’,v’)
The rest is handled by the execution framework
Chapter 4: Classification & Regression
Classification & Regression are some machine learning technique

Classification: categorical
Regression: continuous-valued functions

Artificial Intelligence (AI): any technique which enables computers to mimic human behavior
Machine Learning: Teknik AI biar komputer bisa belajar tanpa di program explicitly
Deep learning: Subset ML which make the computation of multi-layer neural networks feasible

Classic Algorithm: data + algorithm (through machine) = output


Machine learning: data + output (through machine = algorithm

Tipe machine learning: supervised, unsupervised, dan reinforcement

Supervised learning:
● Pros:
○ It can make future predictions
○ Can quantify relationship between predictions and response variables
○ It can show us how variables affect each other and how much
● Cons
○ Requires labeled data
Unsupervised learning
● Pros
○ Can find group of data that behave similarly
○ Can use unlabeled data
○ Can be a preprocessing step for supervised learning
● Cons
○ Has zero predictive power
○ Can be hard to determine if we are on the right track
○ Relies much more on human interpretation
Reinforcement learning
● Pros
○ Very complicated rewards system = very complicated AI system
○ Can learn in almost any environment, including our own earth
● Cons
○ The agent is erratic at first and makes many terrible choices before good
○ It can take while before agent avoids decisions altogether
○ The agent might play it safe

Classification 2 steps: learning step & classification step

KNN algorithm assumes that similar things exists in close proximity (Classification)
K terlalu kecil = terlalu sensitif, K terlalu besar = bisa include majority points from other
● Advantages of KNN
○ Simple and easy to implement
○ The algorithm is versatile, bisa classification dan regression
● Disadvantages of KNN
○ Algorithm makin slow as the number of examples increase

Naive Bayes (bayes theorem)


Tentuin P(H|X), yaitu probabilitas hipotesis H dengan data X (posterior probability)

Holdout method: split data training and testing (most common 80% training, 20% test)

Cross validation method: di partisi random ke k subsets. (most common k=10)

Confusion matrix:

F1-Measure: weighted average of precision and recall

Chapter 5: Data Preprocessing


Data could be incomplete, noisy (incorrect attribute value), and inconsistent

Major tasks in data preprocessing:


● Data cleaning
● Data integration
● Data transformation
● Data reduction
● Data discretization

Data transformation: normalization


● Min-max normalization
● Z-score normalization

● Normalization by decimal scaling

Chapter 6: Clustering Technique


Clustering is a family of unsupervised learning that attempt to group data points into subsets

Unsupervised method used when there is no clear response/target variable

Basic clustering method:


● Partitioning methods (k-means clustering)
● Hierarchical methods
● Density-based methods
● Grid based methods

Silhouette coefficient to choose optimal number of k in k-means

Hierarchical methods is divided into 2 types:


● Agglomerative: bottom-up approach
● Divisive: top-down approach

Density-based spatial clustering of application with noise (DBSCAN)


DBSCAN is a density based clustering algorithm that can neatly handle noise
DBSCAN algorithm:
● Input
○ D: dataset containing n objects
○ E: the radius parameter
○ minPts: the neighborhood density threshold
● Output: a set of density-based clusters

Chapter 7: Apache Spark


Spark → open source

Apache spark is a cluster computing platform designed to be fast and general purpose
Spark is designed to be highly accessible, offering simple APIs in python, java, scala, and SQL
Spark can run in hadoop cluster and access any hadoop data source
Chapter 8: Spark MLlib
MLlib is Spark’s library of machine learning functions
Provides multiple ML algorithm including: classification, regression, clustering, collaborative
filtering

Spark MLlib tools:


● ML Algorithms
● Featurization: feature extraction, transformation, dimensionality reduction, and selection
● Pipelines: tools for constructing, evaluating, and tuning ML pipelines
● Persistence: saving and load algorithms, models, and pipelines
● Utilities: linear algebra, statistics, data handling

You might also like