Bab 1: Introduction To Big Data

Bab 1: Introduction to Big Data
Data mining is the analysis of (often large) observational data sets to find unsuspected
relationships and to summarize the data in novel ways that are both understandable and useful
to the data owner. - Hand, Mannila & Smyth
Data mining functionality: Association, Classification and Prediction, Cluster analysis, Outlier
and exception data analysis, time series analysis (trend and deviation)
Classification:
Observational study: reports an association, does not need control group

Experimental study: reports a cause and effect, use control group, random & generalize
Bias refers to results that are systematically off the mark

Example of bias:
● Sample bias: sample ga represent environment yang di expect
● Exclusion bias: karena nge exclude some features from dataset pas cleaning data
● Observer/experimenter bias: tendency to see what we expect or want to see
● Prejudice bias: karena cultural influences or stereotypes
● Measurement bias: alat ukur nya salah, jadi nge skew data ke particular direction
Skill data scientist: data analytics, it/engineering, business/problem solving
Big data: high-Volume, high-Velocity, high-Variety, (extra: Veracity), (extra: value)

● Volume: Terabytes to Petabytes
● Variety: Structured and Unstructured
● Velocity: Batch and Streaming
● Veracity (extra): inconsistency and incompleteness of data
Big data analytics should start business problem rather than technology problem
Challenges: capturing data, data storage, searching and querying, sharing, transfer, analysis,
visualizing
Bab 2: Introduction to Hadoop Distributed File System (HDFS)
Basic feature HDFS:
● Highly fault-tolerant: using replication
● High throughput: using partitioning
● Suitable for applications with large data sets
● Streaming access to file system data: versus random access
Data characteristics:
● Streaming data access
● Batch processing rather than interactive user access
● Large datasets and files: gigabytes and terabytes size
● High aggregate data bandwidth
● Scale to hundreds of nodes in a cluster
● Tens of millions of files in a single instance
● Write-once-read-many: a file once created, written and closed need not be changed -
this assumption simplifies coherency
● A map-reduce application or web-crawler application fits perfectly with this model
Block is a continuous location on the hard drive where data is stored

FileSystem stores data as collection of blocks
Block is the unit transfer between disk and memory
Fragmentation: Internal vs External
Physical tuning: small or large block?

● Smaller block:
○ less internal fragmentation
○ higher chance external fragmentation for large files
● Larger block:
○ Less chance for external fragmentation
○ Internal fragmentation is negligible for large files
○ Less IO operation
○ Suitable for large files
OS block size is determined during format
Master/slave architecture
HDFS cluster consists of a single name node and a number of datanodes
HDFS exposes a file system namespace and allows user data to be stored in files
A file is split into one or more blocks and set of blocks are stored in DataNodes
HDFS architecture:
HDFS is designed to store very large files across machines in a large cluster
HDFS namespace stored by Namenode
Datanode stores data in files in its local file system

Datanode has no knowledge about HDFS filesystem
HDFS Operation:
Create directory footdir: hdfs dfs -mkdir /foodir
List directory: hdfs dfs -ls (/path, kalo ingin list path)
Create local file to write: nano test.txt, then write
Put file to HDFS: hdfs dfs -put /path/test.txt
Get file from HDFS: hdfs dfs -get /path/test.txt
Delete file from directory: hdfs dfs -rm /path/test.txt
View file content: hdfs dfs -cat /path/test.txt or hdfs dfs -tail /path/test.txt
Get and merge file: hdfs dfs -getmerge /path/ ./result.txt (merge all file in path to result.txt)
Copy file or directory recursively: hadoop distcp /user/dir1 /path/dir2 (copy dir1 to dir2)
Chapter 3: Introduction to Map Reduce

Typical large-data problemes:
● Iterate over a large number of records
● Extract something of interest from each
● Shuffle and sort intermediate result
● Generate final input
● The problem:
○ Diverse input format (data diversity & heterogeneity)
○ Large scale: terabytes, petabytes
○ Parallelization: problem from communication between worker and acces to
shared resources
Current Tools:
● Programming models:
○ Shared memory (pthreads)
○ Message passing (MPI)
● Design patterns:
○ Master-slaves
○ Produces-consumer flows
○ Shared work queues
Key ideas:
● Move processing to the data: cluster have limited network bandwidth
● Process data sequentially, avoid random access: seeks are expensive, disk throughput
● Seamless scalability: from the mythical man-month to the tradable machine-hour
Google File System (GFS) for Google’s map reduce

HDFS for hadoop
Core hadoop has 2 main systems:

● Yarn/Mapreduce: distributed big data processing infrastructure
● HDFS: fault-tolerant, high bandwidth, high availability distributed storage
MapReduce two function:

● Map (k,v) -> [(k’,v’)]
● reduce(k’,[v’]) -> [(k’,v’)]
Mapreduce “runtime”:
● Handles scheduling: assign workers to map and reduce task
● Handles “data distribution”: moves processes to data
● Handles synchronization: gathers, sort, and shuffles intermediate data
● Handles errors and faults: detects worker failures and restarts
● Everything happens on top of a distributed FS
Summary:
Big ideas:
● Distribute the data → HDFS
● Move process to data → map reduce
Map reduce:
● Map: process a key-value pair into new key value pair (k,v) → (k’,v’)
● Reduce: process output of mapper that have the same key (cont: list of values [v’])
● Reduce: (k’,[v’]) into (k’,v’)
The rest is handled by the execution framework
Chapter 4: Classification & Regression
Classification & Regression are some machine learning technique
Classification: categorical
Regression: continuous-valued functions
Artificial Intelligence (AI): any technique which enables computers to mimic human behavior
Machine Learning: Teknik AI biar komputer bisa belajar tanpa di program explicitly
Deep learning: Subset ML which make the computation of multi-layer neural networks feasible
Classic Algorithm: data + algorithm (through machine) = output

Machine learning: data + output (through machine = algorithm
Tipe machine learning: supervised, unsupervised, dan reinforcement
Supervised learning:
● Pros:
○ It can make future predictions
○ Can quantify relationship between predictions and response variables
○ It can show us how variables affect each other and how much
● Cons
○ Requires labeled data
Unsupervised learning
● Pros
○ Can find group of data that behave similarly
○ Can use unlabeled data
○ Can be a preprocessing step for supervised learning
● Cons
○ Has zero predictive power
○ Can be hard to determine if we are on the right track
○ Relies much more on human interpretation
Reinforcement learning
● Pros
○ Very complicated rewards system = very complicated AI system
○ Can learn in almost any environment, including our own earth
● Cons
○ The agent is erratic at first and makes many terrible choices before good
○ It can take while before agent avoids decisions altogether
○ The agent might play it safe
Classification 2 steps: learning step & classification step
KNN algorithm assumes that similar things exists in close proximity (Classification)
K terlalu kecil = terlalu sensitif, K terlalu besar = bisa include majority points from other
● Advantages of KNN
○ Simple and easy to implement
○ The algorithm is versatile, bisa classification dan regression
● Disadvantages of KNN
○ Algorithm makin slow as the number of examples increase
Naive Bayes (bayes theorem)

Tentuin P(H|X), yaitu probabilitas hipotesis H dengan data X (posterior probability)
Holdout method: split data training and testing (most common 80% training, 20% test)
Cross validation method: di partisi random ke k subsets. (most common k=10)
Confusion matrix:
F1-Measure: weighted average of precision and recall
Chapter 5: Data Preprocessing

Data could be incomplete, noisy (incorrect attribute value), and inconsistent
Major tasks in data preprocessing:

● Data cleaning
● Data integration
● Data transformation
● Data reduction
● Data discretization
Data transformation: normalization

● Min-max normalization
● Z-score normalization
● Normalization by decimal scaling
Chapter 6: Clustering Technique

Clustering is a family of unsupervised learning that attempt to group data points into subsets
Unsupervised method used when there is no clear response/target variable
Basic clustering method:

● Partitioning methods (k-means clustering)
● Hierarchical methods
● Density-based methods
● Grid based methods
Silhouette coefficient to choose optimal number of k in k-means
Hierarchical methods is divided into 2 types:

● Agglomerative: bottom-up approach
● Divisive: top-down approach
Density-based spatial clustering of application with noise (DBSCAN)

DBSCAN is a density based clustering algorithm that can neatly handle noise
DBSCAN algorithm:
● Input
○ D: dataset containing n objects
○ E: the radius parameter
○ minPts: the neighborhood density threshold
● Output: a set of density-based clusters
Chapter 7: Apache Spark

Spark → open source
Apache spark is a cluster computing platform designed to be fast and general purpose
Spark is designed to be highly accessible, offering simple APIs in python, java, scala, and SQL
Spark can run in hadoop cluster and access any hadoop data source
Chapter 8: Spark MLlib
MLlib is Spark’s library of machine learning functions
Provides multiple ML algorithm including: classification, regression, clustering, collaborative
filtering
Spark MLlib tools:

● ML Algorithms
● Featurization: feature extraction, transformation, dimensionality reduction, and selection
● Pipelines: tools for constructing, evaluating, and tuning ML pipelines
● Persistence: saving and load algorithms, models, and pipelines
● Utilities: linear algebra, statistics, data handling

Bab 1: Introduction To Big Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bab 1: Introduction To Big Data

Uploaded by

Copyright:

Available Formats

Bab 1: Introduction to Big Data

Observational study: reports an association, does not need control group

Bias refers to results that are systematically off the mark

Skill data scientist: data analytics, it/engineering, business/problem solving

Big data: high-Volume, high-Velocity, high-Variety, (extra: Veracity), (extra: value)

Block is a continuous location on the hard drive where data is stored

Fragmentation: Internal vs External

Physical tuning: small or large block?

OS block size is determined during format

Datanode stores data in files in its local file system

Chapter 3: Introduction to Map Reduce

Google File System (GFS) for Google’s map reduce

Core hadoop has 2 main systems:

MapReduce two function:

Classic Algorithm: data + algorithm (through machine) = output

Tipe machine learning: supervised, unsupervised, dan reinforcement

Classification 2 steps: learning step & classification step

Naive Bayes (bayes theorem)

Cross validation method: di partisi random ke k subsets. (most common k=10)

F1-Measure: weighted average of precision and recall

Chapter 5: Data Preprocessing

Major tasks in data preprocessing:

Data transformation: normalization

● Normalization by decimal scaling

Chapter 6: Clustering Technique

Unsupervised method used when there is no clear response/target variable

Basic clustering method:

Silhouette coefficient to choose optimal number of k in k-means

Hierarchical methods is divided into 2 types:

Density-based spatial clustering of application with noise (DBSCAN)

Chapter 7: Apache Spark

Spark MLlib tools:

You might also like