Professional Documents
Culture Documents
BITS Pilani
Pilani Campus Email: kanantharaman@wilp.bits-pilani.ac.in
➢ Motivation
– Why do modern Enterprises need to work with data
– What is Big Data and data classification
– Scaling RDBMS
➢ Architecture
– Data warehouse
– High level architecture of Big Data solutions
Source: https://www.guru99.com/what-is-big-data.html
➢Structured data
➢Semi-structured data
➢Unstructured data
Semi-structured data
Unstructured data
XHTML
Other Markup Languages SGML
….
• Volatility
• Volatility of data deals with, how long is the data valid?
• Variability
• Data flows can be highly inconsistent with periodic peaks.
• Data mining
✓ Association rule mining, e.g. market basket or affinity analysis
✓ Regression, e.g. predict dependent variable from independent variables
✓ Collaborative filtering, e.g. predict a user preference from group preferences
✓ NLP - e.g. Human to Machine interaction, conversational systems
✓ Text Analytics - e.g. sentiment analysis, search
✓ Noisy text analytics - e.g. spell correction, speech to text
• Network Processing
✓ Analyze data as it streams over network
✓ Enables real-time analytics on remote data
✓ Network latency affects speed
• Hybrid Strategies
✓ Combine in-memory, disk, and network processing
✓ Leverage benefits of each approach
✓ Balance real-time and batch needs
✓ Mitigate individual limitations
✓ Optimal mix depends on:
✓ Data sizes, analytics needs
✓ Infrastructure costs
✓ Real-time vs batch requirements
Disk Access Patterns: In big data systems, when processing a large dataset,
algorithms often read data from disk. These algorithms can benefit from
spatial locality if data on the same disk blocks or in nearby regions are
frequently accessed. This can be observed in data processing frameworks
like Hadoop, which are optimized for processing data with spatial locality.
10/20/2023 CCCSZG522 BDS S1-13-14 48
BITS Pilani, Pilani Campus
Impact of Latency(2)
• Algorithms and Data Structures: Big data systems leverage algorithms and data
structures that take advantage of locality of reference to optimize performance.
For example, B-trees are used in databases to efficiently access data based on
spatial locality, reducing the number of disk I/O operations.
• Three VM architectures in (b), (c), and (d), compared with the traditional
physical machine shown in (a).
1. What is big data analytics and what it a) To understand the significance of big
isn’t? data analytics.
• MS Excel
https://support.office.microsoft.com/en-in/article/Whats-new-in-Excel-2013-
1cbc42cd-bfaf-43d7-9031-5688ef1392fd?CorrelationId=1a2171cc-191f-47de-
8a55-08a5f2e9c739&ui=en-US&rs=en-IN&ad=IN
SAS
http://www.sas.com/en_us/home.htm
IBM SPSS Modeler
http://www-01.ibm.com/software/analytics/spss/products/modeler/
• Should you be storing all of your big data? If “Yes”, where are you
going to store it? If “No”, how will you know what to store and what
to discard?
o • How will you sieve through your massive data to filter out the
relevant from the irrelevant?
• How long will you store this data? • How will you accommodate the
peaks (variability in terms of data influx) in your data?
• How will you analyze? Will you analyze all the data that is stored or
analyze a sample?
• What will you do with the insights generated from this analysis?
• NoSQL
❖ What is it?
❖ Types of NoSQL Databases
❖ Why NoSQL?
❖ Advantages of NoSQL
❖ NoSQL Vendors
❖ SQL versus NoSQL
❖ NewSQL
❖ Comparison of SQL, NoSQL and NewSQL
• Hadoop
– Features of Hadoop
– Key Advantages of Hadoop
– Versions of Hadoop
For example:
Cassandra,
HBase, etc.
Pig
It is a high level scripting language used with Hadoop. It serves as an
alternative to MapReduce. It has two parts:
Pig Latin: It is a SQL like scripting language.
Pig runtime: is the runtime environment.
Hive:
Hive is a data warehouse software project built on top of Hadoop. Three main
tasks performed by Hive are summarization, querying and analysis
Impala:
It is a high performance SQL engine that runs on Hadoop cluster. It is ideal for
interactive analysis. It has very low latency measured in milliseconds. It
supports a dialect of SQL called Impala SQL.
Data Connectors: Various connectors and APIs are used to fetch data
from sources like databases, cloud storage, and web services.
• Data ingestion sets the stage for the rest of the Big Data
Lifecycle, making it imperative to gather, clean, and
organize data from different sources for subsequent
storage, processing, and analysis.
• Definition:
• zzData storage involves the management of ingested
data and its organization into a format that is accessible
for analysis. This stage addresses the challenge of
managing vast amounts of data efficiently.
§ Example:
§ Some common scenarios where divide-and-conquer is used
include sorting, searching, and matrix multiplication.
§ It's like a team of chefs in a restaurant kitchen; each chef has a specific
job to prepare a delicious meal.
Hadoop is:
• Ever wondered why Hadoop has been and is one of the most wanted
technologies!!
• The key consideration (the rationale behind its huge popularity) is:
• Its capability to handle massive amounts of data, different categories of data
– fairly quickly.
• The other considerations are :
• Commodity hardware
• Distributed computing
• Scalable horizontally
• Flexible storage
• Failsafe
MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
NameNode:
• Single NameNode per cluster.
• Keeps the metadata details
DataNode:
• Multiple DataNode per cluster
• Read/Write operations
SecondaryNameNode:
• Housekeeping Daemon
• As per the Hadoop Replica Placement Strategy, first replica is placed on the
same node as the client. Then it places second replica on a node that is
present on different rack. It places the third replica on the same rack as
second, but on a different node in the rack. Once replica locations have
been set, a pipeline is built. This strategy provides good reliability.
1 TB = 10 **6 MB
Transfer speed = 100 MB/sec
Time taken = 10**6/100 = 10**4= 10000 Sec ~ 167 minutes
11/29/2023 CCC ZG522 - BDS S1-23-24 37
BITS Pilani, Pilani Campus
Hadoop Design Principles
# Mapper
def mapper(document):
# Split document string into words
words = document.split()
for word in words:
# Emit each word with a count of 1
print('%s\t%d' % (word, 1))
• Intermediate results
# Sample data mapped to words and counts
words = {'the': [1, 1],
'cow': [1, 1],
'jumps': [1, 1],
'over': [1, 1],
'moon': [1],
'again': [1]}
11/29/2023 CCC ZG522 - BDS S1-23-24 67
BITS Pilani, Pilani Campus
Define Reducer function
# Reducer
def reducer(word, counts):
# Sums counts for each instance of a word
sum = 0
for count in counts:
sum += count
# Emit word with final count
print('%s\t%d' % (word, sum))
word
TestDFSIO :’
• Benchmark tool to test HDFS throughput for large file
reads/writes.
• It measures aggregate read/write throughput across the
cluster by testing them under varied conditions and file
sizes.
Expected Operational Metrics:
• Read throughput (MB/sec)
• Write throughput (MB/sec)
• Average I/O rate
MRBench :
• Helpful for testing incremental changes in the system.
• Micro benchmark suites to test various aspects of
MapReduce operation.
• Includes tests like sort, grep, aggregation, join, statistics
metrics etc.
Expected Operational Metrics:
• Completion time for sort, word count etc
• Processor utilization percentage
• Mapper/reducer task numbers
• Map input/output records
11/29/2023 CCC ZG522 - BDS S1-23-24 83
BITS Pilani, Pilani Campus
Benchmark Programs(4)
GridMix
• Creates a controllable MapReduce workload mix to
simulate production load.
• Workloads can be customized to match variety of use
cases based on needs.
• Helps test cluster performance for production
environments.
Expected Operational Metrics:
• MapReduce job latency distribution
• Overall execution time
• Network utilization
• 11/29/2023
CPU/memory usage per node
CCC ZG522 - BDS S1-23-24 84
BITS Pilani, Pilani Campus
Benchmark Programs(5)
YARN benchmarking
Not a failover
node
• Let us assume that the file “Sample.txt” is of size 192 MB. As per the default
data block size(64 MB), it will be split into three blocks and replicated across
the nodes on the cluster based on the default replication factor.
12/9/23 CCC ZG522 - BDS S1-23-24 20
BITS Pilani, Pilani Campus
HDFS Replication (Tunable)
3 – is replication factor
File size is 3.2MB
Ø - help command
Ø Hadoop dfs –help|more
HDFS 2 Features
1. Horizontal scalability.
2. High availability.
• HDFS Federation uses multiple independent
NameNodes for horizontal scalability. NameNodes are
independent of each other.
➢ Running Pig
➢ Execution Modes of Pig
➢ Relational Operators
➢ Eval Function
➢ Piggy Bank
➢ When to use Pig?
➢ When NOT to use Pig?
➢ Pig versus Hive
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
✓ .LimitOptimizer: If the limit operator is applied just after a load or sort operator, then Pig converts these
operators into a limit-sensitive implementation, which omits the processing of the whole data set.
✓ ColumnPruner: This function will omit the columns that are never used; hence, it reduces the size of the
record. This function can be applied after each operator to prune the fields aggressively and frequently.
✓ MapKeyPruner: This function will omit the map keys that are never used, hence, reducing the size of the
record.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: analyticsvidhya.
Source: Medium
1. Interactive Mode.
2. Batch Mode.
➢ Once you get the grunt prompt, you can type the Pig Latin statement as
shown below.
➢ Syntax
✓ Pig filename
1. Pig Latin statements are basic constructs to process data using Pig.
2. Pig Latin statement is an operator.
3. An operator in Pig Latin takes a relation as input and yields another relation
as output.
4. Pig Latin statements include schemas and expressions
5. Pig Latin Statements end with Semi-colon ‘;’
Reading Data
To read data in Pig, we need to put the data from the local file system to
Hadoop. Let’s see the steps:
Step 1:- Create a file using the cat command in the local file system.
Step 2:- Transfer the file into the HDFS using the put command.
Step 3:- Read the data from the Hadoop to the Pig Latin using the load
command.
Relation − We have to provide the relation name where we want to load the file content.
Input file path information − We have to provide the path of the Hadoop directory where the file is stored.
load_function − Apache Pig provides a variety of load functions like BinStorage, JsonLoader, PigStorage,
TextLoader. Here, we need to choose a function from this set. PigStorage is the commonly used
function as it is suited for loading structured text files.
Schema − We need to define the schema of the data or the passing files in parenthesis.
Source: Edureka.
Source: Edureka.
12/16/2023
CCCSZG522 - Big Data Systems - S1-23-24 BITS Pilani, Pilani Campus
Piggy Bank
2. When there is a time constraint because Pig is slower than MapReduce jobs.
Source : geekforgeeks
Introduction to Hive
➢ What is Hive?
➢ Hive Architecture
➢ Hive Data Types
➢ Primitive Data Types
➢ Collection Data Types
➢ Hive File Format
➢ Text File
➢ Sequential File
➢ RCFile (Record Columnar File)
➢ SERDER
• ETL and Data warehousing tool developed on top of Hadoop Distributed File
System (HDFS)
• Provides an SQL-like interface between the user and the Hadoop distributed
file system (HDFS)
• Hive makes job easy for performing operations like
✓ Data encapsulation
✓ Ad-hoc queries
✓ Analysis of huge datasets
✓ Support for data query and analysis using SQL
✓ Processing of structured and semi-structured
• HQL translates SQL-like queries into MapReduce jobs, like what Pig Latin
does, uses HDFS for Storage
✓ No need to learn Java to work with Hadoop Hive
Source : datadog
Source : davidscoding
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Applications of Apache Hive
You can use Apache Hive mainly for
✓Data Warehousing
✓Ad-hoc Analysis
Source : geekforgeeks
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
History of Hive and Recent Releases of
Hive
RPC
Protocol
Source : guru99
Embedded metastore
✓ is a simple way to get started with Hive
✓ only one embedded Derby database can access the database files on disk at any
one time
✓ means you can only have one Hive session open at a time that shares the same
metastore
Local Metastore
✓ The solution to supporting multiple sessions (and therefore multiple users) is to use a
standalone database
✓ Metastore service still runs in the same process as the Hive service, but connects to
a database running in a separate process
✓ either on the same machine or on a remote machine.
Remote metastore
✓ One or more metastore servers run in separate processes to the Hive service
✓ This brings better manageability and security, since the database tier can be
completely firewalled off
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Major Components of Hive
Architecture(4)
Driver
✓ Receives HiveQL statements and works like a controller
✓ Monitors the progress and life cycle of various executions by creating sessions
✓ Stores the metadata that is generated while executing the HiveQL statement
✓ Collects the data points and query results, when the reducing operation is completed by the MapReduce job
Source : AnalyticsVidhya
Source : geeksforgeeks
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Request Flow - Detailed
Explanation of the workflow: Hourly Log Data can be stored directly into HDFS and then data cleansing
is performed on the log file. Finally, Hive table(s) can be created to query the log file.
Database
Table
Data Model
Partition Bucket
• Databases
–The namespace for tables.
• Tables
–Set of records that have similar schema.
• Partitions
–Logical separations of data based on classification of given information as per specific attributes. Once hive has
partitioned the data based on a specified key, it starts to assemble the records into specific folders as and when
the records are inserted, (Partition by : Year, Country)
• Buckets (Clusters)
– Similar to partitions but uses hash function to segregate data and determines the cluster or bucket into which
the record should be placed
– “CLUSTERED BY (customer_id) INTO XX BUCKETS”;
DYNAMIC PARTITION: The User is required to simply state the column, basis
which the partitioning will take place. Hive will then create partitions basis the
unique values in the column on which partition is to be carried out.
• This can be avoided using Bucketing in which you can limit the number of buckets that will be created.
• A bucket is a file whereas a partition is a directory.
• Bucketing works well when the field has high cardinality (cardinality is the
number of values a column or field can have) and data is evenly distributed
among buckets.
• Partitioning works best when the cardinality of the partitioning field is not too
high.
.
FROM STUDENT
INSERT OVERWRITE TABLE STUDENT_BUCKET
SELECT rollno,name,grade;
String Types
STRING
VARCHAR Only available starting with Hive 0.12.0
CHAR Only available starting with Hive 0.13.0
Strings can be expressed in either single quotes (‘) or double quotes (“)
Miscellaneous Types
BOOLEAN
BINARY Only available starting with Hive
12/16/2023 CCCSZG522 - Big Data Systems - S1-23-24
BITS Pilani, Pilani Campus
Hive Datatypes(2)
• Text File
• Sequential file
• RCFile (Record Columnar File)
• The default file format is text file. In this format, each record is a line in the
file. In text file, different control characters are used as delimiters. (“,”, ‘t’, ^A
(O001))
• The supported text files are CSV and TSV. JSON or XML documents too can
be specified as text file.
.
• Sequential files are flat files that store binary key–value pairs. It includes
compression support which reduces the CPU, I/O requirement.
• RCFile stores the data in Column Oriented Manner. So it's efficient for
column-based queries.
• IF NOT EXIST: It is an optional clause. The create database statement with “IF Not EXISTS” clause
creates a database if it does not exist. However, if the database already exists then it will notify the user
that a database with the same name already exists and will not show any error message.
• We have not specified the location where the Hive database will be created.
• By default all the Hive databases will be created under default warehouse
directory (set by the property hive.metastore.warehouse. dir) as
/user/hive/warehouse/database_name.db.
• But if we want to specify our own location, then the LOCATION clause can
be specified. This clause is optional.
➢ Show Databases
➢ Objective: To display a list of all databases.
To describe a database.
➢ Shows only DB name, comment, and DB directory.
To drop database.
• The DROP DATABASE command in Hive is used to delete an existing
database along with all its tables, partitions and data