Unit-I (Big Data)

Unit-I
Introduction to Big Data

What is Data
Data: Data is the collection of raw facts and figures. Actually data is unprocessed, that is
why data is called collection of raw facts and figures. We collect data from different
resources. After collection, data is entered into a machine for processing. Data may be
collection of words, numbers, pictures, or sounds etc.
Examples of Data:
• Student data on admission form- bundle of admission forms contains name, father’s
name, address, photograph etc.
•Student’s examination data - In examination system of a college/school, data about
obtained marks of different subjects for all students is collected, exam schedule etc.
• Survey Data – data can be collected by survey to know the opinion of people about their
product like / unlike their products. They also collect data about their competitor
companies in a particular area.
Information: Processed data is called information. When raw facts and figures are
processed and arranged in some proper order then they become information. Information
has proper meanings. Information is useful in decision-making. In other words,
Information is data that has been processed in such a way as to be meaningful values to the
person who receives it.
Examples of information:
• Student’saddress labels- Stored data of students can be used to print address labels of
students. These address labels are used to send any intimation / information to students at
their home addresses.
• Student’s examination, Results- In examination system collected data (obtained marks
in each subject) is processed to get total obtained marks of a student. Total obtained marks
are Information. It is also used to prepare result card of a student.
• Survey Report – Survey data is summarized into reports/information to present to
management of the company. The management will take important decisions on the basis
of data collected through surveys.
Units of data:When dealing with big data, we consider numbers to represent like
megabytes, gigabytes, terabytes etc. Here is the system of units to represent data.
The bit
 The Bit
 The Byte
 Kilobyte (1024 Bytes)
 Megabyte (1024 Kilobytes)
 Gigabyte (1,024 Megabytes, or 1,048,576 Kilobytes)
 Terabyte (1,024 Gigabytes)
 Petabyte (1,024 Terabytes, or 1,048,576 Gigabytes)
 Exabyte (1,024 Petabytes)
 Zettabyte (1,024 Exabytes)
 Yottabyte (1,024 Zettabytes)
BIG DATA
 The term has been in use since the 1990s, with some giving credit to John Mashey for
popularizing the term.
 Big Data is also data but with a huge size.
 Big data is a term that describes the large volume of data.
 Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
 Big data is collection of data sets which of large and complex that it becomes difficult to
process using on-hand database system tools or traditional data-processing applications.
Examples of Big Data
 Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
 A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
Types of Data (or) Categories of Data:

Type of data is one of the important aspect, which determines the type of analysis that has
to be performed on data. The different types of data, that are required to be processed are:
Big data' could be found in three forms:

•Structured
•Unstructured
•Semi-structured
 Structured Data:
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data.
An 'Employee' table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
In Structured format you have proper schema for your data. So you know what columns
will be there and basically it is a structured format (or) a tabular format.
 Semi-Structured Data:
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a structured in form but it is actually not defined with e.g. a table definition in relational
DBMS.
Example of semi-structured data is a data represented in eXtensible Markup Language
(XML) file, Java Script object notation (JSON), Comma Separated Values (CSV) files, E-
mail where schema is not defined properly.
Personal data stored in a XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
Example JSON document:

{
Id:9,
Book Title: “fundamentals of Business Analytics”
}
 Unstructured Data:
The unstructured data does not follow any schema definition, for example, a written text
like content of this unit is unstructured. You may add certain headings or meta data for
unstructured data. In fact, the growth of internet has resulted in generation of Zetta bytes of
unstructured data. Some of the unstructured data can be as listed below.
 Large written textual data such as email data, Social media etc.
 Unprocessed audio and video data
 Image data and mobile data
 Unprocessed natural speech data
 Unprocessed geographical data.
In general, this data requires huge storage space, newer processing methods and faster
processing capabilities.
DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND

UNSTRUCTURED DATA
Semi-
Key Structured Unstructured
Structured
Relational database CharacterandXML/RDF

Technology
tables. Binary data.
Matured transaction No transaction Transaction

management, various management, no Management
Transaction concurrency concurrency Adapted from
Management techniques RDBMS,not
matured
Versioning Versioned Not very common,

overtuples,rows,tables, as a whole. versioning over tuples or
Version etc. graphs
Management is possible.
Schema-dependent, Very flexible, Flexible, tolerant

rigorous schema absence of schema
Flexibility
schema
Scaling DB schema is Very scalable Schema scaling

Scalability
difficult Is simple
CHARACTERISTICS OF BIG DATA :
To classify a data as big data, we have 5 v’s
 Volume
 Variety
 Velocity
 Value
 Veracity
Volume:
Volume refers to the unimaginable amounts of information generated every second from
social media, cell phones, cars, credit cards, M2M sensors, images, video, and whatnot. We
are currently using distributed systems, to store data in several locations and brought
together by a software Framework like Hadoop.
Facebook alone can generate about billion messages, 4.5 billion times that the “like”
button is recorded, and over 350 million new posts are uploaded each day. Such a huge
amount of data can only be handled by Big Data Technologies.
As you can see from the image, the volume of data is rising exponentially. In 2016, the
data created was only 8 ZB and it is expected that, by 2020, the data would rise up to 40
ZB, which is extremely large.
Variety:
As Discussed before, Big Data is generated in multiple varieties. Compared to the
traditional data like phone numbers and addresses, the latest trend of data is in the form of
photos, videos, and audios and many more, making about 80% of the data to be completely
unstructured.
Velocity:
Velocity is the speed at which data is generated and processed. At first, mainframes were
used wherein fewer people used computers.Then came the client/server model and more
and more computers wereevolved. After this, the web applications came into the picture
andstarted increasing over the Internet. Then, everyone began using these applications.
These applications were then used by more and more devices such as mobiles as they were
very easy to access. Hence, a lot of data!
As it is clear from the image, every 60 seconds, so much of the data is generated
Value:
Here, our fourth V comes in, which deals with a mechanism to bring outthe correct
meaningout ofdata. First ofall, you needtomine thedata, i.e., a process to turn raw data into
useful data. Then, an analysisis done on the data that you have cleaned or retrieved out of
the rawdata. Then, you need to make sure whatever analysis you have done benefits your
business such as in finding out insights, results, etc. which were not possible earlier.
You need to make sure that whatever raw data you are given, you have cleaned it to be
used for deriving business insights. After you have cleaned the data, a challenge pops up,
i.e., during the process of dumping a huge amount of data, some packages might have lost.
So for resolving this issue, our next V comes into the picture.
Veracity:
Since the packages get lost during the execution, we need to start again from the stage of
mining raw data in order to convert them into valuable data. And this process goes on.
Also, there will be uncertainties and inconsistencies in the data. To overcome this, our last
V comes into place, i.e., Veracity. Veracity means the trustworthiness and quality of data.
It is necessary that the veracity of the data is maintained. For example, think about
Facebook posts, with hashtags, abbreviations, images, videos, etc., which make them
unreliable and hamper the quality of their content. Collecting loads and loads of data is of
no use if the quality and trustworthiness of the data is not up to the mark.
Importance of Big Data:
The Big Data analytics is indeed a revolution in the field of Information Technology. The
use of Data analytics by the companies is enhancing every year. Big data has the properties
of high variety, volume, and velocity. Big Data involves the use of analytics techniques
like machine learning, data mining, natural language processing, and statistics. With the
help of big data multiple operations can be performed at a single platform. You can store
Terabytes of data, pre process it, analyze the data and visualize the data with the help of
couple of big data tools.
Data is extracted, prepared and blended to provide analysis for the businesses. Large
enterprises and multinational organizations use these techniques widely these days in
different ways.
Big data analytics helps organizations to work with their data efficiently and use that data
identify new opportunities. Differenttechniques and algorithms can be applied to predict
from data. Multiple business strategies can be applied for future success of the company
and that leads to smarter business moves, more efficient operations and higher profits.
 Cost Savings: Some tools of Big Data like Hadoop and Cloud- Based Analytics can bring
cost advantages to business when large amounts of data plus they can identify more
efficient ways of doing business.
 Time Reductions: The high speed of tools like Hadoop and in- memory analytics can
easily identify new sources of data which helps businesses analyzing data immediately and
make quick decisions based on the learning’s.
 Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most and
produce products according to this trend. By this, it can get ahead of its competitors.
 Control online reputation: Big data tools can do sentiment analysis. Therefore, you can
get feedback about who is saying what about your company. If you want to monitor and
improve the online presence of your business, then, big data tools can help in all this.
 Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights:Big data analytics can help change all business operations. This includes the
ability to match customer expectation, changing company’s product line and of course
ensuring that the marketing campaigns are powerful.
The marketing and advertising sector is able to make a more sophisticated analysis. This
involves observing the online activity, monitoring the point of sale transactions, and
ensuring on the fly detection of dynamic changes in customer trends. Gaining insights on
customer behavior takes collecting and analyzing the customer’s data. This is done through
the similar approach used by marketers and advertisers as illustrated. This result into the
capability to achieve focused and targeted campaigns.
A more targeted and personalized campaign means that businesses can save money and
ensure efficiency. This is because they target high potential clients with the right products.
Big data analytics is good for advertisers since the companies can use this data to
understand customers purchasing behavior. Through predictive analytics, it is possible for
the organizations to define their target clients.
Example:
Netflix is a good example of a big brand that uses big data analyticsfor targeted
advertising. With over 100 million subscribers, the company collects huge data, which is
the key to achieving the industry status Netflix boosts. If you are a subscriber, you are
familiar to how they send you suggestions of the next movie you should watch. Basically,
this is done using your past search and watch data. This data is used to give them insights
on what interests the subscriber most.
 Big Data Analytics as a Driver of Innovations and Product Development: Another

huge advantage of big data is the ability to help companies innovate and redevelop their
products. Basically, the big data has become an avenue for creating additional revenue
streams through enabling innovations and product improvement. Organizations begin by
correcting as much data as would be technically possible before designing new product
lines and re-designing the existing products.
Example:
You have probably heard of Amazon Fresh and Whole Foods. This is a perfect example of
how big data can help improve innovation and product development. Amazon leverages
big data analytics to move into a large market. The data-driven logistics gives Amazon the
required expertise to enable creation and achievement of greater value. Focusing on big
data analytics, Amazon whole foods is able to understand how customers buy groceries
and how suppliers interact with the grocer. This data gives insights whenever there is need
to implementfurtherchanges.
Architecture for Handling Big Data:
A Big data architecture is designed to handle the ingestion, processing, and

analysis of data that is too large or complex for traditional database systems.
Big data solutions typically involve one or more of the following types of workload:
 Batch processing of big data sources at rest.
 Real-time processing of big data in motion.
 Interactive exploration of big data. Ingestion
 Predictive analytics and machine learning.
Most big data architectures include some or all of the following components:
Data sources: All big data solutions start with one or more data sources. Examples
include:
 Application data stores, such as relational databases.
 Static files produced by applications, such as web server log files.
 Real-time data sources, such as IoT devices.
Data storage: Data for batch processing operations is typically stored in a distributed file
store that can hold high volumes of large files in various formats. This kind of store is
often called a data lake.
Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise prepare
the data for analysis. Usually these jobs involve reading source files, processing them, and
writing the output to new files.
Real-time message ingestion: If the solution includes real-time sources, the architecture
must include a way to capture and store real-time messages for stream processing. This
might be a simple data store, where incoming messages are dropped into a folder for
processing. However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other message
queuing semantics.
Stream processing: After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis. The processed stream
data is then written to an output sink. Azure Stream Analytics provides a managed stream
processing service based on perpetually running SQL queries that operate on unbounded
streams
Analytical data store: Many big data solutions prepare data for analysis and then serve
the processed data in a structured format that can be queried using analytical tools. The
analytical data store used to serve these queries can be a Kimball-style relational data
warehouse, as seen in most traditional business intelligence (BI) solutions. Alternatively,
the data could be presented through a low-latency NoSQL technology such as HBase, or an
interactive Hive database that provides a metadata abstraction over data files in the
distributed data store.
Analysis and reporting: The goal of most big data solutions is to provide insights into the
data through analysis and reporting. To empower users to analyze the data, the architecture
may include a data modeling layer, such as a multidimensional OLAP cube or tabular data
model in Azure Analysis Services. It might also support self-service BI, using the
modeling and visualization technologies in Microsoft Power BI or Microsoft Excel.
Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts.
Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows that transform source data, move data between multiple sources
and sinks, load the processed data into an analytical data store, or push the results straight
to a report or dashboard. To automate these workflows, you can use an orchestration
technology such Azure Data Factory or Apache Oozie and Sqoop.
BIG DATA CHALLENGES
Big data challenges include the Storing, Analyzing, and Visualizing the extremely large
and fast-growing data.
Some of the Big Data challenges are:

 Sharing and Accessing Data:
 Privacy and Security:
 Analytical Challenges:
 Technical challenges:
o Quality of data:
o Fault tolerance:
o Scalability:
BIG DATA PLATFORM
What is a big data platform?
The constant stream of information from various sources is becoming more intense,
especially with the advance in technology. And this is where big data platforms come
in to store and analyze the ever-increasing mass of information.
A big data platform is an integrated computing solution that combines numerous

software systems, tools, and hardware for big data management. It is a one-stop
architecture that solves all the data needs of a business regardless of the volume and
size of the data at hand. Due to their efficiency in data management, enterprises are
increasingly adopting big data platforms to gather tons of data and convert them into
structured, actionable business insights.
Currently, the marketplace is flooded with numerous Open source and commercially
available big data platforms. They boast different features and capabilities for use in
a big data environment.
Characteristics of a Big data platform

Any good big data platform should have the following important features:
• Ability to accommodate new applications and tools depending on the evolving
business needs
• Support several data formats
• Ability to accommodate large volumes of streaming or at-rest data
• Have a wide variety of conversion tools to transform data to different preferred
formats
• Capacity to accommodate data at any speed
• Provide the tools for scouring the data through massive data sets
• Support linear scaling
• The ability for quick deployment
• Have the tools for data analysis and reporting requirements
Big data platform examples

Here are 6 big data platforms that can help manage petabytes of data and provide
actionable insights:
Big Data Platforms are mainly available in

• Commercial or Proprietary
• Open Source
Hadoop provides set of tools and software for making the backbone of the Big Data
analytics system. Hadoop ecosystem provides necessary tools and software for
handling and analyzing Big Data. On the top of the Hadoop system many
applications can be developed and plugged-in to provide ideal solution for Big Data
needs.
• Top Hadoop based Commercial Big Data Analytics Platform

Cloudera, Amazon Web Services, Hortonworks, MapR, IBM Open Platform,
Microsoft HDInsight, Intel Distribution for Apache Hadoop, Datastax Enterprise
Analytics, Teradata Enterprise Access for Hadoop.
• Open Source Big Data Platforms:

Apache Hadoop, MapReduce, GridGain, HPCC Systems, Apache Storm, Apache
Spark, SAMOA stands for (Scalable Advanced Massive Online Analysis).
Apache Hadoop
Hadoop is an open-source programming architecture and server software. It is employed to
store and analyze large data sets very fast with the assistance of thousands of commodity
servers in a clustered computing environment. In case of one server or hardware failure, it
can replicate the data leading to no loss of data.
This big data platform provides important tools and software for big data management.
Many applications can also run on top of the Hadoop platform. And while it can run on OS
X operating systems, Linux, and Windows, it is commonly employed on Ubuntu and other
variants of Linux.
CLOUDERA
Cloudera is a big data platform based on Apache’s Hadoop system. It can handle huge
volumes of data. Enterprises regularly store over 50 petabytes in this platform’s Data
Warehouse, which handles data such as text, machine logs, and more. Cloudera’s
DataFlow also enables real-time data processing.
AMAZON WEB SERVICES
Popularly known as AWS, this is another Hadoop-based big data platform from Amazon.
AWS is hosted in the cloud environment. Thus, businesses can employ AWS to manage
their big data analytics in the cloud. And through Amazon EMR(Elastic MapReduce),
enterprises can set up and effortlessly scale other big data platforms like Spark, Apache
Hadoop, and Presto.
ORACLE
Oracle is another big data platform with a cloud hosting environment. It can automatically
send data in different formats to cloud servers without downtime. It can also run on-
premise and in hybrid environments. This allows for data transformation and enrichment,
whether it’s live streaming or stored in a data lake. The platform offers a free tier as well.
SNOWFLAKE
This big data platform acts as a data warehouse for storing, processing, and analyzing data.
It is designed similarly to a SaaS product. This is because everything about its framework
is run and managed in the cloud. It runs fully atop public cloud hosting frameworks and
integrates with a new SQL query engine.
MapR
MapR is another Big Data platform which us using the Unix file system for handling data.
It is not using HDFS and this system is easy to learn anyone familiar with the Unix system.
This solution integrates Hadoop, Spark, and Apache Drill with a real-time data processing
feature.
APACHE STORM
Apache Storm is the brainchild of Apache Software Foundation. This big data platform is
used in real-time data analytics and distributed processing. It supports virtually all
programming languages because of its high scalability and fault tolerance. Big data giants
such as Yelp, Twitter, Yahoo, and Spotify use Apache Storm.
APACHE SPARK
Apache Spark is software that runs on the top of Hadoop and provides API for real-time,
in-memory processing and analysis of large set of data stored in the HDFS. It stores the
data into memory for faster processing. Apache Spark runs program 100 times faster in-
memory and 10 times faster on disk as compared to the MapRedue. Apache Spark is here
to faster the processing and analysis of big data sets in Big Data environment. Apache
Spark is being adopted very fast by the business to analyze their data set to get real value
of their data.
Challenges of Conventional Systems
RDBMS vs HADOOP
S.No RDBMS Hadoop

1 Row-Column Based Database There is no specific Structure is followed
2 Schema on Write Schema on Read
3 It will process Structured data only It will process Structured, Unstructured
as well as Semi structured data also
4 Best suitable for small data and Online Best suitable of handeling Big Data and
Transaction Processing (OLTP) operations Online Analytical Processing (OLAP)
operations
5 Less scalable Scalability is very high
6 Data Normalization is must Not mandatory
7 It will store Transformed and Aggregated data
Stores huge volumes of data (Video,
Audio, images, text, numeric etc.)
8 No latency in response Complexity leads to some latency in
response
9 Schema of RDBMS is static No predefined schema
10 Write many times and read many times Write Ones and Read many times
11 Relational DB Key-Value pairs
12 Throughput is low Throughput is high
DWDM vs HADOOP
Data warehouse:
A data warehouse essentially combines information from several sources into one
comprehensive database. Let’s summarize what data warehouse is –
Subject oriented
A data warehouse can be used to analyze a particular subject area like sales, finance, and
inventory. Each subject area contains detailed data. For example, to learn more about your
company's sales data, you can build a warehouse that concentrates on sales.
Using this warehouse, you can answer questions like "Who was our best customer for this
item last year? "This ability to define a data warehouse by subject matter, sales in this case,
makes the data warehouse subject oriented.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration during
data warehousing toensure consistency in naming conventions attributes types, etc.,
amongdifferentdatasources.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed. It usually requires
only two procedures in data accessing: Initial loading of data and access to data.
Therefore, the DW does not require transaction processing, recovery, and concurrency
capabilities, which allows for substantial speedup of data retrieval.Non-Volatile defines
that once entered into the warehouse, and data should not change
Time-variant:
Historical information is kept in adata warehouse. For example, one can retrieve files from
3 months, 6 months,12 months, or even previous data from a data warehouse.
Hadoop
The Apache Hadoop software library is a framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming
models. It is designed to scale up from single servers to thousands of machines, each
offering local computation and storage.
Hadoop is made up of 4 modules –
1. Map reduce
MAPREDUCE is a software framework and programming model used for processing
huge amounts of data. MapReduce program work in two phases, namely, Map and Reduce.
Map tasks deal with splitting and mapping of data while Reduce tasks deal with shuffle and
reduce the data.
Hadoop is capable of running MapReduce programs written in various languages: Java,

Ruby, Python, and C++.
Let's understand this with an example –
Consider you have following input data for your Map Reduce Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
MapReduce Working Process
Thedatagoesthroughthefollowingphases
InputSplits:
AninputtoaMapReducejobisdividedintofixed-sizepieces called input splits Input split is a

chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in
each split is passed to a mapping function to produce output values. In our example, a job
of mapping phase is to count a number of occurrences of each word from input splits (more
details about input-split is given below) and prepare a list in the form of <word,
frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant
records from Mapping phase output. In our example, the same words are clubbed together
along with their respective frequency.
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines
values from Shuffling phase and returns a single output value. In short, this phase
summarizes the complete dataset.
2. HDFS (Hadoop Distributed File System):

The Hadoop Distributed File System (HDFS) is based on the Google File System(GFS)
and provides a distributed file system that is designed to run on commodity hardware.
HDFS ARCHITECTURE
Secondary
NameNode
Let’s understand this with an example: We need to read 1TB of data and we have one
machine with 4 I/O channels each channel having 100MB/s, it took 45 minutes to read the
entire data. Now the same amount of data is read by 10 machines each with 4 I/O channels
each channel having 100MB/s. Guess the amount of time it took to read the data? 4.3
minutes. HDFS solves the problem of storing big data. The two main components of HDFS
are NAME NODE and DATA NODE. Name node is the master, we may have a secondary
name node as well incase the primary name node stops working the secondary name node
will act as a backup. The name node basically maintains and manages the data nodes by
storing metadata. The data node is the slave which is basically the low-cost commodity
hardware. We can have multiple data nodes. The data node stores the actual data. This data
node supports the replication factor, suppose if one data node goes down then the data can
be accessed by the other replicated data node, therefore, the accessibility of data is
improved and loss of data is prevented
3. YARN(Yet Another Resource Negotiator):
The initial version of Hadoop had just two components: Map Reduce and HDFS. Later it
was realized that Map Reduce couldn’t solve a lot of bigdata problems. The idea was to
take the resource management and job scheduling responsibilities away from the old map-
reduce engine and give it to a new component. So this is how YARN came into the picture.
It is the middle layer between HDFS and MapReduce which is responsible for managing
cluster resources.
Apache Hadoop YARN decentralizes execution and monitoring of processing jobs by

separating the various responsibilities into thesecomponents:
 A global ResourceManager that accepts job submissions from users, schedules the jobs
and allocates resources to them
 A NodeManager slave that's installed at each node and functions as a monitoring and
reporting agent of the ResourceManager
 An ApplicationMaster that's created for each application to negotiate for resources and
work with the NodeManager to execute and monitor tasks
 Resource containers that are controlled by Node Managers and assigned the system
resources allocated to individual applications
YARN ARCHITECTURE
4. Common:
Also called the Hadoop common. These are nothing but the JAVA libraries, files, scripts,
and utilities that are actually required by the other Hadoop components to perform.
Comparisons between Data Warehouse and Hadoop
Basis For Data Warehoue Hadoop

Comparison
Data In Data In Hadoop, we can process any kind
Warehouse we can of data including
process structured structured/unstructured/semi-
data. structured an draw
Processing Its processing is Its processing is based on

based on schema-on- schema-on-read concepts
write concepts
Storage Suitable for data with It works well with large data sets
small volume and it’s having huge volume, velocity, and
too much expensive variety
for large volume data
It is less agile and It is highly agile, configure and

Agility
of fixed reconfigure as needed
configuration.
Data Warehouse vs Hadoop –Which One to Use?
 If you have Raw Unstructured Data, then you should go for Hadoop because Hadoop
works well with unstructured/raw data but Data Warehouse works only with structured
data.
 For Low Latency and Interactive Reports, you should go for Data Warehouse
 For OLTP/Real-time/ Point Queries you should go for Data Warehouse because Hadoop
works well with batch data.
 For large volume data sets, you should go for Hadoop because Hadoop is designed to solve
Big data problems.
Intelligent Data Analytics (IDA)
Intelligent data analysis discloses hidden facts that are not known previously and provide
potentially important information or facts from large quantities of data.
Phases of IDA
Nature of Data
Data are known facts or things used as basis for inference or reckoning. We can find data
in all the situations of the world around us, in all the structured or unstructured, in
Continuous or discrete conditions, in weather records, stock market logs, in photo albums,
music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw
material of any kind of human activity.
As shown in the following figure, we can see Data in two

distinctways: Categorical and Numerical:
Categorical data are values or observations that can be sorted into groups or categories.
There are two types of categorical values, nominal, ordinal.
• A nominal variable has no intrinsic ordering to its categories. For example, housing is a
categorical variable having two categories
Ex: 1) own and rent 2) male and Female 3) Federal, Democratic, Republican
• An ordinal variable has an established ordering. For example, age as a variable with three
orderly categories
Ex: 1) young, adult, and elder. 2) poor, satisfactory, Good, Very Good, Excellent
Numerical data are values or observations that can be measured. There are two kinds of
numerical values, discrete and continuous.
• Discrete (countable) data are values or observations that can be counted and are distinct
and separate. For example, number of lines in a code.
• Continuous (Measurable) data are values or observations that may take on any value within
a finite or infinite interval. For example, an economic time series such as historic gold
prices.
The kinds of datasets used in this book are as follows:
 E-mails (unstructured, discrete)

 Digital images (unstructured, discrete)
 Stock market logs (structured, continuous)
 Historic gold prices (structured, continuous)
 Credit approval records (structured, discrete)
 Social media friends and relationships (unstructured, discrete)
 Tweets and trending topics (unstructured, continuous)
 Sales records (structured, continuous)
It is very important to examine the data thoroughly before undertaking any formal analysis.
Traditionally, data analysts have been taught to "familiarise themselves with their data"
before beginning to model it or test it against algorithms.
Different issues need to be considered while handling Bigdata are as follows:
• Missing Data
• Mis recorded data
• Sampling Data
• Distortions due to contamination
• Anomalous data, or data with hidden peculiarities
• Curse of dimensionality (high dimensional space)
Analytic Processes and Tools

Big Data Analytics is the process of collecting large chunks of structured / unstructured
data, segregating and analyzing it and discovering the patterns and other useful business
insights from it.
There are 6 analytic processes:
• Deployment
• Business Understanding
o Problem Statement
• Data Exploration
o first-party data might include customer satisfaction surveys, focus groups, interviews,
or direct observation
o second-party data include website, app or social media activity, like online purchase
histories, or shipping data
o Open data repositories and government portals are sources of third-party data
• Data Preparation
o Removing major errors, duplicates, and outliers—all of which are inevitable problems
when aggregating data from numerous sources.
o Removing unwanted data points—extracting irrelevant observations that have no
bearing on your intended analysis.
o Bringing structure to your data—general ‘housekeeping’, i.e. fixing typos or layout
issues, which will help you map and manipulate your data more easily.
o Filling in major gaps—as you’re tidying up, you might notice that important data are
missing. Once you’ve identified gaps, you can go about filling them.
• Data Modeling
o Descriptive analysis: Descriptive analysis identifies what has already happened
o Diagnostic analysis:Diagnostic analytics focuses on understanding why something has
happened.
o Predictive analysis:Predictive analysis allows you to identify future trends based on
historical data.
o Prescriptive analysis:Prescriptive analysis allows you to make recommendations for
the future.
• Data Evaluation
o Sharing of Results and Validate
It’s very important that the insights you present are 100% clear and unambiguous. For
this reason, data analysts commonly use Reports, Dashboards, and Interactive
visualizations to support their findings. Maybe you find that the results of your core
analyses are misleading or erroneous. This might be caused by mistakes in the data, or
human error earlier in the process.
Analysis Vs Reporting
A report involves organizing data into summaries, analysis involves inspecting, cleaning,
transforming, and modeling these reports to gain insights for a specific purpose.
Reporting :
 Once data is collected, it will be organized using tools such as graphs and tables.
 The process of organizing this data is called reporting.
 Reporting translates raw data into information.
 Reporting helps companies to monitor their online business and be alerted when
data falls outside of expected ranges.
 Good reporting should raise questions about the business from its end users.
Examples: Canned reports, Dashboards, and alerts push information to users.
To build a report, the steps involved broadly include:

 Identifying the business need
 Collecting and gathering relevant data
 Translating the technical data
 Understanding the data context
 Creating reporting dashboards
 Enabling real-time reporting
 Offer the ability to drill down into reports
Analysis :
 Analytics is the process of taking the organized data and analysing it.
 This helps users to gain valuable insights on how businesses can improve their
performance.
 Analysis transforms data and information into insights.
 The goal of the analysis is to answer questions by interpreting the data at a deeper
level and providing actionable recommendations.
Example: ad hoc responses, insights, recommended actions, or a forecast
A canned report will show a company’s revenue and whether it is lower or higher
than expected; an ad-hoc drill-down can be used by financial and business analysts to
understand why this occurred.
For data analytics, the steps involved include:
 Creating a data hypothesis
 Gathering and transforming data
 Building analytical models to ingest data, process it and offer insights
 Use tools for data visualization, trend analysis, deep dives, etc.
 Using data and insights for making decisions
Conclusion :
 Reporting shows us “what is happening”.
 The analysis focuses on explaining “why it is happening” and “what we can do about
it”.

Unit-I (Big Data)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-I (Big Data)

Uploaded by

Copyright:

Available Formats

Unit-I

Introduction to Big Data

Examples of Big Data

Types of Data (or) Categories of Data:

Big data' could be found in three forms:

Employee_ID Employee_Name Gender Department Salary_In_lacs

Personal data stored in a XML file-

Example JSON document:

DIFFERENCE BETWEEN STRUCTURED, SEMI-STRUCTURED AND

Relational database CharacterandXML/RDF

Matured transaction No transaction Transaction

Versioning Versioned Not very common,

Schema-dependent, Very flexible, Flexible, tolerant

Scaling DB schema is Very scalable Schema scaling

 Big Data Analytics as a Driver of Innovations and Product Development: Another

A Big data architecture is designed to handle the ingestion, processing, and

BIG DATA CHALLENGES

Some of the Big Data challenges are:

What is a big data platform?

A big data platform is an integrated computing solution that combines numerous

Characteristics of a Big data platform

Big data platform examples

Big Data Platforms are mainly available in

• Top Hadoop based Commercial Big Data Analytics Platform

• Open Source Big Data Platforms:

Challenges of Conventional Systems

S.No RDBMS Hadoop

Hadoop is made up of 4 modules –

Hadoop is capable of running MapReduce programs written in various languages: Java,

Let's understand this with an example –

Welcome to Hadoop Class

AninputtoaMapReducejobisdividedintofixed-sizepieces called input splits Input split is a

2. HDFS (Hadoop Distributed File System):

3. YARN(Yet Another Resource Negotiator):

Apache Hadoop YARN decentralizes execution and monitoring of processing jobs by

Basis For Data Warehoue Hadoop

Data In Data In Hadoop, we can process any kind

Warehouse we can of data including

process structured structured/unstructured/semi-

data. structured an draw

Processing Its processing is Its processing is based on

It is less agile and It is highly agile, configure and

Intelligent Data Analytics (IDA)

As shown in the following figure, we can see Data in two

The kinds of datasets used in this book are as follows:

 E-mails (unstructured, discrete)

Analytic Processes and Tools

To build a report, the steps involved broadly include:

You might also like