Unit 1 1

1.
0 Introduction to Big Data

Introduction of Big Data
 Big Data is a collection of large datasets that cannot be adequately processed

using traditional processing techniques. Big data is not only data it has become a
complete subject, which involves various tools, techniques and frameworks.
 Big data term describes the volume amount of data both structured and
unstructured manner that adapted in day-to-day business environment. It’s
important that what organizations utilize with these with the data that matters.
 Big data helps to analyze the in-depth concepts for the better decisions and
strategic taken for the development of the organization.
 Big Data includes huge volume, high velocity, and extensible variety of data.
 The data in it will be of three types.
 Structured data − Relational data.
 Semi Structured data − XML data.
 Unstructured data − Word, PDF, Text, Media Logs.
Benefits of Big Data

 Using the information kept in the social network like Facebook, the marketing
agencies are learning about the response for their campaigns, promotions, and
other advertising mediums.

 Using the information in the social media like preferences and product
perception of their consumers, product companies and retail organizations are
planning their production.

 Using the data regarding the previous medical history of patients, hospitals are
providing better and quick service.
Categories Of 'Big Data'
Big data have three forms:
1) Structured 2)Unstructured 3)Semi-structured
1) Structured
 Any data that can be stored, accessed and processed in the form of fixed format
is termed as a 'structured' data.
 Over the period of time, talent in computer science have achieved greater success
in developing techniques for working with such kind of data (where the format
is well known in advance) and also deriving value out of it.
 When size of such data grows to a huge extent, typical sizes are being in the rage
of multiple zettabyte.
Employee_ID Employee_Name Gender Department Salary_In_lacs

1 Xyz M Finance 750000
2 ABC F Admin 150000
3 MNP M Admin 550000
4 NXP M Finance 600000
2) Unstructured
 Any data with unknown form or the structure is classified as unstructured data.
 In addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it.
 Typical example of unstructured data is, a heterogeneous data source containing
a combination of simple text files, images, videos etc.
 Now a day organizations have wealth of data available with them but
unfortunately they don't know how to derive value out of it since this data is in
its raw form or unstructured format.
Examples Of Un-structured Data :- Typical human-generated unstructured data
includes: Text files: Word processing, spreadsheets, presentations, email, logs.
 Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is
unstructured and traditional analytics tools cannot parse it.
Social Media: Data from Facebook, Twitter, LinkedIn.
Website: YouTube, Instagram, photo sharing sites.
Mobile data: Text messages, locations.
Communications: Chat, IM, phone recordings, collaboration software.
Media: MP3, digital photos, audio and video files.
Business applications: MS Office documents, productivity applications.
Typical machine-generated unstructured data includes:
 Satellite imagery: Weather data, land forms, military movements.
 Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
 Digital surveillance: Surveillance photos and video.
 Sensor data: Traffic, weather, oceanographic sensors.
3) Semi-structured
 Semi-structured data can contain both the forms of data.
 User can see semi-structured data as a strcutured in form but it is actually not
defined with e.g. a table definition in relational DBMS.
Examples Of Semi-structured Data :- Personal data stored in a XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
 Basic requirements for working with big data are the same as the requirements
for working with datasets of any size.
 However, the massive scale, the speed of ingesting and processing, and the
characteristics of the data that must be dealt with at each stage of the process
present significant new challenges when designing solutions.
 In 2001, Gartner's Doug Laney first presented what became known as the "three
Vs of big data" to describe some of the characteristics that make big data
different from other data processing:
Volume
 The sheer scale of the information processed helps define big data systems.
 These datasets can be orders of magnitude larger than traditional datasets, which
demands more thought at each stage of the processing and storage life cycle.
 Often, because the work requirements exceed the capabilities of a single
computer, this becomes a challenge of pooling, allocating, and coordinating
resources from groups of computers.
 Cluster management and algorithms capable of breaking tasks into smaller
pieces become increasingly important.
Velocity
 Another way in which big data differs significantly from other data systems is
the speed that information moves through the system.
 Data is frequently flowing into the system from multiple sources and is often
expected to be processed in real time to gain insights and update the current
understanding of the system.
 This focus on near instant feedback has driven many big data practitioners away
from a batch-oriented approach and closer to a real-time streaming system.
 Data is constantly being added, massaged, processed, and analyzed in order to
keep up with the influx of new information and to surface valuable information
early when it is most relevant.
Variety
 Big data problems are often unique because of the wide range of both the sources
being processed and their relative quality.
 Data can be ingested from internal systems like application and server logs, from
social media feeds and other external APIs, from physical device sensors, and
from other providers.
 Big data seeks to handle potentially useful data regardless of where it's coming
from by consolidating all information into a single system.
 The formats and types of media can vary significantly as well. Rich media like
images, video files, and audio recordings are ingested alongside text
files,structured logs, etc.
 While more traditional data processing systems might expect data to enter the
pipeline already labeled, formatted, and organized, big data systems usually
accept and store data closer to its raw state.
 Ideally, any transformations or changes to the raw data will happen in memory
at the time of processing.
2. Distributed File System

 A distributed file system (DFS) is a file system with data stored on a server. The
data is accessed and processed as if it was stored on the local client machine.
 The DFS makes it convenient to share information and files among users on a
network in a controlled and authorized way.
 The server allows the client users to share files and store data just like they are
storing the information locally.The Hadoop Distributed File System (HDFS) is
the primary data storage system used by Hadoop applications.
 It employs a NameNode and DataNode architecture to implement a distributed
file system that provides high-performance access to data across highly scalable
Hadoop clusters.
 HDFS is a key part of the many Hadoop ecosystem technologies, as it provides a
reliable means for managing pools of big data and supporting related big data
analytics applications.
 HDFS supports the rapid transfer of data between compute nodes. At its outset,
it was closely coupled with MapReduce, a programmatic framework for data
processing.
 When HDFS takes in data, it breaks the information down into separate blocks
and distributes them to different nodes in a cluster, thus enabling highly efficient
parallel processing.
 HDFS holds very large amount of data and provides easier access. To store such
huge data, the files are stored across multiple machines.
 These files are stored in redundant fashion to rescue the system from possible
data losses in case of failure. HDFS also makes applications available to parallel
processing.
 Features of HDFS
1) It is suitable for the distributed storage and processing.
2) Hadoop provides a command interface to interact with HDFS.
3) The built-in servers of namenode and datanode help users to easily check the
status of cluster.
4) Streaming access to file system data.
5) HDFS provides file permissions and authentication.
HDFS Architecture
 HDFS follows the master-slave architecture and it has the following elements.
Namenode
 The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
 It is a software that can be run on commodity hardware.
 The system having the namenode acts as the master server and it does the
following tasks Manages the file system namespace.
 Regulates client’s access to files. It also executes file system operations such as
renaming, closing, and opening files and directories.
2) Datanode
 The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity hardware/System)
in a cluster, there will be a datanode.
 These nodes manage the data storage of their system. Datanodes perform read-
write operations on the file systems, as per client request.
 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
3) Block
 The file in a file system will be divided into one or more segments and/or stored
in individual data nodes.
 These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block.
 The default block size is 64MB, but it can be increased as per the need to change
in HDFS configuration.
Goals of HDFS
1) Fault detection and recovery − Since HDFS includes a large number of

commodity hardware, failure of components is frequent.
Therefore HDFS should have mechanisms for quick and automatic fault
detection and recovery.
2) Huge datasets − HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.
3) Hardware at data − A requested task can be done efficiently, when the

computation takes place near the data. Especially where huge datasets are
involved, it reduces the network traffic and increases the throughput.
Big Data And its importance

 Complex or massive data sets which are quite impractical to be managed using
the traditional database system and software tools are referred to as big data.
 Big data is utilized by organizations in one or another way. It is the technology
which possibly realizes big data’s value.
 It is the voluminous amount of both multi-structured as well unstructured data.
Advantages of Big Data

 Due to the loopholes present in traditional data systems (used by an
organization), big data was developed. Following table1 marks the difference
between traditional data and big data:
Factors Big Data Traditional Data

Data
Architecture Distributed Architecture Centralized Architecture
Semi-structured/Un-structured
Type of Data Data Structured Data
Consists of (250 – 260 ) bytes of

Volume of Data
data Consists of 240 bytes of data
Schema No schema Based on Fixed Schema
Data Complex Relationship between Known Relationship between the
Relationship data data
4. Drives for Big Data
1) Media & communications
 In media and communications it is used to analyze the personal and behavioral
data of the customers in order to create customer’s profile.
 It creates the content for different customers, recommend that content on the
basis of demand and measure its performance (Tankard 2012).
 In media and communication industry big data is used by the companies like
Spotify, Amazon prime, etc. Spotify use big data analytics in analyzing the data
and give recommendation regarding the music to its customers individually.
2) Banking
 In banking data is used to manage large financial data.
 SEC (Securities Exchange Commission) uses big data in order to monitor the
market and finance related data of the bank and Network analytics in order to
track illegal activities in the finance.

 Big data is also used in the trading sector for trade analytics and decision support
analytics.
3) Healthcare
 Big data is used in the healthcare sector in order to manage the large amount of
data relate to the patients, doctors and the other staff members.
 It helps to eliminate the failures like errors, invalid or inappropriate data, any
system fault etc. that comes while utilizing the system and provides benefits like
managing customer, staff and doctors information related to healthcare (Bughin
et al. 2010).
 According to (Gartner 2013), 43% of the healthcare industries have invested in
Big data.
4) Communications, Media and Entertainment
 Consumers expect rich media on-demand in different formats and in a variety of
devices, some big data challenges in the communications, media and
entertainment industry include:
 Collecting, analyzing, and utilizing consumer insights Leveraging mobile and
social media content Understanding patterns of real-time, media content usage
Organizations in this industry simultaneously analyze customer data along with
behavioral data to create detailed customer profiles that can be used to:
 Create content for different target audiences Recommend content on demand
Measure content performance.
5) Education
 A major challenge in the education industry is to incorporate big data from
different sources and vendors and to utilize it on platforms that were not
designed for the varying data.
 Big data is used quite significantly in higher education. For example, The
University of Tasmania. An Australian university with over 26000 students, has
deployed a Learning and Management System that tracks among other things,
when a student logs onto the system, how much time is spent on different pages
in the system, as well as the overall progress of a student over time.
 In a different use case of the use of big data in education, it is also used to
measure teacher’s effectiveness to ensure a good experience for both students
and teachers.
 Teacher’s performance can be fine-tuned and measured against student numbers,
subject matter, student demographics, student aspirations, behavioral
classification and several other variables.
 On a governmental level, the Office of Educational Technology in the U. S.
Department of Education, is using big data to develop analytics to help course
correct students who are going astray while using online big data courses. Click
patterns are also being used to detect boredom.
6) Manufacturing and Natural Resources

 Increasing demand for natural resources including oil, agricultural products,
minerals, gas, metals, and so on has led to an increase in the volume, complexity,
and velocity of data that is a challenge to handle.
 Similarly, large volumes of data from the manufacturing industry are untapped.
The underutilization of this information prevents improved quality of products,
energy efficiency, reliability, and better profit margins.
7) Government
 In governments the biggest challenges are the integration and interoperability of
big data across different government departments and affiliated organizations.
 In public services, big data has a very wide range of applications including:
energy exploration, financial market analysis, fraud detection, health related
research and environmental protection.
 Some more specific examples are as follows:
 Big data is being used in the analysis of large amounts of social disability claims,
made to the Social Security Administration (SSA), that arrive in the form of
unstructured data. The analytics are used to process medical information rapidly
and efficiently for faster decision making and to detect suspicious or fraudulent
claims.
 The Food and Drug Administration (FDA) is using big data to detect and study
patterns of food-related illnesses and diseases. This allows for faster response
which has led to faster treatment and less death.
8)Insurance
 Lack of personalized services, lack of personalized pricing and the
lack of targeted services to new segments and to specific market
segments are some of the main challenges.
 In a survey conducted by Marketforce challenges identified by
professionals in the insurance industry include underutilization of
data gathered by loss adjusters and a hunger for better insight.
 Applications of big data in the insurance industry.
 Big data has been used in the industry to provide customer insights
for transparent and simpler products, by analyzing and predicting
customer behavior through data derived from social media, GPS-
enabled devices and CCTV footage. The big data also allows for better
customer retention from insurance companies.

Unit 1 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 1

Uploaded by

Copyright:

Available Formats

1.

0 Introduction to Big Data

 Big Data is a collection of large datasets that cannot be adequately processed

Benefits of Big Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

Examples Of Semi-structured Data :- Personal data stored in a XML file-

2. Distributed File System

1) Fault detection and recovery − Since HDFS includes a large number of

3) Hardware at data − A requested task can be done efficiently, when the

Big Data And its importance

Advantages of Big Data

Factors Big Data Traditional Data

Consists of (250 – 260 ) bytes of

6) Manufacturing and Natural Resources

You might also like