Professional Documents
Culture Documents
Big data helps to analyze the in-depth concepts for the better decisions and
strategic taken for the development of the organization.
Big Data includes huge volume, high velocity, and extensible variety of data.
The data in it will be of three types.
Structured data − Relational data.
Semi Structured data − XML data.
Unstructured data − Word, PDF, Text, Media Logs.
1) Structured
Any data that can be stored, accessed and processed in the form of fixed format
is termed as a 'structured' data.
Over the period of time, talent in computer science have achieved greater success
in developing techniques for working with such kind of data (where the format
is well known in advance) and also deriving value out of it.
When size of such data grows to a huge extent, typical sizes are being in the rage
of multiple zettabyte.
2) Unstructured
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it.
Typical example of unstructured data is, a heterogeneous data source containing
a combination of simple text files, images, videos etc.
Now a day organizations have wealth of data available with them but
unfortunately they don't know how to derive value out of it since this data is in
its raw form or unstructured format.
Examples Of Un-structured Data :- Typical human-generated unstructured data
includes: Text files: Word processing, spreadsheets, presentations, email, logs.
Email: Email has some internal structure thanks to its metadata, and we
sometimes refer to it as semi-structured. However, its message field is
unstructured and traditional analytics tools cannot parse it.
Social Media: Data from Facebook, Twitter, LinkedIn.
Website: YouTube, Instagram, photo sharing sites.
Mobile data: Text messages, locations.
Communications: Chat, IM, phone recordings, collaboration software.
Media: MP3, digital photos, audio and video files.
Business applications: MS Office documents, productivity applications.
Typical machine-generated unstructured data includes:
Satellite imagery: Weather data, land forms, military movements.
Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
Digital surveillance: Surveillance photos and video.
Sensor data: Traffic, weather, oceanographic sensors.
3) Semi-structured
Semi-structured data can contain both the forms of data.
User can see semi-structured data as a strcutured in form but it is actually not
defined with e.g. a table definition in relational DBMS.
Volume
The sheer scale of the information processed helps define big data systems.
These datasets can be orders of magnitude larger than traditional datasets, which
demands more thought at each stage of the processing and storage life cycle.
Often, because the work requirements exceed the capabilities of a single
computer, this becomes a challenge of pooling, allocating, and coordinating
resources from groups of computers.
Cluster management and algorithms capable of breaking tasks into smaller
pieces become increasingly important.
Velocity
Another way in which big data differs significantly from other data systems is
the speed that information moves through the system.
Data is frequently flowing into the system from multiple sources and is often
expected to be processed in real time to gain insights and update the current
understanding of the system.
This focus on near instant feedback has driven many big data practitioners away
from a batch-oriented approach and closer to a real-time streaming system.
Data is constantly being added, massaged, processed, and analyzed in order to
keep up with the influx of new information and to surface valuable information
early when it is most relevant.
Variety
Big data problems are often unique because of the wide range of both the sources
being processed and their relative quality.
Data can be ingested from internal systems like application and server logs, from
social media feeds and other external APIs, from physical device sensors, and
from other providers.
Big data seeks to handle potentially useful data regardless of where it's coming
from by consolidating all information into a single system.
The formats and types of media can vary significantly as well. Rich media like
images, video files, and audio recordings are ingested alongside text
files,structured logs, etc.
While more traditional data processing systems might expect data to enter the
pipeline already labeled, formatted, and organized, big data systems usually
accept and store data closer to its raw state.
Ideally, any transformations or changes to the raw data will happen in memory
at the time of processing.
Features of HDFS
1) It is suitable for the distributed storage and processing.
2) Hadoop provides a command interface to interact with HDFS.
3) The built-in servers of namenode and datanode help users to easily check the
status of cluster.
4) Streaming access to file system data.
5) HDFS provides file permissions and authentication.
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The namenode is the commodity hardware that contains the GNU/Linux
operating system and the namenode software.
It is a software that can be run on commodity hardware.
The system having the namenode acts as the master server and it does the
following tasks Manages the file system namespace.
Regulates client’s access to files. It also executes file system operations such as
renaming, closing, and opening files and directories.
2) Datanode
The datanode is a commodity hardware having the GNU/Linux operating
system and datanode software. For every node (Commodity hardware/System)
in a cluster, there will be a datanode.
These nodes manage the data storage of their system. Datanodes perform read-
write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
3) Block
The file in a file system will be divided into one or more segments and/or stored
in individual data nodes.
These file segments are called as blocks. In other words, the minimum amount of
data that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change
in HDFS configuration.
Goals of HDFS
2) Huge datasets − HDFS should have hundreds of nodes per cluster to manage
the applications having huge datasets.
Semi-structured/Un-structured
Type of Data Data Structured Data
2) Banking
In banking data is used to manage large financial data.
SEC (Securities Exchange Commission) uses big data in order to monitor the
market and finance related data of the bank and Network analytics in order to
track illegal activities in the finance.
Big data is also used in the trading sector for trade analytics and decision support
analytics.
3) Healthcare
Big data is used in the healthcare sector in order to manage the large amount of
data relate to the patients, doctors and the other staff members.
It helps to eliminate the failures like errors, invalid or inappropriate data, any
system fault etc. that comes while utilizing the system and provides benefits like
managing customer, staff and doctors information related to healthcare (Bughin
et al. 2010).
According to (Gartner 2013), 43% of the healthcare industries have invested in
Big data.
4) Communications, Media and Entertainment
Consumers expect rich media on-demand in different formats and in a variety of
devices, some big data challenges in the communications, media and
entertainment industry include:
Collecting, analyzing, and utilizing consumer insights Leveraging mobile and
social media content Understanding patterns of real-time, media content usage
Organizations in this industry simultaneously analyze customer data along with
behavioral data to create detailed customer profiles that can be used to:
Create content for different target audiences Recommend content on demand
Measure content performance.
5) Education
A major challenge in the education industry is to incorporate big data from
different sources and vendors and to utilize it on platforms that were not
designed for the varying data.
Big data is used quite significantly in higher education. For example, The
University of Tasmania. An Australian university with over 26000 students, has
deployed a Learning and Management System that tracks among other things,
when a student logs onto the system, how much time is spent on different pages
in the system, as well as the overall progress of a student over time.
In a different use case of the use of big data in education, it is also used to
measure teacher’s effectiveness to ensure a good experience for both students
and teachers.
Teacher’s performance can be fine-tuned and measured against student numbers,
subject matter, student demographics, student aspirations, behavioral
classification and several other variables.
On a governmental level, the Office of Educational Technology in the U. S.
Department of Education, is using big data to develop analytics to help course
correct students who are going astray while using online big data courses. Click
patterns are also being used to detect boredom.
Similarly, large volumes of data from the manufacturing industry are untapped.
The underutilization of this information prevents improved quality of products,
energy efficiency, reliability, and better profit margins.
7) Government
In governments the biggest challenges are the integration and interoperability of
big data across different government departments and affiliated organizations.
In public services, big data has a very wide range of applications including:
energy exploration, financial market analysis, fraud detection, health related
research and environmental protection.
Some more specific examples are as follows:
Big data is being used in the analysis of large amounts of social disability claims,
made to the Social Security Administration (SSA), that arrive in the form of
unstructured data. The analytics are used to process medical information rapidly
and efficiently for faster decision making and to detect suspicious or fraudulent
claims.
The Food and Drug Administration (FDA) is using big data to detect and study
patterns of food-related illnesses and diseases. This allows for faster response
which has led to faster treatment and less death.
8)Insurance
Lack of personalized services, lack of personalized pricing and the
lack of targeted services to new segments and to specific market
segments are some of the main challenges.
In a survey conducted by Marketforce challenges identified by
professionals in the insurance industry include underutilization of
data gathered by loss adjusters and a hunger for better insight.
Applications of big data in the insurance industry.
Big data has been used in the industry to provide customer insights
for transparent and simpler products, by analyzing and predicting
customer behavior through data derived from social media, GPS-
enabled devices and CCTV footage. The big data also allows for better
customer retention from insurance companies.