You are on page 1of 72

Big Data

KCS061
UNIT-I
Syllabus
Data
• Smallest unit of information is called Data

• It can be numbers, alphabets or special symbols.

• It is information that has been translated into a form that is efficient for movement or

processing.
Big Data
• Data which are very large in size is called Big Data.

• Big Data is a collection of data that is huge in size and yet growing exponentially with time

• Big Data is difficult to store, collect, maintain, analyze, and visualize by traditional data

management tools
• Big Data is a concept that deals with storing, processing and analyzing large amounts of data.

• No single definition; here is from Wikipedia:

“Big data is the term for a collection of data sets so large and complex that it becomes difficult to
Sources of Big Data
History of Big Data
• 1989 and 1990
• Tim Berners-Lee and Robert Cailliau found the World Wide Web and develop HTML, URLs and HTTP.
• The internet age with widespread and easy access to data begins.
• 1996
• Digital data storage becomes more cost-effective than storing information on paper for the first time in
1996, as reported by R.J.T. Morris and B.J. Truskowski in their 2003 IBM Systems Journal paper, "The
Evolution of Storage Systems.“
• 1997
• The domain google.com is registered a year before launching, starting the search engine's climb to
dominance and development of numerous other technological innovations, including in the areas of
machine learning, big data and analytics.
History of Big Data
• 1998
• Carlo Strozzi develops NoSQL, an open source relational database that provides a way to store and
retrieve data modeled differently from the traditional tabular methods found in relational databases.
• 1999
• Based on data from 1999, the first edition of the influential book, How Much Information, by Hal R.
Varian and Peter Lyman (published in 2000), attempts to quantify the amount of digital information

available in the world to date.


• 2001
• Doug Laney of analyst firm Gartner coins the 3Vs (volume, variety and velocity), defining the dimensions
and properties of big data. The Vs encapsulate the true definition of big data and usher in a new period
where big data can be viewed as a dominant feature of the 21st century.
History of Big Data
Additional Vs -- such as veracity, value and variability -- have since been added to the list.
• 2005
• Computer scientists Doug Cutting and Mike Cafarella create Apache Hadoop, the open source framework
used to store and process large data sets, with a team of engineers spun off from Yahoo.

• 2006
• Amazon Web Services (AWS) starts offering web-based computing infrastructure services, now
known as cloud computing. Currently, AWS dominates the cloud services industry with roughly one-third
of the global market share.
Types of Digital Data
Types of Digital Data

• Digital data can be classified into three forms:


1. Structured
2. Semi-structured
3. Unstructured
1. Structured Data
• A data which is stored in the form of rows and column (Excel, Databases)
• It is an organized form and computer can use this data easily
• Relationships exists between entities of data

• A certain schema binds it, so all the data has the same set of properties. Structured data is also called
relational data. It is split into multiple tables to enhance the integrity of the data by creating a single
record to depict an entity. Relationships are enforced by the application of table constraints.
• Ex: Data stored in Database
1. Structured Data
Sources of Structured Data
2. Semi-structured Data

• Hybrid Schema (JSON ,HTML ,XML , Email and so on)


• Semi-structured data is data that does not conform to a data model but has some structure. It lacks a fixed
or rigid schema.
• semi-structured data doesn’t need a structured query language, it is commonly called NoSQL data.
• Semi-structured content is often used to store metadata about a business process but it can also include
files containing machine instructions for computer programs.
• This type of information typically comes from external sources such as social media platforms or other
web-based data feeds
• A few examples of semi-structured data sources are HTML code, graphs and tables, e-mails, XML
documents
3. Un-Structured Data
• Unstructured data is information that is not arranged according to a preset data model or schema, and
therefore cannot be stored in a traditional relational database or RDBMS.
• Photos, videos, text documents, and log files can be generally considered unstructured data.
• It is also known as “dark data” because it cannot be analyzed without the proper software tools.
• Structured data is usually easier to search and use, while unstructured data involves more complex search
and analysis
• Structured data is quantitative and is often displayed as numbers, dates, values, and strings. Unstructured
data is qualitative data and includes text, video, audio, images, and more.
• Structured data is stored in rows and columns. Unstructured data is stored as audio, text, and video files, or
NoSQL
Types of Digital Data in short
• Applications data can be classified as structured, semi-structured, and unstructured data.
• Structured data is neatly organized and obeys a fixed set of rules.
• Semi-structured data doesn’t obey any schema, but it has certain discernible features for an
organization. Data serialization languages are used to convert data objects into a byte stream. These
include XML, JSON, and YAML.
• Unstructured data doesn’t have any structure at all. All these three kinds of data are present in an
application. All three of them play equally important roles in developing resourceful and attractive
applications
Application or Uses of Big-Data
Application or Uses of Big-Data
• Healthcare Sector
 It reduces the costs of a treatment since there are fewer chances of having to perform unnecessary
diagnoses.
 It helps in research on past medical results so that patients can be provided with better services and
medicines.
• Banking Sector
 The technology’s application has defeated the user’s struggle, helping the bank to generate more revenue
and their insights are more transparent and comprehensible than before.
 Varying from distinguishing fraud, analyzing and streamlining transaction processing, improving
understanding of the users, perfecting trade execution, and promoting an exceptional user experience,
Big Data extends a range of applications.
Application or Uses of Big-Data
• Education Sector
 Online educational courses conducting organization utilize big data to search candidates interested in that
course.
 If someone searches for a YouTube tutorial video on a subject, then an online or offline course provider
organization on that subject sends an ad online to that person about their course.
• Government Sector
 Governments come across a huge level of data on an everyday basis, irrespective of the nation as they have to
maintain various records and databases of their citizens, the growth, geographical survey
 Primarily, the government utilizes this data in two areas, in its developmental plans and in the case of
cybersecurity.
Application or Uses of Big-Data
• Retail Sector
 Talking about retail, big data plays an important part in foretelling rising trends, targeting fitting customers at
the relevant time, reducing marketing expenses, and improving the quality of customer service.

• Transportation Sector
 It also helps in mapping out the route as per the requirements of the user, assisting in efficiently managing wait
time, identifying accident-prone areas to increase the safety level of traffic.
 The perfect example of Big Data’s use in the transportation industry would be Uber, Google Map
 The platform creates and uses a huge range of data of vehicles, drivers, location, the trip made by each vehicle,
which is again tested and utilized for foretelling the demand, supply, accurate location of drivers, and trip
fares.
Application or Uses of Big-Data
• Insurance Sector
 Talking about retail, big data plays an important part in foretelling rising trends, targeting fitting customers at
the relevant time, reducing marketing expenses, and improving the quality of customer service.

• Other Application or uses


 IoT
 Auto Driving Car
 Smart Traffic System
 Communications Media and Entertainment Sector
 Virtual Personal Assistant Tool
Big Data Platform
Big Data Platform: An Introduction
• Big data platforms are comprehensive frameworks that enable organizations to store, process, and
analyze vast amounts of structured and unstructured data.
• Big data platform is a type of IT solution
• It combines the features and capabilities of several big data application and utilities within a single solution.
• It enables organization in developing, deploying, operating and managing a big data infrastructure
environment.
Features of Big Data Platforms
• Data storage and management
• Distributed processing
• Fault Tolerance
• Data Analytics and Visualization
How do big data platforms work?

a. Data collection
b. Data storage
c. Data processing
d. Data analysis
e. Data quality assurance
f. Data management
Big Data Platforms
Factors to consider when choosing big data
platforms

a. Scalability
b. Performance
c. Security and compliance
d. Ease of usage
e. Integration capabilities
Characteristics of Big-Data
Or
5 Vs of Big Data
V’s of Big Data
• Characterized big data in 2011 by the three V’s: volume, velocity, and variety.
• Other characteristics, such as veracity and value, have been added to the definition by other
researchers.
• Now let us discuss in detail..
5 Vs of Big Data
5 V’s of Big-Data
1. Volume
• The volume of data obviously refers to the size of data managed by the system.
• Data that is somewhat automatically generated tends to be voluminous.
• Examples: include sensor data, such as the data in manufacturing or processing
plants generated by sensors; data from scanning equipment, such as smart card and
credit card readers; and data from measurement devices, such as smart meters or
environmental recording devices.
5 V’s of Big-Data
2. Velocity
• The definition of big data goes beyond the dimension of volume; it includes the types and frequency
of data that are disruptive to traditional database management tools.
• The Mckinsey report on big data described velocity as the speed at which data is created,
accumulated, ingested, and processed.
• Example:
o High velocity is attributed to data when we consider the typical speed of transactions on
stock exchanges; this speed reaches billions of transactions per day on certain days.
o Real-time data and streaming data are accumulated by the likes of Twitter and Facebook at a very
high velocity
5 V’s of Big-Data
3. Variety
• The diversity and range of different data types.
• Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources.
• The types of sources have expanded dramatically and include Internet data (e.g., clickstream and
social media), research data, location data (e.g., mobile device data and geospatial data), images,
e-mails, supply chain, signal data (e.g., sensors and RFID devices), and videos.
• The data will come in an array of forms i.e.- PDFs, Emails, audios, Social Media posts, photos,
videos, etc.
5 V’s of Big-Data
4. Value
• Raw data are processed and we get meaningful data.
• Value is an essential characteristic of big data.
• It is not the data that we process or store, it is valuable and reliable data that we store,
process and analyze.
5 V’s of Big-Data
5. Veracity
• The veracity dimension of big data is a more recent addition than the advent of the Internet.
• Veracity has two built-in features: the credibility of the source, and the suitability of data for its target
audience.
• . Many sources of data generate data that is uncertain, incomplete, and inaccurate, therefore making
its veracity questionable.
Big data
architectures
Big data architecture

• A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too
large or complex for traditional database systems
• Consider big data architectures when you need to:
1. Store and process data in volumes too large for a traditional database.
2. Transform unstructured data for analysis and reporting.
3. Capture, process, and analyze unbounded streams of data in real time, or with low latency.
Big data architecture
Benefits of Big data architecture
• High performance
• Elastic scalability
• Freedom of choice
• Interoperability with related systems
Big data Importance
• Time Saving
• Cost Saving
• Understand the Market condition
• Social Media Listening
• Provide better customer service
• To improve operations
• Create personalized marketing , can increase revenue and profits
Big Data Analytics
Big Data Analytics
• Big data analytics is the process of collecting, examining, and analyzing large amounts of data to discover
market trends, insights and patterns that can help companies make better business decisions, business process
management.

• Big data analytics uses advanced analytics on large collections of both structured and unstructured data to
produce valuable insights for businesses

• It is a form of advanced analytics, which involve complex applications with elements such as predictive
models, statistical algorithms and what-if analysis powered by analytics systems preferences
Challenges of Big Data Analytics
• Integrating data from a variety of sources
• Low Quality and Inaccurate Data
• Complex to Storage and Manage
• Complex to Analysis
• Hardware failure
• Combining data
• Searching
• Sharing
• Transfer
• Presentation
Big Data Analytics Importance

• Big data analytics is an essential part of our day to day life


• Showing of only relevant data
• Targeted advertising
• Personalized customer experience
• Research and technical analysis
Big Data Analytics uses and examples

• Today almost every major commercial organization is using Big data Analytics.
• This has made a direct impact on
• Profit
• Customer retention
• Targeted publicity and ads
• Understanding their customers
Big Data Analytics uses and examples

• Price optimization. Retailers may opt for pricing models that use and model data from a variety of data
sources to maximize revenues.

• Supply chain and channel analytics. Predictive analytical models can help with B2B supplier networks,
inventory management, route optimizations and the notification of potential delays to deliveries.

• Risk management. Big data analytics can identify new risks from data patterns for effective risk
management strategies.

• Improved decision-making. Insights business users extract from relevant data can help organizations make
quicker and better decisions.
Tools for Big Data Analytics
• Data Storage and management tools
Tools for Big Data Analytics
• Data Storage and management tools
Tools for Big Data Analytics
• Data cleansing
Tools for Big Data Analytics
• Data Analysis
Tools for Big Data Analytics
• Data Analysis
Tools for Big Data Analytics
• Data Visualization
Stages in Big Data Analytics
• These are the following stages involved in the Big Data Analytics process:
Stages in Big Data Analytics
• Identifying Problem: to find what is our problem that we need to solve.
• Designing Data Requirements: we need to decide what kind of data is required for analyzing the problem.
• Pre-processing data: we need to prepare our data before actual analysis begin
• Performing analytics over data: we will analyze the processed data using various methods
• Visualizing data: Data visualization is the representation of data or information in a graph, chart, or other
visual format.

56
Big Data Analytics benefits

• Quickly analyzing large amounts of data from different sources, in many different formats and types.
• Cost savings, which can result from new business process efficiencies and optimizations.
• A better understanding of customer needs, behavior and sentiment, which can lead to better marketing
insights, as well as provide information for product development.
• Improved, better informed risk management strategies that draw from large sample sizes of data
• Rapidly making better-informed decisions for effective strategizing, which can benefit and improve the
supply chain, operations and other areas of strategic decision-making.
What Comes Under Big Data?

• Search Engine Data − Search engines retrieve lots of data from different databases

• Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions
made on a share of different companies made by the customers

• Social Media Data − Social media such as Facebook and Twitter hold information and the views posted by
millions of people across the globe.
What Comes Under Big Data?

• Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight
crew, recordings of microphones and earphones, and the performance information of the aircraft
• Power Grid Data − The power grid data holds information consumed by a particular node with respect to
a base station.
• Transport Data − Transport data includes model, capacity, distance and availability of a vehicle.
• Big data involves the data produced by different devices and applications. Given below are some of the
fields that come under the umbrella of Big Data.
BIG DATA FEATURES –SECURITY, COMPLIANCE,
AUDITING AND PROTECTION
Big Data Features
• It should support variety of data format
• It should provide data analysis and reporting tools
• It should provide real-time data analysis software
• It should have tools for searching the data through large data set
• It should have capability for rapid development
Big Data Security
• Big data security is the collective term for all the measures and tools used to guard both the data and
analytics processes from attacks, theft, or other malicious activities that could harm or negatively affect
them.
• For companies that operate on the cloud, big data security challenges are multi-faceted.
• When customers give their personal information to companies, they trust them with personal data which
can be used against them if it falls into the wrong hands
Big Data Compliance
• Data compliance is the practice of ensuring that sensitive data is organized and managed in such a way as
to enable organizations to meet enterprise business rules along with legal and governmental regulations.
• Organizations that don’t implement these regulations can be fined up to tens of millions of dollars and
even receive a 20-year penalty
Big Data Auditing
• Auditors can use big data to expand the scope of their projects and draw comparisons over larger
populations of data.
• Big data also helps financial auditors to streamline the reporting process and detect fraud.
• These professionals can identify business risks in time and conduct more relevant and accurate audits
Big Data Protection
• Big data security is the collective term for all the measures and tools used to guard both the data and
analytics processes from attacks, theft, or other malicious activities that could harm or negatively affect
them.
• That’s why data privacy is there to protect those customers but also companies and their employees from
security breaches.
• When customers give their personal information to companies, they trust them with personal data which
can be used against them if it falls into the wrong hands.
• Data protection is also important as organizations that don’t implement these regulations can be fined up to
tens of millions of dollars and even receive a 20-year penalty.
Reporting Vs Analysis
Reporting
• Reporting shows us “what is happening”.
• Once data is collected, it will be organized using tools such as graphs and tables.
• The process of organizing this data is called reporting.
• Reporting translates raw data into information.
• Reporting helps companies to monitor their online business and be alerted when data falls outside of
expected ranges.
• Good reporting should raise questions about the business from its end users.
Analysis
• The analysis focuses on explaining “why it is happening” and “what we can do about it”
• Analytics is the process of taking the organized data and analyzing it.
• This helps users to gain valuable insights on how businesses can improve their performance.
• Analysis transforms data and information into insights.
• The goal of the analysis is to answer questions by interpreting the data at a deeper level and providing
actionable recommendations.
IDA-Intelligent Data Analysis
IDA-Intelligent Data Analysis
• Intelligent Data Analysis (IDA) is one of the most important approaches in the field of
data mining which discloses hidden facts that are not known previously and provide
potentially important information or facts from large quantities of data.
• It also helps in making a decision .
• Based on the basic principles of IDA and the features of datasets that IDA handles, the
development of IDA is briefly summarized from three aspects :
1. Algorithm principle
2. The scale
3. Type of the dataset
• IDA is one of the major issues in artificial intelligence and information.
IDA-Intelligent Data Analysis
• Based on machine learning, artificial intelligence, recognition of pattern, and records and visualization
technology, IDA helps to obtain useful information, necessary data and interesting models from a lot of
data available online in order to make the right choices
• IDA includes three stages:
(1) Preparation of data
(2) Data mining
(3) Data validation and Explanation
Drivers for Big data
• The digitization of society;
• The drop in technology costs;
• Connectivity through cloud computing;
• Increased knowledge about data science;
• Social media applications;
• The upcoming Internet-of-Things (IoT).

You might also like