You are on page 1of 31

Nature and components of

Big Data
P.SRIDEVI
DEPT OF CSE-IT
UNIT1
Introduction to Big Data Platform
Challenges of Conventional Systems
Intelligent data analysis
Nature of Data
Analytic Processes and Tools
Analysis vs Reporting
What is BigData

“ Big data refers to data sets whose size is beyond


the ability of typical database software tools to
capture, store, manage and analyze. ”
BigData technology components
1. Ingestion

• The ingestion layer is the very first step of pulling in raw data.
• The various sources of data.
• It comes from internal sources,
• relational databases,
• non-relational databases,
• social media,
• emails,
• phone calls/mobile apps etc.
Types of ingestion
• There are two kinds of ingestions :
• Batch, in which large groups of data are gathered and delivered together.
• A batch layer (cold path) stores all of the incoming data in its raw form
and performs batch processing on the data. The result of this processing
is stored as a batch view.
• Streaming, which is a continuous flow of data. This is necessary for real-
time data analytics
• A speed layer (hot path) analyzes data in real time. This layer is designed
for low latency (minimum delay), at the expense of accuracy.
Data sources.
• All big data solutions start with one or more data sources.
• Examples include:
• Relational databases -- Application data stores.
• WEB SERVER LOG FILES -- Static files produced by applications.
• IOT DEVICES--Real-time data sources.
Batch processing.(OSS SW used is spark)
• Because the data sets are so large, often a big data solution must
process data files using batch jobs to filter, aggregate, and otherwise
prepare the data for analysis.
• Usually these jobs involve reading source files, processing them, and
writing the output to new files.
• Options include running U-SQL (one language to process data for any
format)jobs in Azure Data Lake Analytics, in an HDInsight Hadoop
cluster, or using Java, Scala, or Python programs in an HDInsight Spark
cluster.
• Azure HDInsight is a fully-managed cloud service that makes it easy,
fast, and cost-effective to process massive amounts of data.It Uses the
most popular open-source frameworks such as Hadoop, Spark, Hive,
Kafka, Storm, HBase, Microsoft ML Server and more.
Real-time message ingestion.
• If the solution includes real-time sources, the architecture must include a way to
capture and store real-time messages for stream processing.
• This might be a simple data store, where incoming messages are dropped into a
folder for processing.
• However, many solutions need a message ingestion store to act as a buffer for
messages, and to support scale-out processing, reliable delivery, and other
message queuing semantics.
• This streaming architecture is often referred to as stream buffering. Options
include Azure Event Hubs.
• (Azure Event Hubs is a big data streaming platform and event ingestion service.
It can receive and process millions of events per second. Data sent to an event
hub can be transformed and stored by using any real-time analytics provider or
batching/storage).
2. Data storage.(DataWH Vs DataLake)
• Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats.
• This kind of store is often called a data lake. Options for implementing
this storage include Azure Data Lake Store in Azure Storage.
• Storage is where the converted data is stored in a data lake or
warehouse and eventually processed.
• The data lake/warehouse is the most essential component of a big data
ecosystem.
• Data in data lake contain only thorough, relevant data to make insights
as valuable as possible.
• It must be efficient with as little redundancy as possible to allow for
quicker processing.
Azure Data Lake
• Azure Data Lake is a big data solution based on multiple cloud
services in the Microsoft Azure ecosystem.
• It allows organizations to ingest multiple data sets, including
structured, unstructured, and semi-structured data, into an infinitely
scalable data lake enabling storage, processing, and analytics.
DWH vs Data MART vs Data Lake
• Data warehouses, data lakes, and data marts are different cloud
storage solutions.
• A data warehouse stores data in a structured format. It is a central
repository of pre-processed data for analytics and business
intelligence.
• A data mart is a data warehouse that serves the needs of a specific
business unit, like a company’s finance, marketing, or sales
department.
• a data lake is a central repository for raw data and unstructured data.
You can store data first and process it later on.
3. Big data Analytics :

• In the analysis layer, data gets passed through several tools, shaping it
into actionable insights.
• There are four types of analytics on big data :
• Diagnostic: Explains why a problem is happening.
• Descriptive: Describes the current state of a business through
historical data.
• Predictive: Projects future results based on historical data.
• Prescriptive: Takes predictive analytics a step further by projecting
best future efforts.
Big data solutions typically deals with the following
types of workload:

• Batch processing of big data sources at rest.(distributed processing)


• Real-time processing of big data in motion.(Stream processing)
• Interactive exploration of big data.(analytics)
• Predictive analytics and machine learning.
4. Consumption : (end user)

• The final big data component is presenting the information in a


format digestible to the end-user.
• This can be in the forms of
• tables,
• advanced visualizations and even single numbers if requested.
• The most important thing in this layer is making sure the intent and
meaning of the output is understandable.
BIG DATA MANAGEMENT TOOLS

• These days, organizations are realising the value they get out of big
data analytics and hence they are deploying big data tools and
processes to bring more efficiency in their work
environment(SECODA,COLLIBRA)
• Collibra is a data catalog platform and tool that helps organizations
better understand and manage their data assets. Collibra helps
create an inventory of data assets, capture information (metadata)
about them, and govern these assets.
• Secoda is a tool for writing queries to search company data(SECODA)
Challenges of conventional systems
• Big data is the storage and analysis of large data sets.
• These are complex data sets that can be both structured or unstructured.
• They are so large that it is not possible to work on them with traditional
analytical tools.
• One of the major challenges of conventional systems was the uncertainty of
the Data Management.
• Big data is continuously expanding, there are new companies and
technologies that are being developed every day.(Google,Amazon,Netflix)
• Trusting the quality of data. Data security and privacy is a challenge.
• Not designed as user friendly for data extraction.
• A big challenge for companies is to find out which technology works best for
them without the introduction of new risks and problems.
• These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency in their work environment.
BIG DATA AS A SERVICE

• Big Data has created a demand for scalable, flexible and affordable data
management platforms to meet modern compute requirements.
• Big Data as a Service (BDaaS) integrates many of the functionalities and
benefits of SaaS, IaaS, PaaS and DaaS, and leverages additional resources
in the market for analyzing Big Data.
• Big Data as a Service encompasses the software, data warehousing,
infrastructure and platform service models in order to deliver advanced
analysis of large data sets, generally through a cloud-based network.
• It is a solution-based system designed to provide organizations with the
wide-ranging capabilities to gain insights from data.
List some of data analytics tools
• Data analytics tools not only report the results of the data but also explain
why the results occurred to help identify weaknesses, fix potential problem
areas, alert decision-makers to unforeseen events and even forecast future
results based on decisions the company might make.
• R Programming (Leading Analytics Tool in the industry)
• Python
• Excel
• SAS
• Apache Spark
• Splunk
• RapidMiner
• Tableau Public
Orchestration.
• Most big data solutions consist of repeated data processing
operations, encapsulated in workflows, that transform source data,
move data between multiple sources and sinks.
• load the processed data into an analytical data store, or push the
results straight to a report or dashboard. To automate these
workflows, an orchestration technology such Azure Data Factory or
Apache Oozie and Sqoop is used
• Workflow:
Sourcedata->move data between sources and sinks->load processed
data for analytics->display the results on dashboard.
A big data architecture is designed to handle the ingestion, processing,
and analysis of data that is too large or complex for traditional database
systems.
Stream processing.
• After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis.
• The processed stream data is then written to an output sink. Azure
Stream Analytics provides a managed stream processing service based
on perpetually running SQL queries that operate on unbounded streams.
• open source Apache streaming technologies like Storm and Spark
Streaming in an HDInsight cluster can be used
• Azure HDInsight is a service offered by Microsoft, that enables us to use
open source frameworks for big data analytics.
• Azure HDInsight allows the use of frameworks like Hadoop, Apache
Spark, Apache Hive, LLAP, Apache Kafka, Apache Storm, R, etc., for
processing large volumes of data.
Analysis and reporting.
• The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data,
the architecture may include a data modelling layer, such as a
multidimensional OLAP cube or tabular data model in Azure Analysis
Services.
• It might also support self-service BI, using the modelling and
visualization technologies in Microsoft Power BI or Microsoft Excel.
• Analysis and reporting can also take the form of interactive data
exploration by data scientists or data analysts. For these scenarios,
many Azure services support analytical notebooks, such as Jupyter,
enabling these users to leverage their existing skills with Python or R.
Reporting vs Analytics
Reporting Analysis
DWH DATA LAKE
DWH contains Structured data Unstructured data
Collect data from many sources(Flat files,spread Collect all kinds of data( structured and unstructured)
sheets, DBs , apps etc) in one place

Designed for report generation Designed for big data analytics


Reporting is used to provide facts, which stakeholders Analytics offers pre-analyzed conclusions that a
can use for presentations company can use to solve problems and improve its
performance.

Reporting presents the actual data to end-users, after Analytics doesn't present the data but instead draws
collecting, sorting and summarizing it to make it easy information from the available data and uses it to
to understand generate insights, forecasts and recommended
actions.

Mostly done by automated tools Requires skill sets


Intelligent data analysis(IDA)
Data Intelligence Definition:
• Data intelligence is the use of various tools and methods to analyze
and transform data into information from which valuable insight can
be drawn.
• Data intelligence refers to the practice of using artificial intelligence
and machine learning tools to analyze and transform massive datasets
into intelligent data insights, which can then be used to improve
services and investments.
• The application of data intelligence tools and techniques can help
decision makers develop a better understanding of collected
information with the goal of developing better business processes.
What is Intelligent Data Analysis?

• Intelligent data analysis refers to the use of analysis, classification, conversion,


extraction organization, and reasoning methods to extract useful knowledge
from data.
• This data analytics intelligence process generally consists of
1. the data preparation stage,
2. the data mining stage,
3. the result validation
4. and explanation stage.
• Data preparation involves the integration of required data into a dataset that
will be used for data mining;
• data mining involves examining large databases in order to generate new
information;
• result validation involves the verification of patterns produced by data mining
algorithms;
Five major components of IDA
1 descriptive data,
2.prescriptive data,
3.diagnostic data,
4. decisive data, and
5. predictive data.
These disciplines focus on understanding data, developing alternative
knowledge, resolving issues, and analyzing historical data to predict
future trends.
Some industries with the greatest need for data intelligence include
1)cybersecurity, 2) finance,3) health, 4)insurance, and 5) law
enforcement. Intelligent data capture technology is a valuable
application in these industries for transforming print documents or
images into meaningful data.
Big Data Tools & Software for Analytics
• Best Big Data Tools & Software for Analytics 2022
• Tableau.
• Apache Hadoop.
• Apache Spark. unified analytics engine for large-scale data processing
• Zoho Analytics.
• MongoDB.
• Xplenty.
Modern data analytic tools
• These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency to their work environment.
• Many big data tools and processes are being utilised by companies these
days in the processes of discovering insights and supporting decision making.
• Data Analytics tools are types of application software that retrieve data from
one or more systems and combine it in a repository, such as a data
warehouse, to be reviewed and analysed.
• Most organizations use more than one analytics tool including spreadsheets
with statistical functions, statistical software packages, data mining tools,
and predictive modelling tools.
• Together, these Data Analytics Tools give the organization a complete
overview of the company to provide key insights and understanding of the
market/business so smarter decisions may be made.
MICROSOFT AZURE
• Azure Data Lake is a big data solution based on multiple cloud
services in the Microsoft Azure ecosystem.
• It allows organizations to ingest multiple data sets, including
structured, unstructured, and semi-structured data, into an infinitely
scalable data lake enabling storage, processing, and analytics.
Microsoft Azure
• Microsoft Azure provides robust services for analyzing big data. One
of the most effective ways is to store your data in Azure Data Lake
Storage Gen2 and then process it using Spark on Azure Databricks.
• Azure Stream Analytics (ASA) is Microsoft’s service for real-time data
analytics.
• Ex: stock trading analysis,
• fraud detection,
• embedded sensor analysis,
• and web clickstream analytics.
ASA uses Stream Analytics Query Language, which is a variant of T-SQL.
That means anyone who knows SQL will have a fairly easy time learning
how to write jobs for Stream Analytics.

You might also like