You are on page 1of 22

Unit 2

Data Analytics Process


Zikra Shaikh
2.1 Domain Specific Examples of Big Data

Web, Financial, Healthcare, Internet of Things,


Environment Logistics & Transportation, Industry,
Retail

2.2 Analytics Flow for Big Data

Data Collection,Data Preparation,Analysis


Types,Analysis Modes, Visualizations

Topic 2.3 Big Data Stack

Raw Data Sources, Data Access Connectors, Data


Storage, Batch Analytics, Real-time Analytics,
Interactive Querying, Serving Databases, Web &
Visualization Frameworks

2.4 Mapping Analytics Flow to Big Data Stack

2.5 Case Study: Genome Data Analysis

2.6 Case Study: Weather Data Analysis

2.7 Analytics Patterns


Big data is everywhere, and here are some domain-specific examples to illustrate how it's used:

1. Web: In the web domain, big data is used by online platforms and e-commerce websites to
analyze user behavior. This includes tracking user clicks, page views, and interactions. For
instance, platforms like Amazon use big data to personalize recommendations based on a
user's browsing and purchase history, enhancing the overall shopping experience.

2. Financial: Financial institutions leverage big data for risk management, fraud detection, and
customer insights. Credit card companies analyze transaction data in real-time to identify
unusual patterns that may indicate fraudulent activities. Additionally, big data analytics is
employed for predicting market trends and optimizing investment strategies.

3. Healthcare: In healthcare, big data is applied to enhance patient care and optimize
healthcare processes. Electronic health records (EHRs) are analyzed to identify patterns,
improve treatment plans, and predict disease outbreaks. Big data analytics also plays a crucial
role in genomics, helping researchers and clinicians analyze large-scale genomic data for
personalized medicine.
4. Internet of Things (IoT): IoT devices generate massive amounts of data, and big data
analytics is essential for extracting meaningful insights. In smart cities, sensors on traffic
lights, waste management systems, and public transportation are interconnected. Big data
is used to analyze this data in real-time, optimizing traffic flow, reducing energy
consumption, and improving overall city management.

5. Environment: Big data is instrumental in environmental studies, particularly in climate


monitoring and prediction. Climate scientists analyze vast datasets from satellites, weather
stations, and ocean buoys to model climate patterns. This information is crucial for
understanding climate change, predicting natural disasters, and making informed decisions
for environmental conservation.

6. Logistics & Transportation: In logistics and transportation, big data is used for route
optimization, predictive maintenance, and supply chain management. Companies like UPS
use big data analytics to optimize delivery routes, reduce fuel consumption, and enhance
overall operational efficiency. Predictive maintenance helps prevent breakdowns, ensuring
continuous and reliable transportation services.
7. Industry: Manufacturing industries leverage big data for quality control, process
optimization, and predictive maintenance. Sensors on production lines generate vast
amounts of data, which is analyzed in real-time to identify defects, optimize
production processes, and predict when machinery requires maintenance. This
improves overall efficiency and reduces downtime.

8. Retail: Retailers use big data to analyze customer purchasing patterns, optimize
inventory management, and personalize marketing strategies. For instance,
supermarkets analyze customer purchase data to optimize inventory levels, ensuring
products are always available. Online retailers use big data to personalize
recommendations and promotions based on customer preferences and browsing
history.
Analytics flow for big data:
1. *Data Collection*: This is where we gather all the relevant data from various sources
such as databases, sensors, or social media platforms. Think of it as collecting pieces of
a puzzle.

2. *Data Preparation*: After collecting the data, we need to clean and organize it. This
step involves removing any errors or inconsistencies and formatting the data in a way
that's suitable for analysis. It's like sorting and arranging the puzzle pieces so they fit
together neatly.

3. *Analysis Types*: There are different ways we can analyze the data depending on
what we want to find out. For example, we might use descriptive analysis to summarize
the data, predictive analysis to forecast future trends, or prescriptive analysis to
recommend actions based on the data.
4. *Analysis Modes*: Once we know what type of analysis we want to perform,
we choose the mode of analysis. This could be batch processing, where we
analyze a large amount of data at once, or real-time processing, where we
analyze data as it's generated. It's like deciding whether to solve the puzzle all
at once or piece by piece as we go.

5. *Visualizations*: Finally, we present the results of our analysis in a visual


format, such as charts, graphs, or dashboards. This makes it easier for people
to understand the insights gained from the data. Think of it as putting together
the completed puzzle so others can see the big picture.
Big data stack:

1. *Raw Data Sources*: These are the original sources where data is generated or
collected, such as databases, sensors, or social media platforms. It's like the starting
point where all the data comes from.

2. *Data Access Connectors*: These are tools or interfaces that allow us to access and
retrieve data from different sources. They act as bridges between the raw data sources
and the rest of the big data stack, ensuring smooth data flow. Think of them as
connectors that link the raw data sources to the rest of the system.

3. *Data Storage*: This is where the collected data is stored for future use and analysis.
It could be in traditional databases, data lakes, or distributed file systems like Hadoop
Distributed File System (HDFS). It's like the storage room where we keep all the puzzle
pieces safe and organized.
4. *Batch Analytics*: Batch analytics involves processing and analyzing large volumes
of data in batches or chunks. It's useful for tasks that don't require immediate results,
such as historical analysis or periodic reporting. Think of it as solving the puzzle piece by
piece, but not necessarily in real-time.

5. *Real-time Analytics*: Real-time analytics, on the other hand, involves processing


and analyzing data as it's generated, providing immediate insights and responses. It's
like solving the puzzle as soon as you receive each piece, allowing for quick reactions
and decision-making.

6. *Interactive Querying*: This refers to the ability to interactively query and explore the
data stored in the system. It allows users to ask ad-hoc questions and receive instant
responses, facilitating exploratory data analysis and troubleshooting. Think of it as being
able to search for specific puzzle pieces and get instant answers.
7. *Serving Databases*: Databases designed for serving data quickly and
efficiently, such as NoSQL databases (e.g., MongoDB) or distributed SQL
databases (e.g., Apache Cassandra), play a role in storing processed data for
easy retrieval. It's like having a well-organized library where you can easily find
the book you need.

8. *Web & Visualization Frameworks*: These are the tools and frameworks
used to serve the analyzed data to end-users, whether through databases, web
applications, or visualization tools like Tableau or Power BI. They make the
insights gained from the data accessible and understandable to non-technical
users. It's like putting the puzzle together in a way that others can see and
understand the complete picture.
Mapping the analytics flow to the Big Data Stack means aligning the stages and
processes involved in data analytics with the various components of a Big Data
technology stack. It involves understanding how different tools and technologies within
the Big Data ecosystem can be employed to handle the various aspects of data
processing, storage, and analysis.

Here's a simplified breakdown:

Raw Data Sources:


● Identify where the data is coming from, such as sensors, databases, logs, or
any other sources.
Data Access Connectors:
● Determine how to connect and collect data from these sources efficiently. Use
connectors or pipelines to move data to the next stages.
Data Storage:
● Choose appropriate storage systems to house the data, considering factors like
volume, velocity, and variety of the data. This could involve distributed file
systems, databases, or cloud storage.
Batch Analytics:
● Decide how to process large volumes of data in batches to gain insights over
time. Utilize technologies like Apache Spark, Hadoop, or other batch
processing frameworks.
Real-time Analytics:
● Address the need for immediate insights by implementing real-time analytics
using technologies such as Apache Flink, Apache Kafka Streams, or other
stream processing frameworks.
Interactive Querying:
● Provide a way for users to interactively query and explore the data. Use
databases optimized for quick querying, like Apache Cassandra, or other
interactive query systems.
Serving Databases:
● Store processed and analyzed data in serving databases for quick and efficient
retrieval. This could involve NoSQL databases like MongoDB or traditional
relational databases.
Web & Visualization Frameworks:

● Present the results to end-users using web and visualization


frameworks. Utilize tools like Tableau, Power BI, or custom-built
dashboards with frameworks like D3.js for effective data visualization.

By mapping the analytics flow to the Big Data Stack, organizations can optimize
their data processing and analysis workflows, making use of the capabilities
offered by different components of the Big Data ecosystem. This ensures
efficient handling of large datasets and facilitates the extraction of valuable
insights from the data.
Case study on Genome Data Analysis:

A research institution is conducting a study to understand the genetic factors influencing a


rare disease. They have collected genomic data from a large group of individuals and aim
to analyze this data to identify potential genetic markers associated with the disease.

Big Data Stack Implementation:

Raw Data Sources:


● Genomic data is collected from various sources, including DNA sequencing
machines and public genetic databases. This data is incredibly large and
complex, consisting of millions of DNA sequences.
Data Access Connectors:
● Data connectors are used to aggregate genomic data from different sources
and ensure compatibility for further processing. These connectors help in
dealing with the diverse formats and structures of genomic data.
Data Storage:
● The raw genomic data is stored in a distributed file system or a specialized genomic
database. This storage system is designed to handle the massive volume of data
and provide efficient retrieval.
Batch Analytics:
● Batch analytics processes involve running complex algorithms on the entire genomic
dataset. This may include identifying variations, mutations, and patterns in the
genetic code that could be linked to the rare disease. Technologies like Apache
Spark or Hadoop MapReduce could be employed for this.
Real-time Analytics:
● Real-time analytics could be applied for immediate analysis of newly sequenced
genomes. This is particularly useful for identifying urgent insights or adjusting the
analysis approach based on ongoing findings.
Interactive Querying:
● Researchers may need to interactively query specific genes, regions, or mutations.
Interactive querying tools, possibly built on top of distributed databases like Apache
HBase, allow researchers to explore specific aspects of the genomic data.
Serving Databases:
● Processed and analyzed genomic data is stored in databases optimized for quick
access. This allows researchers to efficiently retrieve relevant genetic information
during the study. NoSQL databases like MongoDB could be utilized for this purpose.
Web & Visualization Frameworks:
● Results from the genomic analysis are presented using web-based visualization
frameworks. Researchers can use tools like GenomeBrowse or custom-built
visualizations to explore and interpret genetic variations, making it easier to identify
potential genetic markers associated with the rare disease.
● The research institution can uncover potential genetic markers associated with the rare
disease.
● Insights gained can contribute to a better understanding of the disease's genetic basis.
● This information may lead to the development of targeted treatments or interventions.

In this case study, the Big Data Stack plays a crucial role in handling the vast and complex
genomic data, performing in-depth analysis, and presenting the findings in a way that aids
researchers in understanding the genetic factors influencing the rare disease.
Case study on Weather Data Analysis:

A meteorological agency wants to improve its weather prediction models by analyzing


historical and real-time weather data. The goal is to enhance accuracy in forecasting and
provide more timely and precise information to the public.

Big Data Stack Implementation:

Raw Data Sources:


● Meteorological stations, satellites, and other sensors collect vast amounts of
weather data, including temperature, humidity, wind speed, and atmospheric
pressure. Real-time data is continuously streamed from these sources.
Data Access Connectors:
● Data connectors are used to aggregate data from various sources, ensuring a
smooth flow of information from meteorological stations, satellites, and other
data-producing devices to the central data processing system.
Data Storage:
● Raw weather data is stored in a distributed and scalable data storage system. This
could be a combination of cloud-based storage solutions and on-premise databases,
capable of handling large volumes of historical and real-time data.
Batch Analytics:
● Batch analytics processes historical weather data to identify long-term trends,
patterns, and seasonal variations. This analysis helps in refining predictive models
and improving the accuracy of long-term weather forecasts. Technologies like
Apache Spark or Hadoop can be employed for this purpose.
Real-time Analytics:
● Real-time analytics processes the continuously streaming data from various sensors.
This analysis provides insights into the current weather conditions, enabling more
accurate short-term predictions. Stream processing frameworks like Apache Flink or
Apache Kafka Streams may be utilized for real-time analytics.
Interactive Querying:
● Meteorologists and weather analysts may need to interactively query specific
regions, time periods, or meteorological parameters. Interactive querying tools built
on top of databases like Apache Cassandra or Amazon DynamoDB allow for quick
and flexible access to specific weather data.
Serving Databases:
● Processed and analyzed weather data is stored in serving databases optimized for quick
access. This enables the meteorological agency to provide timely and accurate information
to the public, as well as other stakeholders. NoSQL databases like MongoDB or traditional
relational databases could be used for this purpose.
Web & Visualization Frameworks:
● Weather forecasts and predictions are presented to the public through web and
visualization frameworks. Interactive maps, charts, and dashboards, possibly built using
tools like D3.js or Leaflet, help convey the weather information in an easily understandable
format.
● Improved accuracy in long-term weather forecasts.
● Enhanced ability to provide real-time weather updates and warnings.
● Better-informed decision-making for various sectors, including agriculture, transportation, and
emergency response.

In this case study, the Big Data Stack facilitates the efficient handling of vast and dynamic weather
data, enabling comprehensive analysis, accurate predictions, and timely communication of weather
information to the public and other stakeholders.
Analytics patterns refer to recurring approaches or methodologies used in
data analysis to solve common problems or achieve specific goals. These
patterns provide guidance on how to structure and conduct data analysis tasks
efficiently and effectively. Here are some common analytics patterns:

1. *Descriptive Analytics*: This pattern involves summarizing historical data to


gain insights into past events or trends. It focuses on answering questions like
"What happened?" and includes techniques such as data aggregation,
summarization, and visualization.

2. *Diagnostic Analytics*: Diagnostic analytics aims to understand why certain


events occurred by identifying patterns, correlations, or relationships in the
data. It helps uncover root causes behind observed phenomena and supports
troubleshooting and problem-solving efforts.
3. *Predictive Analytics*: Predictive analytics uses historical data to forecast
future outcomes or trends. It involves building predictive models using
statistical techniques, machine learning algorithms, or other predictive
modeling approaches to make educated predictions based on past patterns.

4. *Prescriptive Analytics*: Prescriptive analytics goes beyond predicting


future outcomes by recommending specific actions or interventions to achieve
desired objectives. It combines predictive models with optimization algorithms
or decision-making frameworks to provide actionable insights and guidance.

5. *Text Analytics*: Text analytics focuses on extracting insights and patterns


from unstructured text data, such as emails, social media posts, or customer
reviews. It includes techniques like natural language processing (NLP),
sentiment analysis, and topic modeling to analyze and interpret textual data.
6. *Spatial Analytics*: Spatial analytics deals with analyzing data that has a
geographic or spatial component, such as maps, GPS coordinates, or spatial
databases. It involves techniques like spatial clustering, spatial interpolation,
and spatial regression to understand spatial relationships and patterns in the
data.

7. *Temporal Analytics*: Temporal analytics focuses on analyzing data over


time to identify temporal trends, patterns, or seasonality. It involves time series
analysis, event sequence analysis, and trend detection techniques to uncover
insights related to temporal changes in the data.

These analytics patterns serve as building blocks for designing and


implementing data analysis workflows and can be combined or adapted to
address specific analytical challenges or business objectives effectively.

You might also like