Professional Documents
Culture Documents
STREAMING
KIT-601
INTRODUCTI
ON
We generate and transmit vast amounts of digital data every second in the real
world. It is not wrong to say that massive data surround us. The continuously
generating and transmitting data is called a Data Stream. However, extracting
valuable knowledge from this big data is a big task. It takes lots of time, effort, and
skills to mine insights from massive data.
Data Streams in Data Mining can be considered a subset of general concepts of machine
learning, knowledge extraction, and data mining.
TYPES OF DATA STREAMS
A data stream is a(possibly unchained) sequence of tuples. Each tuple
comprised of a set of attributes, similar to a row in a database table.
2. IMAGE DATA
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be numerous
of them, each producing a stream of images at a break of 1 second each.
• Time Sensitive: Data Streams are time-sensitive, and elements of data streams carry
timestamps with them. After a particular time, the data stream loses its significance
and is relevant for a certain period.
• Data Volatility: No data is stored in data streaming as It is volatile. Once the data
mining and analysis are done, information is summarized or discarded.
• Concept Drifting: Data Streams are very unpredictable. The data changes or evolves
with time, as in this dynamic world, nothing is constant.
DATA STREAMING
PROCESS
Data streaming refers to the practice of sending, receiving, and processing information
in a stream rather than in discrete batches. It involves 6 main steps
DATA STREAMING
1.DATA PRODUCTION PROCESS
• The first step in the process is when data is generated and sent from various sources such as IoT
devices, web applications, or social media platforms. The data can be in different formats such as JSON
or CSV and can have different characteristics such as structured or unstructured static or dynamic etc.
2.DATA INGESTION
• In the second step, the data is received and stored by consumers such as streaming platforms or
message brokers. These data consumers can use various technologies and streaming data architectures
to handle the volume, velocity, and variety of the data such as Kafka or Estuary Flow streaming
pipelines, etc. The consumers can also perform some basic operations on the data such as validation or
enrichment before passing it to a stream processor.
3.DATA PROCESSING
• Next, the data is analyzed and acted on by data processing tools. These tools can use various
frameworks and tools to perform complex operations on the data such as filtering, aggregation,
transformation, machine learning, etc. Usually, the processing is tightly integrated with the platform
that ingested the streaming data.
DATA STREAMING
PROCESS
4. STREAMING DATA ANALYTICS
•This is the step where data is further explored and interpreted by the analysts or data scientists. The
analysts can use various techniques and methods to discover patterns and trends in the data such as
descriptive analytics, predictive analytics, prescriptive analytics etc. They can also produce outputs
based on the analysis such as dashboards, charts, maps, alerts, etc.
5. DATA REPORTING
• To make sense of the analyzed data, it is summarized and communicated by reporting tools. The
reports can be generated in various formats such as reports documents, emails slideshows or
webinars, etc.
• This step can also use various metrics and indicators to measure and monitor the performance of
their goals ‘n’ objectives etc.
6. DATA VISUALIZATION & DECISION MAKING
• In the final step, data is displayed and acted upon by the decision-makers such as leaders or
customers. The decision-makers can use various types and styles of visualizations to understand and
explore the data such as tables, charts, maps, graphs, etc.
• The decision-makers can make timely decisions such as optimizing processes, improving products,
enhancing customer experience, etc.
STREAMING PROCESS
ARCHITECTURE
DATA SOURCES
Data streams originate from a range of sources in numerous formats and volume
intensities. These sources can be apps, networked devices, server log files, online
activities of various kinds, as well as location-based data. All of these sources can
be collected in real-time to form a single main source for real-time analytics and
information.
One example of streaming data is a ride-sharing app. If you make a booking on
uber or ola, you will be matched with a driver in real-time, and the app will be able
to tell you how far he is from you and how long it will take to get to your
destination based on real-time traffic data.
MESSAGE BROKER
• Message brokers act as buffers between the data sources and the stream processing engine.
It collects data from various sources, converts it to a standard message format (such as
JSON or Avro), and then streams it continuously for consumption by other components.
• A message broker also provides features such as scalability, fault tolerance, load balancing,
partitioning, etc. Some examples of message brokers are Apache Kafka, Amazon Kinesis
Streams, etc.
STREAM PROCESSING ENGINE
• This is the core component that processes streaming data. It can perform
various operations such as filtering, aggregation, transformation, enrichment,
windowing, etc.
• A stream processing engine can also support Complex Event Processing (CEP) which is
the ability to detect patterns in streaming data and trigger actions accordingly.
• Some popular stream processing tools are Apache Spark Streaming, Apache
Flink, Apache Kafka Streams, etc.
DATA
STORAGE
This component stores the processed or raw streaming data for later use. Data
storage can be either relational or non-relational, structured or unstructured,
etc. Because of the large amount and diverse format of event streams, many
organizations opt to store their streaming event data in cloud object stores as
an operational data lake. A standard method of loading data into the data
storage is using ETL pipelines.
Some examples of data storage systems are Amazon S3, Hadoop Distributed
File System (HDFS), Apache Cassandra, Elasticsearch, etc.
APPLICATIONS
The utilization of location data
Fraud detection
Predictive analytics