You are on page 1of 16

DATA

STREAMING
KIT-601
INTRODUCTI
ON
We generate and transmit vast amounts of digital data every second in the real
world. It is not wrong to say that massive data surround us. The continuously
generating and transmitting data is called a Data Stream. However, extracting
valuable knowledge from this big data is a big task. It takes lots of time, effort, and
skills to mine insights from massive data.

Therefore, we need to implement data streams in data mining techniques to


transfer valuable insights from data to the receiver’s end. This article leads us to
understand the data stream and its mining techniques simply and helpfully.
DATA
STREAMING
Data streaming is the process of continuously collecting data as it's generated and
moving it to a destination. This data is usually handled by stream processing software
to analyze, store, and act on this information. Data streaming combined with stream
processing produces real-time intelligence.
Data Stream is a continuous, fast-changing, and ordered chain of data transmitted at a
very high speed. The sender’s data is transferred from the sender’s side and immediately
shows in data streaming at the receiver’s side. Streaming does not mean downloading
the data or storing the information on storage devices.

Data Streams in Data Mining can be considered a subset of general concepts of machine
learning, knowledge extraction, and data mining.
TYPES OF DATA STREAMS
A data stream is a(possibly unchained) sequence of tuples. Each tuple
comprised of a set of attributes, similar to a row in a database table.

• Transactional data stream –


It is a log interconnection between entities
1. Credit card – purchases by consumers from producer
2. Telecommunications – phone calls by callers to the dialed parties
3. Web – accesses by clients of information at servers

• Measurement data streams –


1. Sensor Networks – a physical natural phenomenon, road traffic
2. IP Network – traffic at router interfaces
3. Earth climate – temperature, humidity level at weather stations
SOURCES OF DATA STREAM
• Internet traffic
• Sensors data
• Real-time ATM transaction
• Live event data
• Call records
• Satellite data
• Audio listening
• Watching videos
• Real-time surveillance systems
• Online transactions
EXAMPLES OF STREAM SOURCES
1. SENSOR DATA
In navigation systems, sensor data is used. Imagine a temperature sensor floating about in the
ocean, sending back to the base station a reading of the surface temperature each hour. The data
generated by this sensor is a stream of real numbers. We have 3.5 terabytes arriving every day and we
for sure need to think about what we can be kept continuing and what can only be archived.

2. IMAGE DATA
Satellites frequently send down-to-earth streams containing many terabytes of images per day.
Surveillance cameras generate images with lower resolution than satellites, but there can be numerous
of them, each producing a stream of images at a break of 1 second each.

3. INTERNET AND WEB TRAFFIC


A bobbing node in the center of the internet receives streams of IP packets from many inputs
and paths them to its outputs. Websites receive streams of heterogeneous types. For example, Google
receives a hundred million search queries per day.
CHARACTERISTICS OF DATA
STREAMING
• Continuous Stream of Data: The data stream is an infinite continuous stream resulting
in big data. In data streaming, multiple data streams are passed simultaneously.

• Time Sensitive: Data Streams are time-sensitive, and elements of data streams carry
timestamps with them. After a particular time, the data stream loses its significance
and is relevant for a certain period.

• Data Volatility: No data is stored in data streaming as It is volatile. Once the data
mining and analysis are done, information is summarized or discarded.

• Concept Drifting: Data Streams are very unpredictable. The data changes or evolves
with time, as in this dynamic world, nothing is constant.
DATA STREAMING
PROCESS
Data streaming refers to the practice of sending, receiving, and processing information
in a stream rather than in discrete batches. It involves 6 main steps
DATA STREAMING
1.DATA PRODUCTION PROCESS
• The first step in the process is when data is generated and sent from various sources such as IoT
devices, web applications, or social media platforms. The data can be in different formats such as JSON
or CSV and can have different characteristics such as structured or unstructured static or dynamic etc.
2.DATA INGESTION
• In the second step, the data is received and stored by consumers such as streaming platforms or
message brokers. These data consumers can use various technologies and streaming data architectures
to handle the volume, velocity, and variety of the data such as Kafka or Estuary Flow streaming
pipelines, etc. The consumers can also perform some basic operations on the data such as validation or
enrichment before passing it to a stream processor.
3.DATA PROCESSING
• Next, the data is analyzed and acted on by data processing tools. These tools can use various
frameworks and tools to perform complex operations on the data such as filtering, aggregation,
transformation, machine learning, etc. Usually, the processing is tightly integrated with the platform
that ingested the streaming data.
DATA STREAMING
PROCESS
4. STREAMING DATA ANALYTICS
•This is the step where data is further explored and interpreted by the analysts or data scientists. The
analysts can use various techniques and methods to discover patterns and trends in the data such as
descriptive analytics, predictive analytics, prescriptive analytics etc. They can also produce outputs
based on the analysis such as dashboards, charts, maps, alerts, etc.
5. DATA REPORTING
• To make sense of the analyzed data, it is summarized and communicated by reporting tools. The
reports can be generated in various formats such as reports documents, emails slideshows or
webinars, etc.
• This step can also use various metrics and indicators to measure and monitor the performance of
their goals ‘n’ objectives etc.
6. DATA VISUALIZATION & DECISION MAKING
• In the final step, data is displayed and acted upon by the decision-makers such as leaders or
customers. The decision-makers can use various types and styles of visualizations to understand and
explore the data such as tables, charts, maps, graphs, etc.
• The decision-makers can make timely decisions such as optimizing processes, improving products,
enhancing customer experience, etc.
STREAMING PROCESS
ARCHITECTURE
DATA SOURCES
Data streams originate from a range of sources in numerous formats and volume
intensities. These sources can be apps, networked devices, server log files, online
activities of various kinds, as well as location-based data. All of these sources can
be collected in real-time to form a single main source for real-time analytics and
information.
One example of streaming data is a ride-sharing app. If you make a booking on
uber or ola, you will be matched with a driver in real-time, and the app will be able
to tell you how far he is from you and how long it will take to get to your
destination based on real-time traffic data.
MESSAGE BROKER
• Message brokers act as buffers between the data sources and the stream processing engine.
It collects data from various sources, converts it to a standard message format (such as
JSON or Avro), and then streams it continuously for consumption by other components.

• A message broker also provides features such as scalability, fault tolerance, load balancing,
partitioning, etc. Some examples of message brokers are Apache Kafka, Amazon Kinesis
Streams, etc.
STREAM PROCESSING ENGINE
• This is the core component that processes streaming data. It can perform
various operations such as filtering, aggregation, transformation, enrichment,
windowing, etc.

• A stream processing engine can also support Complex Event Processing (CEP) which is
the ability to detect patterns in streaming data and trigger actions accordingly.

• Some popular stream processing tools are Apache Spark Streaming, Apache
Flink, Apache Kafka Streams, etc.
DATA
STORAGE
This component stores the processed or raw streaming data for later use. Data
storage can be either relational or non-relational, structured or unstructured,
etc. Because of the large amount and diverse format of event streams, many
organizations opt to store their streaming event data in cloud object stores as
an operational data lake. A standard method of loading data into the data
storage is using ETL pipelines.

Some examples of data storage systems are Amazon S3, Hadoop Distributed
File System (HDFS), Apache Cassandra, Elasticsearch, etc.
APPLICATIONS
 The utilization of location data

 Fraud detection

 Marketing, sales, and business analytics

 Monitoring and analyzing customer or user activity

 Security Information and Event Management (SIEM)

 Retail and warehouse inventory across multiple channels

 Enhancing rideshare matching

 Combining data for use in machine learning and AI based analysis.

 Customer journey mapping

 Predictive analytics

You might also like