You are on page 1of 26

Sensor Data Aggregation

By
Dr. Pradnya Ghare
Characteristics of sensor data
The Streaming Nature of Data

● Sensor data is automatically generated and arrives in a multiple, continuous, time-varying manner.
Therefore, the volume of sensor data increases with time, and the total volume of data are potentially
unlimited.

Existence of High Tempo-Spatial Correlations

● Sensors are usually deployed at a certain density so that they can cover the entire monitoring field.

● As a result, ‘‘most sensor networks likely exhibit temporal and spatial correlations among node
readings. The readings observed at one time instant are highly indicative of the readings observed at
the next time instant
● High tempo-spatial correlation can be used to estimate missing or corrupted data, provide data
suppression, reduce data transmission in the network and thus reduce energy consumption.
Generation of Redundant Data
● Significant data redundancy in a database can result from the strong spatial and
temporal correlations typically present in sensor data.
● However, redundancy can be used to predict missing values and to detect outliers and
a certain level of redundancy can improve the accuracy of database query results.
Sensor Data Contains ‘Noise’
● Sensor data often contains errors (due to sensor function) and noise (due to other
environmental interference)
● These characteristics indicate that sensor data should be cleaned before being stored
in any database.
Data Fusion

● In general, the design criteria for data-gathering applications in sensor


networks are:

(1)scalability, (2) autonomy, (3) robustness, and (4) energy-efficiency.

● Instead of transmitting all the highly correlated information to subscribers, it


may be more effective for some intermediate sensor node(s)
● to digest the information received and come up with a concise digest, in order to reduce the
amount of raw data to be transmitted (and hence the power incurred, and bandwidth
consumed, in transmission).

● This technique is termed as data fusion (also called data aggregation).


Sensor Data Fusion Techniques
● Sensor data fusion consists of three stages: pre-processing, data mining and
post-processing.
● Sensor Data Pre-processing
● Data pre-processing includes data cleaning, outlier detection, missing values
recovery, data reduction, dimension reduction, and data prediction, etc.
● Various approaches have been used for sensor data cleaning, including
Bayesian Theory, Neural Network, Wavelets, Kalman Filter and Weighted
Moving Average.
● Because of the limits imposed by the sensor’s computation capability it is
hard to implement the Bayesian Theory, Neural Network, and Wavelets
methods.
● Smart weighted moving average based sensor data cleaning approach,
consists of three steps
● Step 1: Locate important values by range prediction.

• Step 2: Increase confidence for important values by sensor testing and


neighbour testing at individual sensors.

• Step 3: Perform weighted moving average at the sink.

● This approach uses Kalman Filter and Linear Regression for range prediction.
● Values outside the predicted range would be considered as ‘‘important’’
values, and their confidences would be calculated in Step 2.
● Finally, the weighted moving average at the sink node is combined with the
temporal average and the spatial average together.
Missing Values Recovery

● With traditional approaches , energy consumption and delays are more.


● Estimation methods such as belief propagation, expectation maximization can
be used.
Sensor Data In-network Aggregation

● Having a large amount of redundant data may slow down or confuse the
knowledge discovery process.
● In-network aggregation of redundant data can reduce the total data flow over
the sensor network and thus can extract the most representative data using
minimum resources, which effectively reduce power consumption.
● Getting an average of the raw data and reporting the average when it is
greater than a predefined threshold is the simplest case.
● Another approach is a weighted in-network sampling algorithm to obtain a
deterministic (much smaller but representative) sample instead of raw
redundant data.
● Compared with random sampling, the advantage of weighted sampling
algorithm is ‘‘it can guarantee that each node’s data has the same chance to
belong to the final sample, independent from its provenance in the Network’’.
● Instead of selectively sampling the network nodes, a prediction-based data
reduction strategy can be used.
● Here prediction methods are deployed both at the sensor and sink-level, so
that sensors only need to send data that deviates from the predicted value.
Sensor Data Mining

● Data mining aims to extract patterns from data.


● Traditional data mining technologies, often referred to as data miners, include
Decision Trees, Rule-based Classifiers, Artificial Neural Networks, Nearest
Neighbour, Naive Bayes, Support Vector Machines, Logistic regression, etc.
● Most of these were initially developed to be applied in central data
warehouse.
● Sensor data mining mainly focuses on distributed in-network data mining.
Most authors suggested a form of hierarchical network topology for sensor
data mining
Two-level architecture for sensor data mining
Sensor Data Post-processing

● Data post-processing includes pattern evaluation, model evaluation, data


visualization/ presentation, etc.
● This step can link the result of sensor data mining to specific applications.
● Data visualization can be based on computer graphics,statistical methods.
Event Detection
● Event detection can be formally described as follows:
● Given a set of measured data arriving over time, denoted as
D= {Zt |t=1,2,3,-----n}, event detection is to find the time t when an event of interest
occurs,
i.e. where the data is different from the normal pattern of behaviour.
● Therefore, the common goals of event detection are:
• To identify whether an event of interest has occurred;
• To characterize the event (e.g., the time, the affected area, the type and the severity
of the event).
● Two categories of event detection approaches have been identified in sensor

network applications :

1) threshold-based event detection and

2) tempo-spatial pattern based event detection.


● The threshold-based event detection method is based on the underlying intuition that an
event occurring will result in changes in the sensor readings, e.g., an object moving will
result in an increased acceleration reading, a fire will result in an increased temperature
reading.
● Therefore, normal behaviour can be defined as a threshold (e.g., maximum values,
rates of increase and combination thereof from multiple sensors) based on statistics
from historical data (or domain knowledge).
● Alarms can be raised if the predefined threshold is exceeded.
Advantages
● Simplicity of its implementation and low computation complexity.
Drawback
● Less accuracy
Tempo-spatial pattern based event detection

● In contrast to threshold-based event detection, the underlying intuition of


tempo-spatial pattern based event detection is that an event occurring in the
monitoring field usually results in some tempo-spatial patterns in the sensor
readings of network nodes.
● For instance, a gas leakage event can be characterized as a spatial
distribution of sensor readings following a gradual decreasing trend from the
source of the leakage to the surrounding area nearby.
● The event of interest can be defined as temporal, spatial, or tempo-spatial
patterns, and then the event detection problem is converted into a pattern
matching problem.
Location of sources for aggregation

● The actual benefits of data aggregation depend on the location of the data
sources, relative to the data sink.
● Intuitively, when all data sources are spread out, the paths to the sinks do not
intersect, and there is little if any opportunity to aggregate data at some
intermediate nodes.
● If, on the other hand, the data sources are all nearby – for example, when
they all observe an event at a certain place – and they are located far away
from the sink and their paths to the sink merge early on, the expected benefits
of aggregation are large.
Different cases of data aggregation
Metrics to judge efficacy of data aggregation

● Accuracy: the difference between the resulting value at the sink and the true
value
● Latency: Aggregation can also increase the latency of reporting as
intermediate nodes might have to wait for data.
● Message overhead: The main advantage of aggregation lies, of course, in the
reduced message overhead, which should result in an improved energy
efficiency and network lifetime.
A database interface to describe aggregation operations

Syntax of an SQL query to aggregate data from a sensor network:

SELECT {agg(expr), attributes } FROM sensors

WHERE {selectionPredicates}

GROUP BY {attributes}

HAVING {havingPredicates }

EPOCH DURATION i
● the phrase agg(expr) denotes the aggregation function, applied to a given
expression;
● an example would be AVG(temperature) denoting that the average of all
temperature readings is to be determined.
● The WHERE clause acts as a filter on the measured values before they enter the
aggregation process; usually, these predicates are intended to be locally
evaluated by each node
● The GROUP BY clause partitions the data into subsets and the HAVING clause
further filters these groups.
● An example would be to compute average temperature values (SELECT
AVG(temperature)) separately for each floor in a building (GROUP BY
floor)
● but only from the fifth floor upward (HAVING floor > 5);
● the floor number for each temperature average can be obtained by
SELECT AVG(temperature), floor
● The EPOCH DURATION indicates repeated interactions.
● Nodes periodically measure, transmit, and aggregate information and
the epoch duration marks the period for these repetitions.
Aggregation functions

● Given two partial state records < x >and< y >, either received from a
neighboring node or locally measured, an aggregation function f computes a
new state record < z >= f (< x >,< y >).
● Properties of f so as to be usable as aggregation function:
● Duplicate sensitive
● Examples for duplicate-sensitive aggregations are the sum of measured
values (SUM), counting the number of certain instances (e.g. number of
sensors that have raised an alarm, COUNT for short), as are the average
(AVG), the median of a set of values (MEDIAN), and computing the histogram
of values (HISTOGRAM).

● Minimum and maximum (MIN and MAX), on the other hand, are not sensitive
to duplicates.
Summary or exemplary

● An exemplary aggregate is a single, in some sense representative, value out


of a set of values.
● A summary aggregate is a function of the entire set and, typically, does not
strongly depend on individual values.
● MAX and MEDIAN are typical exemplary aggregates; SUM is, as expected, a
summary aggregate.
Placement of aggregation points

● When collecting data toward a sink along a tree or along a routing structure
such as the one resulting from directed diffusion, the aggregation points have
to be well placed for maximum benefit.
● Aggregation should happen close to the sources and many sinks should be
aggregated as early as possible – the tree should have, figuratively, long
trunks and bushy leaves.
● Directed diffusion does not necessarily result in a tree, but it is well suited to
aggregation

You might also like