Professional Documents
Culture Documents
By
Dr. Pradnya Ghare
Characteristics of sensor data
The Streaming Nature of Data
● Sensor data is automatically generated and arrives in a multiple, continuous, time-varying manner.
Therefore, the volume of sensor data increases with time, and the total volume of data are potentially
unlimited.
● Sensors are usually deployed at a certain density so that they can cover the entire monitoring field.
● As a result, ‘‘most sensor networks likely exhibit temporal and spatial correlations among node
readings. The readings observed at one time instant are highly indicative of the readings observed at
the next time instant
● High tempo-spatial correlation can be used to estimate missing or corrupted data, provide data
suppression, reduce data transmission in the network and thus reduce energy consumption.
Generation of Redundant Data
● Significant data redundancy in a database can result from the strong spatial and
temporal correlations typically present in sensor data.
● However, redundancy can be used to predict missing values and to detect outliers and
a certain level of redundancy can improve the accuracy of database query results.
Sensor Data Contains ‘Noise’
● Sensor data often contains errors (due to sensor function) and noise (due to other
environmental interference)
● These characteristics indicate that sensor data should be cleaned before being stored
in any database.
Data Fusion
● This approach uses Kalman Filter and Linear Regression for range prediction.
● Values outside the predicted range would be considered as ‘‘important’’
values, and their confidences would be calculated in Step 2.
● Finally, the weighted moving average at the sink node is combined with the
temporal average and the spatial average together.
Missing Values Recovery
● Having a large amount of redundant data may slow down or confuse the
knowledge discovery process.
● In-network aggregation of redundant data can reduce the total data flow over
the sensor network and thus can extract the most representative data using
minimum resources, which effectively reduce power consumption.
● Getting an average of the raw data and reporting the average when it is
greater than a predefined threshold is the simplest case.
● Another approach is a weighted in-network sampling algorithm to obtain a
deterministic (much smaller but representative) sample instead of raw
redundant data.
● Compared with random sampling, the advantage of weighted sampling
algorithm is ‘‘it can guarantee that each node’s data has the same chance to
belong to the final sample, independent from its provenance in the Network’’.
● Instead of selectively sampling the network nodes, a prediction-based data
reduction strategy can be used.
● Here prediction methods are deployed both at the sensor and sink-level, so
that sensors only need to send data that deviates from the predicted value.
Sensor Data Mining
network applications :
● The actual benefits of data aggregation depend on the location of the data
sources, relative to the data sink.
● Intuitively, when all data sources are spread out, the paths to the sinks do not
intersect, and there is little if any opportunity to aggregate data at some
intermediate nodes.
● If, on the other hand, the data sources are all nearby – for example, when
they all observe an event at a certain place – and they are located far away
from the sink and their paths to the sink merge early on, the expected benefits
of aggregation are large.
Different cases of data aggregation
Metrics to judge efficacy of data aggregation
● Accuracy: the difference between the resulting value at the sink and the true
value
● Latency: Aggregation can also increase the latency of reporting as
intermediate nodes might have to wait for data.
● Message overhead: The main advantage of aggregation lies, of course, in the
reduced message overhead, which should result in an improved energy
efficiency and network lifetime.
A database interface to describe aggregation operations
WHERE {selectionPredicates}
GROUP BY {attributes}
HAVING {havingPredicates }
EPOCH DURATION i
● the phrase agg(expr) denotes the aggregation function, applied to a given
expression;
● an example would be AVG(temperature) denoting that the average of all
temperature readings is to be determined.
● The WHERE clause acts as a filter on the measured values before they enter the
aggregation process; usually, these predicates are intended to be locally
evaluated by each node
● The GROUP BY clause partitions the data into subsets and the HAVING clause
further filters these groups.
● An example would be to compute average temperature values (SELECT
AVG(temperature)) separately for each floor in a building (GROUP BY
floor)
● but only from the fifth floor upward (HAVING floor > 5);
● the floor number for each temperature average can be obtained by
SELECT AVG(temperature), floor
● The EPOCH DURATION indicates repeated interactions.
● Nodes periodically measure, transmit, and aggregate information and
the epoch duration marks the period for these repetitions.
Aggregation functions
● Given two partial state records < x >and< y >, either received from a
neighboring node or locally measured, an aggregation function f computes a
new state record < z >= f (< x >,< y >).
● Properties of f so as to be usable as aggregation function:
● Duplicate sensitive
● Examples for duplicate-sensitive aggregations are the sum of measured
values (SUM), counting the number of certain instances (e.g. number of
sensors that have raised an alarm, COUNT for short), as are the average
(AVG), the median of a set of values (MEDIAN), and computing the histogram
of values (HISTOGRAM).
● Minimum and maximum (MIN and MAX), on the other hand, are not sensitive
to duplicates.
Summary or exemplary
● When collecting data toward a sink along a tree or along a routing structure
such as the one resulting from directed diffusion, the aggregation points have
to be well placed for maximum benefit.
● Aggregation should happen close to the sources and many sinks should be
aggregated as early as possible – the tree should have, figuratively, long
trunks and bushy leaves.
● Directed diffusion does not necessarily result in a tree, but it is well suited to
aggregation