You are on page 1of 22

Data Stream Management

Patrick Martin
School of Computing, Queen’s University
Calisto Zuzarte
IBM Toronto Lab
Goals of the Workshop

• Highlight application areas for data stream


management systems (DSMSs)
• Examine the key issues that differentiate DSMSs
and data stream mining from their standard
counterparts
• Propose interesting questions and topics for
research in the area of DSMSs
Workshop Agenda

1:00 – 1:15: Welcome (Pat Martin)


1:15 – 2:45 Session 1
DSMSs – Overview and Issues
Pat Martin (Queen’s U)
Aggregating Social Media
Nick Koudas (U of Toronto)
2:45 – 3:15 Break
3:15 – 4:45 Session 2
Mining Dynamic Data Streams
Yingying Tao (U of Waterloo)
Window Query Approximation for
Joining Data Streams with
Relations Based on Importance Semantics
Qiang Zhu (U of Michigan)
Data Stream Management
Systems – Overview and Issues

Patrick Martin
School of Computing
Queen’s University
Outline of Talk
„ Where is the need for DSMSs?
„ New class of application with
continuous streams of data
„ What is a DSMS?
„ What are the issues

CASCON 2008 DSMS Workshop 2


Applications – Financial Analysis
„ Electronic trading is now commonplace
„ Trading volume continues to increase rapidly
„ Algorithmic trading: detect advantageous
market conditions, automatically execute
trades
„ Eg compute 5-minute rolling average or volume-
waited average price (VWAP)
„ Latency is key
„ Real-time visualization
CASCON 2008 DSMS Workshop 3
Applications – Sensor Networks
„ Filter, aggregate and join streams from
multiple sensors
„ Military command and control,
healthcare, manufacturing, climate
analysis
„ Eg join streams of temperature reading
from weather stations with static tables of
geographic data to produce temperature
contours on a weather map
CASCON 2008 DSMS Workshop 4
Applications – System Monitoring
„ Large volumes of data produced in real-time
„ Network traffic analysis
„ Eg determine bandwidth used for each source-
destination pair grouped by protocol type
„ Problem determination
„ Eg determine top-k tables used by workloads over
span of several hours

CASCON 2008 DSMS Workshop 5


Why not a DBMS?
Data Streams DBMS

Data Continuous, Static, persistent


transient data data

Queries Continuous, result One-time


incrementally execution on
updated snapshot of data
Execution Approximation & Precise results,
adaptability stable query plans

CASCON 2008 DSMS Workshop 6


Data Stream Model
„ Data stream is a continuous ordered
sequence of data items
„ Data elements arrive in real-time
„ System has no control over order in which data
elements arrive
„ Data streams are potentially unbounded
„ Once a data is processed it is discarded or
archived => processed in 1 pass!
„ Can be modeled as virtual relations or data
objects

CASCON 2008 DSMS Workshop 7


DSMS Reference Architecture [1]

Working
Working
Storage
Storage
Query
Input Summary
Summary Output
Buffer Storage Processor Buffer
Storage

Static
Static Query
Query
Storage
Storage Repository
Repository
Streaming
Streaming Outputs
Inputs
User
Queries

CASCON 2008 DSMS Workshop 8


DSMS Issues – Query Languages
„ Different
SELECT types proposed
RSTREAM(item_id, bid_price)
FROM bid [ RANGE– ’10
„ Relation-based Minutes’
STREAM, TelegraphCQ
SLIDE ’90 Seconds’]
„ Object-based – COUGAR, Tribeca
WHERE bid_price =
„ Procedural – Aurora
(SELECT MAX(bid_price)
„ Support forFROM bid [ RANGE ’10 Minutes’]
„ SLIDE (relations)
Streams and static sources ’90 Seconds’])
„ Continuous queries
„ Windows
CASCON 2008 DSMS Workshop 9
DSMS Issues – Non-Blocking
Operators
„ Streams are infinite so can’t have
blocking operators in a query plan
„ 3 approaches to unblocking
„ Windowing
„ Incremental evaluation
„ Exploiting stream constraints (eg use of
punctuations)

CASCON 2008 DSMS Workshop 10


DSMS Issues – Approximate
Algorithms
„ Compact stream summaries may be stored
and approximate queries posed over
summaries
„ Methods of generating summaries
„ Counting methods
„ Hashing methods
„ Sampling methods
„ Sketches
„ Wavelet transforms

CASCON 2008 DSMS Workshop 11


DSMS Issues – Sliding Windows

Produce an approximate answer to query by evaluating over a


recent sliding window of recent data from stream.

Window operator periodically produces visible sets of tuples. Sets


defined by range (width in time or tuples), slide (how often to emit
set) and start (when to start emitting sets).

CASCON 2008 DSMS Workshop 12


DSMS Issues – Adaptability
„ Stream operators need to be
push-based
„ Cost of query plan may change
=> need a flexible query plan!
„ Tuple routing - Push tuples one at a time
through the operator graph; choose order
of operators at runtime

CASCON 2008 DSMS Workshop 13


DSMS Issues – Data Stream
Mining
„ Process of applying well-known static
mining techniques on data streams
„ Poses new challenges:
„ Algorithms to mine the data with only one
pass
„ Understanding relationship between
accuracy and amount off data seen
„ Concept drift and evolving models
Summary
„ DSMSs required to support new class of
data-intensive applications
„ Continuous, unbounded data streams
„ Key differentiating features of DSMS:
„ One pass of the data
„ Approximation
„ Continuous queries
„ Adaptability
References
1. Golab, L., and Ozsu M. T., “Issues in data stream management”, ACM
SIGMOD Record, Vol. 32, No. 2, pp.5-14, June 2003.
2. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J., “Models and Issues
in Data Stream Systems”, in Proceedings of the 21st ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems (PODS), Madison,
Wisconsin, June 2002, pp. 1-16.
3. Abadi, D., Carney, D., Cetintemel, U., Cherniack, M., Convey, C., Lee, S.,
Stonebraker, M., Tatbul, N., and Zdonik, S., "Aurora: A New Model and
Architecture for Data Stream Management", Journal of Very Large Data Bases
(VLDB), Vol. 12 No.2, pp.120-139, August 2003.
4. Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Nishizawa, I., Rosenstein,
J., and Widom, J., “STREAM: The Stanford Data Stream Management
System”, IEEE Data Engineering Bulletin, Vol. 26 No. 1, 2003.
5. Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein,
J. M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F. and
Shah, M., “TelegraphCQ: Continuous Dataow Processing foran Uncertain
World”, Proc. Conf. on Innovative Data Syst. Res, 2003, pp. 269-280.
Thank you!

CASCON 2008 DSMS Workshop 17


Disclaimer

The IBM Center for Advanced Studies (CAS) regularly


publishes technical documents that are aimed at
facilitating an exchange of information. These reports are
written by individuals or groups of authors and represent
their opinions and do not necessarily reflect those of
IBM. In the event of questions or concerns regarding
individual reports or presentations, email addresses have
been provided for most authors. Alternatively, please feel
free to contact CAS at casinfo@ca.ibm.com. CAS is
dedicated to providing a forum for IBM employees,
research affiliates, and university students to publish
results of their work, their views on technical issues,
book reviews, and workshop reports.
Trademarks

• Company, product, or service names identified in


the text may be trademarks or service marks of
IBM or other companies. Information on the
trademarks of International Business Machines
Corporation in the United States, other countries,
or both is located at
http://www.ibm.com/legal/copytrade.shtml.

• Other company, product or service names may be


trademarks or service marks of others.

You might also like