Supporting Data Stream Mining Applications: in Dbms DSMS: Carlo Zaniolo Ucla CSD

Supporting Data Stream Mining Applications: in DBMS&DSMS
Carlo Zaniolo UCLA CSD
04/26/12
Data Stream Mining and DSMS

Mining Data Stream: an emerging area of important applications. E.g., intrusion detection, click stream analysis, credit card frauds Knowledge Discovery from Data Streams (KDDS) represents a vibrant area of research
Many fast & light algorithms developed for mining data streams: Ensembles, Moment, SWIM, etc.
These cannot be deployed as stand-alone tasks: they need to manage bursty arrivals, windows, scheduling, QoSas DSMS do Analysts want to focus on high-level mining tasks, leaving the lower-level issues to system
Even the much easier tasks of managing KDD applications on stored data has produced KDD workbenches, thus what is needed is But this faces difficult research challenges SQL is part of the problem: as illustrated by the DBMS experience with KDD!
A Data Stream mining workbench built on & integrated with a DSMS!

04/26/12
http://wis.cs.ucla.edu
DM Experience for DBMS: Dreams vs. Support and business intelligence: Reality Decision
OLAP & data warehouses: resounding success for DBMS vendors, via
Simple extensions of SQL (aggregates & analytics) OR-DBMS do not fare much better [Sarawagi 98].
relational DBMS extensions for DM queries: a flop
Imielinski & Mannila proposed a high-road approach [CACM96] was suggested by who called for a quantum leap in functionality based on:

Simple declarative extensions of SQL for Data Mining (DM) Efficiency through DM query optimization techniques (yet to be invented)
The research area of Inductive DBMS was thus born, producing Interesting language work: DMQL, Mine Rule, MSQL,

Where implementation technology lacks generality & performance limitations Real questions if optimizers will ever take us there.
04/26/12
DM Experience for DBMS: Dreams vs. Reality The Low-Road Approach
by Commercial DBMS

Approaches Largely based on a Cache Mining Stored procedures and virtual mining views Outside the DBMS Data transfer delays
No move toward standarization
IBM DB2
http://www-306.ibm.com/software/data/iminer/
Intelligent Miner no longer supported.
04/26/12
Oracle Data Miner
Algorithms

Adaptive Nave Bayes SVM regression K-means clustering Association rules, text, mining, etc., etc.
PL/SQL with extensions for mining Models as first class objects
Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc.
http://www.oracle.com/technology/products/bi/odm/index.html
04/26/12
MS: OLE DB for DM (DMX): 3 Model creation model MemCard_Pred ( steps Create mining

Training
CustomerId long key, Age long continuous, Profession text discrete, Income long continuous, Risk text discrete predict) Using Microsoft_Decision_Tree;
Prediction Join
Insert into MemCard_Pred OpenRowSet( sqloledb, sa, mypass, SELECT CustomerId, Age, Profession, Income, Risk from Customers) Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP.Profession = C.Profession and AP.Income = C.Income AND MP.Age = C.Age;
04/26/12
MS: Defining a Mining Model: to predict students plan to attend college E.g., a model
Example
(
The format of training cases (top-level entity) Attributes, Input/output type, distribution Algorithms and parameters
CREATE MINING MODEL CollegePlanModel
StudentID Gender ParentIncome Encouragement CollegePlans
LONG TEXT LONG TEXT TEXT
KEY, DISCRETE, NORMAL CONTINUOUS, DISCRETE, DISCRETE PREDICT
) USING Microsoft_Decision_Trees
04/26/12
Training
INSERT INTO CollegePlanModel (StudentID, Gender, ParentIncome, Encouragement, CollegePlans) OPENROWSET(<provider>, <connection>, SELECT StudentID, Gender, ParentIncome, Encouragement, CollegePlans FROM CollegePlansTrainData)
04/26/12
Prediction Join
SELECT t.ID, CPModel.Plan FROM CPModel PREDICTION JOIN OPENQUERY(,SELECT * FROM NewStudents) AS t ON CPModel.Gender = t.Gender AND CPModel.IQ = t.IQ
CPModel ID
Gender IQ Plan ID Gender IQ
NewStudents
04/26/12
OLE DB for DM (DMX) (cont.) Mining objects as first class objects
Schema rowsets

Mining_Models Mining_Model_Content Mining_Functions
Other features

Column value distribution Nested cases
http://research.microsoft.com/dmx/DataMining/
04/26/12
10
Summary of Approaches
Vendors
Built-in library of mining methods

Script language or GUI tools Closed systems (internals hidden from users) Adding new algorithms or customizing old ones -Difficult Poor integration with SQL Limited interoperability across DBMSs
Limitations

Predictive Markup Modeling Language (PMML) as a palliative
04/26/12
11
PMML
Predictive Markup Model Language
XML based language for vendor independent definition of statistical and data mining models Share models among PMML compliant products A descriptive language
Supported by all major vendors
04/26/12
12
PMML Example
04/26/12
13
The Data Mining World According to
The Data Mining Software Vendors Market Competition
Disclaimer
Disclaimer This presentation contains preliminary information that may be changed substantially prior to final commercial release of the software described herein. The information contained in this presentation represents the current view of Microsoft Corporation on the issues discussed as of the date of the presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of the presentation. This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this presentation. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this information does not give you any license to these patents, trademarks, copyrights, or other intellectual property. 2005 Microsoft Corporation. All rights reserved.
Major Data Mining Vendors
Platforms
IBM Oracle SAS
Tools
SPSS Angoss KXEN Megaputer FairIsaac Insightful
Competition
SQL Server 2005 Oracle 10g IBM SAS
Product Link
SQL Server Analysis Services
Oracle Data Mining
DB2 Intelligent Miner, WebSphere
Enterprise Miner
http://otn.oracle.com/products/bi/odm/odmining.html http://www-306.ibm.com/software/data/iminer/ http://www.sas.com/technologies/analytics/datami ning/miner/factsheet.pdf
API Algorithms Text Mining Marketing Pages Client Tools
OLEDB/DM, DMX, XMLA, Java DM, PL/SQL ADOMD.Net 7 (+2) Yes N/A Embeddable Viewers, Reporting Services 8 Yes 18
SQL MM/6 based on UDF, SQL SAS Script SPROC 6 Yes 10

8+ Yes Dozens None
Analysis tools, WebWebSphere Portal (vertical based targeted reports solution) Discoverer IM Visualization Excel AddIn Additional Package Developers Additional Packages DB2 IM Scoring module is for developers; Other modules are for analysts. Mature product (6 years). Good service model. Scoring inside relational engine. Strong partnership with SAS
Distribution Target
Included Developers
Separate Product Analysts
Strengths
Powerful yet simple API Integration with other BI technologies New GUI
Not in-process with relational engine Lacking statistical functions Poor Analyst experience
Good credibility with enterprise customers New GUI, Leader of JDM API CRM Integration API overly complex Inconsistent
Mature, Market Leader. Extensive customization and modelling abilities. Robust, industry tested and accepted algorithms and methodologies. Export to DB2 Scoring.
Weaknesses
High price. Standard Functionality. Poor API (SQL MM). Confusing product line.
Expensive. Proprietary. Customer relations range from congenial to hostile.
Major DM Vendors
SAS Institute (Enterprise Miner) IBM (DB2 Intelligent Miner for Data) Oracle (ODM option to Oracle 10g) SPSS (Clementine) Unica Technologies, Inc. (Pattern Recognition Workbench) Insightsful (Insightful Miner) KXEN (Analytic Framework) Prudsys (Discoverer and its family) Microsoft (SQL Server 2005) Angoss (KnowledgeServer and its family) DBMiner (DBMiner) etc
Platforms IBM Oracle SAS, Tools SPSS Angoss KXEN Megaputer FairIsaac Insightful
ORACLE
Strengths
Oracle Data Mining (ODM) Integrated into relational engine Performance benefits Management integration SQL Language integration ODM Client Walks through Data Mining Process Data Mining tailored data preparation Generates code Integration into Oracle CRM EZ Data Mining for customer churn, other applications Full suite of algorithms Typical algorithms, plus text mining and bioinformatics Nice marketing/user education
ORACLE
Weaknesses
Additional Licensing Fees (base $400/user, $20K proc) Confusing API Story Certain features only work with Java API Certain features only work with PL/SQL API Same features work differently with different APIs Difficult to use Different modeling concepts for each algorithm Poor connectivity ORACLE only
SAS
Entrenched Data Mining Leader
Market Share Mind Share
Best of Breed
Always will attract the top ?% of customers
Overall poor product

Only for the expert user (SAS Philosophy) Integration of results generally involves source code
Integrated with ETL, other SAS tools Partnership with IBM

Model in SAS, deploy in DB2
My View ...
DBMS pachyderms have made some progress toward high level data models and integration with SQL, but Closed systems, Lacking in coverage and user-extensibility. Not as popular as dedicated, stand-alone, opensoftware DM systems, such as Weka OS experience again?
04/26/12
22
Weka
A comprehensive set of DM algorithms, and tools. Generic algorithms over arbitrary data sets.
Independent on the number of columns in tables.
Open and extensible system based on Java.
* These are the desiderata for a DSMS (or a CEP system) that support the data stream mining task
04/26/12
23
References

Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Communication ACM, 39(11):58, 1996. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD, 1998. [Imielinski 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM, 39(11):5864, 1996. T. Imielinski and A. Virmani. MSQL: a query language for database mining. Data Mining and Knowledge Discovery, 3:373--408, 1999. J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A data mining query language for relational databases. In Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD), pages 27--33, Montreal, Canada, June 1996. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. In VLDB, pages 122--133, Bombay, India, 1996. Marco Botta, Jean-Francois Boulicaut, Cyrille Masson, and Rosa Meo. Query languages supporting descriptive rule mining: A comparative study. In Database Support for Data Mining Applications, pages 24--51, 2004.
04/26/12
24
Road Map for Next Three Weeks Algorithms for Mining Data Streams Fast& Light
Classifiers and Classifier Ensembles, Clustering methods, Association Rules, Time series Data Mining Query Languages and support for the mining process Toward a Data stream mining workbench
Supporting the mining task in a DSMS
04/26/12
25
References

IBM. DB2 Intelligent Miner www.306.ibm.com/software/data/iminer ORACLE. Oracle Data Miner Release 10gr2:http://www.oracle.com/technology/products/bi/odm. Z. Tang, J. Maclennan, and P. Kim. Building data mining solutions with OLE DB for DM and XML analysis. SIGMOD Record, 34(2):8085, 2005 Data Mining Group (DMG). Predictive model markup language (pmml). http://sourceforge.net/projects/pmml. Carlo Zaniolo: Mining Databases and Data Streams with Query Languages and Rules: Invited Talk, Fourth International Workshop on Knowledge Discovery in Inductive Databases, KDID 2005. Hetal Thakkar Mozafari and Carlo Zaniolo:Designing an Inductive Data Stream Management System: the Stream Mill Experience. The Second International Workshop on Scalable Stream Processing Systems, March 29, 2008, Nantes, France.
04/26/12
26
Thank you!
04/26/12
27
Supporting DM Tasks and the Process I had a dream: WEKA for Data Streams! But with a in DSMS or a CEP System
DSMS we have to starting from SQL rather than Java! Case Study: Nave Bayesian Classifiersarguably the simplest mining algorithm. It is doable in SQL/DBMS. Is it also doable in SQL/DSMS? What about the various CEP systems, which claim to be powerful (e.g., support rules). Can they support NBC? In general, can they be extended to support generic versions of NBC, and perhaps other data stream mining methods?
28
04/26/12
Assignment: due on Monday Download a DSMS or a CEP system of your choice and (after explaining why you have 5/3/10and not the others) explore how you can implement the following selected this
1. 2.
3. 4.
It
tasks: Testing of a Nave Bayesian Classifier: you can assume that the NBC has already been trained and you can read it from the input, or a DB, a file, or memory. Assume now that you also have a stream of pre-classified samples. Use this to determine the accuracy of your current classifier, at periodic intervals. Output the accuracy, and if this falls below a certain threshold repeat Step 1. Periodically retrain a new NBC from the stream of pre-classified tuples; then use the newly built classifier to predict the class of unclassified tuples (Step 1). See if you can generalize your software, and e.g., design/develop generic NBCs, ensemble methods, other classifiers, etc. is understood that the limitations of DSMS and CEP systems will probably prevent you from completing all these tasks (listed in order of increasing difficulty). So, you should make sure that you (1) download a good system (but not Stream Mill), (2) write clear report explaining your efforts, and the reasons that prevented you from going further. (For test sets, refer to CS240A.)
04/26/12
29

Supporting Data Stream Mining Applications: in Dbms DSMS: Carlo Zaniolo Ucla CSD

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supporting Data Stream Mining Applications: in Dbms DSMS: Carlo Zaniolo Ucla CSD

Uploaded by

Copyright:

Available Formats

Supporting Data Stream Mining Applications: in DBMS&DSMS

Carlo Zaniolo UCLA CSD

Data Stream Mining and DSMS

A Data Stream mining workbench built on & integrated with a DSMS!

relational DBMS extensions for DM queries: a flop

DM Experience for DBMS: Dreams vs. Reality The Low-Road Approach

No move toward standarization

Intelligent Miner no longer supported.

Oracle Data Miner

PL/SQL with extensions for mining Models as first class objects

Create_Model, Prediction, Prediction_Cost, Prediction_Details, etc.

CREATE MINING MODEL CollegePlanModel

StudentID Gender ParentIncome Encouragement CollegePlans

LONG TEXT LONG TEXT TEXT

KEY, DISCRETE, NORMAL CONTINUOUS, DISCRETE, DISCRETE PREDICT

OLE DB for DM (DMX) (cont.) Mining objects as first class objects

Mining_Models Mining_Model_Content Mining_Functions

Column value distribution Nested cases

Built-in library of mining methods

Predictive Markup Modeling Language (PMML) as a palliative

Predictive Markup Model Language

Supported by all major vendors

The Data Mining World According to

The Data Mining Software Vendors Market Competition

Major Data Mining Vendors

SQL Server Analysis Services

Oracle Data Mining

DB2 Intelligent Miner, WebSphere

http://otn.oracle.com/products/bi/odm/odmining.html http://www-306.ibm.com/software/data/iminer/ http://www.sas.com/technologies/analytics/datami ning/miner/factsheet.pdf

API Algorithms Text Mining Marketing Pages Client Tools

SQL MM/6 based on UDF, SQL SAS Script SPROC 6 Yes 10

Separate Product Analysts

Expensive. Proprietary. Customer relations range from congenial to hostile.

Overall poor product

Integrated with ETL, other SAS tools Partnership with IBM

Independent on the number of columns in tables.

Open and extensible system based on Java.

Supporting the mining task in a DSMS

You might also like