Supporting Data Stream Mining Applications: in DBMS&DSMS

Carlo Zaniolo UCLA CSD

04/26/12

1

Data Stream Mining and DSMS
Mining Data Stream: an emerging area of important applications. E.g., intrusion detection, click stream analysis, credit card frauds Knowledge Discovery from Data Streams (KDDS) represents a vibrant area of research

Many fast & light algorithms developed for mining data streams: Ensembles, Moment, SWIM, etc.

These cannot be deployed as stand-alone tasks: they need to manage bursty arrivals, windows, scheduling, QoS—as DSMS do Analysts want to focus on high-level mining tasks, leaving the lower-level issues to system

Even the much easier tasks of managing KDD applications on stored data has produced KDD workbenches, thus what is needed is But this faces difficult research challenges SQL is part of the problem: as illustrated by the DBMS experience with KDD!

A Data Stream mining workbench built on & integrated with a DSMS!
 

04/26/12

http://wis.cs.ucla.edu

2

DM Experience for DBMS: Dreams vs.cs. 04/26/12 http://wis. Support and business intelligence: Reality Decision  OLAP & data warehouses: resounding success for DBMS vendors. Mine Rule.edu 3 . …   Where implementation technology lacks generality & performance limitations Real questions if optimizers will ever take us there.ucla. via  Simple extensions of SQL (aggregates & analytics) OR-DBMS do not fare much better [Sarawagi’ 98]. producing  Interesting language work: DMQL. MSQL.  relational DBMS extensions for DM queries: a flop   Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was suggested by who called for a quantum leap in functionality based on:   Simple declarative extensions of SQL for Data Mining (DM) Efficiency through DM query optimization techniques (yet to be invented)  The research area of Inductive DBMS was thus born.

04/26/12 http://wis. Reality The Low-Road Approach by Commercial DBMS    Approaches Largely based on a Cache Mining Stored procedures and virtual mining views Outside the DBMS  Data transfer delays  No move toward standarization IBM DB2 http://www-306.DM Experience for DBMS: Dreams vs.ucla.com/software/data/iminer/ Intelligent Miner no longer supported.edu 4 .ibm.cs.

etc.   PL/SQL with extensions for mining Models as first class objects  Create_Model. etc.oracle. text.  http://www.edu 5 .cs.html 04/26/12 http://wis. Prediction. Prediction_Cost. mining. Prediction_Details. etc.ucla.com/technology/products/bi/odm/index.Oracle Data Miner  Algorithms     Adaptive Naïve Bayes SVM regression K-means clustering Association rules..

Income = C.Profession = C.Risk. ‘mypass’”. Income long continuous. Risk from Customers’) Select C.Id.cs.  Prediction Join Insert into MemCard_Pred OpenRowSet( “‘sqloledb’.Profession and AP.Risk) From MemCard_Pred AS MP Prediction Join Customers AS C Where MP. ‘SELECT CustomerId.Age. Profession text discrete.edu 6 .ucla.MS: OLE DB for DM (DMX): 3 Model creation model MemCard_Pred ( steps Create mining   Training CustomerId long key. Risk text discrete predict) Using Microsoft_Decision_Tree. C.Age = C. ‘sa’. Age long continuous. Age. PredictProbability(MemCard_Pred.Income AND MP. Income. 04/26/12 http://wis. Profession.

NORMAL CONTINUOUS. a model      Example ( The format of “training cases” (top-level entity) Attributes. Input/output type.cs.g.. distribution Algorithms and parameters CREATE MINING MODEL CollegePlanModel StudentID Gender ParentIncome Encouragement CollegePlans LONG TEXT LONG TEXT TEXT KEY. DISCRETE. DISCRETE PREDICT ) USING Microsoft_Decision_Trees 04/26/12 http://wis. DISCRETE.MS: Defining a Mining Model: to predict students’ plan to attend college E.edu 7 .ucla.

Encouragement.cs. Encouragement.Training INSERT INTO CollegePlanModel (StudentID. ParentIncome. Gender. CollegePlans FROM CollegePlansTrainData’) http://wis. ‘SELECT StudentID.ucla. CollegePlans) OPENROWSET(‘<provider>’.edu 04/26/12 8 . Gender. ParentIncome. ‘<connection>’.

Plan FROM CPModel PREDICTION JOIN OPENQUERY(….cs.IQ CPModel ID Gender IQ Plan ID Gender IQ NewStudents 04/26/12 http://wis. CPModel.edu 9 .Gender = t.IQ = t.Gender AND CPModel.Prediction Join SELECT t.‘SELECT * FROM NewStudents’) AS t ON CPModel.ucla.ID.

OLE DB for DM (DMX) (cont.edu 10 .ucla.microsoft.com/dmx/DataMining/ 04/26/12 http://wis.cs.)  Mining objects as first class objects  Schema rowsets    Mining_Models Mining_Model_Content Mining_Functions  Other features   Column value distribution Nested cases  http://research.

edu 11 .Summary of Approaches   Vendors’ Built-in library of mining methods Script language or GUI tools Closed systems (internals hidden from users) Adding new algorithms or customizing old ones -Difficult Poor integration with SQL Limited interoperability across DBMSs  Limitations      Predictive Markup Modeling Language (PMML) as a palliative 04/26/12 http://wis.cs.ucla.

cs.edu 12 .ucla.PMML  Predictive Markup Model Language    XML based language for vendor independent definition of statistical and data mining models Share models among PMML compliant products A descriptive language  Supported by all major vendors 04/26/12 http://wis.

ucla.cs.edu 13 .PMML Example 04/26/12 http://wis.

The Data Mining World According to The Data Mining Software Vendors Market Competition .

All rights reserved. the furnishing of this information does not give you any license to these patents. © 2005 Microsoft Corporation. AS TO THE INFORMATION IN THIS PRESENTATION. copyrights. IMPLIED OR STATUTORY. MICROSOFT MAKES NO WARRANTIES. The information contained in this presentation represents the current view of Microsoft Corporation on the issues discussed as of the date of the presentation. and Microsoft cannot guarantee the accuracy of any information presented after the date of the presentation. EXPRESS. Because Microsoft must respond to changing market conditions.Disclaimer Disclaimer This presentation contains preliminary information that may be changed substantially prior to final commercial release of the software described herein. trademarks. trademarks. Microsoft may have patents. This presentation is for informational purposes only. or other intellectual property rights covering subject matter in this presentation. it should not be interpreted to be a commitment on the part of Microsoft. copyrights. patent applications. or other intellectual property. Except as expressly provided in any written license agreement from Microsoft. .

Major Data Mining Vendors • Platforms  IBM  Oracle  SAS • Tools  SPSS  Angoss  KXEN  Megaputer  FairIsaac  Insightful .

Net 7 (+2) Yes N/A Embeddable Viewers. Proprietary. Market Leader. Poor API (SQL MM).sas.oracle. Robust. Export to DB2 Scoring. Java DM. Mature product (6 years).Competition SQL Server 2005 Oracle 10g IBM SAS Product Link SQL Server Analysis Services Oracle Data Mining DB2 Intelligent Miner.ibm. Strong partnership with SAS Distribution Target Included Developers Separate Product Analysts Strengths Powerful yet simple API Integration with other BI technologies New GUI Not in-process with relational engine Lacking statistical functions Poor Analyst experience Good credibility with enterprise customers New GUI.com/technologies/analytics/datami ning/miner/factsheet. Confusing product line. Extensive customization and modelling abilities. Weaknesses High price. SQL SAS Script SPROC 6 Yes 10 8+ Yes Dozens None Analysis tools. XMLA. . Scoring inside relational engine. Good service model. WebWebSphere Portal (vertical based targeted reports solution) Discoverer IM Visualization Excel AddIn Additional Package Developers Additional Packages DB2 IM Scoring module is for developers. Reporting Services 8 Yes 18 SQL MM/6 based on UDF.html http://www-306. WebSphere Enterprise Miner http://otn. DMX. Customer relations range from congenial to hostile.com/software/data/iminer/ http://www. industry tested and accepted algorithms and methodologies. Other modules are for analysts.pdf API Algorithms Text Mining Marketing Pages Client Tools OLEDB/DM. PL/SQL ADOMD. Standard Functionality.com/products/bi/odm/odmining. Expensive. Leader of JDM API CRM Integration API overly complex Inconsistent Mature.

Tools  SPSS  Angoss  KXEN  Megaputer  FairIsaac  Insightful . (Pattern Recognition Workbench)  Insightsful (Insightful Miner)  KXEN (Analytic Framework)  Prudsys (Discoverer and its family)  Microsoft (SQL Server 2005)  Angoss (KnowledgeServer and its family)  DBMiner (DBMiner)  etc… Platforms  IBM  Oracle  SAS.Major DM Vendors  SAS Institute (Enterprise Miner)  IBM (DB2 Intelligent Miner for Data)  Oracle (ODM option to Oracle 10g)  SPSS (Clementine)  Unica Technologies. Inc.

plus text mining and bioinformatics  Nice marketing/user education .ORACLE Strengths  Oracle Data Mining (ODM) Integrated into relational engine – Performance benefits – Management integration – SQL Language integration  ODM Client – “Walks through” Data Mining Process – Data Mining tailored data preparation – Generates code  Integration into Oracle CRM – “EZ” Data Mining for customer churn. other applications  Full suite of algorithms – Typical algorithms.

ORACLE Weaknesses  Additional Licensing Fees (base $400/user. $20K proc)  Confusing API Story – Certain features only work with Java API – Certain features only work with PL/SQL API – Same features work differently with different API’s  Difficult to use – Different modeling concepts for each algorithm  Poor connectivity – ORACLE only .

deploy in DB2 .SAS • Entrenched Data Mining Leader  Market Share  Mind Share • “Best of Breed”  Always will attract the top ?% of customers • Overall poor product  Only for the expert user (SAS Philosophy)  Integration of results generally involves source code • Integrated with ETL. other SAS tools • Partnership with IBM  Model in SAS.

My View . opensoftware DM systems. Not as popular as dedicated. Lacking in coverage and user-extensibility.ucla. such as Weka OS experience again? 04/26/12 http://wis. but Closed systems.cs...      DBMS pachyderms have made some progress toward high level data models and integration with SQL. stand-alone.edu 22 .

Generic algorithms over arbitrary data sets.   Independent on the number of columns in tables. * These are the desiderata for a DSMS (or a CEP system) that support the data stream mining task 04/26/12 http://wis.edu 23 .Weka  A comprehensive set of DM algorithms.ucla.  Open and extensible system based on Java. and tools.cs.

Data Mining and Knowledge Discovery. G. Marco Botta. Thomas. A database perspective on knowledge discovery.edu 24 . 1996. R. and R. Ceri.cs. 1996. In SIGMOD. June 1996. A database perspective on knowledge discovery. T. Imielinski and A. [Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. Commun. 39(11):58. Koperski. 3:373--408. R. 2004. pages 122--133. Wang. 1998. 1999. Cyrille Masson. In Database Support for Data Mining Applications. Communication ACM. Sarawagi. and Rosa Meo. and O. K.References        Tomasz Imielinski and Heikki Mannila. Han. pages 24--51. 1996. DMQL: A data mining query language for relational databases. S. Agrawal. 04/26/12 http://wis. Montreal. Canada. J. W. MSQL: a query language for database mining. Psaila. Meo.ucla. Fu. S. In VLDB. Bombay. Zaiane. Integrating association rule mining with relational database systems: Alternatives and implications. Y. A new SQL-like operator for mining association rules. pages 27--33. In Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD). Jean-Francois Boulicaut. 39(11):58–64. Virmani. Query languages supporting descriptive rule mining: A comparative study. India. ACM. and S.

ucla.edu 25 .cs. Association Rules. Clustering methods. Time series Data Mining Query Languages and support for the mining process Toward a Data stream mining workbench  Supporting the mining task in a DSMS   04/26/12 http://wis.Road Map for Next Three Weeks Algorithms for Mining Data Streams Fast& Light      Classifiers and Classifier Ensembles.

Hetal Thakkar Mozafari and Carlo Zaniolo:Designing an Inductive Data Stream Management System: the Stream Mill Experience. Tang. Fourth International Workshop on Knowledge Discovery in Inductive Databases. Nantes.net/projects/pmml. Predictive model markup language (pmml). Building data mining solutions with OLE DB for DM and XML analysis. http://sourceforge. The Second International Workshop on Scalable Stream Processing Systems. KDID 2005. SIGMOD Record.References       IBM. J. Carlo Zaniolo: Mining Databases and Data Streams with Query Languages and Rules: Invited Talk. and P. March 29. DB2 Intelligent Miner www.com/technology/products/bi/odm. Z. 04/26/12 http://wis.edu 26 .oracle.ibm.306. 2005 Data Mining Group (DMG). 2008. France. 34(2):80–85.com/software/data/iminer ORACLE. Maclennan. Kim. Oracle Data Miner Release 10gr2:http://www.cs.ucla.

ucla.Thank you! 04/26/12 http://wis.cs.edu 27 .

Is it also doable in SQL/DSMS? What about the various CEP systems.g. It is doable in SQL/DBMS. Can they support NBC? In general.Supporting DM Tasks and the Process  I had a dream: WEKA for Data Streams! But with a in DSMS or a CEP System   DSMS we have to starting from SQL rather than Java! Case Study: Naïve Bayesian Classifiers—arguably the simplest mining algorithm.ucla. and perhaps other data stream mining methods? http://wis. can they be extended to support generic versions of NBC.cs. which claim to be powerful (e.edu 28 04/26/12 .. support rules).

etc. and if this falls below a certain threshold repeat Step 1. 2. a file. other classifiers. refer to CS240A. is understood that the limitations of DSMS and CEP systems will probably prevent you from completing all these tasks (listed in order of increasing difficulty). then use the newly built classifier to predict the class of unclassified tuples (Step 1).g. and e.edu 29 .Assignment: due on Monday Download a DSMS or a CEP system of your choice and (after explaining why you have 5/3/10and not the others) explore how you can implement the following selected this 1.ucla.. 4.cs. 3. See if you can generalize your software. and the reasons that prevented you from going further. or a DB. ensemble methods. you should make sure that you (1) download a good system (but not Stream Mill). (For test sets. design/develop generic NBCs. at periodic intervals. Use this to determine the accuracy of your current classifier. Assume now that you also have a stream of pre-classified samples. So. Output the accuracy. or memory. It tasks: Testing of a Naïve Bayesian Classifier: you can assume that the NBC has already been trained and you can read it from the input. Periodically retrain a new NBC from the stream of pre-classified tuples.) 04/26/12 http://wis. (2) write clear report explaining your efforts.

Sign up to vote on this title
UsefulNot useful