You are on page 1of 24

US008538988B2

(12) United States Patent
Lang et a].
(54) SELECTIVE STORING OF MINING MODELS
FOR ENABLING INTERACTIVE DATA MINING

(10) Patent N0.: (45) Date of Patent:
(56)

US 8,538,988 B2
Sep. 17, 2013

References Cited
U.S. PATENT DOCUMENTS
6,820,089 B2 7,370,316 B2 7,512,623 B2 11/2004 Vishnubhotla 5/2008 Kraiss et a1. 3/2009 Apps et al.
11/2003 Bloom et al.

(75) Inventors: Alexander Lang, Stuttgart (DE); Bernhard Mitschang, Boblingen (DE);
Ruben Pulido de los Reyes, Stuggart

(DE); Christoph Sieb, Schoenaich (DE); Michael Wurst, Stuttgart (DE)
(73) Assignee: International Business Machines

Corporation, Armonk, NY (US)
(*) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35

2003/0212678 2003/0212855 2004/0010505 2005/0102303 2007/0214135 2008/0077544 2009/0094174 2009/0144276

A1 A1 A1 A1 A1 A1 A1 A1

11/2003 Sakaguchi et al.
1/ 2004 5/2005 9/2007 3/2008 4/2009 6/2009 Vishnubhotla Russell et a1. Crivat et al. Sureka Kussmaul Russell et a1.

OTHER PUBLICATIONS

U.S.C. 154(b) by 0 days.

T. MorZy et al., “Fast Discovery of Sequential Patterns Using Mate
rialiZed Data Mining Views”, In: International Symposium on Com puter and Information Sciences, 2000.

(21) Appl.No.: 13/616,783
(22) Filed:
(65)

Sep. 14, 2012
Prior Publication Data

B. Nag et al., “Caching for Multi-Dimensional Data Mining Que ries”, Proceedings ofthe 2001 SCI Conference, 2001. S. Dar et al., “Semantic Data Caching and Replacement”, Proceed
ings of the Third International Conference on Information and

Knowledge Management, 1994.

US 2013/0018917A1

Jan. 17,2013

(Continued)
Primary Examiner * Jensen Hu

Related US. Application Data

(74) Attorney, Agent, or Firm * Susan Murray; Edell,

(63)

Continuation of application No. 12/951,542, ?led on Nov. 22, 2010, now Pat. No. 8,380,740.

Shapiro & Finnan, LLC

(57)

ABSTRACT

(30)

Foreign Application Priority Data
(EP) ................................... .. 09180322

A new data mining model (DMM) is created having at least one of the following characteristics: quality and complexity.
The new DMM is handled as a candidate for storing in a

Dec. 22, 2009

storage device if a prede?ned criterion for the characteristics
is met. The sum of the sizes of the new DMM and already stored DMMs is determined. In response to the sum falling below a storage limit, the new DMM is stored in the storage device. In response to the sum exceeding the storage limit, a decision is taken based on priorities of the DMMs which DMMs to store in the storage device.

(51) (52)
(58)

Int. Cl.
G06F 7/00

(2006.01)
707/776

US. Cl.
USPC ........................................................ ..

Field of Classi?cation Search
USPC ........................................................ .. 707/776

See application ?le for complete search history.

20 Claims, 11 Drawing Sheets

Create a first DMM having at least one at the

foliowing cnaiaoteristios: quaiity and complexity m

Criterion

for the characteristics
is met 7 1J2

Determine the sum of the size of the first DMM

and the size of further DMMs already stored
in the stoiage means m

Decide based on priorities 0| the first DMM and the

further DMMs which DMMS {0 store in (H9 storage means. The priorities are dependent at least on access irequencies of the respective data mining niodeisi @

Store thetirst DMM in the storage means L01

US 8,538,988 B2
Page 2
(56) References Cited
OTHER PUBLICATIONS
of the Third International Conference on Information and Knowledge

“CRoss Industry Standard Process for Data Mining” http://WWW.
crisp-dm.org/ .

M.J. Carey, M.J. Franklin, M. Livny, E.J. Shekita: “Data Caching Tradeoffs in Client-Server DBMS Architectures”, SIGMOD, Rec. 20, Feb. 1991.

Y. Arens, CA. Knoblock, “Intelligent Caching: Selecting, Repre
senting, and Reusing data in an Information Server”, In Proceedings

Management, 1994. B. CZejdo, M. MorZy, M. Wojciecowski, M. ZakrZeWicZ, “Material iZed Views in Data Mining”, Database and Expert Systems Applica tions, 2002. F. Bonchi, F. Giannotti, G. Manco, C. Resno, M. Nanni, D. Pedreschi, S. Ruggieri, “Data Mining for Intelligent Web Caching”, Interna tional Conference on Information Technology: Coding and Comput ing, 2001.

17. The priorities are dependent at least on access frequencies of the respective data mining models. @ Store the first DMM in the storage means 1 7 V A END Fig. 2013 Sheet 1 0f 11 US 8. 1 .988 B2 Create a first DMM having at least one of the following characteristics: quality and complexity m Criterion for the characteristics isme?? Handle the first DMM as a candidate for storing in storage means E V Determine the sum of the size of the first DMM and the size of further DMMs already stored in the storage means M The sum exceeds a storage limit ? E Decide based on priorities of the first DMM and the further DMMs which DMMs to store in the storage means.US. Patent Sep.538.

988 B2 STA RT V Store in storage means for DMMs information identifying a DMM and information describing DMM quality @ DMM quality fulfills a first predefined criterion? no E Store DMMs in the storage means @ M END Fig. Patent Sep. 17.US. 2 .538. 2013 Sheet 2 0f 11 US 8.

2013 Sheet 3 0f 11 US 8.US. 17.538. Patent Sep.988 B2 STA RT ?eceive a new DM request M V Determine a DMM for the new DM request E Information describing DMM quality for the DMM has been stored in the storage means ? E DMM quality fulfills a second predefined criterion? % Request user to confirm that DM is to proceed E User confirmed? @ yes Proceed data mining @ no A END Fig. 3 .

6 0.8 0.4 Quality Complexity Access 0.2 0. 17. Patent Sep.4 Quality Complexity Access 0.1 Size 300 Size 300 Size 300 441 \ Model M1 442 \ Model M2 443 \ Model M3 Materialize YES Materialize NO Materialize MAY BE Fig.988 B2 411 -\ Request Type Data MaxNum R1 Cluster Table A auto 412 -\ Request Type Data MaxNum R2 Cluster Table A 10 413 -\ Request R3 414 -\ Request R4 Type Data MaxNum Cluster Table B auto Type Data MaxNum Cluster Table C 2 421 \ Task T1 422 -\ Task T2 423 \ Task T3 Type Data MaxNum Cluster Table A 1O Type Data MaxNum Cluster Table B 5 Type Data MaxNum Cluster Table C 2 431 \ Model M1 432 -\ Model M2 433 -\ Model M3 Quality Complexity Access 0.4 0. 2013 Sheet 4 0f 11 US 8.4 0.538.8 0. 4 .US.

2013 Sheet 5 0f 11 US 8.US.538. 5 .988 B2 /Receive a DM request @/ Equivalent or more general DM task in cache? E Store DM task description in cache E Build DMM with DMM characteristics Store DMM characteristics in cache E DMM quality or build time low? 50 Storage limit exceeded? m (D G3 no V Store DMM in cache @ ' / Return DMM to useri / Fig. Patent Sep. 17.

538.988 B2 Calculate priorities of built DMM and already stored DMMs @ Already stored DMM with lowest priority ? Q Remove already stored DMM with lowest priority w Storage limit exceeded ? @ Fig. 2013 Sheet 6 0f 11 US 8. 17.US. Patent Sep. 6 .

US. 7 . 2013 Sheet 7 0f 11 US 8. Patent Sep. 17.538.988 B2 Associated DMM quality low? E Post-processing on associated DMM according to the DM request m Request user to confirm that data mining is to proceed M Fig.

2 2 MB 1 5 MB 1. 2 3 MB Priority 5*5MB=25 1*2MB=2 2*2|\/|B=4 1*5MB=5 1*3MB=3 2*3MB:6 Fig.988 B2 f 810 Request Id R1 A R1 B R2 R3A R3B Mining Type Data Selection Constraint Association Rules Table: DEFECTS Filter: Association Rules Table: DEFECTS Filter: Association Rules Table: DEFECTS Filter: Association Rules Table: DEFECTS Filter: Association Rules Table: DEFECTS Filter: Munich Munich Stuttgart Berlin Berlin Mining Constraint MinSupp: 10% MinSupp: 20% MinSupp: 10% MinSupp: 10% MinSupp: 20% /' Task Id T0 T1 T2 T3 820 Mining Type Data Clustering Table: Association Rules Table: Association Rules Table: Association Rules Table: Selection Constraint CUST Filter: DEFECTS Filter: DEFECTS Filter: DEFECTS Filter: All Munich Stuttgart Berlin Mining Constraint MaxNum=5 MinSupp: 10% MinSupp: 10% MinSupp: 10% f 830 Model Id M0 M1 M2 M3 Quality Build Time 90% 20 min 76% 15 min 40% 12 min 61% 18 min Access Size 5 5 MB 1. Patent Sep. 17.538. 8 . 2013 Sheet 8 0f 11 US 8.US.

US. Patent Sep. 10 . 2013 Sheet 9 0f 11 US 8. 17. 9 /' Model Rule 1 Rule 2 Rule 3 1010 Rule for defect parts Cylinder head gasket -> i nition lu g p g Support 24% 23% 11% Confi dence 68% 83% 77% f 1011 1012 Steering shaft -> / 1013 Steering skew gear I Wheel suspenslon -> Axel box / Quality Average: 76% f 1014 Fig.988 B2 f 910 Step M0 911 YES Materialize for Model lol M1 — M2 — M3 — 912 YES YES - - 913 YES YES NO - 914 915 YES YES YES MAYBE NO NO MAYBE YES Fig.538.

----H| 1121 1 1 15 1 1 16 11 13 1 122 1117 1114 Fig.538. 11 . 17. Patent Sep.988 B2 1100 11 1O A 1 120 1 11 1 11 12 l | 1118 ---. 2013 Sheet 10 0f 11 US 8.US.

. . .538. 17. . 12 . Patent Sep. 1235 1223 1227 H 1229 1224 1228 H 1230 1236 1240 Fig.. 2013 Sheet 11 0f 11 US 8.988 B2 1200 1210 Y 1230 1231 /’ \ 1220 \ / 1232 1221 1225 1233 > 1222 1226 1234 > .US.

Traditional data mining scenarios assume that a single user designs and executes a data mining task once or feW times. Such a cache Would then alloW several users to share the same model. These processes are usually performed o?iine as a background task. that is. in retail and services. The aim of data mining in the context of manufacturing may be to resolve problems relating to quality analysis and quality assurance. consider various information 35 The present invention provides a computerized method. No. These database query caches are not suitable to 1. but also stored in a cache. Technical Field support caching of data mining models. 2010. and computer program products for storing of data mining models. if a data mining model is built. The existing approaches do not address the questions Which models should be cached and hoW a further reduction of response times can be effectively realiZed. association rules. and problem analyses.US 8. and in response to the storage limit being exceeded.542. As a second example. HoWever. data mining is not a planned task. the ?rst data mining model and the further data mining mod els. Association rules are patterns model and siZes of further data mining models already stored 45 in the storage means exceeds a storage limit. 30 BRIEF SUMMARY include various pieces of data relating to origin and features of components. for root cause analysis. Thus. deter mining Whether the sum of the siZe of the ?rst data mining Three discovery methods of data mining are clustering. for example. Sequences data mining ?nds typical time-ordered sequences of items in given input data. in response to the storage limit not being exceeded. The records of one cluster should be homo geneous. HoWever. Interactive data mining puts much higher demands on response times than o?iine data mining. storing the ?rst data mining model in the storage means. information identifying the ?rst data min ing model and information describing the ?rst data mining model characteristics may be stored even if the ?rst data . In the ?eld of databases. Data mining techniques typically need to consider hoW to effectively process large amounts of data. For example. This model information can be further analyZed or veri?ed against further process infor mation. These approaches are also limited to association rule mining. Upon receiving a neW data mining request. and for reducing Warranty claims. the input data may 25 queries onto existing materialiZed. and a computer program product for storing of data mining models for furtheruse in face of limited storage space. data mining is invoked by many differ ent users in parallel on the same datasets With partially over mining model. Data mining may be used. system monitoring. and in medicine and life sciences for ?nding causal relations in clinical studies. it is not only returned to the user Who issued a query. HoWever. entitled “Selective Storing of Min ing Models for Enabling Interactive Data Mining” and ?led Nov. Consider manufac turing of products as an example. a computer-implemented method for storing of data mining 40 technology systems. data min ing vieWs of previous data mining queries. and the records of tWo different clusters should be as heterogeneous as possible. In an example embodiment of the present invention. there are knoWn techniques of caching. Other methods are knoWn for mapping neW data mining Data mining refers in general to data-driven approaches for extracting hidden information from input data. 12/951. models comprises creating a ?rst data mining model having at least one of the characteristics of quality and complexity. First. For each stored data mining model. One Way to address this problem is to use a general purpose caching algorithm for data mining models. a data processing system. users can specify a broad variety of parameters and storage space is limited to store all built data mining models. If a model This application is a continuation of US. For the ?rst data mining model. a respec tive task description may be stored. data mining may further be used for intrusion detection. Discussion of the RelatedArt queries along With semantic information. And second. many resources may be Wasted. data processing systems. In the ?eld of association rule data mining. There. and sequences. lapping tasks. it does not need to be rebuilt again later on. but is rather invoked ad hoc in the course of an interactive analytical 60 process. many cur rent and future data mining scenarios shoW rather different (1-to-1) mapping from data mining tasks onto data mining models. semantic caches Were developed. based upon priorities of 50 describing Which items occur frequently Within transactions. patent applica tion Ser. Data mining has also various other uses. This is a serious problem especially in the cases Where the datasets that are 65 Post-processing of the data mining model associated With the neW data mining request may be performed. for example. for early Warning systems Within the manufacture plant. 22. There. Clustering data mining seeks to ?nd distinct clusters or groups of data records having similar attributes. the priorities being dependent at least on access frequencies of the respective data mining models. determining. the disclosure of Which is incorporated herein Was already built by another user. Which are aWare of hoW queries are related to each other and by reference in its entirety. Where typical customer behavior can be analyZed. The extracted information depends on a type of data mining and is put together in data mining models. 2. a chunk-based cache can be provided that stores the results of association rule mining 20 The present invention relates to computerized methods. Which data mining models to store in the storage means.538.988 B2 1 SELECTIVE STORING OF MINING MODELS FOR ENABLING INTERACTIVE DATA MINING CROSS-REFERENCE TO RELATED APPLICATIONS 2 analyZed are rather large. BACKGROUND exploit this information to alloW for a more intelligent cach ing strategy. A data mining task may have an associated data characteristics. There may be provided a surjective (N -to-1) mapping from data mining requests onto data mining tasks and a bijective 55 Iterative data mining processes such as the Cross Industry Standard Process for Data Mining (CRISP-DM) process have been designed for such scenarios. a determination may be performed based on the surj ective and bij ective mappings Whether a data min ing model associated With the neW data mining request is already stored in the storage means. If each user invokes data mining independently of each other. determining Whether the ?rst data mining model is a candi date for storing in a storage means or storage device in response to a criterion for the characteristics being met. this method is limited to item sets. cached.

storage siZe. The data processing means determines a data mining model for the neW data mining task. A user is requested to con?rm that data mining is to proceed if data mining model quality of the data mining model does not ful?ll a second prede?ned criterion. in 25 and checking is performed When information describing data mining model quality for the data mining model has been stored in the storage means. An additional example embodiment of the present inven tion comprises a computer program product for storing of data mining models. The computer usable pro gram code is con?gured to store in storage means for data 40 and complexity. Which has at least one of the folloWing characteristics: quality 35 response to the storage limit not being exceeded. the data processing means stores the ?rst data mining model in the storage means. The system comprises a storage device or storage means. A data pattern may have an associated pattern property determined by the data 4 means data mining models having data mining model quality ful?lling a ?rst prede?ned criterion. . The computer program product comprises a computer usable medium having computer usable program code embodied thereWith. 1 illustrates a How chart of a ?rst method for storing of data mining models according to an embodiment of the information identifying a data mining model and information describing data mining model quality. A data pattern may have an associated pattern property deter 30 mined by the data mining task. upon receiving a neW data mining request. build time. The data mining quality of the data mining model may be calculated based on the pattern properties. determining. based upon priorities of the ?rst data mining model and the further data mining models. From the storage means. When the storage limit is exceeded. The data processing means stores in the storage 65 storing of data mining models according to an embodiment of the present invention. particularly When taken in conjunction With the accompanying draWings Wherein like reference numerals in the various ?gures are utiliZed to designate like components. a data proces sor or data processing means. The data processing means stores in the storage means the information identifying a data mining model and the information describing data mining model quality. determining Whether the ?rst 20 data mining model is a candidate for storing in storage means in response to a criterion for the characteristics being met. The priorities are dependent at least on access 50 The above and still further features and advantages of embodiments of the present invention Will become apparent upon consideration of the folloWing detailed description frequencies of the respective data mining models. an input device or input means. For data mining models. determining Whether the sum of the siZe of the ?rst data mining model and siZes of further data mining models already stored in the storage means exceeds a storage limit. 4 illustrates an example for storing of clustering data mining models according to an embodiment of the present invention.988 B2 3 mining model is not stored for further use. the priorities being dependent at least on access frequencies of the respective data mining models. The input means receives a neW present invention. The surjective mapping may comprise com paring data mining requests and data mining tasks based on data selection constraints and data mining constraints. Data mining models having data mining model qual ity ful?lling a ?rst prede?ned criterion are stored in the stor age means. The system comprises storage means for storing data mining models and data processor or data processing means. the data processing means decides. Upon receiving a neW data mining request. The computer program product com prises a computer usable medium having computer usable program code embodied thereWith. information identifying a data mining model and information code is con?gured to perform the processing steps of creating a ?rst data mining model having at least one of the character describing data mining model quality is stored in storage means. a data mining model is determined for the neW data mining request istics of quality and complexity. A still further example embodiment of the present inven tion comprises a data processing system for storing of data mining models. The data processing means creates a ?rst data mining model. A data mining model may comprise a set of data patterns. The data mining quality of the data mining model may be calculated based on the pattern properties. and an output device or output means. BRIEF DESCRIPTION OF THE DRAWINGS 60 FIG. FIG. Which data mining models to store in the storage means. The data processing means initiates a request to a user to con?rm that data mining is to folloWing characteristics: access frequency. The storage means stores for data mining models 55 thereof.538. Another example of the present invention provides a com Another example embodiment of the present invention comprises a computer program product for storing of data mining models. and request a user to con?rm that data mining is to proceed in response to a determination that data mining model quality of the data mining model does not ful?ll a second prede?ned criterion. The output means submits the request to the user. storing the ?rst data mining model in the storage means. and quality. The computer usable program puter-implemented method for storing of data mining models comprising the folloWing steps. further data mining models already stored in the storage means exceeds a storage limit When the storage limit is not exceeded.US 8. The data processing means determines Whether the ?rst data mining model is a candidate for storing in the storage means if a criterion for the characteristics is met. based upon priorities of the ?rst data mining model and the further data mining models. and in response to the storage limit being exceeded. A data mining model may comprise a set of data patterns. already stored data mining models may be removed based at least on proceed When data mining model quality of the data mining model does not ful?ll a second prede?ned criterion. their priorities. 2 and 3 illustrate ?oW charts of a second method for data mining request. store in the storage means data mining models having data mining model quality ful?lling a ?rst prede?ned criterion. A further example embodiment of the present invention comprises a data processing system for storing of data mining models. The data mining model complexity may be determined based on a build time of the data mining model. determine a data mining model for the neW data mining request and determine Whether information describing data mining model quality for the data 45 of the siZe of the ?rst data mining model and the siZes of mining model has been stored in the storage means. FIGS. and the storage means stores data mining models. A priority of a data mining model may be calculated based at least on one of the mining request and determines Whether information describ ing data mining model quality for the data mining model has been stored in the storage means. Which data mining models to store in the storage means. The data processing means determines Whether the sum mining models information identifying a data mining model and information describing data mining model quality.

988 B2 5 FIGS. Assume that the data mining system maps the ?rst request 411 With identi?er R1 60 further data mining models already stored in the storage means is determined In step 105. and R4. information identifying a data mining model and information describing data mining model quality are stored in storage means for previously system for storing of data mining models according to an embodiment of the present invention. step 304 tests if the data mining model quality ful?lls a second prede?ned criterion. the previously created data mining models are tested if their data mining model qualities ful?ll a ?rst prede?ned criterion. If this is not possible. Caching or materialiZation means storing data mining models information for further use in portions of transient memory or in portions of a persistent storage device. It is noted that. The priorities are dependent at least on access frequencies of the respective data mining models.l ) mapping asso ciates a data mining task With a data mining model. FIG. 422. the sum of the siZe of the ?rst data mining model and the siZe of clusters. the ?rst data mining model is handled as a candidate for storing in a storage means or storage device in step 103. Otherwise. the clustering data mining system tries to map such requests to already stored data mining tasks With same data mining type and data selection constraints. it can be faster returned to the user because the data mining model informa tion does not have to be re-computed. With respective identi?ers R1. the ?rst data mining model is stored in the storage means in step 107. The surjective mapping procedure copies the data mining type and the data selection constraints from the data mining request to a corresponding data mining task.538. A data mining request may be mapped onto an already existing data mining task With the same or a higher maximum number MaxNum of clusters. the term DM refers to data mining and the term DMM refers to data mining model. and T3. 11 illustrates a block diagram of a ?rst data processing 6 models to store in the storage means in step 106. a neW task description With a prede?ned value storing of data mining models. The request com prises parameters for a surj ective many-to-one (N -to-l) map ping onto a corresponding data mining task. and C. 412. OtherWise. 3 illustrates a second ?oW chart of the second example A data mining model is calculated based on a data mining task. sets the maximum number of clusters for the data mining task T1 to a prede?ned value of MaxNum:l0 and computes the associated data mining model M1. If yes. processing ends for the neW data mining request. 413. A data mining task description speci?es a data mining type. MaxNum:auto in the data mining requests 411 and 413. This alloWs for short response times for dif ?cult to compute and frequently used data mining models and does not consume unnecessarily much storage space in the memory or in other storage devices. a hard disk. If the check of step 303 is successful. When the user does not specify the maximum number of clusters. 8 to 10 illustrate a sample scenario for storing of association rule data mining models according to an embodi ment of the present invention. In step 1 02. 2 illustrates a ?rst ?oW chart of a second example method for storing of data mining models. for example. If the test of step 304 is successful. If the user con?rms in step 306. Then. processing on this data mining request is not continued. MaxNum:auto. FIG. the contents of the cache are dynami cally updated based on re-computed priorities. In the second method. the neW data mining request is not further processed. DETAILED DESCRIPTION FIG. for example. Step 303 checks if information describing data mining model quality for this data mining model has already been stored in the storage means in step 201 of FIG. B. When the cached information is requested a second time. Which is one of the data mining tasks. If the storage limit is exceeded. This surjective mapping alloWs different data mining requests to be associ ated With the same data mining task even if the requests and 25 fails. R2. A second request 412 With identi?er R2 speci?es the same table A and the value MaxNum:l0 or loWer. 12 illustrates a block diagram of a second data pro created data mining models. T2. A neW data mining request is received in step 301. 35 Each request speci?es data selection on one of the tablesA. In step 104. FIG. a decision is taken based on priorities of the ?rst data mining model and the further data mining models Which data mining data mining system is aWare that the automatically deter mined parameter Was actually evaluated to MaxNum:l0. data selection constraints for data on Which data mining is to be applied. If yes. The 65 not exceeded. a user is requested in step 305 to con?rm that data mining is to proceed. As illustrated by step 202. a test is performed to determine Whether a criterion for the 55 for MaxNum is stored. 421. these previously created data mining models are stored in the storage means. as illustrated by step 203. for example. and data mining constraints Which depend on 20 method for storing of data mining models. a ?rst data mining model is created. R3. If no. 4 illustrates an example for storing of clustering data tasks have different data mining constraints and other differ ent parameters. for example. they are not stored. 1 and also the other ?gures. cessing system for storing of data mining models according to an embodiment of the present invention.US 8. 2. The data mining constraints of the request are mapped onto the data mining constraints of the task as folloWs: A user may specify a maximum number of 40 requests that can be directly responded from information stored in the cache is called cache hit rate. A data mining model for the neW data mining request is determined in step 302. Then. a test is performed if this sum exceeds a pre-de?ned storage limit If the storage limit is onto a neW data mining task T1. and 423 With respec tive identi?ers T1. When data mining models are requested. the data mining system creates the data mining model associated With the data mining task. FIGS. The example comprises four data mining requests 411. 1 illustrates a How chart of a ?rst example method for 50 clusters for a data mining model calculation. In step 101. A data mining request is the actual invocation of a data mining task by a user. and 414. A ?rst user submits a data mining request 411 to cluster data of a table A Without specifying a maximum number of characteristics is met. processing ends. in FIG. If the test of step 304 fails. The data mining type is alWays Cluster. as illustrated by step 201. data mining pro ceeds for the neW data mining request in step 307.A bijective one-to-one (l -to. 5 to 7 illustrate ?oW charts describing selective and interactive storing of data mining models according to an embodiment of the present invention. Which has at least one of the folloWing characteristics: quality and complexity. A semantic cache in accor dance With embodiments of the present invention takes data mining model characteristics into account for selectively stor ing the data mining models. the second request Would be mapped onto the same . The percentage of user 30 mining models. Which are calculated based at least on access frequencies of the data 45 mining models. FIG. FIG. The mappings may be performed based on program code portions and con?guration parameters. MaxNum:l 0 in the data mining request 412 and MaxNum:2 in the data mining request 414. If the check of step 303 the data mining type.

complexity. such as details more general data mining task description that yields the data mining model M1. The materialiZation states 441 and 443 for the respective models M1 and M3 are MaterialiZeIYES and MAYBE. “MateriaIiZeINO” indicates that a calculated data min ters of the data mining task 422 is set to a prede?ned value.988 B2 7 data mining task T1.538. The model M3 for the task T3 may also be . Instead of computing a neW data mining model for the second request R2. “MaterialiZeIYES” indicates that a calculated data min ciated data mining model M6 is built. The access count or materialiZed because it has a higher quality and complexity than the prede?ned thresholds and the cache storage limit is not yet exceeded. 8 access frequency is a fraction of hits of the data mining model that occur in a time WindoW. corresponding values may also be estimated based on other properties.3). A further neW data mining request 414 With identi?er R4 speci?es data selection on table C. M1. the maximum number of clusters is not speci?ed. l]. MaxNum:5. such as for In the case of the folloWing threshold values (namely Mini example. for example. MaxNum:2. The second reason means that the model build time is smaller than a minimum build time. The post processing comprises aggregating some of the clusters con tained in the data mining model M1 until M1A contains exactly the number of clusters speci?ed in the data mining request R5. the model M2 for the task T2 is not materialiZed. 433. The materialiZation criterion may be expressed using the folloWing formula: mining task 423 is built. This data mining task is independent from the task T1 because the data selection is based on a different table B. Based on the data mining model characteristics. The evaluation results in a materialiZation state for each data 20 mining model. When the data mining model M3 associated With the data 40 storage limit of the cache is exceeded. Where the homogeneity of a cluster is a mean similarity betWeen tWo records of the cluster. .US 8. T3. speci?ed in data mining task T3. Since the Clustering model is the overall homogeneity of the clusters. If a clustering data mining request R5 has a loWer MaxNum value than the actual number of clusters in from the data mining task. a neW data mining task T6 is created With the requested MaxNum value and an asso mated access frequency. This Clustering. The neW request R3 is mapped onto a neW ing model should be stored in the cache if the data mining model quality is not too loW and the build time of the data 25 data mining task 422 With identi?er T2. post-processing is needed to compute a data mining model M1A. MaterialiZation means storing a complete data mining model for further use. 432. the model quality number is the overall homoge neity of the clusters as measured by the mean distance betWeen data records of the cluster and the best matching cluster center. the model M1 for the task T1 is 65 The data mining model complexity is determined based on a build time of the data mining model. a hard disk. IF ( ModelQuality < MinimuInQuality OR ModelCornplexity < 45 MinirnurnCornplexity ) Materialize = NO ELSEIF ( CacheStorageLirnitNotExceeded ) Materialize = YES 431. T2. Multiple data mining methods are knoWn. T1. that is. Both methods result in clustering models. In the case of the Kohonen 60 data mining model can be computed easily on the ?y. . the cached data mining model M1 can be returned to the user. Therefore. The determined data mining model characteristics are stored in the cache independent of the decisions on storing the complete information of the data mining model. the data mining model may be re-computed and eventually stored in the cache When the cache has su?icient free storage space. and storage siZe of the data mining model. The model quality number Q of the Demographic 55 mumQuality:0. Then. This is the time to compute the data mining model on the ?y. The clustering model may contain a model quality number Q in the range [0 . The request 414 With iden ti?er R4 speci?es a maximum number of clusters. The materialiZation state variable may have one of three values: 1 . Therefore. This task 30 ing model should not be stored in the cache if the data mining model quality is too loW or the build time of the data mining model is too short. it may have feWer clusters than the maximum number of clusters. access count or access frequency. The execution of each data mining task. “MaterialiZeIMAYBE” indicates that a data mining model should not be stored after a computation for a ?rst data mining request even When the data mining model quality is not too loW and the build time is not too short because the 35 is independent from the tasks T1 and T2 because data mining is performed on another table C. demographic clustering and Kohonen neural clus tering. Each data mining model has a set of data mining model characteristics. If a clustering data mining request R6 has a higher MaxNum value than an existing data mining task T1 With the same data selection constraints. M2. and is mapped onto the task 423 With identi?er T3 With the same maximum number of clusters. that is. the calcu lation of the materialiZation state 442 returns MaterialiZeINO for the model M2 because its quality is loWer than the ?rst threshold and its complexity is loWer than the second threshold. At subsequent requests. and an estimated data mining model siZe. Max Num:2. depends on Whether the storage limit for the cache is already exceeded or not. Where each model comprises a discrete number of clusters. This estimation may provide an estimated data mining model quality. an estimated data min ing model complexity. results in an associated data mining model. Which comprise information about quality. MaxNum:auto. All characteristics are measur 50 ELSE Materialize = MAYBE ENDIF able. In a neW data mining request 413 With identi?er R3. This request is mapped onto a neW data mining task 423 With identi?er T3.5 and MinimumComplexity:0. a materi aliZation criterion may be evaluated. The cache can be implemented as transient memory or as Any clustering data mining request that has the same data selection constraints as a data mining task T1 and speci?es an equal or loWer MaxNum value than the data mining task T1 may be mapped onto the data mining task T1. they are expressed in terms of numerical values. this data mining task T1 may be considered as an equivalent or persistent storage. An associated data mining model M2 is created. The storage siZe is an amount of storage that the data mining model consumes in the cache. The maximum number of clus mining model is not too short and if a storage limit of the cache is not exceeded. M3. an estimated access count or an esti the data mining model M1 and the data mining request R5 has been mapped onto the data mining task T1. 2. a mapping of the data mining request R6 may be impossible. Which has 5 or feWer clusters. MaxNum:2. Instead of having real measured values for any of the four data mining model characteristics. 3.

the built data mining model Will not be materialiZed and the process continues at step 508. When an already stored data mining model has loWest priority. The built data mining model is returned to the user in step 508. Constraints of data mining requests and data mining tasks are represented as conjunctions of parameters. If for the received data mining request no suitable data mining task can be found in the cache in step 502. DataSource:“Table A” AND Max Num:“l0”. the data mining system receives a data mining request. The access count or access 30 ciently by simple logical entailment. 7. is a binary relation among a set of frequency measurement is typically limited to a time WindoW. because the second con straint is more general than the ?rst one. and its quality. When the storage limit is not puted on the ?y. its build time or its computation time. Then. storage siZe. If the corresponding data mining model characteristics are already stored in the cache. Data mining task descriptions and data mining model characteristics can be stored in the cache for data mining models that themselves are not stored in the cache. the data mining system checks if the data mining model quality is loWer than a prede?ned minimum quality or if the build time is loWer than a prede?ned mini mum build time. The data mining model characteristics may com prise a model identi?er. In step 502. The data mining system may update the corresponding task description With information from the data mining model. The data mining priority may be the product of the storage siZe and the access count of the data mining model. this model is removed from the cache in step 603. the data mining system calculates priori ties of the built data mining model and the already stored data mining models. 5. 5. When a request ful?lls more speci?c constraints. in step 501. requests and a set of tasks. the step 502 can be performed very e?i corresponding data mining model. In FIG. the model is given a priority. 6. that is. the built data mining model is returned to the user. processing continues via connection element 1 in FIG. the user may decide not to use the model. Again. 5 to 7 illustrate ?oW charts describing selective and result of step 505 is “no”. When the data mining quality is higher than the minimum quality threshold in step 701. When a data mining task is found in the cache that is equivalent to the data mining request or that is more general than the data mining request. a one-to-one mapping. 6. processing enters via connection element 3 from 20 As illustrated in FIG. 5 to store the built data mining model in the cache. the 65 data mining system tests in step 702 if the associated data mining model is stored in the cache. The preferred embodiment employs tWo mechanisms to actually determine this relation: The ?rst mechanism is applied When a parameter is not speci In step 602. the data mining system determines the sum of the siZe of the built data mining model and the siZes interactive storing of data mining models. in step 504. tics in the cache. the data mining system continues via the connection element 2 at step 504 in . MaxNum:auto. In step 502 of FIG. a 40 data mining request With a data mining constraint Max Num:2 can be mapped onto a data mining task With a data mining constraint MaxNum:l0. 60 When the user con?rms to proceed With data mining in step 705. processing continues via connection element 3 in FIG. processing con tinues at step 602 to test if an already stored data mining model has loWest priority. The 25 frequency. Back in FIG. The mapping betWeen the task description and the data mining model to be built is bijective. the data mining system issues a Warning to a user in step 704 and requests the user to con?rm that data mining is to proceed. In the second case. When the materialiZation state of a model is set to 10 In step 505. 5. its access count or access task. it can be mapped onto a task having more general constraints. the data mining system tests if already stored data mining models have loWest priorities. Processing continues via connection element 5 in FIG. Preferably. The data mining system checks if these data mining tasks ful?ll the materialiZation criterion for pre-com putation of the associated data mining models and for storing them in the cache. In step 705.538. access count and build time. For example. the data mining system builds the data mining model associated With the data mining task determined in step 502. a many-to-one mapping. If the storage limit is not exceeded any more. that is. 5. 3. When the data 45 mining request has been successfully mapped onto an exist ing data mining task. A surjective mapping from such data mining requests onto data mining tasks. the data mining system builds the data mining model With corresponding data mining model char acteristics and may store the data mining model characteris loWer than a minimum quality threshold. MaxNum:l0. In step 601. In step 508. the data mining system checks if a data mining task that is equivalent to the received data mining request or a more general data mining task than the received data mining request are stored in the cache. the process goes via connection element 4 to step 507 in FIG. For example. and the processing of the data mining request is terminated. the built data mining model gets an access fre quency value of one. The data mining system tests in step 701 Whether the data sumption relation betWeen different constraints.988 B2 9 materialized in the future When its access frequency increases. The data mining system automatically determines this unspeci?ed parameter for the data mining task. exceeded. the data mining system stores the built data mining model in the cache in step 507. 2. mining model quality of the associated data mining model is 55 Then. the data mining system continues processing via connection element 1 in FIG. Which is independent from storing the data mining model in the cache or not. 5. in step 504. If no. Which may typically depend on the storage siZe of the data mining model siZe and its access count. If yes. the data mining system tests in step 604 Whether the sum of the siZe of the built data mining model and siZes of the already stored data mining models exceeds the prede?ned storage limit If yes. If the quality is too loW. 4. an admin istrator user may specify several data mining tasks that are considered to be relevant for many users and cannot be com of further data mining models already stored in the cache and tests if this sum exceeds a prede?ned storage limit This means that the cache Would run out of storage space if the built data mining model Was stored. the received data mining request is more speci?c or more restricted than the cached data mining FIG. information about model quality. Which may be divided by a total number of user requests. the data 50 mining system may store in step 503 in the cache parameters of the received data mining request as a task description. 7. When the YES or MAYBE. and 5. A data mining priority may be determined based at least on one of the folloWing characteristics: the storage siZe of a data mining model. These How charts are interlinked by connector elements 1. The access count is a number of user requests to provide the parameters may be literals that only contain simple binary relations. the built data 35 ?ed in the request description.US 8. The second mechanism makes use of a sub mining model has loWest priority and the data mining system decides not to store it in the cache. FIGS. for example. they may be updated. If the storage limit is exceeded in step 506. If no. When the data mining system is used a ?rst time. for example. the data mining system continues via connection ele ment 2.

R2. The storage siZe and build time of the data mining model. When the data set on Which data mining is performed requests. and M3. This may help him to bundle parts together for delivery. caching of analysis results can signi?cantly reduce response times and thus increase productivity. A task description table 820 contains data mining tasks.538. In step 911 of table 910. RlB. for example. . M0. When the corresponding data mining task is equivalent to the data mining request. and the cached data mining model can be directly returned to the user in step 508 in FIG. and R3B. should groW With the amount of data taken into account. he submits an association rule mining request RlA against the defect parts table DEFECTS for the dealers in Munich and speci?es a minimum support of 10%.I2. Let D 30 be a set of transactions. The materialiZation criterion. M0. The support supp(A) of an itemset A is de?ned as the proportion of transactions in the data set Which contain the itemset A. However. When the corresponding data mining task is more general than the data mining request. Where each transaction T is a subset of (MATERIALIZE:“NO”). These employees can execute analysis 65 task description T1 is stored in the cache. that is. Since there is no same or more general task description in the cache. the data mining model M1 is built based on data from the table DEFECTS With FilteFMunich. The characteristics of the clustering data mining items belonging to I. requests against the central data Warehouse containing the data. MinimumQuality:0. The data selection is performed on the customers table CUST With ?lter condition FilterIall. M1. . {I1. A0 the data mining model can be divided by the total number of user requests. . The data mining constraint is that the maximum number of clusters in the data mining model is MaxNum:5. Which are stored in the cache When processing data mining mining type. The company has a large licensed dealer netWork. 5. 8 to 10 illustrate a sample scenario for storing of association rule data mining models. and Mininimum BuildTime:l0 minutes. If the associated data mining model is stored in the cache. Therefore. if c % of the transactions in D that contain the subset A of items also contain the subset B of items. the data 12 Therefore. is speci?ed in minutes. and M3. The build time of the data mining model. Iq} ofitems in I. . This means that the association rule A—>B holds true in the transaction set D With a con?dence c. Alternatively. IF} ofitems ml and may also contain a further subset B:{IP+1. The priority is calculated as a product of the data mining model siZe and the access count of user requests for 50 the data mining model to be “25”. The data mining type of the task T0 is Clustering. is based on the cache storage limit. In the case of clustering data mining. FIG. The best-knoWn constraints are minimum thresholds on sup ing model quality is 0. Among such job roles are persons supporting the dealers or ensuring quality standards. The materialiZation state of each data mining model indicates Whether the materialiZation state has not yet been determined FIGS. all data mining models based on the changed data set are removed from the cache. the decision proce dure When the data mining model is Written to the cache. 8 shoWs a user request table 810 of data mining mining request. and the build time of the data mining model. . P(B|A). T0. The dealers also offer repair shop services. . Iq}. in table 830. as very rare or loosely correlated events may not be of importance for some applications. user requests. M1. The access count speci?es hoW many times the data mining model has been hit by user requests. the probability of the subset B occurring in a transac tion T given the subset A occurs in the same transaction T. . the con?dence is the conditional probability. 5 MB or 5 Mega bytes. table 910 shoWs materialiZation states for the 25 data mining models at various steps of the sample scenario. 20 minutes. the regularly Work With the data and perform analysis. The 45 access count can also be considered as an absolute request port and con?dence. Their characteristics are stored in table 830 and are a basis for selecting data mining models to be stored in the cache.9 or 90%. The system administrator speci?ed for mining system may perform post-processing on the associ ated data mining model in step 703. 5. that is. A transaction T may contain a subset A:{Il. Whether the data mining model could be stored in the cache (MATERIALIZE:“YES” or “MAYBE”) dependent on available storage space. 5 times. Consider a set of items I:{I1. T2. The subsetA of items is called the body and the subset B of items the head of the rule. An association rule is an implication of the form AQB. IP+2. and has been de?ned above. Where S is the union set of items of the subset A and the subset B occurring in a transaction T in the transaction set D. . I2. . the descriptions of the data mining tasks. the data mining model quality. Often employees request same or similar analysis. Association rules are (MATERIALIZE:“-”). The siZe speci?es the storage used in the cache to store the data mining model. To select model M0 comprise the folloWing properties: The data min 40 interesting rules from the set of all possible rules. . 9. T2. . have no common ele ments. for example. . In other frequency. T0.thatis. .IP+Z.US 8. the number of clusters may be reduced by aggregation of existing clusters of a cached data mining model. that is. The con?dence of a rule A—>B is de?ned conf(AQB):supp(S)/supp(A). The folloWing example describes an automotive company. or Whether the data mining model could not be stored in the cache patterns describing Which items occur frequently Within transactions. a detailed example is described beloW. the access count of Words. for example. Where the subsetA and the subset B are disjunct. 5 and builds the associated data mining model.988 B2 11 FIG. . Which is a measure of the data mining model complexity. In FIG. I2. For association rule mining. rules. .IP}Q{IP+I. no post-processing may be required. These properties are expected to be dependent of the nature of the data mining method applied on the selected data itself. A sample scenario starts With only a data mining task descriptionT0 and an associated data mining model M0 in the cache. The creation of the sub-model may depend on the data requests R1A. T1. Then. T1. M2. and T3. M2. In response to a sequence of 20 essentially changes. Via connection element 5. a sub-model may be created from the cached data mining model according to the data the data mining system the folloWing parameters: CacheStor ageLimit:9 MB. this sub-model may be returned to the user in step 508 in FIG. The priority preferably has a unity measurement unit. The data concerning sales and defects are consolidated and transferred regularly to the auto 60 motive company. Different persons With different job roles A delivery support team member Wants to analyZe defect parts replaced by his dealers in the Munich region to ?nd frequent defect parts occurring together in customer orders. The data mining tasks have associated data mining models. R3A. in table 820 are subsequently stored in the cache together With the characteristics of associated data mining models. Im}. and T3. infor mation about access count and quality may still be kept in the cache.5. the materialiZa user may de?ne a minimum support or con?dence for the 55 tion state for the data mining model M0 is set to MATERIALIZEIYES. hoWever. Which results in a relative access frequency and a modi?ed priority. . constraints on various measures of signi?cance and interest can be used.betWeenthe 35 subset A and the subset B.

According to third program code portions 1115. 30 investigated and possibly redesigned. the data mining system can extract a sub-model from the data mining model M1. the data mining model is temporarily stored in the cache. 11 illustrates a block diagram of a data processing With the examination of the results. He Wants to carry out a similar analysis on defect parts. HoWever. According to second program code portions 1114. that is. Since the 5 Megabytes of the model M0 and the 2 Megabytes of the model M1 do not exceed the cache storage limit of 9 Megabytes. In step 912. 45 Finally. The sub -model comprises all rules Where minimum support is 20%. Finally. Even though the table DEFECTS is the same as for the data mining task T1. The determined sub -model is returned to the requesting user Who bene?ts from a shorter response time and can right aWay start 40 its priority is re-calculated to become “6”.US 8. the ?lter condition is different and the stored data mining model M1 cannot be re-used in this case. For the 14 description T2 is not stored in the cache at all or removed from the cache When the data mining model is not stored. a neW data mining task description T2 may be stored in the cache and the associated data mining model M2 is built. the model M3 is returned to the requesting user. The data mining model quality is calculated as the con?dence average of rules found. The data processing . A neW data mining task de?nition T3 is stored in the cache and the data mining model M3 is built. Therefore. The corresponding data mining task description T3 and the characteristics of the data mining model M3 may hoWever remain in the cache. the data mining models With the loWest priorities are removed from the cache. The insight from the rules helps him to offer useful combinations of associated spare parts to the dealers. Which already exists in the cache With a mini mum support. The data 50 leagues to investigate problems regarding the cylinder head gasket and the steering shaft in order to avoid respective problems With the ignition plug and the steering skeW gear. in 55 FilteFStuttgart and submits a corresponding data mining request R2. the data mining model M3 is not stored in the cache and its materialiZation status is updated to 20 The data mining model M1 is returned to the support team member. the model M1 With priority “4” is removed from the cache. The data mining request R1B is mapped onto the data mining task description T1. A yet further quality assurance team member also Wants to carry out a similar analysis R3B on defect parts for dealers in the Berlin area. the materialiZation state of the model M1 is set to MATERIALIZEIYES. if the storage limit is not exceeded. the processor creates a ?rst data mining model 1118 having at least one of the folloWing data mining model characteristics: quality and complexity. Since both the quality and the response time are greater than the thresholds de?ned by the system administrator.538. 1122. The remaining rules in lines 1012 and 1013 correlate other combinations of defect parts. The determined data min ing model M2 is not stored for further use. and M3. It is possible that the data mining task a storage limit According to forth program code portions 1116. For this purpose. . and 3 for M3). and thus is more general than quality and build time of the data mining model M3 usually 35 remain unchanged. MinSupp:20%. but for a different purpose. based on the same defect parts data. . that is. a more general data mining task description T3 can be found in the cache With MinSupp:10%. only Rule 1 and Rule 2 in lines 1011 and 1012. the priorities of all models in the cache are com pared (namely 25 for M0. In this case. only returned to the requesting user. Rule 1 in line 1011 describes an associa tion between tWo defect parts of a car. the processor handles 60 the ?rst data mining model as a candidate for storing in the storage means if a criterion for the characteristics is met. . 1117. The mate rialiZation status of the model M2 is set to respective elements of the memory and executes these pro gram code portions as folloWs: According to ?rst program code portions 1113. The models With the loWest priorities are removed from the cache until the sum of the siZes falls beloW the maximum storage limit In this case. 1123. In this case. He may suggest his col system 1100 for storing of data mining models. 76%. Since the data mining model quality of M2 is 40% and loWer than the minimum quality of 50% de?ned by the system administrator. M1. The support of Rule 3 in line 1013 is beloW 20% and is excluded from the sub-model. a cylinder head gasket and an ignition plug. In step 915. The stored data mining quality and build time are suf?ciently high for alloWing storing the data mining model M3 in the cache. For the request R1B. so they can be Supp:20%. He expects to recogniZe the root parts causing most defects. this quality and the build time of 15 minutes are both greater than the thresholds de?ned by the system administrator. FIG. 1113. A further quality assurance team member Wants to carry out an analysis R3A for dealers in the Berlin area as his colleague previously did for the dealers in the Munich region. the data mining model M3 is planned to be stored in the cache. Where three rules are found in table 1010 With a support higher than 10%. This The already existing data mining task description T3 is used for the request R3B. the sum (namely data mining model M1. . . the determined association rules data mining model M3 is returned to the requesting user. as shoWn in line 1014. he is interested in defect parts occurring together in spare part orders. the materialiZation status for the data mining model M1 is set to MATERIALIZEIMAYBE and for the data mining model M3 the materialiZation status is set to MATERIALIZEIYES. The access count for the data mining model M1 is set to “1” and the priority is calculated to be “2”. but With a higher minimum support Min similar analysis. 1121. This rule has a support of 24% and a con?dence of 68%. For example. no data mining models have to be deleted from the cache. the processor determines Whether the sum of the siZe of the ?rst data mining model 1118 and the siZes of further data mining models 65 already stored in the storage means. For this request. 10 MB) of the siZes of the data mining models. he also submits an association rule data mining request R1B against the defect parts table DEFECTS for the Munich deal ers and speci?es a minimum support. Its access count is incremented to “2” and the data mining request. But the associated data mining model M3 does not exist in the cache and needs to be rebuilt. 1122 and 1123. A quality assurance team member needs to carry out a MATERIALIZEIMAYBE in step 914. but With a ?lter condition processing system comprises a data processor or data pro cessing means 1110 and storage means 1120 for storing of data mining models. 10 shows a possible data mining model M1. A further support team member is responsible for dealers in the region of Stuttgart.988 B2 13 FIG. . To investigate Whether a defect in a cer 25 tainpart causes another part to break. the model M2 is not cached. M2. exceeds the maximum storage limit (namely 9 MB). The access count of the data mining model M1 is incremented to “2” and the priority value is re-calculated to be “4”. The data mining model M1 stored in the cache is associated With the data mining task T1. Whereas the model M3 With priority “6” is stored in the cache. The data process ing means includes a processor 1111 and a memory 1112. exceeds MATERIALIZEINO in step 913. MinSupp:10%. The processor stores program code portions. . Therefore. . Since the storage limit Would again be exceeded because 10 MB is greater than 9 MB. 4 for M1.