In order to effectively represent data mining models inrelational databases, we need to capture
creation
of datamining models using
arbitrary
mining algorithms,
browsing
of such models (examining their structure orcontents), and
application
of a selected model to an ad-hoc data set for analysis tasks such as
prediction
.Furthermore, for a column that is the result of theprediction, sufficient meta-data must be available with thepredicted column so that analysis tools can interpretproperties of prediction, e.g., its accuracy.Relational database systems understand and support onlyrelations as first class objects and so if we are to representa data mining model in databases, it must be viewed as a“table-like” structure. However, at a first glance, a modelis more like a graph, with a complex interpretation of itsstructure, e.g., a decision tree classifier. Thus, trying torepresent a mining model as a table (or a set of rows)appears unnatural. Fortunately, this need not be thestumbling block. The key steps in the lifecycle of amining model are to
create and populate
a model via analgorithm on a training data source, and to be able to usethe mining model to
predict
values for data sets. If we cancapture these steps using SQL metaphors then it wouldensure that database developers are able to leverage datamining functionality without a shift in their paradigm forapplication development. In the next section, we outlinean approach to accomplish this goal.
3. OLE-DB for Data Mining
OLE-DB is a well-known programming interface thatallows an application to connect to and consumeinformation from any relational data source (need not be aSQL database). This generality makes OLE-DB an idealcandidate API to extend with data mining capabilitiesbased on the design philosophies outlined in the previoussection. Microsoft has proposed OLE-DB for DM toenable data mining functionality on OLE-DB providers.The details of this specification is available online [4] andwe only provide a short overview here.The OLE DB for DM interface allows client applicationsto explore and manipulate the existing mining models andtheir properties through an interface similar to that usedfor exploring tables, views and other first class relationalobjects. All mining models are represented as table-likeobjects with columns, data types and extended meta-information that is needed for data mining algorithms.Before we discuss how one can view mining models astables, we need to recognize the data representation needsrelated to mining. Traditional statistical learningalgorithms prefer to view a data set to consist of (attribute, value) pairs representing “cases” (observations)on a certain entity, e.g., customer. Effective data miningoften requires that these cases in the training consolidateall the information relevant to the entity. Since inrelational databases data is often scattered over multiplenormalized tables, this creates a conceptual mismatch. Inparticular, cases are far better represented as nestedrecords than flat records
1
. Traditional SQL representationalso falls short is in capturing metadata on columns. Toeffectively derive and use a mining model, we must beable to identify properties of an attribute (e.g., discrete vs.continuous) and relationships among attributes.Accordingly, we have enabled a column in a rowset(representing a set of cases) of the four
content types
:
key, attribute, relationship, and qualifier.
The notion of key and attributes are traditional, i.e., a key uniquelyidentifies the case and an attribute is a property of thecase. More interesting are the column types relationshipand qualifier. A relationship column qualifies/classifies anattribute, e.g., a column “product type” classifies theattribute “product name”. Thus, the column “producttype” can be recorded as
related to
“product type”. Aqualifier represents a column that provides additionalinformation on value of another attribute that can beinterpreted by an OLE-DB provider. For example, acolumn may represent the probability associated with apredicted attribute, distribution information, domain types(discrete/continuous) etc. Such information can beinvaluable in data analysis. We describe how the keysteps on a mining model are supported in OLE-DB forDM:
CREATE:
A client application can create (define) a newmining model simply by executing a
CREATE MINING MODEL
statement. This statement is very similar in styleand semantics to that of a CREATE TABLE statementbut contains more meta-information about the columns asindicated above. An example is presented below:
CREATE MINING MODEL [Age Prediction]([Customer ID]LONG KEY,[Gender] TEXT DISCRETE,[Age] DOUBLE DISCRETE PREDICT,[Product Purchases] TABLE([Product Name] TEXT KEY,[Quantity]DOUBLE NORMAL CONTINUOUS,[Product Type] TEXT DISCRETE RELATEDTO [Product Name]))USING [Decision_Trees_A]
Note that each column has an additional meta-dataspecification that identifies its column type. For example,Product Type is marked as related to Product Name. Thetraining algorithm is indicated using the USING clause.The columns that are being predicted are indicated by useof the keyword PREDICT.
INSERT:
Once the mining model was created, itfunctions as an empty table, i.e., its structure could be
1
Microsoft Data Access Components support hierarchicalrowsets using the Shape Provider. See [4] for details.
720
Leave a Comment