Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For
Data Pre-processing
Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed
for the purpose of the user.
Why Preprocessing ?
Data Cleaning
1. Fill in missing values (attribute or class value):
Ignore the tuple: usually done when class label is missing.
Use the attribute mean (or majority nominal value) to fill in the missing value.
Use the attribute mean (or majority nominal value) for all samples belonging to the
same class.
Predict the missing value by using a learning algorithm: consider the attribute with
the missing value as a dependent (class) variable and run a learning algorithm
(usually Bayes or decision tree) to predict the missing value.
Binning
Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);
Clustering: group values in clusters and then detect and remove outliers (automatic
or manual)
1|DataMining
Regression: smooth by fitting the data into regression functions.
Data Transformation
1. Normalization:
Min-Max: Scaling attribute values to fall within a specified range.
Z-Scale: Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StdDev
Data Reduction
1. Reducing the number of attributes
Data cube aggregation: applying roll-up, slice or dice operations.
2|DataMining
If too many intervals, merge intervals with equal or similar class
distributions.
Entropy (information)-based discretization.
3. Generating concept hierarchies: recursively applying partitioning or discretization method
Example:
Original Data:
3|DataMining
Preprocessed Data:
Low: Salary<=20000
Medium: 20000<Salary<=40000
High: Salary>40000
4|DataMining
2. Data Warehouse Schemas
A schema is a collection of database objects, including tables, views, indexes, and synonyms. There is
a variety of ways of arranging schema objects in the schema models designed for data warehousing.
Star Schema
The star schema is perhaps the simplest data warehouse schema. It is called a star schema because
the entity-relationship diagram of this schema resembles a star, with points radiating from a central
table. The centre of the star consists of a large fact table and the points of the star are the dimension
tables.
A star schema is characterized by one or more very large fact tables that contain the primary
information in the data warehouse, and a number of much smaller dimension tables (or lookup
tables), each of which contains information about the entries for a particular attribute in the fact
table.
A star query is a join between a fact table and a number of dimension tables. Each dimension table is
joined to the fact table using a primary key to foreign key join, but the dimension tables are not
joined to each other.
Star schemas are used for both simple data marts and very large data warehouses.
Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has
been grouped into multiple tables instead of one large table. While this saves space, it increases the
number of dimension tables and requires more foreign key joins. The result is more complex queries
and reduced query performance.
Dim_Product Fact_Sales
Dim_Store
ID Store_ID
ID
Product_Name Product_ID
Geography_ID
Brand_ID Units_Sold
Dim_Geography
Dim_Brand
ID
ID
State
Brand
Country
Fact Constellation
6|DataMining
For each star schema it is possible to construct fact constellation schema. The fact constellation
architecture contains multiple fact tables that share many dimension tables. This schema is more
complex than star or snowflake architecture, which is because it contains multiple fact
tables. This allows dimension tables to be shared amongst many fact tables. The main
disadvantage of the fact constellation schema is a more complicated design because
many variants of aggregation must be considered.
In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when some
facts are associated with a given dimension level and other facts with a deeper
dimension level.
7|DataMining
3. OLAP (OnLine Analytical Processing) Cube
An OLAP (Online analytical processing) cube is a data structure that allows fast analysis of data. It
can also be defined as the capability of manipulating and analysing data from multiple perspectives.
The arrangement of data into cubes overcomes some limitations of relational databases.
OLAP cubes can be thought of as extensions to the two-dimensional array of a spreadsheet. For
example a company might wish to analyse some financial data by product, by time-period, by city, by
type of revenue and cost, and by comparing actual data with a budget. These additional methods of
analysing the data are known as dimensions. Because there can be more than three dimensions in
an OLAP system the term hypercube is sometimes used.
The OLAP cube consists of numeric facts called measures which are categorized by dimensions. The
cube metadata (structure) may be created from a star schema or snowflake schema of tables in a
relational database. Measures are derived from the records in the fact table and dimensions are
derived from the dimension tables.
OLAP Operations:
The analyst can understand the meaning contained in the databases using multi-dimensional
analysis. By aligning the data content with the analyst's mental model, the chances of confusion and
erroneous interpretations are reduced. The analyst can navigate through the database and screen
for a particular subset of the data, changing the data's orientations and defining analytical
calculations. The user-initiated process of navigating by calling for page displays interactively,
through the specification of slices via rotations and drill down/up is sometimes called "slice and
dice". Common operations include slice and dice, drill down, roll up, and pivot.
Slice: A slice is a subset of a multi-dimensional array corresponding to a single value for one or more
members of the dimensions not in the subset.
Dice: The dice operation is a slice on more than two dimensions of a data cube (or more than two
consecutive slices).
Drill Down/Up: Drilling down or up is a specific analytical technique whereby the user navigates
among levels of data ranging from the most summarized (up) to the most detailed (down).
Roll-up: A roll-up involves computing all of the data relationships for one or more dimensions. To do
this, a computational relationship or formula might be defined.
8|DataMining
Pivot: This operation is also called rotate operation. It rotates the data in order to provide an
alternative presentation of data - the report or page display takes a different dimensional
orientation.
Example:
Relational Data:
Supplier-1:
9|DataMining
Supplier-2:
10 | D a t a M i n i n g
LOCATION MUMBAI
11 | D a t a M i n i n g
VIJAYAWADA
VIZAG
HYDERABAD
TV
DVD
COMPUTER
S1
LAPTOP
I
T
REFRIGERATOR
E
TV
M
S
DVD
S2
COMPUTER
LAPTOP
REFRIGERATOR
Q1 Q2 Q3 Q4
TIME
12 | D a t a M i n i n g
Extract Data Transform & Import to OLAP Build Cubes Produce
from OLTPs Standardize Data Database Reports
Example:
13 | D a t a M i n i n g
Various Representations of the data after Building Cube:
14 | D a t a M i n i n g
15 | D a t a M i n i n g
2. Pie Graph Representation
16 | D a t a M i n i n g
17 | D a t a M i n i n g
18 | D a t a M i n i n g
3. Tabular Representation
19 | D a t a M i n i n g
20 | D a t a M i n i n g
21 | D a t a M i n i n g
22 | D a t a M i n i n g
5. ETL (Extract, Transform & Load) in Oracle 10g
External Tables:
The external tables feature is a complement to the existing SQL*Loader functionality. It enables you
to access data in external sources as if it were in a table in the database.
Prior to Oracle Database 10g, external tables were read-only. However, as of Oracle Database 10g,
external tables can also be written to. Note that SQL*Loader may be the better choice in data
loading situations that require additional indexing of the staging table. To use the external tables
feature, you must have some knowledge of the file format and record format of the data files on
your platform if the ORACLE_LOADER access driver is used and the data files are in text format. You
must also know enough about SQL to be able to create an external table and perform queries
against it.
You can, for example, select, join, or sort external table data. You can also create views and
synonyms for external tables. However, no DML operations (UPDATE, INSERT, or DELETE) are
possible, and no indexes can be created, on external tables.
External tables provide a framework to unload the result of an arbitrary SELECT statement into a
platform-independent Oracle-proprietary format that can be used by Oracle Data Pump. External
tables provide a valuable means for performing basic extraction, transformation, and loading (ETL)
tasks that are common for data warehousing.
The means of defining the metadata for external tables is through the CREATE
TABLE...ORGANIZATION EXTERNAL statement. This external table definition can be thought of as a
view that allows running any SQL query against external data without requiring that the external
data first be loaded into the database. An access driver is the actual mechanism used to read the
external data in the table. When you use external tables to unload data, the metadata is
automatically created based on the data types in the SELECT statement.
Oracle Database provides two access drivers for external tables. The default access driver is
ORACLE_LOADER, which allows the reading of data from external files using the Oracle loader
technology. The ORACLE_LOADER access driver provides data mapping capabilities which are a
subset of the control file syntax of SQL*Loader utility. The second access driver,
ORACLE_DATAPUMP, lets you unload data—that is, read data from the database and insert it into an
external table, represented by one or more external files—and then reload it into an Oracle
Database.
23 | D a t a M i n i n g
External Table Restrictions:
The following are restrictions on external tables:
An external table does not describe any data that is stored in the database.
An external table does not describe how data is stored in the external source. This is the
function of the access parameters.
Virtual columns are not supported.
In this example, the data for the external table resides in a text file “colleges.dat”.
The contents of the file are:
cbit, 600
mgit, 450
ou, 3000
The following SQL statements create an external table named colleges_exttable and load data from
the external table into the colleges table.
24 | D a t a M i n i n g
(
college name, intake
)
)
LOCATION (‘colleges.dat')
)
PARALLEL
REJECT LIMIT UNLIMITED;
Table Created.
COLLEGE_NAME INTAKE
Cbit 600
Mgit 450
Ou 3000
The TYPE specification indicates the access driver of the external table. The access driver is the API
that interprets the external data for the database. If you omit the TYPE specification,
ORACLE_LOADER is the default access driver. You must specify the ORACLE_DATAPUMP access
driver if you specify the AS sub-query clause to unload data from one Oracle Database and reload it
into the same or a different Oracle Database.
The access parameters, specified in the ACCESS PARAMETERS clause, are opaque to the database.
These access parameters are defined by the access driver, and are provided to the access driver by
the database when the external table is accessed.
The PARALLEL clause enables parallel query on the data sources. The granule of parallelism is by
default a data source, but parallel access within a data source is implemented whenever possible.
Parallel access within a data source is provided by the access driver only if all of the following
conditions are met:
The media allows random positioning within a data source.
25 | D a t a M i n i n g
It is possible to find a record boundary from a random position.
The data files are large enough to make it worthwhile to break up into multiple chunks.
Note:
Specifying a PARALLEL clause is of value only when dealing with large amounts of data. Otherwise, it
is not advisable to specify a PARALLEL clause, and doing so can be detrimental.
The REJECT LIMIT UNLIMITED clause specifies that there is no limit on the number of errors that can
occur during a query of the external data. For parallel access, this limit applies to each parallel
execution server independently. For example, if REJECT LIMIT is specified, each parallel query
process is allowed 10 rejections. Hence, the only precisely enforced values for REJECT LIMIT on.
26 | D a t a M i n i n g
6. Data Pump: Import (impdp) and Export (expdp)
Oracle introduced the Data Pump in Oracle Database 10g Release 1. This new oracle technology
enables very high transfer of data from one database to another. The oracle Data Pump provides
two utilities namely:
Data Pump Export which is invoked with the expdp command.
Data Pump Import which is invoked with the impdp command.
The above two utilities have similar look and feel with the pre-Oracle 10g import and export utilities
(imp and exp) but are completely separate. This means that dump files generated by the original
export utility (exp) cannot be imported by the new data pump import utility (impdp) and vice-versa.
Data Pump Export (expdp) and Data Pump Import (impdp) are server-based rather than client-based
as is the case for the original export (exp) and import (imp). Because of this, dump files, log files, and
sql files are accessed relative to the server-based directory paths. Data Pump requires that directory
objects mapped to a file system directory be specified in the invocation of the data pump import or
export.
You can invoke the data pump export or import using a command line. Export and Import parameters
can be specified directly in the command line or in a parameter (.par) file.
Example:
Table created.
1 row created.
1 row created.
NAME ROLLNO
-------------------- ----------
27 | D a t a M i n i n g
ABCD 1
WXYZ 2
If you want to export to a file, the first thing that you must do is create a database DIRECTORY
object for the output directory, and grant access to users who will be doing exports and imports:
SQL>CREATE DIRECTORY csecbit AS 'C:\8110';
Directory created.
Now, you can export a user's object using the command line. Export parameters are to be
specified in a parameter (.par) file as shown below:
TABLEs=new
DUMPFILE=csecbit:dumpfile.dmp
LOGFILE=csecbit:logfile.dmp
Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 – Production With the
Partitioning, OLAP and Data Mining options
Starting "SYSTEM"."SYS_EXPORT_TABLE_01": system/********@orcl
parfile=C:\8110\parameter.par
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TBL_TABLE_DATA/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 64 KB
Processing object type TABLE_EXPORT/TABLE/TABLE
. . exported "SYSTEM"."NEW" 5.234 KB 2 rows
Master table "SYSTEM"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYSTEM.SYS_EXPORT_TABLE_01 is:
C:\8110\DUMPFILE.DMP
Job "SYSTEM"."SYS_EXPORT_TABLE_01" successfully completed at 14:47
Now, drop the table “new” and Import the previously Exported Data Pump
SQL> drop table new;
Table dropped.
Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 – Production With the
Partitioning, OLAP and Data Mining options
Master table "SYSTEM"."SYS_IMPORT_TABLE_01" successfully loaded/unloaded
Starting "SYSTEM"."SYS_IMPORT_TABLE_01": system/********@orcl
parfile=C:\8110\parameter.par
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TBL_TABLE_DATA/TABLE/TABLE_DATA
. . imported "SYSTEM"."NEW" 5.234 KB 2 rows
Job "SYSTEM"."SYS_IMPORT_TABLE_01" successfully completed at 14:51
NAME ROLLNO
-------------------- ----------
ABCD 1
WXYZ 2
29 | D a t a M i n i n g
7. Using Apriori technique, generate association rules
Association Rules:
Association Rules are used for discovering regularities between products in big transactional
databases. A transaction is an event involving one or more of the products (items) in the business or
domain; for example buying of goods by a consumer in a super market is a transaction. A set of
items is usually referred as "itemset", and an itemset with "k" number of items is called "k-itemset".
The general form of an association rule is X => Y, where X and Y are two disjoint itemsets. The
"support" of an itemset is the number of transactions that contain all the items of that itemset;
whereas the support of an association rule is the number of transactions that contain all items of
both X and Y. The "confidence" of an association rule is the ratio between its support and the
support of X.
A given association rule X => Y is considered significant and useful, if it has high support and
confidence values. The user will specify a threshold value for support and confidence, so that
different degrees of significance can be observed based on these threshold values.
Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy
to counting the support of itemsets and uses a candidate generation function which exploits the
downward closure property of support. The Apriori algorithm relies on the principle "Every non-
empty subset of a large itemset must itself be a large itemset".
The algorithm attempts to find subsets which are common to at least a minimum number C of the
itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a
time (a step known as candidate generation), and groups of candidates are tested against the data.
The algorithm terminates when no further successful extensions are found.
Apriori uses breadth-first search and a tree structure to count candidate item sets efficiently. It
generates candidate item sets of length k from item sets of length k − 1. Then it prunes the
candidates which have an infrequent sub pattern. According to the downward closure lemma, the
candidate set contains all frequent k-length item sets. After that, it scans the transaction database to
determine frequent item sets among the candidates.
Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have
spawned other algorithms. Candidate generation generates large numbers of subsets. Bottom-up
subset exploration finds any maximal subset S only after all 2 | S | − 1 of its proper subsets.
30 | D a t a M i n i n g
Algorithm:
Find frequent itemsets using an iterative level-wise approach based on candidate generation.
Input:
D, a database of transactions;
min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.
Method:
L1 = find frequent 1-itemsets(D);
for (k = 2; Lk-1!=null; k++) {
Ck = apriori gen(Lk-1);
for each transaction t D { // scan D for counts
Ct = subset(Ck, t); // get the subsets of t that are candidates
for each candidate c Ct
c.count++;
}
Lk = {c Ck|c.count >= min sup}
}
return L = UkLk;
31 | D a t a M i n i n g
Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is straightforward
to generate strong association rules (where strong association rules satisfy both minimum support
and minimum confidence) from them. This can be done using the following equation for confidence:
support_count(AUB)
confidence(A=>B) = P(B|A) =
support_count(A)
Example:
Let the database of super-market transactions consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3},
{1,2,4}, {3,4}, and {2,4}. Each number corresponds to a product such as "butter" or "water".
The first step of Apriori is to count up the frequencies, called the supports, of each member item
separately:
Item Support
1 3
2 6
3 4
4 5
We can define a minimum support level to qualify as "frequent," which depends on the context. For
this case, let min support = 3. Therefore, all are frequent.
The next step is to generate a list of all 2-pairs of the frequent items. Had any of the above items not
been frequent, they wouldn't have been included as a possible member of possible 2-item pairs. In
this way, Apriori prunes the tree of all possible sets. In next step we again select only these items
(now 2-pairs are items) which are frequent:
Item Support
{1,2} 3
{2,3} 3
{2,4} 4
{3,4} 3
And generate a list of all 3-triples of the frequent items (by connecting frequent pair with frequent
single item).
32 | D a t a M i n i n g
Item Support
{1,2,3} 1
{1,2,4} 2
{2,3,4} 2
Example:
33 | D a t a M i n i n g
Opening the .csv file in WEKA Explorer:
34 | D a t a M i n i n g
8. Decision tree classfication using weka tool
Classification is a data mining technique used to predict group membership for data instances.
Classification is the task of generalizing known structure to apply to new data. For example, an email
program might attempt to classify an email as legitimate or spam. Common algorithms include
decision tree learning, nearest neighbour, naive Bayesian classification, neural networks and support
vector machines.
Data classification is a two-step process. In the first step, a classifier is built describing a
predetermined set of data classes or concepts. This is the learning step (or training phase), where a
classification algorithm builds the classifier by analysing or “learning from” a training set made up of
database tuples and their associated class labels. The class label attribute is discrete-valued and
unordered. It is categorical in that each value serves as a category or class. The individual tuples
making up the training set are referred to as training tuples and are selected from the database
under analysis. In the context of classification, data tuples can be referred to as samples, examples,
instances, data points, or objects. Because the class label of each training tuple is provided, this step
is also known as supervised learning. Finally, the classifier is represented in the form of classification
rules.
The second step is known as Classification and in this step, test data is used to estimate the accuracy
of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the
classification of new data tuples.
Example:
35 | D a t a M i n i n g
Opening the .csv file in WEKA Explorer:
Classifying using J48 Classification selected from the “Trees” folder in the “Choose” menu:
36 | D a t a M i n i n g
Visualizing the resultant Tree by right clicking on the option in the “Result list” & selecting
“Visualize Tree”:
Clustering is also called data segmentation in some applications because clustering partitions large
data sets into groups according to their similarity.
The most well-known and commonly used partitioning methods are k-means, k-medoids and their
variations.
The k-means algorithm proceeds as follows. First, it randomly selects k of the objects, each of which
initially represents a cluster mean or centre. For each of the remaining objects, an object is assigned
to the cluster to which it is the most similar, based on the distance between the object and the
cluster mean. It then computes the new mean for each cluster. This process iterates until the
criterion function converges. Typically, the square-error criterion is used, defined as
where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional). In
other words, for each object in each cluster, the distance from the object to its cluster centre is
squared, and the distances are summed. This criterion tries to make the resulting k clusters as
compact and as separate as possible.
Example:
38 | D a t a M i n i n g
Attribute Data: Supplied as Comma Separated (.csv) file
Outlook temperature humidity windy play
Sunny 85 85 FALSE no
Sunny 80 90 TRUE no
Overcast 83 86 FALSE yes
Rainy 70 96 FALSE yes
Rainy 68 80 FALSE yes
Rainy 65 70 TRUE no
Overcast 64 65 TRUE yes
Sunny 72 95 FALSE no
Sunny 69 70 FALSE yes
Rainy 75 80 FALSE yes
Sunny 75 70 TRUE yes
Overcast 72 90 TRUE yes
Overcast 81 75 FALSE yes
Rainy 71 91 TRUE no
Clustering using Simple K Means selected from the “Choose” menu in the “Cluster” tab:
39 | D a t a M i n i n g
Visualizing by right clicking on the option in the “Result list” & selecting “Visualize”:
Final Output:
temperatur
Instance_number outlook e humidity windy play Cluster
40 | D a t a M i n i n g
0 sunny 85 85 FALSE no cluster2
1 sunny 80 90 TRUE no cluster2
2 overcast 83 86 FALSE yes cluster0
3 rainy 70 96 FALSE yes cluster0
4 rainy 68 80 FALSE yes cluster0
5 rainy 65 70 TRUE no cluster2
6 overcast 64 65 TRUE yes cluster1
7 sunny 72 95 FALSE no cluster2
8 sunny 69 70 FALSE yes cluster0
9 rainy 75 80 FALSE yes cluster0
10 sunny 75 70 TRUE yes cluster1
11 overcast 72 90 TRUE yes cluster1
12 overcast 81 75 FALSE yes cluster0
13 rainy 71 91 TRUE no cluster2
/* output:
enter how many elements
42 | D a t a M i n i n g
6
enter the x set& y set
1
8
2
13
3
18
4
23
5
28
6
33
sum_x:21.0sum_y:123.0
mean_x:3.5,mean_y:20.5
num:87.5,din:17.5
b:5.0,a:3.0
enter the x value
2
corresponding y is:13.0*/
#include<iostream.h>
#include<conio.h>
void main()
{
int n;
cout<<"enter how many elements";
cin>>n;
double mean_x1,mean_x2,mean_y;
double sum_x1,sum_x2,sum_y;
double num1,din1,num2,din2;
num1=0.0;din1=0.0;
num2=0.0;din2=0.0;
sum_x1=0;
sum_x2=0;
sum_y=0;
int x1[20];
int x2[20];
int y[20];
cout<<"enter the x1,x2 sets& y set";
for(int i=0;i<n;i++)
{
43 | D a t a M i n i n g
cin>>x1[i];
sum_x1=sum_x1+x1[i];
cin>>x2[i];
sum_x2=sum_x2+x2[i];
cin>>y[i];
sum_y=sum_y+y[i];
}
cout<<"sum_x1:"<<sum_x1<<"sum_y:"<<sum_y;
mean_x1=sum_x1/n;
mean_x2=sum_x2/n;
mean_y=sum_y/n;
cout<<"mean_x1:"<<mean_x1<<",mean_x2:"<<mean_x2<<",mean_y:"<<mean_y;
for(i=0;i<n;i++)
{
num1=num1+((x1[i]-mean_x1)*(y[i]-mean_y));
din1= din1+((x1[i]-mean_x1)*(x1[i]-mean_x1));
num2=num2+((x2[i]-mean_x2)*(y[i]-mean_y));
din2= din2+((x2[i]-mean_x2)*(x2[i]-mean_x2));
}
cout<<"num1:"<<num1<<",din1:"<<din1;
cout<<"num1:"<<num1<<",din1:"<<din1;
double b1=num1/din1;
double b2=num2/din2; double a=mean_y-(b1*mean_x1)-(b2*mean_x2);
cout<<"b1:"<<b1<<",b2:"<<b2<<"a:"<<a;
cout<<"enter the x1,x2 values";
int c1,c2;
cin>>c1;
cin>>c2;
double r=a+b1*c1+b2*c2;
cout<<endl<<"corresponding y is:"<<r<<endl;
getch();
}
/*output:
enter how many elements
6
enter the x1,x2 sets& y set
1
1
11
2
2
19
3
3
44 | D a t a M i n i n g
27
4
4
35
5
5
43
6
6
51
sum_x1:21.0sum_y:186.0
mean_x1:3.5,mean_x2:3.5,mean_y:31.0
num1:140.0,din1:17.5
num1:140.0,din1:17.5
b1:8.0,b2:8.0a:-25.0
enter the x1,x2 values
7
7
corresponding y is:87.0*/
K-Means Schema
Aim:
Understand and write a program to implement K-Means partitioning algorithm
Theory:
The k-means algorithm is an algorithm to cluster n objects based on attributes into k
45 | D a t a M i n i n g
partitions, k < n. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in
that they both attempt to find the centers of natural clusters in the data. It assumes that the object
attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster
variance, or, the squared error function
where there are k clusters Si, i = 1, 2, ..., k, and µi is the centroid or mean point of all the points xj ∈
Si.
` The most common form of the algorithm uses an iterative refinement heuristic known as
Lloyd's algorithm. Lloyd's algorithm starts by partitioning the input points into k initial sets, either at
random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It
constructs a new partition by associating each point with the closest centroid. Then the centroids are
recalculated for the new clusters, and algorithm repeated by alternate application of these two steps
until convergence, which is obtained when the points no longer switch clusters (or alternatively
centroids are no longer changed).
Lloyd's algorithm and k-means are often used synonymously, but in reality Lloyd's algorithm
is a heuristic for solving the k-means problem, but with certain combinations of starting points and
centroids, Lloyd's algorithm can in fact converge to the wrong answer (ie a different and optimal
answer to the minimization function above exists.) Other variations exist, but Lloyd's algorithm has
remained popular because it converges extremely quickly in practice. In fact, many have observed
that the number of iterations is typically much less than the number of points. Recently, however,
David Arthur and Sergei Vassilvitskii showed that there exist certain point sets on which k-means
takes superpolynomial time: 2Ω(√n) to converge. Approximate k-means algorithms have been
designed that make use of coresets: small subsets of the original data.
Program:
//K-means Schema
#include<stdio.h>
#include<conio.h>
void main()
{
int n,j,s=0;
int obj[20],c[20][20],mean[20];
int i,nc,k,m;
clrscr();
printf("\n Enter the No.of Items:");
scanf("%d",&n);
printf("\n enter the n items");
for(i=0;i<n;i++)
scanf("%d",&obj[i]);
printf("\n Enter the No.of clusters:");
scanf("%d",&nc);
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
c[i][j]=0;
for(i=0;i<nc;i++)
{
c[i][0]=obj[i];
mean[i]=obj[i];
}
for(i=0;i<nc;i++)
for(j=0;j<n;j++)
46 | D a t a M i n i n g
if (c[i][j]>0)
printf("I: %d",c[i][j]);
j=3;
for(i=0;i<n;i++)
{
if(j<n)
{
if((obj[j]-mean[0])<(obj[j]-mean[1]))
if((obj[j]-mean[0])<(obj[j]-mean[2]))
c[0][i]=obj[j];
if((obj[j]-mean[1])<(obj[j]-mean[0]))
if((obj[j]-mean[1])<(obj[j]-mean[2]))
c[1][i]=obj[j];
if((obj[j]-mean[2])<(obj[j]-mean[0]))
if((obj[j]-mean[2])<(obj[j]-mean[1]))
c[2][i]=obj[j];
for(k=0;k<nc;k++)
{
for(m=0;m<n;m++)
{
s=s+c[k][m];
mean[k]=s/n;
}
}
j++;
}
}
for(i=0;i<nc;i++)
{
printf("\n");
for(j=0;j<n;j++)
{
if(c[i][j]>0)
printf("%d,",c[i][j]);
}
}
getch();
}
Output:
Enter the no. of objects: 10
Enter 10 objects: 1 2 5 7 9 10 14 17 20 25
Enter the no. of clusters: 3
L:1, L:2, L:5
1, 9, 10, 20
2, 5, 17
7, 14, 25
K-Medoids Schema
Aim:
47 | D a t a M i n i n g
Theory:
The K-medoids algorithm is a clustering algorithm related to the K-means algorithm. Both
algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize
squared error, the distance between points labeled to be in a cluster and a point designated as the
center of that cluster. In contrast to the K-means algorithm K-medoids chooses datapoints as centers
(medoids or exemplars).
K-medoid is a classical partitioning technique of clustering that clusters the data set of n
objects into k clusters known apriori. It is more robust to noise and outliers as compared to k-means
A medoid can be defined as that object of a cluster, whose average dissimilarity to all the
objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.
K-medoid clustering algorithm is as follows
1) The algorithm begins with arbitrary selection of the k objects as medoid points out of n data
points (n>k)
2) After selection of the k medoid points, associate each data object in the given data set to most
similar medoid. The similarity here is defined using distance measure that can be euclidean distance,
manhattan distance or minkowski distance
3) Randomly select nonmedoid object O’
4) compute total cost , S of swapping initial medoid object to O’
5) If S<0, then swap initial medoid with the new one ( if S<0 then there will be new set of medoids)
6) repeat steps 2 to 5 until there is no change in the medoid.
Program:
// K-medoids Partitioning
#include<stdio.h>
#include<conio.h>
void main()
{
int xi[10],xj[10];
int i,n,dij=0,tc[20];
clrscr();
printf("\n Enter n Values:");
scanf("%d",&n);
printf("Enter %d numbers into xi",n);
for(i=0;i<n;i++)
scanf("%d",&xi[i]);
printf("Enter %d numbers into xj",n);
for(i=0;i<n;i++)
scanf("%d",&xj[i]);
for(i=0;i<n;i++)
{
if(xi[i]>xj[i])
dij+=xi[i]-xj[i];
else
dij+=xj[i]-xi[i];
}
for(i=0;i<n;i++)
{
48 | D a t a M i n i n g
tc[i]=xi[i]-xj[i];
if(tc[i]<0)
xi[i]=xj[i];
}
getch();
}
Output:
Enter n value: 4
Enter 4 numbers into xi
1
2
3
4
Elements of xi:
5 4 3 4
Elements of xj:
5 4 3 2
49 | D a t a M i n i n g
government organization, are analyzing current and historic data to identify useful patterns from the
large database so that they can support their business strategy Their main emphasis is on complex,
interactive, exploratory analysis of very large dataset created by the integration of data from across
all the part of the organization and that data is fairly static Three complementary trends are their
Governments deal with enormous amount of data. In order that such data is put to an
effective use in facilitating decision-making, a data warehouse is constructed over the historical data.
It permits several types of queries requiring complex analysis on data to be addressed by decision-
makers.In spite of taking lots of initiative for computerization, the Government decision makers are
currently having difficulty in obtaining meaningful information in a timely manner because they have
to request and depend on IT staff for making special reports which often takes long time to
generate. An Information Warehouse can deliver strategic intelligence to the decision makers and
provide an insight into the overall situation. By organizing person and land-related data into a
meaningful Information Warehouse, the Government decision makers can be empowered with a
flexible tool that enables them to make informed policy decisions for citizen facilitation and
accessing their impact over the intended section of the population.
It is well known that in Information Technology (IT) driven society, knowledge is one of the
most significant assets of any organization. Knowledge discovery in databases is well-defined process
consisting of several distinct steps. Data mining is the core step, which results in the discovery of
hidden but useful knowledge from massive databases. A formal definition of Knowledge discovery in
databases is given as :“Data mining is the non trivial extraction of implicit previously unknown and
potentially useful information about data”. Data mining technology provides a user- oriented
approach to novel and hidden patterns in the data. The discovered knowledge can be used by the E-
governance administrators to improve the quality of service. Traditionally, decision making in E-
governance is based on the ground information, lessons learnt in the past resources and funds
constraints. However, data mining techniques and knowledge management technology can be
applied to create knowledge rich environment. An organization may implement Knowledge
Discovery in databases (KDD) with the help of a skilled employee who has good understanding of
organization. KDD can be effective at working with large volume of data to determine meaningful
pattern and to develop strategic solutions. Analyst and policy makers can learn lessons from the use
of KDD in other industries E-governance data is massive. It includes centric data, resource
management data and transformed data. E-governance organizations must have ability to analyze
data. Treatment records of millions of patients can be stored and computerized and data mining
techniques may help in answering several important and critical questions related to organization .
50 | D a t a M i n i n g
Data mining is an essential step of knowledge discovery. In recent years it has attracted
great deal of interest in Information industry. Knowledge discovery process consists of an iterative
sequence of data cleaning, data integration, data selection, data mining pattern recognition and
knowledge presentation. In particulars, data mining may accomplish class description, association,
classification, clustering, prediction and time series analysis. Data mining in contrast to traditional
data analysis is discovery driven. Data mining is a young interdisciplinary field closely connected to
data warehousing, statistics, machine learning, neural networks and inductive logic programming.
Data mining provides automatic pattern recognition and attempts to uncover patterns in data that
are difficult to detect with traditional statistical methods. Without data mining it is difficult to realize
the full potential of data collected within healthcare organization as data under analysis is massive,
highly dimensional, distributed and uncertain.
Measure
Analyze the
effectiveness of
discovered Knowledge
discovered
knowledge
For Goverment organization to succeed they must have the ability to capture, store and analyze data
Online analytical processing (OLAP) provides one way for data to be analyzed in a multi-dimensional
capacity. With the adoption of data warehousing and data analysis/OLAP tools, an organization can
make strides in leveraging data for better decision making. Many organizations struggle with the
utilization of data collected through an organization online transaction processing (OLTP) system
that is not integrated for decision making and pattern analysis. For successful E-governance
organization it is important to empower the management and staff with data warehousing based on
critical thinking and knowledge management tools for strategic decision making. Data warehousing
can be supported by decision support tools such as data mart, OLAP and data mining tools. A data
mart is a subset of data warehouse. It focuses on selected subjects. Online analytical processing
(OLAP) solution provides a multi-dimensional view of the data found in relational databases. With
stored data in two dimensional format OLAP makes it possible to analyze potentially large amount of
data with very fast response times and provides the ability for users to go through the data and drill
down or roll up through various dimensions as defined by the data structure.The traditional manual
data analysis has become insufficient and methods for efficient computer assisted analysis
indispensable. A Data Warehouse is a semantically consistent data store that serves as a physical
implementation of a decision support data model and stores the information on which an enterprise
needs to make strategic decisions. A data warehouse is also often viewed as architecture
51 | D a t a M i n i n g
constructed by integrating data from multiple heterogeneous sources to support structured and/or
ad-hoc queries, analytical reporting and decision making .
Data mining efforts associated with the Web, called Web mining, can be broadly divided into three
classes, i.e. content mining, usage mining, and structure mining.
Web usage mining is the application of data mining techniques to discover usage patterns from Web
data, in order to understand and better serve the needs of Web-based applications. Web usage
mining consists of three phases, namely pre-processing, pattern discovery, and pattern analysis.
52 | D a t a M i n i n g
Figure: Sample Web Server Log file
There are many kinds of data that can be used in Web Usage Mining. They can be classified as
follows:
Content: The real data in the Web pages, i.e. the data the Web page was designed to convey
to the users.
Structure: Data which describes the organization of the content. Intra-page structure
information includes the arrangement of various HTML or XML tags within a given page. The
principal kind of inter-page structure information is hyper-links connecting one page to
another.
Usage: Data that describes the pattern of usage of Web pages, such as IP addresses, page
references, and the date and time of accesses.
User Profile: Data that provides demographic information about users of the Web site. This
includes registration data and customer profile information.
53 | D a t a M i n i n g
Main Tasks in Web Usage Mining:
1. Preprocessing
Preprocessing consists of converting the usage, content, and structure information contained in the
various available data sources into the data abstractions necessary for pattern discovery.
Assuming each user has now been identified (through cookies, logins, or IP/agent/path analysis); the
click-stream for each user must be divided into sessions. Since page requests from other servers are
not typically available, it is difficult to know when a user has left a Web site. A thirty minute timeout
is often used as the default method of breaking a user's click-stream into sessions. When a session ID
is embedded in each URL, the definition of a session is set by the content server.
54 | D a t a M i n i n g
1.2 Content Preprocessing
Content preprocessing consists of converting the text, image, scripts, and other files such as
multimedia into forms that are useful for the Web Usage Mining process. Often, this consists of
performing content mining such as classification or clustering. While applying data mining to the
content of Web sites is an interesting area of research in its own right, in the context of Web Usage
Mining the content of a site can be used to filter the input to, or output from the pattern discovery
algorithms. In addition to classifying or clustering page views based on topics, page views can also be
classified according to their intended use. Page views can be intended to convey information, gather
information from the user, allow navigation, or some combination uses. The intended use of a page
view can also filter the sessions before or after pattern discovery.
2. Pattern Discovery
Pattern discovery draws upon methods and algorithms developed from several fields such as
statistics, data mining, machine learning and pattern recognition.
2.3 Clustering
Clustering is a technique to group together a set of items having similar characteristics. In the Web
Usage domain, there are two kinds of interesting clusters to be discovered: usage clusters and page
clusters.
Clustering of users tends to establish groups of users exhibiting similar browsing patterns.
On the other hand, clustering of pages will discover groups of pages having related content.
2.4 Classification
55 | D a t a M i n i n g
Classification is the task of mapping a data item into one of several predefined classes. In the Web
domain, one is interested in developing a profile of users belonging to a particular class or category.
This requires extraction and selection of features that best describe the properties of a given class or
category. Classification can be done by using supervised inductive learning algorithms such as
decision tree classifiers, naive Bayesian classifiers, k-nearest neighbour classifiers, Support Vector
Machines etc.
3. Pattern Analysis
Pattern analysis is the last step in the overall Web Usage mining process. The motivation behind
pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern
discovery phase. The exact analysis methodology is usually governed by the application for which
Web mining is done. The most common form of pattern analysis consists of a knowledge query
mechanism such as SQL. Another method is to load usage data into a data cube in order to perform
OLAP operations. Content and structure information can be used to filter out patterns containing
pages of a certain usage type, content type, or pages that match a certain hyperlink structure.
56 | D a t a M i n i n g
Figure: Web Usage Mining Process
57 | D a t a M i n i n g