0% found this document useful (0 votes)
294 views57 pages

Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For

Data preprocessing transforms data into a format that will be more easily and effectively processed. Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies. Data transformation: normalization and aggregation. Data reduction: reducing the volume but producing the same or similar analytical results.

Uploaded by

M2a2r2d2
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
294 views57 pages

Data Pre-Processing: Data Preprocessing Describes Any Type of Processing Performed On Raw Data To Prepare It For

Data preprocessing transforms data into a format that will be more easily and effectively processed. Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, resolve inconsistencies. Data transformation: normalization and aggregation. Data reduction: reducing the volume but producing the same or similar analytical results.

Uploaded by

M2a2r2d2
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

1.

Data Pre-processing

Data preprocessing describes any type of processing performed on raw data to prepare it for
another processing procedure. Commonly used as a preliminary data mining practice, data
preprocessing transforms the data into a format that will be more easily and effectively processed
for the purpose of the user.

Why Preprocessing ?

Real world data are generally


 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only
aggregate data
 Noisy: containing errors or outliers
 Inconsistent: containing discrepancies in codes or names

Tasks in data preprocessing


 Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies.
 Data integration: using multiple databases, data cubes, or files.
 Data transformation: normalization and aggregation.
 Data reduction: reducing the volume but producing the same or similar analytical results.
 Data discretization: part of data reduction, replacing numerical attributes with nominal
ones.

Data Cleaning
1. Fill in missing values (attribute or class value):
 Ignore the tuple: usually done when class label is missing.

 Use the attribute mean (or majority nominal value) to fill in the missing value.

 Use the attribute mean (or majority nominal value) for all samples belonging to the
same class.

 Predict the missing value by using a learning algorithm: consider the attribute with
the missing value as a dependent (class) variable and run a learning algorithm
(usually Bayes or decision tree) to predict the missing value.

2. Identify outliers and smooth out noisy data:

 Binning

 Sort the attribute values and partition them into bins (see "Unsupervised
discretization" below);

 Then smooth by bin means, bin median, or bin boundaries.

 Clustering: group values in clusters and then detect and remove outliers (automatic
or manual)

1|DataMining
 Regression: smooth by fitting the data into regression functions.

3. Correct inconsistent data: use domain knowledge or expert decision.

Data Transformation
1. Normalization:
 Min-Max: Scaling attribute values to fall within a specified range.

 Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-Min)/(Max-


Min)

 Z-Scale: Scaling by using mean and standard deviation (useful when min and max are
unknown or when there are outliers): V'=(V-Mean)/StdDev

 Decimal Scale: Scaling by using nth power of 10. V’=V/10 n

2. Aggregation: moving up in the concept hierarchy on numeric attributes.

3. Generalization: moving up in the concept hierarchy on nominal attributes.

4. Attribute construction: replacing or adding new attributes inferred by existing attributes.

Data Reduction
1. Reducing the number of attributes
 Data cube aggregation: applying roll-up, slice or dice operations.

 Removing irrelevant attributes: attribute selection (filtering and wrapper methods),


searching the attribute space (see Lecture 5: Attribute-oriented analysis).

 Principle component analysis (numeric attributes only): searching for a lower


dimensional space that can best represent the data..

2. Reducing the number of attribute values

 Binning (histograms): reducing the number of attributes by grouping them into


intervals (bins).
 Clustering: grouping values in clusters.
 Aggregation or generalization
3. Reducing the number of tuples
 Sampling

Discretization and generating Concept Hierarchies


1. Unsupervised discretization - class variable is not used.
 Equal-interval (equiwidth) binning: split the whole range of numbers in intervals
with equal size.
 Equal-frequency (equidepth) binning: use intervals containing equal number of
values.
2. Supervised discretization - uses the values of the class variable.
 Using class boundaries. Three steps:
 Sort values.
 Place breakpoints between values belonging to different classes.

2|DataMining
 If too many intervals, merge intervals with equal or similar class
distributions.
 Entropy (information)-based discretization.
3. Generating concept hierarchies: recursively applying partitioning or discretization method

Example:

Original Data:

Emp_ID Emp_Name Date of Birth Designation Age Salary


100011 surya 05-02-1989 developer 22 13000
100012 sukanya 05-03-1987 developer 24 20000
100013 suma 05-04-1985 developer 26 34000
100014 anusha 05-02-1986 developer 25 58000
100015 vikram 05-03-1983 developer 28 11000
100016 sateesh 05-04-1987 developer 24 14000
100017 mahesh 08-04-1984 developer 27 25000
100018 priya 09-04-1981 admin 30 36000
100019 mahi 10-04-1978 admin 33 32000
100020 vikram 08-04-1986 admin 25 31000
100021 sateesh 09-04-1983 admin 28 22000
100022 mahesh 10-04-1980 admin 31 16500
100023 sateesh 09-04-1977 admin 34 17800
100024 mahesh 10-04-1989 admin 22 41000
100025 priya 08-04-1981 admin 30 51000
100026 mahi 09-04-1978 admin 33 31000
100027 sateesh 10-04-1986 admin 25 14678
100028 mahesh 09-04-1983 admin 28 45000
100029 priya 05-03-1980 admin 31 53000
100030 mahi 05-04-1977 admin 34 55000
100031 vikram 05-02-1989 admin 22 19500

3|DataMining
Preprocessed Data:

Age (Binning n=5) Transformation(Salary)


Date of
Concept Min- Z-
Birth
Emp_ID Emp_Name Mean Boundary Hierarchy(Salary) Max Scale Decimal
100011 surya 05-02-1989 25 22 Low 0.04 -1.16 0.13
100012 sukanya 05-03-1987 25 22 Low 0.19 -0.70 0.20
100013 suma 05-04-1985 25 28 Average 0.49 0.23 0.34
100014 anusha 05-02-1986 25 22 High 1.00 1.82 0.58
100015 vikram 05-03-1983 25 28 Low 0.00 -1.29 0.11
100016 sateesh 05-04-1987 28 24 Low 0.06 -1.09 0.14
100017 mahesh 08-04-1984 28 24 Average 0.30 -0.36 0.25
100018 priya 09-04-1981 28 33 Average 0.53 0.36 0.36
100019 mahi 10-04-1978 28 33 Average 0.45 0.10 0.32
100020 vikram 08-04-1986 28 24 Average 0.43 0.03 0.31
100021 sateesh 09-04-1983 29 22 Average 0.23 -0.56 0.22
100022 mahesh 10-04-1980 29 34 Low 0.12 -0.93 0.17
100023 sateesh 09-04-1977 29 34 Low 0.14 -0.84 0.18
100024 mahesh 10-04-1989 29 22 High 0.64 0.70 0.41
100025 priya 08-04-1981 29 34 High 0.85 1.36 0.51
100026 mahi 09-04-1978 29 34 Average 0.43 0.03 0.31
100027 sateesh 10-04-1986 29 22 Low 0.08 -1.05 0.15
100028 mahesh 09-04-1983 29 22 High 0.72 0.96 0.45
100029 priya 05-03-1980 29 34 High 0.89 1.49 0.53
100030 mahi 05-04-1977 29 34 High 0.94 1.62 0.55
100031 vikram 05-02-1989 29 22 Low 0.18 -0.73 0.20

Low: Salary<=20000
Medium: 20000<Salary<=40000
High: Salary>40000

4|DataMining
2. Data Warehouse Schemas

A data warehouse is an organized collection of large amounts of structured data. It is a database


designed and intended to support decision making in organizations.

A schema is a collection of database objects, including tables, views, indexes, and synonyms. There is
a variety of ways of arranging schema objects in the schema models designed for data warehousing.

Star Schema
The star schema is perhaps the simplest data warehouse schema. It is called a star schema because
the entity-relationship diagram of this schema resembles a star, with points radiating from a central
table. The centre of the star consists of a large fact table and the points of the star are the dimension
tables.

A star schema is characterized by one or more very large fact tables that contain the primary
information in the data warehouse, and a number of much smaller dimension tables (or lookup
tables), each of which contains information about the entries for a particular attribute in the fact
table.

A star query is a join between a fact table and a number of dimension tables. Each dimension table is
joined to the fact table using a primary key to foreign key join, but the dimension tables are not
joined to each other.

The main advantages of star schemas are that they:


1. Provide a direct and intuitive mapping between the business entities being analysed by end
users and the schema design.
2. Provide highly optimized performance for typical star queries.
3. Are widely supported by a large number of business intelligence tools, which may anticipate
or even require that the data-warehouse schema contain dimension tables.

Star schemas are used for both simple data marts and very large data warehouses.

Example: Star Schema for a Sales Data Warehouse

Dim_Product Fact_Sales Dim_Store


ID Store_ID ID
5|DataMining
Product_Name Product_ID State

Brand Units_Sold Country


Snowflake Schema
The snowflake schema is a more complex data warehouse model than a star schema, and is a type of
star schema. It is called a snowflake schema because the diagram of the schema resembles a
snowflake.

Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has
been grouped into multiple tables instead of one large table. While this saves space, it increases the
number of dimension tables and requires more foreign key joins. The result is more complex queries
and reduced query performance.

Example: Snowflake Schema for a Sales Data Warehouse

Dim_Product Fact_Sales
Dim_Store
ID Store_ID
ID
Product_Name Product_ID
Geography_ID
Brand_ID Units_Sold

Dim_Geography
Dim_Brand
ID
ID
State
Brand
Country

Fact Constellation

6|DataMining
For each star schema it is possible to construct fact constellation schema. The fact constellation
architecture contains multiple fact tables that share many dimension tables. This schema is more
complex than star or snowflake architecture, which is because it contains multiple fact
tables. This allows dimension tables to be shared amongst many fact tables. The main
disadvantage of the fact constellation schema is a more complicated design because
many variants of aggregation must be considered. 

In a fact constellation schema, different fact tables are explicitly assigned to the
dimensions, which are for given facts relevant. This may be useful in cases when some
facts are associated with a given dimension level and other facts with a deeper
dimension level.

7|DataMining
3. OLAP (OnLine Analytical Processing) Cube

An OLAP (Online analytical processing) cube is a data structure that allows fast analysis of data. It
can also be defined as the capability of manipulating and analysing data from multiple perspectives.
The arrangement of data into cubes overcomes some limitations of relational databases.

OLAP cubes can be thought of as extensions to the two-dimensional array of a spreadsheet. For
example a company might wish to analyse some financial data by product, by time-period, by city, by
type of revenue and cost, and by comparing actual data with a budget. These additional methods of
analysing the data are known as dimensions. Because there can be more than three dimensions in
an OLAP system the term hypercube is sometimes used.

The OLAP cube consists of numeric facts called measures which are categorized by dimensions. The
cube metadata (structure) may be created from a star schema or snowflake schema of tables in a
relational database. Measures are derived from the records in the fact table and dimensions are
derived from the dimension tables.

OLAP Operations:
The analyst can understand the meaning contained in the databases using multi-dimensional
analysis. By aligning the data content with the analyst's mental model, the chances of confusion and
erroneous interpretations are reduced. The analyst can navigate through the database and screen
for a particular subset of the data, changing the data's orientations and defining analytical
calculations. The user-initiated process of navigating by calling for page displays interactively,
through the specification of slices via rotations and drill down/up is sometimes called "slice and
dice". Common operations include slice and dice, drill down, roll up, and pivot.

Slice: A slice is a subset of a multi-dimensional array corresponding to a single value for one or more
members of the dimensions not in the subset.

Dice: The dice operation is a slice on more than two dimensions of a data cube (or more than two
consecutive slices).

Drill Down/Up: Drilling down or up is a specific analytical technique whereby the user navigates
among levels of data ranging from the most summarized (up) to the most detailed (down).

Roll-up: A roll-up involves computing all of the data relationships for one or more dimensions. To do
this, a computational relationship or formula might be defined.

8|DataMining
Pivot: This operation is also called rotate operation. It rotates the data in order to provide an
alternative presentation of data - the report or page display takes a different dimensional
orientation.

Example:

Relational Data:
Supplier-1:

9|DataMining
Supplier-2:

10 | D a t a M i n i n g
LOCATION MUMBAI
11 | D a t a M i n i n g
VIJAYAWADA

VIZAG

HYDERABAD

TV

DVD

COMPUTER

S1

LAPTOP
I

T
REFRIGERATOR
E
TV
M

S
DVD

S2

COMPUTER

LAPTOP

REFRIGERATOR

Q1 Q2 Q3 Q4

TIME

4. Generating OLAP cube using OlapCube Tool

12 | D a t a M i n i n g
Extract Data Transform & Import to OLAP Build Cubes Produce
from OLTPs Standardize Data Database Reports

Steps involved in the creation of an OLAP cube

Example:

Relational Data in CSV Format:

one Hyderabad 23000 22


one Hyderabad 25000 21
one Hyderabad 21000 22
one Hyderabad 20000 20
one Hyderabad 23666 19
one Hyderabad 25000 25
one Hyderabad 24500 26
one Hyderabad 22650 18
two Bangalore 16500 27
two Bangalore 15000 24
two Bangalore 14678 22
two Bangalore 13200 25
two Bangalore 14500 24
two Bangalore 15300 23
two Bangalore 14323 22
three Chennai 11200 21
three Chennai 12000 24
three Chennai 13000 22
four Cochin 8500 20
four Cochin 9700 21

Data after exporting and standardizing in OlapCube Tool:

13 | D a t a M i n i n g
Various Representations of the data after Building Cube:

1. Bar Graph Representation:

14 | D a t a M i n i n g
15 | D a t a M i n i n g
2. Pie Graph Representation

16 | D a t a M i n i n g
17 | D a t a M i n i n g
18 | D a t a M i n i n g
3. Tabular Representation

19 | D a t a M i n i n g
20 | D a t a M i n i n g
21 | D a t a M i n i n g
22 | D a t a M i n i n g
5. ETL (Extract, Transform & Load) in Oracle 10g

External Tables:
The external tables feature is a complement to the existing SQL*Loader functionality. It enables you
to access data in external sources as if it were in a table in the database.

Prior to Oracle Database 10g, external tables were read-only. However, as of Oracle Database 10g,
external tables can also be written to. Note that SQL*Loader may be the better choice in data
loading situations that require additional indexing of the staging table. To use the external tables
feature, you must have some knowledge of the file format and record format of the data files on
your platform if the ORACLE_LOADER access driver is used and the data files are in text format. You
must also know enough about SQL to be able to create an external table and perform queries
against it.

How Are External Tables Created?


Oracle Database allows you read-only access to data in external tables. External tables are defined as
tables that do not reside in the database, and can be in any format for which an access driver is
provided. By providing the database with metadata describing an external table, the database is able
to expose the data in the external table as if it were data residing in a regular database table. The
external data can be queried directly and in parallel using SQL.

You can, for example, select, join, or sort external table data. You can also create views and
synonyms for external tables. However, no DML operations (UPDATE, INSERT, or DELETE) are
possible, and no indexes can be created, on external tables.

External tables provide a framework to unload the result of an arbitrary SELECT statement into a
platform-independent Oracle-proprietary format that can be used by Oracle Data Pump. External
tables provide a valuable means for performing basic extraction, transformation, and loading (ETL)
tasks that are common for data warehousing.

The means of defining the metadata for external tables is through the CREATE
TABLE...ORGANIZATION EXTERNAL statement. This external table definition can be thought of as a
view that allows running any SQL query against external data without requiring that the external
data first be loaded into the database. An access driver is the actual mechanism used to read the
external data in the table. When you use external tables to unload data, the metadata is
automatically created based on the data types in the SELECT statement.

Oracle Database provides two access drivers for external tables. The default access driver is
ORACLE_LOADER, which allows the reading of data from external files using the Oracle loader
technology. The ORACLE_LOADER access driver provides data mapping capabilities which are a
subset of the control file syntax of SQL*Loader utility. The second access driver,
ORACLE_DATAPUMP, lets you unload data—that is, read data from the database and insert it into an
external table, represented by one or more external files—and then reload it into an Oracle
Database.

23 | D a t a M i n i n g
External Table Restrictions:
The following are restrictions on external tables:
 An external table does not describe any data that is stored in the database.
 An external table does not describe how data is stored in the external source. This is the
function of the access parameters.
 Virtual columns are not supported.

Creating External Tables:


External tables can be created using the CREATE TABLE statement with an ORGANIZATION EXTERNAL
clause. This statement creates only the metadata in the data dictionary.

Example: Creating an External Table & Loading Data


The following example creates an external table and then uploads the data to a database table. We
can unload data through the external table framework by specifying the AS sub-query clause of the
CREATE TABLE statement.

In this example, the data for the external table resides in a text file “colleges.dat”.
The contents of the file are:
cbit, 600
mgit, 450
ou, 3000

The following SQL statements create an external table named colleges_exttable and load data from
the external table into the colleges table.

SQL> CREATE OR REPLACE DIRECTORY colleges AS 'C:\8110’;


Directory Created.

SQL> CREATE TABLE colleges_exttable


(
college_name varchar2(25) ,
intake number
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_LOADER
DEFAULT DIRECTORY colleges
ACCESS PARAMETERS
(
records delimited by newline
badfile colleges:'empxt%a_%p.bad'
logfile colleges:'empxt%a_%p.log'
fields terminated by ','
missing field values are null

24 | D a t a M i n i n g
(
college name, intake
)
)
LOCATION (‘colleges.dat')
)
PARALLEL
REJECT LIMIT UNLIMITED;

Table Created.

SQL> SELECT * FROM colleges_exttable;

COLLEGE_NAME INTAKE
Cbit 600
Mgit 450
Ou 3000

SQL> CREATE TABLE colleges AS SELECT * FROM colleges_exttable;


Table Created.

The TYPE specification indicates the access driver of the external table. The access driver is the API
that interprets the external data for the database. If you omit the TYPE specification,
ORACLE_LOADER is the default access driver. You must specify the ORACLE_DATAPUMP access
driver if you specify the AS sub-query clause to unload data from one Oracle Database and reload it
into the same or a different Oracle Database.

The access parameters, specified in the ACCESS PARAMETERS clause, are opaque to the database.
These access parameters are defined by the access driver, and are provided to the access driver by
the database when the external table is accessed.

The PARALLEL clause enables parallel query on the data sources. The granule of parallelism is by
default a data source, but parallel access within a data source is implemented whenever possible.
Parallel access within a data source is provided by the access driver only if all of the following
conditions are met:
 The media allows random positioning within a data source.
25 | D a t a M i n i n g
 It is possible to find a record boundary from a random position.
 The data files are large enough to make it worthwhile to break up into multiple chunks.
Note:
Specifying a PARALLEL clause is of value only when dealing with large amounts of data. Otherwise, it
is not advisable to specify a PARALLEL clause, and doing so can be detrimental.

The REJECT LIMIT UNLIMITED clause specifies that there is no limit on the number of errors that can
occur during a query of the external data. For parallel access, this limit applies to each parallel
execution server independently. For example, if REJECT LIMIT is specified, each parallel query
process is allowed 10 rejections. Hence, the only precisely enforced values for REJECT LIMIT on.

26 | D a t a M i n i n g
6. Data Pump: Import (impdp) and Export (expdp)

Oracle introduced the Data Pump in Oracle Database 10g Release 1. This new oracle technology
enables very high transfer of data from one database to another. The oracle Data Pump provides
two utilities namely:
 Data Pump Export which is invoked with the expdp command.
 Data Pump Import which is invoked with the impdp command.

The above two utilities have similar look and feel with the pre-Oracle 10g import and export utilities
(imp and exp) but are completely separate. This means that dump files generated by the original
export utility (exp) cannot be imported by the new data pump import utility (impdp) and vice-versa.

Data Pump Export (expdp) and Data Pump Import (impdp) are server-based rather than client-based
as is the case for the original export (exp) and import (imp). Because of this, dump files, log files, and
sql files are accessed relative to the server-based directory paths. Data Pump requires that directory
objects mapped to a file system directory be specified in the invocation of the data pump import or
export.

You can invoke the data pump export or import using a command line. Export and Import parameters
can be specified directly in the command line or in a parameter (.par) file.

Example:

Create a table with sample data, to be exported


SQL> CREATE TABLE new(
2 name VARCHAR(20),
3 rollno NUMBER);

Table created.

SQL> INSERT INTO new VALUES('ABCD',1);

1 row created.

SQL> INSERT INTO new VALUES('WXYZ',2);

1 row created.

SQL> SELECT * FROM new;

NAME ROLLNO
-------------------- ----------
27 | D a t a M i n i n g
ABCD 1
WXYZ 2
If you want to export to a file, the first thing that you must do is create a database DIRECTORY
object for the output directory, and grant access to users who will be doing exports and imports:
SQL>CREATE DIRECTORY csecbit AS 'C:\8110';

Directory created.

Now, you can export a user's object using the command line. Export parameters are to be
specified in a parameter (.par) file as shown below:
TABLEs=new
DUMPFILE=csecbit:dumpfile.dmp
LOGFILE=csecbit:logfile.dmp

Invoke Data Pump Export


C:\Documents and Settings\admin>expdp system/cbit@orcl parfile=C:\8110\parameter.par

Export: Release 10.1.0.2.0 - Production on Tuesday, 01 March, 2011 14:46

Copyright (c) 2003, Oracle. All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 – Production With the
Partitioning, OLAP and Data Mining options
Starting "SYSTEM"."SYS_EXPORT_TABLE_01": system/********@orcl
parfile=C:\8110\parameter.par
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TBL_TABLE_DATA/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 64 KB
Processing object type TABLE_EXPORT/TABLE/TABLE
. . exported "SYSTEM"."NEW" 5.234 KB 2 rows
Master table "SYSTEM"."SYS_EXPORT_TABLE_01" successfully loaded/unloaded
******************************************************************************
Dump file set for SYSTEM.SYS_EXPORT_TABLE_01 is:
C:\8110\DUMPFILE.DMP
Job "SYSTEM"."SYS_EXPORT_TABLE_01" successfully completed at 14:47

Now, drop the table “new” and Import the previously Exported Data Pump
SQL> drop table new;

Table dropped.

SQL> select * from new;


select * from new
*
ERROR at line 1:
28 | D a t a M i n i n g
ORA-00942: table or view does not exist
Invoke Data Pump Import
C:\Documents and Settings\admin>impdp system/cbit@orcl parfile=C:\8110\parameter.par

Import: Release 10.1.0.2.0 - Production on Tuesday, 01 March, 2011 14:51

Copyright (c) 2003, Oracle. All rights reserved.

Connected to: Oracle Database 10g Enterprise Edition Release 10.1.0.2.0 – Production With the
Partitioning, OLAP and Data Mining options
Master table "SYSTEM"."SYS_IMPORT_TABLE_01" successfully loaded/unloaded
Starting "SYSTEM"."SYS_IMPORT_TABLE_01": system/********@orcl
parfile=C:\8110\parameter.par
Processing object type TABLE_EXPORT/TABLE/TABLE
Processing object type TABLE_EXPORT/TABLE/TBL_TABLE_DATA/TABLE/TABLE_DATA
. . imported "SYSTEM"."NEW" 5.234 KB 2 rows
Job "SYSTEM"."SYS_IMPORT_TABLE_01" successfully completed at 14:51

Check the contents of the imported table “new”


SQL> SELECT * FROM new;

NAME ROLLNO
-------------------- ----------
ABCD 1
WXYZ 2

29 | D a t a M i n i n g
7. Using Apriori technique, generate association rules 

Association Rules:
Association Rules are used for discovering regularities between products in big transactional
databases. A transaction is an event involving one or more of the products (items) in the business or
domain; for example buying of goods by a consumer in a super market is a transaction. A set of
items is usually referred as "itemset", and an itemset with "k" number of items is called "k-itemset".

The general form of an association rule is X => Y, where X and Y are two disjoint itemsets. The
"support" of an itemset is the number of transactions that contain all the items of that itemset;
whereas the support of an association rule is the number of transactions that contain all items of
both X and Y. The "confidence" of an association rule is the ratio between its support and the
support of X.

A given association rule X => Y is considered significant and useful, if it has high support and
confidence values. The user will specify a threshold value for support and confidence, so that
different degrees of significance can be observed based on these threshold values.

Apriori Algorithm- Generation of Frequent Itemsets:


The first step in the generation of association rules is the identification of large itemsets. An itemset
is "large" if its support is greater than a threshold, specified by the user. A commonly used algorithm
for this purpose is the Apriori algorithm.

Apriori is the best-known algorithm to mine association rules. It uses a breadth-first search strategy
to counting the support of itemsets and uses a candidate generation function which exploits the
downward closure property of support. The Apriori algorithm relies on the principle "Every non-
empty subset of a large itemset must itself be a large itemset".

The algorithm attempts to find subsets which are common to at least a minimum number C of the
itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a
time (a step known as candidate generation), and groups of candidates are tested against the data.
The algorithm terminates when no further successful extensions are found.

Apriori uses breadth-first search and a tree structure to count candidate item sets efficiently. It
generates candidate item sets of length k from item sets of length k − 1. Then it prunes the
candidates which have an infrequent sub pattern. According to the downward closure lemma, the
candidate set contains all frequent k-length item sets. After that, it scans the transaction database to
determine frequent item sets among the candidates.

Apriori, while historically significant, suffers from a number of inefficiencies or trade-offs, which have
spawned other algorithms. Candidate generation generates large numbers of subsets. Bottom-up
subset exploration finds any maximal subset S only after all 2 | S | − 1 of its proper subsets.

30 | D a t a M i n i n g
Algorithm:
Find frequent itemsets using an iterative level-wise approach based on candidate generation.

Input:
 D, a database of transactions;
 min sup, the minimum support count threshold.
Output: L, frequent itemsets in D.

Method:
L1 = find frequent 1-itemsets(D);
for (k = 2; Lk-1!=null; k++) {
Ck = apriori gen(Lk-1);
for each transaction t  D { // scan D for counts
Ct = subset(Ck, t); // get the subsets of t that are candidates
for each candidate c  Ct
c.count++;
}
Lk = {c  Ck|c.count >= min sup}
}
return L = UkLk;

procedure apriori gen( Lk-1:frequent (k-1)-itemsets)


for each itemset l1  Lk-1
for each itemset l2  Lk-1
if (l1[1] = l2[1])  ( l1[2] = l2[2])  …  ( l1[k-2] = l2[k-2]) 
{ ⋈ ( l1[k-1] < l2[k-1]) then

c = l1 l2; // join step: generate candidates


if has_infrequent_subset(c, Lk-1) then
delete c; // prune step: remove unfruitful candidate
else add c to Ck;
}
return Ck;

procedure has_infrequent_subset(c: candidate k-itemset; Lk-1: frequent (k-1)-itemsets);


// use prior knowledge
for each (k-1)-subset s of c
if s  Lk-1 then
return TRUE;
return FALSE;

31 | D a t a M i n i n g
Generating Association Rules from Frequent Itemsets:
Once the frequent itemsets from transactions in a database D have been found, it is straightforward
to generate strong association rules (where strong association rules satisfy both minimum support
and minimum confidence) from them. This can be done using the following equation for confidence:
support_count(AUB)
confidence(A=>B) = P(B|A) =
support_count(A)

The conditional probability is expressed in terms of itemset support count, where


support_count(AUB) is the number of transactions containing the itemsets AUB, and
support_count(A) is the number of transactions containing the itemset A. Based on this equation,
association rules can be generated as follows:
 For each frequent itemset l, generate all nonempty subsets of l.
support_count(l)
 For every nonempty subset s of l, output the rule “s => (l - s)” if >=
support_count(s)
min_conf, where min_conf is the minimum confidence threshold.

Example:
Let the database of super-market transactions consist of the sets {1,2,3,4}, {1,2}, {2,3,4}, {2,3},
{1,2,4}, {3,4}, and {2,4}. Each number corresponds to a product such as "butter" or "water".

The first step of Apriori is to count up the frequencies, called the supports, of each member item
separately:

Item Support
1 3
2 6
3 4
4 5

We can define a minimum support level to qualify as "frequent," which depends on the context. For
this case, let min support = 3. Therefore, all are frequent.

The next step is to generate a list of all 2-pairs of the frequent items. Had any of the above items not
been frequent, they wouldn't have been included as a possible member of possible 2-item pairs. In
this way, Apriori prunes the tree of all possible sets. In next step we again select only these items
(now 2-pairs are items) which are frequent:

Item Support
{1,2} 3
{2,3} 3
{2,4} 4
{3,4} 3

And generate a list of all 3-triples of the frequent items (by connecting frequent pair with frequent
single item).
32 | D a t a M i n i n g
Item Support
{1,2,3} 1
{1,2,4} 2
{2,3,4} 2

None of the above items are frequent.

Generating Association Rules using WEKA Tool:


Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either
be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-
processing, classification, regression, clustering, association rules, and visualization. It is also well
suited for developing new machine learning schemes.

Example:

Transaction Data: Supplied as Comma Separated (.csv) file


TV DVD_PLAYER FLASH_DRIVE COMPUTER ANTIVIRUS WASHING_MACHINE
y y
Y y y y
Y y y y y
y y y
y y y y
y y y y
y y y y
Y y y y y
Y y y y y
Y y y y y
Y y y y y
Y y y y
Y y y y y
Y y y y
Y y y y
Y y
Y y y
Y y y y y
Y y

33 | D a t a M i n i n g
Opening the .csv file in WEKA Explorer:

Generating Association Rules by clicking on “Start” in the “Associate” tab:

34 | D a t a M i n i n g
8. Decision tree classfication using weka tool

Classification is a data mining technique used to predict group membership for data instances.
Classification is the task of generalizing known structure to apply to new data. For example, an email
program might attempt to classify an email as legitimate or spam. Common algorithms include
decision tree learning, nearest neighbour, naive Bayesian classification, neural networks and support
vector machines.

Data classification is a two-step process. In the first step, a classifier is built describing a
predetermined set of data classes or concepts. This is the learning step (or training phase), where a
classification algorithm builds the classifier by analysing or “learning from” a training set made up of
database tuples and their associated class labels. The class label attribute is discrete-valued and
unordered. It is categorical in that each value serves as a category or class. The individual tuples
making up the training set are referred to as training tuples and are selected from the database
under analysis. In the context of classification, data tuples can be referred to as samples, examples,
instances, data points, or objects. Because the class label of each training tuple is provided, this step
is also known as supervised learning. Finally, the classifier is represented in the form of classification
rules.

The second step is known as Classification and in this step, test data is used to estimate the accuracy
of the classification rules. If the accuracy is considered acceptable, the rules can be applied to the
classification of new data tuples.

Example:

Attribute Data: Supplied as Comma Separated (.csv) file


R age income rating student buys?
1 y e y y
2 y g y n
3 m a n y
4 m e n n
5 m g n y
6 y a y y
7 y e y y
8 y g y y
9 s a n n
10 s e n n
11 m g n y
12 m a n y
13 y a y y
14 y g y n

35 | D a t a M i n i n g
Opening the .csv file in WEKA Explorer:

Classifying using J48 Classification selected from the “Trees” folder in the “Choose” menu:

36 | D a t a M i n i n g
Visualizing the resultant Tree by right clicking on the option in the “Result list” & selecting
“Visualize Tree”:

9. Clustering in weka tool


37 | D a t a M i n i n g
The process of grouping a set of physical or abstract objects into classes of similar objects is called
clustering. A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters.

Although classification is an effective means for distinguishing groups or classes of objects, it


requires the often costly collection and labelling of a large set of training tuples or patterns, which
the classifier uses to model each group.

Clustering is also called data segmentation in some applications because clustering partitions large
data sets into groups according to their similarity.

The most well-known and commonly used partitioning methods are k-means, k-medoids and their
variations.

Centroid-Based Technique- The k-Means Method:


The k-means algorithm takes the input parameter, k, and partitions a set of n objects into k clusters
so that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. Cluster
similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as
the cluster’s centroid or centre of gravity.

The k-means algorithm proceeds as follows. First, it randomly selects k of the objects, each of which
initially represents a cluster mean or centre. For each of the remaining objects, an object is assigned
to the cluster to which it is the most similar, based on the distance between the object and the
cluster mean. It then computes the new mean for each cluster. This process iterates until the
criterion function converges. Typically, the square-error criterion is used, defined as

where E is the sum of the square error for all objects in the data set; p is the point in space
representing a given object; and mi is the mean of cluster Ci (both p and mi are multidimensional). In
other words, for each object in each cluster, the distance from the object to its cluster centre is
squared, and the distances are summed. This criterion tries to make the resulting k clusters as
compact and as separate as possible.

Example:

38 | D a t a M i n i n g
Attribute Data: Supplied as Comma Separated (.csv) file
Outlook temperature humidity windy play
Sunny 85 85 FALSE no
Sunny 80 90 TRUE no
Overcast 83 86 FALSE yes
Rainy 70 96 FALSE yes
Rainy 68 80 FALSE yes
Rainy 65 70 TRUE no
Overcast 64 65 TRUE yes
Sunny 72 95 FALSE no
Sunny 69 70 FALSE yes
Rainy 75 80 FALSE yes
Sunny 75 70 TRUE yes
Overcast 72 90 TRUE yes
Overcast 81 75 FALSE yes
Rainy 71 91 TRUE no

Opening the .csv file in WEKA Explorer:

Clustering using Simple K Means selected from the “Choose” menu in the “Cluster” tab:

39 | D a t a M i n i n g
Visualizing by right clicking on the option in the “Result list” & selecting “Visualize”:

Final Output:
temperatur
Instance_number outlook e humidity windy play Cluster

40 | D a t a M i n i n g
0 sunny 85 85 FALSE no cluster2
1 sunny 80 90 TRUE no cluster2
2 overcast 83 86 FALSE yes cluster0
3 rainy 70 96 FALSE yes cluster0
4 rainy 68 80 FALSE yes cluster0
5 rainy 65 70 TRUE no cluster2
6 overcast 64 65 TRUE yes cluster1
7 sunny 72 95 FALSE no cluster2
8 sunny 69 70 FALSE yes cluster0
9 rainy 75 80 FALSE yes cluster0
10 sunny 75 70 TRUE yes cluster1
11 overcast 72 90 TRUE yes cluster1
12 overcast 81 75 FALSE yes cluster0
13 rainy 71 91 TRUE no cluster2

10. Programs for linear and multiple regression techniques

Linear regression technique:


41 | D a t a M i n i n g
#include<iostream.h>
#include<conio.h>
void main()
{
cout<<"enter how many elements";
int n;
cin>>n;
double mean_x,mean_y;
double sum_x,sum_y;
double num,din;
num=0.0;din=0.0;
sum_x=0;
sum_y=0;
int x[20];
int y[20];
cout<<"enter the x set& y set";
for(int i=0;i<n;i++)
{
cin>>x[i];
sum_x=sum_x+x[i];
cin>>y[i];
sum_y=sum_y+y[i];
}
cout<<"sum_x:"<<sum_x<<"sum_y:"<<sum_y;
mean_x=sum_x/n;
mean_y=sum_y/n;
cout<<endl<<"mean_x:"<<mean_x<<",mean_y:"<<mean_y;
for(i=0;i<n;i++)
{
num=num+((x[i]-mean_x)*(y[i]-mean_y));
din= din+((x[i]-mean_x)*(x[i]-mean_x));
}
cout<<endl<<"num:"<<num<<",din:"<<din;
double b=num/din;
double a=mean_y-(b*mean_x);
cout<<endl<<"b:"<<b<<","<<"a:"<<a;
cout<<endl<<"enter the x value";
int c;
cin>>c;
double r=a+b*c;
cout<<endl<<"corresponding y is:"<<r;
}

/* output:
enter how many elements
42 | D a t a M i n i n g
6
enter the x set& y set
1
8
2
13
3
18
4
23
5
28
6
33
sum_x:21.0sum_y:123.0
mean_x:3.5,mean_y:20.5
num:87.5,din:17.5
b:5.0,a:3.0
enter the x value
2
corresponding y is:13.0*/

Multiple regression technique:

#include<iostream.h>
#include<conio.h>
void main()
{
int n;
cout<<"enter how many elements";
cin>>n;
double mean_x1,mean_x2,mean_y;
double sum_x1,sum_x2,sum_y;
double num1,din1,num2,din2;
num1=0.0;din1=0.0;
num2=0.0;din2=0.0;
sum_x1=0;
sum_x2=0;
sum_y=0;
int x1[20];
int x2[20];
int y[20];
cout<<"enter the x1,x2 sets& y set";
for(int i=0;i<n;i++)
{
43 | D a t a M i n i n g
cin>>x1[i];
sum_x1=sum_x1+x1[i];
cin>>x2[i];
sum_x2=sum_x2+x2[i];
cin>>y[i];
sum_y=sum_y+y[i];
}
cout<<"sum_x1:"<<sum_x1<<"sum_y:"<<sum_y;
mean_x1=sum_x1/n;
mean_x2=sum_x2/n;
mean_y=sum_y/n;
cout<<"mean_x1:"<<mean_x1<<",mean_x2:"<<mean_x2<<",mean_y:"<<mean_y;
for(i=0;i<n;i++)
{
num1=num1+((x1[i]-mean_x1)*(y[i]-mean_y));
din1= din1+((x1[i]-mean_x1)*(x1[i]-mean_x1));
num2=num2+((x2[i]-mean_x2)*(y[i]-mean_y));
din2= din2+((x2[i]-mean_x2)*(x2[i]-mean_x2));
}
cout<<"num1:"<<num1<<",din1:"<<din1;
cout<<"num1:"<<num1<<",din1:"<<din1;
double b1=num1/din1;
double b2=num2/din2; double a=mean_y-(b1*mean_x1)-(b2*mean_x2);
cout<<"b1:"<<b1<<",b2:"<<b2<<"a:"<<a;
cout<<"enter the x1,x2 values";
int c1,c2;
cin>>c1;
cin>>c2;
double r=a+b1*c1+b2*c2;
cout<<endl<<"corresponding y is:"<<r<<endl;
getch();
}

/*output:
enter how many elements
6
enter the x1,x2 sets& y set
1
1
11
2
2
19
3
3
44 | D a t a M i n i n g
27
4
4
35
5
5
43
6
6
51
sum_x1:21.0sum_y:186.0
mean_x1:3.5,mean_x2:3.5,mean_y:31.0
num1:140.0,din1:17.5
num1:140.0,din1:17.5
b1:8.0,b2:8.0a:-25.0
enter the x1,x2 values
7
7
corresponding y is:87.0*/

11. Programs for kmeans and kmediods clustering techniques

K-Means Schema

Aim:
Understand and write a program to implement K-Means partitioning algorithm

Theory:
The k-means algorithm is an algorithm to cluster n objects based on attributes into k
45 | D a t a M i n i n g
partitions, k < n. It is similar to the expectation-maximization algorithm for mixtures of Gaussians in
that they both attempt to find the centers of natural clusters in the data. It assumes that the object
attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster
variance, or, the squared error function

where there are k clusters Si, i = 1, 2, ..., k, and µi is the centroid or mean point of all the points xj ∈
Si.
` The most common form of the algorithm uses an iterative refinement heuristic known as
Lloyd's algorithm. Lloyd's algorithm starts by partitioning the input points into k initial sets, either at
random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It
constructs a new partition by associating each point with the closest centroid. Then the centroids are
recalculated for the new clusters, and algorithm repeated by alternate application of these two steps
until convergence, which is obtained when the points no longer switch clusters (or alternatively
centroids are no longer changed).
Lloyd's algorithm and k-means are often used synonymously, but in reality Lloyd's algorithm
is a heuristic for solving the k-means problem, but with certain combinations of starting points and
centroids, Lloyd's algorithm can in fact converge to the wrong answer (ie a different and optimal
answer to the minimization function above exists.) Other variations exist, but Lloyd's algorithm has
remained popular because it converges extremely quickly in practice. In fact, many have observed
that the number of iterations is typically much less than the number of points. Recently, however,
David Arthur and Sergei Vassilvitskii showed that there exist certain point sets on which k-means
takes superpolynomial time: 2Ω(√n) to converge. Approximate k-means algorithms have been
designed that make use of coresets: small subsets of the original data.

Program:

//K-means Schema

#include<stdio.h>
#include<conio.h>

void main()
{
int n,j,s=0;
int obj[20],c[20][20],mean[20];
int i,nc,k,m;
clrscr();
printf("\n Enter the No.of Items:");
scanf("%d",&n);
printf("\n enter the n items");
for(i=0;i<n;i++)
scanf("%d",&obj[i]);
printf("\n Enter the No.of clusters:");
scanf("%d",&nc);

for(i=0;i<nc;i++)
for(j=0;j<n;j++)
c[i][j]=0;

for(i=0;i<nc;i++)
{
c[i][0]=obj[i];
mean[i]=obj[i];
}

for(i=0;i<nc;i++)
for(j=0;j<n;j++)

46 | D a t a M i n i n g
if (c[i][j]>0)
printf("I: %d",c[i][j]);

j=3;

for(i=0;i<n;i++)
{
if(j<n)
{
if((obj[j]-mean[0])<(obj[j]-mean[1]))
if((obj[j]-mean[0])<(obj[j]-mean[2]))
c[0][i]=obj[j];
if((obj[j]-mean[1])<(obj[j]-mean[0]))
if((obj[j]-mean[1])<(obj[j]-mean[2]))
c[1][i]=obj[j];
if((obj[j]-mean[2])<(obj[j]-mean[0]))
if((obj[j]-mean[2])<(obj[j]-mean[1]))
c[2][i]=obj[j];
for(k=0;k<nc;k++)
{
for(m=0;m<n;m++)
{
s=s+c[k][m];
mean[k]=s/n;
}
}
j++;
}
}
for(i=0;i<nc;i++)
{
printf("\n");
for(j=0;j<n;j++)
{
if(c[i][j]>0)
printf("%d,",c[i][j]);
}
}

getch();
}

Output:
Enter the no. of objects: 10
Enter 10 objects: 1 2 5 7 9 10 14 17 20 25
Enter the no. of clusters: 3
L:1, L:2, L:5
1, 9, 10, 20
2, 5, 17
7, 14, 25

K-Medoids Schema

Aim:

Understand and write a program to implement K-Medoids partitioning algorithm.

47 | D a t a M i n i n g
Theory:

The K-medoids algorithm is a clustering algorithm related to the K-means algorithm. Both
algorithms are partitional (breaking the dataset up into groups) and both attempt to minimize
squared error, the distance between points labeled to be in a cluster and a point designated as the
center of that cluster. In contrast to the K-means algorithm K-medoids chooses datapoints as centers
(medoids or exemplars).

K-medoid is a classical partitioning technique of clustering that clusters the data set of n
objects into k clusters known apriori. It is more robust to noise and outliers as compared to k-means
A medoid can be defined as that object of a cluster, whose average dissimilarity to all the
objects in the cluster is minimal i.e. it is a most centrally located point in the given data set.
K-medoid clustering algorithm is as follows
1) The algorithm begins with arbitrary selection of the k objects as medoid points out of n data
points (n>k)
2) After selection of the k medoid points, associate each data object in the given data set to most
similar medoid. The similarity here is defined using distance measure that can be euclidean distance,
manhattan distance or minkowski distance
3) Randomly select nonmedoid object O’
4) compute total cost , S of swapping initial medoid object to O’
5) If S<0, then swap initial medoid with the new one ( if S<0 then there will be new set of medoids)
6) repeat steps 2 to 5 until there is no change in the medoid.

Program:

// K-medoids Partitioning

#include<stdio.h>
#include<conio.h>

void main()
{
int xi[10],xj[10];
int i,n,dij=0,tc[20];
clrscr();
printf("\n Enter n Values:");
scanf("%d",&n);
printf("Enter %d numbers into xi",n);

for(i=0;i<n;i++)
scanf("%d",&xi[i]);
printf("Enter %d numbers into xj",n);

for(i=0;i<n;i++)
scanf("%d",&xj[i]);

for(i=0;i<n;i++)
{
if(xi[i]>xj[i])
dij+=xi[i]-xj[i];
else
dij+=xj[i]-xi[i];
}

for(i=0;i<n;i++)
{

48 | D a t a M i n i n g
tc[i]=xi[i]-xj[i];
if(tc[i]<0)
xi[i]=xj[i];
}

printf("\n Elements of xi:\n");


for(i=0;i<n;i++)
printf("\t %d",xi[i]);

printf("\n Elements of xj:\n");


for(i=0;i<n;i++)
printf("\t %d",xj[i]);

getch();
}

Output:

Enter n value: 4
Enter 4 numbers into xi
1
2
3
4

Enter 4 numbers into xj


5
4
3
2

Elements of xi:
5 4 3 4
Elements of xj:
5 4 3 2

12. Case study: Customer Information System

E-governance involves the application of Information and Communication Technologies by


government agencies for information and service delivery to citizens, business and government
employees. It is an emerging field, faced with various implementation problems related to
technology, employees, flexibility and change related issues, to mention a few. With the increase in
Internet and mobile connections, the citizens are learning to exploit their new mode of access in
wide ranging ways. Increasingly,

49 | D a t a M i n i n g
government organization, are analyzing current and historic data to identify useful patterns from the
large database so that they can support their business strategy Their main emphasis is on complex,
interactive, exploratory analysis of very large dataset created by the integration of data from across
all the part of the organization and that data is fairly static Three complementary trends are their

1)Data warehouse 2)OLAP 3)Data Mining

ROLE OF DATA WARE HOUSE IN E-GOVERNANCE (Citizen Information Systems)

Need for data warehouse

Governments deal with enormous amount of data. In order that such data is put to an
effective use in facilitating decision-making, a data warehouse is constructed over the historical data.
It permits several types of queries requiring complex analysis on data to be addressed by decision-
makers.In spite of taking lots of initiative for computerization, the Government decision makers are
currently having difficulty in obtaining meaningful information in a timely manner because they have
to request and depend on IT staff for making special reports which often takes long time to
generate. An Information Warehouse can deliver strategic intelligence to the decision makers and
provide an insight into the overall situation. By organizing person and land-related data into a
meaningful Information Warehouse, the Government decision makers can be empowered with a
flexible tool that enables them to make informed policy decisions for citizen facilitation and
accessing their impact over the intended section of the population.

ROLE OF DATA MINING IN E-GOVERNANCE

It is well known that in Information Technology (IT) driven society, knowledge is one of the
most significant assets of any organization. Knowledge discovery in databases is well-defined process
consisting of several distinct steps. Data mining is the core step, which results in the discovery of
hidden but useful knowledge from massive databases. A formal definition of Knowledge discovery in
databases is given as :“Data mining is the non trivial extraction of implicit previously unknown and
potentially useful information about data”. Data mining technology provides a user- oriented
approach to novel and hidden patterns in the data. The discovered knowledge can be used by the E-
governance administrators to improve the quality of service. Traditionally, decision making in E-
governance is based on the ground information, lessons learnt in the past resources and funds
constraints. However, data mining techniques and knowledge management technology can be
applied to create knowledge rich environment. An organization may implement Knowledge
Discovery in databases (KDD) with the help of a skilled employee who has good understanding of
organization. KDD can be effective at working with large volume of data to determine meaningful
pattern and to develop strategic solutions. Analyst and policy makers can learn lessons from the use
of KDD in other industries E-governance data is massive. It includes centric data, resource
management data and transformed data. E-governance organizations must have ability to analyze
data. Treatment records of millions of patients can be stored and computerized and data mining
techniques may help in answering several important and critical questions related to organization .

Knowledge Discovery in E-governance

50 | D a t a M i n i n g
Data mining is an essential step of knowledge discovery. In recent years it has attracted
great deal of interest in Information industry. Knowledge discovery process consists of an iterative
sequence of data cleaning, data integration, data selection, data mining pattern recognition and
knowledge presentation. In particulars, data mining may accomplish class description, association,
classification, clustering, prediction and time series analysis. Data mining in contrast to traditional
data analysis is discovery driven. Data mining is a young interdisciplinary field closely connected to
data warehousing, statistics, machine learning, neural networks and inductive logic programming.
Data mining provides automatic pattern recognition and attempts to uncover patterns in data that
are difficult to detect with traditional statistical methods. Without data mining it is difficult to realize
the full potential of data collected within healthcare organization as data under analysis is massive,
highly dimensional, distributed and uncertain.

Identify the Employ of Data


Problem in the Mining technique to
government extract Knowledge

Measure
Analyze the
effectiveness of
discovered Knowledge
discovered
knowledge

Data Mining Cycle

For Goverment organization to succeed they must have the ability to capture, store and analyze data
Online analytical processing (OLAP) provides one way for data to be analyzed in a multi-dimensional
capacity. With the adoption of data warehousing and data analysis/OLAP tools, an organization can
make strides in leveraging data for better decision making. Many organizations struggle with the
utilization of data collected through an organization online transaction processing (OLTP) system
that is not integrated for decision making and pattern analysis. For successful E-governance
organization it is important to empower the management and staff with data warehousing based on
critical thinking and knowledge management tools for strategic decision making. Data warehousing
can be supported by decision support tools such as data mart, OLAP and data mining tools. A data
mart is a subset of data warehouse. It focuses on selected subjects. Online analytical processing
(OLAP) solution provides a multi-dimensional view of the data found in relational databases. With
stored data in two dimensional format OLAP makes it possible to analyze potentially large amount of
data with very fast response times and provides the ability for users to go through the data and drill
down or roll up through various dimensions as defined by the data structure.The traditional manual
data analysis has become insufficient and methods for efficient computer assisted analysis
indispensable. A Data Warehouse is a semantically consistent data store that serves as a physical
implementation of a decision support data model and stores the information on which an enterprise
needs to make strategic decisions. A data warehouse is also often viewed as architecture

51 | D a t a M i n i n g
constructed by integrating data from multiple heterogeneous sources to support structured and/or
ad-hoc queries, analytical reporting and decision making .

13. Case Study: Web Usage Mining

Data mining efforts associated with the Web, called Web mining, can be broadly divided into three
classes, i.e. content mining, usage mining, and structure mining.

Web usage mining is the application of data mining techniques to discover usage patterns from Web
data, in order to understand and better serve the needs of Web-based applications. Web usage
mining consists of three phases, namely pre-processing, pattern discovery, and pattern analysis.

52 | D a t a M i n i n g
Figure: Sample Web Server Log file

There are many kinds of data that can be used in Web Usage Mining. They can be classified as
follows:
 Content: The real data in the Web pages, i.e. the data the Web page was designed to convey
to the users.
 Structure: Data which describes the organization of the content. Intra-page structure
information includes the arrangement of various HTML or XML tags within a given page. The
principal kind of inter-page structure information is hyper-links connecting one page to
another.
 Usage: Data that describes the pattern of usage of Web pages, such as IP addresses, page
references, and the date and time of accesses.
 User Profile: Data that provides demographic information about users of the Web site. This
includes registration data and customer profile information.

53 | D a t a M i n i n g
Main Tasks in Web Usage Mining:

1. Preprocessing

Preprocessing consists of converting the usage, content, and structure information contained in the
various available data sources into the data abstractions necessary for pattern discovery.

1.1 Usage Preprocessing


Usage preprocessing is arguably the most difficult task in the Web Usage Mining process due to the
incompleteness of the available data. Unless a client side tracking mechanism is used, only the IP
address, agent, and server side click-stream are available to identify users and server sessions. Some
of the typically encountered problems are:
 Single IP address/Multiple Server Sessions
 Multiple IP address/Single Server Session
 Multiple IP address/Single User
 Multiple Agent/Singe User

Assuming each user has now been identified (through cookies, logins, or IP/agent/path analysis); the
click-stream for each user must be divided into sessions. Since page requests from other servers are
not typically available, it is difficult to know when a user has left a Web site. A thirty minute timeout
is often used as the default method of breaking a user's click-stream into sessions. When a session ID
is embedded in each URL, the definition of a session is set by the content server.

54 | D a t a M i n i n g
1.2 Content Preprocessing
Content preprocessing consists of converting the text, image, scripts, and other files such as
multimedia into forms that are useful for the Web Usage Mining process. Often, this consists of
performing content mining such as classification or clustering. While applying data mining to the
content of Web sites is an interesting area of research in its own right, in the context of Web Usage
Mining the content of a site can be used to filter the input to, or output from the pattern discovery
algorithms. In addition to classifying or clustering page views based on topics, page views can also be
classified according to their intended use. Page views can be intended to convey information, gather
information from the user, allow navigation, or some combination uses. The intended use of a page
view can also filter the sessions before or after pattern discovery.

1.3 Structure Preprocessing


The structure of a site is created by the hypertext links between page views. The structure can be
obtained and preprocessed in the same manner as the content of a site. Again, dynamic content
pose more problems than static page views. A different site structure may have to be constructed
for each server session.

2. Pattern Discovery
Pattern discovery draws upon methods and algorithms developed from several fields such as
statistics, data mining, machine learning and pattern recognition.

2.1 Statistical Analysis


Statistical techniques are the most common method to extract knowledge about visitors to a Web
site. By analysing the session file, one can perform different kinds of descriptive statistical analyses
on variables such as page views, viewing time and length of a navigational path. Many Web traffic
analysis tools produce a periodic report containing statistical information such as the most
frequently accessed pages, average view time of a page or average length of a path through a site.

2.2 Association Rules


Association rule generation can be used to relate pages that are most often referenced together in a
single server session. In the context of Web Usage Mining, association rules refer to sets of pages
that are accessed together with a support value exceeding some specified threshold. These pages
may not be directly connected to one another via hyperlinks. Aside from being applicable for
business and marketing applications, the presence or absence of such rules can help Web designers
to restructure their Web site.

2.3 Clustering
Clustering is a technique to group together a set of items having similar characteristics. In the Web
Usage domain, there are two kinds of interesting clusters to be discovered: usage clusters and page
clusters.
Clustering of users tends to establish groups of users exhibiting similar browsing patterns.
On the other hand, clustering of pages will discover groups of pages having related content.

2.4 Classification
55 | D a t a M i n i n g
Classification is the task of mapping a data item into one of several predefined classes. In the Web
domain, one is interested in developing a profile of users belonging to a particular class or category.
This requires extraction and selection of features that best describe the properties of a given class or
category. Classification can be done by using supervised inductive learning algorithms such as
decision tree classifiers, naive Bayesian classifiers, k-nearest neighbour classifiers, Support Vector
Machines etc.

2.5 Sequential Patterns


The technique of sequential pattern discovery attempts to find inter-session patterns such that the
presence of a set of items is followed by another item in a time-ordered set of sessions or episodes.
By using this approach, Web marketers can predict future visit patterns which will be helpful in
placing advertisements aimed at certain user groups.

2.6 Dependency Modeling


Dependency modeling is another useful pattern discovery task in Web Mining. The goal here is to
develop a model capable of representing significant dependencies among the various variables in
the Web domain. There are several probabilistic learning techniques that can be employed to model
the browsing behaviour of users. Such techniques include Hidden Markov Models and Bayesian
Belief Networks. Modeling of Web usage patterns will not only provide a theoretical framework for
analysing the behaviour of users but is potentially useful for predicting future Web resource
consumption.

3. Pattern Analysis
Pattern analysis is the last step in the overall Web Usage mining process. The motivation behind
pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern
discovery phase. The exact analysis methodology is usually governed by the application for which
Web mining is done. The most common form of pattern analysis consists of a knowledge query
mechanism such as SQL. Another method is to load usage data into a data cube in order to perform
OLAP operations. Content and structure information can be used to filter out patterns containing
pages of a certain usage type, content type, or pages that match a certain hyperlink structure.

56 | D a t a M i n i n g
Figure: Web Usage Mining Process

57 | D a t a M i n i n g

You might also like