You are on page 1of 91

CHAPTER 1

INTRODUCTION

This chapter provides the overview of this research project and discussed about
research background, problem statement, objectives of the research, research scope
and significance of the research.

1.1 Research Background

E-filing provides access to large database that consist list of electronic files.
According to Olson, Edwards and Monty (2003), e-filing is a highly secure
and reliable method for sending, receiving and managing legal documents.
This is because, it takes time to find needed files manually and e-filing
provides secured access to identify needed files easily without searching
manually at huge shelf. Olson et al. (2003) also stated that state courts,
federal courts and law firms across the country are using e-filing more and
more to improve access to documents, maximize resources and streamline
filing and service activities. It is much easier to know status of the needed
files and identified location of the files before going through to the real files.

The purpose of this research is to develop a prototype of e-filing web-based


system for Majlis Daerah Kerian. Majlis Daerah Kerian, Parit Buntar, Perak
act as local government which is a government unit that is closest to the
citizens and these includes municipalities, local authorities, town councils and
city councils. There are eight departments in Majlis Daerah Kerian which is
Law and Administration Unit, Assessment Unit, Information Technology
Unit, Account and Finance Unit, License and Parking Unit, Town Service
Unit, Garden and Recreation Unit, and Building Unit.

1
Within the e-filing web-based system, staffs easily gather information about
status of the files and identify suitable files that meet their requirement. The
system is developed using data mining technique specifically clustering
technique. According to Phyu (2009), data mining involves the use of
sophisticated data analysis tools to discover previously unknown, valid
patterns and relationships in large database set. This is because data mining
not even consists of more than collection and managing data, but also
includes analysis and prediction. Garofalakis, Rastogi, Seshadri and Shim
(1999) stated that there are three popular data mining techniques which are
association rules, classification and clustering. This research identified
suitable searching method using data mining techniques either association,
classification or clustering techniques in order to develop a prototype of e-
filing web-based system.

1.2 Problem Statement

The staffs in Majlis Daerah Kerian face difficulties in managing and


identifying needed files that meet their requirement. This is because, it is
difficult to search needed files manually. According to Mrs. Shalina,
Administrative Assistant of Majlis Daerah Kerian, there are many steps to
search files manually which is :
a. Searching suitable number of file that required by using a log
book.
b. Determine file name by using file number.
c. Check needed file on many big shelves that required long time.
d. Surveying on each staff’s table or other department in Majlis
Daerah Kerian if the file is not on the shelf.

All this steps will create barriers in order to give best respond for each action.
By developing this system, staff can find the file that satisfies their needs so
that it will create interactive environment for them.

2
1.3 Aim

The aim for this research project is to provide a suitable searching method
using data mining techniques for e-filing web-based system.

1.4 Objective of the Research

To achieve the aim of the project above, the objective can be divided into
four. The objectives are:

a. To identify requirements that will be needed for E-Filing from


Majlis Daerah Kerian.
b. To identify the searching method based on data mining techniques.
c. To design e-filing web-based system.
d. To demonstrate e-filing web-based system using identified data
mining technique.

1.5 Significance of Research

The significance of this development is that this system can be used by staff
in Majlis Daerah Kerian. E-filing will act as an information center for staff to
gather information about status of the files. Besides that, it also provides staff
with interactive environment in making their choice in determining the
suitable files that meets their requirement.

3
1.6 Scope of Study

The e-filing web-based system is developed using PHP with MySQL


database. The development is for Majlis Daerah Kerian, Parit Buntar, Perak
and focused on filing management only. This is a web-based application that
can be accessed via browser and will be used internally by Majlis Daerah
Kerian’s employees.

1.7 Limitation

The important task carried out in this study is to gather information from
staffs in Majlis Daerah Kerian who are involved in filing management. It is
conducted through the interview that requires arranging schedules and need
the right interviewee to gain the proper and effective interview sessions.

Conducting the interview time is the main constraint. This is because, the
researcher have to reschedule the interview when the interviewee canceled
the interview session. It is difficult for the researcher to gather all of the
information and possibility of missing some important information. Interview
session was conducted at Majlis Daerah Kerian, Parit Buntar, Perak.

Another limitation is that there are three different data mining techniques, but
researcher must select the best data mining technique that suite with the
objective. Researcher need to study properly for each data mining techniques
and come out with the related journals that support the findings.

Next, there are a large number of data mining tools available, but not all the
tools support different kind of data mining techniques. So researcher need to
study the tools based on their function and usability with the selected
techniques. Furthermore, the tool used in this research is new to the
researcher so that requires time to familiarize with the tool.

4
Experience of the researcher is another limitation factor of the research. This
is the first research for the researcher. However, researcher can learn and
have the proper guide based on the research plan and instruction from the
supervisor and examiner.

1.8 Outcomes/Deliverables

The outcome from the research project is a suitable searching method using
data mining technique for e-filing web-based system.

1.9 Layout of Dissertation

This research project has both a theoretical and practical part. The theoretical
part will describes the concepts and literature review of the e-filing and data
mining techniques. The practical part consists of an analysis of data gathered
from the interview session and secondary data from literature review.

The remaining chapters of this research are:

 Chapter 2 is about the literature review on the e-filing and data mining
techniques. These literatures will act as a reference for this research
project.
 Chapter 3 describes the research approach and methodology used in
this research project. The choice of method, how data is gathered and
the strategy used to perform an analysis of the data are explained.
 Chapter 4 discusses the construction of the system’s prototype.
 Chapter 5 discusses the findings and the analysis from the interview
sessions and secondary data.
 Chapter 6 provides suggestion of conclusion and recommendations
for further research.

5
1.10 Summary

This chapter explains the background of the problem and its proposed
solution together with a brief explanation of the solution. The important
aspects of the projects such as research background, objectives of the project,
scope of the project and significance of the project are included in this
chapter. The methodology diagram as shown in Figure 3.1 in Chapter 3 and
other contents of this chapter will be used in the following chapter as the
basis for direction.

The next chapter discusses the literature review for the research project.

6
CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

This chapter describes in detail the related literatures to support the research
project. Literature review also clarifies the relationship between the study and
previous work conducted on the topic. This chapter covers overview of e-
filing and data mining, brief explanation for each technique in data mining
and steps in selecting data mining tools.

2.2 E-Filing

2.2.1 Introduction to E-Filing

E-Filing provides access to large database that consist list of


electronic files. According to Olson et al. (2003), e-filing is a highly
secure and reliable method for sending, receiving and managing legal
documents. Besides, e-filing is a highly secure and reliable method for
sending, receiving and managing legal documents and case
information. However, the rules to implement e-filing need to be fully
understand in order to achieve the best filing.

2.2.2 Purposes of the Rules in E-Filing

According to Olson et al. (2003), there are reason why rules are
important for electronic filing :

7
 To define the electronic filing system : Electronic filing
and services can mean anything. So, the exact information
regarding type of files must clearly defined in order to
provide guidance for where and how to access the files.

 To authorize electronic filing and service : Rules of


procedure are very specific when it comes to defining the
mechanical rules of filing. The valid method for delivering
document into right files need to identify for the best filing.

 To clearly specify the procedural mechanics : How to


file electronically, security, service and filing deadlines,
and how to sign documents electronically can more easily
for simplicity and to avoid complexity.

 To encourage use of electronic filing : Electronic filing


looks new to some people and training is the solution in
order to encourage them to use this system.

2.2.3 Proposed Model Rules for E-Filing

According to Olson et al. (2003), these rules below may be cited as


“e-filing rules” :

 Short title
 Clear definitions of files
 Give authority
 Determine authorized users
 Give effective date
 Signature to identify responsible user

8
2.3 What is Data Mining?

2.3.1 Definition of Data Mining

According to Phyu (2009), data mining is the use of sophisticated data


analysis tools to discover previously unknown, valid patterns and
relationships in large database set.

According to Chen, Han and Yu (1996), data mining which is also


referred to as knowledge discovery in databases, means a process of
nontrivial extraction of implicit, previously unknown and potentially
useful information (such as knowledge rules, constraints, regularities)
from data in databases.

Tang, Steinbach and Kumar (2006) stated that data mining is the
process of automatically discovering useful information in large
database repositories. Data mining techniques are deployed to scour
large database in order to find novel and useful patterns that might
otherwise remain unknown.

There are also many other terms founded in some articles and journals
that carry a similar or slightly different meaning, such as knowledge
meaning from databases, knowledge extraction, data archeology, data
dredging or data analysis.

2.3.2 Data Mining and Knowledge Discovery

Data mining is an integral part of knowledge discovery in database,


which is the overall process of converting raw data into useful
information, as shown in Figure 2.1. This process consists of a series
of transformation steps, from data preprocessing to postprocessing of
data mining results. (Tang et al., 2006)

9
Figure 2.1 : The Process of knowledge discovery in database.

Tang et al. (2006) stated that the input data can be stored in a variety
of formats (flat files, spread-sheets, or relational tables) and may
reside in a centralized data repository or be distributed across multiple
sites. The purpose of preprocessing is to transform the raw input data
into an appropriate format for subsequent analysis. The steps involved
in data preprocessing include fusing data from multiple sources,
cleaning data to remove noise and duplicate observations, and
selecting records and features that are relevant to the data mining task
at hand. Because of the many ways data can be collected and stored,
data preprocessing is perhaps the most laborious and time-consuming
step in the overall knowledge discovery process.

“Closing the loop” is the phrase often used to refer to the process of
integrating data mining results into decision support systems. For
example, in business applications, the insights offered by data mining
results can be integrated with campaign management tools so that
effective marketing promotions can be conducted and tested. Such
integration requires a postprocessing step that ensures that only valid
and useful results are incorporated into the decision support system.
Statistical measures or hypothesis testing methods can also be applied
during postprocessing to eliminate spurious data mining results.

10
According to Shyu, Chen and Haruechaiyasak (2005), data mining or
knowledge discovery in databases has emerged recently as an active
research area for extracting implicit, previously unknown, and
potentially useful information from large databases mining techniques
into the IR context, specifically as the information filtering tools for
the recommender system framework.

The overall process for designing and implementing a recommender


system is illustrated in Figure 2.2. The process involves the following
five steps.

Figure 2.2 : Process for designing and implementing a


recommender system (Shyu et al., 2005)

Data Collection: This initial step involves the collection of data sets
for executing the data mining algorithms. Three data components are
considered: (a) textual content (i.e., index terms or keywords), (b) link
structure (embedded hyperlinks within Web pages), and (c) user log
records.

Data Preprocessing: This step is required to clean and transform the


collected data sets into the formats which are suitable for the data

11
mining algorithms. This step includes the data reduction and selection
techniques to improve the efficiency of the data mining algorithms.

Information Filtering via Data Mining: This step is the core


process of the recommender system framework, where the data sets
are analyzed and the data mining algorithms are applied as the
information filtering tools to generate and discover any useful and
interesting recommended outputs.

Database Design and Implementation: To improve the efficiency of


data and information access and retrieval.

User Interface Design and Implementation: The user interface acts


as an intermediary between the users and the recommender system.
This step involves the design and implementation of a Web (i.e.,
HTTP) server which receives the users’ requests via the WWW,
processes the requests by accessing the database, and responds by
returning the results to the users. The user interface provides a
recommendation function with the user personalization technique by
requiring each user to log into the system in order to keep track of the
preferences.

2.3.3 Challenges of Data Mining

According to Tang et al. (2006), traditional data analysis techniques


have often encountered practical difficulties in meeting the challenges
posed by new data sets.

Chen et al. (1996) stated the importance to examine what kind of


features an applied knowledge discovery system is expected to have
and what kind of challenges may face at the development of data
mining techniques.

12
Chen et al. (1996) also provide the list of challenges that will face
during development of data mining techniques which is :

a. Handling of different types of data.


There are many kinds of data and databases used in
different applications. This will cause knowledge
discovery system should be able to perform effective data
mining on different kinds of data. Since most available
databases are relational, it is crucial that a data mining
systems performs effective knowledge discovery on
relational data. Besides, most databases contain complex
data types, such as structured data and complex data
objects, hypertext and multimedia data, spatial and
temporal data, transaction data, legacy data and so on. So,
powerful system should be able to perform efficient data
mining on complex types of data as well. However, data
mining system can handle specific kinds of data such as
systems dedicated to knowledge mining in relational
databases, transaction databases, spatial databases,
multimedia databases and so on in order to face diversity
of data types.

b. Efficiency and scalability of data mining algorithms.


In order to extract information from a large amount of data
in databases, the knowledge discovery algorithms must be
efficient and scalable. This is because, the running time of
a data mining algorithms must be predictable and
acceptable for large databases.

c. Usefulness, certainty, and expressiveness of data


mining results.
The contents of the database must accurately portray and
be useful for certain application in order to discover

13
knowledge. This also encourage a systematic study of
measuring the quality of the discovered knowledge,
including interestingness and reliability, by construction of
statistical, analytical and simulative models and tools.

d. Expression of various kinds of data mining requests


and results.
Different kinds of knowledge can be discovered from a
large amount of data. It is important to discovered
knowledge from different views and presents them in
different forms. This task requires them to express both the
data mining requests and the discovered knowledge in
high-level languages or graphical user interfaces so that
the data mining process can be specified by none expert,
understandable and directly usable by users.

e. Interactive mining knowledge at multiple abstraction


levels.
A high-level data mining query should be treated as a
probe which may disclose some interesting traces for
further exploration. Interactive discovery allow users to
interactively refine a data mining request, dynamically
change data focusing, progressively deepen a data mining
process and flexibly view the data and data mining results
at multiple abstraction levels from different areas.

f. Mining information from different sources of data.


Many sources of data are available through local and wide-
area computer network, including internet. Mining
knowledge from different sources either formatted or
unformatted data with diverse data are becomes new
challenges to data mining. Data mining may help by come
out with simple query systems.

14
g. Protection of privacy and data security.
Protecting data security and guarding against the invasion
of privacy are important when data viewed from many
different angles and at different abstraction levels. The
measurement of security can avoid disclosure of sensitive
information.

However, these requirements may cause conflict. For example,


protection of data security may conflict with the requirements of
interactive mining of multiple-level knowledge from different angles.

2.4 Data Mining Techniques

2.4.1 Overview of Data Mining Techniques

According to Garofalakis et al. (1999), data mining techniques


describe key data mining algorithms that have been developed for
large databases.

Garofalakis et al. (1999) also stated the popular data mining


techniques which are association rules, classification and clustering.

2.4.2 Classifying Data Mining Techniques

Chen et. al (1996) stated the kinds of techniques that can be utilized
during classification which is :

 Type of databases to work on


A data mining system can be classified according to the
kinds of databases on which the data mining is performed.
This is important to identify the data type in order to

15
specific the area that system will perform. For example, a
system is a relational data miner if it discovers knowledge
from relational data, or an object-oriented one if it mines
knowledge from object-oriented databases. In general, a
data miner can be classified according to its mining of
knowledge from the following different kinds of databases:
relational databases, transaction databases, object oriented
databases, deductive databases, spatial databases, temporal
databases, multimedia databases, heterogeneous databases,
active databases, legacy databases, and the Internet
information-base.

 Type of knowledge to be mined


Data miners should identify several kinds of knowledge
including association rules, characteristic rules,
classification rules, clustering and deviation analysis.
However, this knowledge depends on abstraction level of
the databases.

 Type of techniques to be utilized


Data miners will be categorized according to the
underlying data mining techniques and approach. For
example, it can be categorized according to the driven
method into autonomous knowledge miner, data-driven
miner, query-driven miner, and interactive data miner. It
can also be categorized according to its underlying data
mining approach into generalization based mining, pattern-
based mining, mining based on statistics or mathematical
theories, and integrated approaches, etc.

16
2.4.3 Association Rules

Association rules provide a useful mechanism for discovering


correlations among items belonging to customer transactions in a
market basket database (Garofalakis et al., 1999). For example : given
a database of sales transactions, it is desirable to discover the
important associations among items such that the presence of some
items in a transaction will imply the presence of other items in the
same transaction.

Chen et. al (1996) stated the problem of mining association rules that
composed into the following two steps :

a. Discover the large item sets.


b. Use the large item sets to generate the association rules for
the database.

It is noted that the overall performance of mining association rules is


determined by the first step. After the large item sets are identified,
the corresponding association rules can be derived in a
straightforward manner.

Figure 2.3 : The general architecture of Mining Association Rule


model (Defit & Md Sap, 2001)

17
Figure 2.3 represents the general architecture of Mining Association
Rule (MAR) model. MAR model consists of two main modules, pre-
processing and processing module. The first module, pre-processing is
used to transform data, identify and remove inconsistent data from
databases. Next, processing is executed to generate rules and evaluate
the generated rules.

2.4.4 Classification

Data classification is the process which finds the common properties


among a set of objects in a database and classifies them into different
classes, according to a classification model. (Chen et al., 1996)

Chen et al. (1996) also stated the objective of the classification which
is :
a. Analyze the training data.
b. Develop an accurate description or a model for each class
using the features available in the data.

Garofalakis et al. (1999) stated that classification are useful in the


Web context to build taxonomies and topic hierarchies on Web pages,
and subsequently perform context-based searches for Web pages
relating to a specific topic. Decisions tree classifiers are popular since
they are easily interpreted by humans and are efficient to build.

18
Figure 2.4 : Hierarchical Classification Process
(Khodra & Widyantoro, 2007)

Figure 2.4 shows the hierarchical classification process that consists


of two stages: offline stage, and online stage. Offline stage encodes
classification scheme metadata for each web page. In online stage, all
search results are hierarchically categorized using the classification
scheme provided in the metadata of retrieved documents.
Classification scheme is total ordering class from the most general (i.e.
root of ontology) to the most specific class (i.e. leaf of ontology).
They use Lucene as search engine. They combined Lucene with
interactive navigation interface generator that uses this hierarchical
structure to present list of search results hierarchically.

19
2.4.5 Clustering

Visnick (2003) defined clustering as a technique to achieve high data


density. She classifies clustering into different techniques which is
isolate index, object pooling and object modeling that conduct
different function.

Chen et al. (1996) defined clustering as the process of grouping


physical or abstract objects into classes of similar objects. It helps
data miner to construct meaningful partitioning of a large set of
objects based on a “divide and conquer” methodology which
decomposes a large scale system into smaller components to simplify
design and implementation.

Garofalakis et al. (1999) defined clustering as a useful technique for


discovering interesting data distributions and patterns in the
underlying data.

Qiu, Davis and Ikem (2004) stated that clustering techniques are
heuristic in nature. Almost all techniques have a number of arbitrary
parameters that can be “adjusted” to improve results.

Clustering techniques fall into the following broad categories :

a. Hierarchical vs partitional : Hierarchical techniques produce


a nested sequence of partitions, with a single, all inclusive
cluster at the top and singleton clusters of individual instances
at the bottom. Each intermediate level can be viewed as a
combination of two clusters from the next lower level, or a
split of a cluster from the next higher level into two.
Partitional (or non-nested) techniques create a one-level
partitioning of the data instances. After the user specifies the
desired number of clusters, a partitional approach typically

20
finds all clusters at once. This is in contrast to traditional
hierarchical schemes, which bisect a cluster to get two clusters
or merge two clusters to get one.

b. Divisive vs agglomerative : Hierarchical clustering


techniques proceed either from the top to the bottom or from
the bottom to the top, i.e. clustering starts with one large
cluster and splits it, or starts with clusters each containing a
point and then merges them.

c. Incremental vs non-incremental : Some clustering


techniques work with one instance at a time and decide how to
place it into an appropriate cluster, but most clustering
techniques are non-incremental, using information about all
the instances at once to form clusters.

Typical pattern clustering activity involves the following steps (Jain,


Murty and Flynn, 2000) :
 Pattern representation (optionally including feature
extraction and/or selection)
 Definition of a pattern proximity
 Measure appropriate to the data domain
 Clustering or grouping
 Data abstraction (if needed), and
 Assessment of output (if needed).

Figure 2.5 : Stages in clustering (Jain et al., 2000)

21
Figure 2.5 depicts a typical sequencing of the first three of these steps,
including a feedback path where the grouping process output could
affect subsequent feature extraction and similarity computations.
Pattern representation refers to the number of classes, the number of
available patterns, and the number, type, and scale of the features
available to the clustering algorithm. Some of this information may
not be controllable by the practitioner. Feature selection is the process
of identifying the most effective subset of the original features to use
in clustering. Feature extraction is the use of one or more
transformations of the input features to produce new salient features.
Either or both of these techniques can be used to obtain an appropriate
set of features to use in clustering. Pattern proximity is usually
measured by a distance function defined on pairs of patterns. A
variety of distance measures are in use in the various communities. A
simple distance measure like Euclidean distance can often be used to
reflect dissimilarity between two patterns, whereas other similarity
measures can be used to characterize the conceptual similarity
between patterns. The grouping step can be performed in a number of
ways. The output clustering (or clusterings) can be hard (a partition of
the data into groups) or fuzzy (where each pattern has a variable
degree of membership in each of the output clusters).

Hierarchical clustering algorithms produce a nested series of


partitions based on a criterion for merging or splitting clusters based
on similarity. Partitional clustering algorithms identify the partition
that optimizes (usually locally) a clustering criterion.

2.5 Selecting Data Mining Techniques

It is important for the researcher to select a suitable searching method using


data mining techniques in order to accomplish the objective. Researcher
decided to review three main data mining techniques which are classification,

22
association and clustering. These techniques deliver the same objective of
data mining, but different in terms of their function and suitability for the
system.

According to Tang et al. (2006), data mining is a technology that blends


traditional data analysis methods with sophisticated algorithms for processing
large volumes of data. It has also opened up exciting opportunities for
exploring and analyzing new types of data and for analyzing old types of data
in new ways.

Classification, which is the task of assigning objects to one of several


predefined categories, is a pervasive problem that encompasses many diverse
applications. Examples include detecting spam email messages based upon
the message header and content, categorizing cells as malignant or benign
based upon the results of MRI scans and classifying galaxies based upon their
shapes. (Tang et al., 2006)

Association is useful for discovering interesting relationships hidden in large


data sets. The uncovered relationships can be represented in the form of
association rules or sets of frequent items. Besides, many business enterprises
accumulate large quantities of data from their day-to-day operations. For
example, huge amounts of customer purchase data are collected daily at the
checkout counters of grocery stores. Retailers are interested in analyzing the
data to learn about the purchasing behavior of their customers. Such valuable
information can be used to support a variety of business-related applications
such as marketing promotions, inventory management and customer
relationship management. Association techniques will discover the patterns
from a large transaction data and evaluating the discovered patters in order to
prevent the generation of spurious results. (Tang et al., 2006)

Cluster divides data into groups (clusters) that are meaningful, useful, or both.
If meaningful groups are the goal, then the clusters should capture the natural
structure of the data. The concept of clustering has been around for a long
time. It has several applications, particularly in the context of information

23
retrieval and in organizing web resources. The main purpose of clustering is
to locate information and in the present day context, to locate most relevant
electronic resources. In database management, data clustering is a technique
in which, the information that is logically similar is physically stored
together. In order to increase the efficiency of search and the retrieval in
database management, the number of disk accesses is to be minimized. In
clustering, since the objects of similar properties are placed in one class of
objects, a single access to the disk can retrieve the entire class. If the
clustering takes place in some abstract algorithmic space, we may group a
population into subsets with similar characteristic, and then reduce the
problem space by acting on only a representative from each subset.
Clustering is ultimately a process of reducing a mountain of data to
manageable piles. For examples, analyze the large amounts of genetic
information that are now available, group the search result into a small
number of clusters, identify different types of depression and to segment
customers into a small number of groups for additional analysis and
marketing activities. (Ravichandra, 2003)

However, it is important for the researcher to identify suitability of each


technique in order to implement the good searching method. Researcher
reviewed the techniques based on their definition, concept, functions,
suitability and examples given in several journals. (Refer Table 2.1)

24
Table 2.1 : Differences of Classification, Association and Clustering
techniques
DM Techniques
Classification Association Clustering
Differences
Data Association rules Clustering as
classification is provide a useful the process of
the process mechanism for grouping
which finds the discovering physical or
common correlations abstract objects
properties among items into classes of
among a set of belonging to similar objects.
objects in a customer (Chen et al.,
Definition
database and transactions in a 1996)
classifies them market basket
into different database
classes, (Garofalakis et
according to a al., 1999)
classification
model. (Chen et
al., 1996)
Classification, Association is Cluster divides
which is the task useful for data into groups
of assigning discovering (clusters) that
objects to one of interesting are meaningful,
several relationships useful, or both.
predefined hidden in large (Ravichandra,
Concept categories, is a
data sets. (Tang 2003)
pervasive
et al., 2006)
problem that
encompasses
many diverse
applications.
(Tang et al., 2006)

25
DM Techniques
Classification Association Clustering
Differences
Classification is It will discover It helps data
useful in the the patterns from miner to
Web context to a large construct
build taxonomies transaction data meaningful

and topic and evaluating partitioning of a

hierarchies on the discovered large set of

Web pages, and patters in order objects based on


a “divide and
subsequently to prevent the
conquer”
perform context- generation of
methodology
based searches spurious results.
Functions which
for Web pages (Tang et al.,
decomposes a
relating to a 2006)
large scale
specific topic.
system into
(Garofalakis et
smaller
al., 1999)
components to
simplify design
and
implementation.
(Chen et al.,
1996)
Develop an Discovering Increase the
accurate correlations efficiency of
description or a among items search and the
model for each belonging to retrieval in
class using the customer database
Suitability features transactions in a management.
market basket
available in the (Ravichandra,
database for
data. (Chen et 2003)
market analysis
al., 1996)
(Garofalakis et
al., 1999).

26
DM Techniques
Classification Association Clustering
Differences
Detecting spam Huge amounts of Analyze the
email messages customer large amounts
based upon the purchase data are of genetic
message header collected daily at information that
and content, the checkout are now
categorizing counters of available, group
cells as grocery stores. the search result
malignant or Retailers are into a small
benign based interested in number of
upon the results analyzing the clusters,
of MRI scans data to learn identify
Examples and classifying about the different types
galaxies based purchasing of depression
upon their behavior of their and to segment
shapes. (Tang et customers. (Tang customers into
al., 2006) et al., 2006) a small number
of groups for
additional
analysis and
marketing
activities.
(Ravichandra,
2003)

According to the comparison above, after reviewing each technique based on


their definition, concept, functions, suitability and examples given by several
journals, researcher found that clustering is the suitable searching method for
e-filing web-based system.

27
Although Classification, Association and Clustering have similarity in terms
of information retrieval, but there are differences regarding how the
information retrieved, analyzed and delivered. Classification assigning
objects to several predefined categories in order to develop a model for each
data using the features available in the data. Association is useful to discover
correlations among data in order to identify interesting relationships hidden in
large data sets especially for market analysis. However, clustering groups the
physical or abstract objects into list of similar objects to provide simplified
list of data. In other words, it divides data into groups that have similarity,
meaningful and useful.

This is because, partitioning of a large set of data by clustering will


decompose a large search result into smaller components to simplify the
content. It helps user to review accurate search result that fulfill their needs
and expectation.

In terms of suitability, clustering increase the efficiency of search and the


retrieval of information in database management (Ravichandra, 2003). It
analyzed the search result to identify similarity between the results and
provide simplified list of results. Further information regarding why
clustering is suitable for searching method in e-filing web-based system are
discussed in Chapter 5 (Result and Findings).

28
2.6 Selecting Data Mining Tools

Data mining tools are used widely to solve real-world problems in


engineering, science and business. (Abbott, Matkovsky & Elder, 1998)

Nowadays, numbers of data mining tools are increases and it has become
more challenges in order to select effective tools. The data mining tool
market has become more crowded in recent years, with more than 50
commercial data mining tools as stated at the KDNuggets website
(http://www.kdnuggets.com). KDnuggets.com is the Data Mining
Community’s Top Resource since 1997 for data mining and analytics news,
tools, jobs, courses, data and more.

Collier, Carey, Sautter and Marjaniemi (1999) proposed four categories of


criteria for selecting from among the assortment of commercially available
data mining tools which is :

a. Performance
As per Table 2.2 is the ability to handle a variety of data sources
in an efficient manner. From a computational perspective,
hardware configuration has a major impact on tool performance.
Besides, some data algorithms are more efficient than others.
However, this category focuses on the qualitative aspects of a
tool’s ability to easily handle data under a variety of hardware
configuration. The criteria that should consider in this task are
platform variety, software architecture, heterogeneous data access,
data size, efficiency, interoperability and robustness.

29
Table 2.2 : Computational Performance Criteria (Collier et al., 1999)
Criteria Description
Platform Variety Does the software run on a wide-variety of computer platforms? More
importantly, does it run on typical business user platforms?
Software Does the software use client-server architecture or a stand-alone
Architecture architecture? Does the user have a choice of architectures?
Heterogeneous How well does the software interface with a variety of data sources
Data (RDBMS, ODBC, CORBA, etc)? Does it require any auxiliary software to
Access do so? Is the interface seamless?
Data Size How well does the software scale to large data sets? Is performance linear or
exponential?
Efficiency Does the software produce results in a reasonable amount of time relative to
the data size, the limitations of the algorithm, and other variables?
Interoperability Does the tool interface with other KDD support tools easily? If so, does it
use a standard architecture such as CORBA or some other proprietary API?
Robustness Does the tool run consistently without crashing? If the tool cannot handle a
data mining analysis, does it fail early or when the analysis appears to be
nearly complete? Does the tool require monitoring and intervention or can it
be left to run on its own?

b. Functionality
There are variety of capabilities, techniques, and methodologies
for data mining (Table 2.3). In order to know well the tool adapt to
different data mining problem, software functionality will help to
solve it. The criteria in functionality aspect are algorithm variety,
prescribed methodology, model validation, data type flexibility,
algorithm modifiability, data sampling, reporting, model exporting,
user interface, learning curve, user types, data visualization, error
reporting, action history and domain variety.

Table 2.3 : Functionality Criteria (Collier et al., 1999)


Criteria Description
Algorithmic Variety Does the software provide an adequate variety of mining techniques and
algorithms including neural networks, rule induction, decision trees,
clustering, etc.?
Prescribed Does the software aid the user by presenting a sound, step-by-step
Methodology mining methodology to help avoid spurious results?
Model Validation Does the tool support model validation in addition to model creation?
Does the tool encourage validation as part of the methodology?
Data Type Flexibility Does the implementation of the supported algorithms handle a wide-
variety of data types, continuous data without binning, etc.?
Algorithm Does the user have the ability to modify and fine-tune the modeling
Modifiability algorithms?
Data Sampling Does the tool allow random sampling of data for predictive modeling?
Reporting Are the results of a mining analysis reported in a variety of ways? Does
the tool provide summary results as well as detailed results? Does the
tool select actual data records that fit a target profile?
Model Exporting After a model is validated does the tool provide a variety of ways to
export the tool for ongoing use (e.g., C program, SQL, etc.)?

30
c. Usability
Different level and types of user will cause usability (Table 2.4).
One problem with easy-to-use mining tools is their potential
misuse. The criteria should consider are data cleansing, value
substitution, data filtering, binning, deriving attributes,
randomization, record deletion, handling blanks, metadata
manipulation and result feedback.

Table 2.4 : Usability Criteria (Collier et al. 1999)


Criteria Description
User Interface Is the user interface easy to navigate and uncomplicated? Does the interface
present results in a meaningful way?
Learning Curve Is the tool easy to learn? Is the tool easy to use correctly?
User Types Is the tool designed for beginning, intermediate, advanced users or a
combination of user types? How well suited is the tool for its target user type?
How easy is the tool for analysts to use? How easy is the tool for business
(end) users to use?
Data How well does the tool present the data? How well does the tool present the
Visualization modeling results? Are there a variety of graphical methods used to
communicate information?
Error Reporting How meaningful is the error reporting? How well do error messages help the
user debug problems? How well does the tool accommodate errors or
spurious model building?
Action History Does the tool maintain a history of actions taken in the mining process? Can
the user modify parts of this history and re-execute the script?
Domain Variety Can the tool be used in a variety of different industries to help solve a variety
of different kinds of business problems? How well does the tool focus on one
problem domain? How well does it focus on a variety of domains?

Data mining tools is costly and generally accompanied by moderately step


learning. Selection of the wrong tool is expensive both in terms of waste
money and time. These categories for selecting data mining tools will help
practitioners avoid spending much time only to discover that a particular tool
does not provide the necessary solution. (Collier et al., 1999)

Bialynicka (2008) stated that there are data mining tools that suite with
clustering which are :
 Scatter
 Grouper
 Carrot²
 Vivisimo

31
Scatter is designed for browsing that support online clustering based on two
novel clustering algorithms which are buckshot and fractionation. Buckshot
fast for online clustering and fractionation is accurate for offline initial
clustering of the entire set. (Bialynicka, 2008)

Grouper is suitable for online purposes that operate on query result snippets.
It will cluster together documents with large common subphrases.
(Bialynicka, 2008)

Carrot² is component framework that allows substituting components for


input (from other search engines), filter (stemming, distance measure and
clustering) and output the result. (Bialynicka, 2008)

Vivisimo is the commercial online clustering that support hierarchical and


conceptual clustering techniques. (Bialynicka, 2008)

However, for this research project, researcher used free tools that available
for learning purposes which is Carrot².

Carrot2 is an open source search results clustering engine. It can


automatically organize small collections of documents, e.g. search results,
into thematic categories. (Carrot², 2010)

Apart from two specialized document clustering algorithms, Carrot2 offers


ready-to-use components for fetching search results from various sources
including YahooAPI, GoogleAPI, MSN Live API, eTools Meta Search,
Lucene, SOLR, Google Desktop and more. Carrot2 is implemented in Java,
but it easily integrates with non-Java software, such as PHP, Ruby or C#.
(Carrot², 2010)

32
2.7 Summary

This chapter provides overview of e-filing and data mining techniques based
on the literature review from several journals. Rules in e-filing, overview of
data mining and challenges in data mining are discussed. Researcher also
reviews three basic data mining techniques which are classification,
association and clustering. After that, researcher come out with comparison
between them and selects the suitable data mining techniques for searching
method in e-filing web-based system (Refer Table 2.1). Based on the
comparison in Table 2.1, researcher found that clustering is the suitable
searching method for e-filing web-based system. Besides, for this research
project, researcher used free tools that available for learning purposes which
is Carrot² (open source search results clustering engine) after review several
journals regarding data mining tools.

The next chapter discusses the research approach and the methodology for the
research project.

33
CHAPTER 3

RESEARCH APPROACH AND METHODOLOGY

3.1 Introduction

This chapter describes the methodology and approaches that were used in the
research from problem identification until development of the system. To
achieve the objective of this project, the right approach must be applied for
best conclusions. This research used five major steps to start developing
prototype of e-filing web-based system using data mining techniques. It
consists of problem identification and planning, requirement gathering,
requirement analysis, design model and develop prototype. The overview of
this methodology can be shown below in Figure 3.1.

Figure 3.1 : Overview of Research Approach and Methodology

34
3.2 Problem Identification and Planning

This phase will identify the goal, scope, budget, schedule, technology and
system development process, methods and tools to ensure that everything are
in right place. However, it depends to what researcher wants to plan
according to the stakeholder requirement.

Before start to plan the project’s planning, the researcher should know the
current situation and problem that the old system have. An understanding of
potential problems is the main process to make the development
successful. After the researcher identifies the problems, scope of the project
is defined. The goal must be determined and the objectives of the project
must solved on the problems that have been identified. After analyzing all
the problems and identifying what task need to be done, a measurable
and achievable project plan is schedule using a Microsoft Project tool.

For this research, Microsoft Project is used to produce Gantt Chart (Refer
Appendix A- Project Planning) as a guideline for researcher in order to finish
the project. Besides, this phase involves list of steps which is :

a. Discuss the current problem with staff at Majlis Daerah


Kerian
The current problems for this research need to identify in order to
solve the problem in the next task.

b. Identify goal, objective, scope, and significance of research


The goal, objective, scope and significance of research need to be
clearly defined.

c. Plan related task


Plan the related task using Microsoft Project to schedule all the
planning. Time must be allocated carefully and entire task must be
stated to ensure the completion of the research.

35
3.3 Requirement Gathering

Requirement gathering is the process to gather all information that is needed


to develop the system. In this analysis phase, a method of data collection has
been applied. This phase is to identify some of the concept and
requirement that will be required and apply in developing the e-filing web-
based system. For this research, there are two types of data collection which
are :

3.3.1 Primary Data

Primary data is about gathering requirement from the original


resource such as interviews, questionnaire and observation. For this
research, the researcher used data from the interview with staff at
Majlis Daerah Kerian. Interviewing is a technique used to gain
detailed information regarding the related subject of interest of this
research. This includes software and hardware used and also the
problem that arises in current system so that requirements identified.

Table 3.1 below shows the information of people that involved in


interview session for gathering requirement of e-filing web-based
system.

Table 3.1 : Information of people that involve in interview


Respondent Name Department
Head of IT Department,
Mr. Gobibaskaran A/L Govindaraju
Majlis Daerah Kerian.
Administrative Assistant,
Puan Shalina Mat Piah
Majlis Daerah Kerian.

36
The main advantages of interviews are that the answer of the
interviewees is more spontaneous without an extended reflection. This
can be done by using a top down approach where the interviewer
starts with a general question and progress to specific question about
task. Interviews should plan in advance by defining a set of interview
questions to be asked. This does not only assist in ensuring
consistency between interviews conducted with different interviewees
but also help to focus on the purpose of the interview session.

The deliverable of this activity is an identified requirement that


needed for e-filing web-based system.

3.3.2 Secondary Data

The secondary data for this research is about data collection through
many resources such as articles, journals, books and other related
academic publication information about e-filing and data mining. It is
important to gain deeper understanding to e-filing and data mining.

3.4 Requirement Analysis

This is the next stage after all data has been collected from the requirement
gathering phase. The primary data collected is needed to be analyzed to
define the system requirement for developing e-filing web-based system. The
collected data need to be studied and analyzed properly in order to have
accurate, reliable and relevant information during the development. This
entire requirement helped researcher to identify the use case that produce
system functions and finally researcher come out with Software Requirement
Specification (SRS) documentation.

37
Besides, secondary data that collected during requirement gathering phase is
useful to identify suitable searching method using data mining techniques.
Researcher made comparison between three popular techniques (association,
classification and clustering) in data mining in order to identify suitable
searching method from selected data mining techniques. Researcher finally
comes out with suitable searching method using data mining techniques. The
tool used during this phase is Rational Rose.

3.5 Design Model

The model will be designed and determine before proceeding with the actual
construction of the database and system. System interface, classes, objects
and their relation will be designed using Rational Rose. The entire related
diagram to this research that includes class diagram, use case, sequence
diagram will be designed based on the result from the requirement analysis
phase.

After all the objects and classes are illustrated clearly with its attributes and
methods, a development of database was conducted. This activity is
accomplished by using MySQL database. At the end of this activity, a
detailed design (database model) is produced. The deliverable of this phase
has been documented in Software Design Document (SDD).

3.6 Develop Prototype

Develop prototype is related with building the application of the system using
the appropriate development technologies. In this phase, researcher will
develop the prototype of e-filing web-based system using data mining
techniques. The Apache is use as a web server, MySQL database as a
database server, and PHP programming language as the platform of the
development. In order to write programming code, Dreamweaver is used as a

38
workspace and Carrot² as a data mining tool. At the end of this phase, e-filing
prototype system using data mining technique will be produced.

3.7 Summary

The research methodology describes the research strategy that is used in this
research project. For this research project, a plan of action is laid out that
shows how the problem will be investigated, what information will be
collected using which method and how this information will be analyzed to
come to the conclusion. It consists of problem identification and planning,
requirement gathering, requirement analysis, design model and develop
prototype.

The methodology stated above was followed to develop the e-filing web-
based system in order to achieve the project’s objectives as well as to
fulfill requirements specified by the user. With understandable and
achievable methodology, the project is carried out in a proper manner that
consequently completed effectively.

The next chapter discusses the construction for the research project.

39
CHAPTER 4

PROTOTYPE CONSTRUCTION

4.1 Introduction

This chapter explained about the construction of prototype in depth and


details in developing the project development of the e-filing web-based
system. It explains on the result and ways it achieves the project objectives.

4.2 Software Requirements

Specified below is the list of software tools that are selected during the
development process. These include operating system and other applications
that are compulsory for the system to be developed and deployed.

4.2.1 Software Tools

Table 4.1 : Software Tools Specifications


No. Software Type
1. Windows XP SP2 Operating System (OS)
2. MySQL Database Server
3. PHP Programming Platform
4. Apache Web Server
5. Rational Rose Enterprise Edition
Unified Modeling
Language Software
6. Adobe Photoshop CS3 Graphics Design Software
7. Macromedia Dreamweaver MX 2004 Workspace Software
8. Carrot² Open source framework for
building search clustering
engines

40
4.2.2 Software Tools Installation

Referring to Table 4.1, the installation of the three basic tools related
which is Apache, MySQL Server version 5, Rational Rose Enterprise
Edition, Adobe Photoshop CS3, Macromedia Dreamweaver MX 2004
are explain further as the following.

a. Apache

The Apache HTTP Server, commonly referred to as Apache is


web server software notable for playing a key role in the initial
growth of the World Wide Web. In 2009 it became the first
web server software to surpass the 100 million web site
milestone. Apache was the first viable alternative to the
Netscape Communications Corporation web server (currently
known as Sun Java System Web Server), and has since
evolved to rival other Unix-based web servers in terms of
functionality and performance. Apache supports a variety of
features, many implemented as compiled modules which
extend the core functionality. These can range from server-
side programming language support to authentication schemes.
Some common language interfaces support Perl, Python, Tcl,
and PHP. Apache provides a variety of MultiProcessing
Modules (MPMs) which allow Apache to run in a process-
based, hybrid (process and thread) or event-hybrid mode, to
better match the demands of each particular infrastructure.
This implies that the choice of correct MPM and the correct
configuration is important. Where compromises in
performance need to be made, the design of Apache is to
reduce latency and increase throughput, relative to simply
handling more requests, thus ensuring consistent and reliable
processing of requests within reasonable time-frames. (Apache,
2002)

41
b. MySQL Version 5

MySQL is the world's most popular open source database


software, with over 100 million copies of its software
downloaded or distributed throughout its history. With its
superior speed, reliability, and ease of use, MySQL has
become the preferred choice for Web, Web 2.0, SaaS, ISV,
Telecom companies and forward-thinking corporate IT
Managers because it eliminates the major problems associated
with downtime, maintenance and administration for modern,
online applications. (MySQL, 2009)

MySQL server is chosen as the storage for the data in E-Filing


web-based system because of its consistency, fast
performance, high reliability and ease of use. The researcher
only need to follow all the instruction on the wizard until the
installation process is completed. Once the installation is
completed, MySQL Server Version 5 can be used in the
development of E-Filing web-based system.

c. Rational Rose Enterprise Edition

According to IBM Corporation (2006), Rational Rose enables


the creation of the following types of UML based diagrams:
activity diagrams, class, component, deployment, sequence,
state chart, use case, collaboration, physical storage and
deployment, and physical data and tables.

Researcher used Rational Rose Enterprise Edition to create


UML modeling for e-filing web-based system. It consists of
use case diagram, sequence diagram and class diagram for e-
filing web-based system.

42
d. Adobe Photoshop CS3

Photoshop CS3 is part of Adobe’s Creative Suite (along with a


host of other products such as Illustrator). It is Adobe’s
flagship bit map editor, and a professional level editor for fine
art photography there is no viable alternative. Photoshop is the
industry standard because of its flexibility and extensibility (it
supports a wide range of third-party plug-ins), its support for
color management, and the robustness of its tools. (Levy,
2007)

Researcher used Adobe Photoshop CS3 to design the interface


of E-Filing web-based system that consists of header, logo and
system’s layout.

e. Macromedia Dreamweaver MX 2004

Dreamweaver is a powerful web page creation and web site


management tool. It offers numerous, sophisticated functions
that can be used to create professional quality web sites.
Because of this, it’s one of the most popular web authoring
tools among web designers. (San Diego State University,
2004)

Researcher used Macromedia Dreamweaver MX 2004 as the


workspace software in order to develop coding using PHP
language for E-Filing web-based system.

f. Carrot²

According to Carrot² (2010), Carrot2 is an Open Source Search


Results Clustering Engine. It can automatically organize small

43
collections of documents, e.g. search results, into thematic
categories.

Apart from two specialized document clustering algorithms,


Carrot2 offers ready-to-use components for fetching search
results from various sources including YahooAPI, GoogleAPI,
MSN Live API, eTools Meta Search, Lucene, SOLR, Google
Desktop and more. Besides, Carrot2 is implemented in Java,
but it easily integrates with non-Java software, such as PHP,
Ruby or C#.

Researcher used Carrot² which is open source framework to


build a search results clustering engine. It will organize the
search results into topics, fully automatically and without
external knowledge such as taxonomies or reclassified content.

4.3 Hardware Requirements

In developing and deploying e-filing web-based system, the minimum


hardware requirement that project needed is standard personal computer with
Intel or AMD processor, standard motherboard, 80 GB hard disk and 512MB
DDRAM memory. No additional external device is needed for this project.

4.4 Development Phase

Based on research methodology depicts in Figure 3.1, system construction


process involved in last 3 phases of research methodology, which are
Requirement Analysis, Design and Development phase. Each process
involved in mentioned phase is explained further below.

44
4.4.1 Requirement Analysis Phase

In this construction process, the researcher analyzed the requirement


in more detail. The researcher illustrated use case diagram using
Rational Rose Software which focused on high level view that
concentrated on a user-centered view of the system. This is to analyze
class diagram which is the primary model for describing the internal
structure and behavior of the project system. Furthermore, each use
case is described thoroughly that stated the flows involved within it as
well as the production of sequence diagram are also taken placed. As
a result, a summary of requirements for development of E-Filing web-
based system is fully constructed. For details on the requirement,
please refer Appendix D: Software Requirement Specification (SRS).

4.4.2 Design Phase

The design phase is concerned with specifying the e-filing web-based


system that will meet the requirements. The design of this project
takes place at two main levels, which is system design and detailed
design.

a. System Design

System design is focuses on architectural aspects that affect


the entire system (Bennett, McRobb & Farmer, 2006). The
system design of e-filing web-based system involved setting of
standard such as the design of the human computer interface,
the development of coding standard are specified, and the
suitable database management for data storage is selected.
This project uses the MySQL as the database management and
PHP as a programming language.

45
b. Detailed Design

Detailed Design is addresses the design of classes and the


detail working of this project system. It was based on the
requirement designed in the Software Requirement
Specification (SRS) that follows object-oriented design
approach. In an object-oriented system, the detailed design is
concerned the design of objects. Object Design is mainly
concerned with the specification of attributes types, how
operations function, and how objects are linked to other object
(Bennett et al., 2006). For details description of class diagram,
please refer Appendix E: Software Design Document (SDD).

4.4.3 Development Phase

In this development phase, a series of development tasks were


performed during this phase. It consists of constructing database
establishing its connection and coding task. These tasks are explained
further as below.

a. Coding

This task was concurrently done with the enhancement of the


interfaces. The necessary codes were added in the programs to
enable the interfaces to function correctly. Figure 4.1 shows
one of the coding segments that has been constructing during
development using Macromedia Dreamweaver MX 2004.

46
Figure 4.1 : Coding index.php

b. Data Mining Techniques

This task was concurrently done with the enhancement of the


e-filing web-based system with searching method using data
mining techniques. Clustering selected as the suitable data
mining techniques for searching method. Researcher used
Carrot² which is open source framework for building search
clustering engines. The necessary codes were added in the
system to cluster search results.

c. Interface

Figure 4.2 shows the main page of the system. This page
appear after the authorize user (staff) enter into the system.
This page shows the list of menu for staff to handle the
system.
(Refer Appendix F – Description of Interface System)

47
Figure 4.2 : The main page interface of e-filing

4.5 Summary

This chapter explained about the construction of the system in details in


developing the E-Filing web-based system. Researcher reviews the list of
software tools that are selected during the development process. These
include operating system and other applications that are compulsory for the
system to be developed and deployed which is Dreamweaver MX 2004,
Apache, MySQL, Rational Rose and Carrot². Besides, researcher comes out
with the minimum hardware requirements in developing and deploying E-
Filing web-based system. In the development phase, researcher reviews a
series of development tasks that were performed. It consists of requirement
analysis, design and development phase.

The next chapter discusses the result and findings for the research project.

48
CHAPTER 5

RESULT AND FINDINGS

5.1 Introduction

This chapter will explain how the collected data is organized, analyzed and
finalized to be used in the development phase of the research. The result of
the research that has been conducted will be explained in depth in this
chapter. It includes the findings and result gathered from the interviews and
discussions.

5.2 Interview Results

In order to generate good interview question, researcher follows a model for


navigating interview processes in requirements elicitation (Refer Figure 5.1).

Figure 5.1 : A Model for Navigating Interview Processes in Requirements Elicitation

49
In developing a Software Requirements Specification (SRS) of good quality,
it is quite important to correctly elicit requirements from stakeholders. The
interview session has been conducted with Encik Gobibaskaran A/L
Govindaraju, the Head of Information Technology at Majlis Daerah Kerian
and Puan Shalina Mat Piah, the Administrative Assistant at Majlis Daerah
Kerian. The interview questions are categorized into two categories. The first
category focused more on the current problems faced by staffs in Majlis
Daerah Kerian. All the necessary data from the current problems has been
collected through this category. The second category is focusing on the
functional requirement for the system to be developed. The sample interview
question can be found in Appendix C.

5.2.1 Current Problems

Interviewee :
Puan Shalina Mat Piah, Administrative Assistant,
Majlis Daerah Kerian.

The results gained from the first category of the interview questions
are presented in the Table 5.1 below.

Table 5.1 : The problems that have been identified from the interviews.
Problem Researcher Interviewee
PQ.1  Is the current manual  No
system easier and
comfortable to you?
PQ.2  Please describe the  Involve many step :
current system in  Searching suitable
regarding the manual number of file that
managing and required by using
searching files. log book.
 Determine file name
by using file

50
number.
 Check needed file
on many big shelf
that required long
time.
 Surveying on each
staff’s table or other
department in Majlis
Daerah Kerian if the
file not have on the
shelf.
PQ.3  Is it easy to identify  No
the suitable files
manually according to
your requirement?
PQ.4  Why you think it is not  Difficult to search the
easy to identify the suitable files.
suitable files  Difficult to know status
manually? of the files.
 Required long time.
 There are thousands of
files on the shelf.
 Sometimes, there are
interchanges of files
between departments.
PQ.5  In your opinion, is it  Yes, of course
important for MDK to
have web-based
system that will act as
information center for
staff to gather
information about the
status of the files?

51
5.2.2 Functional Requirements

Interviewee :
Encik Gobibaskaran A/L Govindaraju, Head of IT Department,
Majlis Daerah Kerian.

Apart from that, the second category of the interview is focusing more
on the functional requirement of the system. The requirements and
suggestions gathered from the interviews are represented in the Table
5.2 below.

Table 5.2: The requirement and suggestion that had been


identified from the interviews
Requirement Researcher Interviewer
RQ.1  How many users  Three users which is
required Administrator, Manager
involving in the and Staff
system?
RQ.2  What do you  Stored general staffs
think E-Filing information.
web-based  Stored files information.
system should  Stored status and
have? location of the files.
 Implement automated
searching to identify
suitable files.
RQ.3  What is the rule  Admin : handle user
for account, view and delete
Administrator, files.
Manager and  Manager : handle user
Staff in the information, maintain
system? files and delete staff.
 Staff : handle user

52
information and
maintain files.
RQ.4  What is your  Use the open source
suggestion about language that suite with
the language to any platform such as
develop the PHP programming.
system?
RQ.5  What is your  MySQL database
suggestion about
the database to
develop the
system?

Based on the Table 5.2, several processes for the system are
identified. This requirement is all about system functionality for e-
filing web-based system. This requirement is collected and analyzed
to produce the new system.

53
5.3 Use Case Diagram

Maintain User Account

View Files

Admin

Delete Files

Validate User

Maintain User Information


Staff Manager

Maintain Files Information

Maintain Customer Information

Delete Staff

Figure 5.2 : Use Case Diagram for E-Filing web-based system

Referring to Figure 5.2 above, it shows the use case diagram for e-filing web-
based system. This use case illustrated the functionality for the administrator,
manager and staff. First, the admin, manager and staff must login into the
system. They must registered first before can use the system. Upon they have
login into the system, admin can maintain user account, view files and delete
files. Manager can maintain user information, maintain files information,
maintain customer information and delete staff. Staff can maintain user
information, maintain files information and maintain customer information.

54
The description about the use cases is described in Table 5.3.

Table 5.3 : Description of Use Case diagram


Use Cases Description
Maintain User Account use case is used by
Maintain User Account administrator to update and delete user’s
account that used the system.
View Files use case is used by administrator to
View Files view files from all departments in Majlis
Daerah Kerian.
Delete Files use case is used by administrator
Delete Files to delete files from all departments in Majlis
Daerah Kerian.
Validate User use case is used by administrator,
Validate User
manager and staff to login into the system.
Maintain User Information is used by manager
Maintain User Information and staff for their registration and update their
information.
Maintain Files Information is used by manager
Maintain Files Information and staff to add new files, update files and
delete files.
Maintain Customer Information is used by
Maintain Customer
manager and staff to add new customer, update
Information
customer and delete customer.
Delete Staff is used by manager to delete their
Delete Staff
staff that not belonging to their department.

55
5.4 Class Diagram

<<entity>>
advisor
<<boundary>>
<<PK>> advisor_no
advisor_form <<control>>
advisor_ic
advisor_no advisor_control advisor_name
advisor_ic advisor_hp
advisor_name set_advisor_detail() advisor_email
advisor_hp set_advisor_update() dept_name
advisor_email
add_advisor()
update_advisor()
display_advisor()
1
validate 1

<<entity>>
file <<boundary>>
1
<<entity>> file_form
<<PK>> file_id
login <<control>>
file_name file_id
<<control>> <<PK>> user_name file_control
<<boundary>> manage file_status file_name
login_control user_password file_status
login_form file_remark
user_id open_date search_files() file_remark
user_name user_level
set_user_update() update_date set_file_detail() open_date
user_password user_dept
remove_user() staff_no set_file_update() update_date
validate_user() dept_name remove_file() staff_no
update_user() dept_name
delete_user() add_files()
display_user() 0..*
update_files()
0..* delete_files()
1
display_files()
<<entity>>
staff 0..n
<<boundary>> validate
staff_form <<PK>> staff_no
staff_ic manage
staff_no 1 staff_name
staff_ic
staff_add1
staff_name <<control>>
staff_add2 1
staff_add1 staff_control
staff_city
staff_add2 have
staff_postcode
staff_city search_staff() staff_state
staff_postcode set_staff_detail() staff_hp
staff_state set_staff_update() staff_email
staff_hp removeStaff() dept_name
staff_email
advisor_no 1
dept_name manage 1
advisor_no
add_staff() <<entity>>
update_staff() customer
delete_staff() <<PK>> cust_id <<boundary>>
display_staff() file_id customer_form
cust_ic file_id
cust_name <<control>> cust_ic
0..* cust_add1 customer_control cust_name
cust_add2 cust_add1
cust_city search_cust() cust_add2
cust_postcode set_cust_detail() cust_city
cust_state set_cust_update() cust_postcode
cust_phone remove_cust() cust_state
staff_no cust_phone
staff_no
add_cust()
update_cust()
delete_cust()
display_cust()

Figure 5.3 : Class Diagram for E-Filing web-based system

Referring to Figure 5.3, it is a class diagram for e-filing web-based system.


The class diagram is a type of static structure diagram of the system. It shows
the system's classes, their attributes, and their relationships between the
classes.

56
5.5 Clustering as the Suitable Searching Method

5.5.1 Introduction

For this research project, it is important for the researcher to select the
suitable searching method using data mining techniques. Researcher
decided to review three main data mining techniques which are
classification, association and clustering. These techniques deliver the
same objective of data mining, but different in terms of their function
and suitability for the system.

Researcher reviewed the techniques based on their definition, concept,


functions, suitability and examples given in several journals. (Refer
Table 2.1 in Chapter 2-Literature Review, page 25)

According to the comparison in Table 2.1, after reviewing each


technique based on their definition, concept, functions, suitability and
examples given by several journals, researcher found that clustering is
the suitable searching method for e-filing web-based system.

5.5.2 Why Clustering Search Result

This decision supported by several journals that stated clustering as


the suitable searching method. According to Zhang, Zie and Wu
(2006), clustering will cluster the search results that can help users
find the results in several clustered collections, so it is easy to locate
the valuable search results that the users really needed.

57
Aliakbary, Khayyamian and Abolhassani (2008) stated that clustering
search results helps the user to overview returned results and to focus
on the desired clusters. Most of search result clustering methods use
title, URL and snippets returned by a search engine as the source of
information for creating the clusters.

According to Lipai (2008), clustering search tools results means


grouping them into object classes which are constructed using the
search results characteristics, with the purpose of simplifying the
users work to retrieve the information it needs, helping him to find
faster better quality results.

Bialynicka (2008) stated that, clustering will organize search result


into groups, so that different groups correspond to different user
needs. This is because, flanked list is not enough and documents
pertaining to different topics cannot be compared. Besides, there are
relationships between the results that can be utilized in order to cluster
the search results.

5.5.3 Examples of Clustering Search Result

Jasco (2007) gives example the useful of clustering techniques in


search result list. Figure 5.3 below shows google’s one dimensional
result list without clustering techniques. By using “clustering search
result” keywords, google gives about 15,500,000 list of result which is
large and difficult to choose.

58
Figure 5.4 : Google’s One Dimensional Result List

Figure 5.4 below shows the good search result list with clustering
technique. By using “clustering search result” keywords same as
Figure 5.3 above, it gives about 194 list of result only which is more
accurate, simple and easy to choose.

Figure 5.5 : Good clustering result list

59
Figure 5.5 below shows the search result list with clustering technique
that available in the World Wide Web (http://search.carrot2.org).

Figure 5.6 : Good clustering result list from http://search.carrot2.org

60
5.5.4 Clustering Search Result from e-filing web-based system

Figure 5.7 below shows the search result list with clustering technique
that available in e-filing web-based system.

Figure 5.7 : Good clustering result list from e-filing web-based system

Figure 5.8 below shows the data mining tool provided by Carrot²
which is the open source framework for building search clustering
engines. The necessary codes were added in the system to cluster
search results.

61
Figure 5.8 : Data Mining Tool by Carrot²

5.6 Summary

On this chapter, researcher explained how the collected data is organized,


analyzed and finalized to be used in the development phase of the research.
Researcher analyzed interview results with two staffs in Majlis Daerah
Kerian in terms of their current problems and functional requirements for e-
filing web-based system. Besides, researcher also discussed the reasons why
clustering is selected as the suitable searching method for e-filing web-based
system. Researcher comes out with several journals, examples that support
clustering as the suitable method to cluster search result and clustered result
from e-filing web-based system.

The next chapter discusses the conclusion and recommendations for the
research project.

62
CHAPTER 6

CONCLUSION AND RECOMMENDATIONS

6.1 Introduction

This chapter will conclude what has been done by the researcher from
defining the objectives until obtaining the findings through developing
the prototype of e-filing web-based system using data mining techniques.
This chapter also concludes the report for this project and provides limitations
of the software and recommendations for those who wish to pursue the
research on the development of the e-filing web-based system.

6.2 Conclusions

As for the conclusion of the research project on a development the


prototype of e-filing web-based system using data mining techniques, the
researcher managed to achieve the entire objectives based on defined
research approach and methodology that consists of a proper theoretical
findings (Secondary Data) and data findings (Primary Data). The
achievement of these objectives is hoped to provide solutions to the current
problems in Majlis Daerah Kerian, Parit Buntar, Perak.

The first objective of the research project is to identify requirements that will
be needed for e-filing from Majlis Daerah Kerian. This objective has been
achieved through requirement gathering by conducting interview session with
staffs in Majlis Daerah Kerian in order to know the current problems and
functional requirements for e-filing web-based system. The deliverable for
this objective has been documented and can be referred in the Appendix D:
Software Requirement Specification (SRS).

63
The second objective of the research project is to identify the searching
method based on data mining techniques. For this phase, researcher reviewed
many resources such as article, journal, books and other related academic
publication information about e-filing and Data Mining in order to gain
deeper understanding to e-filing and Data Mining. This secondary data is
useful to identify suitable searching method using data mining techniques.
Researcher make comparison between three popular data mining techniques
(association, classification and clustering) in order to identify suitable
techniques for searching method in e-filing web-based system. This objective
has been achieved when researcher found that clustering is the suitable
searching method for e-filing web-based system.

After the second objective has been achieved, the research proceeds with the
third objective of designing e-filing web-based system. This objective has
been achieved through the design stage, which is system design and
detailed design. In system design, the development of e-filing web-based
system highlight the importance of interface design with the human
computer interface characteristics through proper choosing of colors,
buttons, and fonts. Despite, overall system structure is produced to illustrate
how the overall system works. In detailed design, it addressed the design of
classes and the detail working of this project system. The detail design
described the attributes, operations, and classes. The third objective
deliverables been documented and can be referred in the Appendix E:
Software Design Document (SDD).

The fourth objective of this project is to demonstrate e-filing web-based


system using identified data mining technique. The third objective must
follow the three objectives that have been achieved. It was based on the
project methodology that consists of requirement gathering and analyzing,
then designing the model that must follows the user requirements. Finally, the
process of development the prototype is implemented by translating the
design into program code using selected programming platform, database
server, web server and selected data mining technique. Thus, the last
objective has been realized.

64
By developing e-filing web-based system for Majlis Daerah Kerian, it is
expected that it will providing staff interactive environment in making their
choice in determining the suitable files that meets their requirements. Besides,
it also expects that it will help staff to identify their needed files more
accurate and faster as a result of using suitable searching method using
selected data mining technique. This system also expected to become
information center for staff in Majlis Daerah Kerian to gather information
about status of the files.

Although all the objectives have been achieved, the e-filing web-based
system using data mining technique is far from complete and has its own
limitations. There are still lots of improvement that can be considered to
enhance this project. The limitations and recommendation for this project are
discussed below.

6.3 Limitations

The project had encountered a number of limitations while in progress. The


limitations are as follows :

a. The interview session for gathering the information about the current
problems and functional requirements was conducted only with Head
of Information Technology and Administrative Assistant of Majlis
Daerah Kerian. Interview with two person only, provide less
information about the requirements.

b. Due to the time constraint, researcher developed the prototype of e-


filing web-based system which is the system for demonstrate
purposes.

65
c. There are a lot of journal regarding data mining techniques, but
researcher faces difficulties to understand each journal because not
familiar with this knowledge.

d. There are three different data mining techniques, but researcher must
select the better data mining techniques that suite with the objective.
Researcher need to study properly for each data mining techniques
and come out with the related journals that support the findings.

e. There are a large number of data mining tools available, but not all the
tools support different kind of data mining techniques. So researcher
need to study the tools based on their function and usability with the
selected data mining techniques. Furthermore, the tool used in this
research is new to the researcher so that requires time to familiarize
with the tool.

f. Experience of the researcher is another limitation factor of this


research. This is the first research for the researcher. However,
researcher can learn and have the proper guide based on the research
plan and instruction from the supervisor and examiner.

6.4 Recommendations

There are several recommendations that can be considered to further enhance


the development of e-filing web-based system as the following:

a. Suggest that project scope of the system to be expanded to know


contents of the files other than status of the files.

b. Suggest that this system can be used by others local government, not
only Majlis Daerah Kerian.

66
c. Suggest that project can be online through the Internet so that it
can be access by everyone at anytime and anywhere. It is because, this
project has limited access by using Local Area Network (LAN) only.

Through the implementation of this system, hopefully there will be other


enhancement made for further project.

67
REFERENCES

Abbott, D.W., Matkovsky, I.P., & Elder, J.F. (1998). An Evaluation of High-end
Data Mining Tools for Fraud Detection. IEEE Transaction on Knowledge and
Data Engineering, 2836.

Aliakbary, S., Khayyamian, M., & Abolhassani, H. (2008). Using Social Annotations
for Search Result Clustering. Retrieved February 10, 2010, from http://
www.springerlink.com/index/v770wm385n256p68.pdf

Apache. (2002). Retrieved February 14, 2010, from The Apache Software
Foundation: http://apache.org/

Bennett, S., McRobb, S., & Farmer, R. (2006). Object-Oriented Systems Analysis
and Design Using UML Third Edition. McGraw-Hill Education(UK)
Limited.

Bialynicka, I. (2008). Clustering Web Search Results. Retrieved March 2, 2010, from
http://medialab.di.unipi.it/web/Search+QA/Seminar/Clustering.ppt

Carrot² (2010). Carrot²-Open Source Search Results Clustering Engine. Retrieved


March 1, 2010, from Carrot² Website : http://project.carrot2.org/index.html

Chen, M., Han, J., & Yu, S.Y. (1996). Data Mining : An Overview from a Database
Perspective. IEEE Transaction on Knowledge and Data Engineering, 8, 6.

Collier, K., Carey, B., Sautter, D., & Marjaniemi, C. (1999). A Methodology for
Evaluating and Selecting Data Mining Software. IEEE Transaction on
Knowledge and Data Engineering, 2-4.

68
Defit, S., & Md Sap, M. N. (2009). Mining Association Rule from Large Databases.
Retrieved October 10, 2009, from http://fsksm.utm.edu.my

Garofalakis, M. N., Rastogi, R., Seshadri, S., & Shim, K. (1999). Data Mining and
the Web : Past, Present and Future. Retrieved July 17, 2009, from
http://www.softnet.tuc.gr/~minos/Papers/widm99.pdf

IBM Corporation. (2006). IBM Rational Rose. Retrieved March 1, 2010, from
http://ftp.software.ibm.com/software/rational/web/datasheets/rose_ds.pdf

Jain, A. K., Murty, M. N., & Flynn, P. J. (2000). Data Clustering: A Review. ACM
Computing.

Jasco, P. (2007). Clustering Search Result, Part 1: Web-wide Search Engines.


Retrieved January 5, 2010, from http://www.emeraldinsight.com/1468-4527.htm

Khodra, M. L., Widyantoro, D. H. (2007). An Efficient and Effective Algorithm for


Hierarchical Classification of Search Results. Retrieved March 20, 2010,
from http://repository.gunadarma.ac.id:8000/711/1/C-07.pdf

Lee, H. K. (2005). Inductive Clustering : A Technique for Clustering Search Results.


Retrieved July 15, 2009, from http://sifaka.cs.uiuc.edu/course
/598cxz05s/report-hle.pdf

Levy, P. (2007). A Review of Adobe Photoshop CS3. Retrieved February 3, 2010,


from http://www.becs-wa.org/PhotoShop_CS3.pdf

Lipai, A. (2008). World Wide Web Metasearch Clustering Algorithm. Retrieved


March 13, 2010, from http://revistaie.ase.ro/content/46/Adina%20Lipai.pdf

MySQL. (2009). Retrieved Disember 28, 2009, from MySQL Website:


http://www.mysql.com/

69
Olson, T., Edwards, M., & Monty, H.A. (2003). A Guide to Model Rules for
Electronic Filing and Service. Retrieved July 15, 2009, from
http://www.ncsconline.org/WC/Publications/External_ElFileModelRulesLexi
sPub.pdf

Phyu, T.N. (2009). Survey of Classification Techniques in Data Mining. Retrieved


August 5, 2009, from
http://www.iaeng.org/publication/IMECS2009/IMECS2009pp727-731.pdf

Qiu, M., Davis, S., & Ikem, F. (2004). Evaluation of Clustering Techniques in Data
Mining Tools. Retrieved January 5, 2010, from
http://www.iacis.org/iis/2004_iis/PDFfiles/QiuDavisIkem.pdf

Ravichandra, R. (2003). Data Mining and Clustering Techniques. Retrieved April 1,


2010, from https://drtc.isibang.ac.in/bitstream/handle/1849/121
/K_ikr_datamining.PDF?sequence=2

San Diego State University. (2004). Dreamweaver MX 2004 Introduction. San


Diego, Berkeley. Academic Affairs.

Shyu, M. L., Chen, S. C., & Haruechaiyasak, C. (2005). Retrieved February 12,
2010, from http://www.hlt.nectec.or.th/Publications/Conferences/A%20
Data%20Mining%20Framework%20for%20Building%20A%20Web-
Page%20Recommender%20System.pdf

Tang, P. N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining.
Boston : Pearson Education.

Visnick, L. (2003). Clustering Techniques. Retrieved July 30, 2009, from


http://www.progress.com/realtime/docs/whitepapers/
clustering_techniques.pdf

Zhang, H., Xie, K., & Wu, H. (2006). An Efficient Algorithm for Clustering Search
Engine Results. Retrieved February 6, 2010, from http://www.ieee.org

70
APPENDICES

71
APPENDIX A
PROJECT PLANNING

A
72
APPENDIX B
PROGRESS SLIDE PRESENTATION

73
B
APPENDIX C
INTERVIEW QUESTION

74
C
APPENDIX D
SOFTWARE
REQUIREMENT SPECIFICATION
(SRS)

75D
APPENDIX E
SOFTWARE DESIGN DOCUMENT
(SDD)

76E
APPENDIX F
DESCRIPTION OF SYSTEM
INTERFACE

77
F
APPENDIX G
IN-PROGRESS ASSESSMENT

78
G
UNIVERSITI TEKNOLOGI MARA

DEVELOPMENT OF E-FILING FOR


MAJLIS DAERAH KERIAN
USING DATA MINING TECHNIQUES

MOHAMED SYAHMI BIN MOHAMED ISA

BSc. (Hons)
INFORMATION SYSTEM ENGINEERING

MAY 2010

79
Universiti Teknologi MARA

Development of E-Filing for Majlis Daerah Kerian


using Data Mining Techniques

Mohamed Syahmi Bin Mohamed Isa

Thesis submitted in fulfillment of the requirements for


Bachelor of Science (Hons)
Information System Engineering
Faculty of Computer and Mathematical Sciences

MAY 2010

80
DECLARATION

This declaration is to certify that this thesis and all of its submitted contents are
original in its stature, excluding those in which have been acknowledged specifically
in the references. The contents of this thesis are of my own endeavor and any ideas
or quotations from the work of other people; published or otherwise are fully
acknowledged in accordance with the standard referring practices of the discipline.

Name of Candidate : MOHAMED SYAHMI BIN MOHAMED ISA


Candidate’s ID No. : 2008287242
Programme : BACHELOR OF SCIENCE (HONS)
INFORMATION SYSTEM ENGINEERING
(CS 226)
Faculty : FACULTY OF COMPUTER AND
MATHEMATICAL SCIENCES
Project Title : DEVELOPMENT OF E-FILING FOR MAJLIS
DAERAH KERIAN USING DATA MINING
TECHNIQUES
Signature of
candidate :

Date : 24th MAY 2010

81
APPROVAL

DEVELOPMENT OF E-FILING FOR MAJLIS DAERAH KERIAN


USING DATA MINING TECHNIQUES

By

Mohamed Syahmi Bin Mohamed Isa


2008287242

This thesis is prepared under the direction of thesis coordinators, Assoc. Prof. Wan
Nor Amalina Wan Hariri and Assoc. Prof. Rashidah Md. Rawi, Information System
Engineering Program, and it has been approved by the thesis supervisor, Puan
Norisan Abd Karim. It was submitted to the Faculty of Computer and Mathematical
Sciences and was accepted in partial fulfillment of the requirement for the degree of
Bachelor of Science.

Approved by:

__________________________
Madam Norisan Abd Karim
Thesis Supervisor
Date: 24th May 2010

82
DEDICATION

“For my mother, Sadiah Binti Harun,


my late father, Mohamed Isa Bin Harun,
and my brothers.”

83
ACKNOWLEDGEMENT

Praise be to Allah SWT Most Gracious, Most Beneficent

Firstly, I would like to pay my gratitude to Allah S.W.T for giving me strength to be
able to complete this project. Without His blessing and permission, this project could
not have been completed.

I would like to give my sincere appreciation to my supervisor Puan Norisan Abd


Karim for her concern, advices, supports and encouragement throughout this thesis
progress. My gratitude also goes to my coordinator of Final Year Project (ITS690)
FSKM, UiTM Shah Alam, Assoc. Prof. Wan Nor Amalina Wan Hariri and Assoc.
Prof. Rashidah Md. Rawi for their valuable guidance in the completion of this
project.

Special thanks to Mr. Gobibaskaran and Puan Shalina for giving the opportunity to
perform the interview session that helped me in gathering the requirements for this
project.

Finally yet importantly, thoughtful thanks to my parents, who gave me an


appreciation of learning and taught me the value of perseverance and resolve. I also
would like to say thank you to my friends for their support and to the entire person
that directly or indirectly helped me in this project. Thanks for inspiring me in such a
means that could not be written in words. May Allah SWT bless all of you.

84
i
TABLE OF CONTENTS

TITLE PAGE

ACKNOWLEDGEMENT i
TABLE OF CONTENT ii
LIST OF TABLES vi
LIST OF FIGURES vii
ABSTRACT viii

CHAPTER 1
INTRODUCTION
1.1 Research Background 1
1.2 Problem Statement 2
1.3 Aim 3
1.4 Objective of the Research 3
1.5 Significance of Research 3
1.6 Scope of Study 4
1.7 Limitation 4
1.8 Outcomes/Deliverables 5
1.9 Layout of Dissertation 5
1.10 Summary 6

CHAPTER 2
LITERATURE REVIEW
2.1 Introduction 7
2.2 E-Filing
2.2.1 Introduction to E-Filing 7
2.2.2 Purposes of the Rules in E-Filing 7
2.2.3 Proposed Model Rules for E-Filing 8
2.3 What is Data Mining
2.3.1 Definition of Data Mining 9

ii85
2.3.2 Data Mining & Knowledge Discovery 9
2.3.3 Challenges of Data Mining 12
2.4 Data Mining Techniques
2.4.1 Overview of Data Mining Techniques 15
2.4.2 Classifying Data Mining Techniques 15
2.4.3 Association Rules 17
2.4.4 Classification 18
2.4.5 Clustering 20
2.5 Selecting Data Mining Techniques 22
2.6 Selecting Data Mining Tools 29
2.7 Summary 33

CHAPTER 3
RESEARCH APPROACH AND METHODOLOGY
3.1 Introduction 34
3.2 Problem Identification and Planning 35
3.3 Requirement Gathering 36
3.3.1 Primary Data 36
3.3.2 Secondary Data 37
3.4 Requirement Analysis 37
3.5 Design Model 38
3.6 Develop Prototype 38
3.7 Summary 39

CHAPTER 4
PROTOTYPE CONSTRUCTION
4.1 Introduction 40
4.2 Software Requirement 40
4.2.1 Software Tools 40
4.2.2 Software Tools Installation 41
4.3 Hardware Requirements 44
4.4 Development Phase 44

86
iii
4.4.1 Requirement Analysis Phase 45
4.4.2 Design Phase 45
4.4.3 Development Phase 46
4.5 Summary 48

CHAPTER 5
RESULT AND FINDINGS
5.1 Introduction 49
5.2 Interview Results 49
5.2.1 Current Problems 50
5.2.2 Functional Requirements 52
5.3 Use Case Diagram 54
5.4 Class Diagram 56
5.5 Clustering as the Suitable Searching Method 57
5.5.1 Introduction 57
5.5.2 Why Clustering Search Result 57
5.5.3 Examples of Clustering Search Result 58
5.5.4 Clustering Search Result from e-filing
web-based system 61
5.6 Summary 62

CHAPTER 6
CONCLUSION AND RECOMMENDATIONS
6.1 Introduction 63
6.2 Conclusions 63
6.3 Limitations 65
6.4 Recommendations 66

REFERENCES 68

87
iv
APPENDICES 71

APPENDIX A : Project Planning A


APPENDIX B : Progress Slide Presentation B
APPENDIX C : Interview Question C
APPENDIX D : Software Requirements Specification (SRS) D
APPENDIX E : Software Design Document (SDD) E
APPENDIX F : Description Of System Interface F
APPENDIX G : In-Progress Assessment G

88
v
LIST OF TABLES

Table 2.1 : Differences of Classification, Association and Clustering techniques 25


Table 2.2 : Computational Performance Criteria (Collier et. al, 1999) 30
Table 2.3 : Functionality Criteria (Collier et. al, 1999) 30
Table 2.4 : Usability Criteria (Collier et. al, 1999) 31
Table 4.1 : Software Tools Specifications 40
Table 5.1 : The problems that have been identified from the interviews 50
Table 5.2 : The requirement and suggestion that had been identified
from the interviews 52
Table 5.3 : Description of Use Case diagram 55

vi89
LIST OF FIGURES

Figure 2.1 : The Process of knowledge discovery in database 10


Figure 2.2 : Process for designing and implementing arecommender
system (Shyu et al., 2005) 11
Figure 2.3: The general architecture of Mining Association Rule model
(Defit & Md Sap, 2001) 17
Figure 2.4: Hierarchical Classification Process (Khodra & Widyantoro, 2007) 19
Figure 2.5 : Stages in clustering (Jain et al., 1999) 21
Figure 3.1 : Overview of Research Approach and Methodology 34
Figure 4.1 : Coding index.php 47
Figure 4.2 : The main page interface of e-filing 48
Figure 5.1 : A Model for Navigating Interview Processes in
Requirements Elicitation 49
Figure 5.2 : Use Case Diagram for E-Filing web-based system 54
Figure 5.3 : Class Diagram for E-Filing web-based system 56
Figure 5.4 : Google’s One Dimensional Result List 59
Figure 5.5 : Good clustering result list 59
Figure 5.6 : Good clustering result list from http://search.carrot2.org 60
Figure 5.7 : Good clustering result list from e-filing web-based system 61
Figure 5.8 : Data Mining Tool by Carrot² 62

90
vii
ABSTRACT

E-filing web-based system is a development project that using a data mining


technique called clustering. There are different types of data mining that are useful
based on their functions and stated conditions. Majlis Daerah Kerian act as local
government which is a government unit that is closest to the citizens and these
includes municipalities, local authorities, town councils and city councils. The staff
in Majlis Daerah Kerian face difficulties in managing and identifying needed files
that meet their requirement. This is because, they have thousand of files and eight
departments, so that is difficult to search needed files manually that contains many
steps to follow. This research provides suitable searching method using data mining
technique for e-filing web-based system. The researcher make comparison between
three different data mining techniques (association, classification and clustering) to
identify suitable data mining technique for searching files and do interview session
with staff in Majlis Daerah Kerian to gather details requirement. By developing e-
filing web-based system for Majlis Daerah Kerian, it will help staff to identify their
needed files more accurate and faster as a result of using suitable searching method
by selected data mining techniques. It also will provide staff with interactive
environment in making their choice in determining the suitable files that meets their
requirements. It is expected that this e-filing web-based system will act as
information center for staff in Majlis Daerah Kerian to gather information about
status of the files.

91
viii