understanding data mining Basis for future planning development

What is data mining?

Importance of data mining

Classes of data mining technique

How does data mining works?

What technological infrastructure is

required? Problems

the process of collecting, searching through, and

analyzing a large amount of data in a database, as to discover patterns or relationships

Bose & Mahapatra (2001) process of identifying

interesting patterns in databases that can then be used in decision making Turban et al. (2007) process that uses statistical, mathematical artificial intelligence and machine learning techniques to extract and identify useful information Frawley et al. (1992) objective of data mining is to obtain useful, non-explicit information from data stored in large repositories

Data mining help analyzing data from

different perspectives, different dimensions or angles and summarizing it into useful information
Plays important role in Financial Fraud

Detection (FDD) because often applied to extract and uncover the hidden truths behind very large quantities of data


-any facts, numbers or text that can be processed by a computer. -this includes operational or transactional data such as sales, cost, inventory, payroll and accounting -nonoperational data such as industry sales, forecast data Information - the correlation or pattern among the data can provide useful information Knowledge -information can be converted into knowledge about historical patterns and future trends

Classification build up and utilizes a model to

predict the categorical labels of unknown objects to distinguish between objects of different classes Clustering divide objects into conceptually meaningful groups (clusters), with the objects in a group being similar to one another but very dissimilar to the objects in other groups Prediction estimates numeric and ordered future values based on the patterns of a data set

Outlier detection measure the distance between data

objects to detect those objects that are grossly different from or inconsistent with the remaining data set Regression statistical methodology used to reveal the relationship between one or more independent variables and a dependent variable Visualization easily understandable presentation of data and to methodology that converts complicated data characteristics into clear patterns or relationships uncovered in the data mining process

stages of data mining

exploration this stage starts with preparing data such as data cleaning, transformation, selecting records. model building and validation involves choosing the best model based on the predictive performance deployment this stage combines the previous two by implementing the model that was choose and applying it to the data to generate predictions or pattern.

consists of five elements

-extract, transform, and load transaction data onto the data warehouse system -store and manage the data in a multidimensional database system -provide data access to business analysts and information technology professionals -analyze data by application software -present data in a useful format, such as a graph or table

Tools used for data mining :

I) artificial neural networks non linear predictive models that learn through training and resemble biological neural networks in structure

ii) decision tree tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset

Tools used for data mining : iii) logistic regression - is a type of regression analysis used for predicting the outcome of a categorical (a variable that can take on a limited number of categories) dependent variable based on one or more predictor variables.

iv) genetic algorithms optimization techniques that use processes such as genetic combination, mutation and natural selection in a design based on the concepts of natural evaluation

Tools used for data mining :

vi) data visualization the visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationship.

As companies started collecting and saving basics data in

computers, they were able to start answering detailed questions quicker and with more ease
Internal auditors can use spreadsheets to undertake simple data

mining exercises or to produce summary tables. Using the spreadsheet, auditors can review complex data in a simplified format and drill down where necessary to find the underlining assumptions or information

size of database

-the more data being processed and maintained, the more powerful the system required
query complexity

-the more complex the queries and the greater the number of queries being processed, the more powerful the system required

Limited use of outlier detection and visualization have

seen, may be due to difficulty of detecting outliers Cost sensitivity

Cost of misclassification (false positive and false

negative errors) differs

with a false negative error (misclassifying fraudulent activity

as a normal activity) more costly than a false positive error (misclassifying a normal activity as a fraudulent activity)

The right combination of imaginative human and

computer skills can work small wonders on large sets of data.

Large amount of data can be retrieved from

various website and database

