You are on page 1of 11

University Kasdi Merbah ouargla

Faculty of New Information Technologies


And Communication
Department of Computer Science and
Information Technology

Theme

Data Mining

Directed by: Supervised by:


 Chaima Derouiche Khalil Mezriche
 Hadjer EL_karbo

College year: 2016_2017

Summary:
Introduction…………………………………………………........................3
1
Definition...………………………………………………….........................3

Data, information and knowledge……………………………………….4

Cause of using data mining................................................…..4

What kind of data can be mined………………………………………….5

Origin of data mining………………………………………………………….5

How data mining works………………………………………………………6

The tasks of data mining……………………………………..................7

Data mining applications…………………………………………………….9

Advantage and disadvantage of data mining……………….......10

Conclusion…………………………………………………………………………12

Bibliography……………………………………………………………………….13

Introduction:
We are in an age often referred to as the information age. In this information age, because we believe that
information leads to power and success, and thanks to sophisticated technologies such as computers, satellites,
etc., we have been collecting tremendous amounts of information. Initially, with the advent of computers and
means for mass digital storage, we started collecting and storing all sorts of data, counting on the power of
computers to help sort through this amalgam of information. Unfortunately, these massive collections of data
stored on disparate structures very rapidly became overwhelming. This initial chaos has led to the creation of
2
structured databases and database management systems (DBMS). The efficient database management systems
have been very important assets for management of a large corpus of data and especially for effective and
efficient retrieval of particular information from a large collection whenever needed. The proliferation of
database management systems has also contributed to recent massive gathering of all sorts of information.
Today, we have far more information than we can handle from business transactions and scientific data, to
satellite pictures, text reports and military intelligence. Information retrieval is simply not enough anymore for
decision-making. Confronted with huge collections of data, we have now created new needs to help us make
better managerial choices. These needs are automatic summarization of data, extraction of the “essence” of
information stored, and the discovery of patterns in raw data.

Definition:
Generally, data mining (sometimes-called data or knowledge discovery) is the process of analyzing data from
different perspectives and summarizing it into useful information that can be used to increase revenue, cuts
costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to
analyze data from many different dimensions or angles, categorize it, and summarize the relationships
identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in
large relational databases. [1]

Example:
For example, one Midwest grocery chain used the data mining capacity of Oracle software to analyze local
buying patterns. They discovered that when men bought diapers on Thursdays and Saturdays, they also tended
to buy beer. Further analysis showed that these shoppers typically did their weekly grocery shopping on
Saturdays. On Thursdays, however, they only bought a few items. The retailer concluded that they purchased
the beer to have it available for the upcoming weekend. The grocery chain could use this newly discovered
information in various ways to increase revenue. For example, they could move the beer display closer to the
diaper display. Moreover, they could make sure beer and diapers were sold at full price on Thursdays. [2]

Data, Information, and Knowledge:


Data:
Data are any facts, numbers, or text that can be processed by a computer. Today, organizations are accumulating
vast and growing amounts of data in different formats and different databases. This includes :

 Operational or transactional data such as, sales, cost, inventory, payroll, and accounting.

 Nonoperational data, such as industry sales, forecast data, and macro-economic data.

 Meta data - data about the data itself, such as logical database design or data dictionary definitions.

Information:
The patterns, associations, or relationships among all this data can provide information. For example, analysis of
retail point of sale transaction data can yield information on which products are selling and when.

Knowledge:

3
Information can be converted into knowledge about historical patterns and future trends. For example,
summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide
knowledge of consumer buying behavior. Thus, a manufacturer or retailer could determine which items are
most susceptible to promotional efforts.

Cause of using data mining:


Commercial Viewpoint:
● Lots of data is being collected and warehoused
 Web data, e-commerce
 purchases at department/ grocery stores
 Bank/Credit Card transactions
● Computers have become cheaper and more powerful
● Competitive Pressure is Strong Provide better, customized services for an edge (e.g. in Customer Relationship
Management)

Scientific Viewpoint:
● Data collected and stored at enormous speeds (GB/hour)
 remote sensors on a satellite
 telescopes scanning the skies
 microarrays generating gene expression data

Scientific simulations generating terabytes of data


● Traditional techniques infeasible for raw data
● Data mining may help scientists in classifying and segmenting in Hypothesis
Formation

What kind of Data can be mined?


In principle, data mining is not specific to one type of media or data. Data mining should be applicable to any
kind of information repository. However, algorithms and approaches may differ when applied to different types
of data. Indeed, the challenges presented by different types of data vary significantly. Data mining is being put
into use and studied for databases, including relational databases, object-relational databases and object-
oriented databases, data warehouses, transactional databases, unstructured and semi structured repositories
such as the World Wide Web, advanced databases such as spatial databases, multimedia databases, time-series
databases and textual databases, and even flat files. Here are some examples in more detail:
• Flat files: Flat files are actually the most common data source for data mining algorithms, especially at the
research level. Flat files are simple data files in text or binary format with a structure known by the data-mining
4
algorithm to be applied. The data in these files can be transactions, time-series data, scientific measurements,
etc.
• Relational Databases: Briefly, a relational database consists of a set of tables containing either values of entity
attributes, or values of attributes from entity relationships. Tables have columns and rows, where columns
represent attributes and rows represent tuples. A tuple in a relational table corresponds to either an object or a
relationship between objects and is identified by a set of attribute values representing a unique key.[3]

Origin of data mining:


● Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

● Traditional Techniques may be unsuitable due to


 Enormity of data
 High dimensionality of data

Heterogeneous, distributed nature of data[4].

How data mining works?


How exactly is data mining able to tell you important things that you did not know or what is going to happen
next? The technique that is used to perform these feats in data mining is called modeling. Modeling is simply the
act of building a model in one situation where you know the answer and then applying it to another situation
that you do not. For instance, if you were looking for a sunken Spanish galleon on the high seas the first thing
you might do is to research the times when Spanish treasure had been found by others in the past. You might
note that these ships often tend to be found off the coast of Bermuda and that there are certain characteristics
to the ocean currents, and certain routes that have likely been taken by the ship’s captains in that era. You note
these similarities and build a model that includes the characteristics that are common to the locations of these
sunken treasures. With these models in hand you sail off looking for treasure where your model indicates it
most likely might be given a similar situation in the past. Hopefully, if you've got a good model, you find your
treasure.

This act of model building is thus something that people have been doing for a long time, certainly before the
advent of computers or data mining technology. What happens on computers, however, is not much different
5
than the way people build models. Computers are loaded up with lots of information about a variety of
situations where an answer is known and then the data mining software on the computer must run through that
data and distill the characteristics of the data that should go into the model. Once the model is built, it can then
be used in similar situations where you do not know the answer. For example, say that you are the director of
marketing for a telecommunications company and you would like to acquire some new long distance phone
customers. You could just randomly go out and mail coupons to the general population - just as you could
randomly sail the seas looking for sunken treasure. In neither case would you achieve the results you desired
and of course you have the opportunity to do much better than random - you could use your business
experience stored in your database to build a model.

As the marketing director, you have access to a lot of information about all of your customers: their age, sex,
credit history and long distance calling usage. The good news is that you also have a lot of information about
your prospective customers: their age, sex, credit history etc. Your problem is that you do not know the long
distance calling usage of these prospects (since they are most likely now customers of your competition). You
would like to concentrate on those prospects who have large amounts of long distance usage. You can
accomplish this by building a model. Table 2 illustrates the data used for building a model for new customer
prospecting in a data warehouse.[5]

Customer Prospects

General information (e.g. demographic Know Know


data)

Proprietary information (e.g. customer Know Target


The tasks of Data mining:
transactions)
Summarization
Summarization is the generalization or abstraction of data. A set of relevant data is abstracted and summarized,
resulting a smaller set, which gives a general overview of data. For example, the long distance calls of customer
can be summarized in to total minutes, total calls, total spending etc. instead of detailed calls. Similarly, the calls
can be summarized in to local calls, STD calls, ISD calls etc.

Clustering
Clustering is identifying similar groups from unstructured data. Clustering is the task of grouping a set of objects
in a such a way that object in same group are more similar to each other than to those in other groups. Once the
clusters are decided, the objects are labelled their corresponding clusters, and common features of the objects
in cluster are summarized to form a class description. For example, a bank may cluster its customer in to several
groups based on the similarities of their income, age, sex, residence etc. and the command characteristics of the
customers in a group can be used to describe that group of customers. This will the bank to understand its
customers better and thus provide customized services.

Classification

6
Classification is learning rules that can be applied to new data and will typically include following steps:
preprocessing of data, designing modelling, learning/feature selection and validation /evaluation. Classification
predicts categorical continuous valued functions. For example, we can make classification model to categorize
bank loan application as either safe or risky. Classification is the derivation of model which determines the class
of an object based on its attributes. A set of object is given as training set in which every object is represented
by vector of attributes along with its class. By analyzing the relationship between attributes and class of the
objects in the training set, classification model can be constructed. Such classification model can be used to
classify future objects and develop a better understanding of the classes of the objects in the database. For
example, from the set ISSN (Online) : 2278-1021 ISSN (Print) : 2319-5940 International Journal of Advanced
Research in Computer and Communication Engineering Vol. 3, Issue 10, October 2014 Copyright to IJARCCE
www.ijarcce.com 8096 of loan borrowers (Name, Age, and Income) who serve as training set, a classification
model can be built, which concludes bank loan application as either safe or risky. (If age = Youth then Loan
decision = risky).

Regression
Regression is finding function with minimal error to model data. It is statistical methodology that is most often
used for numeric prediction. Regression analysis is widely used for prediction and forecasting, where its use has
substantial overlap with the field of machine learning. Regression analysis is also used to understand which
among the independent variables are related to the dependent variable, and to explore the forms of these
relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between
the independent and dependent variables. However this can lead to illusions or false relationships, so cautions
advisable [6] for example, correlation does not imply causation.

Association
Association is looking for relationship between variables or objects. It aims to extract interesting association,
correlations or casual structures among the objects i.e. the appearance of another set of objects in [7]. The
association rules can be useful for marketing, commodity management, advertising etc. Association rule learning
is a popular and well researched method for discovering interesting relations between variables in large
databases. It is intended to identify strong rules discovered in databases using different measures of
interestingness[6] and based on the concept of strong rules presented in [8] , introduced association rules for
discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS)
systems in supermarkets. For example, the rule {Onions, potatoes} {burger} found in the sales data of a
supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy
hamburger meat. Such information can be used as the basis for decisions about marketing activities such as,
e.g., promotional pricing or product placements. In addition to the above example from market basket analysis
association rules are employed today in many application areas including Web usage mining, intrusion
detection, Continuous production, and bioinformatics.

Data Mining Applications:


Data Mining Applications in Sales/Marketing:
Data mining enables businesses to understand the hidden patterns inside historical purchasing transaction data,
thus helping in planning and launching new marketing campaigns in prompt and cost effective way. The
following illustrates several data mining applications in sale and marketing.

 Data mining is used for market basket analysis to provide information on what product combinations
were purchased together when they were bought and in what sequence. This information helps
7
businesses promote their most profitable products and maximize the profit. In addition, it encourages
customers to purchase related products that they may have been missed or overlooked.

 Retail companies use data mining to identify customer’s behavior buying patterns .

Data Mining Applications in Banking / Finance


 Several data mining techniques e.g., distributed data mining have been researched, modeled and
developed to help credit card fraud detection.

 Data mining is used to identify customers loyalty by analyzing the data of customer’s purchasing
activities such as the data of frequency of purchase in a period of time, a total monetary value of all
purchases and when was the last purchase. After analyzing those dimensions, the relative measure is
generated for each customer. The higher of the score, the more relative loyal the customer is.

 To help the bank to retain credit card customers, data mining is applied. By analyzing the past data, data
mining can help banks predict customers that likely to change their credit card affiliation so they can plan
and launch different special offers to retain those customers.

 Credit card spending by customer groups can be identified by using data mining.

 The hidden correlation’s between different financial indicators can be discovered by using data mining.

 From historical market data, data mining enables to identify stock trading rules.

Data Mining Applications in Health Care and Insurance


The growth of the insurance industry entirely depends on the ability to convert data into the knowledge,
information or intelligence about customers, competitors, and its markets. Data mining is applied in insurance
industry lately but brought tremendous competitive advantages to the companies who have implemented it
successfully. The data mining applications in insurance industry are listed below:

 Data mining is applied in claims analysis such as identifying which medical procedures are claimed
together.

 Data mining enables to forecasts which customers will potentially purchase new policies.

 Data mining allows insurance companies to detect risky customers’ behavior patterns.

 Data mining helps detect fraudulent behavior.

Data Mining Applications in Transportation


 Data mining helps determine the distribution schedules among warehouses and outlets and analyze
loading patterns.

Data Mining Applications in Medicine


 Data mining enables to characterize patient activities to see incoming office visits.

8
 Data mining helps identify the patterns of successful medical therapies for different illnesses .

Advantages and Disadvantages of Data Mining


Advantages of Data Mining

Marketing / Retail

Data mining helps marketing companies build models based on historical data to predict who will respond to the
new marketing campaigns such as direct mail, online marketing campaign…etc. Through the results, marketers
will have an appropriate approach to selling profitable products to targeted customers.

Data mining brings many benefits to retail companies in the same way as marketing. Through market basket
analysis, a store can have an appropriate production arrangement in a way that customers can buy frequent
buying products together with pleasant. In addition, it also helps the retail companies offer certain discounts for
particular products that will attract more customers.

Finance / Banking

Data mining gives financial institutions information about loan information and credit reporting. By building a
model from historical customer’s data, the bank, and financial institution can determine good and bad loans. In
addition, data mining helps banks detect fraudulent credit card transactions to protect credit card’s owner.

Manufacturing

By applying data mining in operational engineering data, manufacturers can detect faulty equipment and
determine optimal control parameters. For example, semiconductor manufacturers have a challenge that even
the conditions of manufacturing environments at different wafer production plants are similar, the quality of
wafer are a lot the same and some for unknown reasons even has defects. Data mining has been applying to
determine the ranges of control parameters that lead to the production of the golden wafer. Then those optimal
control parameters are used to manufacture wafers with desired quality.

Governments

Data mining helps government agency by digging and analyzing records of the financial transaction to build
patterns that can detect money laundering or criminal activities.

Disadvantages of data mining

Privacy Issues

The concerns about the personal privacy have been increasing enormously recently especially when the internet
is booming with social networks, e-commerce, forums, blogs…. Because of privacy issues, people are afraid of
their personal information is collected and used in an unethical way that potentially causing them a lot of
troubles. Businesses collect information about their customers in many ways for understanding their purchasing
behaviors trends. However, businesses do not last forever, some days they may be acquired by other or gone. At
this time, the personal information they own probably is sold to other or leak.

9
Security issues

Security is a big issue. Businesses own information about their employees and customers including social
security number, birthday, payroll etc. However how properly this information is taken care is still in questions.
There have been a lot of cases that hackers accessed and stole big data of customers from the big corporation
such as Ford Motor Credit Company, Sony… with so much personal and financial information available, the
credit card stolen and identity theft become a big problem.

Misuse of information/inaccurate information

Information is collected through data mining intended for the ethical purposes can be misused. This information
may be exploited by unethical people or businesses to take benefits of vulnerable people or discriminate against
a group of people.In addition, data mining technique is not perfectly accurate. Therefore, if inaccurate
information is used for decision-making, it will cause serious consequence.

Conclusion:

Data mining is an important part of knowledge discovery process that we can analyze an
enormous set of data and get hidden and useful knowledge. Data mining is applied effectively
not only in the business environment but also in other fields such as weather forecast,
medicine, transportation, healthcare, insurance, government…etc. Data mining has many
advantages when using in a specific industry. Besides those advantages, data mining also has its
own disadvantages e.g., privacy, security and misuse of information.

10
Bibliography

[1]
[2]
[3]
[4]
[5] http://www.thearling.com/text/dmwhite/dmwhite.htm \ 7_11_2016
[6] R.Kaur, S.Kaur, A.Kaur, R.Kaur, A.Kaur, “An Overview of Database management System, Data warehousing
and Data Mining”. IJARCCE, Vol.2, issue.7, July 2013.
[7] Y.Fu , Data Minig : Tasks, Techniques and Applications.
[8] Y. Ramamohan, K. Vasantharao, C. Kalyana Chakravarti, and A.S.K.Ratnam, “A Study of Data Mining Tools in
Knowledge

11