You are on page 1of 25

Data Mining: Concepts and

Techniques Jiawei Han, Micheline


Kamber and Jian Pei

Data Mining Introduction and


Advance Topic
Margaret H. Dunham and S. Sridhar
Data,

1
DATA
Data represents unorganized and
unprocessed facts.

It can represent a set of discrete facts


about events.
Data is a prerequisite to information.
An organization sometimes has to decide
on the nature and volume of data that is
required for creating the necessary
information.

Operational Data

2
Operational Data

Data on Internet

 TEXT
 IMAGE
 VIDEO
Operational
 SOUND
 LOCATION Data

3
Web Server Log data

2006-10-04 04:24:39 203.200.95.194 - W3SVC195 NS 69.41.233.13 80 GET


/STUDYCENTRE/StudyCentreSearchPage.asp - 200 0 0 568 210 HTTP/1.0
www.mcu.ac.in Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;) -
http://www.mcu.ac.in/search_a_study_institute.htm

date(2006-10-04) time(04:24:39)
Client side IP (203.200.95.194) Server site name(W3SVC195)
Server Computer Name (NS), Server ip(69.41.233.13)
Server port(80) method (GET)
Client url(/STUDYCENTRE/StudyCentreSearchPage.asp)
Server status code(200 ) Client side byte received (568)
Time taken(210),
User agent Mozilla/4.0+(compatible;+MSIE+5.0;+Windows+98;)
CS_referer (http://www.mcu.ac.in/search_a_study_institute.htm)
7

Data Mining works with Warehouse Data

 Data Warehousing provides the


Enterprise with a memory

Data Mining provides the


Enterprise with intelligence

4
DATA CUBE OF WEB LOG

Total Byte Transfer


Date Syllabus page
1Qtr 2Qtr 3Qtr 4Qtr sum
IE 5.0
Netscape Navigator. Syllabus
Fire Fox
Time Table

Pages
Result

Total Visit

TID Items
100 134
TRANSACTION DATABASES 200 235
300 1235
400 25

 A transaction database consists of a file where each


record represents a transaction.
 A transaction typically includes a unique transaction
identity number(trans_ID),and a list of the items
making up the transaction.
 The transaction database may have additional tables
associated with it, which contain other information
regarding the sales, such as the date of trasaction the
customer ID number, the ID number of the sales
person and of the branch at which sales occurred and
so on.

5
Temporal Database & Time series
database

 Temporal databases and time series databases


both store time related data.
 A temporal database usually store relational
data that include time related attributes these
attributes may involve several timestamps,
each having different semantics.
 A time series database stores sequence of
values that change with time such as data
collected regarding the stock exchange .

Example
Given two temporal relations:

S: Supplier S# was under contract S# P# During


during the interval During
SP S1 P1 [d04,d10]
S1 P7 [d05,d10]
SP: Supplier S# was able to supply
S1 P3 [d09,d10]
part P# during the interval During
S1 P5 [d06,d10]

S# During S2 P1 [d02,d04]
S S1 [d04,d10]
S2 P9 [d03,d03]
S2 P1 [d08,d10]
S2 [d02,d04]
S2 P5 [d09,d10]
S2 [d07,d10]
S3 P1 [d08,d10]
S3 [d03,d10] S4 P2 [d06,d09]
S4 [d04,d10] S4 P5 [d04,d08]
S5 [d02,d10] S4 P7 [d05,d10]

12

6
Text databases

 Text databases are databases that contain


word description for object these word
description are usually not simple key word but
rather long sentences or paragraph such as
product specification ,error or bug reports ,
warning messages .
 Text data base structured / semi-
structures/Unstructured
 Email message and HTML/XML web pages

7
Text databases

Object-Oriented Database

 It is based on object-oriented programming


where in general terms,each entity consider as
an object.
 Data and code relating object are encapsulated
into a single unit .Each object has associated
with it the following:
 A set of variable

 A set of message

 A set of method

8
Object-Relational Database

Data mining in object oriented and object relational system share


some similarities in comparison with relational data mining
techniques need to be develop for handling complex object
structure, complex data types, class and sub class hierarchies,
property inheritance and method and procedures.

Allow Nested Structuring

SPATIAL DATA
 Spatial data is about instances located in a
physical space.
 Spatial data has location features.
 A spatial database stores a large amount of
space-related data.
 Such as maps data, preprocessed remote
sensing or medical imaging data, VLSI chip
layout data.
 Example of Spatial database:
 Cartographic databases (that store maps)

 Meteorological databases (for weather

information)

9
10
Heterogeneous database & legacy
database
 Objects in one component database may differ
greatly from object in other component
databases making it difficult to assimilate their
semantics into over all heterogeneous
database.
 A legacy database is a group of heterogeneous
databases that combine different kind of data
system such as relational or object oriented
databases, hierarchical databases, network
databases, spreadsheets,multimedia databases
or file system.

World wide web

 The world wide web and its associated


distributed information services as America
online,yahoo!,altavita,prodigy,world wide,
online information services,where data objects
are linked together to facilate intractive access.
 capturing user access pattern in such
distributed information environment is called
mining path traversal patterns.

11
12
EMAIL HEADER
Delivered-To: ranjansingh06@gmail.com Received: by 10.115.55.2 with SMTP id h2cs59002wak; Wed, 8 Apr
2009 10:38:05 -0700 (PDT) Received: by 10.210.53.5 with SMTP id b5mr3667848eba.12.1239212284303; Wed,
08 Apr 2009 10:38:04 -0700 (PDT) Return-Path: <manmohansingh@gmail.com> Received: from
Bumba.profithost.net ([89.248.172.66]) by mx.google.com with ESMTP id
8si8244998ewy.109.2009.04.08.10.38.03; Wed, 08 Apr 2009 10:38:04 -0700 (PDT) Received-SPF: neutral
(google.com: 89.248.172.66 is neither permitted nor denied by domain of manmohansingh@gmail.com) client-
ip=89.248.172.66; Authentication-Results: mx.google.com; spf=neutral (google.com: 89.248.172.66 is neither
permitted nor denied by domain of manmohansingh@gmail.com) smtp.mail=manmohansingh@gmail.com
Received: from localhost ([127.0.0.1] helo=fakesend.com) by Bumba.profithost.net with esmtp (Exim 4.67)
(envelope-from <manmohansingh@gmail.com>) id 1Lrcf9-0007hi-8i for ranjansingh06@gmail.com; Wed, 08
Apr 2009 13:38:15 -0500 Date: Wed, 8 Apr 2009 13:38:15 -0500 To: ranjansingh06@gmail.com From: dr
manmohan singh <manmohansingh@gmail.com> Subject: appointment Message-ID:
<ddbe7ca7a7f3766f4b133647a88e0d4b@fakesend.com> X-Priority: 3 X-Mailer: PHPMailer [version 1.73] MIME-
Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="iso-8859-1"
congaratulation.............

13
14
15
Make the appropriate changes in the

C:\WINDOWS\system32\drivers\etc\Hosts

16
Multimedia databases

 Multimedia databases store image, audio and


video data.They are used in application such as
picture content base retrieval, voice-mail
system and world wide web.

17
35

36

18
CDR

Call Data Records(CDRs) are


similar to itemised phone bills –
but they hold much more
information. As well as the dates,
times and numbers called, CDRs
also include IMEI numbers, cell
site data and locations.

19
Sample CDR

Mobile Call Detail Record Analysis

20
Fusion Table in Google

Google Fusion Tables (or simply Fusion


Tables) is a Web service provided by
Google for data management. Data is
stored in multiple tables that Internet
users can view and download

A similar map
(again using
Google Fusion
Tables) will
allow users to
visually see the
route.

21
43

44

22
45

https://support.google.com
/fusiontables/answer/25271
32?hl=en&topic=2573107&
ctx=topic

46

23
47

Firewall logs
Firewall logs can provide valuable
information like source and destination
IP addresses, port numbers, and
protocols. You can also use the Windows
Firewall log file to monitor TCP and UDP
connections and packets that are
blocked by the firewall.
Firewall logs reveal a lot of information
about the security threat attempts at the
periphery of the network and on the
nature of traffic coming in and going out
of the firewall.

24
Firewall log

25

You might also like