Bài 2. Dữ liệu

KHAI THÁC DỮ LIỆU & KHAI PHÁ TRI THỨC
DATA MINING & KNOWLEDGE DISCOVERY
Bài 2. Dữ liệu
Data
Mã MH: 505043
TS. HOÀNG Anh
1
Nội dung
n Dữ liệu và các loại thuộc tính/ Data Objects and Attribute Types
n Mô tả thống kê cơ bản/ Basic Statistical Descriptions of Data
n Trực quan hóa dữ liệu/ Data Visualization
n Đo lường sự tương quan/ Measuring Data Similarity_Dissimilarity
n Tổng kết/ Summary
2
Tập dữ liệu/ Data Sets
n Bản ghi/ Record
n Relational records
Data matrix, e.g., numerical matrix, crosstabs
timeout
n
season
coach
game
score
team
ball
lost
pla
wi
n
y
n Document data: text documents: term-
frequency vector
n Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
n Đồ thị và Mạng/ Graph and network
n World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0
n Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

n Molecular Structures
n Ordered
n Video data: sequence of images TID Items
n Temporal data: time-series 1 Bread, Coke, Milk
n Sequential Data: transaction sequences 2 Beer, Bread
n Genetic sequence data
3 Beer, Coke, Diaper, Milk
n Dữ liệu không gian/ Spatial, image and multimedia:
4 Beer, Bread, Diaper, Milk
n Spatial data: maps
n Image data: 5 Coke, Diaper, Milk
n Video data:
3
1. Dữ liệu có cấu trúc
n Chiều/ Dimensionality
n Curse of dimensionality
n Sparsity
n Only presence counts
n Resolution
n Các mẫu phụ thuộc vào qui mô
n Phân bổ/ Distribution
n Trung tâm, và phân tán
4
Dữ liệu/ Data Objects
n Tập dữ liệu được tạo thành từ các đối tượng/điểm dữ liệu
n Một đối tượng dữ liệu đại diện cho một thực thể/ entity
n Ví dụ:
n sales database: customers, store items, sales
n medical database: patients, treatments
n university database: students, professors, courses
n Tên gọi khác: samples , examples, instances, data points,
objects, tuples.
n Các đối tượng dữ liệu được mô tả bởi các thuộc tính/
attributes.
n CSDL/ Database: hàng/rows -> data objects; cột/columns ->attributes
5
Thuộc tính/ Attributes
n Attribute (or dimensions, features, variables): đại diện
cho một đặc điểm của một đối tượng dữ liệu.
n E.g., customer _ID, name, address
n Bao gồm/ Types:

n Danh nghĩa/ Nominal
n Nhị phân/ Binary
n Số/ Numeric: quantitative
n Interval-scaled
n Ratio-scaled
6
Các loại thuộc tính
n Định danh/ Nominal: categories, states, or “names of things”
n Hair_color = {auburn, black, blond, brown, grey, red, white}
n marital status, occupation, ID numbers, zip codes
n Nhị phân/ Binary
n Thuộc tính nhị phân với 2 trạng thái (0 và 1)
n Nhị phân đối xứng: cả 2 trạng thái đều quan trọng như nhau
n e.g., gender
n Nhị phân bất đối xứng: cả 2 trạng thái không quan trọng như nhau
n e.g., medical test (positive vs. negative)
n Convention: assign 1 to most important outcome (e.g., HIV
positive)
n Thứ tự/ Ordinal
n Các giá trị có ý nghĩa thứ tự (xếp hang), nhưng độ lớn giữa các giá trị kế
tiếp không được biết
n Size = {small, medium, large}, grades, army rankings
7
Các loại thuộc tính số
n Định lượng/ Quantity (integer or real-valued)
n Khoảng/ Interval
n Measured on a scale of equal-sized units
n Values have order
n E.g., temperature in C˚or F˚, calendar dates
n No true zero-point
n Tỉ lệ/ Ratio
n Inherent zero-point
n We can speak of values as being an order of magnitude
larger than the unit of measurement (10 K˚ is twice as
high as 5 K˚).
n e.g., temperature in Kelvin, length, counts, monetary
quantities
8
Thuộc tính rời rạc, liên tục
n Rời rạc/ Discrete Attribute
n Chỉ có 1 bộ giá trị hữu hạn hoặc vô hạn đếm được
n E.g., zip codes, profession, or the set of words in a
collection of documents
n Đôi khi, được biểu diễn dưới dạng các biến số nguyên
n Lưu ý: Thuộc tính nhị phân là trường hợp đặc biệt của thuộc
tính rời rạc
n Liên tục/ Continuous Attribute
n Gía trị thuộc tính là số thực
n E.g., temperature, height, or weight
n Trên thực tế, các giá trị thực chỉ có thể được đo lường và biểu
diễn bằng 1 số chử số hữu hạn
n Các thuộc tính liên tục thường được biểu diễn dưới dạng các
biến dấu phẩy động
9
2. Mô tả thống kê cơ bản
n Lý do?/ Motivation
n Hiểu rỏ dữ liệu: xu hướng tập trung, biến thể, và giàn trải
n Độ phân tán/ Data dispersion characteristics
n median, max, min, quantiles, outliers, variance, etc.
n Các chiều được số hóa trên các khoảng:
n Dữ liệu phân tán: phân tích trên nhiều chiều
n Boxplot hoặc phân tích rời rạc trên các khoảng (sorted)
n Dispersion analysis on computed measures
n Folding measures into numerical dimensions
n Boxplot or quantile analysis on the transformed cube
10
Các đại lượng trung tâm
n Trung bình/ Mean (algebraic measure):
1 n
x = å xi µ= å x
Note: n is sample size and N is population size.
n i =1 N
n
n Weighted arithmetic mean: åw x i i
n Trimmed mean: chopping extreme values x= i =1

n
n Trung vị/ Median: åw

i =1
i
n Middle value if odd number of values, or average of the

middle two values otherwise
n Estimated by interpolation (for grouped data):
n / 2 - (å freq )l
median = L1 + ( ) width
n Yếu vị/ Mode freqmedian
n Value that occurs most frequently in the data
n Unimodal, bimodal, trimodal
n Empirical formula: mean - mode = 3 ´ (mean - median)
11
Dữ liệu đối xứng vs. lệch
n Median, mean and mode of symmetric, symmetric

positively and negatively skewed data
positively skewed negatively skewed
August 25, 2023 Data Mining: Concepts and Techniques 12

Các đại lượng Độ phân tán
n Quartiles, outliers and boxplots
n Bách phân vị/ Percnetile
n Tứ phân vị/ Quartiles: Q1 (25th percentile), Q3 (75th percentile)
n Khoảng trải giữa/ Inter-quartile range: IQR = Q3 – Q1
n Five number summary: min, Q1, median, Q3, max
n Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
n Outlier: usually, a value higher/lower than 1.5 x IQR
n Phương sai/ Variance, Độ lệch chuyển/ standard deviation (s, σ)
n Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
s =
2
å
n - 1 i =1
( xi - x ) =
2
[å xi - (å xi ) 2 ]
n - 1 i =1 n i =1
s =
2
N
å
i =1
( xi - µ 2
) =
N
åx
i =1
i
2
- µ2
n Standard deviation s (or σ) is the square root of variance s2 (or σ2)

13
Boxplot Analysis
n Five-number summary of a distribution

n Minimum, Q1, Median, Q3, Maximum
n Boxplot
n Data is represented with a box
n The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
n The median is marked by a line within the box
n Whiskers: two lines outside the box extended to
Minimum and Maximum
n Outliers: points beyond a specified outlier
threshold, plotted individually
14
Trực quan hoá phân tán dữ liệu: 3-D Boxplots
August 25, 2023 Data Mining: Concepts and Techniques 15

Những thuộc tính của phân phối chuẩn
n Hình dạnh phân phối chuẩn/ The normal (distribution) curve

n From μ–σ to μ+σ: contains about 68% of the measurements
(μ: mean, σ: standard deviation)

n From μ–2σ to μ+2σ: contains about 95% of it
n From μ–3σ to μ+3σ: contains about 99.7% of it
16
Đồ thị mô tả thống kê cơ bản
n Boxplot: graphic display of five-number summary

n Histogram: x-axis are values, y-axis repres. frequencies
n Quantile plot: each value xi is paired with fi indicating that
approximately 100 fi % of data are £ xi
n Quantile-quantile (q-q) plot: graphs the quantiles of one
univariant distribution against the corresponding quantiles of
another
n Scatter plot: each pair of values is a pair of coordinates and
plotted as points in the plane
17
Biểu đồ Histogram
n Histogram: một dạng biểu đồ cột được
sử dụng để mô tả trực quan sự phân bố 40
tần suất cho tập dữ liệu. 35
30
n Trung tâm về mặt vị trí của tập dữ liệu; 25
n Độ phân tán của tập dữ liệu;
20
n Độ lệch của tập dữ liệu;
15
n Sự hiện diện của các giá trị ngoại lệ
(outliers); 10
n Sự hiện diện của các yếu vị (mode) 5

trong tập dữ liệu. 0
10000 30000 50000 70000 90000
18
Biểu đồ Histogram nhiều thông tin hơn Boxplot
n Hai đồ thị bên trái có thể

có cùng biểu diễn
Boxplot
n Cùng giá trị: min, Q1,
median, Q3, max
n Nhưng chúng có phân

phối dữ liệu khá khác
nhau
19
Biểu đồ rời rạc hóa
n Hiển thị tất cả dữ liệu (cho phép người dùng đánh gía cả hành vi
tổng thể và các sự cố bất thường)
n Thông tin biểu đồ rời rạc hóa/ lượng tử
n For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the
value xi
Data Mining: Concepts and Techniques 20

Biểu đồ Quantile-Quantile (Q-Q)
n Vẽ đồ thị các phân vị của một phân phối đơn biến so với các phân vị tương
ứng của một phân phối khác
n Có sự thay đổi trong việc chuyển từ phân phối này sang phân phối khác
không?
n Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.
21
Biểu đồ phân tán/ Scatter plot
n Cung cấp cái nhìn đầu tiên về dữ liệu 2 biến số để xem các cụm
điểm, giá trị ngoại lệ, v.v.
n Mỗi cặp giá trị được coi là 1 cặp tọa độ, và được vẽ dưới dạng
các điểm trong mặt phẳng
22
Dữ liệu tương quan dương, âm
n The left half fragment is positively

correlated
n The right half is negative correlated
23
Dữ liệu không tương quan
24
3. Trực quan hóa dữ liệu
n Tại sao? Why data visualization?
n Hiểu rỏ hơn về không gian thông tin bằng cách ánh xạ dữ liệu lên các tọa độ
n Cung cấp tổng quan định tính về tập dữ liệu lớn
n Tìm kiếm các mẫu, xu hướng, cấu trúc, bất thường, mối quan hệ giữa dữ liệu
n Giúp tìm các khu vực thú vị và các tham số phù hợp để phân tích định lượng
n Cung cấp bằng chứng trực quan
n Phân loại các phương pháp trực quan hóa dữ liệu:

n Trực quan theo điểm/ Pixel-oriented visualization techniques
n Trưc quan theo phép chiếu hình học/ Geometric projection visualization
techniques
n Trực quan theo biểu tượng/ Icon-based visualization techniques
n Trực quan theo phân cấp/ Hierarchical visualization techniques
n Trực quan hóa dữ liệu và quan hệ phức tạp/ Visualizing complex data and
relations
25
Kỹ thuật trực quan hóa theo định hướng điểm ảnh
n Với dữ liệu m chiều: tạo m cửa sổ trên màn hình, mỗi cửa sổ cho 1 chiều
n Những giá trị m chiều của 1 bản ghi, được ánh xạ tới m pixel tại các vị trí
tương ứng trong cửa sổ
n Màu sắc của các pixel phản ánh các giá trị tương ứng
(a) Income (b) Credit Limit (c) transaction volume (d) age
26
Biểu diễn các pixel trong các đoạn hình tròn
n Để tiết kiệm không gian và hiển thị các kết nối giữa nhiều chiều, việc lấp đầy
khoảng trống thường được thực hiện trong một đoạn hình tròn
(a) Representing a data record in

(b) Laying out pixels in circle segment
circle segment
27
Trực quan hóa dữ liệu theo phép chiếu hình học
n Trực quan hóa các phép biến đổi hình học và phép
chiếu của dữ liệu
n Phương pháp/ Methods
n Trực quan hóa trực tiếp/ Direct visualization
n Biểu đồ phân tán/ Scatterplot and scatterplot matrices
n Landscapes
n Projection pursuit technique: Help users find meaningful
projections of multidimensional data
n Prosection views
n Siêu lát cắt/ Hyperslice
n Tọa độ song song/ Parallel coordinates
28
Ribbons with Twists Based on Vorticity
Trực quan hóa dữ liệu trực tiếp
Data Mining: Concepts and Techniques 29

Ma trận biểu đồ phân tán
Used by ermission of M. Ward, Worcester Polytechnic Institute
Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

30
Landscapes
Used by permission of B. Wright, Visible Decisions Inc.
news articles
visualized as
a landscape
n Visualization of the data as perspective landscape

n The data needs to be transformed into a (possibly artificial) 2D spatial
representation which preserves the characteristics of the data
31
Tọa độ song song
n n trục cách đều song song với 1 trong các trục màn hình và tương ứng với các
thuộc tính
n Các trục được chia tỉ lệ theo phạm vi [tối thiểu, tối đa] của thuộc tính tương
ứng
n Mỗi mục dữ liệu tương ứng với 1 đường đa giác cắt mỗi trục tại điểm tương
ứng với giá trị của thuộc tính
• • •
Attr. 1 Attr. 2 Attr. 3 Attr. k

32
Tọa độ song song của tập dữ liệu
33
Trực quan hóa dữ liệu dựa trên biểu tượng
n Trực quan hóa các giá trị dữ liệu dưới dạng các tính năng của
biểu tượng
n Phương pháp trực quan điển hình:
n Chernoff Faces
n Stick Figures
n Kỹ thuật chung:
n Mã hóa hình dạng: Sử dụng hình dạng để biểu thị mã hóa
thông tin nhất định
n Biểu tượng màu: Sử dụng biểu tượng màu để mã hóa thêm
thông tin
n Thanh ô xếp: Sử dụng các biểu tượng nhỏ để biểu thị các
véc-tơ đặc trưng có liên quan trong truy xuất tài liệu
34
Chernoff Faces
n A way to display variables on a two-dimensional surface, e.g., let x be

eyebrow slant, y be eye size, z be nose length, etc.
n The figure shows faces produced using 10 characteristics--head eccentricity,
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size,
mouth shape, mouth size, and mouth opening): Each assigned one of 10
possible values, generated using Mathematica (S. Dickson)
n REFERENCE: Gonick, L. and Smith, W. The

Cartoon Guide to Statistics. New York: Harper
Perennial, p. 212, 1993
n Weisstein, Eric W. "Chernoff Face." From
MathWorld--A Wolfram Web Resource.
mathworld.wolfram.com/ChernoffFace.html
35
Stick Figure
A census data
figure showing
age, income,
used by permission of G. Grinstein, University of Massachusettes at Lowell
gender,
education, etc.
A 5-piece stick
figure (1 body
and 4 limbs w.
different
angle/length)
Two attributes mapped to axes, remaining attributes mapped to angle or length of limbs”. Look at texture pattern 36
Kỹ thuật trực quan hóa phân cấp
n Trực quan hóa dữ liệu sử dụng phân vùng phân cấp

thành các không gian con
n Phương pháp/ Methods

n Dimensional Stacking
n Worlds-within-Worlds
n Tree-Map
n Cone Trees
n InfoCube
37
Xếp chồng chiều/ Dimensional Stacking
attribute 4
attribute 2
attribute 3
attribute 1
n Partitioning of the n-dimensional attribute space in 2-D

subspaces, which are ‘stacked’ into each other
n Partitioning of the attribute value ranges into classes. The
important attributes should be used on the outer levels.
n Adequate for data with ordinal attributes of low cardinality
n But, difficult to display more than nine dimensions
n Important to map dimensions appropriately
38
Dimensional Stacking
Used by permission of M. Ward, Worcester Polytechnic Institute
Visualization of oil mining data with longitude and latitude mapped to the outer
x-, y-axes and ore grade and depth mapped to the inner x-, y-axes
39
Worlds-within-Worlds
n Assign the function and two most important parameters to innermost world
n Fix all other parameters at constant values - draw other (1 or 2 or 3 dimensional
worlds choosing these as the axes)
n Software that uses this paradigm
n N–vision: Dynamic
interaction through data
glove and stereo displays,
including rotation, scaling
(inner) and translation
(inner/outer)
n Auto Visual: Static
interaction by means of
queries
40
Bản đồ cây/ Tree-Map
n Screen-filling method which uses a hierarchical partitioning of the
screen into regions depending on the attribute values
n The x- and y-dimension of the screen are partitioned alternately
according to the attribute values (classes)
MSR Netscan Image
Ack.: http://www.cs.umd.edu/hcil/treemap-history/all102001.jpg 41
Tree-Map of a File System (Schneiderman)
42
InfoCube
n A 3-D visualization technique where hierarchical
information is displayed as nested semi-transparent
cubes
n The outermost cubes correspond to the top level data,
while the subnodes or the lower level data are
represented as smaller cubes inside the outermost cubes,
and so on
43
Three-D Cone Trees
n 3D cone tree visualization technique works
well for up to a thousand nodes or so
n First build a 2D circle tree that arranges its
nodes in concentric circles centered on the root
node
n Cannot avoid overlaps when projected to 2D
n G. Robertson, J. Mackinlay, S. Card. “Cone
Trees: Animated 3D Visualizations of
Hierarchical Information”, ACM SIGCHI'91
n Graph from Nadeau Software Consulting
website: Visualize a social network data set that
models the way an infection spreads from one
person to the next
Ack.: http://nadeausoftware.com/articles/visualization
44
Trực quan hóa dữ liệu và quan hệ phức tạp
n Trực quan hóa dữ liệu phi số: văn bản và mạng xã hội
n Đám mây thẻ/ tag: Trực quan hóa các thẻ do người dùng tạo
n The importance of tag is
represented by font
size/color
n Bên cạnh các dữ liệu

văn bản: còn có các
phương pháp để trực
quan hóa các mối quan
hệ, chẳng hạn như trực
quan hóa các mạng xã
hội
Newsmap: Google News Stories in 2005
4. Tương đồng và không tương đồng
n Tương đồng/ Similarity
n Số đo mức độ giống nhau của 2 đối tượng dữ liệu
n Giá trị cao hơn khi các đối tượng giống nhau hơn
n Thường rơi vào khoảng [0,1]
n Không tương đồng/ Dissimilarity (e.g., khoảng cách/ distance)

n Số đo mức độ khác nhau của 2 đối tượng dữ liệu
n Giá trị thấp hơn khi các đối tượng giống nhau hơn
n Sự khác biệt tối thiểu thường là 0
n Giới hạn trên thường thay đổi
n Tiệm cận/ Proximity đề cập đến sự tương đồng hoặc không

tương đồng
46
Ma trận dữ liệu và ma trận không tương đồng
n Ma trận dữ liệu/Data matrix

n n điểm dữ liệu với p é x11 ... x1f ... x1p ù
ê ú
chiều ê ... ... ... ... ... ú
n 2 modes
êx ... xif ... xip ú
ê i1 ú
ê ... ... ... ... ... ú
êx ... xnf ... xnp úú
êë n1 û
n Ma trận không tương đồng/
Dissimilarity matrix é 0 ù
ê d(2,1) 0 ú
n n điểm dữ liệu ê ú
ê d(3,1) d ( 3,2) 0 ú
n Ma trận tam giác ê ú
ê : : : ú
n 1 mode
êëd ( n,1) d ( n,2) ... ... 0úû
47
Thước đo tiệm cận cho các thuộc tính danh nghĩa
n Có thể có 2 trạng thái trở lên, ví dụ đỏ, vàng, xanh

dương, xanh lá cây (khái quát hóa thuộc tính nhị phân)
n Method 1: Kết hợp đơn giản/ Simple matching
n m: # of matches, p: total # of variables
d (i, j) = p - p
m
n Method 2: Sử dung một số lượng lớn các thuộc tính

nhị phân
n Tạo 1 thuộc tính nhị phân mới cho mỗi trạng thái
danh nghĩa M
48
Thước đo tiệm cận cho các thuộc tính nhị phân
Object j
n Bảng dự phòng cho dữ liệu nhị phân
Object i
n Đo khoảng cách cho các biến nhị phân

đối xứng
n Đo khoảng cách cho các biến nhị phân
không đối xứng
n Jaccard coefficient (similarity measure
for asymmetric binary variables):
n Note: Jaccard coefficient is the same as “coherence”:
49
Sự khác biệt giữa các biến nhị phân
n Ví dụ
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
n Gender là 1 thuộc tính đối xứng

n Các thuộc tính còn lại là nhị phân bất đối xứng
n Đặt các giá trị Y và P là 1, N là 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
50
Chuẩn hóa dữ liệu số
x
z= s - µ
n Z-score:
n x: dữ liệu gốc được chuẩn hóa, μ: giá trị trung bình (của population), σ:
độ lệch chuẩn
n Khoảng cách giữa điểm dữ liệu gốc và giá trị trung bình theo đơn vị của
độ lệch chuẩn
n Giá trị âm khi điểm dữ liệu dưới mức trung bình, và ngược lại
n Cách khác: tính độ lệch tuyệt đối trung bình
s f = 1n (| x1 f - m f | + | x2 f - m f | +...+ | xnf - m f |)
Ở đây, m f = 1n (x1 f + x2 f + ... + xnf )
xif - m f
.
zif = sf
n Chuẩn hóa (z-score):
n Thường sử dụng độ lệch tuyệt đối trung bình nhiều hơn độ lệch chuẩn
51
Ma trận dữ liệu và ma trận không tương đồng
Data Matrix
x2 x4 point attribute1 attribute2
x1 1 2
4 x2 3 5
x3 2 0
x4 4 5
2 x1 Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x3
x1 0
0 2 4
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
52
Khoảng cách trên dữ liệu số: Minkowski Distance
n Minkowski distance: thước đo khoảng cách phổ biến
Ở đây i = (xi1, xi2, …, xip) và j = (xj1, xj2, …, xjp) là 2 điểm dữ

liệu p-chiều, và h là thứ tự (the distance so defined is also
called L-h norm)
n Đặc tính/ Properties
n d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
n d(i, j) = d(j, i) (Symmetry)
n d(i, j) £ d(i, k) + d(k, j) (Triangle Inequality)
n Khoảng cách thỏa mãn các đặc tính trên: metric
53
Các trường hợp đặc biệt: Minkowski Distance
n h = 1: Manhattan (city block, L1 norm) distance

n E.g., the Hamming distance: the number of bits that are different
between two binary vectors

d (i, j) =| x - x | + | x - x | +...+ | x - x |
i1 j1 i2 j 2 ip jp
n h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x - x |2 + | x - x |2 +...+ | x - x |2 )
i1 j1 i2 j 2 ip jp
n h ® ¥. “supremum” (Lmax norm, L¥ norm) distance.

n This is the maximum difference between any component (attribute) of
the vectors
54
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan (L1)
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L¥ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
55
Biến thông thường/ Ordinal Variables
n An ordinal variable can be discrete or continuous

n Order is important, e.g., rank
n Can be treated like interval-scaled
n replace xif by their rank rif Î{1,..., M f }
n map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif -1
zif =
M f -1
n compute the dissimilarity using methods for interval-scaled

variables
56
Thuộc tính của loại hỗn hợp
n A database may contain all attribute types

n Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
n One may use a weighted formula to combine their effects
S pf = 1d ij( f ) dij( f )
d (i, j) =
S pf = 1d ij( f )
n f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
n f is numeric: use the normalized distance
n f is ordinal
n Compute ranks rif and
zif = rif - 1
n Treat zif as interval-scaled M f -1

57
Cosine Similarity
n A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.
n Other vector objects: gene features in micro-arrays, …

n Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
n Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d
58
Example: Cosine Similarity
n cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d
n Example: Find the similarity between documents 1 and 2.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
59
5. Tổng kết
n Các loại thuộc tính dữ lieu: danh nghĩa, nhị phân, thứ tự, chia khoảng, chia tỉ lệ
n Nhiều loại tập dữ lieu: ví dụ, số, văn bản, biểu đồ, web, hình ảnh
n Hiểu rỏ hơn về dữ liệu bằng cách:
n Thống kê cơ bản: xu hướng trung tâm, độ phân tán, hiển thị đồ họa
n Trực quan hóa dữ liệu: ánh xạ dữ liệu lên các tọa độ nguyên mẫu
n Đo độ tương tự của dữ liệu
n Các bước trên bắt đầu cho tiền xử lý dữ liệu

n Nhiều phương pháp đã được phát triển, nhưng vẫn là một lĩnh vực nghiên cứu
tích cực
60
Tham khảo/ References
n W. Cleveland, Visualizing Data, Hobart Press, 1993
n T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley,
2003
n U. Fayyad, G. Grinstein, and A. Wierse. Information Visualization in Data Mining and
Knowledge Discovery, Morgan Kaufmann, 2001
n L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster
Analysis. John Wiley & Sons, 1990.
n H. V. Jagadish, et al., Special Issue on Data Reduction Techniques. Bulletin of the
Tech. Committee on Data Eng., 20(4), Dec. 1997
n D. A. Keim. Information visualization and visual data mining, IEEE trans. on
Visualization and Computer Graphics, 8(1), 2002
n D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
n S. Santini and R. Jain,” Similarity measures”, IEEE Trans. on Pattern Analysis and
Machine Intelligence, 21(9), 1999
n E. R. Tufte. The Visual Display of Quantitative Information, 2nd ed., Graphics Press,
2001
n C. Yu , et al., Visual data mining of multimedia data for social and behavioral studies,
Information Visualization, 8(1), 2009 61

Bài 2. Dữ liệu

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bài 2. Dữ liệu

Uploaded by

Copyright:

Available Formats

KHAI THÁC DỮ LIỆU & KHAI PHÁ TRI THỨC

DATA MINING & KNOWLEDGE DISCOVERY

n Mô tả thống kê cơ bản/ Basic Statistical Descriptions of Data

n Trực quan hóa dữ liệu/ Data Visualization

n Đo lường sự tương quan/ Measuring Data Similarity_Dissimilarity

n Tổng kết/ Summary

n Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

n Bao gồm/ Types:

n Nhị phân/ Binary

n Số/ Numeric: quantitative

n E.g., zip codes, profession, or the set of words in a

n E.g., temperature, height, or weight

n Trimmed mean: chopping extreme values x= i =1

n Trung vị/ Median: åw

n Middle value if odd number of values, or average of the

n Median, mean and mode of symmetric, symmetric

positively skewed negatively skewed

August 25, 2023 Data Mining: Concepts and Techniques 12

n Standard deviation s (or σ) is the square root of variance s2 (or σ2)

n Five-number summary of a distribution

August 25, 2023 Data Mining: Concepts and Techniques 15

n Hình dạnh phân phối chuẩn/ The normal (distribution) curve

(μ: mean, σ: standard deviation)

n From μ–3σ to μ+3σ: contains about 99.7% of it

n Boxplot: graphic display of five-number summary

n Sự hiện diện của các yếu vị (mode) 5

n Hai đồ thị bên trái có thể

n Nhưng chúng có phân

Data Mining: Concepts and Techniques 20

n The left half fragment is positively

n Phân loại các phương pháp trực quan hóa dữ liệu:

(a) Representing a data record in

Data Mining: Concepts and Techniques 29

Used by ermission of M. Ward, Worcester Polytechnic Institute

Matrix of scatterplots (x-y-diagrams) of the k-dim. data [total of (k2/2-k) scatterplots]

Used by permission of B. Wright, Visible Decisions Inc.

n Visualization of the data as perspective landscape

Attr. 1 Attr. 2 Attr. 3 Attr. k

n A way to display variables on a two-dimensional surface, e.g., let x be

n REFERENCE: Gonick, L. and Smith, W. The

n Trực quan hóa dữ liệu sử dụng phân vùng phân cấp

n Phương pháp/ Methods

n Partitioning of the n-dimensional attribute space in 2-D

MSR Netscan Image

n Bên cạnh các dữ liệu

n Không tương đồng/ Dissimilarity (e.g., khoảng cách/ distance)

n Tiệm cận/ Proximity đề cập đến sự tương đồng hoặc không

n Ma trận dữ liệu/Data matrix

n Có thể có 2 trạng thái trở lên, ví dụ đỏ, vàng, xanh

n Method 2: Sử dung một số lượng lớn các thuộc tính

n Đo khoảng cách cho các biến nhị phân

n Note: Jaccard coefficient is the same as “coherence”:

n Gender là 1 thuộc tính đối xứng

Ở đây i = (xi1, xi2, …, xip) và j = (xj1, xj2, …, xjp) là 2 điểm dữ

n h = 1: Manhattan (city block, L1 norm) distance

between two binary vectors

n h = 2: (L2 norm) Euclidean distance

n h ® ¥. “supremum” (Lmax norm, L¥ norm) distance.

n An ordinal variable can be discrete or continuous

n compute the dissimilarity using methods for interval-scaled

n A database may contain all attribute types

n Treat zif as interval-scaled M f -1

n Other vector objects: gene features in micro-arrays, …

n Example: Find the similarity between documents 1 and 2.