Data Mining: Introduction

Lecture Note !or "ha#ter 1 Introduction to Data Mining
b$ Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar

Introduction to Data Mining



Why Mine Data? Commercial Viewpoint

Lot o! data i being co%%ected and &arehou ed ' (eb data, e)commerce ' #urcha e at de#artment/ grocer$ tore ' *an+/"redit "ard tran action "om#uter ha,e become chea#er and more #o&er!u% "om#etiti,e -re ure i Strong ' -ro,ide better, cu tomi.ed er,ice !or an edge /e0g0 in "u tomer 1e%ation hi# Management2
Introduction to Data Mining 4/18/2004 2


© Tan,Steinbach, Kumar

Why Mine Data? Scientific Viewpoint

Data co%%ected and tored at enormou #eed /3*/hour2 ' remote en or on a ate%%ite ' te%e co#e canning the +ie

' microarra$ generating gene e4#re ion data ' cienti!ic imu%ation generating terab$te o! data

Traditiona% techni5ue in!ea ib%e !or ra& data Data mining ma$ he%# cienti t ' in c%a i!$ing and egmenting data ' in 6$#othe i 7ormation

Mining Large Data Sets . 1::< The Data Gap Total new disk (TB) since 1995 Number of analysts 1::8 1::: 4 © Tan.Moti ation G G G !here is often information "hidden# in the data that is not readily e ident $uman analysts may ta%e wee%s to disco er useful information Much of the data is ne er analy&ed at all 4.000.800. ?Data Mining !or Mining Scienti!ic and @ngineering A##%ication B .000 800. "0 Kamath.000 2.000.800.000 1.000.000 0 1::8 1::.000 9. Kumar Introduction to Data 4/18/2004 7rom= 10 3ro man.000 1.Steinbach.000 9.800. >0 Kumar.000 2.000.

What is Data Mining? G Many Definitions ' 'on-tri ial e(traction of implicit) pre iously un%nown and potentially useful information from data ' *(ploration + analysis) .y automatic or semi-automatic means) of large -uantities of data in order to disco er meaningful patterns © Tan. Kumar Introduction to Data Mining 4/18/2004 8 .Steinbach.

Kumar Introduction to Data Mining .What is . Ama. DE1ur+e.on0com. ' Loo+ u# #hone number in #hone director$ ' Guer$ a (eb earch engine !or in!ormation about ?Ama. DE1ei%%$F in *o ton area2 ' 3rou# together imi%ar document returned b$ earch engine according to their conte4t /e0g0 Ama.onB © Tan.2 4/18/2004 .Steinbach.not/ Data Mining? What is not Data Mining? G G What is Data Mining? ' "ertain name are more #re.on rain!ore t.a%ent in certain CS %ocation /DE*rien.

uted nature Databa e $ tem of data G © Tan.le due to Stati tic / Machine Learning/ ' *normity of data AI -attern 1ecognition ' $igh dimensionality of data Data Mining ' $eterogeneous) distri. Kumar Introduction to Data Mining 4/18/2004 < .ase systems G !raditional !echni-ues may .2rigins of Data Mining Draws ideas from machine learning01I) pattern recognition) statistics) and data.e unsuita.Steinbach.

Steinbach.a%ue o! other .ariab%e 0 De cri#tion Method ' 7ind human)inter#retab%e #attern that de cribe the data0 G rom ! ayyad" et#al#$ %d&ances in 'nowled(e Disco&ery and Data )inin(" 199* © Tan.Data Mining !as%s G -rediction Method ' C e ome . Kumar Introduction to Data Mining 4/18/2004 8 .ariab%e to #redict un+no&n or !uture .

eI G Se5uentia% -attern Di co.eI G © Tan.eI G De.eI G A ociation 1u%e Di$ HDe cri#ti.Steinbach.iation Detection H-redicti. Kumar Introduction to Data Mining 4/18/2004 : .eI G 1egre ion$ HDe cri#ti.Data Mining !as%s333 "%a i!ication H-redicti.eI G "%u tering HDe cri#ti.

a%ue o! other attribute 0 3oa%= #re. &ith training et u ed to bui%d the mode% and te t et u ed to .a%idate it0 © Tan.Steinbach. the gi. Kumar Introduction to Data Mining 4/18/2004 10 .Classification: Definition G 3i.en a co%%ection o! record /training set 2 ' @ach record contain a et o! attributes.iou %$ un een record hou%d be a igned a c%a a accurate%$ a #o ib%e0 ' A test set i u ed to determine the accurac$ o! the mode%0 C ua%%$. one o! the attribute i the class0 G G 7ind a model !or c%a attribute a a !unction o! the .ided into training and te t et .en data et i di.

orced 220K Sing%e Married Sing%e 88K <8K :0K 'o 5es 'o 5es !est Set !raining Set Learn Classifier Model © Tan. Kumar Introduction to Data Mining 4/18/2004 11 .0K Di.Steinbach.orced :0K Sing%e Married 40K 80K Di. < 8 : 10 10 4efund Marital Status No Je No Je No No Sing%e Married Married !a(a.le Income Cheat <8K 80K 180K ? ? ? ? ? ? Je No No Je No No Je No No No Sing%e Married Sing%e Married Di.orced :8K Married .le Income Cheat 128K 100K <0K 120K 'o 'o 'o 'o 5es 'o 10 Tid 4efund Marital Status 1 2 9 4 8 .Classification *(ample al al us c c i i o u or or n i g g nt te te ss a a o a c c c cl !a(a.

Steinbach.Classification: 1pplication 6 G Direct Mar+eting ' 3oa%= 1educe co t o! mai%ing b$ targeting a et o! con umer %i+e%$ to bu$ a ne& ce%%)#hone #roduct0 ' A##roach= C e the data !or a imi%ar #roduct introduced be!ore0  (e +no& &hich cu tomer decided to bu$ and &hich decided other&i e0 Thi {buy.ariou demogra#hic. and com#an$) interaction re%ated in!ormation about a%% uch cu tomer 0 + Type of business" where they stay" how much they earn" etc#  "o%%ect C e thi in!ormation a in#ut attribute to %earn a c%a mode%0 Introduction to Data Mining 4/18/2004 i!ier rom !Berry . %i!e t$%e. -inoff$ Data )inin( Techni.ues" 199/ © Tan. don’t buy} deci ion !orm the class attribute0 . Kumar 12 .

Classification: 1pplication 7 G 7raud Detection ' 3oa%= -redict !raudu%ent ca e in credit card tran action 0 ' A##roach= C e credit card tran action and the in!ormation on it account)ho%der a attribute 0 + 0hen does a customer buy" what does he buy" how often he pays on time" etc  Labe% #a t tran action a !raud or !air tran action 0 Thi !orm the c%a attribute0  Learn a mode% !or the c%a o! the tran action 0  C e thi mode% to detect !raud b$ ob credit card tran action on an account0 © Tan. Kumar Introduction to Data Mining 4/18/2004 19 .

-inoff$ Data )inin( Techni.Steinbach. Kumar Introduction to Data Mining 4/18/2004 14 .Classification: 1pplication 8 G "u tomer Attrition/"hurn= ' 3oa%= To #redict &hether a cu tomer i %i+e%$ to be %o t to a com#etitor0 ' A##roach= C e detai%ed record o! tran action &ith each o! the #a t and #re ent cu tomer .ues" 199/ © Tan. to !ind attribute 0 + 1ow often the customer calls" where he calls" what time2of2the day he calls most" his financial status" marital status" etc#  Labe% the cu tomer a %o$a% or di %o$a%0  7ind a mode% !or %o$a%t$0 rom !Berry .

ne& high red) hi!t 5ua ar .e$ image /!rom -a%omar Db er.Classification: 1pplication 9 G S+$ Sur.ator$20 + 3444 ima(es with 53"464 7 53"464 pi7els per ima(e# ' A##roach=  Segment  Mea the image0 ba ed on the e !eature 0 ure image attribute /!eature 2 ) 40 o! them #er obKect0 the c%a  Mode%  Succe Stor$= "ou%d !ind 1.e$ "ata%oging ' 3oa%= To #redict c%a / tar or ga%a4$2 o! +$ obKect . Kumar Introduction to Data Mining 4/18/2004 18 . ome o! the !arthe t obKect that are di!!icu%t to !indL rom ! ayyad" et#al#$ %d&ances in 'nowled(e Disco&ery and Data )inin(" 199* © Tan.i ua%%$ !aint one . e #ecia%%$ . ba ed on the te%e co#ic ur.Steinbach.

ase: 6@< :? © Tan. Kumar Introduction to Data Mining 4/18/2004 1.7 million stars) 7< million gala(ies M 2.utes: M Image features) M Characteristics of light wa es recei ed) etc3 Intermediate Late Data Si&e: M .Classifying :ala(ies Courtesy: http://aps.Steinbach. .edu Early Class: M Stages of Aormation 1ttri.=ect Catalog: > :? M Image Data.umn.

!ind c%u ter uch that ' Data #oint in one c%u ter are more imi%ar to one another0 ' Data #oint in e#arate c%u ter are %e imi%ar to one another0 Simi%arit$ Mea ure = ' @uc%idean Di tance i! attribute are continuou 0 ' Dther -rob%em) #eci!ic Mea ure 0 Introduction to Data Mining 4/18/2004 1< © Tan. and a imi%arit$ mea ure among them.en a et o! data #oint .ing a et o! attribute . each ha. Kumar .Steinbach.Clustering Definition G G 3i.

Illustrating Clustering R @uc%idean Di tance *a ed "%u tering in 9)D #ace0 8ntracluster distances are minimi9ed 8ntercluster distances are ma7imi9ed © Tan.Steinbach. Kumar Introduction to Data Mining 4/18/2004 18 .

ing bu$ing #attern o! cu tomer in ame c%u ter .ab%$ be e%ected a a mar+et target to be reached &ith a di tinct mar+eting mi40 ' A##roach=  "o%%ect di!!erent attribute o! cu tomer ba ed on their geogra#hica% and %i!e t$%e re%ated in!ormation0  7ind c%u ter o! imi%ar cu tomer 0  Mea ure the c%u tering 5ua%it$ b$ ob er.Clustering: 1pplication 6 G Mar+et Segmentation= ' 3oa%= ubdi. Kumar Introduction to Data Mining 4/18/2004 1: . 0 tho e !rom di!!erent c%u ter 0 © Tan.Steinbach.ide a mar+et into di tinct ub et o! cu tomer &here an$ ub et ma$ concei.

e the c%u ter to re%ate a ne& document or earch term to c%u tered document 0 Introduction to Data Mining 4/18/2004 20 © Tan.Steinbach. Kumar .a% can uti%i.Clustering: 1pplication 7 G Document "%u tering= ' 3oa%= To !ind grou# o! document that are imi%ar to each other ba ed on the im#ortant term a##earing in them0 ' A##roach= To identi!$ !re5uent%$ occurring term in each document0 7orm a imi%arit$ mea ure ba ed on the !re5uencie o! di!!erent term 0 C e it to c%u ter0 ' 3ain= In!ormation 1etrie.

8<9 2<8 © Tan. Kumar Introduction to Data Mining 4/18/2004 21 .4 2.Illustrating Document Clustering G G "%u tering -oint = 9204 Artic%e o! Lo Ange%e Time 0 Simi%arit$ Mea ure= 6o& man$ &ord are common in the e document /a!ter ome &ord !i%tering20 Category Financial Foreign National Metro Sports Entertainment Total Articles 888 941 2<9 :49 <98 984 Correctly Placed 9.0 9. <4.Steinbach.

ircuit2.orp2D:0N" Gen28nst2D:0N" )otorola2D:0 N")icrosoft2D:0N"<cientific2%tl2D:0N annie2)ae2D:0N" ed21o me2-oan2D:0 N" )BN%2. Kumar Introduction to Data Mining 4/18/2004 22 .ent de cribed b$ them !re5uent%$ ha##en together on the ame da$0 T (e u ed a ociation ru%e to 5uanti!$ a imi%arit$ mea ure0 Discovered Clusters Industry Group Technolo(y12D:0N 1 2 3 4 %pplied2)atl2D:0 N"Bay2Net work2Down"32.2.:)2D:0N" .o mputer2%ssoc2D:0N".2.o mp2D:0 N"%utodesk2D:0N"D>.ement e.orp2D:0N" .abletron2<ys2D:0N".2D:0N" %D?2) icro2De&ice2D:0N"%ndrew2.o mm2D:0 N"8NT>-2D:0N"-<82-o(ic2D:0N" )icron2Tech2D:0N"Te7as28nst2Down"Tellabs28nc2Down" Natl2<emiconduct2D:0N":racl2D:0N"<G82D:0 N" <un2D:0 N %pple2.:2D:0N"1=2D:0N" D<.2D:0N" >) .Steinbach.8<.er$ da$0 T "%u tering #oint = Stoc+)NC-/DD(NO T Simi%arit$ Mea ure= T&o #oint are more imi%ar i! the e.o mpa.e Stoc+ Mo.orp 2D:0N")or(an2<tanley2D:0N Baker21u(hes2@="Dresser28nds2@="1alliburton21-D2@=" -ouisiana2-and2@="=hillips2=etro2@="@nocal2@=" <chlu mber(er2@= Technolo(y52D:0N inancial2D:0N :il2@= © Tan.Clustering of S+B @<< Stoc% Data T Db er.ity2D:0N" .

Diaper. Coke. Diaper. Milk Coke.en co%%ectionP ' -roduce de#endenc$ ru%e &hich &i%% #redict occurrence o! an item ba ed on occurrence o! other item 0 Items TID 1 2 3 4 5 Bread. Coke.en a et o! record each o! &hich contain ome number o! item !rom a gi. Bread Beer. Milk Beer. Bread. Kumar Introduction to Data Mining 4/18/2004 29 .1ssociation 4ule Disco ery: Definition G 3i. Diaper. Milk Beer. Milk Aules Disco&eredB CMil%D --E CCo%eD CDiaper) Mil%D --E C?eerD © Tan.Steinbach.

Steinbach.1ssociation 4ule Disco ery: 1pplication 6 G Mar+eting and Sa%e -romotion= ' Let the ru%e di co. Kumar Introduction to Data Mining 4/18/2004 24 .ered be {Bagels. … } --> {Potato Chips} ' -otato "hi# a con e5uent QR "an be u ed to determine &hat hou%d be done to boo t it a%e 0 ' *age% in the antecedent QR "an be u ed to ee &hich #roduct &ou%d be a!!ected i! the tore di continue e%%ing bage% 0 ' *age% in antecedent and -otato chi# in con e5uent QR "an be u ed to ee &hat #roduct hou%d be o%d &ith *age% to #romote a%e o! -otato chi# L © Tan.

er$ %i+e%$ to bu$ beer0  So.Steinbach. donEt be ur#ri ed i! $ou !ind i4)#ac+ tac+ed ne4t to dia#er L © Tan. then he i . Kumar Introduction to Data Mining 4/18/2004 28 .1ssociation 4ule Disco ery: 1pplication 7 G Su#ermar+et he%! management0 ' 3oa%= To identi!$ item that are bought together b$ u!!icient%$ man$ cu tomer 0 ' A##roach= -roce the #oint)o!) a%e data co%%ected &ith barcode canner to !ind de#endencie among item 0 ' A c%a ic ru%e ))  I! a cu tomer bu$ dia#er and mi%+.

1ssociation 4ule Disco ery: 1pplication 8 G In. Kumar Introduction to Data Mining 4/18/2004 2. .ice .entor$ Management= ' 3oa%= A con umer a##%iance re#air com#an$ &ant to antici#ate the nature o! re#air on it con umer #roduct and +ee# the er.i it to con umer hou eho%d 0 ' A##roach= -roce the data on too% and #art re5uired in #re.iou re#air at di!!erent con umer %ocation and di co.Steinbach.ehic%e e5ui##ed &ith right #art to reduce on number o! .er the co)occurrence #attern 0 © Tan.

ering #attern 0 @.ent 0 ( G B! (C! (D "! 1u%e are !ormed b$ !ir t di o. !ind ru%e that #redict trong e5uentia% de#endencie among di!!erent e.erned b$ timing con traint 0 ( B! FG (g (C! (D "! Eng FG ms FG ws © Tan.en i a et o! objects.Steinbach.ent occurrence in the #attern are go. &ith each obKect a ociated &ith it o&n timeline o e!ents. Kumar Introduction to Data Mining 4/18/2004 2< .Se-uential Battern Disco ery: Definition G 3i.

' /In. ' "om#uter *oo+ tore= /IntroSToS>i ua%S"2 /"TTS-rimer2 ))R /-er%S!orSdummie .Se-uential Battern Disco ery: *(amples G In te%ecommunication a%arm %og .eSLineS"urrent2 /1ecti!ierSA%arm2 ))R /7ireSA%arm2 In #oint)o!) a%e tran action e5uence .Steinbach. 1ac+etba%%2 ))R /S#ort SUac+et2 G © Tan. Kumar Introduction to Data Mining 4/18/2004 28 .Tc%ST+2 ' Ath%etic A##are% Store= /Shoe 2 /1ac+et.erterS-rob%em @4ce i.

en continuou .e%ocitie a a !unction o! tem#erature.a%ued .ariab%e ba ed on the .a%ue o! a gi.Steinbach.ariab%e . a uming a %inear or non%inear mode% o! de#endenc$0 3reat%$ tudied in tati tic .4egression G G G -redict a . air #re ure. Kumar Introduction to Data Mining 4/18/2004 2: .a%ue o! other . etc0 ' Time erie #rediction o! toc+ mar+et indice 0 © Tan. humidit$. neura% net&or+ !ie%d 0 @4am#%e = ' -redicting a%e amount o! ne& #roduct ba ed on ad.eti ing e4#enditure0 ' -redicting &ind .

Kumar Typical network traffic at University level may reach over 100 million connections per day Introduction to Data Mining 4/18/2004 90 .De iation01nomaly Detection Detect igni!icant de.iation !rom norma% beha.ior G A##%ication = ' "redit "ard 7raud Detection G ' Net&or+ Intru ion Detection © Tan.Steinbach.

Steinbach.Challenges of Data Mining G G G G G G G Sca%abi%it$ Dimen iona%it$ "om#%e4 and 6eterogeneou Data Data Gua%it$ Data D&ner hi# and Di tribution -ri. Kumar Introduction to Data Mining 4/18/2004 91 .ac$ -re er.ation Streaming Data © Tan.