DWM Lab Manual 3yr 1

DATA WAREHOUSING AND MINING
(21CS3052R)
STUDENT ID: ACADEMIC YEAR: 2023-24

STUDENT NAME:
Table of Contents
1. Session 01: Basic Statistical Descriptions ..............................................................................

2. Session 02: To implement data pre-processing techniques ....................................................
3. Session 03: To implement principle component analysis .......................................................
4. Session 04: Classification using Decision Trees ......................................................................
5. Session 05: Classification using K Nearest Neighbor ............................................................
6. Session 06: Classification using Bayesian Classifiers ..............................................................
7. Session 07: Classification using Back propagation................................................................
8. Session 08: Association Rule Mining - Apriori ......................................................................
9. Session 9: Implementation of K-Means Clustering.................................................................
10. Session 10: Classification: Support Vector Machine (SVM) ..................................................
11. Session 11: Rule Based Classification ...................................................................................
12. Session 12: Outliers Detection ...............................................................................................
A.Y. 2023-24 LAB/SKILL CONTINUOUS EVALUATION
S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)
1. Basic Statistical
Descriptions
To implement data pre-
2.
processing techniques
To implement principle
3. component analysis
Classification using Decision
4.
Trees
Classification using K
5.
Nearest Neighbor
Classification using
6.
Bayesian Classifiers
Classification using Back
7.
propagation
Association Rule Mining -
8. Apriori
Implementation of K-Means
9.
Clustering
Classification: Support
10.
Vector Machine (SVM)
11 RuleBased Classification
12. Outliers detection

Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>
Lab#1: Basic Statistical Descriptions

Date of the Session: ____/____/ Time of the Session ____to __
Pre-lab
BASIC STATISTICAL DESCRIPTIONS OF DATA

Basic statistical descriptions provide the analytical foundation for data pre processing. It can be used
to identify properties of the data and highlight which data values should be treated as noisy or
outliers.
Answer the following Questions
1. What are the various ways to measure the central tendency of data?
2. What are the several ways of measuring the dispersion of data?
3. What is IQR (inter quartile range)?
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Code(s) 21CS3052R Page 1 of 67
4. Observe the following diagrams; identify the quantile and q-q plot? Define how
the q-q-plotis different from quantile plot?
5. What are the items involved Five number summary?
6. Identify the symmetric data , positively skewed data and negatively skewed data from
the below graphs?

In-lab
1. Given a dataset “cars” for analysis it includes the variables speed and
distance.(Download the dataset from lms)
a) What are the average speed and the distance of the cars?
b) What is the median and midrange of the data?
c) Find mode of the data and comment on the data modality (i.e, unimodal or bimodal)?
d) What are the variance and the standard deviation of the data?
e) Find the five number summaries of the data?
f) Show the histogram and box plot of the data?

WritingspaceoftheProblem:(ForStudent’s useonly)

4
Post-lab
1. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results. (Download the dataset from lms)
a) Find the maximum and the minimum percentage of the fat and age of the adults who
visited the hospital.
b) Calculate mean, median and midrange of the age.
c) Find the first quartile and third quartile of the data.
d) Draw a scatter plot and q-q plot based on these two variables.
Writing space of the Problem: (For Student’s use only)

5
VivaVoce:-
1. Difference between symmetric data and skewed data.
2. What are the most widely used forms of quartiles?
3. Variance and Standard deviation fall under what category of measuring data?
4. What do low and high standard deviations indicate?
5. Based on what condition, two variables are said to be correlated?
(ForEvaluator’suseonly)
Comment of the Evaluator (if Any) Evaluator’s Observation

Marks Secured: out of
Full Name of the Evaluator:
Signature of the Evaluator Date of Evaluation:

6
Lab#2:To implement data pre-processing techniques.

Date of the Session: ____/____/____ Time of the Session ____to_____
Pre Lab
DATA PREPROCESSING:
Databases are exceedingly helpless to noisy, missing, and inconsistent data because
oftheir commonly enormous size (frequently a few gigabytes or more).Low-quality
informationwill prompt low-quality mining results. Pre-processing helps to get a quality data.
The stepsinvolvedindatapre-
processingaredatacleaning,dataintegration,datareduction,datatransformation.
Match thefollowing:
i. Datacleaning a.Reducedrepresentationofdata
ii. Dataintegration b. xold/xmax
iii. Datareduction c. deal with missingvalues andnoisydata
iv. Normalization d.works to remove noisydata
v. Datatransformation e. (xold-xmin)/(xmax-xmin)
vi. Decimalscaling f.mergingofdatafrommultipledatastores
vii. Minmax normalization g.scale thedatavalues inspecifiedrange
viii. Zscorenormalization h.convertdatainto appropriateforms
ix. Smoothing i.(xold-mean)/standarddeviation
1. Mention anytwo methodsthat deals withmissingvalues and noisydata .

2. Mentiontwotechniquesthat areappliedtoobtain areduceddataset.
3. Usingmin-max normalization, transformthevalue 35ontothe range[0.0,1.0].
4. Using z-score normalization, transform the value 35, where the standard deviation is
12.94years.
5. Usingnormalization bydecimalscaling, transformthevalue 35.

7
In-lab
1. Given a data set “data” for analysis it includes the attribute Country, purchased item,
age,Salary.
a. Identifynumberof missingvalues in a givendataset

b. Dropthe tuples thathavemissingvaluesin theattributes.
c. Checkthedatatypeof age,ifit isnotanintegerthenconvertintointeger.
d. Normalizethesalaryusingsimple featurescaling.
e. Categorizethe salaryintolow, high,mediumbins.
f. Turnthecategoricalvaluesintonumerical.
WritingspaceoftheProblem:(ForStudent’suseonly)

Course Code(s) 21CS3052R Page 8 of 67 8
2. Suppose john is working as a manager at Nuclear Power Corporation of India and have
beencharged with analyzing the Nuclear power station construction data. He carefully inspects
thecompany’s database identifying and selecting the attributes (cost, date, t1, t2 and cap) to
beincludedin theanalysis.(Download the dataset from lms)
a. Henoticedthatseveralvaluesoftheattributesforvarioustupleshavenorecordedvalue.
b. Heobservedthatdatatypeofyearisrecordedinfloatinsteadof integertype.
c. Hewantsto normalizeall thedata (variables)inequalweights.
d. Finally,hewantsto knowifthereareanyoutliers presentincost ofthe
construction.Youimmediatelyset outto perform this task.
Hint:missingvalues canbesolved byreplacingwithmean)

Post-lab
1. Data(13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70)
2. Usesmoothingbybinmeanstosmooththeabovedata,usingabindepthof
3. Illustrateyoursteps.
Commentontheeffectofthistechniqueforthe givendata.AlsoPlotahistogram.

2. Usethesetwo methodsbelowto normalizethefollowinggroupof data:200, 300, 400,600and

1000.
a. min-max normalization bysettingmin =0 andmax=1
b. z-scorenormalization
c. z-scorenormalization usingthemeanabsolutedeviationof standard deviation
d. Normalizationbysimplefeaturescaling.
Writingspaceof theProblem:(For Student’suse only)

3. a. Normalize the two variables(age, fat) based on z-score normalization

b. Calculate the correlation matrix. Are these two variables positively or negatively correlated?
Writingspaceof theProblem:(For Student’suse only)

VivaVoce:-
1. Whatarethefactorsthatcomprisingdataquality?
2. Whatdoyou meanbynoise inthedataset?
3. Whatareoutliersinthedataset?
4. Whatisdiscretization?
5. Whatis thedifferencebetween lossyand losslessin data reduction?
CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof
FullNameofthe Evaluator:
SignatureoftheEvaluatorDateofEvaluation:

Lab#3:To implement principle component analysis

Date of the Session: ____/____/____ Time of the Session ____to_____
Pre-lab:-
PrincipalComponentAnalysis:
Principal Component Analysis is a method of extracting important variables
fromlargesetofvariablesavailableinadataset.Supposethatthedatatobereducedconsistoftuplesor
data vectors described by n attributes or dimensions. Principal components analysis (PCA;
alsocalled the Karhunen-Loeve, or K-L, method) searches for k n-dimensional orthogonal vectors
thatcan best be used to represent the data, where k ≤ n. The original data are thus projected onto
amuchsmaller space, resultingin dimensionalityreduction.
1. Whatareprincipalcomponents?
2. Mentionthesteps toconstruct principalcomponents?
In-lab:-
1. Suppose that you are given a small 3x2 matrix,you have

tocalculatePrincipalComponentAnalysiswithout usingpca()function?
Matrix:([3, 5],[4, 2],[1, 6])
WritingspaceoftheProblem: (ForStudent’s useonly)
2. Calculatethe principalcomponentanalysisforthematrixgiveninQ1usingPCA?
USING PCA


Post-lab:-
1. Pollution has been a concern since Industrialization due to its effects on human lives
andplanet. According to WHO, air pollution is having an effect in 7 million premature deaths
perannum. A report is generated on the quality of air in 5 months. It is found that the data
withinthe reported dataset are correlated. So, perform a strategic method to reconstruct the
datasetwith2 components.Also visualizeagraph between the two components?
(Downloadthedatasetfrom lms)
VivaVoce:-
1. WhyPCA is preferable,mention thetwo primaryreasons?

2. Is thereanyloss of dataif weuse PCA?
3. PCAisanunsupervisedtechnique,willyouagreewith it? Why?
4. WhataretheapplicationsofPCA?
5. Definecovariancematrix?
outof

Lab#4:ClassificationusingDecisionTrees.
Dateof theSession: / / TimeoftheSession: to
Pre-lab:-
1. What are the attribute selection measures in modeling a decision tree and write
the       respective equations for each of them.  
2. Whatdoyou mean byentropyina decision tree?Howis it calculated? 
3. WhatisInformation gainandhow doesismatterinaDecisionTree? 
4. List out the parameters involved in DecisionTreeClassifier and export_graphviz
and trytounderstand the role ofeach parameter.
5. Matchthefollowing: 
1. ID3  a.GAIN RATIO  
2. CART  b.INFORMATION GAIN 
3. C4.5  c.GINI INDEX 

In-lab:-
1. Implement the decision tree algorithm on the given data which has weight and smoothness
asthe segregating criteria for the fruit apple and orange. Apple is represented by the
number ‘1’andorangeby‘0’.Constructa decision treeandapplythe prediction measuresfor
thegiven datato obtain thetypes offruits. 
Weight Smooth Fruit
180 7 ?
140 8 ?
150 5 ?
 
 Fruit dataset
https://drive.google.com/file/d/1qoMDjozHHELVn5tFAJxp8mMw0Ggt-BVX/view?usp=sharing
Convertthetraineddecisiontreeclassifierinto graphviz object.Later,weuse
the converted graphviz object for visualization. To visualize the decision tree, you just need
toopenthe .txt file and copythe contents of thefileto pasteinthe graphviz web
portal graphyizwebportaladdress:http://webgraphviz.com 
Writingspaceof theProblem:(ForStudent’suseonly)
18
2. Below givenisthediabetesdataset. 
(Ref: https://drive.google.com/file/d/1PJizP39JPh_T-
5dQVcUVCfswrPSxT734/view?usp=sharing)
 
 Makesuretoinstallthe scikit-learnpackageandotherrequiredpackages.  
1. Findthe correlation matrixforthediabetes dataset?
1. Split the dataset into train_set and test_set for modeling and prediction. Divide the
datasetin such a way that the trained dataset constitutes 70 percent of the original
dataset and therestof the part belongs to thetest dataset. 
2. Produceadecisiontreemodelusing  
    a.Gini indexmetric 
    b. Entropy and Information gain metric  on the trained dataset
usingthe DecisionTreeClassifier function. 
3. Applythe prediction measureson thetestdataset. 
4. Definea
functionnamed accuracy_score byinterpretingthedifferencebetweenthepredicted
values andthe test set values. Displaytheaccuracyinterms of  
a. Fraction usingthe accuracy_score function 
b. Number ofcorrectpredictions. 
6.Printtheconfusionmatrixofthetestdataset. 
6. Calculate thefollowingvaluesmanuallyafter obtainingthe confusionmatrix  
a. Accuracy 
b. Errorrate 
c. Precision 
d. Recall (sensitivity)
e. F1Score 
f. Specificity
6.Compare the two results(obtained from two kinds of metrics) and state which method
ismoreaccurateforthisdataset. Convertthetraineddecisiontreeclassifierinto graphviz object.
Later, weusethe converted graphviz objectforvisualization.
10. PlotROCcurveandcalculateAUC
11. Plotrecall vsprecisioncurve
19
Post-lab:-
1. WhatistheC4.5algorithmandhowdoesitwork? StatethedifferencesbetweenID3andC4.5. 
2.Differentiatebetweenover-fitting,over-fittingandover-
fittingloss?Whydoesitoccurduringclassification? 
3.Explaintheconceptofpruningandwhyit is important.Differentiatebetweenpre-
pruningandpost-pruning. 
VivaVoce:-
1. Whatisthe difference between supervisedandunsupervised machinelearning? 
2. What is a confusion matrix? 

3. Which ofthe followingistrue about trainingand testingerrorin such case? 
a. Thedifferencebetweentrainingerror
andtesterrorincreasesasnumberofobservationsincrease.
b. The difference between training error and test error decreases as number
ofobservationsincrease. 
c. Thedifferencebetweentrainingerrorandtest errorwillnot change 
4. Whatisthedifference betweenclassificationandclustering? 
5. WhatareRecommenderSystems? 
outof
FullNameoftheEvaluator:

Lab#5:Classificationusing KNearestNeighbour.
Pre-requisite:
In LMS: Find the file named “Concept of k-Nearest-Neighbor.doc”. Read the
specifieddocumentand answer the belowquestions.
Pre-lab:-
1. Statewhetherthe given statementistrueor false withsupportedreasoning.
a. k-Nearest-Neighboris a simple algorithm that stores all available cases and

classifiesthenew casebased on dissimilaritymeasure.
b. Thevalueof ‘k’in k-nearest-neighboralgorithmhelps tocheck theno.
oftrainingsetslabelsto assign the mostcommon label forthe testingset.
2. Listtheindustrialusesofk-nearest-neighbor algorithmintherealworld.


3. Writeanalgorithmfork-nearest-
neighborclassificationgivenk,thenearestnumberofneighbors, and n, the numberof
attributes describingeachtuple.
4. Compare the advantages and disadvantages of

eagerclassification(e.g.,decisiontree,Bayesian,neuralnetwork)versuslazyclassification(e.g.
,k-nearest-neighbor,case-basedreasoning).
5. Givethe distancemethods that aremost commonlyusedin k-nearest-neighboralgorithm.

In-lab:-
Performthefollowing Analysis:
Step-by-stepprocesstocomputek-nearest-neighboralgorithmis:
1. Determineparameterk=no.ofnearestneighbors
2. Calculatethedistancebetween thetest sample and thetrainingsamples.
3. Sortthedistanceanddeterminenearestneighborsbasedonthekthminimumdistance.
4. Gatherthe categoryofnearestneighbors.
5. Usesimplemajorityofthecategoryofnearestneighborsasthepredictionvalueoftestingsam
ple.
Dataset:
Suppose we have the following “StudentDataSet” dataset which consists of 1st
yearCGPA,2ndyearCGPA, Category(C: CRT, NC:Non-CRT)asparameters.
Std.No 1st 2ndyearCGPA Category

yearCG
PA
1 8.5 8.5 C
2 8.2 9 C
3 7.5 7.6 C
4 5.5 4.5 NC
5 9.2 9 C
6 7.8 7.3 C
7 7.3 7.4 NC
8 7.9 7 NC
9 10 6 C
10 6.8 7.1 NC
11 6.5 7.1 NC
12 7.2 7.3 NC
When a new student comes only with 1st year CGPA and 2nd year CGPA as
information,predictthecategory ofthatnewstudent(whetherhebelongstoCRTorNon-CRT)by
Euclidean distance measure, where Euclidean distance between 2 points or tuples,
sayX1=(x11,x12,............,x1n)andX2=(x21,x22, ............ ,x2n), is
Testsample:
1styearCGPAand2ndyearCGPAofthenewstudentare8.4and7.1respectively.(Conside
rk=3)


'
Post-lab:-
1. PredicttheCategoryofstudentwith1styearCGPAand2ndyearCGPAas7.3and7.1respectively
using the Manhattan measuring technique formula with
k=3(Manually).Note:TheManhattandistancebetweentwotuples(orpoints)aandbisdefineda
s∑i|ai−bi|
2. By considering the above StudentDataSet ,, predict the Category of the new
studenthaving 1st yearCGPAand2nd yearCGPAas8.4and7.1respectively,
byimplementingthepythoncodeusingManhattandistancemeasureinordertofindnearest
neighbors for k=3 and check whether the output is same for both the
measuringtechniques ornot.

VivaVoce:-
ReferPageno:423,424,425inHanJ&KamberM,“DataMining:ConceptsandTechniques”,ThirdEdition,
Elsevier, 2011
1. k-nearest-neighborisa lazylearningalgorithm.
2. Howcanthedistancebecomputedforattributesthatarenotnumeric,butnominal(orcategorica
l)suchas color?
3. Listsometechniques usedto speedupthe classificationtime.
4. IfthevalueofagivenattributeAismissingintupleX1and/orintupleX2,thedifferenceisalways
outof

Lab#6:ClassificationusingBayesianClassifiers
Pre-lab:-
1. Matchthefollowing   
                ColumnA                    ColumnB 
 
a. NaiveBayesian a. Values arecontinuous 
Classification
b. Bayesianbeliefnetwork b. Attributesconditionallydependent 
c. Gaussiandistribution c. Toavoidzeroprobability 
d. Laplaceestimator d. Attributesconditionallyindependent 
2. Explain Baye’s theoremandwrite its derivedformulae. 
3. Supposewehave
continuousvaluesforan attributeinadatasetthenhowtocalculateprobability. 

4. Letusassume  
   p(age=youth/buys_car =yes)=0.222, 
   p(income=medium/buys_car)=0.444 and 
   p(buys_car=yes)=0.643 then 
   Find the probability of p(x/buys_car=yes), where x=(income=medium,
age=youth). 
5. WhileimplementingNaïveBayesian classifier, suppose

wehaveencounteredazeroprobability then we should add one count to each of the
probability to avoid zeroprobability. Whatisthisestimation iscalled?  

In-lab:-
1. Considerthegiventablenamed“Weather_cond.csv”consistingofattributesTemperatureHumidity
,Windy andaclasslabelnamed“Outcome”.Depending on the weatherconditionsyou
havetochoosewhetherto playcricket or not. 
a. Unlike conventional function, write a python function to split the dataset into
trainingsetand test set. Assume test sizelength as 0.33. 
b. Write a python function to calculate mean and standard deviation for each
numericalattributein thedata set. 
c. Calculatethenumberofpriorsforthe given
datasetaftersplittingintotrainingandtestsets usingpython. 


2. Theproblemiscomprisedof100observationsofmedicaldetailsforPima Indian’s patients. The
records describe instantaneous measurements taken from the patientsuch as their age, the
number of times pregnant and blood workup. All patients are women aged21 or older.All
attributes are numeric,and their units vary from attribute to attribute. Eachrecord has a class
value that indicates whether the patient suffered an onset of diabetes within
5yearsofwhenthemeasurementsweretaken(1)ornot(0).Thisisastandarddatasetthathasbeenstudie
dalot inmachine learningliterature. Agood prediction accuracyis70%-76%.
 
Implementapython codetofindtheaccuracyforgivendatasetnamed
“Diabetes.csv”basedon train set andtest set. Taketest sizelength as 0.4. 

Post-lab:-
1. Considerthegiventablethatspecifiesloan classificationproblem.  
Tid  HomeOwner  Maritalstatus  AnnualIncome  Defaulted

Borrower 
1  Yes  Single  125K  No 
2  No  Married  100K  No 
3  No  Single  70K  No 
4  Yes  Married  120K  No 
5  No  Divorced  95K  Yes 
7  Yes  Divorced  220K  No 
8  No  Single  85K  Yes 
10  No  Single  90K  Yes 
 
a. Computethe class conditional probabilityforeach categorical attribute. 
b. PredicttheclasslabelvaluefortestrecordX=(HomeOwner=No,MaritalStatus=Married,
Income=$120K) 
  

VivaVoce:-
1. Explainthedifference betweenaValidationSetandaTestSet?
2. WhatarethethreetypesofNaïveBayesclassifier?
3. Howmanyterms arerequired forbuildingaBayes model?
4. Whatis trainingtest andtestingset?
5. WhataretheadvantagesofNaive Bayes?
CommentoftheEvaluator (ifAny) Evaluator’s

outof

Lab#7:ClassificationusingBackpropagation
Pre-lab:-
In LMS: Find the file named “Han J & Kamber M, Data Mining Concepts
andTechniques.doc”.  
Readthespecifieddocument fromPg.No:398–404 andanswerthebelowquestions. 
1. StatewhetherthegivenstatementisTrue/False. 
a. Backpropagationisneuralnetworklearningalgorithm. 
b.Backpropagation learns by iteratively processing a data set of training

tuples,comparing the network’s prediction for each tuple with the actual
known targetvalue. 
2. What is the objectiveof Backpropagation? 
3. ExplainaboutMultilayer Feed-ForwardNeuralNetworkwithdiagram.  

4. HowdoesBack propagation work? 
5. Considerthefollowingtable. 
 
Input  DesiredOutput  ModelO AbsoluteError  Square
utput  Error 
0  0       
1  2       
2  4       
 
Predict the Model Output by considering the initial value of weight as 3. Find the Absolute
Errorand Square Error. Use the Backpropagation algorithm to update the weight and try to
minimizethe square error asmuch aspossible. 
Hint: 
i.ModelOutput= W*I(x)(W=weight, I=Input,x=indexthatiteratesfrom0to
length(Input)) 
ii. Absolute Error = mod(Model Output-Desired
Output) iii.SquareError= (Absolute Error)^2

In-lab:-
Analysis: 
Thefollowingstepswillprovidethefoundationthatyouneed to implementtheBackpropagationalgorith
mandapplyit toyourownpredictivemodellingproblems: 
1. InitializeNetwork. 
2. ForwardPropagate. 
i. NeuronActivation. 
ii. NeuronTransfer. 
iii. ForwardPropagation. 
3. BackPropagateError. 
i. TransferDerivative 
ii. ErrorBackpropagation 
4. TrainNetwork. 
i. UpdateWeights. 
ii. TrainNetwork. 
5. TestNetwork.
Dataset: 
Supposewehavethefollowing“Results Dataset” whichconsistofGPA’sofsomestudents that
they had scored in two internal tests. And, it also consists of another attributenamed
‘Qualified’, which holds a character(Q/NQ), representing the student qualificationforfinal
examination.  
 
 
S.No Test– 1 Test– 2 Qualified
1 8.5 8.5 Q
2 8.2 9.0 Q
3 3.5 5.0 NQ
4 5.5 4.5 NQ
5 9.2 9.0 Q
6 7.8 7.3 Q
7 8.0 3.1 NQ
8 10 7.0 Q
9 4.5 6.0 NQ
10 6.8 7.1 Q
11 5.1 4.1 NQ
12 4.2 5.3 NQ
Problem:   Trainanetworkonabove“ResultsDataset”byapplyingBackpropagationalgorithm. 
a. Initializing anetwork withallweights andbiases.   (Considerweightsinrange-0.5to
+0.5,biases=1,LearningRate ={0.5, 0.7,1})
b. Training thenetwork accordingtotheDataset. (ConsiderbothActivatingFunctions–
SigmoidFunctionandTanhFunction) 
c. Backpropagating theerrors. 
Post-lab:-
1. Use the network which is trained on the above “Results Dataset” and test whether it is
trainedwith 100% accuracy or not. And, predict the result (qualified for final examination or not)
of anewentrywhich contains5.9and 5.9 GPA’softest-1 and test-2 respectively.
39
VivaVoce:-
1.  What are the general tasks that are performed with backpropagation
algorithm? 2.What kind of real-world problemscan neural networkssolve? 
3.  Whatisa gradient descent? 
4. Why is zero initialization not a recommended weight initialization
technique? 5. Howareartificial neural networks different from normal
networks? 
outof

Lab#8:AssociationRuleMining -Apriori
Pre-lab:-
1. Definewhatis Apriori algorithm.
2. Whatisassociation minning? 
3. Whatis the needof association minning? 
4. Whatisminimumsupportandminimum confidence? 
5. Consider the market basket transactions given in the following table.

Letmin-sup=40%and min_conf=40% 
TransactionID ItemsBought
T1 A,B,C
T2 A,B,C,D,E
T3 A,C,D
T4 A,C,D,E
T5 A,B,C,D
a. Findallthefrequentitemsetsusing apriori algorithm. 
b. Obtainsignificantdecisionrules. 

In-lab:- Forthefollowinggiventransaction dataset,perform followingoperations : 
a.Generate rulesusing Apriori algorithmbyusing belowdataset. 
vegetables green whole wheat flour cottage
shrimp  almonds  avocado  mix  grapes    yams  cheese 
burgers  meatballs eggs                
 
chutney                      
turkey  avocado                   
mineral energy whole
water  milk  bar  wheatrice  greentea  eggs       
lowfat
yogurt                      
whole
wheatp french fri
asta  es                   
light
soup  cream  shallot                
frozen green
vegetables  spaghetti  tea                
french fries 
                    
eggs  petfood                   
cookies                      
mineral cooking
turkey  burgers  water  eggs  oil          
champag
spaghetti  ne  cookies                
mineral
water  salmon  eggs                
mineral
water                      
lowfat
shrimp  chocolate chicken  honey  oil  cookingoil  yogurt    
 
turkey  eggs                   
tomatoes  mineral
turkey  freshtuna spaghetti  water  blacktea  salmon  eggs 
 
french fries 
meatballs  milk  honey  proteinbar          
shampoo 
redwine  shrimp  pasta  pepper  eggs  chocolate    
sparkling
rice  water                   
mineral body
spaghetti  water  ham  spray  pancakes  greentea       
grated white toothpaste
burgers  cheese  eggs  pasta  avocado  honey  wine   
eggs                      

parmesan
cheese  spaghetti  soup  avocado  milk  freshbread       
ground mineral frozen
beef  spaghetti  water  milk  eggs  blacktea  salmon  smoothie 
sparkling
water
mineral frenchfries
water eggs chicken chocolate
frozenveg
etables mineral
spaghetti yams water
43
Post-lab:-
1. Sameas In-lab questiongeneraterulesonbelowdataset. 
semi-
finished ready
citrusfruit  bread  margarine  soups                      
tropical
fruit  yogurt  coffee                         
wholemilk                               
cream meat
pip fruit  yogurt  cheese   spreads                      
longlife
otherveget whole condensed bakeryp
ables  milk  milk  roduct                      
abrasive
wholemilk  butter  yogurt  rice  cleaner                   
rolls/buns                               
liquor(ap
otherveget UHT- bottled petizer
ables  milk  rolls/buns  beer  )                   
potplants                               
wholemilk  cereals                            
other
tropica vegetabl white bottled chocolate 
l fruit  es  bread water                   
 
bottle
dwat
tropica whole yogurt  er dishes 
citrusfruit  lfruit  milk  butter  curd  flour         
beef                               
rolls/bun
frankfurter  s  soda                         
tropical
chicken  fruit                            
fruit/vegeta newspape
butter  sugar  blejuice  rs                      
fruit/vegetab
lejuice                               
packaged
fruit/vegetab
les                               
chocolate                               
specialty
bar                               
other
vegetables                               
butter milk  pastry                            
wholemilk                               
tropical cream processed detergent  newspape
fruit  cheese   cheese  rs                   

bathro
rootveg sweets salty o
tropica etables  otherveget frozend rolls/buns pread snac waffle cand mclean
l fruit  ables  essert    flour  s  k  s  y  er 
bottled canned
water  beer                            
yogurt                               
rolls/bun chocolate 
sausage  s  soda                      
other
vegetables                               
shoppi
brown fruit/vegeta canned newspape ngba
bread  soda  blejuice  beer  rs  gs                

VivaVoce:-
1. Whoproposed Apriori algorithminwhich year?  
2. Whatisfrequent itemset? 
3. Whydoweconvertdataset intolist? 
4. Whatistheformulafor support,confidence and lift? 
5. Howtheyget thenameas Apriori? 
outof

Lab#9:ImplementationofK-MeansClustering
Pre-Requisites:
Data pre-processing  
Basics of plotting techniques  
Various clustering techniques 
Pre-lab:-
1. Matchthefollowing.
 
                    Parameters                          Application 
1. pch  a. Tosetorientationofaxis labels 
2. col  b. No.ofplotsperrowand column 
3. mfrow  c. Tosetplotcolor 
4. lwd  d. Plottingsymbol 
5. las  e. Tosetline width 
 
2. Listout various parametersandattributesinKMeansclustering. 
3. Intohow manytypes does clusteringdivided into andname them.
4. Listoutvarious applicationsofclustering.

5. DescribeEuclideandistanceand Manhattandistanceinbriefwith its derivedformula.  
6. ListoutbasicstepsinvolvedinKMeansclustering.  

In-lab:-
1. The given dataset comprises of 150 data entries of different countries around
theworld.It is a report on world happiness, a landmark survey of the state of
globalhappiness that ranks 156 countries by how happy their citizens perceive
themselves tobe, with a focus on the technologies, social norms, conflicts and
government policiesthat have driven those changes. The records contains various
attributes of each
countrythatincludespositive_effect,negative_effect,corruption,freedom,healthlifeexpe
ctancy etc. The data frame includes categorical variables, numerical values
andtheirvalues varyfrom countryto country.
Implementapythoncodeusingscikit-learntodisplayaK-
meansclusteringplotforgivendataframe named“world_happiness_report.csv”.
2. Thegivendataset named “Student_performance”consists of150dataentriesof

students in an institution that displays the performance of a student. It consists
ofvarious attributes such as
gender,ethnicity, test_preparation, math_score, reading_scoreetc.Perform
theKmeansclustering for the given dataset taking an appropriate number of centers
based on meanand standard deviation for the data entries. Analyze the cluster plot
and give a briefnotebased on results obtained.

Post-lab:-
1. This lab module aims to build an analysis on customers of a shopping mall. It consists of
150observationsofcustomersconsistingdetailsthatincludegender,age,
annual_income,spending_score etc.Basedonthetwoparameters annual_income and
spending_score,trytobuildaanalysis oncustomers through cluster graphs
 
Apply k means clustering on the given data set named “Mall_customers” marking number
ofclustersbasedonmeanandstandarddeviationofanytwoattributesofyourchoiceandimplementtheK-
means iterativelytill the centroids get stabilized
VivaVoce:-
1. K-meansis whichtypeofalgorithm.
2. In K-meansclusteringalgorithmwhatis
thecriteriausedbythedatapointstogetseparatedfromonecluster to another.
3. Whatarethebasicsteps inKMeansclustering.
4. WhatdoesKreferin K-meansalgorithm - Krefersto kno.ofclusters.
5. HowisK-meansalgorithmisdifferentfromKNN algorithm
outof

Lab#10:Classification:Support VectorMachine(SVM)
Pre-lab:-
1. What isSVM? 
2. When do weuse SVM? 
3. What ismaximummarginalhyperplane andwhatistheequationofseparatinghyperpla
ne? 
4. What arethetwocasesofSVM? 
5. What aretheequationsforpointthatliesabovetheseparating hyperplane andbelowthese
parating hyperplane? 

In-lab:-
1. Below is the data of the employees in the company. The data shows  whether
employeepurchased software or not. Take x co-ordinate as age and y co-ordinate
asestimated_salary.Now, Considerthefollowing datasetandperformthebelowoperations: 
UserID  Gender  Age  EstimatedSalary  Purchased 
15624510  Male  19  19000  0 
15810944  Male  35  20000  0 
15668575  Female  26  43000  0 
15603246  Female  27  57000  0 
15804002  Male  19  76000  0 
15728773  Male  27  58000  0 
15598044  Female  27  84000  0 
15694829  Female  32  150000  1 
15600575  Male  25  33000  0 
15727311  Female  35  65000  0 
15570769  Female  26  80000  0 
15606274  Female  26  52000  0 
15746139  Male  20  86000  0 
15704987  Male  32  18000  0 
15628972  Male  18  82000  0 
15697686  Male  29  80000  0 
15733883  Male  47  25000  1 
15617482  Male  45  26000  1 
15704583  Male  46  28000  1 
15621083  Female  48  29000  1 
15649487  Male  45  22000  1 
15736760  Female  47  49000  1 
15714658  Male  48  41000  1 
15599081  Female  45  22000  1 
15705113  Male  46  23000  1 
15631159  Male  47  20000  1 
15792818  Male  49  28000  1 
15633531  Female  47  30000  1 
15744529  Male  29  43000  0 
a. Importthedatasetintopython 
b. Splitthedataset set intotrainingand testingsets 
c. Applyfeaturescalingon trainingand test sets 
d. FitSVMtothetrainingset 
e. Visualize thetrainingsetresults 
f. Visualize thetestsetresults. 


Post-lab:-
 1. Below dataset represents the bank transactions of KVB bank for an hour. Consider x
co-ordinate as Balance and y co-ordinate as Trtn_amt. Perform following operations
ongiven dataset: 
S.No  transaction_ID  Balance  Trtn_amt  sucornot 
1  3467  98687.36  500  0 
2  4801  8510.47  100  0 
3  2093  2475.3  200  1 
4  9933  37743.25  1000  0 
5  7178  2705.95  600  0 
6  1093  60314  750  1 
7  3708  812129.5  280  1 
8  3804  8076.25  140  0 
9  3192  42323.14  310  1 
10  3666  47045.25  2500  0 
11  8598  96171.25  6900  0 
12  8743  608581.8  8520  1 
13  9302  586057.3  410  1 
14  6127  4587.5  750  0 
15  7502  43597.75  250  0 
 
a. Importthedatasetintopython 
b. Split thedataset set intotrainingand testingsets 
c. Applyfeaturescalingon trainingand test sets 
d. FitSVMto thetrainingset 
e. Visualize thetrainingsetresults 
f. Visualize thetestsetresults. 

VivaVoce:-
1. What arethe advantagesof SVM? 
2. How manytypesofmachine learnings arethere andinwhichtypethis svm fallunder? 
3. What arethe turningparametersin SVM? 
outof

Lab#11:RuleBasedClassification
Pre-requisite: 
Refer Page no: 355-363 in Han J & Kamber M, “Data Mining: Concepts and Techniques”,
ThirdEdition,Elsevier, 2011    
Pre-lab:-
1. Whatis rule-basedclassification indatamining? 
2. Brieflyexplain aboutthe buildingclassificationrules.  
3. When to stopbuildinga rule? 
4. List someaspectsofsequentialcovering. 
5. Whatarethecharacteristics ofrule-basedclassifier?  
6. Definecoverageand accuracy.

In-lab:-
1. Implement a simple python code for rule-
basedclassification on “AllElectronicsCustomer” datab
ase  (Downloadthedataset fromLMS) 
 
RID  age  income  student  Credit_rating  Class:buys computer   
1  youth  high  no  fair  no 

2  youth  high  no  excellent  no 
3  middle_aged  high  no  fair  yes 
4  senior  medium  no  fair  yes 
5  senior  low  yes  fair  yes 
6  senior  low  yes  excellent  no 
7  middle_aged  low  yes  excellent  yes 
8  youth  medium  no  fair  no 
9  youth  low  yes  fair  yes 
10  senior  medium  yes  fair  yes 
11  youth  medium  yes  excellent  yes 
12  middle_aged  medium  no  excellent  yes 
13  middle_aged  high  yes  fair  yes 
14  senior  medium  no  excellent  no 
 
a.Calculateaccuracy,coverageandprinttheRIDvalueswhen the
followingrulesare satisfied:  
o Rule R1: if the age of the person is in the category of “youth” and he/she is
astudentthen the person purchases thecomputer. 
o Rule R2: if age of the person is in the category of “middle_aged” , income
iseither medium or high and with excellent Credit_rating  then the person
buys acomputer  
o RuleR3: if ageoftheperson is inthecategoryof“senior”andhe/sheis
astudentthen purchasesacomputer. 
o Rule R4: if age of the person is in the category of “senior”
, income ishigh, he/sheisastudentandwith Credit_rating fair thenpurchasesaco
mputer. 
  


Post-lab:-
1. Extractpossibleclassificationrulesfromthe givendecisiontree.
2. Writethesequential coveringalgorithm used inruleinduction.
3. DifferencebetweenDecisiontreeandrulebased classification.

VivaVoce:-
1. Rule-Basedclassifier classifyrecords byusingacollection of rules.
2. Mostrule-basedclassificationsystemsusewhich strategy?
3. Differencebetweenclass-basedorderingandrule-basedordering.
4. Brieflyexplain thebelow termsinyourown words:
a. Mutuallyexclusive
b. Exhaustive
5. Nametheterms that definethe followingstatements:
a. Fractionofrecords that satisfyonlyantecedentofarule.
b. Fractionofrecords thatsatisfyboth antecedentandconsequentof arule.
outof

Lab#12:OutliersDetection
Pre-lab:-
1. Whatdoyou mean byan outlier? What arethemaincauses foroutliers?
2. What are the important methods for outlier detection?
3. Whyisoutlierdetectionnecessaryindata analysis?
4. Howdowecalculate z-score?
5. Considerthebelowdatasetwhichcomprisesoftheincome(in
thousands)of15peopleinanorganisation.
[45,51, 63, 48,67, 48, 56, 2, 62, 59, 44, 61, 99, 46,52]
Whatdoyouobservefromtheabovedata?Isthereanysignificantdifferencebetweentheincomeoffe
w employees?If so,whatcould be thereason ofit?

In-lab:-
1. ThedatasetBostonhousepricesconsistsof9attributesCRIM,ZN,INDUS,LSTAT,NOX,RM, DIS,RAD,
TAX. Thedescription of each attribute
 CRIM per capitacrime rate bytown
 ZNproportion ofresidential landzoned forlots over25,000 sq.ft.
 INDUSproportionofnon-retailbusinessacrespertown
 NOXnitricoxides concentration(parts per 10 million)
 RMaveragenumberofroomsper dwelling
 DISweighteddistancestofiveBostonemploymentcentres
 RADindexof accessibilityto radialhighways
 TAXfull-valueproperty-taxrateper$10,000
Boston dataset:https://drive.google.com/file/d/1YVYWQWPKsLX1UM-
0XCnGCwD1NIi7_uIv/view?usp=sharing
a. Usingboxplot detectwhich columnshaveoutliers

b. ImplementscatterplotbetweenINDUSandTAXandinspecttheoutliers
c. Applyz_scoreoutlierdetection method on Bostondataset consideringthreshold =3
d. Print anyfivez_scorevalues of theoutliers.
e. Removealltheoutliers obtainedfromthedatasetandrefashionthedataset.
f. Applyinterquantilerange(IQR)outlierdetectiononthedatasetandprintIQRvaluesofeachcolu
mns.
g. Calculate lower_bound and upper_bound and print boolean values wherein the outliers
arerepresentedas TRUE.
h. Removealltheoutliersproducedbyinterquartilerangemethodandrefashionthedataset.
2. Considertheirisdataset.Itincludesthreeirisspecieswith50sampleseachaswellassomeproperties
62
about eachflower. Thecolumnsin this dataset are:

 SepalLengthCm
 SepalWidthCm
 PetalLengthCm
 PetalWidthCm
 Species
https://drive.google.com/file/d/1HEEMrAQqAynHdM5TmK0G-
mD5Qr0OW2J8/view?usp=sharing
Importthecsvfileandusetheboxplotmethodtovisualisetheoutliersconsideringthe4propertiesof
aflower. You will noticethat oneofthepropertyhas outliers.
1. Consideringtherangeoftheoutliersfromthevisualisation,displaytheobservationswhichhaveoutli
ers.
2. ImplementaDBSCANmodelfittingonthedatasettakingepsilonvalueas0.8andminimumsamplesv
alueas 19.
3. Print the counter values using the counter function on the model
labels.4.Considering the values obtained from the model labels print the outliers of
the data.5.Draw ascatterplot betweenpetallength and sepalwidthto
visualisetheoutliers.
Post-lab:-
Consider the following student
datasethttps://drive.google.com/file/d/1edmKnHjXkTyHT6gSYhwLw9rTpzoy1Cig/view?usp=sharing

whichconsists of studentdetails of twoschools in a town.
i. Findthestudentswho havetakenmorenumberofleaves
thantheaveragenumberofabsences by implementing a z_score functiontaking
mean and standarddeviationinto account.
ii. Findthenumberof students whogot least andhighest scorein

thesubjectG1consideringthreshold =2.5
iii. Applyboxplot fortheabovetwo instances.

2. Canwefindoutliersforcategoricalvalues?Explain.
3. An sugar factory weighs every sugar packet in the weighing machine before
packingthem into cartons. As per the guidelines of the factory,the standard weight of
each sugarpacket should be 60 grams. It has been observed that during the final
weighing of thepackets, few of themgave an anomalous weight due to malfunctioning
of weighingmachines.
Consider the below dataset which comprises of weights of the
packets.https://drive.google.com/file/d/1JkdkQ3j-
J93DCfZa3kUjDycEtRzShk6V/view?usp=sharing
a.Findthoseanomalousweightsbyplottingahistogram
b. Intherange0to1,considerthelower_bound=0.1&upper_bound=0.9andfindthe
outliers usingthe quantile method.
c.Segregatetheoutliersfrominlines
using“loc”methodtogetthevaluesof“true_index”.Alsoobtain values of“false_index”.
.
d.Nowfind themedianfrom thevaluesobtained in“true_index”
.
d.Replacealltheoutliers withmedian.
64
VivaVoce:-
1. Isitgoodtoremove
anoutlierfromthedatasetallthetime?2.Whattheapplicatio
ns of outlier detection.
3. Whatthedifferenttypesof outliers?
4. Areoutliersjustsideproductsofsomeclusteringalgorithms?
5. Whatisthedifference betweennoise and anomoly?
outof
Course Title <TO BE FILLED BY CC> ACADEMIC YEAR: 2023-24

Course Code(s) <TO BE FILLED BY CC AND MUST INCLUDE ALL R,A,P CODES> Page 65 of 67

DWM Lab Manual 3yr 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWM Lab Manual 3yr 1

Uploaded by

Copyright:

Available Formats

DATA WAREHOUSING AND MINING

STUDENT ID: ACADEMIC YEAR: 2023-24

1. Session 01: Basic Statistical Descriptions ..............................................................................

12. Outliers detection

Lab#1: Basic Statistical Descriptions

BASIC STATISTICAL DESCRIPTIONS OF DATA

Answer the following Questions

2. What are the several ways of measuring the dispersion of data?

3. What is IQR (inter quartile range)?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

5. What are the items involved Five number summary?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Writing space of the Problem: (For Student’s use only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Comment of the Evaluator (if Any) Evaluator’s Observation

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Lab#2:To implement data pre-processing techniques.

1. Mention anytwo methodsthat deals withmissingvalues and noisydata .

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

a. Identifynumberof missingvalues in a givendataset

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

2. Usethesetwo methodsbelowto normalizethefollowinggroupof data:200, 300, 400,600and

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

3. a. Normalize the two variables(age, fat) based on z-score normalization

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Lab#3:To implement principle component analysis

1. Suppose that you are given a small 3x2 matrix,you have

WritingspaceoftheProblem: (ForStudent’s useonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

1. WhyPCA is preferable,mention thetwo primaryreasons?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

2. What is a confusion matrix?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Dateof theSession: / / TimeoftheSession: to

1. Statewhetherthe given statementistrueor false withsupportedreasoning.

a. k-Nearest-Neighboris a simple algorithm that stores all available cases and

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

4. Compare the advantages and disadvantages of

5. Givethe distancemethods that aremost commonlyusedin k-nearest-neighboralgorithm.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Std.No 1st 2ndyearCGPA Category

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Dateof theSession: / / TimeoftheSession: to

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

5. WhileimplementingNaïveBayesian classifier, suppose

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24

2. What is a confusion matrix? 

Tid  HomeOwner  Maritalstatus  AnnualIncome  Defaulted

2. What is the objectiveof Backpropagation? 

4. HowdoesBack propagation work? 

3. Whatis the needof association minning? 

1. Whatis rule-basedclassification indatamining? 

3. When to stopbuildinga rule? 

1  youth  high  no  fair  no