You are on page 1of 68

DATA WAREHOUSING AND MINING

(21CS3052R)

STUDENT ID: ACADEMIC YEAR: 2023-24


STUDENT NAME:
Table of Contents

1. Session 01: Basic Statistical Descriptions ..............................................................................


2. Session 02: To implement data pre-processing techniques ....................................................
3. Session 03: To implement principle component analysis .......................................................
4. Session 04: Classification using Decision Trees ......................................................................
5. Session 05: Classification using K Nearest Neighbor ............................................................
6. Session 06: Classification using Bayesian Classifiers ..............................................................
7. Session 07: Classification using Back propagation................................................................
8. Session 08: Association Rule Mining - Apriori ......................................................................
9. Session 9: Implementation of K-Means Clustering.................................................................
10. Session 10: Classification: Support Vector Machine (SVM) ..................................................
11. Session 11: Rule Based Classification ...................................................................................
12. Session 12: Outliers Detection ...............................................................................................
A.Y. 2023-24 LAB/SKILL CONTINUOUS EVALUATION

S.No Date Experiment Name Pre- In-Lab (25M) Post- Viva Total Faculty
Lab Program/Procedure Data and Analysis & Lab Voce (50M) Signature
(10M) (5M) Results(10M) Inference(10M) (10M) (5M)

1. Basic Statistical
Descriptions
To implement data pre-
2.
processing techniques
To implement principle
3. component analysis
Classification using Decision
4.
Trees
Classification using K
5.
Nearest Neighbor
Classification using
6.
Bayesian Classifiers
Classification using Back
7.
propagation
Association Rule Mining -
8. Apriori
Implementation of K-Means
9.
Clustering
Classification: Support
10.
Vector Machine (SVM)
11 RuleBased Classification

12. Outliers detection


Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#1: Basic Statistical Descriptions


Date of the Session: ____/____/ Time of the Session ____to __

Pre-lab

BASIC STATISTICAL DESCRIPTIONS OF DATA


Basic statistical descriptions provide the analytical foundation for data pre processing. It can be used
to identify properties of the data and highlight which data values should be treated as noisy or
outliers.

Answer the following Questions

1. What are the various ways to measure the central tendency of data?

2. What are the several ways of measuring the dispersion of data?

3. What is IQR (inter quartile range)?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 1 of 67
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

4. Observe the following diagrams; identify the quantile and q-q plot? Define how
the q-q-plotis different from quantile plot?

5. What are the items involved Five number summary?

6. Identify the symmetric data , positively skewed data and negatively skewed data from
the below graphs?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 2 of 67
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab
1. Given a dataset “cars” for analysis it includes the variables speed and
distance.(Download the dataset from lms)

a) What are the average speed and the distance of the cars?
b) What is the median and midrange of the data?
c) Find mode of the data and comment on the data modality (i.e, unimodal or bimodal)?
d) What are the variance and the standard deviation of the data?
e) Find the five number summaries of the data?
f) Show the histogram and box plot of the data?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 3 of 67
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

WritingspaceoftheProblem:(ForStudent’s useonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 4 of 67
4
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab
1. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results. (Download the dataset from lms)

a) Find the maximum and the minimum percentage of the fat and age of the adults who
visited the hospital.
b) Calculate mean, median and midrange of the age.
c) Find the first quartile and third quartile of the data.
d) Draw a scatter plot and q-q plot based on these two variables.

Writing space of the Problem: (For Student’s use only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 5 of 67
5
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Difference between symmetric data and skewed data.
2. What are the most widely used forms of quartiles?
3. Variance and Standard deviation fall under what category of measuring data?
4. What do low and high standard deviations indicate?
5. Based on what condition, two variables are said to be correlated?

(ForEvaluator’suseonly)

Comment of the Evaluator (if Any) Evaluator’s Observation


Marks Secured: out of

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 6 of 67
6
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#2:To implement data pre-processing techniques.


Date of the Session: ____/____/____ Time of the Session ____to_____
Pre Lab

DATA PREPROCESSING:
Databases are exceedingly helpless to noisy, missing, and inconsistent data because
oftheir commonly enormous size (frequently a few gigabytes or more).Low-quality
informationwill prompt low-quality mining results. Pre-processing helps to get a quality data.
The stepsinvolvedindatapre-
processingaredatacleaning,dataintegration,datareduction,datatransformation.

Match thefollowing:
i. Datacleaning a.Reducedrepresentationofdata
ii. Dataintegration b. xold/xmax
iii. Datareduction c. deal with missingvalues andnoisydata
iv. Normalization d.works to remove noisydata
v. Datatransformation e. (xold-xmin)/(xmax-xmin)
vi. Decimalscaling f.mergingofdatafrommultipledatastores
vii. Minmax normalization g.scale thedatavalues inspecifiedrange
viii. Zscorenormalization h.convertdatainto appropriateforms
ix. Smoothing i.(xold-mean)/standarddeviation

1. Mention anytwo methodsthat deals withmissingvalues and noisydata .


2. Mentiontwotechniquesthat areappliedtoobtain areduceddataset.
3. Usingmin-max normalization, transformthevalue 35ontothe range[0.0,1.0].
4. Using z-score normalization, transform the value 35, where the standard deviation is
12.94years.
5. Usingnormalization bydecimalscaling, transformthevalue 35.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 7 of 67
7
In-lab
1. Given a data set “data” for analysis it includes the attribute Country, purchased item,
age,Salary.

a. Identifynumberof missingvalues in a givendataset


b. Dropthe tuples thathavemissingvaluesin theattributes.
c. Checkthedatatypeof age,ifit isnotanintegerthenconvertintointeger.
d. Normalizethesalaryusingsimple featurescaling.
e. Categorizethe salaryintolow, high,mediumbins.
f. Turnthecategoricalvaluesintonumerical.

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 8 of 67 8
2. Suppose john is working as a manager at Nuclear Power Corporation of India and have
beencharged with analyzing the Nuclear power station construction data. He carefully inspects
thecompany’s database identifying and selecting the attributes (cost, date, t1, t2 and cap) to
beincludedin theanalysis.(Download the dataset from lms)
a. Henoticedthatseveralvaluesoftheattributesforvarioustupleshavenorecordedvalue.
b. Heobservedthatdatatypeofyearisrecordedinfloatinsteadof integertype.
c. Hewantsto normalizeall thedata (variables)inequalweights.
d. Finally,hewantsto knowifthereareanyoutliers presentincost ofthe
construction.Youimmediatelyset outto perform this task.
Hint:missingvalues canbesolved byreplacingwithmean)

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 9 of 67 9
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab
1. Data(13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70)
2. Usesmoothingbybinmeanstosmooththeabovedata,usingabindepthof
3. Illustrateyoursteps.

Commentontheeffectofthistechniqueforthe givendata.AlsoPlotahistogram.

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 10 of 67 10
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

2. Usethesetwo methodsbelowto normalizethefollowinggroupof data:200, 300, 400,600and


1000.
a. min-max normalization bysettingmin =0 andmax=1
b. z-scorenormalization
c. z-scorenormalization usingthemeanabsolutedeviationof standard deviation
d. Normalizationbysimplefeaturescaling.

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 11 of 67 11
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

3. a. Normalize the two variables(age, fat) based on z-score normalization


b. Calculate the correlation matrix. Are these two variables positively or negatively correlated?

Writingspaceof theProblem:(For Student’suse only)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 12 of 67 12
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Whatarethefactorsthatcomprisingdataquality?
2. Whatdoyou meanbynoise inthedataset?
3. Whatareoutliersinthedataset?
4. Whatisdiscretization?
5. Whatis thedifferencebetween lossyand losslessin data reduction?

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameofthe Evaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 13 of 67 13
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#3:To implement principle component analysis


Date of the Session: ____/____/____ Time of the Session ____to_____

Pre-lab:-

PrincipalComponentAnalysis:
Principal Component Analysis is a method of extracting important variables
fromlargesetofvariablesavailableinadataset.Supposethatthedatatobereducedconsistoftuplesor
data vectors described by n attributes or dimensions. Principal components analysis (PCA;
alsocalled the Karhunen-Loeve, or K-L, method) searches for k n-dimensional orthogonal vectors
thatcan best be used to represent the data, where k ≤ n. The original data are thus projected onto
amuchsmaller space, resultingin dimensionalityreduction.

1. Whatareprincipalcomponents?
2. Mentionthesteps toconstruct principalcomponents?

In-lab:-

1. Suppose that you are given a small 3x2 matrix,you have


tocalculatePrincipalComponentAnalysiswithout usingpca()function?
Matrix:([3, 5],[4, 2],[1, 6])

WritingspaceoftheProblem: (ForStudent’s useonly)

2. Calculatethe principalcomponentanalysisforthematrixgiveninQ1usingPCA?
USING PCA

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 14 of 67 14
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 15 of 67 15
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-

1. Pollution has been a concern since Industrialization due to its effects on human lives
andplanet. According to WHO, air pollution is having an effect in 7 million premature deaths
perannum. A report is generated on the quality of air in 5 months. It is found that the data
withinthe reported dataset are correlated. So, perform a strategic method to reconstruct the
datasetwith2 components.Also visualizeagraph between the two components?
(Downloadthedatasetfrom lms)

WritingspaceoftheProblem:(ForStudent’suseonly)

VivaVoce:-

1. WhyPCA is preferable,mention thetwo primaryreasons?


2. Is thereanyloss of dataif weuse PCA?
3. PCAisanunsupervisedtechnique,willyouagreewith it? Why?
4. WhataretheapplicationsofPCA?
5. Definecovariancematrix?

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameofthe Evaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 16 of 67 16
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#4:ClassificationusingDecisionTrees.
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-

1. What are the attribute selection measures in modeling a decision tree and write
the       respective equations for each of them.  
2. Whatdoyou mean byentropyina decision tree?Howis it calculated? 
3. WhatisInformation gainandhow doesismatterinaDecisionTree? 
4. List out the parameters involved in DecisionTreeClassifier and export_graphviz
and trytounderstand the role ofeach parameter.
5. Matchthefollowing: 
1. ID3  a.GAIN RATIO  
2. CART  b.INFORMATION GAIN 
3. C4.5  c.GINI INDEX 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 17 of 67 17
In-lab:-

1. Implement the decision tree algorithm on the given data which has weight and smoothness
asthe segregating criteria for the fruit apple and orange. Apple is represented by the
number ‘1’andorangeby‘0’.Constructa decision treeandapplythe prediction measuresfor
thegiven datato obtain thetypes offruits. 
Weight Smooth Fruit
180 7 ?
140 8 ?
150 5 ?

 Fruit dataset
https://drive.google.com/file/d/1qoMDjozHHELVn5tFAJxp8mMw0Ggt-BVX/view?usp=sharing

Convertthetraineddecisiontreeclassifierinto graphviz object.Later,weuse
the converted graphviz object for visualization. To visualize the decision tree, you just need
toopenthe .txt file and copythe contents of thefileto pasteinthe graphviz web
portal graphyizwebportaladdress:http://webgraphviz.com 

WritingspaceoftheProblem:(ForStudent’suseonly)

Writingspaceof theProblem:(ForStudent’suseonly)
18
2. Below givenisthediabetesdataset. 

(Ref: https://drive.google.com/file/d/1PJizP39JPh_T-
5dQVcUVCfswrPSxT734/view?usp=sharing)

 Makesuretoinstallthe scikit-learnpackageandotherrequiredpackages.  
1. Findthe correlation matrixforthediabetes dataset?
1. Split the dataset into train_set and test_set for modeling and prediction. Divide the
datasetin such a way that the trained dataset constitutes 70 percent of the original
dataset and therestof the part belongs to thetest dataset. 
2. Produceadecisiontreemodelusing  
    a.Gini indexmetric 
    b. Entropy and Information gain metric  on the trained dataset
usingthe DecisionTreeClassifier function. 
3. Applythe prediction measureson thetestdataset. 
4. Definea
functionnamed accuracy_score byinterpretingthedifferencebetweenthepredicted
values andthe test set values. Displaytheaccuracyinterms of  
a. Fraction usingthe accuracy_score function 
b. Number ofcorrectpredictions. 
6.Printtheconfusionmatrixofthetestdataset. 
6. Calculate thefollowingvaluesmanuallyafter obtainingthe confusionmatrix  
a. Accuracy 
b. Errorrate 
c. Precision 
d. Recall (sensitivity)
e. F1Score 
f. Specificity
6.Compare the two results(obtained from two kinds of metrics) and state which method
ismoreaccurateforthisdataset. Convertthetraineddecisiontreeclassifierinto graphviz object.
Later, weusethe converted graphviz objectforvisualization.

10. PlotROCcurveandcalculateAUC
11. Plotrecall vsprecisioncurve

WritingspaceoftheProblem:(ForStudent’suseonly)

19
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-

1. WhatistheC4.5algorithmandhowdoesitwork? StatethedifferencesbetweenID3andC4.5. 
2.Differentiatebetweenover-fitting,over-fittingandover-
fittingloss?Whydoesitoccurduringclassification? 
3.Explaintheconceptofpruningandwhyit is important.Differentiatebetweenpre-
pruningandpost-pruning. 

VivaVoce:-
1. Whatisthe difference between supervisedandunsupervised machinelearning? 
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 20 of 67 20
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

2. What is a confusion matrix? 


3. Which ofthe followingistrue about trainingand testingerrorin such case? 
a. Thedifferencebetweentrainingerror
andtesterrorincreasesasnumberofobservationsincrease.
b. The difference between training error and test error decreases as number
ofobservationsincrease. 
c. Thedifferencebetweentrainingerrorandtest errorwillnot change 
4. Whatisthedifference betweenclassificationandclustering? 
5. WhatareRecommenderSystems? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 21 of 67 21
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#5:Classificationusing KNearestNeighbour.

Dateof theSession: / / TimeoftheSession: to

Pre-requisite:
In LMS: Find the file named “Concept of k-Nearest-Neighbor.doc”. Read the
specifieddocumentand answer the belowquestions.

Pre-lab:-

1. Statewhetherthe given statementistrueor false withsupportedreasoning.

a. k-Nearest-Neighboris a simple algorithm that stores all available cases and


classifiesthenew casebased on dissimilaritymeasure.
b. Thevalueof ‘k’in k-nearest-neighboralgorithmhelps tocheck theno.
oftrainingsetslabelsto assign the mostcommon label forthe testingset.

2. Listtheindustrialusesofk-nearest-neighbor algorithmintherealworld.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 22 of 67 22
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>



3. Writeanalgorithmfork-nearest-
neighborclassificationgivenk,thenearestnumberofneighbors, and n, the numberof
attributes describingeachtuple.

4. Compare the advantages and disadvantages of


eagerclassification(e.g.,decisiontree,Bayesian,neuralnetwork)versuslazyclassification(e.g.
,k-nearest-neighbor,case-basedreasoning).

5. Givethe distancemethods that aremost commonlyusedin k-nearest-neighboralgorithm.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 23 of 67 23
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
Performthefollowing Analysis:
Step-by-stepprocesstocomputek-nearest-neighboralgorithmis:
1. Determineparameterk=no.ofnearestneighbors
2. Calculatethedistancebetween thetest sample and thetrainingsamples.
3. Sortthedistanceanddeterminenearestneighborsbasedonthekthminimumdistance.
4. Gatherthe categoryofnearestneighbors.
5. Usesimplemajorityofthecategoryofnearestneighborsasthepredictionvalueoftestingsam
ple.

Dataset:
Suppose we have the following “StudentDataSet” dataset which consists of 1st
yearCGPA,2ndyearCGPA, Category(C: CRT, NC:Non-CRT)asparameters.

Std.No 1st 2ndyearCGPA Category


yearCG
PA
1 8.5 8.5 C
2 8.2 9 C
3 7.5 7.6 C
4 5.5 4.5 NC
5 9.2 9 C
6 7.8 7.3 C
7 7.3 7.4 NC
8 7.9 7 NC
9 10 6 C
10 6.8 7.1 NC
11 6.5 7.1 NC
12 7.2 7.3 NC

When a new student comes only with 1st year CGPA and 2nd year CGPA as
information,predictthecategory ofthatnewstudent(whetherhebelongstoCRTorNon-CRT)by
Euclidean distance measure, where Euclidean distance between 2 points or tuples,
sayX1=(x11,x12,............,x1n)andX2=(x21,x22, ............ ,x2n), is

Testsample:
1styearCGPAand2ndyearCGPAofthenewstudentare8.4and7.1respectively.(Conside
rk=3)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 24 of 67 24
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 25 of 67 25
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

'
Post-lab:-

1. PredicttheCategoryofstudentwith1styearCGPAand2ndyearCGPAas7.3and7.1respectively
using the Manhattan measuring technique formula with
k=3(Manually).Note:TheManhattandistancebetweentwotuples(orpoints)aandbisdefineda
s∑i|ai−bi|
2. By considering the above StudentDataSet ,, predict the Category of the new
studenthaving 1st yearCGPAand2nd yearCGPAas8.4and7.1respectively,
byimplementingthepythoncodeusingManhattandistancemeasureinordertofindnearest
neighbors for k=3 and check whether the output is same for both the
measuringtechniques ornot.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 26 of 67 26
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
ReferPageno:423,424,425inHanJ&KamberM,“DataMining:ConceptsandTechniques”,ThirdEdition,
Elsevier, 2011

1. k-nearest-neighborisa lazylearningalgorithm.
2. Howcanthedistancebecomputedforattributesthatarenotnumeric,butnominal(orcategorica
l)suchas color?
3. Listsometechniques usedto speedupthe classificationtime.
4. IfthevalueofagivenattributeAismissingintupleX1and/orintupleX2,thedifferenceisalways

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 27 of 67 27
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#6:ClassificationusingBayesianClassifiers

Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. Matchthefollowing   
                ColumnA                    ColumnB 

a. NaiveBayesian a. Values arecontinuous 
Classification
b. Bayesianbeliefnetwork b. Attributesconditionallydependent 
c. Gaussiandistribution c. Toavoidzeroprobability 
d. Laplaceestimator d. Attributesconditionallyindependent 

2. Explain Baye’s theoremandwrite its derivedformulae. 

3. Supposewehave
continuousvaluesforan attributeinadatasetthenhowtocalculateprobability. 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 28 of 67 28
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

4. Letusassume  
   p(age=youth/buys_car =yes)=0.222, 
   p(income=medium/buys_car)=0.444 and 
   p(buys_car=yes)=0.643 then 
   Find the probability of p(x/buys_car=yes), where x=(income=medium,

age=youth). 

5. WhileimplementingNaïveBayesian classifier, suppose


wehaveencounteredazeroprobability then we should add one count to each of the
probability to avoid zeroprobability. Whatisthisestimation iscalled?  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 29 of 67 29
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. Considerthegiventablenamed“Weather_cond.csv”consistingofattributesTemperatureHumidity
,Windy andaclasslabelnamed“Outcome”.Depending on the weatherconditionsyou
havetochoosewhetherto playcricket or not. 
a. Unlike conventional function, write a python function to split the dataset into
trainingsetand test set. Assume test sizelength as 0.33. 
b. Write a python function to calculate mean and standard deviation for each
numericalattributein thedata set. 
c. Calculatethenumberofpriorsforthe given
datasetaftersplittingintotrainingandtestsets usingpython. 

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 30 of 67 30
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 31 of 67 31
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

2. Theproblemiscomprisedof100observationsofmedicaldetailsforPima Indian’s patients. The
records describe instantaneous measurements taken from the patientsuch as their age, the
number of times pregnant and blood workup. All patients are women aged21 or older.All
attributes are numeric,and their units vary from attribute to attribute. Eachrecord has a class
value that indicates whether the patient suffered an onset of diabetes within
5yearsofwhenthemeasurementsweretaken(1)ornot(0).Thisisastandarddatasetthathasbeenstudie
dalot inmachine learningliterature. Agood prediction accuracyis70%-76%.

Implementapython codetofindtheaccuracyforgivendatasetnamed
“Diabetes.csv”basedon train set andtest set. Taketest sizelength as 0.4. 

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 32 of 67 32
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. Considerthegiventablethatspecifiesloan classificationproblem.  

Tid  HomeOwner  Maritalstatus  AnnualIncome  Defaulted


Borrower 
1  Yes  Single  125K  No 
2  No  Married  100K  No 
3  No  Single  70K  No 
4  Yes  Married  120K  No 
5  No  Divorced  95K  Yes 
6  No  Married  60K  No 
7  Yes  Divorced  220K  No 
8  No  Single  85K  Yes 
9  No  Married  75K  No 
10  No  Single  90K  Yes 

a. Computethe class conditional probabilityforeach categorical attribute. 
b. PredicttheclasslabelvaluefortestrecordX=(HomeOwner=No,MaritalStatus=Married,
Income=$120K) 
  
WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 33 of 67 33
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Explainthedifference betweenaValidationSetandaTestSet?
2. WhatarethethreetypesofNaïveBayesclassifier?
3. Howmanyterms arerequired forbuildingaBayes model?
4. Whatis trainingtest andtestingset?
5. WhataretheadvantagesofNaive Bayes?

(ForEvaluator’suseonly)

CommentoftheEvaluator (ifAny) Evaluator’s


ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 34 of 67 34
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#7:ClassificationusingBackpropagation
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
In LMS: Find the file named “Han J & Kamber M, Data Mining Concepts
andTechniques.doc”.  
Readthespecifieddocument fromPg.No:398–404 andanswerthebelowquestions. 

1. StatewhetherthegivenstatementisTrue/False. 
a. Backpropagationisneuralnetworklearningalgorithm. 

b.Backpropagation learns by iteratively processing a data set of training


tuples,comparing the network’s prediction for each tuple with the actual
known targetvalue. 

2. What is the objectiveof Backpropagation? 

3. ExplainaboutMultilayer Feed-ForwardNeuralNetworkwithdiagram.  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 35 of 67 35
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

4. HowdoesBack propagation work? 

5. Considerthefollowingtable. 

Input  DesiredOutput  ModelO AbsoluteError  Square
utput  Error 
0  0       
1  2       
2  4       

Predict the Model Output by considering the initial value of weight as 3. Find the Absolute
Errorand Square Error. Use the Backpropagation algorithm to update the weight and try to
minimizethe square error asmuch aspossible. 
Hint: 
i.ModelOutput= W*I(x)(W=weight, I=Input,x=indexthatiteratesfrom0to
length(Input)) 
ii. Absolute Error = mod(Model Output-Desired
Output) iii.SquareError= (Absolute Error)^2
WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 36 of 67 36
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
Analysis: 
Thefollowingstepswillprovidethefoundationthatyouneed to implementtheBackpropagationalgorith
mandapplyit toyourownpredictivemodellingproblems: 
1. InitializeNetwork. 
2. ForwardPropagate. 
i. NeuronActivation. 
ii. NeuronTransfer. 
iii. ForwardPropagation. 
3. BackPropagateError. 
i. TransferDerivative 
ii. ErrorBackpropagation 
4. TrainNetwork. 
i. UpdateWeights. 
ii. TrainNetwork. 
5. TestNetwork.

Dataset: 
Supposewehavethefollowing“Results Dataset” whichconsistofGPA’sofsomestudents that
they had scored in two internal tests. And, it also consists of another attributenamed
‘Qualified’, which holds a character(Q/NQ), representing the student qualificationforfinal
examination.  


S.No Test– 1 Test– 2 Qualified
1 8.5 8.5 Q
2 8.2 9.0 Q
3 3.5 5.0 NQ
4 5.5 4.5 NQ
5 9.2 9.0 Q
6 7.8 7.3 Q
7 8.0 3.1 NQ
8 10 7.0 Q
9 4.5 6.0 NQ
10 6.8 7.1 Q
11 5.1 4.1 NQ
12 4.2 5.3 NQ

Problem:   Trainanetworkonabove“ResultsDataset”byapplyingBackpropagationalgorithm. 
a. Initializing anetwork withallweights andbiases.   (Considerweightsinrange-0.5to
+0.5,biases=1,LearningRate ={0.5, 0.7,1})
b. Training thenetwork accordingtotheDataset. (ConsiderbothActivatingFunctions–
SigmoidFunctionandTanhFunction) 
c. Backpropagating theerrors. 

WritingspaceoftheProblem:(ForStudent’suseonly)
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 37 of 67 37
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 38 of 67 38
1. Use the network which is trained on the above “Results Dataset” and test whether it is
trainedwith 100% accuracy or not. And, predict the result (qualified for final examination or not)
of anewentrywhich contains5.9and 5.9 GPA’softest-1 and test-2 respectively.

WritingspaceoftheProblem:(ForStudent’suseonly)

39
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1.  What are the general tasks that are performed with backpropagation
algorithm? 2.What kind of real-world problemscan neural networkssolve? 
3.  Whatisa gradient descent? 
4. Why is zero initialization not a recommended weight initialization
technique? 5. Howareartificial neural networks different from normal
networks? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameofthe Evaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 40 of 67 40
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#8:AssociationRuleMining -Apriori
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. Definewhatis Apriori algorithm.

2. Whatisassociation minning? 

3. Whatis the needof association minning? 

4. Whatisminimumsupportandminimum confidence? 

5. Consider the market basket transactions given in the following table.


Letmin-sup=40%and min_conf=40% 
TransactionID ItemsBought
T1 A,B,C
T2 A,B,C,D,E
T3 A,C,D
T4 A,C,D,E
T5 A,B,C,D

a. Findallthefrequentitemsetsusing apriori algorithm. 
b. Obtainsignificantdecisionrules. 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 41 of 67 41
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:- Forthefollowinggiventransaction dataset,perform followingoperations : 
a.Generate rulesusing Apriori algorithmbyusing belowdataset. 
vegetables green whole wheat flour cottage
shrimp  almonds  avocado  mix  grapes    yams  cheese 
burgers  meatballs eggs                

chutney                      
turkey  avocado                   
mineral energy whole
water  milk  bar  wheatrice  greentea  eggs       
lowfat
yogurt                      
whole
wheatp french fri
asta  es                   
light
soup  cream  shallot                
frozen green
vegetables  spaghetti  tea                
french fries 
                    
eggs  petfood                   
cookies                      
mineral cooking
turkey  burgers  water  eggs  oil          
champag
spaghetti  ne  cookies                
mineral
water  salmon  eggs                
mineral
water                      
lowfat
shrimp  chocolate chicken  honey  oil  cookingoil  yogurt    

turkey  eggs                   
tomatoes  mineral
turkey  freshtuna spaghetti  water  blacktea  salmon  eggs 

french fries 
meatballs  milk  honey  proteinbar          
shampoo 
redwine  shrimp  pasta  pepper  eggs  chocolate    
sparkling
rice  water                   
mineral body
spaghetti  water  ham  spray  pancakes  greentea       
grated white toothpaste
burgers  cheese  eggs  pasta  avocado  honey  wine   
eggs                      

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 42 of 67 42
parmesan
cheese  spaghetti  soup  avocado  milk  freshbread       
ground mineral frozen
beef  spaghetti  water  milk  eggs  blacktea  salmon  smoothie 
sparkling
water
mineral frenchfries
water eggs chicken chocolate
frozenveg
etables mineral
spaghetti yams water

WritingspaceoftheProblem:(ForStudent’suseonly)

43
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. Sameas In-lab questiongeneraterulesonbelowdataset. 
semi-
finished ready
citrusfruit  bread  margarine  soups                      
tropical
fruit  yogurt  coffee                         
wholemilk                               
cream meat
pip fruit  yogurt  cheese   spreads                      
longlife
otherveget whole condensed bakeryp
ables  milk  milk  roduct                      
abrasive
wholemilk  butter  yogurt  rice  cleaner                   
rolls/buns                               
liquor(ap
otherveget UHT- bottled petizer
ables  milk  rolls/buns  beer  )                   
potplants                               
wholemilk  cereals                            
other
tropica vegetabl white bottled chocolate 
l fruit  es  bread water                   

bottle
dwat
tropica whole yogurt  er dishes 
citrusfruit  lfruit  milk  butter  curd  flour         
beef                               
rolls/bun
frankfurter  s  soda                         
tropical
chicken  fruit                            
fruit/vegeta newspape
butter  sugar  blejuice  rs                      
fruit/vegetab
lejuice                               
packaged
fruit/vegetab
les                               
chocolate                               
specialty
bar                               
other
vegetables                               
butter milk  pastry                            
wholemilk                               
tropical cream processed detergent  newspape
fruit  cheese   cheese  rs                   

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 44 of 67 44
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

bathro
rootveg sweets salty o
tropica etables  otherveget frozend rolls/buns pread snac waffle cand mclean
l fruit  ables  essert    flour  s  k  s  y  er 
bottled canned
water  beer                            
yogurt                               
rolls/bun chocolate 
sausage  s  soda                      
other
vegetables                               
shoppi
brown fruit/vegeta canned newspape ngba
bread  soda  blejuice  beer  rs  gs                

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 45 of 67 45
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Whoproposed Apriori algorithminwhich year?  
2. Whatisfrequent itemset? 
3. Whydoweconvertdataset intolist? 
4. Whatistheformulafor support,confidence and lift? 
5. Howtheyget thenameas Apriori? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 46 of 67 46
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#9:ImplementationofK-MeansClustering

Dateof theSession: / / TimeoftheSession: to

Pre-Requisites:
Data pre-processing  
Basics of plotting techniques  
Various clustering techniques 

Pre-lab:-
1. Matchthefollowing.

                    Parameters                          Application 
1. pch  a. Tosetorientationofaxis labels 
2. col  b. No.ofplotsperrowand column 
3. mfrow  c. Tosetplotcolor 
4. lwd  d. Plottingsymbol 
5. las  e. Tosetline width 

2. Listout various parametersandattributesinKMeansclustering. 

3. Intohow manytypes does clusteringdivided into andname them.

4. Listoutvarious applicationsofclustering.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 47 of 67 47
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

5. DescribeEuclideandistanceand Manhattandistanceinbriefwith its derivedformula.  

6. ListoutbasicstepsinvolvedinKMeansclustering.  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 48 of 67 48
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. The given dataset comprises of 150 data entries of different countries around
theworld.It is a report on world happiness, a landmark survey of the state of
globalhappiness that ranks 156 countries by how happy their citizens perceive
themselves tobe, with a focus on the technologies, social norms, conflicts and
government policiesthat have driven those changes. The records contains various
attributes of each
countrythatincludespositive_effect,negative_effect,corruption,freedom,healthlifeexpe
ctancy etc. The data frame includes categorical variables, numerical values
andtheirvalues varyfrom countryto country.
Implementapythoncodeusingscikit-learntodisplayaK-
meansclusteringplotforgivendataframe named“world_happiness_report.csv”.
WritingspaceoftheProblem:(ForStudent’suseonly)

2. Thegivendataset named “Student_performance”consists of150dataentriesof


students in an institution that displays the performance of a student. It consists
ofvarious attributes such as
gender,ethnicity, test_preparation, math_score, reading_scoreetc.Perform
theKmeansclustering for the given dataset taking an appropriate number of centers
based on meanand standard deviation for the data entries. Analyze the cluster plot
and give a briefnotebased on results obtained.
WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 49 of 67 49
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. This lab module aims to build an analysis on customers of a shopping mall. It consists of
150observationsofcustomersconsistingdetailsthatincludegender,age,
annual_income,spending_score etc.Basedonthetwoparameters annual_income and
spending_score,trytobuildaanalysis oncustomers through cluster graphs

Apply k means clustering on the given data set named “Mall_customers” marking number
ofclustersbasedonmeanandstandarddeviationofanytwoattributesofyourchoiceandimplementtheK-
means iterativelytill the centroids get stabilized

WritingspaceoftheProblem:(ForStudent’suseonly)

VivaVoce:-
1. K-meansis whichtypeofalgorithm.
2. In K-meansclusteringalgorithmwhatis
thecriteriausedbythedatapointstogetseparatedfromonecluster to another.
3. Whatarethebasicsteps inKMeansclustering.
4. WhatdoesKreferin K-meansalgorithm - Krefersto kno.ofclusters.
5. HowisK-meansalgorithmisdifferentfromKNN algorithm

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 50 of 67 50
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#10:Classification:Support VectorMachine(SVM)

Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. What isSVM? 

2. When do weuse SVM? 

3. What ismaximummarginalhyperplane andwhatistheequationofseparatinghyperpla
ne? 

4. What arethetwocasesofSVM? 

5. What aretheequationsforpointthatliesabovetheseparating hyperplane andbelowthese
parating hyperplane? 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 51 of 67 51
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. Below is the data of the employees in the company. The data shows  whether
employeepurchased software or not. Take x co-ordinate as age and y co-ordinate
asestimated_salary.Now, Considerthefollowing datasetandperformthebelowoperations: 
UserID  Gender  Age  EstimatedSalary  Purchased 
15624510  Male  19  19000  0 
15810944  Male  35  20000  0 
15668575  Female  26  43000  0 
15603246  Female  27  57000  0 
15804002  Male  19  76000  0 
15728773  Male  27  58000  0 
15598044  Female  27  84000  0 
15694829  Female  32  150000  1 
15600575  Male  25  33000  0 
15727311  Female  35  65000  0 
15570769  Female  26  80000  0 
15606274  Female  26  52000  0 
15746139  Male  20  86000  0 
15704987  Male  32  18000  0 
15628972  Male  18  82000  0 
15697686  Male  29  80000  0 
15733883  Male  47  25000  1 
15617482  Male  45  26000  1 
15704583  Male  46  28000  1 
15621083  Female  48  29000  1 
15649487  Male  45  22000  1 
15736760  Female  47  49000  1 
15714658  Male  48  41000  1 
15599081  Female  45  22000  1 
15705113  Male  46  23000  1 
15631159  Male  47  20000  1 
15792818  Male  49  28000  1 
15633531  Female  47  30000  1 
15744529  Male  29  43000  0 
a. Importthedatasetintopython 
b. Splitthedataset set intotrainingand testingsets 
c. Applyfeaturescalingon trainingand test sets 
d. FitSVMtothetrainingset 
e. Visualize thetrainingsetresults 
f. Visualize thetestsetresults. 

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 52 of 67 52
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 53 of 67 53
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
 1. Below dataset represents the bank transactions of KVB bank for an hour. Consider x
co-ordinate as Balance and y co-ordinate as Trtn_amt. Perform following operations
ongiven dataset: 
S.No  transaction_ID  Balance  Trtn_amt  sucornot 
1  3467  98687.36  500  0 
2  4801  8510.47  100  0 
3  2093  2475.3  200  1 
4  9933  37743.25  1000  0 
5  7178  2705.95  600  0 
6  1093  60314  750  1 
7  3708  812129.5  280  1 
8  3804  8076.25  140  0 
9  3192  42323.14  310  1 
10  3666  47045.25  2500  0 
11  8598  96171.25  6900  0 
12  8743  608581.8  8520  1 
13  9302  586057.3  410  1 
14  6127  4587.5  750  0 
15  7502  43597.75  250  0 

a. Importthedatasetintopython 
b. Split thedataset set intotrainingand testingsets 
c. Applyfeaturescalingon trainingand test sets 
d. FitSVMto thetrainingset 
e. Visualize thetrainingsetresults 
f. Visualize thetestsetresults. 

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 54 of 67 54
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. What arethe advantagesof SVM? 
2. How manytypesofmachine learnings arethere andinwhichtypethis svm fallunder? 
3. What arethe turningparametersin SVM? 

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 55 of 67 55
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#11:RuleBasedClassification
Dateof theSession: / / TimeoftheSession: to

Pre-requisite: 
Refer Page no: 355-363 in Han J & Kamber M, “Data Mining: Concepts and Techniques”,
ThirdEdition,Elsevier, 2011    

Pre-lab:-

1. Whatis rule-basedclassification indatamining? 

2. Brieflyexplain aboutthe buildingclassificationrules.  

3. When to stopbuildinga rule? 

4. List someaspectsofsequentialcovering. 

5. Whatarethecharacteristics ofrule-basedclassifier?  

6. Definecoverageand accuracy.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 56 of 67 56
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. Implement a simple python code for rule-
basedclassification on “AllElectronicsCustomer” datab
ase  (Downloadthedataset fromLMS) 

RID  age  income  student  Credit_rating  Class:buys computer   

1  youth  high  no  fair  no 


2  youth  high  no  excellent  no 
3  middle_aged  high  no  fair  yes 
4  senior  medium  no  fair  yes 
5  senior  low  yes  fair  yes 
6  senior  low  yes  excellent  no 
7  middle_aged  low  yes  excellent  yes 
8  youth  medium  no  fair  no 
9  youth  low  yes  fair  yes 
10  senior  medium  yes  fair  yes 
11  youth  medium  yes  excellent  yes 
12  middle_aged  medium  no  excellent  yes 
13  middle_aged  high  yes  fair  yes 
14  senior  medium  no  excellent  no 

a.Calculateaccuracy,coverageandprinttheRIDvalueswhen the
followingrulesare satisfied:  
o Rule R1: if the age of the person is in the category of “youth” and he/she is
astudentthen the person purchases thecomputer. 
o Rule R2: if age of the person is in the category of “middle_aged” , income
iseither medium or high and with excellent Credit_rating  then the person
buys acomputer  
o RuleR3: if ageoftheperson is inthecategoryof“senior”andhe/sheis
astudentthen purchasesacomputer. 
o Rule R4: if age of the person is in the category of “senior”
, income ishigh, he/sheisastudentandwith Credit_rating fair thenpurchasesaco
mputer. 
  

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 57 of 67 57
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

WritingspaceoftheProblem:(ForStudent’suseonly)

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 58 of 67 58
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Post-lab:-
1. Extractpossibleclassificationrulesfromthe givendecisiontree.

2. Writethesequential coveringalgorithm used inruleinduction.

3. DifferencebetweenDecisiontreeandrulebased classification.

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 59 of 67 59
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-
1. Rule-Basedclassifier classifyrecords byusingacollection of rules.
2. Mostrule-basedclassificationsystemsusewhich strategy?
3. Differencebetweenclass-basedorderingandrule-basedordering.
4. Brieflyexplain thebelow termsinyourown words:
a. Mutuallyexclusive
b. Exhaustive
5. Nametheterms that definethe followingstatements:
a. Fractionofrecords that satisfyonlyantecedentofarule.
b. Fractionofrecords thatsatisfyboth antecedentandconsequentof arule.

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 60 of 67 60
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

Lab#12:OutliersDetection
Dateof theSession: / / TimeoftheSession: to

Pre-lab:-
1. Whatdoyou mean byan outlier? What arethemaincauses foroutliers?

2. What are the important methods for outlier detection?

3. Whyisoutlierdetectionnecessaryindata analysis?

4. Howdowecalculate z-score?

5. Considerthebelowdatasetwhichcomprisesoftheincome(in
thousands)of15peopleinanorganisation.
[45,51, 63, 48,67, 48, 56, 2, 62, 59, 44, 61, 99, 46,52]
Whatdoyouobservefromtheabovedata?Isthereanysignificantdifferencebetweentheincomeoffe
w employees?If so,whatcould be thereason ofit?

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 61 of 67 61
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

In-lab:-
1. ThedatasetBostonhousepricesconsistsof9attributesCRIM,ZN,INDUS,LSTAT,NOX,RM, DIS,RAD,
TAX. Thedescription of each attribute
 CRIM per capitacrime rate bytown
 ZNproportion ofresidential landzoned forlots over25,000 sq.ft.
 INDUSproportionofnon-retailbusinessacrespertown
 NOXnitricoxides concentration(parts per 10 million)
 RMaveragenumberofroomsper dwelling
 DISweighteddistancestofiveBostonemploymentcentres
 RADindexof accessibilityto radialhighways
 TAXfull-valueproperty-taxrateper$10,000

Boston dataset:https://drive.google.com/file/d/1YVYWQWPKsLX1UM-
0XCnGCwD1NIi7_uIv/view?usp=sharing

a. Usingboxplot detectwhich columnshaveoutliers


b. ImplementscatterplotbetweenINDUSandTAXandinspecttheoutliers
c. Applyz_scoreoutlierdetection method on Bostondataset consideringthreshold =3
d. Print anyfivez_scorevalues of theoutliers.
e. Removealltheoutliers obtainedfromthedatasetandrefashionthedataset.
f. Applyinterquantilerange(IQR)outlierdetectiononthedatasetandprintIQRvaluesofeachcolu
mns.
g. Calculate lower_bound and upper_bound and print boolean values wherein the outliers
arerepresentedas TRUE.
h. Removealltheoutliersproducedbyinterquartilerangemethodandrefashionthedataset.

WritingspaceoftheProblem:(ForStudent’suseonly)

2. Considertheirisdataset.Itincludesthreeirisspecieswith50sampleseachaswellassomeproperties
Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24
Course Code(s) 21CS3052R Page 62 of 67

62
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

about eachflower. Thecolumnsin this dataset are:


 SepalLengthCm
 SepalWidthCm
 PetalLengthCm
 PetalWidthCm
 Species

https://drive.google.com/file/d/1HEEMrAQqAynHdM5TmK0G-
mD5Qr0OW2J8/view?usp=sharing

Importthecsvfileandusetheboxplotmethodtovisualisetheoutliersconsideringthe4propertiesof
aflower. You will noticethat oneofthepropertyhas outliers.

1. Consideringtherangeoftheoutliersfromthevisualisation,displaytheobservationswhichhaveoutli
ers.
2. ImplementaDBSCANmodelfittingonthedatasettakingepsilonvalueas0.8andminimumsamplesv
alueas 19.
3. Print the counter values using the counter function on the model
labels.4.Considering the values obtained from the model labels print the outliers of
the data.5.Draw ascatterplot betweenpetallength and sepalwidthto
visualisetheoutliers.

WritingspaceoftheProblem:(ForStudent’suseonly)

Post-lab:-
Consider the following student
datasethttps://drive.google.com/file/d/1edmKnHjXkTyHT6gSYhwLw9rTpzoy1Cig/view?usp=sharing

Course Title Data Warehousing and Mining ACADEMIC YEAR: 2023-24


Course Code(s) 21CS3052R Page 63 of 67 63
whichconsists of studentdetails of twoschools in a town.
i. Findthestudentswho havetakenmorenumberofleaves
thantheaveragenumberofabsences by implementing a z_score functiontaking
mean and standarddeviationinto account.

ii. Findthenumberof students whogot least andhighest scorein


thesubjectG1consideringthreshold =2.5

iii. Applyboxplot fortheabovetwo instances.


2. Canwefindoutliersforcategoricalvalues?Explain.
3. An sugar factory weighs every sugar packet in the weighing machine before
packingthem into cartons. As per the guidelines of the factory,the standard weight of
each sugarpacket should be 60 grams. It has been observed that during the final
weighing of thepackets, few of themgave an anomalous weight due to malfunctioning
of weighingmachines.
Consider the below dataset which comprises of weights of the
packets.https://drive.google.com/file/d/1JkdkQ3j-
J93DCfZa3kUjDycEtRzShk6V/view?usp=sharing

a.Findthoseanomalousweightsbyplottingahistogram

b. Intherange0to1,considerthelower_bound=0.1&upper_bound=0.9andfindthe
outliers usingthe quantile method.

c.Segregatetheoutliersfrominlines
using“loc”methodtogetthevaluesof“true_index”.Alsoobtain values of“false_index”.
.
d.Nowfind themedianfrom thevaluesobtained in“true_index”
.
d.Replacealltheoutliers withmedian.

WritingspaceoftheProblem:(ForStudent’suseonly)

64
Experiment # <TO BE FILLED BY STUDENT> Student ID <TO BE FILLED BY STUDENT>
Date <TO BE FILLED BY STUDENT> Student Name <TO BE FILLED BY STUDENT>

VivaVoce:-

1. Isitgoodtoremove
anoutlierfromthedatasetallthetime?2.Whattheapplicatio
ns of outlier detection.
3. Whatthedifferenttypesof outliers?
4. Areoutliersjustsideproductsofsomeclusteringalgorithms?
5. Whatisthedifference betweennoise and anomoly?

(ForEvaluator’suseonly)

CommentoftheEvaluator(ifAny) Evaluator’s
ObservationMarksSecured:
outof

FullNameoftheEvaluator:

SignatureoftheEvaluatorDateofEvaluation:

Course Title <TO BE FILLED BY CC> ACADEMIC YEAR: 2023-24


Course Code(s) <TO BE FILLED BY CC AND MUST INCLUDE ALL R,A,P CODES> Page 65 of 67

You might also like