You are on page 1of 25

DATA ANALYTICS

TWOMARKSQUESTIONSANDANSWERS
Unit I

1.Define Data analytics.


Data analytics (DA) is the science of examining raw data with the purpose of drawing
conclusions about that information. Data analytics is used in many industries to allow companies and
organization to make better business decisions and in the sciences to verify or disprove existing
models or theories.

2.Give some alternative terms for data mining.


• Knowledge mining
• Knowledge extraction
• Data/pattern analysis.
• Data Archaeology
• Data dredging
3.What is KDD.
KDD-KnowledgeDiscovery in Databases.
4.What are the stepsinvolved in KDDprocess.
• Data cleaning
• Data Mining
• Pattern Evaluation
• Knowledge Presentation
• Data Integration
• Data Selection
• Data Transformation

5.What is the use of theknowledge base?


Knowledge base is domainknowledge thatis used to guide searchorevaluate theinterestingness
of resulting pattern. Such knowledge can include concepthierarchiesusedto organize attribute
/attribute values in to differentlevels ofabstractionofData Mining.

6.Arcitecture of a typical data mining system.


Knowledge base

7.Mention some of thedata miningtechniques.


• Statistics
• Machine learning
• Decision Tree
• Hiddenmarkovmodels
• Artificial Intelligence
• Genetic Algorithm
• Meta learning
8.Givefew statisticaltechniques.
• Point Estimation
• Data Summarization
• Bayesian Techniques
• Testing Hypothesis
• Correlation
• Regression

9.What ismeta learning.


Concept of combining the predictionsmadefrom multiplemodels of datamining and analyzing
thosepredictions to formulate a new andpreviously unknownprediction.
GUI
Pattern Evaluation
Database orData warehouseserver
DBDW

10.Define Genetic algorithm.


• Search algorithm.
• Enables us to locateoptimal binary string by processing an initialrandompopulation ofbinary
strings by performingoperations such asartificialmutation , crossover and selection.

11.What isthe purpose of Data mining Technique?


It provides a way to usevarious data mining tasks.

12.Define Predictive model.


It is used topredict the values of data by making use of known results from adifferent setof sample
data.

13.Data mining tasks that are belongs to predictivemodel


• Classification
• Regression
• Time series analysis

14.Define descriptive model

• It is used to determine the patterns and relationships in a sample data.Datamining tasks that
belongs to descriptivemodel:
• Clustering
• Summarization
• Association rules
• Sequence discovery

15. Define the term summarization


The summarization ofa large chunk ofdata contained ina web page or adocument.
Summarization = caharcterization=generalization

16. List outthe advanced database systems.


• Extended-relational databases
• Object-oriented databases
• Deductive databases
• Spatial databases
• Temporaldatabases
• Multimedia databases
• Active databases
• Scientificdatabases
• Knowledge databases

17. Define clusteranalysis


Clusteranalyses data objectswithoutconsultinga known class label. Theclasslabels are
notpresent inthe training datasimply because theyare not known to beginwith.

18.Classifications of Data mining systems.


• Based on the kinds of databases mined:
o Accordingto model
_ Relational mining system
_ Transactionalmining system
_ Object-orientedmining system
_ Object-Relationalmining system
_ Data warehousemining systemo Types of Data
_ Spatial data mining system
_ Time series data mining system
_ Text data mining system
_ Multimedia data mining system
• Based on kinds of Knowledgemined
o Accordingto functionalities
_ Characterization
_ Discrimination
_ Association
_ Classification
_ Clustering
_ Outlieranalysis
_ Evolution analysis
o Accordingto levels of abstraction ofthe knowledge mined
_ Generalized knowledge (High level of abstraction)
_ Primitive-level knowledge (Raw data level)
o Accordingto mine data regularities versus mine data irregularities
• Based on kinds of techniquesutilized
o Accordingto user interaction
_ Autonomoussystems
_ Interactiveexploratorysystem
_ Query-driven systems
o Accordingto methods of data analysis
_ Database-oriented
_ Data warehouse-oriented
_ Machinelearning
_ Statistics
_ Visualization
_ Patternrecognition
_ Neural networks
• Based on applications adopted
o Finance
o Telecommunicationo DNA
o Stock markets
o E-mail and so on

19.Describe challengesto data miningregarding data miningmethodology and


userinteractionissues.
• Mining different kindsofknowledge in databases
• Interactive mining ofknowledge atmultiple levels of abstraction
• Incorporation ofbackground knowledge
• Datamining query languages and ad hoc data mining
• Presentation and visualization of data mining results
• Handling noisy or incomplete data
• Pattern evaluation

20.Describechallengesto data miningregarding performance issues.


• Efficiency and scalability of data miningalgorithms
• Parallel, distributed,andincrementalminingalgorithms

21.Describeissues relating to the diversity ofdatabasetypes.


• Handling ofrelational and complex types of data
• Mining information from heterogeneousdatabases and globalinformationSystems

22.What ismeant by pattern?


Patternrepresentsknowledge if it iseasily understood by humans; valid on testdata withsome
degree of certainty;andpotentially useful, novel,or validates a hunchabout whichthe used was curious.
Measures of patterninterestingness, either objective orsubjective, can be used to guide thediscovery
process.

23.Howis a data warehousedifferent from a database?


Data warehouse is a repository of multipleheterogeneous data sources, organizedunder a
unifiedschema at a singlesite in order tofacilitatemanagement decision-making.Database consists of a
collection of interrelateddata.
UNIT II

1. Define Association RuleMining.


Associationrule mining searches for interesting relationships among items ina given data set.

2. When we can say theassociationrules are interesting?


Associationrules are consideredinterestingif they satisfy both a minimumsupport threshold and
a minimumconfidence threshold. Users or domain expertscan set such thresholds.

3. Explain Associationrule in mathematicalnotations.


Let I-{i1,i2,…..,im} bea set of items
Let D, the taskrelevantdata be asetofdatabase transaction Tis a set ofitems
An association ruleis an implication ofthe formA=>B where A C I, B C I,and An B=f. The rule
A=>Bcontains in the transaction set Dwith support s,
where s is thepercentage of transactions in D thatcontainAUB. The Rule A=> Bhas confidence c inthe
transaction set D ifc is the percentage of transactions in Dcontaining A that also contain B.

4. Define support and confidence in Association rule mining.


Support S isthe percentage of transactions in Dthat contain AUB.
Confidence c is thepercentage of transactions in Dcontaining A that alsocontainB.
Support ( A=>B)=P(AUB)Confidence(A=>B)=P(B/A)

5. Howareassociation rules mined from largedatabases?


• I step: Find all frequent itemsets:
• II step: Generatestrong association rules from frequentitemsets
6. Describethe differentclassifications of Association rulemining.
• Based on types of values handledinthe Rule
i. Boolean associationrule
ii. Quantitativeassociation rule
• Based on the dimensions of data involved
i. Single dimensionalassociationrule
ii. Multidimensionalassociationrule
• Based on the levels of abstraction involved
i. Multilevel associationrule
ii. Single level association rule
• Based on various extensions
i. Correlationanalysis
ii. Mining max patterns
7. What is the purpose of Apriori Algorithm?
Apriori algorithmis an influential algorithmformining frequent itemsets forBoolean
associationrules. The nameofthe algorithmis based on the fact that thealgorithmuses priorknowledge
of frequentitem set properties.

8. Define anti-monotone property.


If a set cannot pass a test, all of itssupersets willfail the sametestas well.

9. Howto generate association rules from frequent itemsets?


Associationrules can begenerated as follows
For each frequentitemset1, generate all non empty subsets of1.For every nonempty subsets s of 1,
outputtherule “S=>(1-s)”ifSupport count(1)
=min_conf,Support_count(s)
Wheremin_conf is the minimumconfidencethreshold.

10. Givefewtechniques to improve the efficiency of Apriorialgorithm.


• Hash based technique
• Transaction Reduction
• Portioning
• Sampling
• Dynamic itemcounting

11. What are the thingssuffering the performance of Apriori candidategenerationtechnique.


• Need to generate a hugenumber of candidate sets
• Need to repeatedlyscan the scanthe database and check a large set ofcandidatesby pattern
matching

12. Describe the method of generating frequent item sets without candidategeneration.
Frequent-patterngrowth(or FP Growth) adopts divide-and-conquerstrategy.
Steps:
Compress the database representingfrequent items into a frequent pattern treeor FP tree
Divide the compresseddatabase intoa set of conditionaldatabaseMine each
conditionaldatabaseseparately

13. Define Icebergquery.


It computes an aggregatefunctionover an attribute or set of attributes inorder to findaggregate
valuesabove some specified threshold.
Given relation Rwith attributes a1,a2,…..,an and b, and an aggregate function,
agg_f, an iceberg query is theformSelect R.a1,R.a2,…..R.an,agg_f(R,b)Fromrelation R
Group by R.a1,R.a2,….,R.an
Having agg_f(R.b)>=threshold

14. Mentionfewapproaches to miningMultilevelAssociationRules


• Uniform minimumsupport for all levels(oruniformsupport)
• Using reduced minimum support atlower levels(orreduced support)
• Level-by-level independent
• Level-crossfiltering bysingleitem
• Level-crossfiltering byk-itemset

15. What are multidimensionalassociationrules?


Associationrules that involve two or moredimensions or predicates
• Interdimensionassociationrule:Multidimensionalassociation rule withnorepeated predicate or
dimension
• Hybrid-dimension association rule:Multidimensional associationrule withmultiple occurrences of
some predicates or dimensions.

16. Define constraint-BasedAssociationMining.


Mining is performed under the guidance of various kinds of constraintsprovided bythe user.
The constraintsincludethe following
• Knowledge type constraints
• Data constraints
• Dimension/levelconstraints
• Interestingness constraints
• Rule constraints.

17. Define the conceptof classification.


Twostep process
• A model is built describing a predefined set of data classesor concepts.The model is constructed by
analyzing databasetuples described byattributes.
• The modelis usedfor classification.

18. What isDecision tree?


A decisiontree isa flow chartliketree structures, where each internal
node denotes a test on an attribute, each branch represents anoutcome of the test,and leaf
nodesrepresent classes or classdistributions. The top most in a tree istheroot node.

19. What isAttribute Selection Measure?


The information Gain measure is used to select the test attribute at each nodein the decision
tree. Such a measure is referred toas an attribute selectionmeasureor a measure of the goodness of
split.

20. Describe Tree pruning methods.


When a decision tree is built,many of the branches will reflectanomalies inthe training data
due tonoise or outlier.Treepruningmethods addressthis
problemof over fitting the data.Approaches:
• Pre pruning
• Post pruning

21. DefinePre Pruning


A tree is pruned by halting its construction early.Upon halting, the nodebecomes a leaf. The leaf may
hold themost frequent class among the subsetsamples.

22. DefinePost Pruning.


Post pruning removes branches froma “Fully grown” tree. Atree node is
pruned by removing its branches.Eg: Cost Complexity Algorithm

23. What ismeant by Pattern?


Patternrepresents the knowledge.

24. Define the conceptof prediction.


Predictioncan be viewed as the constructionand use of a model to assess theclass of an
unlabeledsample or to assessthe value or value ranges of an attributethat a given sample is likely to
have.
Unit III

1.Define Clustering?
Clustering is a process ofgrouping the physical or conceptualdata object intoclusters.

2. What do you mean by Cluster Analysis?


A cluster analysis is theprocess of analyzingthe various clusters to organizethedifferentobjects into
meaningful anddescriptive objects.

3. What arethe fields inwhich clusteringtechniques are used?


• Clusteringis used in biology to develop new plants and animal
taxonomies.
• Clusteringis used in business to enablemarketers to developnew
distinctgroups oftheir customers and characterize thecustomer group on basisof purchasing.
• Clusteringis used intheidentification ofgroups of automobiles
Insurancepolicycustomer.
• Clusteringis used intheidentification ofgroups of house in a city on
the basis ofhouse type, theircost and geographical location.
• Clusteringis used toclassify the document on the web forinformation
discovery.

4.What are the requirements of cluster analysis?


The basic requirements of cluster analysis are
• Dealing with differenttypes of attributes.
• Dealing with noisy data.
• Constraints on clustering.
• Dealing with arbitrary shapes.
• High dimensionality
• Ordering of input data
• Interpretability and usability
• Determining input parameter and
• Scalability

5.What are the different types of data used forcluster analysis?


The different types of data used for cluster analysis are interval scaled, binary,nominal,ordinal
and ratio scaled data.

6. What are interval scaled variables?


Intervalscaled variablesare continuousmeasurements of linear scale.
For example, height andweight, weather temperature or coordinatesforany cluster.These
measurements can be calculated using Euclideandistance or Minkowskidistance.
7. Define Binary variables? And what are the two types of binary variables?
Binary variables are understood bytwo states 0 and 1, when state is 0, variable isabsent
andwhen stateis 1, variable is present. There are twotypes of binaryvariables,symmetric and
asymmetricbinaryvariables.Symmetric variables are those variables thathave same state values
andweights. Asymmetricvariablesare those variables thathavenot same state valuesand weights.

8. Define nominal, ordinal and ratio scaled variables?


A nominalvariableis a generalization ofthe binary variable. Nominal variablehas more than
two states,For example, a nominalvariable,colorconsists of four states,red, green, yellow, or black. In
Nominal variablesthe totalnumber ofstates is N and it isdenoted by letters, symbols or integers.
An ordinal variablealso has more thantwo statesbut all thesestates are orderedin a meaningful
sequence.
A ratio scaled variablemakespositivemeasurements on a non-linearscale, suchas exponential scale,
using the formula
AeBt or Ae-Bt
Where A and B are constants.

9. What do u mean by partitioningmethod?


In partitioningmethod a partitioning algorithmarrangesall theobjectsintovarious partitions,
where the totalnumber ofpartitionsis less than thetotal number ofobjects. Here each partitionrepresents
a cluster.The two types ofpartitioningmethod arek-means and k-medoids.

10. DefineCLARA and CLARANS?


Clustering inLARgeApplications is called as CLARA. The efficiency ofCLARA depends upon
the size of the representative data set.CLARA does not workproperlyif any representative dataset
fromthe selected representative datasets does notfind best k-medoids.
To recover thisdrawback a new algorithm, Clustering Large Applicationsbasedupon RANdomized
search (CLARANS) is introduced. The CLARANS works like
CLARA, the only difference betweenCLARA and CLARANS is the clusteringprocessthatis doneafter
selecting the representativedata sets.

11. What is Hierarchical method?


Hierarchical methodgroups allthe objects intoatree of clusters that are arrangedin a
hierarchicalorder.Thismethod works on bottom-up or top-down approaches.

12. DifferentiateAgglomerativeand Divisive Hierarchical Clustering?


AgglomerativeHierarchicalclusteringmethod works on thebottom-up approach.
In Agglomerative hierarchical method, each object createsitsown clusters. The singleClustersare
merged tomake larger clusters and the process ofmerging continues until all the singular clusters are
merged into one big clusterthatconsists of all theobjects.Divisive Hierarchicalclusteringmethod works
on the top-down approach.In thismethod all theobjects are arrangedwithin a big singular cluster
andthe largecluster iscontinuously dividedinto smallerclustersuntil each cluster has a single object.

13. What is CURE?


ClusteringUsing Representativesis called as CURE. The clustering algorithmsgenerallywork
on sphericaland similarsizeclusters.CUREovercomes the problemofspherical and similar size cluster
andis more robust with respect to outliers.
14. DefineChameleon method?
Chameleon is another hierarchicalclusteringmethod that usesdynamicmodeling.Chameleon is
introducedto recover the drawbacks of CUREmethod. In this method twoclusters are merged, if the
interconnectivitybetween two clusters is greater than theinterconnectivitybetween the objects within a
cluster.

15. DefineDensity based method?


Density basedmethod deals with arbitrary shaped clusters. In density-basedmethod, clusters
areformed on the basis of the region where the density of the objectsishigh.

16. What isa DBSCAN?


Density BasedSpatialClustering ofApplicationNoise is called as DBSCAN.DBSCAN isa
density based clusteringmethod that converts the high-density objectsregionsintoclusters
witharbitraryshapes and sizes.DBSCAN defines the cluster asamaximal set of density
connectedpoints.

17. What do you mean by Grid Based Method?


In this method objects arerepresented by the multi resolutiongrid datastructure.
All the objects are quantizedinto a finitenumber of cells andthe collection of cells buildthe grid
structure of objects. The clusteringoperations are performed on that gridstructure.Thismethod is
widely usedbecause its processing time is very fast andthatisindependent of numberofobjects.

18. What isa STING?


Statistical Information Grid is calledas STING; it is a grid basedmulti resolutionclustering method. In
STINGmethod, allthe objects are containedinto rectangular cells,these cells are keptintovarious levels
of resolutions and these levels arearranged inahierarchical structure.

19. DefineWave Cluster?


It is a gridbasedmulti resolution clustering method. In this method all theobjectsare represented by a
multidimensional grid structure and a wavelet transformation is
applied for finding the dense region.Each grid cellcontainstheinformation of the group of objects that
map into a cell. A wavelet transformation is a process of signaling thatproduces the signal of various
frequency sub bands.

20. What isModel based method?


For optimizing a fit between a givendata set anda mathematicalmodel basedmethods are used. This
method uses an assumption thatthe data are distributed byprobability distributions.There are two basic
approaches in this method that are
1. Statistical Approach
2. Neural Network Approach.

21. What isthe use of Regression?


Regression can be used to solvethe classificationproblems but itcan also be used
for applications such as forecasting.Regression can be performed using many differenttypes of
techniques; in actuallyregressiontakes aset of data and fits the data toaformula.

22. What are the reasons for not usingthe linear regressionmodel to estimate theoutput data?
There are many reasonsfor that, Oneis that the data do not fita linear model,It ispossible howeverthat
the data generally do actuallyrepresent a linearmodel, but thelinearmodelgenerated is poor because
noise or outliers exist in the data.
Noise is erroneous data and outliers are data values thatare exceptions to the usual andexpected data.

23. What are the two approaches used by regression to perform classification?
Regression can be used to performclassification using the following approaches
1. Division: The data aredividedintoregions based on class.
2. Prediction:Formulas are generated to predict the output class value.
24. What do u mean by logistic regression?
Instead of fitting a data into a straightlinelogistic regressionuses a logisticcurve.The formulafor the
univariate logistic curveis
P= e (C0+C1X1)1+e (C0+C1X1)
The logistic curve gives a value between 0 and 1so it can be interpretedas theprobability ofclass
membership.

25. What isTime SeriesAnalysis?


A timeseries is a set of attribute values over a period of time.Time Series
Analysis may be viewedas findingpatterns inthe data and predicting futurevalues.

26. What are the various detected patterns?


Detected patternsmay include:
¨Trends : It may be viewed as systematic non-repetitivechanges to the values overtime.
¨Cycles :The observedbehavioris cyclic.
¨Seasonal :The detected patterns may be based on time of year or month or day.
¨Outliers : To assist inpattern detection , techniquesmay be needed toremove orreduce the impact of
outliers.

27. What isSmoothing?


Smoothingis an approach thatis used to remove the nonsystematic behaviors
foundin time series. Itusually takes the formof finding moving averagesof attributevalues. It isused to
filter out noise and outliers.

28. Givethe formulafor Pearson’s r


One standard formula tomeasurecorrelationis
thecorrelationcoefficientr,sometimescalledPearson‟sr.Giventwotimeseries,XandYwithmeansX‟andY‟
,each with n elements, the formulaforr is
S(xi–X‟)(yi–Y‟)
(S(xi–X‟)2S(yi–Y‟)2)1/2

29. What is Autoregression?


Autoregression is a method ofpredicting a futuretime series value by looking atprevious values. Given
a time series X= (x1,x2,….xn) a future value, xn+1, can befound
using
x n+1 = x + j nx n + j n-1x n-1 +……+ e n+1
Here e n+1 represents arandomerror, at time n+1.In addition,each element in thetimeseries can
beviewed as a combination ofa random error anda linear combination ofprevious values.
UNIT-IV

1.Define data warehouse?


A data warehouse is a repository ofmultiple heterogeneous data sourcesorganized under a
unified schema at a single siteto facilitate managementdecisionmaking .
(or)
A data warehouse is a subject-oriented,time-variant and nonvolatile
collectionofdatainsupportofmanagement‟sdecision-makingprocess.

2.What are operationaldatabases?


Organizationsmaintain large databasethat are updated by daily transactions arecalled
operationaldatabases.

3.Define OLTP?
If an on-line operational database systems is used for efficient retrieval, efficientstorageand
managementof large amounts of data, then the systemis said to be on-linetransaction processing.

4.Define OLAP?
Data warehousesystems serves users(or) knowledge workers in the role ofdataanalysis
anddecision-making. Suchsystems can organize and present data in variousformats. These systems are
known as on-lineanalyticalprocessingsystems.

5.Howa database design is represented in OLTPsystems?


Entity-relationmodel

6. Howa database design is represented in OLAPsystems?


Star schemaSnowflakeschema
Fact constellationschema

7.Write short notes onmultidimensional datamodel?


Data warehouses and OLTPtools arebased on a multidimensionaldatamodel.
This model is used for the design of corporate data warehouses and department datamarts. This model
contains a Starschema, Snowflake schema and Fact constellationschemas. The core of the
multidimensionalmodel is the datacube.

8.Define data cube?


It consists ofa large set offacts (or) measures and a numberof dimensions.

9.What are facts?


Facts are numericalmeasures. Factscan also beconsidered asquantities by whichwe can analyze
the relationshipbetweendimensions.
10.What are dimensions?
Dimensions are theentities(or)perspectives withrespect to anorganization forkeeping records
and arehierarchical in nature.

11.Define dimensiontable?
A dimension tableis used fordescribing the dimension.
(e.g.) A dimension table for item may containtheattributesitem_name,brand and type.

12.Define fact table?


Fact table contains the name of facts (or) measures as well as keys to eachof therelated
dimensional tables.

13.What are lattice ofcuboids?


In data warehousingresearchliterature, a cube can also be called as cuboids. Fordifferent (or)
set of dimensions, we canconstructa lattice ofcuboids, eachshowing thedata at different level. The
lattice ofcuboids is also referredtoas data cube.

14.What isapex cuboid?


The 0-D cuboid which holds the highestlevel ofsummarization is calledthe apexcuboid.
Theapex cuboid is typicallydenoted by all.

15.List outthe components of starschema?


A large centraltable(fact table) containing the bulk of datawith noredundancy.
_ A set of smallerattendanttables (dimension tables), one for eachdimension.

16.What is snowflake schema?


The snowflakeschema is a variant of the star schemamodel, where somedimension tables are
normalized thereby furthersplitting the tables in to additionaltables.

17.List outthe components of fact constellation schema?


This requiresmultiple facttables to share dimensiontables.This kind of schemacan be viewed as
a collection of stars and henceit is knownas galaxy schema (or) factconstellationschema.

18.Point out the major differencebetween the star schemaand the snowflakeschema?
The dimension table of the snowflake schemamodel may be kept in normalizedformto reduce
redundancies. Such a tableis easy to maintain and saves storage space.

19.Which is popular in the data warehouse design, star schema model (or)snowflake schema
model?
Star schema model, because the snowflake structure canreduce the effectivenessand more joins
will be needed to execute a query.

20.Define concepthierarchy?
A concept hierarchydefines a sequence of mappings froma set of low-levelconcepts tohigher-
level concepts.
21.Define total order?
If the attributes of a dimension whichforms a concept hierarchy such as
“street<city<province_or_state<country”, thenit is said to be total order.
Country Province orstate City
Street
Fig: Partialorder for location
22.Define partialorder?
If the attributes of a dimension whichforms a latticesuch as“day<{month<quarter; week}<year, then it
is said to be partial order.23.Define schemahierarchy?
A concept hierarchythat is a total (or) partialorderamong attributes in adatabaseschema is called a
schema hierarchy.
24.List outthe OLAP operations in multidimensional data model?
_ Roll-up
_ Drill-down
_ Slice and dice
_ Pivot (or) rotate
25.What isroll-up operation?
The roll-upoperation isalso called drill-upoperation which performs aggregationon a data cube either
byclimbing up a concept hierarchyfor a dimension (or) bydimension reduction.
26.What is drill-down operation?
Drill-downis the reverse of roll-up operation.It navigates fromless detailed datato more detailed
data.Drill-downoperation can be taken place by stepping down aconcept hierarchy for a dimension.
27.What is slice operation?
The slice operationperforms a selection on one dimension ofthe cube resulting ina sub cube.
28.What isdice operation?
The dice operationdefines a sub cube by performing a selection on two(or) moredimensions.

29.What ispivot operation?


This is a visualizationoperation thatrotates thedata axes inan alternativepresentationof the data.

30.List outthe views in the designof a data warehouse?


_ Top-down view
_ Data source view
_ Data warehouse view
_ Business query view

31.What are the methods for developing large software systems?


_ Waterfall method
_ Spiral method

32.Howtheoperation is performed in waterfall method?


The waterfallmethodperforms a structuredand systematicanalysis at each step
before proceeding to thenext, whichis like a waterfall fallingfromone step to the next.

33.Howtheoperation is performed in spiralmethod?


The spiral method involves the rapid generationofincreasinglyfunctional
systems, with short intervalsbetween successivereleases. This is considered as a goodchoice for the
data warehousedevelopmentespeciallyfor data marts,because theturnaround timeis short,
modificationscan be done quickly and new designs andtechnologies can be adapted in a timelymanner.
34.List outthe steps of the data warehouse design process?
_ Choose a business process to model.
_ Choose the grain of the business process
_ Choose the dimensions that willapply to eachfact table record.
_ Choose the measures that will populate each fact table record.

35.Define ROLAP?
The ROLAPmodel is anextended relational DBMS that mapsoperations onmultidimensionaldata
tostandardrelational operations.

36.DefineMOLAP?
The MOLAP model is a
specialpurposeserverthatdirectlyimplementsmultidimensionaldataandoperations.

37.Define HOLAP?
The hybridOLAP approach combinesROLAP and MOLAP technology,benefiting fromthe greater
scalability of ROLAP and the fastercomputation of
MOLAP,(i.e.) a HOLAP server may allow large volumes of detail data tobe storedinarelational
database,while aggregations are kept in a separate MOLAP store.

38.What is enterprise warehouse?


Anenterprisewarehousecollectsalltheinformation‟saboutsubjectsspanningthe
entire organization.It providescorporate-widedataintegration, usually fromone (or)more operational
systems (or) externalinformationproviders. It contains detailed dataaswell as summarized data and can
range in size fromafewgiga bytesto hundreds of gigabytes, tera bytes (or)beyond.

39.What isdata mart?


Data mart isa database that contains a subset of data presentina data warehouse.
Data marts are createdto structure the datain a datawarehouseaccordingto issues suchas hardware
platforms and access controlstrategies.We can divide a data warehouse intodata martsafter the data
warehouse has been created. Data marts are usuallyimplementedon low-cost departmental servers that
are UNIX (or) windows/NT based.

40.What are dependent and independent data marts?


Dependent datamarts are sourced directly fromenterprise data warehouses.Independent data marts are
data captured fromone (or) more operational systems (or)external information providers(or) data
generated locally with in particulardepartment(or) geographic area.

41.What isvirtual warehouse?


A virtual warehouse is a set of viewsover operational databases. For efficient
query processing, only some of the possible summary views may bematerialized. Avirtualwarehouse
is easy to build but requires excesscapability on operationaldatabaseservers.

42.Define indexing?
Indexing isa technique,which is used forefficient dataretrieval (or) accessing
data inafastermanner.When a table grows in volume, the indexes also increase insizerequiring more
storage.
43.What are the typesof indexing?
_ B-Treeindexing
_ Bit map indexing
_ Join indexing

44.Define metadata?
Metadata isused in data warehouse is usedfor describing data about data.
(i.e.) metadata are the datathatdefine warehouse objects. Metadata are createdforthedata names and
definitions of the given warehouse.

45.Define VLDB?
Very LargeData Base. If a databasewhose size is greaterthan 100GB, thenthe database is saidto be
very largedatabase.
UNIT – V

1.What are the classifications of tools for datamining?


• Commercial Tools
• Public domain Tools
• Research prototypes

2.What are commercialtools?


Commercial tools can bedefined as thefollowingproducts and usuallyareassociatedwiththeconsulting
activityby the samecompany:
1. „IntelligentMiner‟fromIBM
2. „SAS‟SystemfromSASInstitute
3. „Thought‟fromRightInformationSystems.etc

3. What are Publicdomain Tools?


Public domain Tools arelargely freeware with justregistration fees:
‟Brute‟fromUniversityofWashington.„MC++‟fromStanforduniversity,Stanford,
California.

4. What are Research prototypes?


Some of the researchproductsmay find their way into commercial
market:„DBMiner‟fromSimonFraserUniversity,BritishColumbia,„MiningKernelSystem‟fromUniversi
tyofUlster,NorthIreland.

5.What is thedifferencebetween generic single-task toolsand generic multi-tasktools?


Generic single-tasktools generally useneuralnetworks or decisiontrees.They coveronly the datamining
part and require extensive pre-processingandpostprocessing
steps.
Generic multi-task tools offermodules for pre-processingandpostprocessingsteps and alsooffer a broad
selectionof several populardata miningalgorithms as clustering.

6. What are the areas inwhich data warehouses are usedin present and in future?
The potential subjectareas in whichdata ware housesmay bedeveloped atpresentandalso in future are
1.Census data:
The registrar general and census commissioner ofIndia decennially
compilesinformation of all individuals,villages, population groups, etc. Thisinformationis wide
ranging such as theindividual slip. A compilation ofinformation of individualhouseholds, ofwhich a
database of 5%sample is maintained for analysis. Adatawarehouse can be built fromthis database
uponwhich OLAP techniquescan be applied,Data mining also can be performed for analysis and
knowledge discovery
2.Prices ofEssentialCommodities
The ministry of food and civil supplies, Government of India complies
daily data for about 300observationcenters in the entire country on the prices ofessential
commoditiessuch as rice,edible oiletc, Adata warehouse canbe builtfor this dataand OLAP techniques
can be appliedfor its analysis

7. What are the other areas for Datawarehousing and data mining?
• Agriculture
• Rural development
• Health
• Planning
• Education
• Commerce and Trade

8. Specify some of thesectors in which data warehousing and data mining are used?
• Tourism
• ProgramImplementation
• Revenue
• EconomicAffairs
• Audit and Accounts

9. Describethe use of DBMiner.


Used to performdataminingfunctions,includingcharacterization,association,classification,
predictionand clustering.

10. Applications of DBMiner.


The DBMiner systemcan be used asa general-purpose online analyticalmining system for both OLAP
and data miningin relationaldatabase anddatawarehouses.
Used in mediumto largerelational databaseswithfast response time.

11. Give some data mining tools.


DBMinerGeoMiner Multimedia minerWeblogMiner

12. Mentionsome ofthe application areas of data mining


DNA analysisFinancial dataanalysisRetailIndustry
Telecommunication industryMarketanalysis
Bankingindustry and Health care analysis.

13. Differentiatedata query and knowledge query


A data query finds concrete datastored in a database and corresponds toabasic retrievalstatement in a
databasesystem.
A knowledge query finds rules, patterns andother kinds of knowledge in adatabase andcorresponds to
querying databaseknowledge includingdeductionrules, integrityconstraints, generalized rules, frequent
patterns andother regularities.

14.Differentiatedirectquery answering and intelligent query answering.Direct queryanswering


means that a query answers by returningexactlywhatis beingasked.
Intelligentquery answering consists ofanalyzingthe intent ofquery andproviding generalized,
neighborhood,or associatedinformationrelevant tothequery.

15. Define visual datamining


Discovers implicit anduseful knowledge fromlarge datasets using dataand/or knowledge visualization
techniques.
Integration ofdata visualization anddata mining.

16. What does audio data miningmean?


Uses audio signalsto indicate patterns ofdata or the features ofdata miningresults.
Patterns aretransformed into sound andmusic.
To identify interestingorunusualpatterns by listening pitches,rhythms, tuneand melody.
Steps involved in DNA analysis
Semanticintegration of heterogeneous,distributedgenome databasesSimilarity search and
comparisonamong DNA sequencesAssociationanalysis: Identification ofco-occuring gene sequences
Path analysis: Linking genes to differentstagesof disease developmentVisualization tools andgenetic
data analysis

17.What are the factors involvedwhile choosing data mining system?


Data types Systemissues Data sources
Data Mining functions and methodologies
Coupling data mining with database and/or datawarehouse systemsScalability
Visualization tools
Data mining query language and graphicaluser interface.

18. DefineDMQL
Data Mining Query Language
It specifies clauses and syntaxes for performingdifferenttypes of data miningtasks for example data
classification, data clustering and miningassociationrules. Alsoit uses SQl-like syntaxesto mine
databases.

19. Define text mining


Extractionofmeaningfulinformation fromlarge amounts free format textualdata.
Useful in Artificial intelligence and patternmatching
Also known as textmining, knowledge discoveryfromtext, or contentanalysis.

20. What does web mining mean


Technique to process informationavailable on web and search for useful data.To discoverweb pages,
text documents , multimediafiles,images, andothertypes of resources fromweb.
Used in several fields such as E-commerce,information filtering, frauddetection and educationand
research.

21.Define spatial datamining.


Extractingundiscoveredand implied spatial information.Spatial data:Data that is associatedwith a
location
Used in several fields such as geography, geology, medicalimaging etc.

22. Explain multimediadata mining.


Mines largedata bases.
Does not retrieveany specificinformation from multimedia databasesDerive newrelationships , trends,
and patterns fromstored multimedia datamining.
Used in medicaldiagnosis, stock markets,Animation industry,Airlineindustry,
Trafficmanagementsystems,Surveillance systems etc.
16 MARKSQUESTIONS AND ANSWERS

UNIT-I

1. Explain the evolutionof Database technology?


_ Data collection and Database creation
_ Database managementsystems
_ Advanced database systems
_ Data warehousing andData Mining
_ Web-basedDatabasesystems
_ New generation of Integratedinformationsystems
2.Explain the steps of knowledgediscovery in databases?
_ Data cleaning
_ Data integration
_ Data selection
_ Data transformation
_ Data mining
_ Patternevaluation
_ Knowledgepresentation
3. Explain the architecture of datamining system?
_ Database, datawarehouse, or otherinformationrepository
_ Databaseor data warehouseserver
_ Knowledge base
_ Data mining engine
_ Patternevaluationmodule
_ Graphicaluser interface
4.Explain various tasks in data mining?
(Or)
Explain the taxonomyofdata miningtasks?
_ Predictive modeling
• Classification
• Regression
• Time series analysis
_ Descriptivemodeling
• Clustering
• Summarization
• Association rules
• Sequence discovery
5.Explain various techniques in data mining?
_ Statistics (or) Statisticalperspectives
_ Point estimation
• Data summarization
• Bayesian techniques
• Hypothesis testing
• Correlation
_ Regression
_ Machinelearning
_ Decisiontrees
_ Hidden markov models
_ Artificial neural networks
_ Genetic algorithms
_ Meta learning

UNIT-II

6.Explain the issues regardingclassification andprediction?


_ Preparingthe data forclassification and predictiono Data cleaning
o Relevance analysiso Data transformation
_ Comparingclassificationmethodso Predictive accuracy
o Speed
o Robustnesso Scalability
o Interpretability
7.Explainclassification by Decision treeinduction?
_ Decisiontree induction
_ Attributeselectionmeasure.
_ Tree pruning
_ Extracting classification rules from decision trees
8.Write short notes onpatterns?
_ Pattern definition
_ Objective measures
_ Subjective measures
_ Can a data mining systemgenerate all of theinterestingpatterns?
_ Can a data mining systemgenerate only interestingpatterns?
9.Explain mining single–dimensionalBoolean associatedrules from transactionaldatabases?
_ The apriorialgorithm:Finding frequentitemsets usingcandidategeneration
_ Mining frequent itemsets without candidate generation
10.Explain apriori algorithm?
_ Apriori property
_ Join steps
_ Prune step
_ Example
_ Algorithm

11.Explain howthe efficiency of apriori is improved?


_ Hash-based technique (hashing item set counts)
_ Transaction reduction (reducingthe number of transactionsscanned in future iteration)
_ Partitioning(Partitioning the data to find candidate itemsets)
_ Sampling(mining on a subset of the given data)
_ Dynamic itemset counting (addingcandidateitemsets atdifferentpoints duringascan)
12.Explain frequent item set withoutcandidatewithout candidate generation?
_ Frequent patternsgrowth (or) FP-growth
_ Frequent patterntree(or) FP-tree
_ Algorithm
13. Explain mining Multi-dimensional Boolean association rules from transactiondatabases?
_ Multi-dimensional(or) Multilevel associationrules
_ Approaches to mining Multilevel association rules
• Using uniform minimum support for all levels
• Using reduced minimum support atlower levels
o Level-by-levelindependent
o Level-cross filtering bysingle
o Level- crossfilteringby k-itemset
_ Checkingfor redundant Multilevel associationrules
14.Explainconstraint-based association mining?
_ Knowledge type constraints
_ Data constraints
_ Dimension/levelconstraints
_ Interestingness constraints
_ Rule constraints
_ Metarule-Guidedmining ofassociation ofassociationrules
_ Mining guided by additionalrule constraints

Unit–III

15.Explainregressionin predictive modeling?


_ Regression definition
_ Linear regression
_ Multiple regression
_ Non-linear regression
_ Other regression models
16.Explainstatisticalperspective indata mining?
_ Point estimation
_ Data summarization
_ Bayesian techniques
_ Hypothesis testing
_ Regression
_ Correlation
17. ExplainBayesian classification.
_ Bayesian theorem
_ Naïve Bayesian classification
_ Bayesianbelief networks
_ Bayesian learning
18. Discuss the requirements of clustering indata mining.
_ Scalability
_ Abilitytodeal with different typesof attributes
_ Discoveryofclusters witharbitrary shape
_ Minimalrequirements fordomain knowledge to determineinput parameters
_ Abilitytodeal with noisy data
_ Insensitivity to theorder of input records
_ High dimensionality
_ Interpretability and usability
_ Intervalscaledvariables
_ Binary variables
o Symmetric binary variableso Asymmetricbinary variables
_ Nominalvariables
_ Ordinal variables
_ Ratio-scaled variables
20. Explain the partitioningmethod of clustering.
K-means clusteringK-medoids clustering
21. ExplainVisualization in data mining.
Various forms of visualizingthe discovered patterns
_ Rules
_ Table
_ Crosstab
_ Pie chart
_ Bar chart
_ Decisiontree
_ Data cube
_ Histogram
_ Quantile plots
_ q-q plots
_ Scatterplots
_ Loess curves

UNIT IV

22. Discuss the components of datawarehouse.


_ Subject-oriented
_ Integrated
_ Time-Variant
_ Non-volatile
23. List outthe differencesbetween OLTPand OLAP.
_ Users andsystemorientation
_ Data contents
_ Databasedesign
_ View
_ Access patterns
24.Discuss the various schematicrepresentations in multidimensional model.
_ Star schema
_ Snow flakeschema
_ Fact constellationschema
25. Explain the OLAP operations I multidimensional model.
_ Roll-up
_ Drill-down
_ Slice and dice
_ Pivot or rotate
26. Explain the design and construction of a datawarehouse.
_ Design of a data warehouse
• Top-down view
• Data source view
• Data warehouse view
• Business query view
_ Process of data warehouse design
27.Expalin the three-tier data warehousearchitecture.
_ Warehouse database server(Bottomtier)
_ OLAP server(middle tier)
_ Client(toptier)
28. Explain indexing.
_ Definition
_ B-Treeindexing
_ Bit-map indexing
_ Join indexing
29.Write notes on metadata repository.
_ Definition
_ Structure ofthe datawarehouse
_ Operationalmetadata
_ Algorithms used for summarization
_ Mapping fromoperationalenvironment to data warehouse
_ Data related to systemperformance
_ Business metadata
30. Write short notes onVLDB.
_ Definition
_ Challenge related to database technologies
_ Issuesin VLDB

UNIT V

31.Explain data miningapplications for Biomedical andDNA data analysis.


_ Semanticintegrationof heterogeneous, distributed genome databases
_ Similaritysearchand comparisonamong DNA sequences
_ Associationanalysis.
_ Path analysis
_ Visualization tools andgenetic data analysis.
32. Explain data miningapplications fro financial data analysis.
_ Loan payment prediction and customer creditpolicy analysis.
_ Classification and clustering of customers fro targeted marketing.
_ Detectionof money laundering and other financial crimes.
33. Explain data miningapplications for retail industry.
_ Multidimensional analysis of sales,customers, products, time and region.
_ Analysis of the effectiveness of sales campaigns.
_ Customerretention-analysis of customer loyalty.
_ Purchase recommendation and cross-reference of items.
34. Explain data miningapplications for Telecommunication industry.
_ Multidimensional analysis of telecommunication data.
_ Fraudulent pattern analysis and theidentification of unusual patterns.
_ Multidimensional association and sequentialpatternanalysis
_ Use of visualization tools in telecommunication data analysis.
35. ExplainDBMiner tool in datamining.
_ Systemarchitecture
_ Input andOutput
_ Data mining tasks supported by the system
_ Support of task and methodselection
_ Support of the KDD process
_ Main applications
_ Currentstatus
36. Explain howdata mining is usedin healthcare analysis.
_ Health care data mining and its aims
_ Health care data miningtechnique
_ Segmentingpatientsinto groups
_ Identifyingpatientsinto groups
_ Identifyingpatients with recurring healthproblems
_ Relation between disease and symptoms
_ Curbing thetreatment costs
_ Predictingmedical diagnosis
_ Medical research
_ Hospital administration
_ Applications of data mining in health care
_ Conclusion
37. Explain howdata mining is usedin banking industry.
_ Data collected by data mining in banking
_ Banking data miningtools
_ Mining customer dataofbank
_ Mining forprediction and forecasting
_ Mining for frauddetection
_ Mining forcross selling bank services
_ Mining for identifyingcustomer preferences
_ Applications of data mining in banking
_ Conclusion
38. Explain the types of datamining.
_ Audio data mining
_ Video data mining
_ Image data mining
_ Scientific and statistical data mining

You might also like