You are on page 1of 10

DATA MINING VIVA QUESTIONS

1.Perform Data Preprocessing using Weka

Data preprocessing analyse the data set and provides the characteristics of data set such as
relation name, attributes, instances i.e., number of data entries.

What is Data cleaning ?


A Causes of Dirty Data

 Missing values
 Noisy data (Human/Machine Errors)
 Inconsistent data
Data cleaning tasks
 Handling missing values
 Identify outliers and smooth out noisy data
 Correct inconsistent data
Define metadata?
A : Metadata is simply defined as data about data. In other words, we can say that metadata is
the summarized data that leads us to the detailed data.
Explain data mart.
A : Data mart contains the subset of organization-wide data. This subset of data is valuable to
specific groups of an organization. In other words, we can say that a data mart contains data
specific to a particular group.

2.Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data in a useful
and efficient format.

Steps Involved in Data Preprocessing:


1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done.
It involves handling of missing data, noisy data etc.
 (a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in various
ways.
Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
2. Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
 (b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated
due to faulty data collection, data entry errors etc. It can be handled in following ways :
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided
into segments of equal size and then various methods are performed to complete the
task. Each segmented is handled separately. One can replace all data in a segment by
its mean or boundary values can be used to complete the task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The regression
used may be linear (having one independent variable) or multiple (having multiple
independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or
it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process.
This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to 1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help the
mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or conceptual
levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-The
attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with
huge volume of data, analysis became harder in such cases. In order to get rid of this, we uses
data reduction technique. It aims to increase the storage efficiency and reduce data storage and
analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
2. Attribute Subset Selection:
The highly relevant attributes should be used, rest all can be discarded. For performing
attribute selection, one can use level of significance and p- value of the attribute.the
attribute having p-value greater than significance level can be discarded.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example: Regression
Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless. If after
reconstruction from compressed data, original data can be retrieved, such reduction are
called lossless reduction else it is called lossy reduction. The two effective methods of
dimensionality reduction are:Wavelet transforms and PCA (Principal Componenet
Analysis).

3.Perform discretization of data

Discretization will convert the type of the attributes from numeric to nominal to improve the
efficiency of the result.
Perform Classification of data using Weka
Classification is a process of finding a model that describes and distinguish data classes and
concept The described model can be represented in various forms such as classification
rules, decision trees, mathematical model. In this we choose the J48 tree classifier to
classify the data classes.

4.Apriori Algorithm using Weka


Association in data mining means mining frequent patterns such as generating frequent item
set, frequent sub sequence and frequent substructure. Apriori is used for generating frequent
item set for a data set.

Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a
step known as candidate generation, and groups of candidates are tested against the data. The
algorithm terminates when no further successful extensions are found.

1. What is a Decision Tree Algorithm?


o A decision tree is a tree in which every node is either a leaf node or a decision
node. This tree takes an input an object and outputs some decision. All Paths from
root node to the leaf node are reached by either using AND or OR or BOTH. The
tree is constructed using the regularities of the data. The decision tree is not
affected by Automatic Data Preparation.

2. What is Naïve Bayes Algorithm?


o Naïve Bayes Algorithm is used to generate mining models. These models help to
identify relationships between input columns and the predictable columns. This
algorithm can be used in the initial stage of exploration. The algorithm calculates
the probability of every state of each input column given predictable columns
possible states. After the model is made, the results can be used for exploration
and making predictions.
3. Explain Association algorithm in Data mining?
o Association algorithm is used for recommendation engine that is based on a
market based analysis. This engine suggests products to customers based on what
they bought earlier. The model is built on a dataset containing identifiers. These
identifiers are both for individual cases and for the items that cases contain. These
groups of items in a data set are called as an item set. The algorithm traverses a
data set to find items that appear in a case. MINIMUM_SUPPORT parameter is
used any associated items that appear into an item set.
4. Define support and confidence.
o The support for a rule R is the ratio of the number of occurrences of R, given all
occurrences of all rules.
The confidence of a rule X->Y, is the ratio of the number of occurrences of Y given X,
among all other occurrences given X

5.Implement Multi dimensional Data models

5. What is data warehouse?


o A data warehouse is a electronic storage of an Organization's historical data for the
purpose of reporting, analysis and data mining or knowledge discovery.

6. What is the benefits of data warehouse?


o A data warehouse helps to integrate data and store them historically so that we can
analyze different aspects of business including, performance analysis, trend, prediction
etc. over a given time frame and use the result of our analysis to improve the efficiency of
business processes.

7. Briefly state different between data ware house & data mart?
o Dataware house is made up of many datamarts. DWH contain many subject areas. but
data mart focuses on one subject area generally. e.g. If there will be DHW of bank then
there can be one data mart for accounts, one for Loans etc. This is high level definitions.
Metadata is data about data. e.g. if in data mart we are receving any file. then metadata
will contain information like how many columns, file is fix width/elimted, ordering of
fileds, dataypes of field etc...
8. Differentiate between Data Mining and Data warehousing.
Data warehousing is merely extracting data from different sources, cleaning the data and
storing it in the warehouse. Where as data mining aims to examine or explore the data
using queries. These queries can be fired on the data warehouse. Explore the data in data
mining helps in reporting, planning strategies, finding meaningful patterns etc.
E.g. a data warehouse of a company stores all the relevant information of projects
9. What is the difference between OLTP and OLAP?
o OLTP is the transaction system that collects business data. Whereas OLAP is the
reporting and analysis system on that data.
OLTP systems are optimized for INSERT, UPDATE operations and therefore highly
normalized. On the other hand, OLAP systems are deliberately denormalized for fast data
retrieval through SELECT operations.

10. What is data mart?


o Data marts are generally designed for a single subject area. An organization may have
data pertaining to different departments like Finance, HR, Marketting etc. stored in data
warehouse and each department may have separate data marts. These data marts can be
built on top of the data warehouse.

11. What are the forms of multidimensional model?


o Star schema
o Snow flake schema
o Fact constellation Schema

12. What are frequent pattern?


o A set of items that appear frequently together in a transaction data set.
o eg milk, bread, sugar

Star Schema
 Each dimension in a star schema is represented with only one-dimension table.
 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of attributes.
For example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state
and country.

Snowflake Schema
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

 Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.
Fact Constellation Schema
 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.

 The sales fact table is same as that in the star schema.


 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units sold.
 It is also possible to share dimension tables between fact tables. For example, time, item,
and location dimension tables are shared between the sales and shipping fact table.

Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
What is Dimension Table?
Dimension table is a table which contain attributes of measurements stored in fact tables. This
table consists of hierarchies, categories and logic that can be used to traverse in nodes.

What is Fact Table?


Fact table contains the measurement of business processes, and it contains foreign keys for the
dimension tables.
Example – If the business process is manufacturing of bricks

Average number of bricks produced by one person/machine – measure of the business process

How many fact tables are there in a star schema?


There is only one fact table in a star Schema.
What is Normalization?
Normalization splits up the data into additional tables.
Out of star schema and snowflake schema, whose dimension table is normalized?
A : Snowflake schema uses the concept of normalization.
What is the benefit of normalization?
Normalization helps in reducing data redundancy.

6.Implement clustering algorithms

1. Explain clustering algorithm.


o Clustering algorithm is used to group sets of data with similar characteristics also
called as clusters. These clusters help in making faster decisions, and exploring
data. The algorithm first identifies relationships in a dataset following which it
generates a series of clusters based on the relationships. The process of creating
clusters is iterative. The algorithm redefines the groupings to create clusters that
better represent the data.

Compare K-mean and K-mediods algorithm.


o K-mediods is more robust than k-mean in presence of noise and outliers. K-
Mediods can be computationally costly.

What is K-nearest neighbor algorithm?


o It is one of the lazy learner algorithm used in classification. It finds the k-nearest
neighbor of the point of interest.

The k-mean algorithm defines the centroid of a cluster as the mean value of the points within
the cluster. It proceeds as follows:

First it randomly selects k of the objects in D each of which initially represent cluster mean or
centre. For each of the remaining objects an object is assigned to the cluster to which it is the
most similar, based on the Euclidean distance between objects and the cluster mean. Then it
iteratively improves the within cluster variation and for each cluster it computes new mean using
objects assigned to cluster in previous iteration. All the objects are then reassigned using the
updated means as the new cluster centers. This is repeated until it becomes stable.
K-mediod algorithm is a Partitioning Around Mediods (PAM) algorithm used for partitioning
based on mediod or central objects. It proceeds as follows:

Arbitrarily choose k of the objects in D as the initial representative objects. Then assign each
remaining object to the cluster with the nearest representative object and randomly select a non
representative object. Compute the total cost of swapping representative object with the non
representative object. If the total cost is less than zero then swap it and form a set of new
representative objects.

7.Classification Algorithms

K Nearest Neighbors Algorithm

The closest neighbor rule distinguishes the classification of an unknown data point. That is on
the basis of its closest neighbor whose class is already known.

In this nearest neighbor is computed on the basis of estimation of k. That indicates how many
nearest neighbors are to consider characterizing. It makes use of the more than one closest
neighbor to determine the class. In which the given data point belongs to and so it is called as
KNN. These data samples are needed to be in the memory at the runtime. Hence they are
referred to as memory-based technique.

KNN which is focused on weights. The training points are assigned weights. According to their
distances from sample data point. But at the same, computational complexity and memory
requirements remain the primary concern.

To overcome memory limitation size of data set is reduced. For this, the repeated patterns. That
don’t include additional data are also eliminated from training data set. To further enhance the
information focuses which don’t influence the result. That are additionally eliminated from
training data set. The NN training data set can organize utilizing different systems. That is to
enhance over memory limit of KNN. The KNN implementation can do using ball tree, k-d tree,
and orthogonal search tree.
The tree-structured training data is further divided into nodes and techniques. Such as NFL and
tunable metric divide the training data set according to planes. Using these algorithms we can
expand the speed of basic KNN algorithm. Consider that an object is sampled with a set of
different attributes.

Assuming its group can determine from its attributes. Also, different algorithms can use to
automate the classification process. In pseudo code, k-nearest neighbor algorithm can express,

K ← number of nearest neighbors

For each object Xin the test set do

calculate the distance D(X,Y) between X and every object Y in the training set

neighborhood ← the k neighbors in the training set closest to X

X.class ← SelectClass (neighborhood)

8. Case study

You might also like