You are on page 1of 22

5 marks each

1) Explain why data warehouses are needed for developing business


solutions from today’s perspective. Discuss the role of data marts.
Importance of Data warehouse:

Potential high returns on investment and delivers enhanced business


intelligence: Implementation of data warehouse requires a huge investment in
lakhs of Rs. But it helps the organization to take strategic decisions based on past
historical data and organization can improve the results of various processes like
marketing segmentation, inventory management and sales.

Competitive advantage: As previously unknown and unavailable data is


available in data warehouse; makers can access that data to take decisions to gain
the competitive advantage.

Saves Time: As the data from multiple sources is available in integrated form,
business users can access data from one place. There is no need to retrieve the
data from multiple sources.

Better enterprise intelligence: It improves the customer service and


productivity.

High quality data: Data in data warehouse is cleaned and transferred into
desired format. So, data quality is high.

Data Mart:
 A data mart is oriented to a specific purpose or major data subject that may be
 distributed to support business needs. It is a subset of the data resource.
 A data mart is a repository of a business organization's data implemented to
answer very specific questions for a specific group of data consumers such as
organizational divisions of marketing, sales, operations, collections and
others. A data mart is typically established as one dimensional model or star
schema which is composed of a fact table and multi-dimensional table.
 A data mart is a small warehouse which is designed for the department level.
 It is often a way to gain entry and provide an opportunity to learning.
 Major problem: If they differ from department to department, they can be
difficult to integrate enterprise wide.
2) Explain various features of Data Warehouse?
A data warehouse is a database, which is kept separate from the organization's
operational database. There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyse
its business. A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.

Data Warehouse Features:

The key features of a data warehouse are discussed below −


 Subject Oriented − A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing
operations. These subjects can be product, customers, suppliers, sales,
revenue, etc. A data warehouse does not focus on the ongoing operations,
rather it focuses on modelling and analysis of data for decision making.
Data warehouse are designed to help analyse data. For example, to learn more
about banking data, a warehouse can be built that concentrates on
transactions, loans, etc.
 Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This
integration enhances the effective analysis of data.
The data collected is clean and then data integration techniques are applied,
which returns consistency in naming conventions, encoding structures,
attribute measures etc among different data sources.
 Time Variant − The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information
from the historical point of view.
For example, a customer record has details of his job, a data warehouse would
maintain all his previous jobs (Historical information) when compared to a
transactional system which only maintain current job due to which its not
possible to retrieve older records
 Non-volatile − Non-volatile means the previous data is not erased when new
data is added to it. Non-volatile means that, once data entered into the
warehouse, it cannot be removed or changed because the purpose of
warehouse is to analyse the data.
A data warehouse is kept separate from the operational database and therefore
frequent changes in operational database is not reflected in the data
warehouse
3) Discuss the application of data warehousing and data mining
A data warehouse helps business executives to organize, analyse, and use their
data for decision making. A data warehouse serves as a sole part of a plan-
execute-assess "closed-loop" feedback system for the enterprise management.
Data warehouses are widely used in the following fields −

Healthcare:
One of the most important sectors which utilizes data warehouses is the
Healthcare sector. All of their financial, clinical, and employee records are fed to
warehouses as it helps them to strategize and predict outcomes, track and analyse
their service feedback, generate patient reports, share data with tie-in insurance
companies, medical aid services, etc.

Government and Education:


The federal government utilizes the warehouses for research in compliance,
whereas the state government uses it for services related to human resources like
recruitment, and accounting like payroll management. The government uses data
warehouses to maintain and analyse tax records, health policy records and their
respective providers, and also their entire criminal law database is connected to
the state’s data warehouse.
Universities use warehouses for extracting of information used for the proposal of
research grants, understanding their student demographics, and human resource
management.

Finance Industry:
Similar to the applications seen in banking, mainly revolve around evaluation and
trends of customer expenses which aids in maximizing the profits earned by their
clients.

Consumer Goods Industry:


They are used for prediction of consumer trends, inventory management, market
and advertising research. In-depth analysis of sales and production is also carried
out. Apart from these, information is exchanged business partners and clientele.

Hospitality Industry:
A major proportion of this industry is dominated by hotel and restaurant services,
car rental services, and holiday home services. They utilize warehouse services to
design and evaluate their advertising and promotion campaigns where they target
customers based on their feedback and travel patterns.
Data Mining Applications:

Here is the list of areas where data mining is widely used −

Scientific Analysis: Scientific simulations are generating bulks of data every


day. This includes data collected from nuclear laboratories, data about human
psychology, etc. Data mining techniques are capable of the analysis of these data.
Now we can capture and store more new data faster than we can analyze the old
data already accumulated.

Business Transactions: Every business industry is memorized for perpetuity.


Such transactions are usually time-related and can be inter-business deals or
intra-business operations

Market Basket Analysis: Market Basket Analysis is a technique that gives the
careful study of purchases done by a customer in a supermarket. This concept
identifies the pattern of frequent purchase items by customers.

Healthcare and Insurance: A Pharmaceutical sector can examine its new deals
force activity and their outcomes to improve the focusing of high-value
physicians and figure out which promoting activities will have the best effect in
the following upcoming months, Whereas the Insurance sector, data mining can
help to predict which customers will buy new policies, identify behavior patterns
of risky customers and identify fraudulent behavior of customers.

Financial/Banking Sector: A credit card company can leverage its vast


warehouse of customer transaction data to identify customers most likely to be
interested in a new credit product.

 Credit card fraud detection.


 Identify ‘Loyal’ customers.
 Extraction of information related to customers.
 Determine credit card spending by customer groups.
4) A data warehouse is a subject-oriented, integrated, time-variant, and non-
volatile collection of data – Justify. Same as Ans - 2

5) Give differences between OLAP and OLTP.

SR.NO OLAP (Online analytical OLTP (Online transaction


processing) processing)
1. Consists of historical data from Consists only operational current
various Databases. data.
2. It is subject oriented. Used for It is application oriented. Used for
Data Mining, Analytics, business tasks.
Decision making, etc.

3. The data is used in planning, The data is used to perform day to


problem solving and decision day fundamental operations.
making.
4. It reveals a snapshot of present It provides a multi-dimensional
business tasks. view of different business tasks.
5. Large amount of data is stored The size of the data is relatively
typically in TB, PB small as the historical data is
archived. For ex MB, GB
6. Relatively slow as the amount of Very Fast as the queries operate on
data involved is large. Queries 5% of the data.
may take hours.
7. It only need backup from time to Backup and recovery process is
time as compared to OLTP. maintained religiously
8. This data is generally managed This data is managed by clerks,
by CEO, MD, GM. managers.
9. Only read and rarely write Both read and write operations.
operation.
6) Explain various OLAP operations
OLAP stands for Online Analytical Processing Server. It is a software
technology that allows users to analyse information from multiple database
systems at the same time. It is based on multidimensional data model and allows
the user to query on multi-dimensional data (e.g. Delhi -> 2018 -> Sales data).
OLAP databases are divided into one or more cubes and these cubes are known
as Hyper-cubes.

OLAP operations:
There are five basic analytical operations that can be performed on an OLAP
cube:
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
 Moving down in the concept hierarchy
 Adding a new dimension
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs
aggregation on the OLAP cube. It can be done by:
 Climbing up in the concept hierarchy
 Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).

3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected
by selecting following dimensions with criteria:
 Location = “Delhi” or “Kolkata”
 Time = “Q1” or “Q2”
 Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a
new sub-cube creation. In the cube given in the overview section, Slice is
performed on the dimension Time = “Q1”.

5. Pivot: It is also known as rotation operation as it rotates the current view to


get a new view of the representation. In the sub-cube obtained after the slice
operation, performing pivot operation gives a new view of it.
7) Differentiate Fact table vs. Dimension table

S.NO Fact Table Dimension Table


1. Fact table contains the measuring Dimension table contains the attributes
on the attributes of a dimension on that truth table calculates the metric.
table.
2. In fact table, there is less attributes
While in dimension table, there is more
than dimension table. attributes than fact table.
3. In fact table, There is more records While in dimension table, there is less
than dimension table. records than fact table.
4. Fact table forms a vertical table. While dimension table forms a
horizontal table.
5. The attribute format of fact table is While the attribute format of dimension
in numerical format and text table is in text format.
format.
6. It comes after dimension table. While it comes before fact table.

7. The number of fact table is less While the number of dimension is more
than dimension table in a schema. than fact table in a schema.

8. It is used for analysis purpose and While the main task of dimension table
decision making. is to store the information about a
business and its process.
9. Located at the center of a star or Connected to the fact table and located
snowflake schema and surrounded at the edges of the star or snowflake
by dimensions. schema
10. Defined by their grain or its most Should be wordy, descriptive, complete,
atomic level. and quality assured.
11. Primary Key in fact table is Dimension table has a primary key
mapped as foreign keys to columns that uniquely identifies each
Dimensions. dimension.
12. Helps to store report labels and Load detailed atomic data into
filter domain values in dimension dimensional structures.
tables.
13. Does not contain Hierarchy Contains Hierarchies. For example
Location could contain, country, pin
code, state, city, etc.
8) Define the term “data mining”. Discuss the major issues in data mining.

Data mining is the process of analysing massive volumes of data to discover


business intelligence that can help companies solve problems, mitigate risks, and
seize new opportunities.

Major issues in data mining are:

 Mining Methodology and User Interaction Issues

It refers to the following kinds of issues −


 Mining different kinds of knowledge in databases − Different users may
be interested in different kinds of knowledge. Therefore, it is necessary for
data mining to cover a broad range of knowledge discovery task.
 Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to
focus the search for patterns, providing and refining data mining requests
based on the returned results.
 Incorporation of background knowledge − To guide discovery process
and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining
Query language that allows the user to describe ad hoc mining tasks,
should be integrated with a data warehouse query language and optimized
for efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the
patterns are discovered it needs to be expressed in high level languages,
and visual representations. These representations should be easily
understandable.
 Handling noisy or incomplete data − The data cleaning methods are
required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of
the discovered patterns will be poor.
 Pattern evaluation − The patterns discovered should be interesting
because either they represent common knowledge or lack novelty.
 Performance Issues

There can be performance-related issues such as follows −


 Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in databases,
data mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors
such as huge size of databases, wide distribution of data, and complexity
of data mining methods motivate the development of parallel and
distributed data mining algorithms. These algorithms divide the data into
partitions which is further processed in a parallel fashion. Then the results
from the partitions is merged. The incremental algorithms, update
databases without mining the data again from scratch.

 Diverse Data Types Issues

 Handling of relational and complex types of data − The database may


contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these kinds
of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data sources on
LAN or WAN. These data source may be structured, semi structured or
unstructured. Therefore, mining the knowledge from them adds challenges
to data mining.
9) In real-world data, tuples with missing values for some attributes are a
common occurrence. Describe various methods for handling this problem
The various methods for handling the problem of missing values in data tuples
include:

(a) Ignoring the tuple: This approach is suitable only when the dataset we have
is quite large and multiple values are missing within a tuple. This is usually done
when the value is missing. This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.

(b) Manually filling in the missing value: In general, this approach is time-
consuming and may not be a reasonable task for large data sets with many
missing values, especially when the value to be filled in is not easily determined.

(c) Using a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like \Unknown,". When
missing values are difficult to be predicted, a global constant value like
‘unknown”, “N/A” or “minus infinity; can be used to fill all the missing values.
For example, consider the students database, if the address attribute is missing for
some students, it does not make sense in filling up these values rather a global
constant can be used.

(d) Using the attribute mean for quantitative (numeric) values or attribute
mode for categorical (nominal) values: For missing values, mean or median of
its discrete values may be used as replacement. For example, suppose that the
average income of AllElectronics customers is $28,000. Use this value to replace
any missing values for income.

(e) Use attribute mean for all sample belonging to the same class: Instead of
replacing the missing values by mean or median of all the rows in the database,
rather we could consider class wise data for missing values to be replaced by its
mean or median to make it more relevant
For example, consider a car pricing database with classes like “luxury” and “low
budget” and missing values need to filled in, replacing missing cost of a luxury
car with average cost of all luxury car makes the data more accurate.

(f) Using the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your
data set, you may construct a decision tree to predict the missing values for
income.
10) Explain the following data normalization techniques:
(i) min-max normalization and
Min-Max Normalization –
In this technique of data normalization, linear transformation is performed on the
original data. Minimum and maximum value from data is fetched and each value
is replaced according to the following formula.

Where A is the attribute data,


Min(A), Max(A) are the minimum and maximum absolute value of A
respectively.
v’ is the new value of each entry in data.
v is the old value of each entry in data.
new_max(A), new_min(A) is the max and min value of the range(i.e boundary
value of range required) respectively.

(ii) decimal scaling.


Decimal Scaling Method for Normalization –
It normalizes by moving the decimal point of values of the data. To normalize the
data by this technique, we divide each value of the data by the maximum absolute
value of data. The data value, vi, of data is normalized to vi‘ by using the formula
below –

where j is the smallest integer such that max(|vi‘|)<1.


Example – Let the input data is: -10, 201, 301, -401, 501, 601, 701
To normalize the above data,
Step 1: Maximum absolute value in given data(m): 701
Step 2: Divide the given data by 1000 (i.e j=3)
Result: The normalized data is: -0.01, 0.201, 0.301, -0.401, 0.501, 0.601, 0.701
11) Describe various methods for handling missing data values Same as - 9

12) What are the limitations of the Apriori approach for mining? Briefly
describe the techniques to improve the efficiency of Apriori algorithm
Limitations of Apriori Algorithm
 Apriori Algorithm can be slow.
 The main limitation is time required to hold a vast number of candidate sets
with much frequent itemsets, low minimum support or large itemsets i.e. it is
not an efficient approach for large number of datasets. For example, if there
are 10^4 from frequent 1- itemsets, it needs to generate more than 10^7
candidates into 2-length which in turn they will be tested and accumulate.
Furthermore, to detect frequent pattern in size 100 i.e., v1, v2… v100, it has to
generate 2^100 candidate itemsets that yield on costly and wasting of time of
candidate generation. So, it will check for many sets from candidate itemsets,
also it will scan database many times repeatedly for finding candidate
itemsets.
 Apriori will be very low and inefficiency when memory capacity is limited
with large number of transactions.
13) What is market basket analysis? Explain the two measures of rule
interestingness: support and confidence with suitable example.

14) Explain measures for finding rule interestingness (support, confidence)


with example. Same as ans - 13
15) Compare association and classification. Briefly explain associative
classification with suitable example.
 Classification rule mining aims to discover a small set of rules in the database
that forms an accurate classifier.
 Association rule mining finds all the rules existing in the database that satisfy
some minimum support and minimum confidence constraints.
 Associative classification (AC) is a mining technique that integrates
classification and association rule mining to perform classification on unseen
data instances. AC is one of the effective classification techniques that applies
the generated rules to perform classification
16) What is an attribute selection measure? Explain different attribute
selection measures with example.
17) Do feature wise comparison between classification and prediction.

Classification Prediction
Classification is the process of Predication is the process of
identifying which category a new identifying the missing or unavailable
observation belongs to based on a numerical data for a new observation.
training data set containing observations
whose category membership is known.
Classification is a major type of Prediction can be viewed as the
prediction problem where classification construction and use of a model to
is used to predict discrete or nominal assess the class of an unlabelled
values sample.
In classification, the accuracy depends In prediction, the accuracy depends on
on finding the class label correctly. how well a given predictor can guess
the value of a predicated attribute for
new data.
Classification is the use of prediction to It is used to assess the values or value
predict class labels ranges of an attribute that a given
sample is likely to have.

In classification, the model can be In prediction, the model can be known


known as the classifier. as the predictor.

A model or the classifier is constructed A model or a predictor will be


to find the categorical labels. constructed that predicts a continuous-
valued function or ordered value.

For example, the grouping of patients For example, we can think of


based on their medical records can be prediction as predicting the correct
considered a classification. treatment for a particular disease for a
person.
18) Explain Linear regression with example.
 Linear regression is used to predict the relationship between two variables by
applying a linear equation to observed data.
 There are two types of variable, one variable is called an independent variable,
and the other is a dependent variable.
 Linear regression is commonly used for predictive analysis.
 Used explain relationship between one independent variable and one
dependant variable.

The given equation represents the equation of linear regression

Equation: b=y+x*a
b = dependent
y = constant
x = regression coefficient
a = independent

Example: The weight of the person is linearly related to their height. So, this
shows a linear relationship between the height and weight of the person.
According to this, as we increase the height, the weight of the person will also
increase. It is not necessary that one variable is dependent on others, or one
causes the other, but there is some critical relationship between the two variables.
In such cases, we use a scatter plot to simplify the strength of the relationship
between the variables. If there is no relation or linking between the variables then
the scatter plot does not indicate any increasing or decreasing pattern. In such
cases, the linear regression design is not beneficial to the given data.

19) Explain data mining application for fraud detection.

 Credit Card Fraud Detection: Credit card fraud detection is quite


confidential and is not much disclosed in public.
 Computer Intrusion Detection: An intrusion detection system is needed to
automate and perform system monitoring by keeping aggregate audit trail
statistics
 Telecommunication Fraud Detection: Most techniques use Call Detail
Record data to create behaviour profiles for the customer, and detect
deviations from these profiles.
20) Discuss applications of data mining in Banking and Finance.
Applications of Data Mining in Banking:
Banks use data mining to better understand market risks. It is most often used in
banking to determine the likelihood of a loan being repaid by the borrower. It is
also used commonly to detect financial fraud. An example used is fraud detection
is when some unusually high transactions occur, and the bank’s fraud prevention
system is set up to put the account on hold until the account holder confirms that
this was a legitimate purchase.
There are numerous areas in which data mining can be used in the banking
industry, which include customer segmentation and profitability, credit scoring
and approval, predicting payment default, marketing, detecting fraudulent
transactions, cash management and forecasting operations, optimizing stock
portfolios, and ranking investments. In addition, banks may use data mining to
identify their most profitable credit card customers or high-risk loan applicants.
To help bank to retain credit card customers, data mining is used. By analysing
the past data, data mining can help banks to predict customers that likely to
change their credit card affiliation so they can plan and launch different special
offers to retain those customers. Credit card spending by customer groups can be
identified by using data mining. In banking industries like marketing, Fraud
Detection, and Risk management, data mining is been effectively utilized.

Applications of Data Mining in Finance:


Finance, which is becoming increasingly conducive to data-driven modeling as
enormous quantities of financial data become available, is one of the most
enticing application areas for these developing technologies. Some of the
applications of data mining in finance are:
 Prediction of the Stock Market
 Portfolio Management
 Bankruptcy Prediction
 Foreign Exchange Market
 Fraud Detection
 Pattern Recognition
 Outlier Detection
 Rule Induction
 Social Network Analysis
 Visualization
21) How K-Mean clustering method differs from K-Medoid clustering
method?
K-means K-medoid
Complexity is O(ikn) Complexity is O(i k(n-k)^2
More efficient Comparatively less efficient
Sensitive to outliers Not sensitive to outliers
Convex shape is required Convex shape is not required
Number of clusters need to be specified Number of clusters need to be specified
in advance in advance
Efficient for separated clusters Efficient for separated clusters and
small data sets

22) How FP tree is better than Apriori algorithm- Justify


23) Define information gain, entropy, gini index
Information Gain:
The concept of entropy plays an important role in measuring the information
gain. However, “Information gain is based on the information theory”.
Information gain is used for determining the best features/attributes that render
maximum information about a class. It follows the concept of entropy while
aiming at decreasing the level of entropy, beginning from the root node to the leaf
nodes. Information gain computes the difference between entropy before and
after split and specifies the impurity in class elements.
Information Gain = Entropy before splitting - Entropy after splitting

Entropy:
It is the information theory metric that measures the impurity or uncertainty in a
group of observations. It determines how a decision tree chooses to split data.
Entropy is the measurement of impurities or randomness in the data points.
Here, if all elements belong to a single class, then it is termed as “Pure”, and if
not, then the distribution is named as “Impurity”.
It is computed between 0 and 1, however, heavily relying on the number of
groups or classes present in the data set it can be more than 1 while depicting the
same significance i.e., extreme level of disorder.
In more simple terms, if a dataset contains homogeneous subsets of observations,
then no impurity or randomness is there in the dataset, and if all the observations
belong to one class, the entropy of that dataset becomes zero.

Gini Index:
Gini index or Gini impurity measures the degree or probability of a particular
variable being wrongly classified when it is randomly chosen.
Calculate the amount of probability of a specific feature that is classified
incorrectly when select randomly
It is calculated by subtracting the sum of squared probabilities of each class from
one. It favours larger partitions and easy to implement whereas information gain
favours smaller partitions with distinct values.
A feature with a lower Gini index is chosen for a split.
The classic CART algorithm uses the Gini Index for constructing the decision
tree.

You might also like