Professional Documents
Culture Documents
Saves Time: As the data from multiple sources is available in integrated form,
business users can access data from one place. There is no need to retrieve the
data from multiple sources.
High quality data: Data in data warehouse is cleaned and transferred into
desired format. So, data quality is high.
Data Mart:
A data mart is oriented to a specific purpose or major data subject that may be
distributed to support business needs. It is a subset of the data resource.
A data mart is a repository of a business organization's data implemented to
answer very specific questions for a specific group of data consumers such as
organizational divisions of marketing, sales, operations, collections and
others. A data mart is typically established as one dimensional model or star
schema which is composed of a fact table and multi-dimensional table.
A data mart is a small warehouse which is designed for the department level.
It is often a way to gain entry and provide an opportunity to learning.
Major problem: If they differ from department to department, they can be
difficult to integrate enterprise wide.
2) Explain various features of Data Warehouse?
A data warehouse is a database, which is kept separate from the organization's
operational database. There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to analyse
its business. A data warehouse helps executives to organize, understand, and use
their data to take strategic decisions.
Healthcare:
One of the most important sectors which utilizes data warehouses is the
Healthcare sector. All of their financial, clinical, and employee records are fed to
warehouses as it helps them to strategize and predict outcomes, track and analyse
their service feedback, generate patient reports, share data with tie-in insurance
companies, medical aid services, etc.
Finance Industry:
Similar to the applications seen in banking, mainly revolve around evaluation and
trends of customer expenses which aids in maximizing the profits earned by their
clients.
Hospitality Industry:
A major proportion of this industry is dominated by hotel and restaurant services,
car rental services, and holiday home services. They utilize warehouse services to
design and evaluate their advertising and promotion campaigns where they target
customers based on their feedback and travel patterns.
Data Mining Applications:
Market Basket Analysis: Market Basket Analysis is a technique that gives the
careful study of purchases done by a customer in a supermarket. This concept
identifies the pattern of frequent purchase items by customers.
Healthcare and Insurance: A Pharmaceutical sector can examine its new deals
force activity and their outcomes to improve the focusing of high-value
physicians and figure out which promoting activities will have the best effect in
the following upcoming months, Whereas the Insurance sector, data mining can
help to predict which customers will buy new policies, identify behavior patterns
of risky customers and identify fraudulent behavior of customers.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP
cube:
1. Drill down: In drill-down operation, the less detailed data is converted into
highly detailed data. It can be done by:
Moving down in the concept hierarchy
Adding a new dimension
In the cube given in overview section, the drill down operation is performed by
moving down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs
aggregation on the OLAP cube. It can be done by:
Climbing up in the concept hierarchy
Reducing the dimensions
In the cube given in the overview section, the roll-up operation is performed by
climbing up in the concept hierarchy of Location dimension (City -> Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more
dimensions. In the cube given in the overview section, a sub-cube is selected
by selecting following dimensions with criteria:
Location = “Delhi” or “Kolkata”
Time = “Q1” or “Q2”
Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a
new sub-cube creation. In the cube given in the overview section, Slice is
performed on the dimension Time = “Q1”.
7. The number of fact table is less While the number of dimension is more
than dimension table in a schema. than fact table in a schema.
8. It is used for analysis purpose and While the main task of dimension table
decision making. is to store the information about a
business and its process.
9. Located at the center of a star or Connected to the fact table and located
snowflake schema and surrounded at the edges of the star or snowflake
by dimensions. schema
10. Defined by their grain or its most Should be wordy, descriptive, complete,
atomic level. and quality assured.
11. Primary Key in fact table is Dimension table has a primary key
mapped as foreign keys to columns that uniquely identifies each
Dimensions. dimension.
12. Helps to store report labels and Load detailed atomic data into
filter domain values in dimension dimensional structures.
tables.
13. Does not contain Hierarchy Contains Hierarchies. For example
Location could contain, country, pin
code, state, city, etc.
8) Define the term “data mining”. Discuss the major issues in data mining.
(a) Ignoring the tuple: This approach is suitable only when the dataset we have
is quite large and multiple values are missing within a tuple. This is usually done
when the value is missing. This method is not very effective unless the tuple
contains several attributes with missing values. It is especially poor when the
percentage of missing values per attribute varies considerably.
(b) Manually filling in the missing value: In general, this approach is time-
consuming and may not be a reasonable task for large data sets with many
missing values, especially when the value to be filled in is not easily determined.
(c) Using a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like \Unknown,". When
missing values are difficult to be predicted, a global constant value like
‘unknown”, “N/A” or “minus infinity; can be used to fill all the missing values.
For example, consider the students database, if the address attribute is missing for
some students, it does not make sense in filling up these values rather a global
constant can be used.
(d) Using the attribute mean for quantitative (numeric) values or attribute
mode for categorical (nominal) values: For missing values, mean or median of
its discrete values may be used as replacement. For example, suppose that the
average income of AllElectronics customers is $28,000. Use this value to replace
any missing values for income.
(e) Use attribute mean for all sample belonging to the same class: Instead of
replacing the missing values by mean or median of all the rows in the database,
rather we could consider class wise data for missing values to be replaced by its
mean or median to make it more relevant
For example, consider a car pricing database with classes like “luxury” and “low
budget” and missing values need to filled in, replacing missing cost of a luxury
car with average cost of all luxury car makes the data more accurate.
(f) Using the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your
data set, you may construct a decision tree to predict the missing values for
income.
10) Explain the following data normalization techniques:
(i) min-max normalization and
Min-Max Normalization –
In this technique of data normalization, linear transformation is performed on the
original data. Minimum and maximum value from data is fetched and each value
is replaced according to the following formula.
12) What are the limitations of the Apriori approach for mining? Briefly
describe the techniques to improve the efficiency of Apriori algorithm
Limitations of Apriori Algorithm
Apriori Algorithm can be slow.
The main limitation is time required to hold a vast number of candidate sets
with much frequent itemsets, low minimum support or large itemsets i.e. it is
not an efficient approach for large number of datasets. For example, if there
are 10^4 from frequent 1- itemsets, it needs to generate more than 10^7
candidates into 2-length which in turn they will be tested and accumulate.
Furthermore, to detect frequent pattern in size 100 i.e., v1, v2… v100, it has to
generate 2^100 candidate itemsets that yield on costly and wasting of time of
candidate generation. So, it will check for many sets from candidate itemsets,
also it will scan database many times repeatedly for finding candidate
itemsets.
Apriori will be very low and inefficiency when memory capacity is limited
with large number of transactions.
13) What is market basket analysis? Explain the two measures of rule
interestingness: support and confidence with suitable example.
Classification Prediction
Classification is the process of Predication is the process of
identifying which category a new identifying the missing or unavailable
observation belongs to based on a numerical data for a new observation.
training data set containing observations
whose category membership is known.
Classification is a major type of Prediction can be viewed as the
prediction problem where classification construction and use of a model to
is used to predict discrete or nominal assess the class of an unlabelled
values sample.
In classification, the accuracy depends In prediction, the accuracy depends on
on finding the class label correctly. how well a given predictor can guess
the value of a predicated attribute for
new data.
Classification is the use of prediction to It is used to assess the values or value
predict class labels ranges of an attribute that a given
sample is likely to have.
Equation: b=y+x*a
b = dependent
y = constant
x = regression coefficient
a = independent
Example: The weight of the person is linearly related to their height. So, this
shows a linear relationship between the height and weight of the person.
According to this, as we increase the height, the weight of the person will also
increase. It is not necessary that one variable is dependent on others, or one
causes the other, but there is some critical relationship between the two variables.
In such cases, we use a scatter plot to simplify the strength of the relationship
between the variables. If there is no relation or linking between the variables then
the scatter plot does not indicate any increasing or decreasing pattern. In such
cases, the linear regression design is not beneficial to the given data.
Entropy:
It is the information theory metric that measures the impurity or uncertainty in a
group of observations. It determines how a decision tree chooses to split data.
Entropy is the measurement of impurities or randomness in the data points.
Here, if all elements belong to a single class, then it is termed as “Pure”, and if
not, then the distribution is named as “Impurity”.
It is computed between 0 and 1, however, heavily relying on the number of
groups or classes present in the data set it can be more than 1 while depicting the
same significance i.e., extreme level of disorder.
In more simple terms, if a dataset contains homogeneous subsets of observations,
then no impurity or randomness is there in the dataset, and if all the observations
belong to one class, the entropy of that dataset becomes zero.
Gini Index:
Gini index or Gini impurity measures the degree or probability of a particular
variable being wrongly classified when it is randomly chosen.
Calculate the amount of probability of a specific feature that is classified
incorrectly when select randomly
It is calculated by subtracting the sum of squared probabilities of each class from
one. It favours larger partitions and easy to implement whereas information gain
favours smaller partitions with distinct values.
A feature with a lower Gini index is chosen for a split.
The classic CART algorithm uses the Gini Index for constructing the decision
tree.