You are on page 1of 17

Data Mining & Business Intelligence

Practice Set for IA 2 Question for 5

marks Module IV and V


1. State the difference between clustering and classification.
Clustering Classification
Unsupervised learning Supervised learning
Groups data based on similarities Assigns data to pre-defined categories
Does not require a pre-defined set of classes Requires a pre-defined set of classes or labels
Output is a prediction or label for each data
Output is a grouping or partitioning of data point
Used for exploratory data analysis, pattern Used for predictive modeling and decision
recognition, and anomaly detection making
Examples: logistic regression, decision trees,
Examples: k-means, hierarchical clustering support vector machines

2. Define outlier and what the different types of outliers are.


1.An outlier is a data object that diverges essentially from the rest of the
objects as if it were produced by several mechanisms.
2.Outliers are data components that cannot be combined in a given class or
cluster.
3.These are the data objects which have several behavior from the usual
behavior of different data objects.
4.The analysis of this kind of data can be important to mine the knowledge.
5.Outliers are divided into three different types
a)Global or point outliers
b)Collective outliers
c)Contextual or conditional outliers

3. Explain hierarchical clustering along with its advantages and disadvantages.


1.Hierarchical clustering refers to an unsupervised learning procedure that
determines successive clusters based on previously defined clusters.
2.It works via grouping data into a tree of clusters.
2)Hierarchical clustering stats by treating each data points as an individual
cluster.
3)The endpoint refers to a different set of clusters, where each cluster is
different
from the other cluster, and the objects within each cluster are the same as one
another.
4)There are two types of hierarchical clustering
a)Agglomerative Hierarchical Clustering
b)Divisive Clustering
Advantages of Hierarchical clustering
1)It is simple to implement and gives the best output in some cases.
2)It is easy and results in a hierarchy, a structure that contains more
information.
3)It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering


1)It breaks the large clusters.
2)It is Difficult to handle different sized clusters and convex shapes.
3)It is sensitive to noise and outliers.
4)The algorithm can never be changed or deleted once it was done previously.

4. Explain various outlier detection methods.


Outlier Detection Methods:
5. What is market basket analysis along with its applications?
1.A market basket is a collection of items that are typically purchased together.
2.It is used to measure changes in the cost of living.

3.The market basket for the Consumer Price Index(CPI) is a collection of items
that represents the cost of living for a typical urban consumer

4.A market basket contains a variety of goods and services that are consistently
purchased and sold in an economy.

5.The market basket is used by economists to track price changes over time
and determine inflation levels.
APPLICATIONS-
a) Retail industry: Market basket analysis can help retailers understand
which products are frequently purchased together, which can be used
to optimize store layout, planogramming, and cross-selling strategies
b) E-commerce: Online retailers can use market basket analysis to
personalize product recommendations for each customer, based on
their past purchases and the purchasing behavior of similar customers
c) Hospitality industry: Market basket analysis can help hotels and
restaurants identify which items are frequently ordered together,
which can be used to optimize menu offerings and suggest menu
pairings
d) Healthcare: Market basket analysis can be used in healthcare to
identify which medical treatments are frequently used together, which
can be used to optimize treatment plans and reduce healthcare costs.

6. Define and explain i)support ii)confidence.


Support : It is one of the measures of interestingness. This tells about the
usefulness and certainty of rules. 5% Support means total 5% of transactions in
the database follow the rule.

Support(A -> B) = Support_count(A ∪ B)

Confidence: A confidence of 60% means that 60% of the customers who


purchased a milk and bread also bought butter.
Confidence(A -> B) = Support_count(A ∪ B) / Support_count(A)
7. State what are the limitations of apriori algorithm and how the efficiency can
be improved?
The Apriori algorithm is a widely used algorithm for performing market basket
analysis, but it also has several limitations, including:

1. High computational cost: Apriori algorithm generates a large number of


candidate itemsets which can lead to a high computational cost,
especially for large datasets.
2. Memory requirements: The Apriori algorithm requires a large amount of
memory to store the frequent itemsets and candidate itemsets.
3. Lack of scalability: The Apriori algorithm does not scale well with the
increase in the number of items and transactions.
4. Inability to handle sparse data: The Apriori algorithm struggles to
generate frequent itemsets for sparse data as it requires a minimum
support threshold to be met.

To improve the efficiency of the Apriori algorithm, several optimization


techniques have been proposed, including:

1. Hash-based techniques: Hash-based techniques can be used to reduce


the number of itemsets that need to be generated and stored in memory.
2. Sampling: Random sampling can be used to reduce the number of
transactions that need to be processed.
3. Pruning techniques: Pruning techniques can be used to remove candidate
itemsets that are not likely to be frequent.
4. Parallelization: Parallelization techniques can be used to speed up the
computation of frequent itemsets by distributing the computation across
multiple processors or machines.
5. Using advanced data structures: Advanced data structures such as FP-
trees can be used to compress the transaction database and reduce the
memory requirements of the Apriori algorithm.

8. Write short note on i)frequent Itemsets ii) association rules .


i) Frequent Itemsets:

Frequent itemsets are a key concept in market basket analysis, which is used to
identify patterns in customer purchase behavior.
A frequent itemset is a set of items that frequently appear together in a
transaction dataset above a certain minimum support threshold.

For example, a frequent itemset in a grocery store dataset might be {bread,


milk, eggs}, which means that these three items are frequently purchased
together by customers.

ii) Association Rules:

Association rules are rules that describe the relationships between different
items in a frequent itemset.

These rules can be used to identify interesting patterns and relationships


between items that can be leveraged to improve business operations.

An association rule is typically represented as "If itemset A, then itemset B",


where A and B are sets of items.

The strength of an association rule is measured by its support and confidence.

For example, an association rule in a grocery store dataset might be "If a


customer buys bread and milk, they are likely to buy eggs with a confidence of
80%". This rule suggests that customers who buy bread and milk are likely to
also buy eggs, with a high degree of confidence.

9. Explain multidimensional association rules with suitable example.


Multidimensional association rules are a generalization of traditional association
rules that consider multiple dimensions or attributes of the data. In traditional
association rules, the focus is on finding relationships between items, whereas in
multidimensional association rules, the focus is on finding relationships between
items and other attributes of the data.

For example, consider a sales dataset for a supermarket that includes


information about customer demographics (age, gender, income) in addition to
their purchases. We can use multidimensional association rules to identify
interesting patterns in the data that can help improve the supermarket's
marketing strategies.
Here's an example of a multidimensional association rule:

"If a customer is a male between the ages of 30-40 and has an income of over
$50,000, they are likely to purchase beer and chips together with a confidence of
90%."

In this example, we are not only considering the relationship between items
(beer and chips), but also the customer attributes (gender, age, income). The
confidence of 90% indicates that 90% of the transactions that contain beer and
chips also contain male customers between the ages of 30-40 with an income
over $50,000.

By analyzing multidimensional association rules, the supermarket can gain


insights into the purchasing behavior of different customer segments and tailor
their marketing strategies accordingly. For example, they may create targeted
promotions for male customers between the ages of 30-40 with high incomes to
increase sales of beer and chips.

10. Problem based on K means Algorithm.


11. Problem based on K medoids .
12. Problem based on apriori algorithm
13. Problem based on apriori algorithm to find association rules

Module VI questions
14. Explain the design of BI system for credit card and fraud detection.
The design of a Business Intelligence (BI) system for credit card and fraud
detection involves several key components:

1. Data sources: The BI system must be designed to gather data from


various sources, including credit card transaction data, customer
demographics, and behavioral data. These data sources should be
integrated and consolidated to provide a comprehensive view of the
customer and transaction data.
2. Data warehousing: The integrated data should be stored in a data
warehouse, which provides a central repository for all the data that is
used in the BI system. The data warehouse should be designed to
support complex queries and analysis.
3. ETL processes: Extract, Transform, and Load (ETL) processes are used to
extract data from various sources, transform the data into a common
format, and load the data into the data warehouse. These processes
should be automated to ensure that the data is always up-to-date.
4. Analytics and reporting: The BI system should include analytical and
reporting tools that enable analysts to perform ad-hoc queries, run
reports, and create visualizations of the data. These tools should be user-
friendly and designed to support both technical and non-technical users.
5. Fraud detection algorithms: The BI system should include algorithms for
fraud detection that use machine learning techniques to analyze
transaction data and identify patterns that indicate fraudulent activity.
These algorithms should be regularly updated to stay ahead of new fraud
trends.
6. Real-time monitoring: The BI system should include real-time monitoring
capabilities that enable fraud detection analysts to monitor transactions
in real-time and take immediate action if fraudulent activity is detected.
7. Governance and security: The BI system should be designed to comply
with regulatory requirements and include appropriate security measures
to protect the sensitive data.

15. Define business intelligence with examples.


Business Intelligence (BI) refers to the process of collecting, analyzing,
and presenting data to provide actionable insights that can help
businesses make informed decisions. BI solutions typically involve the
use of data analytics, reporting tools, and visualizations to enable users
to identify patterns, trends, and relationships in their data.

Here are some examples of how business intelligence is used in


different industries:

1. Retail: A retailer might use BI to analyze sales data, customer


demographics, and inventory levels to optimize their pricing and
promotion strategies, and improve their supply chain
management.
2. Healthcare: A healthcare provider might use BI to analyze patient
data to identify disease patterns, track patient outcomes, and
improve their resource allocation.
3. Financial services: A bank might use BI to analyze transaction
data to identify fraud patterns, monitor risk exposure, and
improve their customer experience.
4. Manufacturing: A manufacturer might use BI to analyze
production data to identify inefficiencies, optimize their supply
chain, and improve their product quality.
5. Education: An educational institution might use BI to analyze
student data to identify trends in academic performance, track
graduation rates, and improve student retention.

16. Define decision support system(DSS).


A Decision Support System (DSS) is an information system that utilizes data,
analytical tools, and models to help individuals and organizations make
informed and effective decisions. The purpose of a DSS is to support the
decision-making process by providing users with access to relevant information,
data analysis tools, and modeling capabilities.

A DSS can be used in a variety of applications and domains, including business,


healthcare, finance, and manufacturing. It typically consists of three main
components: data management, model management, and user interface.

17. Explain the structure of decision support system (DSS) with a diagram.
+-----------------------------------+
| User Interface |
+-----------------------------------+
| ^
v |
+-----------------------------------+
| Model Management |
+-----------------------------------+
| ^
v |
+-----------------------------------+
| Data Management |
+-----------------------------------+
As shown in the diagram, a DSS typically consists of three main components:

1. Data Management: This component is responsible for collecting, storing,


and managing the data required by the DSS. The data may be obtained
from various sources such as databases, spreadsheets, or other data
repositories. The data management component includes data cleaning,
transformation, and integration processes.
2. Model Management: This component contains the analytical models and
tools that are used to analyze the data and support decision-making.
These models may include statistical models, optimization models,
simulation models, or other types of models. The model management
component also includes model development, validation, and testing
processes.
3. User Interface: This component provides a means for users to interact
with the DSS and access the results of the analysis. The user interface may
be a graphical user interface (GUI), a command-line interface, or a web-
based interface. The user interface component also includes user training
and support processes.

18. Explain Business Intelligence issues.


1. Data Quality: BI systems rely heavily on accurate and reliable data. If the
data is inaccurate, incomplete, or inconsistent, it can lead to incorrect
conclusions and poor decision-making. Ensuring data quality requires a
comprehensive data management strategy that includes data cleansing,
validation, and monitoring.
2. Data Integration: Organizations often have data spread across different
systems, applications, and departments. Integrating this data into a
unified view can be a complex process. Data integration issues can
include data inconsistency, data duplication, and the need for data
mapping.
3. Data Security: BI systems contain sensitive data that needs to be
protected from unauthorized access. Data security issues include data
breaches, data theft, and data loss. Organizations need to implement
robust security measures such as access controls, encryption, and
monitoring to protect their data.
4. User Adoption: BI systems can be complex and difficult to use, and users
may not fully understand their capabilities. Lack of user adoption can lead
to underutilization of the system, decreased ROI, and missed
opportunities for data-driven decision-making. To ensure user adoption,
organizations need to provide comprehensive training and support for
their users.
5. Cost: Implementing a BI system can be expensive, especially for small and
medium-sized businesses. Costs can include hardware and software
acquisition, data management, system integration, and user training. To
maximize the ROI, organizations need to carefully evaluate the costs and
benefits of their BI investments.

19. Explain BI architecture/components with diagram.


+------------------------------------+
| BI Front-End Tools & Reports |
+------------------------------------+
| ^
v |
+------------------------------------+
| BI Data Processing Layer |
+------------------------------------+
| ^
v |
+------------------------------------+
| Data Warehouse |
+------------------------------------+
| ^
v |
+------------------------------------+
| Operational Data Sources |
+------------------------------------+

As shown in the diagram, a BI architecture typically consists of four main


components:

1. Operational Data Sources: These are the various systems and applications
that generate the data used in BI, such as transactional systems, CRM
systems, and ERP systems.
2. Data Warehouse: The data warehouse is a central repository that stores
the data used for BI analysis. The data in the warehouse is structured,
integrated, and optimized for reporting and analysis.
3. BI Data Processing Layer: This layer processes and prepares the data for
analysis. It includes ETL (Extract, Transform, Load) processes, data
modeling, and data aggregation.
4. BI Front-End Tools & Reports: These are the tools and applications that
enable users to interact with the data and generate reports and
visualizations. Examples of front-end tools include dashboards, ad hoc
reporting tools, and data visualization tools.
20. What is a purpose of Business Intelligence system.
he primary purpose of a Business Intelligence (BI) system is to provide
organizations with the ability to analyze their data and gain valuable insights
that can inform their business decisions. Specifically, BI systems can help
organizations:

1. Gain a better understanding of their business performance: By analyzing


data from various sources, organizations can gain insights into their sales,
marketing, finance, and operations, and identify areas for improvement.
2. Identify trends and patterns: BI systems can help organizations identify
trends and patterns in their data, such as changes in customer behavior,
market trends, or production cycles.
3. Make data-driven decisions: BI systems provide decision-makers with the
information they need to make informed and effective business decisions.
This can include insights into customer behavior, market trends, and
financial performance.
4. Monitor key performance indicators (KPIs): BI systems can help
organizations track their KPIs, such as revenue, profit margins, customer
satisfaction, and operational efficiency.
5. Improve operational efficiency: BI systems can help organizations identify
inefficiencies in their operations and take steps to streamline processes
and reduce costs.

21. What are the characteristics of Business Intelligence Analysis approach?


1. Data-driven: BI analysis is based on data, and the insights are derived
from analyzing and interpreting that data.
2. Multidisciplinary: BI analysis draws on multiple disciplines, including
statistics, data science, computer science, and business management.
3. Holistic: BI analysis takes a holistic view of the business, considering all
relevant data and factors that affect performance and decision-making.
4. Iterative: BI analysis is an iterative process, where insights are refined over
time based on new data and feedback from stakeholders.
5. Goal-oriented: BI analysis is focused on achieving specific business goals,
such as increasing revenue, reducing costs, or improving customer
satisfaction.
6. Collaborative: BI analysis involves collaboration between business
stakeholders, data analysts, and IT professionals to ensure that the
insights generated are relevant, accurate, and actionable.
7. Visual: BI analysis often involves data visualization techniques, such as
charts, graphs, and dashboards, to help stakeholders understand and
interpret the data.

22. Explain Data, Information and knowledge.


1. Data: Data refers to raw facts and figures that have not been organized
or processed in any meaningful way. Data can be quantitative or
qualitative and may be represented in various forms, such as numbers,
text, or images.
2. Information: Information is data that has been organized, processed, and
contextualized to make it meaningful and useful. Information provides
context and meaning to data, helping to answer questions and make
decisions.
3. Knowledge: Knowledge is the application of information and experience
to solve problems or create new insights. Knowledge involves
understanding the relationships between different pieces of information
and being able to use that understanding to inform decision-making.

23. What are advantages of adoption of mathematical models.


1. Improved accuracy: Mathematical models provide a more accurate
representation of complex systems than traditional methods of analysis,
such as trial and error or intuition-based decision-making.
2. Better decision-making: Mathematical models enable decision-makers to
simulate different scenarios and predict outcomes, helping them to make
better-informed decisions.
3. Increased efficiency: Mathematical models can help organizations
optimize their processes and operations, reducing waste and increasing
efficiency.
4. Faster results: Mathematical models can quickly analyze large amounts of
data and provide insights that might take human analysts weeks or
months to generate.

24. Explain business applications where data mining can be used.


1. Customer segmentation: Data mining can be used to segment customers
based on their behavior, preferences, and demographics, allowing
businesses to tailor their marketing and sales strategies to different
customer groups.
2. Fraud detection: Data mining can help identify fraudulent activity by
analyzing patterns and anomalies in financial transactions.
3. Inventory management: Data mining can be used to optimize inventory
management by predicting demand and identifying patterns in customer
buying behavior.
4. Risk assessment: Data mining can help businesses assess risk by analyzing
data from various sources, such as financial records, customer behavior,
and market trends.

25. Explain data mining used in recommendation system.


Data mining is a crucial component in building recommendation systems, which
are used to suggest products or services to customers based on their past
behavior and preferences. There are different types of recommendation systems,
such as collaborative filtering and content-based filtering, but both rely on data
mining techniques to extract insights from large datasets.

In collaborative filtering, data mining is used to identify patterns and similarities


in customer behavior, such as the items they have purchased or rated. The
system then recommends items that customers with similar behavior have also
shown interest in. Collaborative filtering algorithms rely on data such as
customer ratings and purchase histories to build a matrix of user-item
preferences, which is used to identify patterns and generate recommendations.

In content-based filtering, data mining is used to analyze the features and


characteristics of items, such as their genre, price, or other attributes. The system
then recommends items that are similar to those that the customer has shown
interest in based on these characteristics. Content-based filtering algorithms rely
on data such as product descriptions and attributes to identify patterns and
generate recommendations.
26. Draw and explain different phases in the development of a Decision
Support System (DSS).
1. Planning: The first phase in developing a DSS is to define the problem or
decision to be addressed and determine the scope and objectives of the
project. This involves identifying stakeholders, gathering requirements,
and determining the resources and constraints of the project.
2. Analysis: The second phase involves gathering and analyzing data
relevant to the decision or problem being addressed. This may involve
collecting data from various sources, such as databases, surveys, or other
sources. Data analysis techniques such as statistical analysis, data mining,
and machine learning may be used to identify patterns and relationships
in the data.
3. Design: In the design phase, the DSS architecture and components are
planned and designed. This includes defining the user interface, data
sources, algorithms, and other components of the system. The design
phase may also involve developing prototypes and conducting usability
testing.
4. Development: The development phase involves building and
implementing the DSS based on the design specifications. This may
involve programming, database development, and integration with other
systems.
5. Testing: In the testing phase, the DSS is tested to ensure that it meets the
requirements and specifications. This may involve testing different
scenarios and use cases, as well as evaluating performance and scalability.
6. Deployment: Once the DSS has been tested and validated, it can be
deployed to users. This may involve training users and providing
documentation and support.
7. Maintenance: The final phase involves ongoing maintenance and support
of the DSS. This includes monitoring performance, addressing issues and
bugs, and making updates and enhancements as needed.

27. Explain Analysis, Design, Planning, Implementation and control phases in the
development of a Business Intelligence system.
1. Planning: The first phase involves defining the scope, objectives, and
requirements of the BI system. This includes identifying stakeholders,
determining data sources, and assessing the resources and constraints of
the project. Planning also involves developing a project plan and timeline,
as well as securing funding and resources.
2. Analysis: The analysis phase involves gathering and analyzing data from
various sources to identify patterns, trends, and insights. This may involve
data mining, statistical analysis, and other techniques to extract useful
information from large datasets. The goal of this phase is to gain a
deeper understanding of the data and identify key metrics and KPIs.
3. Design: In the design phase, the BI system architecture and components
are planned and designed. This includes defining the data warehouse or
data mart structure, developing ETL processes to extract, transform, and
load data, and designing dashboards, reports, and other user interfaces.
The design phase may also involve developing prototypes and
conducting usability testing.
4. Implementation: In the implementation phase, the BI system is built and
deployed based on the design specifications. This may involve
programming, database development, and integration with other
systems. The implementation phase may also include data cleansing,
normalization, and other data quality activities.
5. Testing: In the testing phase, the BI system is tested to ensure that it
meets the requirements and specifications. This may involve testing
different scenarios and use cases, as well as evaluating performance and
scalability. Testing may include both technical testing and user
acceptance testing.
6. Control: Once the BI system is deployed and in use, the control phase
involves monitoring and maintaining the system. This includes
monitoring performance, addressing issues and bugs, and making
updates and enhancements as needed. Ongoing data quality and
governance activities are also critical in the control phase.

28. Describe effectiveness and efficiency of Decision-making process.


The effectiveness and efficiency of the decision-making process are two
important factors in determining the success of a decision. Here's a brief
explanation of each:

Effectiveness: The effectiveness of a decision refers to the extent to which it


achieves the intended goal or objective. An effective decision is one that results
in the desired outcome or solution to the problem at hand. To ensure
effectiveness, it is important to define clear goals and objectives, gather and
analyze relevant information, consider various options, and make a decision that
aligns with the organization's values and priorities.
Efficiency: The efficiency of a decision-making process refers to the amount of
resources used to achieve the intended goal or objective. An efficient decision-
making process uses the minimum amount of resources (time, money, people)
to arrive at a decision. To ensure efficiency, it is important to streamline the
decision-making process, eliminate unnecessary steps, and use the appropriate
tools and techniques to gather and analyze information.

You might also like