You are on page 1of 9

SALALE UNIVERSITY

COLLEGE OF NATURAL SCIENCE


DEPARTMENT OF COMPUTER SCIENCE

Data Mining and Data Warehouse

Prepared by: Adugna Gutema


Id No: Ru 0155/13
Email: adug363@gmail.com

Submitted to: Mr. Mesfin. A


MARCH 20, 2024
Fitche, Ethiopia
SALALE UNIVERSITY

Defining Data Mining and Data Warehouse

Questions

1. Define Data Mining? Discuss tasks in Data Mining.


2. Define Data Warehouse? Compare and contrast On-Line Transaction Processing (OLTP)
and On-Line Analytical Processing (OLAP).
Let me start by answering the first question about data mining and tasks in data mining:
Rapid advances in data collection and storage technology have enabled organizations
to accumulate vast amounts of data. However, extracting useful information has proven
extremely challenging. Because of the massive size of data traditional data analysis tools
and techniques cannot be used, and thus, new methods need to be developed [1].
Data mining is a technology that blends traditional data analysis methods with
sophisticated algorithms for processing large volumes of data. It has also opened up
exciting opportunities for exploring and analyzing new types of data and for analyzing old
types of data in new ways.
What is Data Mining?
→ Data mining is the process of automatically discovering useful information in large data
repositories.
→ Data mining is the process of extracting useful information, insights, and knowledge from
large datasets using various techniques from statistics, machine learning, and database
systems. It is also the extraction of useful patterns and relationships from data sources, such
as databases, texts, and the web. It uses statistical and pattern-matching techniques [2] .
Importance of data mining:
 Some of the areas where data mining helps are: detecting fraud, spam filtering, managing
risks, and cybersecurity.
 In the marketing sector, it helps in forecasting customer behavior.
 In the banking sector, it can help in determining fraudulent transactions.
 Data mining is not only beneficial for organizations. From governments to healthcare, it
is used everywhere.

1|Page
SALALE UNIVERSITY

Data Mining and Knowledge Discovery


Data mining is an integral part of knowledge discovery in databases (KDD), which is the
overall process of converting raw data into useful information. This process consists of a
series of transformation steps, from data preprocessing to postprocessing of data mining
results.
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers
to the nontrivial extraction of implicit, previously unknown and potentially useful
information from data in databases. While data mining and knowledge discovery in
databases (or KDD) are frequently treated as synonyms, data mining is actually part of the
knowledge discovery process.
Knowledge discovery in databases (KDD), Knowledge extraction, Pattern discovery, Data
dredging, Data archaeology, and Data pattern processing are some alternative names of
data mining that reflect the diverse aspects and applications of the field.
Data mining applications in sales/ marketing, Data mining applications in
banking/finance, Data mining applications in Health Care and Insurance, Data Mining for
the Retail Industry, Data Mining Application in Higher Education, Data mining for the
Telecommunications industry are some applications of Data Mining.
Goals of Data Mining:
• Prediction: Determine how certain attributes will behave in the future. For
example, how much sales volume a store will generate in a given period.
• Identification: Identify patterns in data. For example, newlywed couples tend to
spend more money buying furniture’s.
• Classification: Partition data into classes. For example, customers can be classified
into different categories with different behavior in shopping.
• Optimization: Optimize the use of limited resources such as time, space, money, or
materials. For example, how to best use advertising to maximize profit (sales) [3].

Challenging problems in Data Mining:

1. Developing a Unifying Theory of Data Mining.


2. Scaling Up for High-Dimensional Data and High-Speed Data Streams.
3. Mining Sequence Data and Time Series Data.

2|Page
SALALE UNIVERSITY

4. Mining Complex Knowledge from Complex Data.


5. Data Mining in a Network Setting.

Tasks of Data Mining

Classification

Predictive Anomaly Detection

Data Mining Time series forecasting

Association
Descriptive
Summarization

Clustering

Data mining involves six common classes of tasks


1, Anomaly detection (outlier/change/deviation detection): - The identification of unusual
data records, that might be interesting or data errors that require further investigation.
2, Association rule learning (dependency modeling): - Searches for relationships between
variables. For example, a supermarket might gather data on customer purchasing habits. Using
association rule learning, the supermarket can determine which products are frequently
bought together and use this information for marketing purposes. This is sometimes referred
to as market basket analysis.
3, Clustering: - is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
4, Classification: - is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".
5, Regression: - attempts to find a function that models the data with the least error that is,
for estimating the relationships among data or datasets.
6, Summarization: - providing a more compact representation of the data set, including
visualization and report generation [4] .

3|Page
SALALE UNIVERSITY

Let me dive to the second question about the definition of Data Warehouse and
comparing between OLTP and OLAP:

→ A Data Warehouse is a subject-oriented, integrated, time-variant, and non-volatile


collection of data in support of management decision-making process. What does that
mean by that subject-oriented, integrated, time-variant, and non-volatile?
• Subject-oriented: - Stored data targets of specific subjects.

For example, it may store data regarding total Sales, Number of Customers, etc.,
and not general data on everyday operations.

• Integrated: - Data may be distributed across heterogeneous sources which have to


be integrated.

Example: Sales data may be on RDB, Customer information on Flat files, etc.

• Time variant: - Data stored may not be current but varies with time and data have an
element of time.

Example: Data of sales in last 5 years, etc.

• Non-volatile: - It is separate from the Enterprise Operational Database and hence is


not subject to frequent modification. It generally has only two operations performed
on it: Loading of data and Access of data [5].

Goals of Data Warehouse:

Make an organization's information easily accessible: The contents of the data warehouse must
be understandable and intuitive and obvious to the business user.

Present the organization's information consistently: Consistent information means high-quality


information. It means that all the data is accounted for and complete.

Be adaptive and resilient to change: We simply can't avoid change. User needs, business
conditions, data, and technology are all subject to the shifting sands of time. The data warehouse
must be designed to handle this inevitable change.

4|Page
SALALE UNIVERSITY

Be a secure bastion that protects our information assets: The data warehouse must effectively
control access to the organization's confidential information.

Why a Data Warehouse is separated from operational databases? Because of:

• An operational database is constructed for well-known tasks and workloads such as


searching particular records, indexing, etc. In contrast, data warehouse queries are often
complex and they present a general form of data.
• An operational database maintains current data. On the other hand, a data warehouse
maintains historical data [6].

Data Warehouse (OLAP) Operational Database (OLTP)


It involves the historical processing of It involves day-to-day processing.
information
It focuses on Information. It focuses on Data.
These are highly flexible. It provides high performance.
It provides a summarized and It provides a detailed and flat relational view
multidimensional view of data. of data.
It is based on Star Schema, Snowflake It is based on the Entity Relationship Model.
Schema, and Fact Constellation Schema.

Data Warehouse Design processes and steps:

Top-Down Approach: Starts with overall design and planning, useful where the technology is
mature and well known.

Bottom-Up Approach: Starts with experiments and prototypes, allows an organization to move
forward at considerably less expense, and allows it to evaluate the benefits of the technology before
making significant commitments.

Combined Approach: Exploits the planned and strategic nature of the top-down approach and
retains the rapid implementation and opportunistic application of the bottom-up approach. These
are processes in Data Warehouse design.

5|Page
SALALE UNIVERSITY

Steps of Data Warehouse Design:

1. Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, and the general ledger.
2. Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.
3. Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
4. Choose the measures that will populate each fact table record. Typical measures are
numeric additive quantities like dollars sold and units sold.

OLAP Operations

Roll-up: Performs aggregation on a data cube, either by climbing up a concept hierarchy for
a dimension or by dimension reduction.

Drill-down: This can be realized by either stepping down a concept hierarchy for a dimension
or introducing additional dimensions.

Slice and Dice: Slice performs a selection on one dimension of the given cube, resulting in a
sub-cube Dice defines a sub-cube by performing a selection on two or more dimensions.

Pivot (rotate): It’s a visualization operation that rotates the data axes in view to provide an
alternative presentation of the data.

Characteristics of Data Warehouse:

• Subject Oriented • Stores large amounts of data


• Time Variant • Read-optimized persistent and
• Integrated Non-Volatile

Before diving into comparing and contrasting of OLTP and OLAP let me point some
understanding of both OLTP and OLAP:

On-line Transaction Processing: - Consider a point-of-sale (POS) system in a supermarket


store. You have picked a bar of chocolate and await your chance in the queue for getting it

6|Page
SALALE UNIVERSITY

billed. The cashier scans the chocolate bar's bar code. Consequent to the scanning of the bar
code, some activities take place in the background like, the database is accessed; the price and
product information is retrieved and displayed on the computer screen; the cashier feeds in the
quantity purchased; the application then computes the total, generates the bill, and prints it.
You pay the cash and leave. The application has just added a record of your purchase in its
database. This was an On-Line Transaction Processing (OLTP) system designed to support
online transactions and query processing. OLTP systems refer to a class of systems that manage
transaction-oriented applications.

On-line Analytical Processing: - OLAP is a category of software that allows users to analyze
information from multiple database systems at the same time. It is a technology that enables
analysts to extract and view business data from different points of view. Analysts frequently
need to group, aggregate, and join data. These operations in relational databases are resource-
intensive. With OLAP, data can be pre-calculated and pre-aggregated, making analysis faster.

Let’s dive into the comparison:

Parameters OLTP OLAP


Process It is an online transactional is an online analysis and data-
system. It manages database retrieving process?
modification.
Table Tables in the OLTP database Tables in the OLAP database
are normalized. are not normalized.
Functionality OLTP is an online database OLAP is an online database
modifying system. query management system.
Method OLTP uses traditional DBMS. OLAP uses the data
warehouse.
Query Insert, Update, and Delete
information from the Mostly select operations
database.
Response time Its response time is in Response time in seconds to
milliseconds. minutes.

1|Page
SALALE UNIVERSITY

References

[1] M. S. V. K. Pang-Ning Tan, "Center for Earth Observation and Modelling," July 2021. [Online].
Available: https//www.ceom.ou.edu. [Accessed 18 March 2024].

[2] D. P. B. Ms. Aruna J. Chamatkar, "International Journal of Enginnering Research and Application,"
May 2014. [Online]. Available: https//www.ijera.com. [Accessed 18 March 2024].

[3] "Studocu," september 2013. [Online]. Available: https://www.studocu.com/. [Accessed 18 March


2024].

[4] M. A. A. S. F. Khan Mohammed and S. Affan, "International Journal for Research Trends and
Innovation," 9 September 2018. [Online]. Available: https//www.ijnrd.org. [Accessed 18 March
2024].

[5] R. Sharma, "Stony Brook University," [Online]. Available: https://www.cs.stonybrook.edu. [Accessed


19 March 2024].

[6] S. A. RN Prasad, "PUNJIBA University, Patiala," [Online]. Available:


https://www.punjabiuniversity.ac.in. [Accessed 19 March 2024].

2|Page

You might also like