Professional Documents
Culture Documents
Subject-Oriented
Provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat
files, and online transaction records.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
Non-Volatile
Non-Volatile defines that once entered into the warehouse, and data should not
change.
Goals of Data Warehousing
o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases.
Archived Data: In every operational system, we periodically take the old data and
store it in achieved files.
We will now discuss the three primary functions that take place in
the staging area.
1) Data Extraction: This method has to deal with numerous data sources. We have
to employ the appropriate techniques for each data source.
First, we clean the data extracted from each source. Cleaning may be the correction
of misspellings or may deal with providing default values for missing data
elements, or elimination of duplicates when we bring in the same data from
various source systems.
3) Data Loading: Two distinct categories of tasks form data loading functions. When
we complete the structure and construction of the data warehouse and go live for
the first time. The initial load moves high volumes of data using up a substantial
amount of time.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalogue in
a database management system. In the data dictionary, we keep the data about the
logical data structures, the data about the records and addresses, the information
about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of
users. Data marts are lower than data warehouses and usually contain organization.
The current trends in data warehousing are to developed a data warehouse with
several smaller related data marts for particular kinds of queries and reports.
Its work with the database management systems and authorizes data to be
correctly saved in the repositories.
It monitors the movement of information into the staging method and from
there into the data warehouses storage itself.
7. The data is taken as a base and 7. The application data is handled for
managed to get available fast and analysis and reporting objectives.
efficient access.
The Operational Database is the source of information for the data warehouse. It
includes detailed information used to run the day to day operations of the business
Flat Files : system of files in which transactional data is stored, and every file in the
system must have a different name.
Meta Data : A set of data that defines and gives information about other data.
Meta Data summarizes necessary information about data, which can make finding
and work with particular instances of data more accessible. For example, author, data
build, and data changed, and file size are examples of very basic document
metadata.
Lightly and highly summarized data: The goals of the summarized information are
to speed up query performance. The summarized record is updated continuously as
new information is loaded into the warehouse.
A staging area simplifies data cleansing and consolidation for operational method
coming from multiple source systems.
Load Performance
Data warehouses require increase loading of new data periodically basis within
narrow time windows; performance on the load process should be measured in
hundreds of millions of rows and gigabytes per hour and must not artificially
constrain the volume of data business.
Load Processing
Many phases must be taken to load new or update data into the data warehouse,
including data conversion, filtering, reformatting, indexing, and metadata update.
Fact-based management demands the highest data quality. The warehouse ensures
local consistency, global consistency, and referential integrity despite "dirty" sources
and massive database size.
Query Performance
Terabyte Scalability
Data warehouse sizes are growing at astonishing rates. Today these size from a few
to hundreds of gigabytes and terabyte-sized data warehouses.
ETL stands for Extract, Transform, and Load, while ELT stands for Extract,
Load, and Transform.
In ETL, data flow from the data source to staging to the data destination.
ELT lets the data destination do the transformation, eliminating the need for
data staging.
ETL can help with data privacy and compliance, cleansing sensitive data
before loading into the data destination, while ELT is simpler and for
companies with minor data needs.
These are the following areas where data mining is widely used:
Data Mining in Healthcare:
Data mining in healthcare has excellent potential to improve the health system. It
uses data and analytics for better insights and to identify best practices that will
enhance health care services and reduce costs.
This technique may enable the retailer to understand the purchase behavior of a
buyer. This data may assist the retailer in understanding the requirements of the
buyer and altering the store's layout accordingly.
A model is constructed using the data, and the technique is made to identify whether
the document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a
very challenging task. Law enforcement may use data mining techniques to
investigate offenses, monitor suspected terrorist communications, etc.
The process of extracting useful data from large volumes of data is data mining. The
data in the real-world is heterogeneous, incomplete, and noisy. Data in huge
quantities will usually be inaccurate or unreliable. These problems may occur due to
data measuring instrument or because of human errors. All these consequences
(noisy and incomplete data)makes data mining challenging.
Data Distribution:
Complex Data:
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used. If the designed algorithm and techniques are not up to the
mark, then the efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyses the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.
Data Visualization:
It is the primary method that shows the output to the user in a presentable way. The
extracted data should convey the exact meaning of what it intends to express. But
many times, representing the information to the end-user in a precise and easy way
is difficult. Very efficient, and successful data visualization processes need to be
implemented to make it successful.
Data Mining Techniques
In recent data mining projects, various major data mining techniques have been
developed and used, including association, classification, clustering, prediction,
sequential patterns, and regression.
Data Mining Architecture
The significant components of data mining systems are a data source, data mining
engine, data warehouse server, the pattern evaluation module, graphical user
interface, and knowledge base.
KDD- Knowledge Discovery in Databases
The term KDD stands for Knowledge Discovery in Databases. It refers to the broad
procedure of discovering knowledge in data and emphasizes the high-level
applications of specific Data Mining techniques. It is a field of interest to researchers
in various fields, including artificial intelligence, machine learning, pattern
recognition, databases, statistics, knowledge acquisition for expert systems, and data
visualization.
The main objective of the KDD process is to extract information from data in the
context of large databases. It does this by using Data Mining algorithms to identify
what is deemed knowledge.
The process begins with determining the KDD objectives and ends with the
implementation of the discovered knowledge.
Learning
Unsupervised Learning: makes use of input data without any labels —in
other words, no teacher (label) telling the child (computer) when it is right
or when it has made a mistake so that it can self-correct. It tries to learn the
basic structure of the data to give us more insight into the data.
Classification
A regression problem has a real number (a number with a decimal point) as its
output.
We have an independent variable (or set of independent variables) and a
dependent variable (the thing we are trying to guess given our independent
variables).
Also, each row is typically called an example, observation, or data point, while
each column (not including the label/dependent variable) is often called
a predictor, dimension, independent variable, or feature.
Decision Tree
A decision tree is a structure that includes a root node, branches, and leaf
nodes.
Each internal node denotes a test on an attribute, each branch denotes the
outcome of a test, and each leaf node holds a class label.
The topmost node in the tree is the root node.
Can be applied to classification and prediction.
Generate rules: simple interpretation & understanding.
The benefits of having a decision tree are as follows −
It does not require any domain knowledge.
It is easy to comprehend.
The learning and classification steps of a decision tree are simple and
fast.
Example
Terminology
What is Metadata?
Metadata is simply defined as data about data. The data that is used to represent
other data is known as metadata. For example, the index of a book serves as a
metadata for the contents in the book. In other words, we can say that metadata is
the summarized data that leads us to detailed data. In terms of data warehouse, we
can define metadata as follows.
Metadata is the road-map to a data warehouse.
Metadata in a data warehouse defines the warehouse objects.
Metadata acts as a directory. This directory helps the decision support
system to locate the contents of a data warehouse.
To ensure high-quality data, it’s crucial to pre-process it. To make the process easier, data
pre-processing is divided into four stages: data cleaning, data integration, data reduction,
and data transformation