Professional Documents
Culture Documents
o Data Loading
o Data Access
External Data: For data gathering, most of the executives and data
analysts rely on information coming from external sources for a
numerous amount of the information they use. They use statistical
features associated with their organization that is brought out by some
external sources and department.
Internal Data: In every organization, the consumer keeps their “private”
spreadsheets, reports, client profiles, and generally even department
databases. This is often the interior information, a part that might be
helpful in every data warehouse.
Operational System data: Operational systems are principally meant
to run the business. In each operation system, we periodically take the
old data and store it in achieved files.
Flat files: A flat file is nothing but a text database that stores data in a
plain text format. Flat files generally are text files that have all data
processing and structure markup removed. A flat file contains a table
with a single record per line.
2. Data Staging:
After the data is extracted from various sources, now it’s time to prepare the
data files for storing in the data warehouse. The extracted data collected from
various sources must be transformed and made ready in a format that is
suitable to be saved in the data warehouse for querying and analysis. The
data staging contains three primary functions that take place in this part:
Data Extraction: This stage handles various data sources. Data
analysts should employ suitable techniques for every data source.
4. Data Marts:
Data marts are also the part of storage component in a data warehouse. It can
store the information of a specific function of an organization that is handled
by a single authority. There may be any number of data marts in a particular
organization depending upon the functions. In short, data marts contain
subsets of the data stored in data warehouses.
Now, the users and analysts can use data for various applications like
reporting, analyzing, mining, etc. The data is made available to them
whenever required.
Data Warehousing life Cycle:
1. Structured
2. Semi-structured
3. Unstructured data
The data is processed and transformed so that users and analysts can access
the processed data in the Data Warehouse through Business Intelligence
tools, SQL clients, and spreadsheets. A data warehouse merges all
information coming from various sources into one global and complete
database. By merging all this information in one place, it becomes easier for
an organization to analyze its customers more comprehensively.
1. Amazon Redshift
2. Microsoft Azure
3. Google BigQuery
4. Snowflake
5. Micro Focus Vertica
6. Teradata
7. Amazon DynamoDB
8. PostgreSQL
The ETL process requires active inputs from various stakeholders, including developers,
analysts, testers, top executives and is technically challenging.
To maintain its value as a tool for decision-makers, Data warehouse technique needs to
change with business changes. ETL is a recurring method (daily, weekly, monthly) of a
Data warehouse system and needs to be agile, automated, and well documented.
Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality. The primary data cleansing features found in ETL tools are
rectification and homogenization. They use specific dictionaries to rectify typing
mistakes and to recognize synonyms, as well as rule-based cleansing to enforce
domain-specific rules and defines appropriate associations between values.
If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-
to-date list of contact addresses, email addresses and telephone numbers must be
available.
If a client or supplier calls, the staff responding should be quickly able to find the person
in the enterprise database, but this need that the caller's name or his/her company
name is listed in the database.
If a user appears in the databases with two or more slightly different names or different
account numbers, it becomes difficult to update the customer's information.
Transformation
Transformation is the core of the reconciliation phase. It converts records from its
operational source format into a particular data warehouse format. If we implement a
three-layer architecture, this phase outputs our reconciled data layer.
o Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly
show that this is a Limited Partnership company.
o Different formats can be used for individual data. For example, data can be saved as a
string or as three integers.
Following are the main transformation processes aimed at populating the reconciled
data layer:
o Conversion and normalization that operate on both storage formats and units of
measure to make data uniform.
o Matching that associates equivalent fields in different sources.
o Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
Loading
The Load is the process of writing the data into the target database. During the load
step, it is necessary to ensure that the load is performed correctly and with as little
resources as possible.
1. Refresh: Data Warehouse data is completely rewritten. This means that older file is
replaced. Refresh is usually used in combination with static extraction to populate a data
warehouse initially.
2. Update: Only those changes applied to source information are added to the Data
Warehouse. An update is typically carried out without deleting or modifying preexisting
data. This method is used in combination with incremental extraction to update data
warehouses regularly.
Selecting an ETL Tool
Selection of an appropriate ETL Tools is an important decision that has to be made in
choosing the importance of an ODS or data warehousing application. The ETL tools are
required to provide coordinated access to multiple data sources so that relevant data
may be extracted from them. An ETL tool would generally contains tools for data
cleansing, re-organization, transformations, aggregation, calculation and automatic
loading of information into the object database.
A star schema is a relational schema where a relational schema whose design represents
a multidimensional data model. The star schema is the explicit data warehouse schema.
It is known as star schema because the entity-relationship diagram of this schemas
simulates a star, with points, diverge from a central table. The center of the schema
consists of a large fact table, and the points of the star are the dimension tables.
Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table
has two types of columns: those that include fact and those that are foreign keys to the
dimension table. The primary key of the fact tables is generally a composite key that is
made up of all of its foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact
tables that include aggregated fact are often instead called summary tables). A fact table
generally contains facts with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that
categorize data. If a dimension has not got hierarchies and levels, it is called a flat
dimension or list. The primary keys of each of the dimensions table are part of the
composite primary keys of the fact table. Dimensional attributes help to define the
dimensional value. They are generally descriptive, textual values. Dimensional tables are
usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic
region (markets, cities), clients, products, times, channels.
In a star schema database design, the dimension is connected only through the central
fact table. When the two-dimension table is used in a query, only one join path,
intersecting the fact tables, exist between those two tables. This design feature enforces
authentic and consistent query results.
Structural simplicity also decreases the time required to load large batches of record
into a star schema database. By describing facts and dimensions and separating them
into the various table, the impact of a load structure is reduced. Dimension table can be
populated once and occasionally refreshed. We can add new facts regularly and
selectively by appending records to a fact table.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only
through the fact table. These joins are more significant to the end-user because they
represent the fundamental relationship between parts of the underlying business.
Customer can also browse dimension table attributes before constructing a query.
Example: Suppose a star schema is composed of a fact table, SALES, and several
dimension tables connected to it for time, branch, item, and geographic locations.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has
columns for each item_Key, item_name, brand, type, supplier_type. The BRANCH table
has columns for each branch_key, branch_name, branch_type. The LOCATION table has
columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension
tables, TIME, ITEM, BRANCH, and LOCATION, instead of four columns for time data, four
columns for ITEM data, three columns for BRANCH data, and four columns for
LOCATION data. Thus, the size of the fact table is significantly reduced. When we need
to change an item, we need only make a single change in the dimension table, instead
of making many changes in the fact table.
Data Mining
Data mining is one of the most useful techniques that help entrepreneurs, researchers,
and individuals to extract valuable information from huge sets of data. Data mining is
also called Knowledge Discovery in Database (KDD). The knowledge discovery
process includes Data cleaning, Data integration, Data selection, Data transformation,
Data mining, Pattern evaluation, and Knowledge presentation.
Our Data mining tutorial includes all topics of Data mining such as applications, Data
mining vs Machine learning, Data mining tools, Social Media Data mining, Data mining
techniques, Clustering in data mining, Challenges in Data mining, etc.
In other words, we can say that Data Mining is the process of investigating hidden
patterns of information to various perspectives for categorization into useful data, which
is collected and assembled in particular areas such as data warehouses, efficient analysis,
data mining algorithm, helping decision making and other data requirement to
eventually cost-cutting and generating revenue.
Data mining is the act of automatically searching for large stores of information to find
trends and patterns that go beyond simple analysis procedures. Data mining utilizes
complex mathematical algorithms for data segments and evaluates the probability of
future events. Data Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge
databases to solve business problems. It primarily turns raw data into useful information.
Data Mining is similar to Data Science carried out by a person, in a specific situation, on
a particular data set, with an objective. This process includes various types of services
such as text mining, web mining, audio and video mining, pictorial data mining, and
social media mining. It is done through software that is simple or highly specific. By
outsourcing data mining, all the work can be done faster with low operation costs.
Specialized firms can also use new technologies to collect data that is impossible to
locate manually. There are tonnes of information available on various platforms, but very
little knowledge is accessible. The biggest challenge is to analyze the data to extract
important information that can be used to solve a problem or for company
development. There are many powerful instruments and techniques available to mine
data and find better insight from it.
Relational Database:
A relational database is a collection of multiple data sets formally organized by tables,
records, and columns from which data can be accessed in various ways without having
to recognize the database tables. Tables convey and share information, which facilitates
data searchability, reporting, and organization.
Data warehouses:
A Data Warehouse is the technology that collects the data from various sources within
the organization to provide meaningful business insights. The huge amount of data
comes from multiple places such as Marketing and Finance. The extracted data is
utilized for analytical purposes and helps in decision- making for a business
organization. The data warehouse is designed for the analysis of data rather than
transaction processing.
Data Repositories:
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT
structure. For example, a group of databases, where an organization has kept various
kinds of information.
Object-Relational Database:
One of the primary objectives of the Object-relational data model is to close the gap
between the Relational database and the object-oriented model practices frequently
utilized in many programming languages, for example, C++, Java, C#, and so on.
Transactional Database:
A transactional database refers to a database management system (DBMS) that has the
potential to undo a database transaction if it is not performed appropriately. Even
though this was a unique capability a very long while back, today, most of the relational
database systems support transactional database activities.
Data mining in healthcare has excellent potential to improve the health system. It uses
data and analytics for better insights and to identify best practices that will enhance
health care services and reduce costs. Analysts use data mining approaches such as
Machine learning, Multi-dimensional database, Data visualization, Soft computing, and
statistics. Data Mining can be used to forecast patients in each category. The procedures
ensure that the patients get intensive care at the right place and at the right time. Data
mining also enables healthcare insurers to recognize fraud and abuse.
Education data mining is a newly emerging field, concerned with developing techniques
that explore knowledge from the data generated from educational Environments. EDM
objectives are recognized as affirming student's future learning behavior, studying the
impact of educational support, and promoting learning science. An organization can use
data mining to make precise decisions and also to predict the results of the student.
With the results, the institution can concentrate on what to teach and how to teach.
Knowledge is the best asset possessed by a manufacturing company. Data mining tools
can be beneficial to find patterns in a complex manufacturing process. Data mining can
be used in system-level designing to obtain the relationships between product
architecture, product portfolio, and data needs of the customers. It can also be used to
forecast the product development period, cost, and expectations among the other tasks.
Billions of dollars are lost to the action of frauds. Traditional methods of fraud detection
are a little bit time consuming and sophisticated. Data mining provides meaningful
patterns and turning data into information. An ideal fraud detection system should
protect the data of all the users. Supervised methods consist of a collection of sample
records, and these records are classified as fraudulent or non-fraudulent. A model is
constructed using this data, and the technique is made to identify whether the
document is fraudulent or not.
Apprehending a criminal is not a big deal, but bringing out the truth from him is a very
challenging task. Law enforcement may use data mining techniques to investigate
offenses, monitor suspected terrorist communications, etc. This technique includes text
mining also, and it seeks meaningful patterns in data, which is usually unstructured text.
The information collected from the previous investigations is compared, and a model for
lie detection is constructed.
The process of extracting useful data from large volumes of data is data mining. The
data in the real-world is heterogeneous, incomplete, and noisy. Data in huge quantities
will usually be inaccurate or unreliable. These problems may occur due to data
measuring instrument or because of human errors. Suppose a retail chain collects phone
numbers of customers who spend more than $ 500, and the accounting employees put
the information into their system. The person may make a digit mistake when entering
the phone number, which results in incorrect data. Even some customers may not be
willing to disclose their phone numbers, which results in incomplete data. The data
could get changed due to human or system error. All these consequences (noisy and
incomplete data)makes data mining challenging.
Data Distribution:
Real-worlds data is usually stored on various platforms in a distributed computing
environment. It might be in a database, individual systems, or even on the internet.
Practically, It is a quite tough task to make all the data to a centralized data repository
mainly due to organizational and technical concerns. For example, various regional
offices may have their servers to store their data. It is not feasible to store, all the data
from all the offices on a central server. Therefore, data mining requires the development
of tools and algorithms that allow the mining of distributed data.
Complex Data:
Real-world data is heterogeneous, and it could be multimedia data, including audio and
video, images, complex data, spatial data, time series, and so on. Managing these
various types of data and extracting useful information is a tough task. Most of the time,
new technologies, new tools, and methodologies would have to be refined to obtain
specific information.
Performance:
The data mining system's performance relies primarily on the efficiency of algorithms
and techniques used. If the designed algorithm and techniques are not up to the mark,
then the efficiency of the data mining process will be affected adversely.
Data mining usually leads to serious issues in terms of data security, governance, and
privacy. For example, if a retailer analyzes the details of the purchased items, then it
reveals data about buying habits and preferences of the customers without their
permission.
Data Visualization:
In data mining, data visualization is a very important process because it is the primary
method that shows the output to the user in a presentable way. The extracted data
should convey the exact meaning of what it intends to express. But many times,
representing the information to the end-user in a precise and easy way is difficult. The
input data and the output information being complicated, very efficient, and successful
data visualization processes need to be implemented to make it successful.
There are many more challenges in data mining in addition to the problems above-
mentioned. More problems are disclosed as the actual data mining process begins, and the
success of data mining relies on getting rid of all these difficulties.
Backward Skip 10s