Professional Documents
Culture Documents
Module #10
A. MAIN LESSON
N
IO
1. Activity 2: Content Notes
AT
Introduction to Data Mining
UC
Data Mining is a set of method that applies to large and complex databases. This is to eliminate the randomness
and discover the hidden pattern. As these data mining methods are almost always computationally intensive. We
use data mining tools, methodologies, and theories for revealing patterns in data. There are too many driving forces
ED
present. And, this is the reason why data mining has become such an important area of study.
History
MA
In 1960s statisticians used the terms “Data Fishing” or “Data Dredging”.
IN
That was to refer what they considered the bad practice of analyzing
data. The term “Data Mining” appeared around 1990 in the database
PH
community.
Foundation
OF
through their data in real time. We use data mining in the business
T
mature:
We use to automate the process of finding predictive information in large databases. Questions that required
extensive hands-on analysis can now be answered from the data. Targeted marketing is a typical example of
predictive marketing. As we also use data mining on past promotional mailings. That is to identify the targets to
maximize return on investment in future mailings. Other predictive problems include forecasting bankruptcy and
other forms of default. And identifying segments of a population likely to respond similarly to given events.
As we use data mining tools to sweep through databases. Also, to identify previously hidden patterns in one step.
There is a very good example of pattern discovery. As it is the analysis of retail sales data. That is to identify
unrelated products that are often purchased together. Also, there are other pattern discovery problems. That
includes detecting fraudulent credit card transactions. It is identified that anomalous data could represent data entry
keying errors.
N
IO
AT
UC
ED
Types of Data gathered MA
• Business transactions: In this business industry, every transaction is “memorized” for perpetuity. We can
say many transactions are dealing with time and can be inter-business deals such as purchases,
IN
data need to be analyzed. Unfortunately, we have to capture and store more new data faster. Then we can
analyze the old data already accumulated.
OF
• Medical and personal data: As we can say from the government to customer and for personal needs, we
have to gather large information. That information is required for individuals and groups. When correlated
with other data, this information can shed light on customer behavior.
Y
• Surveillance Video and Pictures: As with the collapse of video camera prices, video cameras are
T
becoming ubiquitous. Also, we can recycle cameras, videotapes from surveillance. However, it’s become a
ER
trend to store the tapes and even digitize them for future use and analysis.
• Games In societies, a huge amount of data and statistics is used. That is to collect about games, players,
OP
and athletes. As this information data is used by commentators and journalists for reporting.
• Digital media: There are too many reasons for causes of the explosion in digital media repositories. Such
PR
as cheap scanners, desktop video cameras, and digital cameras. Associations such as the NHL and the
NBA. That have already started converting their huge game collection into digital forms.
• CAD and Software engineering data: There are multiple CAD systems for architects present to design
building. As these systems are used to generate a huge amount of data. Moreover, we can use S.E is a
source of considerable similar data with code and objects that needs to be powerful tools for management
and maintenance.
• Virtual Worlds: Nowadays many applications are using three-dimensional virtual spaces. Also, these
spaces and the objects they contain have to describe with special languages such as VRML. Ideally, we
have to define virtual spaces as they can share objects and places. Also, there present the remarkable
amount of virtual reality object available.
• Text reports and memos (e-mail messages): As communications are based on the reports and memos in
textual forms in many companies. As they are exchanged by e-mail. Although, we use to store it in digital
form for future use. Also, reference creating formidable digital libraries.
N
As this learn through training and resemble biological
IO
neural networks in structure.
AT
• Decision trees
As we use tree-shaped structures to represent sets of
UC
decisions. Also, by this rules are generated for the
classification of a dataset. These decisions generate
rules for the classification of a dataset.
ED
As there are specific decision tree methods that
includes Classification and Regression Trees and Chi-
MA
Square Automatic Interaction Detection (CHAID).
•
IN
Genetic algorithms
There are the present genetic combination, mutation, and natural selection for optimization techniques. That
PH
A technique that classifies each record in a dataset based on a combination of the classes of the k record(s)
like. it in a historical dataset (where k ³ 1). Sometimes called the k-nearest neighbor technique.
Y
• Rule induction
T
The extraction of useful if-then rules from data based on statistical significance.
ER
OP
PR
Now that you introduced yourself to Data Mining Let’s try a short activity to know how much you understand the our
short introduction to our lesson.
There you go! I’m expecting that you learn something today, I am excited to hear your
understanding with our lesson for today, Answer the following question:
N
IO
Part 1: Identify the following statements if what type data gathered are the following:
ANSWERS: STATEMENTS
AT
Business This type of data gathered for data mining is dealing with time and can be inter-business
UC
Business
transactions deals such as purchases, exchanges, banking, stock, etc.,
transaction
Virtual Worlds This type of data gathered for data mining are using three-dimensional virtual spaces.
ED
Virtual Worlds Also, these spaces and the objects they contain have to describe with special languages
such as VRML. Ideally, we have to define virtual spaces as they can share objects and
places.
MA
Games This type of data gathered for data mining is to collect about games, players, and athletes.
Games As this information data is used by commentators and journalists for reporting.
IN
Scientific Data This type of data gathered for data mining is to capture and store more new data faster.
PH
Scientific Data Then we can analyze the old data already accumulated.
Medical and This type of data gathered for data mining is required for individuals and groups. When
Medical and
personal data correlated with other data, this information can shed light on customer behavior
OF
personal data
Text reports and This type of data gathered for data mining is based on the reports and memos in textual
Text reports and
memos forms in many companies.
Y
memos
T
CAD and Software This type of data gathered for data mining are used to generate a huge amount of data.
ER
engineering data
CAD and Software Moreover, we can use S.E is a source of considerable similar data with code and objects
engineering data
that needs to be powerful tools for management and maintenance.
OP
Surveillance Video This type of data gathered for data mining is to collapse of video camera prices, video
Surveillance
and PicturesVideo cameras are becoming ubiquitous
and Pictures
PR
A. LESSON WRAP-UP
N
IO
FAQs:
AT
1. What is VRML?
Answer:. Virtual Reality Modeling Language, pronounced vermal or by its initials, originally—before 1995—
UC
known as the Virtual Reality Markup Language) is a standard file format for representing 3-dimensional
(3D) interactive vector graphics, designed particularly with the World Wide Web in mind..
ED
2. What is CAD?
Answer: Computer-aided design (CAD) is the use of computers (or workstations) to aid in the creation,
modification, analysis, or optimization of a design.
MA
IN
PH
Mark the place in the work tracker which is simply a visual to help you track how much work you have
T
accomplished and how much work there is left to do. This tracker will be part of your activity sheet.
ER
OP
PR
To develop habits on thinking about learning, answer the questions below about your learning
experience.
A. MAIN LESSON
N
IO
1. Activity 2: Content Notes
AT
Why Data Mining?
UC
As data mining is having spacious applications. Thus, it is the young and
promising field for the present generation. It has attracted a great deal of
attention in the information industry and in society. Due to the wide
ED
availability of huge amounts of data and the imminent need for turning
such data into useful information and knowledge.
MA
Thus, we use information and knowledge for applications ranging from
market analysis. This is the reason why data mining is also called as
IN
To operate data mining tools we need extra steps for the extracting, and importing the data.
T Y
Furthermore, new insights need operational implementation, integration with the warehouse simplifies the
ER
application. We have to apply analytic data warehouse to improve business processes. Particularly in areas such as
promotional campaign management, and so on.
Below figure illustrates an architecture for advanced analysis in a large data warehouse.
OP
Such as Sybase, Oracle, Redbrick, and so on, and should be optimized for flexible and fast data access.
An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business model. That
need to apply when navigating the data warehouse. Although, multidimensional structures allow the user to analyze
the data. As they want to view their business. Such as summarizing by product line, region.
Further, the Data Mining Server must be integrated with the data warehouse. And, the OLAP server to embed ROI-
focused business analysis directly into this infrastructure. Also, integration with the data warehouse enables the
operational decisions. That is to be implemented and tracked.
Also, keep warehouse grows with new decisions and results. Thus, the organization can mine the best practices and
apply them to future decisions
N
In the OLAP, results enhance the metadata. That is by providing a dynamic metadata layer. As this layer is used to
represents a distilled view of the data. Reporting, visualization, and tools can then be applied to plan future actions.
IO
And confirm the impact of those plans.
AT
Data Mining Process
UC
Data Mining, also popularly known as Knowledge Discovery in Databases (KDD). Also, nontrivial extraction of
implicit information from data in databases.
ED
This process comprises of a few steps. That is to lead from raw data collections to some form of new knowledge.
The iterative process consists of the following steps:
MA
1. Data cleaning: This is also called as data cleansing. As in this phase noise data and irrelevant data are
removed from the collection.
IN
5. Data mining: In this, we have to apply clever techniques to extract patterns potentially useful.
6. Pattern evaluation: In this process interesting patterns representing knowledge are identified based on
given measures.
Y
7. Knowledge representation: It is the final phase. Particularly in this phase, knowledge is discovered
T
and represented to the user. This essential step uses visualization techniques. That help users
ER
As there are too many data mining systems available. Also, some
systems are specific that we need to dedicate to a given data source.
Further, according to various criteria, data mining systems have to
categorize.
N
discrimination, association, classification, clustering, etc.
• Classification according to mining techniques used
IO
As data mining systems employ are used to provide different techniques. According to the data analysis, we
AT
have to done this classification. Such as machine learning, neural networks, genetic algorithms, , etc.
UC
1. Mining methodology issues: These issues to the data mining approaches applied and their limitations
ED
such as versatility of the mining approaches that can dictate mining methodology choices.
2. Performance issues: As there is much artificial intelligence and statistical methods exist. That is use for
MA
data analysis. However, these methods were often not designed for the very large datasets. And data
mining is dealing with today.
IN
We can say this raises the issues of scalability and efficiency of the data mining methods. That would
process considerably large data. . Moreover, Linear algorithms are usually the norm. In the same theme,
PH
However, issues like completeness and choice of samples may arise. Other topics in the issue of
OF
performance are incremental updating and parallel programming. We use parallelism to solve the size
problem. And if the dataset can be subdivided and the results can be merged later.
Incremental updating is important for merging results from parallel mining. That the new data becomes
Y
3. Data source issues: We must know that there are many issues related to the data sources. Some are
practical such as the diversity of data types. While others are philosophical like the data glut problem.
OP
We certainly have an excess of data since. Also, we already have more data than we can handle. Then we are still
collecting data at an even higher rate. Although, If the spread of database management systems. That has helped in
PR
increasing the gathering of information. And the advent of data mining is certainly encouraging more data
harvesting. The current practice is to collect as much data as possible now and process it or try to process it, later.
Regarding the practical issues related to data sources, there is the subject databases. Thus, we need to focus on
diverse complex data types. We are storing different types of data in a variety of repositories. It is difficult to expect a
data mining system to achieve good mining results on all kinds of data and sources.
As different kinds of data and sources may require distinct algorithms and methodologies. Currently, there is a focus
on relational databases and data warehouses.
It’s a versatile data mining tool, for all sorts of data, may not be realistic. Moreover, data sources, at structural and
semantic levels, poses important challenges. That is not only to the database community but also to the data mining
community.
N
More applications include:
IO
• A credit card company can leverage its vast warehouse of customer transaction data. As we perform this to
AT
identify customers. It shows more interest in a new credit product.
• Moreover, we use small test mailing. So the attributes of customers with an affinity for the product have to
UC
identify. Recent projects have indicated more than a 20-fold decrease in costs. That is target for mailing
campaigns over conventional approaches.
• As diversified transportation company used to apply data mining. That is to identify the best prospects for its
ED
services. Further, need to apply this segmentation to a general business database. Such as those provided
by Dun & Bradstreet can yield a prioritized list of prospects by region.
• Large consumer packaged goods company. That can apply data mining to improve its sales process to
MA
retailers. Although, data from consumer panel, and competitor activity have to apply. That is to understand
the reasons for brand and store switching.
• Through this analysis, we have to manufacturer it. Then select promotional strategies that best reach their
IN
Now that you understand the architecture of Data Mining, Let’s try a short activity to know how much you
Y
There you go! I’m expecting that you learn something today, I am excited to hear your
OP
understanding with our lesson for today, Answer the following question:
PR
Try to answer it base on your understanding, better to construct your own sentences for
you to able practice your written communication skill.
Part 1: Draw a diagram that will show Data Mining Process use a following shapes :
Task or Evalu
Data Data
Wareh Data ation
base
ouse
Draw here…
N
IO
AT
UC
ED
MA
IN
PH
OF
Answer:
Regarding the practical issues related to data sources, there is the subject databases. Thus, we need
OP
to focus on diverse complex data types. We are storing different types of data in a variety of
repositories.
PR
2. What is OLAP?
Answer:
An OLAP (On-Line Analytical Processing) server enables a more sophisticated end-user business
model. That need to apply when navigating the data warehouse. Although, multidimensional structures
allow the user to analyze the data. decisions generate rules for the classification of a dataset.
A. LESSON WRAP-UP
FAQs:
1. What is KDD?
Answer:. The term Knowledge Discovery in Databases, or KDD for short, refers to the broad process of
finding knowledge in data, and emphasizes the "high-level" application of particular data mining methods..
N
2. How data mining help space research?
Answer:. Integrated system health management is one of the benefits use in space mission that use data
IO
mining, it will be a key contributor to the safety, reliability, and affordability of future exploration missions.
AT
Data mining and complex systems design can be combined with other ISHM approaches to obtain higher
ISHM performance at lower cost.
UC
An intrusion detection system (IDS) is a data mining tool used to
identify cyber attacks. Besides quickly identifying attacks, it has
ED
many other benefits such as enabling the collection of intrusion
information, recording malicious events, generating reports, and
alerting system administrators by raising an alarm..
MA
IN
Mark the place in the work tracker which is simply a visual to help you track how much work you have
accomplished and how much work there is left to do. This tracker will be part of your activity sheet.
OF
T Y
To develop habits on thinking about learning, answer the questions below about your learning
ER
experience.
OP
N
A. MAIN LESSON
IO
1. Activity 2: Content Notes
AT
With Big Data becoming more prevalent than
ever, the demand for mining tools is growing.
UC
It’s becoming vital to know exactly what tools are
ED
capable of successfully dealing with huge
amounts of data. MA
In this article, we will discuss the complex
prospecting algorithms and data visualization
libraries that will be your primary tools in building
IN
Data Processing
OF
Before we dive deeper into the details, first we need a clear vision of how a very large amount of data transforms
from a huge amount of unorganized information into an organized and structured set of lists, ready to be used by
sales, marketers or even HR’s.
T Y
1. Find a lead data source. This is the primary place from where all your data will be mined. This can be a
OP
popular social media platform like Facebook, LinkedIn, and Twitter. Now we have bulk data, most of which is
no use to us.
PR
2. Target the relevant data. Here we define the targeted data type and source, suited for our purposes. We can
have multiple associative data types, as well as several sub-sources to extract data from.
3. Preprocess raw data for future processing. This part of the data mining process involves altering the data
from a raw format into one that’s acceptable for further interactions.
4. Convert preprocessed data into a readable format. Your original data language will be determined and
transformed into one your system is able to process.
5. Create Data Patterns/Models. Based on the data you have, you can determine common relationships
between the subtypes of data and identify patterns, or create sets of tables connected by data relationships.
Data Visualization
With the relational data patterns identified, we are able to build all sorts of meaningful infographics and visualize
them using third-party services or libraries. These third-party solutions don’t have a high learning curve, however
analyzing the libraries directly would require the assistance of a developer who is familiar with the languages used in
any given library. Here you can see the list of the most commonly used 3rd party tools for data visualization:
N
• Datawrapper (data tool for journalists and news publishers)
IO
• D3.js (JavaScript library for displaying data on web platforms)
• Google Charts (user-friendly library based on HTML5 and SVG for Android, iOS, and browsers)
AT
With these tools, we can create infographics that will show all the data we need for our sales and marketing
UC
departments to create a successful marketing campaign. Moreover, collected data can be used for outreach to
potential prospects. Lead generation cannot exist without a solid data foundation. If you want to generate leads –
ED
generate data.
business decisions and cut costs drastically. They can also help you detect anomalies inside your models and
patterns to prevent your system from being exploited by third persons.
With all those features on board, you won’t need to implement complex algorithms from the ground up. Moreover,
OF
you can adjust those features with some additional tweaking to the code base (if it’s an open source tool), as your
demands grow.
Y
Overall, data mining tools were created to define and achieve numerous objectives, helping you generate more
T
profit in the end. Now you see why these tools are genuinely useful. Let’s end this with the last but not least
ER
Different tools require different approaches. Some require zero to no coding experience, others would most likely
demand some programming skills depending on the coding used. These tools are generally open-source and don’t
have any paid plans.
Here is a list of the most commonly used data mining tools. Starting from entry level to enterprise-grade
businesses:
1) RapidMiner
This tool is written in Java and has multiple mining options like pre-processing, converting and prediction
techniques. It can be used with other tools like WEKA and R-tool to give models written in the code of those two.
Existing patterns, models and algorithms can be enhanced by the following programming languages:
• R – a programming language used for data mining, extraction, exploration, and analytical tasks;
• Python – a programming language used for rapid prototyping of software solutions.
They are well suited for rapid prototyping and data manipulation. RapidMiner has all the data analysis features from
the simplest to the most advanced ones. With plugins from Rapidminer Marketplace, they extend the already vast
N
functionality. Moreover, developers and data analysts can use the marketplace for publishing their plugins or
algorithms.
IO
2) WEKA
AT
WEKA contains a selection of algorithms, visualization tools for machine learning and data
UC
analytics. You can use this tool directly on your sets of data. With WEKA you can perform
numerous data tasks, regression, clustering, classification, visualization and data processing.
ED
The main advantages of this software are:
MA
• Completely free
• Portable, can be used on multiple platforms
IN
3) Orange
ER
learning, data mining, analysis, and visualization. These components are also
called widgets, they help not only with simple tasks like data preprocessing and
visualization but also with creating complex algorithms and prediction models.
PR
Orange has visual programming implemented into it for creating a solid workflow
by linking user-made widgets. It can also be used as a Python library to change
widgets and manipulate data.
4) R
N
IO
AT
Data mining tools are an essential part of enriching your leads. With these tools at your disposal, you can create
patterns based on the user’s behavior and apply it to your marketing strategies. These patterns can also be used to
UC
enrich your leads with new data. There are various techniques to describe data by associations or split it into
separate clusters, to predict the changes in data by classifying it or using regression.
ED
Overall, data mining tools help us enrich our leads and make our lead generation campaigns more successful.
MA
Now that we differentiate the data mining tools and its use, Let’s try a short activity to know how much you
understand the our short introduction to our lesson.
IN
There you go! I’m expecting that you learn something today, I am excited to hear your
understanding with our lesson for today, Answer the following question:
OF
try to answer it base on your understanding, better to construct your own sentences for
you to able practice your written communication skill.
T Y
ER
Part 1: Do a research: Visit the website of the following Data Mining Tool answer the following questions:
DATA MINING TOOLS MISSION FOUNDER/CEO
OP
R PROGRAMMING It is one of the primary languages used by data Ross Ihaka and Robert
scientists and statisticians alike. Gentleman
INET SOFT To stay nimble and innovate in order to apply Luke Liang
the latest technologies to a productive business
intelligence implementation. To design software
that is easy to use and easy to deploy.
H30 To provide a safe working environment for each Stephen Howes
of our employees and our clients. To complete
N
each project on time and on budget.
IO
ORACLE BI Our Mission is to Help People See Data in New Larry Ellison
Ways, Discover Insights, Unlock Endless
AT
Possibilities.
KNIME At KNIME, we build software to create and Michael Berthold
UC
productionize data science using one easy and
intuitive environment, enabling every
stakeholder in the data science process to focus
ED
on what they do best.
Answer:
PH
With the relational data patterns identified, we are able to build all sorts of meaningful infographics and visualize
them using third-party services or libraries. These third-party solutions don’t have a high learning curve, however
analyzing the libraries directly would require the assistance of a developer who is familiar with the languages used
OF
Data mining tools will help you generate more revenue by creating informational assets, used both by sales and
ER
marketing departments. They can study the behavior of your clients, their location, position and create solid
marketing strategies.
OP
A. LESSON WRAP-UP
PR
FAQs:
Currently, storing data is not enough for the organization to have greater
competition, but it is necessary that the data are integrated in a single place
so that they cease to be a cost to become a business asset. To achieve
this, the organization must carry out an ETL process.
N
IO
Mark the place in the work tracker which is simply a visual to help you track how much work you have
accomplished and how much work there is left to do. This tracker will be part of your activity sheet.
AT
UC
ED
To develop habits on thinking about learning, answer the questions below about your learning
experience.
MA
1. How was the module able to help you learn?
IN
PH
A. MAIN LESSON
N
IO
1. Activity 2: Content Notes
AT
What is ETL and Why it matters?
UC
ETL is a type of data integration that refers to the three steps (extract,
transform, load) used to blend data from multiple sources. It's often used to
build a data warehouse.
ED
During this process, data is taken (extracted) from a source system,
MA
converted (transformed) into a format that can be analyzed, and stored
(loaded) into a data warehouse or other system. Extract, load, transform
(ELT) is an alternate but related approach designed to push processing down
IN
ETL History
ETL gained popularity in the 1970s when organizations began using multiple
OF
In the late 1980s and early 1990s, data warehouses came onto the scene. A distinct type of database, data
warehouses provided integrated access to data from multiple systems – mainframe computers, minicomputers,
personal computers and spreadsheets. But different departments often chose different ETL tools to use with
OP
different data warehouses. Coupled with mergers and acquisitions, many organizations wound up with several
different ETL solutions that were not integrated.
PR
Over time, the number of data formats, sources and systems has expanded tremendously. Extract, transform, load
is now just one of several methods organizations use to collect, import and process data. ETL and ELT are both
important parts of an organization’s broader data integration strategy.
Businesses have relied on the ETL process for many years to get a consolidated view of the data that drives better
business decisions. Today, this method of integrating data from multiple systems and sources is still a core
component of an organization’s data integration toolbox.
N
move data without requiring technical skills to
write code or scripts.
IO
• ETL has evolved over time to support emerging
AT
integration requirements for things like streaming
data.
• Organizations need both ETL and ELT to bring
UC
data together, maintain accuracy and provide the
auditing typically required for data warehousing,
ED
reporting and analytics.
Core ETL and ELT tools work in tandem with other data integration tools, and with various other aspects of data
management – such as data quality, data governance, virtualization and metadata. Popular uses today include:
OP
sales data regularly, or health care providers looking for an accurate depiction of claims. ETL can
combine and surface transaction data from a warehouse or other data store so that it’s ready for
business people to view in a format they can understand. ETL is also used to migrate data from legacy
systems to modern systems with different data formats. It’s often used to consolidate data from business
mergers, and to collect and join data from external suppliers or partners.
Whoever gets the most data, wins. While that’s not necessarily true, having easy access to a broad
scope of data can give businesses a competitive edge. Today, businesses need access to all sorts of
big data – from videos, social media, the Internet of Things (IoT), server logs, spatial data, open or
crowdsourced data, and more. ETL vendors frequently add new transformations to their tools to support
these emerging requirements and new data sources. Adapters give access to a huge variety of data
sources, and data integration tools interact with these adapters to extract and load data efficiently.
N
IO
AT
UC
ED
MA
IN
PH
ETL has evolved to support integration across much more than traditional data warehouses. Advanced
OF
ETL tools can load and convert structured and unstructured data into Hadoop. These tools read and
write multiple files in parallel from and to Hadoop, simplifying how data is merged into a common
transformation process. Some solutions incorporate libraries of prebuilt ETL transformations for both the
Y
transaction and interaction data that run on Hadoop. ETL also supports integration across transactional
T
systems, operational data stores, BI platforms, master data management (MDM) hubs and the cloud.
ER
OP
PR
Self-service data preparation is a fast-growing trend that puts the power of accessing, blending and
transforming data into the hands of business users and other nontechnical data professionals.
N
preparation and more time is spent on
IO
generating insights.
AT
Consequently, both business and IT data
professionals can improve productivity, and
UC
organizations can scale up their use of data to
make better decisions.
ED
MA
5. ETL and Data Quality
Now that you visualized the ___________________?. Let’s try a short activity to know how much you understand
the our short introduction to our lesson.
There you go! I’m expecting that you learn something today, I am excited to hear your
understanding with our lesson for today, Answer the following question:
N
Try to answer it base on your understanding, better to construct your own sentences for
you to able practice your written communication skill. .
IO
Part 1: Identification: Write your answer for each number
AT
________________1. When was data warehouse come to the scene?
UC
________________2. ETL stands for?
ED
________________3. This will help us to lineage of Data.
MA
________________4. ETL has evolved over time to support emerging integration requirements for things like?
________________7. Advanced ETL Tools can load and covert structured and unstructured into?
OF
________________9. Multiple System and sources and still a core component of organizations?
T Y
1. HADOOP
ANSWER:
PR
N
IO
3. How ETL recognized in the industry?
AT
Answer:
UC
ED
MA
A. LESSON WRAP-UP
IN
FAQs:
PH
ETL stands for Extract, Transform, and Load, while ELT stands for Extract, Load, and Transform. In
ETL, data flow from the data source to staging to the data destination. ELT lets the data destination do
the transformation, eliminating the need for data staging.
T Y
ER
These tools should be designed as per your data integration requirements. These tools perform
transformation, mapping, and cleansing of data
PR
Mark the place in the work tracker which is simply a visual to help you track how much work you have
accomplished and how much work there is left to do. This tracker will be part of your activity sheet.
N
IO
To develop habits on thinking about learning, answer the questions below about your learning
experience.
AT
1. How was the module able to help you learn?
UC
ED
2. What did you realize about the topic? MA
IN
PH
OF
T Y
ER
OP
PR
N
A. MAIN LESSON
IO
1. Activity 2: Content Notes
AT
How It Works
UC
ETL is closely related to a number of other data integration functions, processes and techniques. Understanding
these provides a clearer view of how ETL works.
ED
SQL Structured query language is the most common method of accessing and transforming
data within a database.
MA
Transformations, After extracting data, ETL uses business rules to transform the data into new formats.
business rules and The transformed data is then loaded into the target.
IN
adapters
Data mapping Data mapping is part of the transformation process. Mapping provides detailed
PH
instructions to an application about how to get the data it needs to process. It also
describes which source field maps to which destination field.
OF
For example, the third attribute from a data feed of website activity might be the user
name, the fourth might be the time stamp of when that activity happened, and the fifth
Y
might be the product that the user clicked on. An application or ETL process using that
T
data would have to map these same fields or attributes from the source system (i.e., the
ER
website activity data feed) into the format required by the destination system. If the
destination system was a customer relationship management system, it might store the
OP
user name first and the time stamp fifth; it might not store the selected product at all.
PR
In this case, a transformation to format the date in the expected format (and in the right
order), might happen in between the time the data is read from the source and written to
the target.
Scripts ETL is a method of automating the scripts (set of instructions) that run behind the
scenes to move and transform data. Before ETL, scripts were written individually in C or
COBOL to transfer data between specific systems.
This resulted in multiple databases running numerous scripts. Early ETL tools ran on
mainframes as a batch process. ETL later migrated to UNIX and PC platforms.
Organizations today still use both scripts and programmatic data movement methods.
ETL versus ELT In the beginning, there was ETL. Later, organizations added ELT, a complementary
method. ELT extracts data from a source system, loads it into a destination system and
then uses the processing power of the source system to conduct the transformations.
This speeds data processing because it happens where the data lives.
Data quality Before data is integrated, a staging area is often created where data can be cleansed,
data values can be standardized (NC and North Carolina, Mister and Mr., or Matt and
Matthew), addresses can be verified and duplicates can be removed. Many solutions
N
are still standalone, but data quality procedures can now be run as one of the
IO
transformations in the data integration process.
Scheduling and ETL tools and technologies can provide either batch scheduling or real-time capabilities.
AT
processing They can also process data at high volumes in the server, or they can push down
processing to the database level. This approach of processing in a database as
UC
opposed to a specialized engine avoids data duplication and prevents the need to use
extra capacity on the database platform.
ED
Batch processing ETL usually refers to a batch process of moving huge volumes of data between two
systems during what’s called a “batch window.” During this set period of time – say
between noon and 1 p.m. – no actions can happen to either the source or target system
MA
as data is synchronized. Most banks do a nightly batch process to resolve transactions
that occur throughout the day.
IN
Web services Web services are an internet-based method of providing data or functionality to various
PH
applications in near-real time. This method simplifies data integration processes and can
deliver more value from data, faster. For example, let’s say a customer contacts your
call center. You could create a web service that returns the complete customer profile
OF
with a subsecond response time simply by passing a phone number to a web service
that extracts the data from multiple sources or an MDM hub. With richer knowledge of
Y
the customer, the customer service rep can make better decisions about how to interact
T
Master data MDM is the process of pulling data together to create a single view of the data across
management multiple sources. It includes both ETL and data integration capabilities to blend the data
together and create a “golden record” or “best record.”
OP
Data virtualization Virtualization is an agile method of blending data together to create a virtual view of data
without moving it. Data virtualization differs from ETL, because even though mapping
PR
and joining data still occurs, there is no need for a physical staging table to store the
results. That’s because the view is often stored in memory and cached to improve
performance. Some data virtualization solutions, like SAS Federation Server, provide
dynamic data masking, randomization and hashing functions to protect sensitive data
from specific roles or groups. SAS also provides on-demand data quality while the view
is generated.
Event stream When the speed of data increases to millions of events per second, event stream
processing and ETL processing can be used to monitor streams of data, process the data streams and help
make more timely decisions. An example in the energy space is using predictive
analytics on streams of data to detect when a submersible pump is in need of repair to
reduce both downtime and the scope and size of damage to the pump.
• It helps companies to analyze their business data for taking critical business decisions.
• Transactional databases cannot answer complex business questions that can be answered by ETL.
• A Data Warehouse provides a common data repository
• ETL provides a method of moving the data from various sources into a data warehouse.
N
• As data sources change, the Data Warehouse will automatically update.
• Well-designed and documented ETL system is almost essential to the success of a Data Warehouse
IO
project.
• Allow verification of data transformation, aggregation and calculations rules.
AT
• ETL process allows sample data comparison between the source and the target system.
• ETL process can perform complex transformations and requires the extra area to store the data.
UC
• ETL helps to Migrate data into a Data Warehouse. Convert to the various formats and types to adhere to
one consistent system.
ED
• ETL is a predefined process for accessing and manipulating source data into the target database.
• ETL offers deep historical context for the business.
• It helps to improve productivity because it codifies and reuses without a need for technical skills.
MA
ETL Process in Data Warehouses
IN
Extraction
PH
In this step, data is extracted from the source system into the
staging area. Transformations if any are done in staging area
so that performance of source system in not degraded. Also, if
OF
different
OP
DBMS, Hardware, Operating Systems and Communication Protocols. Sources could include legacy applications like
Mainframes, customized applications, Point of contact devices like ATM, Call switches, text files, spreadsheets,
PR
Hence one needs a logical data map before data is extracted and loaded physically. This data map describes the
relationship between sources and target data.
1. Full Extraction
2. Partial Extraction- without update notification.
3. Partial Extraction- with update notification
Irrespective of the method used, extraction should not affect performance and response time of the source systems.
These source systems are live production databases. Any slow down or locking could affect company's bottom line.
Transformation
N
IO
Data extracted from source server is raw and not
usable in its original form. Therefore it needs to be
AT
cleansed, mapped and transformed. In fact, this is the
key step where ETL process adds value and changes
data such that insightful BI reports can be generated.
UC
In this step, you apply a set of functions on extracted
data. Data that does not require any transformation is
ED
called as direct move or pass through data.
2. There are multiple ways to denote company name like Google, Google Inc.
3. Use of different names like Cleaveland, Cleveland.
4. There may be a case that different account numbers are generated by various applications for the same
Y
customer.
5. In some data required files remains blank
T
Loading
Loading data into the target data warehouse database is the last step of the ETL process. In a typical Data
warehouse, huge volume of data needs to be loaded in a relatively short period (nights). Hence, load process
should be optimized for performance.
In case of load failure, recover mechanisms should be configured to restart from the point of failure without data
integrity loss. Data Warehouse admins need to monitor, resume, cancel loads as per prevailing server performance.
N
Types of Loading:
IO
1. Initial Load — populating all the Data Warehouse tables
AT
2. Incremental Load — applying ongoing changes as when needed periodically.
3. Full Refresh —erasing the contents of one or more tables and reloading with fresh data.
UC
Load verification
ED
• Ensure that the key field data is neither missing nor null.
• Test modeling views based on the target tables.
•
MA
Check that combined values and calculated measures.
• Data checks in dimension table as well as history table.
• Check the BI reports on the loaded fact and dimension table.
IN
PH
Never try to cleanse all the data: Every organization would like to have all the data clean, but most of them are not
OF
ready to pay to wait or not ready to wait. To clean it all would simply take too long, so it is better not to try to cleanse
all the data.
Y
Never cleanse Anything: Always plan to clean something because the biggest reason for building the Data
T
Determine the cost of cleansing the data: Before cleansing all the dirty data, it is important for you to determine
OP
To speed up query processing, have auxiliary views and indexes: To reduce storage costs, store summarized
data into disk tapes. Also, the trade-off between the volume of data to be stored and its detailed usage is required.
Trade-off at the level of granularity of data to decrease the storage costs.
Now that you understand the ?. Let’s try a short activity to know how much you understand the our short introduction
to our lesson.
There you go! I’m expecting that you learn something today, I am excited to hear your
understanding with our lesson for today, Answer the following question:
N
Try to answer it base on your understanding, better to construct your own sentences for
you to able practice your written communication skill. .
IO
Part 1: ENUMERATION: Write your answer Inside the box:
AT
Data Extractions Methods Types of Loading Data Integration Issues
UC
1. 1. 1.
ED
2. 2. 2.
MA
3. 3. 3.
IN
4.
PH
OF
Part 2: Determine the following statements if TRUE or FALSE write DATA if the statement is TRUE and MINING if
FALSE:
T Y
______________1. If you system will encounter corruption of data it will automatically copy data into Data
ER
Warehouse.
______________3. The best practice of ETL is always try to cleanse all the data.
PR
______________4. Extraction should not affect the performance and response time of source system.
______________5. Test modelling views are not based on the target methods
2. Why in this step you apply such a function on the extracted data?
Answer:
N
A. LESSON WRAP-UP
IO
FAQs:
AT
1. What will happened if we do full refresh in loading?
UC
Answer:
Full Refresh is process were all data inside the storage will conduct a ERASING of contents in one or more
tables and reloading with fresh data.
ED
2. Why we do EXTRACTION STAGING?
Answer:
MA
Because Staging area gives an opportunity to validate extracted data before it moves into the Data
warehouse.
IN
Mark the place in the work tracker which is simply a visual to help you track how much work you have
accomplished and how much work there is left to do. This tracker will be part of your activity sheet.
OP
PR
To develop habits on thinking about learning, answer the questions below about your learning
experience.
A. MAIN LESSON
N
IO
1. Activity 2: Content Notes
AT
Is your company suffering from a case of “Bad Data”? Everyone
UC
is following the process and doing their job correctly but you still
face issues with accurate reporting, operational errors, audit
anxiety about your data, etc. Good data should be a given, right?
ED
Well it’s not that easy. In today’s business environment, rapid
growth, organizational change, and mergers and acquisitions
(M&A) are very difficult to absorb within a fragmented data
MA
ecosystem. Multiple disparate IT systems, siloed databases, and
deficient master data often result in data which is fragmented,
IN
MDM is considered a cornerstone to a good data governance program and the health of your data ecosystem. Let’s
take a look at what MDM is and how an MDM program is implemented within an organization.
OP
business processes and application systems. It’s not all the data about a subject area, (Data Domain), but it’s the
pieces (Data Entities) you care about. These pieces are like the data elements on a web form with the asterisk (*)
next to them, the mandatory ones.
Often, externally supplied data is also linked within master data, providing rich extensions to your internal data.
Numerous data sets are available today to expand your data validation, and provide a broader view of your entities.
Address validation, Geospatial data, Risk by Geo data, and Global Legal Entity data are a few of the many data sets
available.
Master data does not include the transactions that happen in the course of doing business. I’ve seen master data
described as “the Nouns of the data but not the Verbs.” The verbs or action words would be the transactional data.
Deciding on what the Master Data is within your company requires collaboration with all of your internal business
partners to get it right, as well as IT.
Master Data Management (MDM) is the organization of people, processes and technologies to create and maintain
an authoritative, reliable, sustainable, accurate, and secure data environment that represents a “single version of the
truth” for master data and its relationships across the enterprise.
N
Let’s break that definition down a little further:
IO
The People Involved
AT
• MDM Program Executive Sponsor – provides organization
UC
and resource support; able to help articulate the importance
of MDM in your organization.
ED
They would be the champion of change to help
communicate the benefits, the work to be done and how
MA
change fits into corporate goals. They are a vital component
of a successful Data Governance and Master Data
Program.
IN
Each business unit who shares a dependency on the data being considered for mastering should be
represented in the MDM program. These are the people who will help define, prioritize and measure the
Y
success.
T
ER
• Data Stewards – The data stewards take ownership of the day-to-day maintenance and data governance.
They insure the data meets the quality standards put in place and takes corrective action on exceptions.
OP
• System Owners – These are the people knowledgeable about specific business applications and systems
within the data ecosystem.
PR
They provide system SME and access data timing and the processes which operate within a particular
business application.
The system owners provide access and technical knowledge to the team.
• MDM SME – These are the navigators of implementing the MDM program.
They bring the knowledge and experience to help plan and execute an MDM program. They help define and
implement the solutions based on the specific needs of the company.
All of these people should be a part of the Master Data Management team and can be considered the steering
committee for putting MDM in place and managing the MDM program moving forward.
The Processes
There are 2 process areas to consider when looking at Master Data Management.
• The business process that will be affected when implementing a MDM program.
These changes are identified during the planning of an MDM initiative and described
as a component of a Use Case. Multiple changes can happen to business processes
N
as the MDM program matures.
• The governance processes put in place to maintain the master data program, the
IO
health of the data, and the monitoring, error handling and change of data after MDM
is put in place.
AT
The Technology
UC
There are several excellent platforms in the market today that provide the tools
ED
to help you implement a master data management program. Whether they
come as a suite or are selected individually, the functions they perform are
similar to make up your MDM platform.
MA
At the core of your operational MDM platform is a data store, which houses the
cleansed “single source of the truth.”
IN
At the front end of your MDM platform is the extraction component which moves all of your source data into the
PH
MDM platform as a one time load. After this, your data will typically be incrementally loaded as change occurs. This
is usually called the Extract, Transform and Load (ETL) component.
OF
Within the MDM platform, the source data goes through a series of cleansing steps in order to become available as
the “single source of truth.” When this is done, you’re ready to share your mastered data with applications, data
warehouse, business intelligence and customer interactions with a common view of cleansed, de-duplicated data.
T Y
ER
In order to cleanse the data and make it ready, components within the MDM platform will need to be set up. This
requires configuration of rules engines, validation against reference data, and possibly external validation or custom
PR
These steps require analysis, design and development to create the cleansing path the data will take (workflow). A
good first step in the analysis is source data profiling. There are good tools out there (data profilers) to help you
analyze and diagnose the quality of your data regarding the accuracy, completeness and integrity.
This understanding can then be used to help define and automate the rules to put in place to build your cleansing
workflow:
Standardization – Standardization involves aligning data from different sources into a common structure, format
and definition, such as customer name coming from two different systems which allow different lengths, case and
spelling.
Matching – Once your data is in a comparable state, matching can take place. Matching is the identification of
duplicate data based on a set of matching criteria. You may want all the records with company name, address and
phone numbers that are the same to be considered the same record (a match).
N
Merging – Once duplicates are identified, they can be merged into that “Single Source of the Truth.” Duplicates may
have varying degrees of quality and the winner of quality would be the survivor. On the first run of this process, the
IO
initial load, this survivor is designated as “Best of Breed.”
AT
Best of Breed – On subsequent incremental processing of additional data, the current Best of Breed record will be
challenged and replaced if a better record comes along.
UC
Launching an MDM Program?
ED
Now that we have a sense of what Master Data and Master Data Management is, you may ask, “How would you
implement a Master Data Management Program?”
MA
Although MDM is much like other software development initiatives in many ways, it does have its peculiarities. MDM
needs to be driven by the business leaders in collaboration with Information Technology. Too often, MDM is
IN
attempted by IT alone and can lead to de-prioritization and failures because IT are not the ones with the most pain
with bad data and subsequent gain that MDM offers.
PH
An MDM implementation should start with a healthy planning phase where your steering committee is assembled,
your executive sponsor is identified, the use cases you want to conquer are defined, the return on investment is
OF
developed, and a roadmap is established to guide the development based on priorities of the organization.
I highly recommend appointing a navigator on the team who has traveled these waters before and can help avoid
the rocks. They will be especially helpful in use case development, vendor selection and aligning the roadmap
Y
effectively.
T
ER
Once the planning is in place, the development should follow a pattern for fulfilling the roadmap. Each use case or
common set of use cases follow a standard lifecycle to deliver incremental benefit.
The Discovery phase is the time to do the data analysis and profiling to understand the source data and its
OP
condition. This is also the time to refine the use cases into functional requirements to prepare for the Design and
Build of the solution. Notice the arrow in the diagram pointing back from Build to Discovery. This alludes to the
PR
iterative process typically followed in the development of an MDM solution. Much like peeling an onion, data
nuances will be discovered and addressed as you move further through the phases.
This is why an Agile style project works well with MDM. It lets you move through and adjust to get the most benefit
from your efforts.
Whether you implement Master Data Management on a sophisticated Graph Database or on the back of a napkin (I
don’t recommend the napkin approach), you should realize that MDM is an ongoing process, not a single project. By
its nature, MDM must evolve with the changes in your business. Keep this in mind as you move through the initiation
of MDM, and your program will yield ever increasing value to your organization.
N
Now that I introduced to you the MDM?. Let’s try a short activity to know how much you understand the our short
introduction to our lesson.
IO
2) Activity 3: Skill-building Activities.
AT
There you go! I’m expecting that you learn something today, I am excited to hear your
understanding with our lesson for today, Answer the following question:
UC
Try to answer it base on your understanding, better to construct your own sentences for you to
ED
able practice your written communication skill.
Part 1: IILUSTRATE and DISCUSS: Draw a Diagram Cleansing Flow and Discuss each Phase
MA
Draw here…
IN
PH
OF
T Y
ER
OP
PR
N
2. Why MDM needs a Business Leaders?
IO
Answer:
AT
UC
ED
A. LESSON WRAP-UP
FAQs:
MA
1. What is AGILE?
Answer:
IN
In software development, agile practices approach discovering requirements and developing solutions
PH
through the collaborative effort of self-organizing and cross-functional teams and their customer/end user..
2. What is a workflow?
OF
Answer:
A Workflow is a sequence of tasks that processes a set of data. Workflows occur across every kind of
business and industry. Anytime data is passed between humans and/or systems, a workflow is created.
Y
Workflows are the paths that describe how something goes from being undone to done, or raw to
processed.
T
ER
Mark the place in the work tracker which is simply a visual to help you track how much work you have
accomplished and how much work there is left to do. This tracker will be part of your activity sheet.
N
IO
To develop habits on thinking about learning, answer the questions below about your learning
experience.
AT
1. How was the module able to help you learn?
UC
ED
2. What did you realize about the topic?
MA
IN
PH
OF
T Y
ER
OP
PR