Professional Documents
Culture Documents
Chaubey
1
Prepared By : P.K.Chaubey
2
Prepared By : P.K.Chaubey
Data independence
If a database system is not multi-layered, then it becomes difficult to make any changes in the
database system. Database systems are designed in multi-layers as we learnt earlier.
Data Independence
A database system normally contains a lot of data in addition to users’ data. For example, it
stores data about data, known as metadata, to locate and retrieve data easily. It is rather
difficult to modify or update a set of metadata once it is stored in the database. But as a
DBMS expands, it needs to change over time to satisfy the requirements of the users. If the
entire data is dependent, it would become a tedious and highly complex job.
Metadata itself follows a layered architecture, so that when we change data at one layer, it
does not affect the data at another level. This data is independent but mapped to each other.
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is managed
inside. For example, a table (relation) stored in the database and all its constraints, applied on
that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data
stored on the disk. If we do some changes on table format, it should not change the data
residing on the disk.
Physical Data Independence
All the schemas are logical, and the actual data is stored in bit format on the disk. Physical
data independence is the power to change the physical data without impacting the schema or
logical data.
For example, in case we want to change or upgrade the storage system itself − suppose we
want to replace hard-disks with SSD − it should not have any impact on the logical data or
schemas.
Data Redundancy
In computer main memory, auxiliary storage and computer buses, data redundancy is the
existence of data that is additional to the actual data and permits correction of errors in stored
or transmitted data. The additional data can simply be a complete copy of the actual data, or
only select pieces of data that allow detection of errors and reconstruction of lost or damaged
data up to a certain level.
Data redundancy is a condition created within a database or data storage technology in which
the same piece of data is held in two separate places.
3
Prepared By : P.K.Chaubey
This can mean two different fields within a single database, or two different spots in multiple
software environments or platforms. Whenever data is repeated, it basically constitutes data
redundancy.
Data redundancy can occur by accident but is also done deliberately for backup and recovery
purposes
For example, by including additional data checksums, ECC memory is capable of detecting
and correcting single-bit errors within each memory word, while RAID 1 combines two hard
disk drives (HDDs) into a logical storage unit that allows stored data to survive a complete
failure of one drive. Data redundancy can also be used as a measure against silent data
corruption; for example, file systems such as Btrfs and ZFS use data and metadata
checksumming in combination with copies of stored data to detect silent data corruption and
repair its effects.
While different in nature, data redundancy also occurs in database systems that have values
repeated unnecessarily in one or more records or fields, within a table, or where the field is
replicated/repeated in two or more tables. Often this is found in Unnormalized database
designs and results in the complication of database management, introducing the risk of
corrupting the data, and increasing the required amount of storage. When done on purpose
from a previously normalized database schema, it may be considered a form of database
denormalization; used to improve performance of database queries.
For instance, when customer data are duplicated and attached with each product bought, then
redundancy of data is a known source of inconsistency since a given customer might appear
with different values for one or more of their attributes. Data redundancy leads to data
anomalies and corruption and generally should be avoided by design; applying database
normalization prevents redundancy and makes the best possible usage of storage.
Redundancy means having multiple copies of same data in the database. This problem arises
when a database is not normalized. Suppose a table of student details attributes are: student
Id, student name, college name, college rank, course opted.
ID Name Contact College Courses Rank
4
Prepared By : P.K.Chaubey
Insertion Anomaly: If a student detail has to be inserted whose course is not being decided
yet then insertion will not be possible till the time course is decided for student.
ID Name Contact College Courses Rank
Data Consistency
Data consistency refers to when same data kept at different places do not match.
Point-in-time consistency is an important property of backup files and a critical objective of
software that creates backups. It is also relevant to the design of disk memory systems,
specifically relating to what happens when they are unexpectedly shut down.
As a relevant backup example, consider a website with a database such as the online
encyclopedia, Wikipedia, which needs to be operational around the clock, but also must be
backed up with regularity to protect against disaster. Portions of Wikipedia are constantly
being updated every minute of every day, meanwhile, Wikipedia’s database is stored on
servers in the form of one or several very large files which require minutes or hours to back
up.
These large files as with any database contain numerous data structures which reference each
other by location. For example, some structures are indexes which permit the database
subsystem to quickly find search results. If the data structures cease to reference each other
properly, then the database can be said to be corrupted.
Application Consistency
Application Consistency is similar to Transaction consistency, but on a grander scale. Instead
of data consistency within the scope of a single transaction, data must be consistent within the
confines of many different transaction streams from one or more applications.
An application may be made up of many different types of data, such as multiple database
components, various types of files, and data feeds from other applications. Application
consistency is the state in which all related files and databases are in-synch and represent the
true status of the application.
Transaction Consistency
A transaction is a logical unit of work that may include any number of file or database
updates. During normal processing, transaction consistency is present only
Before any transactions have run,
Following the completion of a successful transaction and before the next
transaction begins, and
When the application ends normally or the database is closed.
Following a failure of some kind, the data will not be transaction consistent if transactions
were in-flight at the time of the failure. In most cases what occurs is that once the application
or database is restarted, the incomplete transactions are identified and the updates relating to
these transactions are either ―backed-out‖ or processing resumes with the next dependant
write.
5
Prepared By : P.K.Chaubey
6
Prepared By : P.K.Chaubey
7
Prepared By : P.K.Chaubey
Types of DBMS
Four Types of DBMS systems are:
1. Hierarchical DBMS
In a Hierarchical database, model data is organized in a tree-like structure. Data is Stored
Hierarchically (top down or bottom up) format. Data is represented using a parent-child
relationship. In Hierarchical DBMS parent may have many children, but children have only
one parent.
2. Network Model
The network database model allows each child to have multiple parents. It helps you to
address the need to model more complex relationships like as the orders/parts many-to-many
relationship. In this model, entities are organized in a graph which can be accessed through
several paths.
3. Relational model
Relational DBMS is the most widely used DBMS model because it is one of the easiest. This
model is based on normalizing data in the rows and columns of the tables. Relational model
stored in fixed structures and manipulated using SQL.
4. Object-Oriented Model
In Object-oriented Model data stored in the form of objects. The structure which is called
classes which display data within it. It defines a database as a collection of objects which
stores both data members values and operations.
Advantages of DBMS
DBMS offers a variety of techniques to store & retrieve data
DBMS serves as an efficient handler to balance the needs of multiple
applications using the same data
Uniform administration procedures for data
Application programmers never exposed to details of data representation
and storage.
A DBMS uses various powerful functions to store and retrieve data
efficiently.
Offers Data Integrity and Security
The DBMS implies integrity constraints to get a high level of protection
against prohibited access to data.
A DBMS schedules concurrent access to the data in such a manner that only
one user can access the same data at a time
Reduced Application Development Time
Disadvantage of DBMS
DBMS may offer plenty of advantages but, it has certain flaws-
Cost of Hardware and Software of a DBMS is quite high which increases
the budget of your organization.
Most database management systems are often complex systems, so the
training for users to use the DBMS is required.
In some organizations, all data is integrated into a single database which can
be damaged because of electric failure or database is corrupted on the
storage media
Use of the same program at a time by many users sometimes lead to the loss
of some data.
DBMS can’t perform sophisticated calculations
8
Prepared By : P.K.Chaubey
DBMS Architecture
The design of a DBMS depends on its architecture. It can be centralized or decentralized or
hierarchical. The architecture of a DBMS can be seen as either single tier or multi-tier. An n-
tier architecture divides the whole system into related but independent n modules, which can
be independently modified, altered, changed, or replaced.
In 1-tier architecture, the DBMS is the only entity where the user directly sits on the DBMS
and uses it. Any changes done here will directly be done on the DBMS itself. It does not
provide handy tools for end-users. Database designers and programmers normally prefer to
use single-tier architecture.
If the architecture of DBMS is 2-tier, then it must have an application through which the
DBMS can be accessed. Programmers use 2-tier architecture where they access the DBMS by
means of an application. Here the application tier is entirely independent of the database in
terms of operation, design, and programming.
3-tier Architecture
A 3-tier architecture separates its tiers from each other based on the complexity of the users
and how they use the data present in the database. It is the most widely used architecture to
design a DBMS.
Database (Data) Tier− At this tier, the database resides along with its query
processing languages. We also have the relations that define the data and
their constraints at this level.
9
Prepared By : P.K.Chaubey
Application (Middle) Tier− At this tier reside the application server and
the programs that access the database. For a user, this application tier
presents an abstracted view of the database. End-users are unaware of any
existence of the database beyond the application. At the other end, the
database tier is not aware of any other user beyond the application tier.
Hence, the application layer sits in the middle and acts as a mediator
between the end-user and the database.
User (Presentation) Tier− End-users operate on this tier and they know
nothing about any existence of the database beyond this layer. At this layer,
multiple views of the database can be provided by the application. All views
are generated by applications that reside in the application tier.
Multiple-tier database architecture is highly modifiable, as almost all its components are
independent and can be changed independently.
Database Schema
A database schema is the skeleton structure that represents the logical view of the entire
database. It defines how the data is organized and how the relations among them are
associated. It formulates all the constraints that are to be applied on the data.
A database schema defines its entities and the relationship among them. It contains a
descriptive detail of the database, which can be depicted by means of schema diagrams. It’s
the database designers who design the schema to help programmers understand the database
and make it useful.
10
Prepared By : P.K.Chaubey
Database Instance
It is important that we distinguish these two terms individually. Database schema is the
skeleton of database. It is designed when the database doesn’t exist at all. Once the database
is operational, it is very difficult to make any changes to it. A database schema does not
contain any data or information.
A database instance is a state of operational database with data at any given time. It contains
a snapshot of the database. Database instances tend to change with time. A DBMS ensures
that its every instance (state) is in a valid state, by diligently following all the validations,
constraints, and conditions that the database designers have imposed.
If a database system is not multi-layered, then it becomes difficult to make any changes in the
database system. Database systems are designed in multi-layers as we learnt earlier.
Data Independence
A database system normally contains a lot of data in addition to users’ data. For example, it
stores data about data, known as metadata, to locate and retrieve data easily. It is rather
difficult to modify or update a set of metadata once it is stored in the database. But as a
DBMS expands, it needs to change over time to satisfy the requirements of the users. If the
entire data is dependent, it would become a tedious and highly complex job.
Metadata itself follows a layered architecture, so that when we change data at one layer, it
does not affect the data at another level. This data is independent but mapped to each other.
Logical Data Independence
Logical data is data about database, that is, it stores information about how data is managed
inside. For example, a table (relation) stored in the database and all its constraints, applied on
that relation.
Logical data independence is a kind of mechanism, which liberalizes itself from actual data
stored on the disk. If we do some changes on table format, it should not change the data
residing on the disk.
Physical Data Independence
All the schemas are logical, and the actual data is stored in bit format on the disk. Physical
data independence is the power to change the physical data without impacting the schema or
logical data.
For example, in case we want to change or upgrade the storage system itself − suppose we
want to replace hard-disks with SSD − it should not have any impact on the logical data or
schemas.
11
Prepared By : P.K.Chaubey
Data Warehousing
The term ―Data Warehouse‖ was first coined by Bill Inmon in 1990. According to Inmon, a
data warehouse is a subject oriented, integrated, time-variant, and non-volatile collection of
data. This data helps analysts to take informed decisions in an organization.
An operational database undergoes frequent changes on a daily basis on account of the
transactions that take place. Suppose a business executive wants to analyze previous feedback
on any data such as a product, a supplier, or any consumer data, then the executive will have
no data available to analyze because the previous data has been updated due to transactions.
A data warehouses provides us generalized and consolidated data in multidimensional view.
Along with generalized and consolidated view of data, a data warehouses also provides us
Online Analytical Processing (OLAP) tools. These tools help us in interactive and effective
analysis of data in a multidimensional space. This analysis results in data generalization and
data mining.
Data mining functions such as association, clustering, classification, prediction can be
integrated with OLAP operations to enhance the interactive mining of knowledge at multiple
level of abstraction. That’s why data warehouse has now become an important platform for
data analysis and online analytical processing.
Understanding a Data Warehouse
A data warehouse is a database, which is kept separate from the
organization’s operational database.
There is no frequent updating done in a data warehouse.
It possesses consolidated historical data, which helps the organization to
analyze its business.
A data warehouse helps executives to organize, understand, and use their
data to take strategic decisions.
Data warehouse systems help in the integration of diversity of application
systems.
A data warehouse system helps in consolidated historical data analysis.
Why a Data Warehouse is Separated from Operational Databases
A data warehouses is kept separate from operational databases due to the following reasons −
An operational database is constructed for well-known tasks and workloads
such as searching particular records, indexing, etc. In contract, data
warehouse queries are often complex and they present a general form of
data.
Operational databases support concurrent processing of multiple
transactions. Concurrency control and recovery mechanisms are required for
operational databases to ensure robustness and consistency of the database.
An operational database query allows to read and modify operations, while
an OLAP query needs only read only access of stored data.
An operational database maintains current data. On the other hand, a data
warehouse maintains historical data.
12
Prepared By : P.K.Chaubey
13
Prepared By : P.K.Chaubey
14
Prepared By : P.K.Chaubey
15
Prepared By : P.K.Chaubey
Note − Data cleaning and data transformation are important steps in improving the quality of
data and data mining results.
Data Mining
There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.
Extraction of information is not the only process we need to perform; data mining also
involves other processes such as Data Cleaning, Data Integration, Data Transformation, Data
Mining, Pattern Evaluation and Data Presentation. Once all these processes are over, we
would be able to use this information in many applications such as Fraud Detection, Market
Analysis, Production Control, Science Exploration, etc.
Data Mining is defined as extracting information from huge sets of data. In other words, we
can say that data mining is the procedure of mining knowledge from data. The information or
knowledge extracted so can be used for any of the following applications:
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data Mining Applications
Data mining is highly useful in the following domains:
Market Analysis and Management
Corporate Analysis & Risk Management
Fraud Detection
Apart from these, data mining can also be used in the areas of production control, customer
retention, science exploration, sports, astrology, and Internet Web Surf-Aid
Market Analysis and Management
Listed below are the various fields of market where data mining is used:
Customer Profiling: Data mining helps determine what kind of people buy
what kind of products.
Identifying Customer Requirements: Data mining helps in identifying the
best products for different customers. It uses prediction to find the factors
that may attract new customers.
Cross Market Analysis: Data mining performs Association/correlations
between product sales.
Target Marketing: Data mining helps to find clusters of model customers
who share the same characteristics such as interests, spending habits,
income, etc.
Determining Customer purchasing pattern: Data mining helps in
determining customer purchasing pattern.
Providing Summary Information: Data mining provides us various
multidimensional summary reports.
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector:
16
Prepared By : P.K.Chaubey
17
Prepared By : P.K.Chaubey
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical correlations
between associated-attribute-value pairs or between two item sets to analyze that if they have
positive, negative or no effect on each other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to forming group of
objects that are very similar to each other but are highly different from the objects in other
clusters.
Classification and Prediction
Classification is the process of finding a model that describes the data classes or concepts.
The purpose is to be able to use this model to predict the class of objects whose class label is
unknown. This derived model is based on the analysis of sets of training data. The derived
model can be presented in the following forms −
Classification (IF-THEN) Rules
Decision Trees
Mathematical Formulae
Neural Networks
The list of functions involved in these processes are as follows −
Classification− It predicts the class of objects whose class label is
unknown. Its objective is to find a derived model that describes and
distinguishes data classes or concepts. The Derived Model is based on the
analysis set of training data i.e. the data object whose class label is well
known.
Prediction− It is used to predict missing or unavailable numerical data
values rather than class labels. Regression Analysis is generally used for
prediction. Prediction can also be used for identification of distribution
trends based on available data.
Outlier Analysis− Outliers may be defined as the data objects that do not
comply with the general behavior or model of the data available.
Evolution Analysis− Evolution analysis refers to the description and model
regularities or trends for objects whose behavior changes over time.
Data Mining Task Primitives
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data
mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
Set of task relevant data to be mined
This is the portion of database in which the user is interested. This portion includes the
following −
Database Attributes
18
Prepared By : P.K.Chaubey
19
Prepared By : P.K.Chaubey
20
Prepared By : P.K.Chaubey
21
Prepared By : P.K.Chaubey
Therefore the data analysis task is an example of numeric prediction. In this case, a model or
a predictor will be constructed that predicts a continuous-valued-function or ordered value.
With the help of the bank loan application that we have discussed above, let us understand
the working of classification. The Data Classification process includes two steps:
22
Prepared By : P.K.Chaubey
23
Prepared By : P.K.Chaubey
from it. Statistics can help you to a greater extent to get answers for questions about their data
like
What are the patterns in their database ?
What is the probability of an event to occur ?
Which patterns are more useful to the business ?
What is the high level summary that can give you a detailed view of what is
there in the database ?
Statistics not only answers these questions they help in summarizing the data and count it. It
also helps in providing information about the data with ease. Through statistical reports
people can take smart decisions. There are different forms of statistics but the most important
and useful technique is the collection and counting of data. There are a lot of ways to collect
data like
Histogram
Mean
Median
Mode
Variance
Max
Min
Linear Regression
2. Clustering Technique
Clustering is one among the oldest techniques used in Data Mining. Clustering analysis is the
process of identifying data that are similar to each other. This will help to understand the
differences and similarities between the data. This is sometimes called segmentation and
helps the users to understand what is going on within the database. For example, an insurance
company can group its customers based on their income, age, nature of policy and type of
claims.
There are different types of clustering methods. They are as follows
Partitioning Methods
Hierarchical Agglomerative methods
Density Based Methods
Grid Based Methods
Model Based Methods
The most popular clustering algorithm is Nearest Neighbour. Nearest neighbour technique is
very similar to clustering. It is a prediction technique where in order to predict what a
estimated value is in one record look for records with similar estimated values in historical
database and use the prediction value from the record which is near to the unclassified record.
This technique simply states that the objects which are closer to each other will have similar
prediction values. Through this method you can easily predict the values of nearest objects
very easily. Nearest Neighbour is the most easy to use technique because they work as per the
thought of the people. They also work very well in terms of automation. They perform
complex ROI calculations with ease. The level of accuracy in this technique is as good as the
other Data Mining techniques.
In business Nearest Neighbour technique is most often used in the process of Text Retrieval.
They are used to find the documents that share the important characteristics with that main
document that have been marked as interesting.
3. Visualization
24
Prepared By : P.K.Chaubey
Visualization is the most useful technique which is used to discover data patterns. This
technique is used at the beginning of the Data Mining process. Many researches are going on
these days to produce interesting projection of databases, which is called Projection Pursuit.
There are a lot of data mining technique which will produce useful patterns for good data.
But visualization is a technique which converts Poor data into good data letting different
kinds of Data Mining methods to be used in discovering hidden patterns.
4. Induction Decision Tree Technique
A decision tree is a predictive model and the name itself implies that it looks like a tree. In
this technique, each branch of the tree is viewed as a classification question and the leaves of
the trees are considered as partitions of the dataset related to that particular classification.
This technique can be used for exploration analysis, data pre-processing and prediction work.
Decision tree can be considered as a segmentation of the original dataset where segmentation
is done for a particular reason. Each data that comes under a segment has some similarities in
their information being predicted. Decision trees provides results that can be easily
understood by the user.
Decision tree technique is mostly used by statisticians to find out which database is more
related to the problem of the business. Decision tree technique can be used for Prediction and
Data pre-processing.
The first and foremost step in this technique is growing the tree. The basic of growing the tree
depends on finding the best possible question to be asked at each branch of the tree. The
decision tree stops growing under any one of the below circumstances
If the segment contains only one record
All the records contain identical features
The growth is not enough to make any further spilt
CART which stands for Classification and Regression Trees is a data exploration and
prediction algorithm which picks the questions in a more complex way. It tries them all and
then selects one best question which is used to split the data into two or more segments. After
deciding on the segments it again asks questions on each of the new segment individually.
Another popular decision tree technology is CHAID (Chi-Square Automatic Interaction
Detector). It is similar to CART but it differs in one way. CART helps in choosing the best
questions whereas CHAID helps in choosing the splits.
5. Neural Network
Neural Network is another important technique used by people these days. This technique is
most often used in the starting stages of the data mining technology. Artificial neural network
was formed out of the community of Artificial intelligence.
Neural networks are very easy to use as they are automated to a particular extent and because
of this the user is not expected to have much knowledge about the work or database. But to
make the neural network work efficiently you need to know
How the nodes are connected ?
How many processing units to be used ?
When should the training process to be stopped ?
There are two main parts of this technique – the node and the link
The node– which freely matches to the neuron in the human brain
The link– which freely matches to the connections between the neurons in
the human brain
A neural network is a collection of interconnected neurons. which could form a single layer
or multiple layer. The formation of neurons and their interconnections are called architecture
of the network. There are a wide variety of neural network models and each model has its
25
Prepared By : P.K.Chaubey
own advantages and disadvantages. Every neural network model has different architectures
and these architectures use different learning procedures.
Neural networks are very strong predictive modelling technique. But it is not very easy to
understand even by experts. It creates very complex models which is impossible to
understand fully. Thus to understand the Neural network technique companies are finding out
new solutions. Two solutions have already been suggested
First solution is Neural network is packaged up into a complete solution
which will let it to be used for a single application
Second solution is it is bonded with expert consulting services
Neural network has been used in various kinds of applications. This has been used in the
business to detect frauds taking place in the business.
6. Association Rule Technique
This technique helps to find the association between two or more items. It helps to know the
relations between the different variables in databases. It discovers the hidden patterns in the
data sets which is used to identify the variables and the frequent occurrence of different
variables that appear with the highest frequencies.
Association rule offers two major information
Support– Hoe often is the rule applied ?
Confidence– How often the rule is correct ?
This technique follows a two step process
Find all the frequently occurring data sets
Create strong association rules from the frequent data sets
There are three types of association rule. They are
Multilevel Association Rule
Multidimensional Association Rule
Quantitative Association Rule
This technique is most often used in retail industry to find patterns in sales. This will help
increase the conversion rate and thus increases profit.
7. Classification
Data mining techniques classification is the most commonly used data mining technique
which contains a set of pre classified samples to create a model which can classify the large
set of data. This technique helps in deriving important information about data and metadata
(data about data). This technique is closely related to cluster analysis technique and it uses
decision tree or neural network system. There are two main processes involved in this
technique
Learning– In this process the data are analyzed by classification algorithm
Classification– In this process the data is used to measure the precision of
the classification rules
There are different types of classification models. They are as follows
Classification by decision tree induction
Bayesian Classification
Neural Networks
Support Vector Machines (SVM)
Classification Based on Associations
One good example of classification technique is Email provider.
26
Prepared By : P.K.Chaubey
Record = Records are composed of fields, each of which contains one item of
information.
Field= A field is an area in a fixed or known location in a unit of data such as a
record, message header, or computer instruction that has a purpose and usually a
fixed size.
Form= A form is a database object that you can use to enter, edit, or display
data from a table or a query.
Query= A query is a request for data or information from a database table or
combination of tables.
Table= A table is a data structure that organizes information into rows and
columns.
Schema -is its structure described in a formal language supported by
the database management system (DBMS). The term "schema" refers to the
organization of data as a blueprint of how the database is constructed (divided
into database tables in the case of relational databases).
Views: Views in SQL are considered as a virtual table. A view also contains
rows and columns. To create the view, we can select the fields from one or
more tables present in the database. A view can either have specific rows based
on certain condition or all the rows of a table.
27