Professional Documents
Culture Documents
Name - Krushitha.V.P ROLL NO. - 520791371 Assignment Set 2 Subject - Mc0077 Advanced Database System
Name - Krushitha.V.P ROLL NO. - 520791371 Assignment Set 2 Subject - Mc0077 Advanced Database System
P
ROLL NO. - 520791371
ASSIGNMENT SET 2
SUBJECT - MC0077
ADVANCED DATABASE
SYSTEM
August 2010
Master of Computer Application (MCA) Semester 4
MC0077 Advanced Database Systems
Assignment Set 2
6. Explain the following concepts with respect to Distributed
Database Systems:
A) Data Replication
B) Options for Multi Master Replication
A) Data Replication
Replication is the process of copying and maintaining database objects, such
as tables, in multiple databases that make up a distributed database system.
Changes applied at one site are captured and stored locally before being
forwarded and applied at each of the remote locations. Advanced Replication
is a fully integrated feature of the Oracle server; it is not a separate server.
Replication uses distributed database technology to share data between
multiple sites, but a replicated database and a distributed database are not
the same. In a distributed database, data is available at many locations, but a
particular table resides at only one location. For example, the employees
table resides at only the loc1.world database in a distributed database
system that also includes the loc2.world and loc3.world databases.
Replication means that the same data is available at multiple locations. For
example, the employees table is available at loc1.world, loc2.world, and
loc3.world. Some of the most common reasons for using replication are
described as follows:
Availability
Replication provides fast, local access to shared data because it balances
activity over multiple sites. Some users can access one server while other
users access different servers, thereby reducing the load at all servers. Also,
users can access data from the replication site that has the lowest access
cost, which is typically the site that is geographically closest to them.
Performance
Replication provides fast, local access to shared data because it balances
activity over multiple sites. Some users can access one server while other
users access different servers, thereby reducing the load at all servers. Also,
users can access data from the replication site that has the lowest access
cost, which is typically the site that is geographically closest to them.
Disconnected Computing
A Materialized View is a complete or partial copy (replica) of a target table
from a single point in time. Materialized views enable users to work on a
subset of a database while disconnected from the central database server.
Later, when a connection is established, users can synchronize (refresh)
materialized views on demand. When users refresh materialized views, they
update the central database with all of their changes, and they receive any
changes that may have happened while they were disconnected.
Network Load Reduction
Replication can be used to distribute data over multiple regional locations.
Then, applications can access various regional servers instead of accessing
one central server. This configuration can reduce network load dramatically.
Mass Deployment
Replication can be used to distribute data over multiple regional locations.
Then, applications can access various regional servers instead of accessing
one central server. This configuration can reduce network load dramatically.
B) Options for Multi Master Replication
Multi Master Replication (also called peer-to-peer or n-way replication)
enables multiple sites, acting as equal peers, to manage groups of replicated
database objects. Each site in a multi-master replication environment is a
master site, and each site communicates with the other master sites.
Options for Multi-Master Replication
Asynchronous replication is the most common way to implement multimaster replication. However, you have two other options: Synchronous
Replication and Procedural Replication.
A Multi-Master replication environment can use either asynchronous or
synchronous replication to copy data. With asynchronous replication, changes
made at one master site occur at a later time at all other participating master
sites. With synchronous replication, changes made at one master site occur
immediately at all other participating master sites.
C) Classification of Data
A) Need for Fuzzy Databases
Need for Fuzzy Databases
As the application of database technology moves outside the realm of a crisp
mathematical world to the realm of the real world, the need to handle
imprecise information becomes important, because a database that can
handle imprecise information shall store not only raw data but also related
information that shall allow us to interpret the data in a much deeper context,
e.g. a query Which student is young and has sufficiently good grades?
captures the real intention of the users query than a crisp query as
SELECT * FROM STUDENT
WHERE AGE < 19 AND GPA > 3.5
Such a technology has wide applications in areas such as medical diagnosis,
employment, investment etc. because in such areas subjective and uncertain
information is not only common but also very important.
B) Techniques for implementation of Fuzziness in Databases
One of the major concerns in the design and implementation of fuzzy
databases is efficiency i.e. these systems must be fast enough to make
interaction with the human users feasible. In general, we have two feasible
ways to incorporate fuzziness in databases:
1. Making fuzzy queries to the classical databases
2. Adding fuzzy information to the system
C) Classification of Data
The information data can be classified as following:
1. Crisp: There is no vagueness in the information.
e.g., X = 13
Temperature = 90
2. Fuzzy: There is vagueness in the information and this can be further
divided into two types as
a. Approximate Value: The information data is not totally vague and there
is some approximate value, which is known and the data, lies near that value.
e.g., 10 _ X _ 15
Temperature _ 85
These are considered have a triangular shaped possibility distribution as
shown below
Temperature is HOT
These are considered have a trapezoidal shaped possibility distribution as
shown below
Possibility Distribution for a Linguistic Term SMALL for the Linguistic Variable
HEIGHT
4. Describe the following Data Mining Functions:
A) Classification B) Associations
C) Sequential/Temporal patterns D) Clustering/Segmentation
Data mining methods may be classified by the function they perform or
according to the class of application they can be used in.
A) Classification
Data Mining tools have to infer a model from the database, and in the case of
Supervised Learning this requires the user to define one or more classes. The
database contains one or more attributes that denote the class of a tuple and
these are known as predicted attributes whereas the remaining attributes are
called predicting attributes. A combination of values for the predicted
attributes defines a class.
When learning classification rules the system has to find the rules that predict
the class from the predicting attributes so firstly the user has to define
conditions for each class, the data mine system then constructs descriptions
for the classes.
Once classes are defined the system should infer rules that govern the
classification therefore the system should be able to find the description of
each class. The descriptions should only refer to the predicting attributes of
the training set so that the positive examples should satisfy the description
and none of the negative.
A rule said to be correct if its description covers all the positive examples and
none of the negative examples of a class.
A rule is generally presented as, if the left hand side (LHS) then the right
hand side (RHS), so that in all instances where LHS is true then RHS is also
true, is very probable. The categories of rules are:
B) Associations
Given a collection of items and a set of records, each of which contain some
number of items from the given collection, an association function is an
operation against this set of records which return affinities or patterns that
exist among the collection of items. These patterns can be expressed by rules
such as "72% of all the records that contain items A, B and C also contain
items D and E." The specific percentage of occurrences (in this case 72) is
called the confidence factor of the rule. Also, in this rule, A, B and C are said
to be on an opposite side of the rule to D and E. Associations can involve any
number of items on either side of the rule.
A typical application, identified by IBM, that can be built using an association
function is Market Basket Analysis. This is where a retailer run an association
operator over the point of sales transaction log, which contains among other
information, transaction identifiers and product identifiers. The set of
products identifiers listed under the same transaction identifier constitutes a
record. The output of the association function is, in this case, a list of product
affinities. Thus, by invoking an association function, the market basket
analysis application can determine affinities such as "20% of the time that a
specific brand toaster is sold, customers also buy a set of kitchen gloves and
matching cover sets."
C) Sequential/Temporal patterns
3. Explain:
A) Data Dredging
A) Data Dredging
Data Dredging or Data Fishing are terms one may use to criticize someones
data mining efforts when it is felt the patterns or causal relationships
discovered are unfounded. In this case the pattern suffers of over fitting on
the training data.
Data Dredging is the scanning of the data for any relationships, and then
when one is found coming up with an interesting explanation. The
conclusions may be suspect because data sets with large numbers of
variables have by chance some "interesting" relationships. Fred Schwed said:
"There have always been a considerable number of people who busy
themselves examining the last thousand numbers which have appeared on a
roulette wheel, in search of some repeating pattern. Sadly enough, they have
usually found it."
Nevertheless, determining correlations in investment analysis has proven to
be very profitable for statistical arbitrage operations (such as pairs trading
strategies), and correlation analysis has shown to be very useful in risk
management. Indeed, finding correlations in the financial markets, when
done properly, is not the same as finding false patterns in roulette wheels.
Some exploratory data work is always required in any applied statistical
analysis to get a feel for the data, so sometimes the line between good
statistical practice and data dredging is less than clear. Most data mining
efforts are focused on developing highly detailed models of some large data
set. Other researchers have described an alternate method that involves
finding the minimal differences between elements in a data set, with the goal
of developing simpler models that represent relevant data.
When data sets contain a big set of variables, the level of statistical
significance should be proportional to the patterns that were tested. For
example, if we test 100 random patterns, it is expected that one of them will
be "interesting" with a statistical significance at the 0.01 level.
Cross Validation is a common approach to evaluating the fitness of a model
generated via data mining, where the data is divided into a training subset
and a test subset to respectively build and then test the model. Common
cross validation techniques include the holdout method, k-fold cross
validation, and the leave-one-out method.
B) Data Mining Techniques
10
Cluster Analysis
In an unsupervised learning environment the system has to discover its own
classes and one way in which it does this is to cluster the data in the
database as shown in the following diagram. The first step is to discover
subsets of related objects and then find descriptions Ex: D1, D2, D3 etc.
which describe each of these subsets.
11
Induction
A database is a store of information but more important is the information
which can be inferred from it. There are two main inference techniques
available i.e. deduction and induction.
Decision Trees
Decision Trees are simple knowledge representation and they classify
examples to a finite number of classes, the nodes are labeled with attribute
names, the edges are labeled with possible values for this attribute and the
leaves labeled with different classes. Objects are classified by following a
path down the tree, by taking the edges, corresponding to the values of the
attributes in an object.
The objects contain information on the outlook, humidity etc. Some objects
are positive examples denote by P and others are negative i.e. N.
Classification is in this case the construction of a tree structure, which can be
used to classify all the objects correctly.
12
A Data Mining System has to infer a model from the database that is it may
define classes such that the database contains one or more attributes that
denote the class of a tuple i.e. the predicted attributes while the remaining
attributes are the predicting attributes. A Class can then be defined by
condition on the attributes. When the classes are defined the system should
be able to infer the rules that govern classification, in other words the system
should find the description of each class.
Production rules have been widely used to represent knowledge in expert
systems and they have the advantage of being easily interpreted by human
experts because of their modularity i.e. a single rule can be understood in
isolation and doesnt need reference to other rules. The propositional like
structure of such rules has been described earlier but can summed up as ifthen rules.
Neural Networks
Neural Networks are an approach to computing that involves developing
mathematical structures with the ability to learn. The methods are the result
of academic investigations to model nervous system learning. Neural
Networks have the remarkable ability to derive meaning from complicated or
imprecise data and can be used to extract patterns and detect trends that are
too complex to be noticed by either humans or other computer techniques.
Neural Networks have broad applicability to real world business problems and
have already been successfully applied in many industries. Since neural
networks are best at identifying patterns or trends in data, they are well
suited for prediction or forecasting needs including:
Sales Forecasting
Industrial Process Control
Customer Research
Data Validation
Risk Management
13
Accessibility
Client/Server Architecture
Generic Dimensionality
14
Multi-User Support
Flexible Reporting
Data Visualization
15
Data visualization makes it possible for the analyst to gain a deeper, more
intuitive understanding of the data and as such can work well along side data
mining. Data mining allows the analyst to focus on certain patterns and
trends and explore in-depth using visualization. On its own data visualization
can be overwhelmed by the volume of data in a database but in conjunction
with data mining can help with exploration.
16
6. Id is defined as the primary key. The not null phrase only controls that
some not null value is given. The primary key phrase indicates that the DBM
is to guaranty that the set of values for Id are unique.
7. Name has a data-type, PersName, defined as a Row type similar to the one
defined in lines 1-3. BirthDate is a date that can be used as the argument for
the function Age_f defined in line 4.
8. Address is defined using the row type Address_t, defined in lines 1-3.
Picture
is
defined
as
a
BLOB,
or
Binary
Large
Object.
Here there are no functions for content search, manipulation or presentation,
which support BLOB data types. These must be defined either by the user as
user-defined functions, UDFs, or by the ORDBMS vendor in a supplementary
subsystem. In this case, we need functions for image processing.
9. Age is defined as a function, which will be activated each time the attribute
is retrieved. This costs processing time (though this algorithm is very simple),
but gives a correct value each time the attribute is used.
B) Hierarchical Structures
1. Create table STUDENT initiates specification of the implementation of a
subclass entity type.
2. GPA, Level, are the attributes for the subclass, here with simple SQL2
data types.
3. under PERSON specifies the table as a subclass of the table PERSON. The
DBM thus knows that when the STUDENT table is requested, all attributes and
functions in PERSON are also relevant. An OR-DBMS will store and use the
primary key of PERSON as the key for STUDENT, and execute a join operation
to retrieve the full set of attributes.
4. Create table COURSE specifies a new table specification, as done for
statements in lines 5 and 10 above.
5. Id, Name, and Level are standard atomic attribute types with SQL2 data
types. Id is defined as requiring a unique, non null value, as specified for
PERSON in line 6 above.
6. Note that attributes must have unique names within their tables, but the
name may be reused, with different data domains in different tables.
Both Id and Name are such attribute-names, appearing in both PERSON and
COURSE, as is Level used in STUDENT and COURSE.
17
{Sid, Cid, and Term} form the primary key, PK. Since the key is
composite, a separate Primary key clause is required. (As compared
with the single attribute PK specifications for PERSON.Id and
COURSE.Id.)
The 2 foreign key attributes in the PK, must be defined separately.
TakenBy.Report is a foreign key to a report entity-type, forming a
ternary relationship as modeled in Figure 6.7a. The ON DELETE trigger
is activated if the Report relation is deleted and assures that the FK link
has a valid value, in this case null.
18
19
database, the DMS will have to have the capability to manage store, search,
retrieve, and manipulate different media types. Object-relational dbms
vendors claim to be able to do this.
Oracle data types for large objects are BLOB, CLOB, NCLOB (fixed-width multibyte CLOB) and BFILE (binary file stored outside the DB). Note that the 1 st 3
are equivalent to the DB2 LOBs, while the last is really not a data-type, but
rather a link to an externally stored media object.
SQL3 has no functions for processing, f.ex. indexing the content of a BLOB,
and provides only functions to store and retrieve it given an external
identifier.
20
DBMS vendors who provide differentiated blob types have also extended the
basic SQL string comparison operators so that they will function for LOBs, or
at least CLOBs. These operators include the pattern match function "LIKE",
which gives a true/false response if the search string is found/not found in the
*LOB attribute.
E) Storage of LOBs
There are 3 strategies for storing LOBs in an or-DB:
1. Embedded in a column of the defining relation, or
2. Stored in a separate table within the DB, linked from the *LOB column
of the defining relation.
3. Stored on an external (local or geographically distant) medium, again
linked from the *LOB column of the defining relation.
Embedded storage in the defining relation closely maps the logical view of
the media object with its physical storage. This strategy is best if the other
attributes of the table are primarily structural metadata used to specify
display characteristics, for example length, language, format.
The problem with embedded storage is that a DMS must transfer at least a
whole tuple, more commonly a block of tuples, from storage for processing. If
blobs are embedded in the tuples, a great deal of data must be transmitted
even if the LOB objects are not part of the query selection criteria or the
result.
Separate table storage gives indirect access via a link in the defining relation
and delays retrieval of the LOB until it is to be part of the query result set.
Though this gives a two-step retrieval, for example when requesting an
image of Joan Nordbotten, it will reduce general or average transfer time for
the query processing system.
A drawback of this storage strategy is a likely fragmentation of the DB area,
as LOBs can be stored anywhere. This will decrease the efficiency of any
algorithm searching the content of a larger set of LOBs.
External storage is useful if the DB data is connected to established media
databases, either locally on CD, DVD, or on other computers in a network
as will most likely be the case when sharing media data stored in
autonomous applications, such as cooperating museums, libraries, archives,
or government agencies. This storage structure eliminates the need for
duplication of large quantities of data that are normally offered in read-only
21
mode. The cost is in access time which may currently be nearly unnoticeable.
A good multimedia DMS should support each of these storage strategies.
22
Pragmatic:
o
DB Components
A database is defined as a logically coherent collection of related data,
representing some aspect of the real world, designed, built, and populated for
some purpose. In addition to the user data stored in the database proper (in
accordance to the above definition), two other data sets are stored within the
DB area. These data are necessary to support efficient data storage, retrieval,
and management, and include:
23
24
that value. In practice the index elements are ordered in some form of b-tree
to minimize access time.
Indexes may be unique or clustered, meaning that an index entry references
only one element or a set of elements, respectively. Unique indexes are
commonly used to enforce the primary key integrity constraint that each
tuple in a relation must be unique. Cluster indexes provide fast access to sets
of data containing the same values as the index term.
The method library contains user-defined functions, procedures, assertion
statements, integrity rules and the trigger functions that maintain them. In
ORDBMS, this library can be extended to include user definitions of new (to
the DBMS) data types and the functions necessary for manipulate them. A DB
schema contains the metadata specified for the database as defined using
the DBMS/ Data Definition Language (DDL).
25
26