You are on page 1of 49

Write short notes on:

a) N-tier architecture
b) SOAP
a) N-tier architecture
 The N- tier architecture application is distributed among three
or more separate computers in a distributed network.
 The common form of N-tier is the 3-tier application, where the
user interface programming is in the users computer and the
business logic is in a centralized computer and the data which is
needed in the computer that manages the database.
 It is implied on the client/server program model.
 If there are more than three distribution levels or tiers
involved the additional tiers are usually associated with the
business logic tier.
 It is also referred to as the pulling apart of an application into
separate layers or finer grains.
 One of the best example of this architecture in the web
applications is the shopping cart web application.
 Here the client tier interacts with the user via the GUIs with
the application and the application server.
 In most of the web applications the client is a web browser.
 The integration tier allows the N-tier architecture to be vendor
independent.
 The business tier is also considered as the integration tier.
 The encapsulation will allow the application to communicate
with the business tier in a way that all the nodes are intelligible.
 The final application tier is the data tier.
 It mostly consists of the database servers. The data is kept
neutral and independent from the application servers or the
business logic.
 If the data has its own tier it improves the scalability and the
performance and as it grows it easily moves to another powerful
machine.
Benefits of N-Tier Architecture
 It helps in improving the scalability and supports the cost-
efficient application building.
 It helps in making the applications more readable and
reusable.
 The applications that are made are robust as they have no
single point of failure. The tiers function with relative
independence. Reusability is important for the web applications.
 Authentication and authorization is provided for the security.
This allows the web server to restrict the user access which is
based on the pre-determined criteria.
 It helps the developers to build the web applications as it
allows the developers to apply their specific skill to that part of a
program which best suits their skill set.
b) SOAP
 SOAP, Simple Object Access Protocol is a communication
protocol, a way to structure data before transmitting it, is based on
XML standard. It is developed to allow communication between
applications of different platforms and programming languages via
Internet.
 It can use range of protocols such as HTTP, FTP, SMTP, Post
office protocol 3(POP3) to carry documents.
 Http-Get, Http-Post works with name/value pair which means
transferring complex object is not possible with these protocols,
whereas SOAP serializes complex structure, such as ASP.NET
DataSets, complex arrays, custom types and XML nodes before
transmitting and thus allows exchange of complex objects
between applications.
 Two components can easily communicate using Remote
Procedure Calls protocol. But because of their compatibility and
security issues, most of firewalls and proxy server block this type
of messages. SOAP uses HTTP channel to transport which makes it
widely accepted protocol over the Internet.
Steps taken in the SOAP Processing model
There are different nodes used and they are termed as SOAP nodes.
They act as a receiver of the process and allow access to the
messages as well.
The nodes consists of the following process:
 SOAP sender : It is a node that transmits the message
received by the receiver.
 SOAP receiver : It is a node that receives or accepts the
message passed by the user.
 SOAP message path : It is a node that sets the path to make
it easy for the messages to go along and reach its destination.
 Initial SOAP sender : It is also called as originator and it
sends the message at the starting point of the message path and
saves the settings there.
 SOAP intermediary : It is in between the SOAP receiver and
SOAP sender that contains the SOAP message. It processes the
header blocks that forward the SOAP message to the receiver.
 Ultimate SOAP receiver : It is the node where the message
gets received finally. They are responsible for processing the
contents used by the SOAP body and the SOAP header also
included in it.
Message format used in SOAP
 The message format is written by using the XML language that
writes the standard message format as it is widely used. It allows
easy transition to deliver the SOAP based implementations.
 The format of the protocol allows easy readability, ease of
error detection and it removes the problems with interoperability
like the byte order.
The message that is been given is in the format given below:
POST /InStock HTTP/1.1
Host: localhost
Content-Type: application/soap+xml; charset=utf-8
Content-Length: 299
SOAPAction: "http://www.abc.org/2003/05/soap-envelope"

<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://www.abc.org/2003/05/soap-
envelope">
<soap:Header>
</soap:Header>
<soap:Body>
<m:CareerName>Careeride</m:CareerName>
</soap:Body>
</soap:Envelope>

Working of SOAP
 SOAP is used to provide a user interface that can be achieved
from the client object and the request that it sends goes to the
server that can be achieved by using the server object.
 The user interface creates some files or methods that
consists of server object and the name of the interface to the
server object.
 It also consists of other information like name of the interface
and method.
 It uses HTTP to send the XML to the server using the POST
method. The server parses the method and send the result to the
client side.
 The server creates more XML that consists of responses of
the user interface's request that is used using HTTP.
 Client can use any method to send the XML. It can use the
SMTP server as well as POP3 protocol to pass the messages and
for request or respond queries.
Problems faced by the user by using SOAP
 SOAP is a new protocol that is used for cross-platform
communication and it can bypass the firewall.
 This new protocol has more security vulnerabilities than any
other. There is a problem to use this protocol as firewall is a
security mechanism that comes in between.
 It blocks all the ports leaving few like HTTP port 80 and the
HTTP port which is used by SOAP to bypass the firewall.
 It is a serious concern as it can pose difficulties for the users.
There are ways like SOAP traffic can be filtered from the firewalls.
 Each SOAP header is having a unique header field that can be
used to check the SOAP messages which are passing through the
firewall.
Web Relational functionalities provided by the SOAP protocol
There are functionalities that are provided for the web page by the
SOAP protocol and they are:

1. HTTPUtils : This provides the functionality of the POST method


through which the request can be reached in a secure manner.
2. Parameter : It represents an argument to a RPC call that is used
by both client and the server.
3. Response : Response is an object that represents a RPC
response by both client and server but the result will come after the
method invocation only.
4. TCPTunnel : Is an object that provide the property of listening on
a given port and forward all the host and port names.
5. TypeConverter : It is a functionality provided to convert an object
of one type to another type and this invoked with the class in the
form object.

Explain multimedia Architecture? Mentioned the


requirement for mobile database.

 The multimedia is a single, integrated feature that extends the


database by storing, managing, and retrieving image, audio, and
video data, and by supporting Web technologies for multimedia
data.
 The Multimedia architecture defines the framework through
which the media-rich content as well as traditional data are
supported in the database.
 This content and data can then be securely shared across
multiple applications written with popular languages and tools,
easily managed and administered by relational database
management and administration technologies, and offered on a
scalable database that supports thousands of users.
 In the first tier, the database holds a rich content in tables
along with the traditional data. Through a database-embedded
JVM, a server-side media parser is supported as well as an image
processor.
 The media parser has object-oriented and relational
interfaces, supports format and application metadata parsing, and
can be extended to support additional formats.
 The image processor includes JAI and provides image
processing for operations such as producing thumbnail-size
images, converting image formats, and image indexing and
matching.
 Using the Multimedia methods, import and export operations
between the database and operating system files (external file
storage) are possible.
 It also supports special delivery types of servers, such as
streaming content from a database.
 Using the Oracle Multimedia Plug-ins for Real Networks or
Windows Media Services, the Helix Universal Server or Windows
Media Streaming Server can stream multimedia data to a client
directly out of the database using Real-Time Streaming Protocol
(RTSP).
 In addition, third-party media processors such as speech
recognition engines can run external to the database to process
media stored in the database and return results to the database.
 In the second or middle tier, the Application Server provides
access to Multimedia through the Multimedia Java classes, which
enable Java applications on any tier (client, application server, or
database) to access, manipulate, and modify audio, image, and
video data stored in a database.
 In addition, the Multimedia Servlets and JSP Java API
facilitates the upload and retrieval of multimedia data stored in a
database using the Oracle Multimedia OrdAudio, OrdDoc,
OrdImage, and OrdVideo object types.
 The Multimedia Servlets and JSP Java API can access data
stored in the Oracle Multimedia objects or BLOBs or BFILEs
directly.
 Developers can also use JDeveloper and Multimedia to build
media-rich Java applications quickly and easily using the Oracle
Multimedia/ADF Business Components integration package.
 The Multimedia rich content can also be easily and
transparently incorporated into Oracle Portal forms and reports,
which can then be published as portlets.
 SQL developers familiar with the database can develop Web
applications that use Oracle Application Server exclusively, and
Oracle Database using the PL/SQL development environment.
 The steps include using the PL/SQL Gateway (mod_plsql)
feature of the Oracle HTTP Server and the PL/SQL Web Toolkit.
Web application developers can write PL/SQL servlets and PL/SQL
server pages (PSP) that invoke PL/SQL procedures stored in the
database through an Oracle Net connection and OCI.
 In the third or client tier, the ability to perform local
processing is supported through the Multimedia Java classes, JAI,
and JMF. JAI and JMF provide a set of APIs for media processing
on the client, and Oracle Multimedia Java classes supply direct access
to all media types from the client.
Explain Inter-operational and Intra-operational
parallelism with relevant examples.
Intraoperation Parallelism
 Relational operations work on relations that contain large sets
of tuples, that we can parallelize the operations by executing them
in parallel on different subsets of the relations.
 The number of tuples in a relation can be large so the degree
of parallelism is potentially enormous.
 Hence, we can say intraoperation parallelism is natural in a
database system.
The parallel versions of some common relational operations are as
follows:

Parallel Sort
 For example, we want to sort a relation that resides on n
disks D0,D1,......Dn-1.
 If this relation is range partitioned on the attributes then each
partition is sorted out separately and can concatenate the results
to get the full sorted relation.
 As the tuples are partitioned on the n disks the time that is
required for reading the entire relation is reduced by the parallel
access.
 If the relation is partitioned in any other way in can be sorted
out by using any of the following ways:
1. Range-partition it on the sort attributes and then sort each
partition separately.
2. Use the parallel version of the external sort-merge algorithm.
Range-partitioning sort
 It basically works in two steps: first is to range partition the
relation and second is to sort out each partition separately.
 When we sort the relation it is not necessary to range
-partition the relation on the same set of processors or disks as
those on which that relation is stored.
 The range-partitioning should be done with a good range-
partition vector so that each partition will approximately have the
same number of tuples.
Parallel External Sort-Merge
 It is an alternative to range partitioning.
 Suppose a relation has already been partitioned among the
disks D0,D1,....Dn-1.
The parallel sort-merge will work in the following manner:
1. Each processor Pi will locally sort the data on the disk Di.
2. To get the final sorted output the system merges the sorted runs
on each processor.
Parallel Join
 The join operation tests the pairs of tuples to see whether
they satisfy the join condition and if they do the system adds the
pair to the join output.
 The parallel join algorithms attempt to split the pairs that are
to be tested over several processors.
 Each processor then checks part of the join locally.
 After this the system collects the results from each of the
processor for producing the final result.
The types of joins are:
 Partitioned join
 Fragment and Replicate join
 Partitioned Parallel Hash join
 Parallel Nested-Loop join
Other relational operators
 Selection
 Duplicate elimination
The duplicates can be eliminated by sorting by using either of the
parallel sort techniques. The duplicate elimination can also be
parallelized by partitioning the tuples and eliminating the
duplicates locally at each processor.
 Projection
The projection without the duplicate elimination can be performed
as the tuples are read from the disk in parallel. To eliminate the
duplicates any of the techniques can be used.
 Aggregation
the operation can be parallelized by partitioning the relation on the
grouping attributes and computing the aggregate values locally at
each processor. Either hash partitioning or range partitioning can
be used.
Interoperation parallelism
It has two types of parallelism:

1. Pipelined Parallelism
 The parallel systems use the pipelining mainly for the same
that the sequential systems do.
 The pipelines are a source of parallelism in the same way that
the instructions pipelines are a source of parallelism in hardware
design.
 Two operations can be run simultaneously on different
processors so that the tuple consumes the tuples in parallel to the
one producing them.
 This form of parallelism is known as pipelined parallelism.
Independent parallelism
 The operations in a query expression that do not depend on
one another can be executed in parallel. This is known as
independent parallelism.
 The independent parallelism does not provide a high degree of
parallelism and is less useful in a highly parallel system, even if it
is useful with a lower degree of parallelism.

Explain I/O parallelism? Define Parallelism on


Multicore processor.

 For the purpose of parallel I/O the data can be partitioned


across multiple disks.
 The relational operators such as the sort, join, aggregation
can be executed in parallel.
 The data here can be partitioned in such a manner that each
processor can work independently on its own partition.
 The queries are expressed in the high level language which
helps in making the parallelization easier.
 Different queries can be made run in parallel with each other
the conflicts can be taken care by the concurrency control.
 Hence, we can say that the databases lend themselves to
parallelism.
I/O Parallelism
 It helps in reducing the time that is required to retrieve the
relations from the disk by partitioning.
 All the relations are maintained on multiple disks.
 Horizontal partitioning is where the tuples of a relation are
divided among many other disks such that each of the tuple
resides on one disk.
 Partitioning techniques used in I/O parallelism.
 Assume that the number of disks = n.
The partitioning techniques are as follows:

Round Robin
 It scans the relation in any order and sends the ith tuple to
disk number Di mod n.
 The scheme ensures an even distribution of tuples across
disks; that is, each disk has approximately the same number of
tuples as the others.
Hash partitioning
 It is a declustering strategy that designates one or more
attributes from the given relation’s schema as the partitioning
attributes.
 A hash function is chosen whose range is {0, 1, . . . , n - 1}.
 Each tuple of the original relation is hashed on the partitioning
attributes.
 If 'i' is returned by the hash function, then the tuple is placed
on disk Di 1.
Range partitioning
 It distributes the tuples by assigning contiguous attribute-
value ranges to each disk.
 It selects a partitioning attribute, A, and a partitioning vector
[v0, v1, . . . , vn-2], such that, if i < j, then vi < vj.
 The relation is partitioned as follows: Consider a tuple 't' such
that t[A] = x. If x < v0, then 't' goes on disk D0. If x = vn-2, then 't'
goes on disk Dn-1. If vi = x < vi+1, then 't' goes on disk Di+1.
 Example of this can be with three disks numbered 0, 1, and 2
that may assign tuples with values less than 5 to disk 0, values
between 5 and 40 to disk 1, and values greater than 40 to disk 2.
Comparison of Partitioning Techniques
 A relation can be retrieved in parallel by using all the disks
once a relation has been partitioned among several disks.
 Similarly, when a relation is being partitioned, it can be
written to multiple disks in parallel.
 The rate transfer for reading or writing an entire relation are
much faster with I/O parallelism than without it.
 However, it is only one kind of access to data for reading an
entire relation, or scanning a relation.
Access to data can be classified as follows:
 The entire relation is scanned.
 A tuple is located associatively (example, employee name =
“Pooja”); these queries, also known as point queries, seek tuples
that have a specified value for a specific attribute.
 Locating all tuples for which the value of a given attribute lies
within a specified range (example, 10000 < salary < 20000); these
queries are called range queries.
Explain 2PC protocols. Discuss its failure &
recovery techniques.

 The 2PC operates during the normal operation and then it


describes how it handles failures and finally how it carries out the
recovery and concurrency control.
 Assume a transaction T which is initiated at site S where the
transaction coordinator is C.
 When T completes its execution that is when all the sites at
which T has been executed inform C that T has been completed C
starts the 2PC protocol.
Phase 1
 'A' record is added by C <prepare T> to the log, and it forces
the log onto stable storage.
 'A' prepare 'T' message is sent to all sites at which 'T'
executed.
 On receiving such a message, the transaction manager at that
site determines whether it is willing to commit its portion of 'T'.
 'A' record <no T> is added to the log if the answer is no, and is
then responded by sending an abort 'T' message to 'C'.
 'A' record <ready T> is added to the log if the answer yes, and
forces the log (with all the log records corresponding to T) onto
stable storage.
 The transaction manager then replies with a ready 'T'
message to 'C'.
Phase 2
 Whenever 'C' receives responses to the prepare 'T' message
from all the sites, or when a prespecified interval of time has been
elapsed since the prepare 'T' message was sent out, 'C' can
determine whether the transaction 'T' can be committed or
aborted.
 If 'C' received a ready 'T' message from all the participating
sites only then can transaction 'T' be committed else transaction
'T' must be aborted.
 Depending on the verdict, either a record <commit T> or a
record <abort T> is added to the log and the log is forced onto
stable storage.
 At this point, the state of the transaction has been sealed.
 After following on to this point, the coordinator sends either a
commit 'T' or an abort 'T' message to all participating sites.
 When a site receives that message, it records the message in
the log.
Handling of Failures
The types of failures are:

Failure of a participating site


 Actions are taken if the coordinator C detects that a site has
failed, it takes these actions:
 If the site fails before responding with a ready 'T' message to
'C', it is assumed by the coordinator that it responded with an abort
'T' message.
 If the site fails after the coordinator has received the ready 'T'
message from the site, the coordinator executes the rest of the
commit protocol in the normal fashion, ignoring the failure of the
site.
 When a participating site 'S' recovers from a failure, it must
examine its log to determine the fate of those transactions that
were in the midst of execution when the failure occurred. Let 'T' be
one such transaction.
We consider each of the possible cases:
 The site executes redo(T) when the log contains a <commit T>
record.
 The site executes undo(T) when the log contains an <abort T>
record.
 The log contains a record. Here, the site must consult 'C' to
determine the fate of 'T'.
Failure of the coordinator
 If the coordinator fails in the midst of the execution of the
commit protocol for transaction T, then the participating sites
must decide the fate of 'T'.
 In certain cases, the participating sites cannot decide
whether to commit or abort T, and therefore these sites must wait
for the recovery of the failed coordinator.
 'T' must be committed if an active site contains a <commit T>
record in its log.
 'T' must be aborted if an active site contains an <abort T>
record in its log.
 If some active site does not contain a <ready T> record in its
log, then the failed coordinator 'C' cannot have decided to commit
'T', because a site that does not have a <ready T> record in its log
cannot have sent a ready 'T' message to 'C'. However, the
coordinator may have decided to abort 'T', but not to commit 'T'.
Rather than wait for 'C' to recover, it is preferable to abort 'T'.
 If none of the preceding cases holds, then all active sites
must have a <ready T> record in their logs, but no additional
control records (such as <abort T> or <commit T>).
 Since the coordinator has failed, it is impossible to determine
whether a decision has been made, and if one has, what that
decision is, until the coordinator recovers. Thus, the active sites
must wait for C to recover. Since the fate of T remains in doubt, T
may continue to hold system resources.
Recovery
 When a failed site restarts, recovery is performed by using the
recovery algorithm.
 The recovery procedure must treat in-doubt transactions
specially while dealing with the distributed commit protocols; the
in-doubt transactions are transactions for which a <ready T> log
record is found, but neither a <commit T> log record nor an <abort
T> log record is found.
 By contacting the other sites recovering site must determine
the commit–abort status of transactions.
 Even if the recovery is done as just described the normal
transaction processing at the site cannot begin until all in-doubt
transactions have been committed or rolled back.
 Since multiple sites have been contacted finding the status of
in-doubt transactions can be slow.
 If the coordinator has failed, and no other site has information
about the commit–abort status of an incomplete transaction,
recovery potentially could become blocked if 2PC is used. Due to
this the site that is performing the restart recovery may remain
unusable for a long period.
 To solve this problem, recovery algorithms provide support for
noting lock information in the log.
 Instead of writing a <ready T> log record, the algorithm writes
a <ready T, L> log record, where 'L' is a list of all write locks held
by the transaction 'T' when the log record is written.
 At recovery time, after performing local recovery actions, for
every in-doubt transaction 'T', all the write locks noted in the
<ready T, L> log record are reacquired.
 After lock reacquisition is complete for all in-doubt
transactions, transaction processing can start at the site, even
before the commit–abort status of the in-doubt transactions is
determined.
 The site recovery is faster and never gets blocked as the
commit or rollback of in-doubt transactions proceeds concurrently
with the execution of new transactions.
 Note that new transactions that have a lock conflict with any
write locks held by in-doubt transactions will be unable to make
progress until the conflicting in-doubt transactions have been
committed or rolled back.

Explain concurrency control in Distributed


Database.

 It is assumed that each site participates in the execution of


the commit protocol to ensure the global transaction atomicity.
 The protocols that are described further will require updates
to be done on all the replicas of a data item.
 The updates of the data item cannot be processed if any site
that contains a replica of the data item is failed.
Locking Protocols
 There are various locking protocols that have been used.
 The only change that needs to be done is the way the lock
manager deals with the replicated data.
 There are several possible schemas that are applicable to an
environment where the data can be replicated in several sites.
The types of locking protocols are as follows:

1. Single Lock-Manager Approach


 In this approach the system maintains a single lock manager
that resides in a single chosen site.
 All the lock and unlock requests are made at this site.
 Whenever a data item is needed to be locked it send a lock
request to this site. The lock manager will then check if the lock
request can be granted or not. If the lock is granted a message is
sent to the site with that effect.
 If not the request is delayed until the grant.
 The transaction can read the data item from any of the sites
which is a replica of the data item.
Advantages
 Simple implementation
 Simple deadlock handling
Disadvantages
 Bottleneck
 Vulnerability
2. Distributed Lock Manager
 This approach is basically a compromise between the
advantages and disadvantages where the lock manager function is
distributed over several sites.
 Each site maintains a local lock manager whose function is to
administer the lock and unlock requests for those data items that
are stored in that site.
 Once it is decided that a lock request is granted the lock
manager will send a message back to the initiator indicating him
about his request.
Advantages
 Simple implementation,
 Reduces the degree to which the coordinator is a bottleneck.
 Low overhead
 Requiring two message transfers for handling lock requests,
and one message transfer for handling unlock requests.
Disadvantages
 Deadlock handling is more complex.
3. Primary Copy
 Whenever a system uses data replication one of the replicas
can be chosen as the primary copy.
 For each of the data item the primary copy must be residing
precisely on one site which we can call as the primary site for that
data item.
4. Majority Protocol
The working of this protocol is as follows
 If a data item is replicated on different sites then the lock
request is to be sent to more than the one half of the sites where
the data item is stored.
 The lock manager will determine if the lock can be granted
immediately.
 The transaction does not operate until it has obtained a lock
successfully on all the replicas of the data item.
Advantages
 Extended to deal with site failures.
 The protocol also deals with replicated data in a decentralized
manner, thus avoiding the drawbacks of central control.
Disadvantages:
 Implementation
 Deadlock Handling
5. Biased Protocol
 In this approach the requests for shared locks are given more
favorable treatment than the requests for the exclusive locks.
 Shared locks - When a transaction needs to lock data item Q,
it simply requests a lock on Q from the lock manager at one site
that contains a replica of Q.
 Exclusive locks - When a transaction needs to lock data item
Q, it requests a lock on Q from the lock manager at all sites that
contain a replica of Q.
Advantages
 Imposing less overhead on read operations than does the
majority protocol. This savings is especially significant in common
cases in which the frequency of read is much greater than the
frequency of write.
Disadvantages
 Additional overhead on writes.
 The biased protocol shares the majority protocol’s
disadvantage of complexity in handling deadlock.
6. Quorum Consensus Protocol
 This protocol is a generalization of the majority protocol.
 It assigns each site a non-negative weight.
 They assign a read and write operation on two integers called
read quorum and write quorum.
Timestamping
 A timestamping scheme is followed so as to give a unique
timestamp which will help the system in deciding the serialization
order.
 There are two methods for generating the unique timestamps
one is centralized and one is distributed.
 In the centralized scheme a single site distributes the
timestamps. This site can use a logical counter or its own local
clock for this purpose.
 In the distributed scheme each site will generate a unique
local timestamp by using a logical counter or then the local clock.
 The order of concatenation is important.
 The site identifier is used in the least significant position to
ensure that the global timestamps are generated in one site are
not always greater than those generated in another site.
 There can be a problem if one site generates timestamps
faster than the other site.
 A mechanism is needed which will ensure that the local
timestamps will be generated fairly across the system.
 If a system clock is used for generating the timestamps it will
be assigned fairly provided that no site has a system clock that
runs fast or slow.
 As the clocks cannot be perfectly accurate a technique that is
similar for the logical clock is to be used to ensure that no clock
gets ahead or behind.

Compare RDBMS, OODBMS and ORDBMS in


detail.

 An ORDBMS is a relational DBMS which have certain


extensions.
 An OODBMS is a programming language with a a type system
which supports the features of it and it allows any of its data
object to be persistent that is it is used to survive across the
different program executions.
 There are many current system which conform neither to the
definition entirely but are much closer to one or the other and can
be classified accordingly.
RDBMS versus ORDBMS

RDBMS ORDBMS

RDBMS does not support the different ORDBMS supports the different extensions of the
extensions of the database. database.

It is easier to optimize the queries for efficient It is difficult to optimize the queries for efficient
execution. execution.

It is easier to use as there are fewer features It is difficult to use as there are many features that
to master. need to be mastered.

It is less versatile than ORDBMS. It is more versatile than RDBMS.


OODBMS versus ORDBMS

OODBMS ORDBMS

The OODBMS will try to add the DBMS The ORDBMS will try to add the richer data types to
functionality to a programming language. a relational DBMS.

Their aim is to achieve seamless integration Such type of integration is not an important aim for
with a programming language like C++, Java it.
or SmallTalk.

It aims at the applications where an object- It is optimized for the applications where the large
centric viewpoint is appropriate it means that data collections is the focus even though the
the typical user sessions consist of retrieving objects may have a rich structure and are fairly
a few objects and working on them for a long large. The applications will retrieve the data from
period with the related objects that are the disk extensively and optimizing data access is
fetched occasionally. the main concern for efficient execution.

The query facilities of SQL are not supported The query facilities are the centerpiece for ORDBMS.
efficiently.

What do you understand by OODBMS? Explain the


features of OODBMS.

 The relational database systems support a small,fixed


collection of the data type which prove adequate for the traditional
application domain such as the administrative data processing.
 There are many types of complex data that need to be
handled.
 The concepts of object-oriented are strongly influenced efforts
to enhance the database support for the complex data and lead to
the development of the object-database systems.
Object-Oriented Database Systems
 They are proposed as an alternative to the relational systems
and are aimed at the application domains where the complex
objects play a central role.
 The approach is mainly influenced by the object oriented
programming language and can be understood as an attempt to
add the DBMS functionality to a programming language
environment.
 A standard like the Object Data Model (ODM) and Object Query
Language (OQL) are developed by the Object Database
Management Group(ODMG), they are equivalent to the SQL
standard for the relational database systems.
 The objects in the object-oriented programming language
exist only during the program execution they are known as
transient objects.
 They can extend the existence of the objects so that they are
stored permanently and the object persist beyond the program
termination and can be retrieved later and can be shared by the
other program.
 The persistent objects allow sharing of the objects among
multiple programs and applications.
Features of OODBMS
 Some of the common features of database are
indexing mechanisms, concurrency control and recovery.
 The OO databases provide a unique system-generated object-
identifier (OID) for each object to maintain a direct
correspondence between the real world and database objects so
that the objects don't lose their integrity and identity.
 In order to contain all the necessary information that
describes the object the objects have a object structure of
arbitrary complexity.
 The specification of instance variables is included in the
internal structure of an object which hold the value that define the
internal state of an object. The instance variable is similar to the
concept of an attribute in the relational model.
 Another feature is encapsulation which insists that all the
operations that a user can apply to the object must be predefined.
 For encouraging encapsulation the operation is defined into
two parts.
1. Signature or interface of the operation which specifies the
operation name and the arguments.
2. The method or the body which specifies the implementation of
the operation.
 Encapsulation permits the modification of the internal
structure of an object its operation without the need to disturb the
external programs that invoke these operations.
 Inheritance is one such feature that permits the specification
of new type or classes that inherit much of their structure and/or
the operations from the previously defined types or classes.
 Inverse references is a feature that places the OIDs of the
related objects within the objects themselves and maintains the
referential integrity.
 Some systems provide a capability for dealing with multiple
versions of the same object this feature helps in design and
engineering applications.
 For permitting a version databases should allow schema
evolution which occurs when the type declaration are changed or
when the new types or relationships are created.
 Another concept is the operator overloading which refers to
the operations ability to be applied to the different types of
objects. This feature is also known as operator polymorphism.

Explain the operations of OLAP.


Introduction
 OLAP is known as Online Analytical Processing which was
created by E.F. Codd in 1993.
 It is used to refer to a type of application that allows a user to
interactively analyze data.
 Before the creation of OLAP the systems were referred to as
the Decision Support Systems.
 They describe a class of applications which require
multidimensional analysis of business data.
 The OLAP systems enable the managers and the analysts to
rapidly and easily examine the key performance data and perform a
powerful comparison and trend analysis on large volumes of data.
 These systems can be used in a wide variety of business
areas like sales and marketing, financial reporting, quality tracking
etc.
 They are used for any management system that requires a
flexible top down view of an organization.
 It is also a method used for analyzing data in a
multidimensional format often across the multiple time periods
with the aim of uncovering the business information concealed
within the data.
 It helps in enabling the business users to gain an insight in the
business through interactive analysis of different views of the
business data built up from the operation systems.
 OLAP is not a data warehousing methodology but an integral
part of it.
 It provides a facility to analyze the data that is held within the
data warehouse in a flexible manner.
 It can also be defined as a process of converting raw data into
business information through multi-dimensional analysis.
 The OLAP application contains the logic which include
1. Multi-dimensional data selection
2. Sub-setting of data
3. Retrieval of data via the metadata layer
4. Calculation formulas.
 The OLAP application is accessed via a front-end tool that
uses the tables and charts to drill down or navigate through the
dimensional data or aggregated measures.
Operations of OLAP
The OLAP servers are based on the multidimensional view of data.

The different operations are

1. Roll-up
By using the following ways they perform aggregation on a data
cube:
a. By climbing up a concept hierarchy for a dimension.
b. By dimension reduction.

The following diagram illustrates roll-up

 This operation is performed by climbing up a concept


hierarchy for the dimension location.
 Initially the concept use to work as
“street<city<province<country”.< li="" style="padding: 0px; margin:
0px;"></city<province<country”.<>
 The data will be aggregated on rolling up by ascending the
location hierarchy from the level of city to the level of country.
 Grouping of data is done into cities rather than countries.
 One or more dimensions of the cube are removed whenever
roll-up is performed.
2. Drill-down
It is the reverse operation of roll-up. They are performed by using
either of the following ways:
a. By stepping down a concept hierarchy for a dimension.
b. By introducing a new dimension.

The following diagram shows how drill-down works

 The performance of drill-down is done by stepping down a


concept hierarchy for the dimension time.
 Initially the concept use to work as
“day<month<quarter<year”.< li="" style="padding: 0px; margin:
0px;"></month<quarter<year”.<>
 Once the drilling down starts the time dimension will descend
from the level of quarter to level of month.
 While performing this operation one or more dimensions from
the data cube are added.
 The data is navigated from less detailed data to highly
detailed data.
3. Slice
 This operation provides a new sub-cube by selecting one
particular dimension from a given cube.
 The above diagram illustrates the working of the slice
operation where the dimension “time” is using the criterion time =
“Q1”.
 A new sub-cube is formed by selecting one or more
dimensions.
4. Dice
 This operation provides a new sub-code by selecting two or
more dimensions from a given cube.
 The above diagram illustrates the dice operation on the cube
based on the following selection criteria involving three
dimensions:
(location = “Delhi” or “Mumbai”)
(time = “Q1” or “Q2”)
(item = “Mobile” or “Modem”)

5. Pivot
 This operation is also known as rotation.
 In order to provide an alternative presentation of the data it
rotates the data axes.
The diagram below illustrates the pivot operation.

 In this operation the item and location axes in 2- D slice are


rotated.
Write short notes on:
a) Snowflake Schema
b) OLAP
a) Snowflake Schema

Introduction
 It is a logical arrangement of the tables in a multidimensional
database in such a manner that the entity relationship diagram
resembles a snowflake shape.
 This schema is represented by centralized fact tables which
are connected to the multiple dimensions.
 The “Snowflaking” is one of the method of normalizing the
dimension table in star schema.
 When complete normalization takes place along all the
dimension tables, the resultant structure will resemble a
snowflake with the fact table in between.
 The principle behind this schema is the normalization of the
dimension tables by way of removing the low cardinality attributes
and forming the separate tables.
 This schema is similar to that of the star schema.
 A complex shape is emerged when the dimensions of this
schema elaborate, have multiple level of relationships and the
child tables have multiple parent tables.
Diagram

The above diagram shows the snowflake schema dimension tables


are connected to a fact table and the dimension tables are again
normalized into other tables.

Use
 They are mostly found in the data warehouses and data marts
where the speed of data retrieval is important than the efficiency
of the data manipulations.
Benefits
 The snowflake schema is in the same family as the star
schema logical model.
 Star schema is thought through as a special case of the
snowflake schema.
The advantages of snowflake schema over star schema are:
 Snowflake schema are used by some of the OLAP multi-
dimensional database modeling tools.
 Normalizing the attributes will result into storage savings.
Disadvantages
 One of the primary disadvantage of this schema is that the
additional level of attribute normalization will add complexity to
the source query joins.
 The goal that is assumed by this schema is to be an efficient
and compact storage of the normalized data but at the cost of poor
performance while browsing the joins required in the dimension.
 When there is a comparison with the highly normalized
transactional schema this schema denormalization will remove the
data integrity assurances provided by the normalized schema.
 Data that is being loaded in this schema must be highly
controlled and managed so as to avoid the update and insert
anomalies.
b) OLAP

Introduction
 OLAP is known as Online Analytical Processing which was
created by E.F. Codd in 1993.
 It is used to refer to a type of application that allows a user to
interactively analyze data.
 Before the creation of OLAP the systems were referred to as
the Decision Support Systems.
 They describe a class of applications which require
multidimensional analysis of business data.
 The OLAP systems enable the managers and the analysts to
rapidly and easily examine the key performance data and perform a
powerful comparison and trend analysis on large volumes of data.
 These systems can be used in a wide variety of business
areas like sales and marketing, financial reporting, quality tracking
etc.
 They are used for any management system that requires a
flexible top down view of an organization.
 It is also a method used for analyzing data in a
multidimensional format often across the multiple time periods
with the aim of uncovering the business information concealed
within the data.
 It helps in enabling the business users to gain an insight in the
business through interactive analysis of different views of the
business data built up from the operation systems.
 OLAP is not a data warehousing methodology but an integral
part of it.
 It provides a facility to analyze the data that is held within the
data warehouse in a flexible manner.
 It can also be defined as a process of converting raw data into
business information through multi-dimensional analysis.
 The OLAP application contains the logic which include:
1. Multi-dimensional data selection
2. Sub-setting of data
3. Retrieval of data via the metadata layer
4. Calculation formulas.
 The OLAP application is accessed via a front-end tool that
uses the tables and charts to drill down or navigate through the
dimensional data or aggregated measures.
Uses of OLAP
 A lot of organizational functions use the OLAP applications.
 Finance departments use the OLAP application for
budgeting,activity-based costing, financial performance analysis
and financial modeling.
 Sales department use this application for sales analysis and
forecasting.
 Marketing departments use it for market research analysis,
sales forecasting,promotions analysis, customer analysis and
market/customer segmentation.
 Manufacturing departments use it for production planning and
defect analysis.
 It provides a “just-in-time” information to the managers for
effective decision making.
 They enable the managers, analysts, executives to gain an
insight into the data by fast, consistent, interactive access to a
wide variety of possible views of information.
 OLAP transforms the data warehouse into strategic
information.
Benefits of OLAP
 It increases the productivity of the business managers,
developers and the whole organization.
 The systems are flexible which means that they are self-
sufficient.
 Enables the managers to model the problems more easily.
 The developers can deliver the application to business users
faster by providing better services. The faster delivery also
reduces the applications backlog.
 The IT gains more self sufficient users without relinquishing
control over the integrity of data.
 It provides the IT by efficient operations. They reduce the
query drag and network traffic on transaction systems.
 It enables the organization as a whole to respond more
quickly to market demands. This in turn will improve the revenue
and profits.
Key Features of OLAP
The features are as follows:

1. Multi-dimensional views of data


 They inherently represent an actual business model.
 They provide an ability to slice and dice it provides the
foundation for the analytical processing through flexible access to
information.
 The managers should not have much complications in
understanding the complex table layouts, elaborate table joins and
the summary tables.
 It has a consistent response time.
2. Complex Calculations
 The OLAP database should be able to perform complex
calculations.
 It must provide a rich tool kit of powerful yet succinct
computational methods.
 To make the developers more efficient and users more self-
sufficient the implementation of computational methods must be
clear and non-procedural.
3. Time Intelligence
 Time is the integral component of almost any analytical
application.
 It is a unique dimension because it is sequential in character.
 Time hierarchy is not always used in the same manner as the
other hierarchies.
 They must understand the concept of balances over time.

Explain different schema in dimensional data


modeling.

 The schema is a logical description of the entire database.


 The name and description of records of all the record types
including all the associated data-items and aggregates are
included in it.
 Just like a database a data warehouse also requires to
maintain a schema.
 The database use a relational model while a data warehouse
use the Star, Snowflake and Fact Constellation schema.
The schemas are explained as below:

Star Schema
 It is the simplest data warehouse schema.
 It is called a star schema because the entity-relationship
diagram of this schema resembles a star, with points radiating
from a central table.
 The center of the star consists of a large fact table and the
points of the star are the dimension tables.
 A star schema is characterized by one or more very large fact
tables that contain the primary information in the data warehouse,
and a number of much smaller dimension tables (or lookup tables),
each of which contains information about the entries for a
particular attribute in the fact table.
 A star query is a join between a fact table and a number of
dimension tables.
 Each dimension table is joined to the fact table using a
primary key to foreign key join, but the dimension tables are not
joined to each other.
 The cost-based optimizer recognizes star queries and
generates efficient execution plans for them.
 A typical fact table contains keys and measures.
 A star join is a primary key to foreign key join of the dimension
tables to a fact table.
Advantages of star schemas are:
 They provide a direct and intuitive mapping between the
business entities being analyzed by end users and the schema
design.
 It provides highly optimized performance for typical star
queries.
 They are widely supported by a large number of business
intelligence tools, which may anticipate or even require that the
data-warehouse schema contain dimension tables.
 Star schemas are used for both simple data marts and very
large data warehouses.
Diagram
Snowflake Schema
 The snowflake schema is a more complex data warehouse
model than a star schema, and is a type of star schema.
 It is called a snowflake schema because the diagram of the
schema resembles a snowflake.
 Snowflake schemas normalize dimensions to eliminate
redundancy.
 That is, the dimension data has been grouped into multiple
tables instead of one large table.
 It is a logical arrangement of the tables in a multidimensional
database in such a manner that the entity relationship diagram
resembles a snowflake shape.
 This schema is represented by centralized fact tables which
are connected to the multiple dimensions.
 The “Snowflaking” is one of the method of normalizing the
dimension table in star schema.
 When complete normalization takes place along all the
dimension tables, the resultant structure will resemble a
snowflake with the fact table in between.
 The principle behind this schema is the normalization of the
dimension tables by way of removing the low cardinality attributes
and forming the separate tables.
 This schema is similar to that of the star schema.
 A complex shape is emerged when the dimensions of this
schema elaborate, have multiple level of relationships and the
child tables have multiple parent tables.
Diagram

Fact Constellation Schema


 It has multiple fact tables.
 It is also known as galaxy schema.
 They contain many fact tables with some common
dimensions.
 It is also a combination of many data marts.
 They are segregated into independent dimensions based on
the level hierarchy.
Diagram
Describe classification. Explain any two
classification algorithms with examples.
Introduction
 Classification is a data- mining function that will assign items
in a collection to target various categories or classes.
 Once the classification is done a prediction or decision can be
taken about the data.
 It generally includes historical data.
 The goal of this is to construct a model using the historical
data that will accurately predict the label of the unlabeled
examples.
 A classification task generally begins by building data for
which the target values are known.
 There are three different approaches that are followed by the
classification model: discriminative approach, regression approach
and class-conditional approach.
 A classification task begins with a data set in which the class
assignments are known.
 Classifications are discrete and do not imply order.
 Continuous, floating-point values would indicate a numerical,
rather than a categorical, target.
 A predictive model with a numerical target uses a regression
algorithm, not a classification algorithm.
 The simplest type of classification problem is binary
classification.
 In binary classification, the target attribute has only two
possible values: for example, high credit rating or low credit
rating.
 Multiclass targets have more than two values: for
example, low, medium, high, or unknown credit rating.
 In the model build (training) process, a classification
algorithm finds relationships between the values of the predictors
and the values of the target.
 Different classification algorithms use different techniques for
finding relationships.
 These relationships are summarized in a model, which can
then be applied to a different data set in which the class
assignments are unknown.
 Classification models are tested by comparing the predicted
values to known target values in a set of test data.
 The historical data for a classification project is typically
divided into two data sets: one for building the model; the other
for testing the model.
 Scoring a classification model results in class assignments
and probabilities for each case.
 For example, a model that classifies customers as low,
medium, or high value would also predict the probability of each
classification for each customer.
 Classification has many applications in customer
segmentation, business modeling, marketing, credit analysis, and
biomedical and drug response modeling.
Classification Algorithms
There are four types of algorithms provided by classification:

1. Decision Tree
 These are the predictive models that are used to graphically
organize the information about the possible options, consequences
and the end values.
 Each branch of this tree is a classification question and the
leaves of this tree are the partitions of the dataset with the
classification.
 The outcome of the test depends upon the choice of a certain
branch.
 A particular data item is classified at the start of the root
node and follow the assertions down until we reach a terminal
node (or leaf).
 A decision will be taken when the terminal node is
approached.
 They are also interpreted as a special form of rule set that are
characterized by their hierarchical organization of rules.
Diagram

The basic algorithm can be summarized as follows:

Input
 A set of training tuples and their associated class labels –
Data partition D
 An attribute list of the candidates attributes.
 The attribute selection method is used which is a procedure
used to determine the splitting criteria that the best partitions the
data tuples into individual classes.
Output
 A decision tree will the output to the above input.
Method/ Steps for creating a decision tree
 A node N is created.
 If tuples in D are all of the same class C then
 N is returned as a leaf node labeled with class C
 If the list of attributes is empty then
 N is returned as a leaf node that is labeled with the majority of
class in D
 By applying the attribute selection method (D, attribute list)
the best splitting criterion is found.
 Node N is labeled with the splitting criterion.
 If the splitting attribute has a discrete- value and multiway
splits are allowed then
 Attribute list – attribute list – splitting attribute – the splitting
attribute is removed
 Each outcome j of the splitting criterion the tuples are
partitioned and subtrees are grown for each partition.
 Dj will be the set of data tuples where D is satisfying the
outcome of j
 Dj is empty then
 A leaf labeled with the majority class in D to node N is
attached.
 Otherwise the node returned by generate decision tree (Dj,
attribute list) to node N will be attached.
Endfor
 N is returned.
2. Bayesian classification
 It is based on the Bayes theorem.
 They are statistical classifiers.
 These classifiers help in predicting about the class
membership probability which means that we can predict about
the particular record to which class it belongs.
 Bayesian classifiers are acurate and give a good performance
with the larger databases.
 The Naive Bayesian classifier are the class condition
independent which means the effect of an attribute value on a
given class is independent of the values of the other attributes.
Bayes Theorem
 The Bayes theorem is named after Thomas Bayes in the 18th
century.
 It provides two types of probabilities
1. Posterior Probability [P(H/X)]
2. Prior Probability [P(H)]
 Here X is the data tuple and H is some hypothesis
According to the Bayes theorem it is
P(H/X) = P(X/H)P(H) / P(X)

Bayesian Network
 These networks joint the probability distributions. They are
also known as Belief Networks, Bayesian Networks or even
Probabilistic networks.
 They allow class conditional independences to be defined
between the subsets of the variable.
 A graphical model of casual relationship is provided on which
learning can be performed.
 A trained Bayesian Network can be used for classification.
A Bayesian Belief Network defines two components

1. Directed acyclic graph


 Each node in this graph represents a random variable.
 The variables can be continuous or discrete valued.
 The variables correspond to the actual attribute given in the
data.
Graphical representation of the acyclic graph

 The arc in the above diagram allows the representation of


casual knowledge.
 For example, diabetes is inherited by a persons family history
and even his age.
 The variable positive test is independent of whether the
patient has a family history of diabetes or not or is in an age or not,
given that we know that the patient has diabetes.
2. Set of conditional probability table
 It allows the representation of casual knowledge.

 For example, diabetes is inherited by a persons family history


and even his age.
 The variable positive test is independent of whether the
patient has a family history of diabetes or not or is in an age or not,
given that we know that the patient has diabetes.
Write short notes on:
a) Text mining
b) Data-visualization.
Introduction
 The text database consist of huge collection of databases.
 This information is collected by various means like news
articles, books, digital libraries, e-mail messages, web pages, etc.
 The text databases are growing rapidly due to the increase in
amount of information.
 The data is semi-structured in many of the text databases.
 Take an example of a document that contain a few structured
fields, like title, author, publishing_date, etc.
 But along with this structured data, the document also
contains unstructured text components, like abstract and
contents.
 Without having any knowledge of what could be in the
documents, it becomes difficult to formulate effective queries for
analyzing and extracting useful information from the data.
 Tools are required by the users to compare the documents
and rank their importance and relevance.
 Hence, text mining has become popular and an essential
theme in data mining.
Information Retrieval
 It deals with the retrieval of information from a large number
of text-based documents.
 They can handle different kinds of data as some of the
database systems are not usually present in the information
retrieval systems.
 Examples of information retrieval system include:
 Online Library catalog system
 Online Document Management Systems
 Web Search Systems etc.
 The information retrieval systems main problem is to locate
relevant documents in a document collection based on a user's
query.
 This kind of user's query consists of some keywords
describing an information need.
 In such search problems, the user takes an initiative to pull
relevant information out from a collection.
 It is appropriate when the user has ad-hoc information need,
i.e., a short-term need.
 The retrieval system can also take an initiative to push any
newly arrived information item to the user only if the user has a
need of long-term information.
 This kind of access given to the information is called
Information Filtering. And the corresponding systems are known as
Filtering Systems or Recommender Systems.
Basic Measures for Text Retrieval
 The accuracy of the system is checked when a number of
documents on the basis of user's input is retrieved.
 The set of documents relevant to a query is denoted
as {Relevant} and the set of retrieved document as {Retrieved}.
 The set of documents which are relevant and retrieved are
denoted as {Relevant} ∩ {Retrieved}.
This can be shown in the form of a Venn diagram as follows:

The quality of text retrieval can be assessed by using three


fundamental methods:

1. Precision
It is the percentage of retrieved documents that are in fact relevant
to the query.

It can be defined as:


Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|

2. Recall
It is the percentage of documents that are relevant to the query and
were in fact retrieved.

It is defined as:
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|

3. F-score
 It is commonly used as trade-off.
 The information retrieval system often needs to trade-off for
precision or vice versa.
It is also defined as harmonic mean of recall or precision as follows:
F-score = recall x precision / (recall + precision) / 2

Text mining applications


The applications where text mining is used is as follows:

1. Security applications – It does the analysis of plain text sources


such as the internet news. The study of text encryption is also
involved.
2. Biomedical applications – One of the best example for this is
PubGene which is a combination of biomedical text mining with the
network visualization as an internet service.
3. Marketing applications – It is more specifically used for
analytical customer relationship management.
4. Software applications – Firms like IBM are further trying to
automate the mining and analysis processes in order to improve the
text mining results.
5. Online media applications – It is generally used to clarify
information and provide readers with greater search experiences,
which in turn increases site "stickiness" and revenue. On the back
end, editors are benefiting by being able to share, associate and
package news across properties, significantly increasing
opportunities to monetize content.
6. Sentiment analysis – It involves the analysis of movie reviews for
estimating how favorable a review is for a movie.

b) Data-visualization
 Data visualization is viewed as a modern equivalent way of
visual communication.
 It involves the creation of the visual representation of data.
 Its primary goal is to communicate the information clearly and
efficiently to the users by means of statistical graphics, plots,
information graphics charts.
 It helps the decision makers to see the analytics and learn the
difficult concept or identify the new pattern.
 With the help of interactive visualization for more detailing the
data can be drilled down into charts and graphs.
Importance
 Data visualization is important as the human brain process the
information faster as charts or graphs are used to visualize large
amounts of data.
 It is a quick and easy way to convey the concepts in a
universal manner.
 The different scenarios can be experimented by just making
slight adjustments.
 It can also help in identifying the areas which need attention
or improvements.
 The factors that influence the customer behavior is clarified.
 Help to build an understanding about what products need to
be placed where.
 The sales volume can be predicted.
Use of data visualization
 Comprehend the information quickly
The graphical representation of the business information help to
see large amount of data in a clear or cohesive way and draw the
conclusions from it. The problems can be sorted out in a timely
manner due to faster analysis.
 Identifying the relationships and patterns
Large amount of complicated data start making sense when being
presented graphically. The parameters that are highly correlated
are recognized by the business. Identifying some relationships will
help the organization to focus on the area that influence their
goals.
 Pinpointing the emerging trends
By using the data visualization for discovering trends can give the
business an edge for competition and affect the bottom line. It
becomes easy to spot the outliers which affect the quality of
product or the customer churn and address the issues before they
turn out to be bigger problems.
 Communicating the story to others
once the business has uncovered the new insights from the visual
analytics the next step is to communicate it to the others. By using
the charts, graphs and other visually impactful representations it
becomes easier to send the message across quickly.
Characteristics of an effective graphical display
The graphical display should posses the following characteristics:
 Show the data.
 It will induce the viewer for thinking about the substance
rather than methodology, graphic design, the technology of graphic
production or something else.
 It should avoid distorting what the data has to say.
 Many numbers should be present in a small space.
 Large data sets should be made coherent.
 The eye should be encouraged to compare different pieces of
data.
 The data should be revealed at several levels of detail, from a
broad overview to the fine structure.
 A reasonably clear purpose should be served: description,
exploration, tabulation or decoration.
 The statistical and verbal descriptions of a data set should be
closely integrated.
Diagrams used for data visualization
 Bar chart
 Histogram
 Scatter plot
 Scatter plot (3D)
 Network
 Streamgraph
 Treemap
 Gantt chart
 Heat Map

Elucidate the difference between DTD & XML


schema.

XML DTD

XML is namespace aware. DTD is not namespace aware.

It is written in the XML language. It is not written in XML language nor does it have a
significant DTD language.

It eliminates the need to learn any It allows to learn another language.


other language.

XML implements strong typing. DTD lacks strong typing.

It has derived and built in data type. It does not have derived and built in data type.

The XML schema cannot be defined DTD is defined inline.


inline.

Explain XML Parsers.


XML Parsers
 It refers to going through the XML document to access the
data or to modify the data in one or another way.
 It provides a way for accessing or modifying the data that is
present in an XML document.
The various types of parsers are as follows:

1. Dom Parser : It helps the document by loading the complete


content of the document and create its complete hierarchical tree
in the memory.
2. SAX Parser : It parses the document on the event based triggers.
The complete document is not loaded in the memory.
3. JDOM Parser : It parses the document in the same manner as the
DOM but in an easier manner.
4. StAX Parser : It parses the document in the same manner as the
SAX but in an efficient way.
5. Xpath Parser : It parses the XML that is based on the expression
and is used extensively in the conjunction with XSLT.
6. DOM4J Parser : A java library parses XML, Xpath and XSLT by
using the Java Collections Framework that provides support for
DOM, SAX and JAXP.

You might also like