Professional Documents
Culture Documents
a) N-tier architecture
b) SOAP
a) N-tier architecture
The N- tier architecture application is distributed among three
or more separate computers in a distributed network.
The common form of N-tier is the 3-tier application, where the
user interface programming is in the users computer and the
business logic is in a centralized computer and the data which is
needed in the computer that manages the database.
It is implied on the client/server program model.
If there are more than three distribution levels or tiers
involved the additional tiers are usually associated with the
business logic tier.
It is also referred to as the pulling apart of an application into
separate layers or finer grains.
One of the best example of this architecture in the web
applications is the shopping cart web application.
Here the client tier interacts with the user via the GUIs with
the application and the application server.
In most of the web applications the client is a web browser.
The integration tier allows the N-tier architecture to be vendor
independent.
The business tier is also considered as the integration tier.
The encapsulation will allow the application to communicate
with the business tier in a way that all the nodes are intelligible.
The final application tier is the data tier.
It mostly consists of the database servers. The data is kept
neutral and independent from the application servers or the
business logic.
If the data has its own tier it improves the scalability and the
performance and as it grows it easily moves to another powerful
machine.
Benefits of N-Tier Architecture
It helps in improving the scalability and supports the cost-
efficient application building.
It helps in making the applications more readable and
reusable.
The applications that are made are robust as they have no
single point of failure. The tiers function with relative
independence. Reusability is important for the web applications.
Authentication and authorization is provided for the security.
This allows the web server to restrict the user access which is
based on the pre-determined criteria.
It helps the developers to build the web applications as it
allows the developers to apply their specific skill to that part of a
program which best suits their skill set.
b) SOAP
SOAP, Simple Object Access Protocol is a communication
protocol, a way to structure data before transmitting it, is based on
XML standard. It is developed to allow communication between
applications of different platforms and programming languages via
Internet.
It can use range of protocols such as HTTP, FTP, SMTP, Post
office protocol 3(POP3) to carry documents.
Http-Get, Http-Post works with name/value pair which means
transferring complex object is not possible with these protocols,
whereas SOAP serializes complex structure, such as ASP.NET
DataSets, complex arrays, custom types and XML nodes before
transmitting and thus allows exchange of complex objects
between applications.
Two components can easily communicate using Remote
Procedure Calls protocol. But because of their compatibility and
security issues, most of firewalls and proxy server block this type
of messages. SOAP uses HTTP channel to transport which makes it
widely accepted protocol over the Internet.
Steps taken in the SOAP Processing model
There are different nodes used and they are termed as SOAP nodes.
They act as a receiver of the process and allow access to the
messages as well.
The nodes consists of the following process:
SOAP sender : It is a node that transmits the message
received by the receiver.
SOAP receiver : It is a node that receives or accepts the
message passed by the user.
SOAP message path : It is a node that sets the path to make
it easy for the messages to go along and reach its destination.
Initial SOAP sender : It is also called as originator and it
sends the message at the starting point of the message path and
saves the settings there.
SOAP intermediary : It is in between the SOAP receiver and
SOAP sender that contains the SOAP message. It processes the
header blocks that forward the SOAP message to the receiver.
Ultimate SOAP receiver : It is the node where the message
gets received finally. They are responsible for processing the
contents used by the SOAP body and the SOAP header also
included in it.
Message format used in SOAP
The message format is written by using the XML language that
writes the standard message format as it is widely used. It allows
easy transition to deliver the SOAP based implementations.
The format of the protocol allows easy readability, ease of
error detection and it removes the problems with interoperability
like the byte order.
The message that is been given is in the format given below:
POST /InStock HTTP/1.1
Host: localhost
Content-Type: application/soap+xml; charset=utf-8
Content-Length: 299
SOAPAction: "http://www.abc.org/2003/05/soap-envelope"
<?xml version="1.0"?>
<soap:Envelope xmlns:soap="http://www.abc.org/2003/05/soap-
envelope">
<soap:Header>
</soap:Header>
<soap:Body>
<m:CareerName>Careeride</m:CareerName>
</soap:Body>
</soap:Envelope>
Working of SOAP
SOAP is used to provide a user interface that can be achieved
from the client object and the request that it sends goes to the
server that can be achieved by using the server object.
The user interface creates some files or methods that
consists of server object and the name of the interface to the
server object.
It also consists of other information like name of the interface
and method.
It uses HTTP to send the XML to the server using the POST
method. The server parses the method and send the result to the
client side.
The server creates more XML that consists of responses of
the user interface's request that is used using HTTP.
Client can use any method to send the XML. It can use the
SMTP server as well as POP3 protocol to pass the messages and
for request or respond queries.
Problems faced by the user by using SOAP
SOAP is a new protocol that is used for cross-platform
communication and it can bypass the firewall.
This new protocol has more security vulnerabilities than any
other. There is a problem to use this protocol as firewall is a
security mechanism that comes in between.
It blocks all the ports leaving few like HTTP port 80 and the
HTTP port which is used by SOAP to bypass the firewall.
It is a serious concern as it can pose difficulties for the users.
There are ways like SOAP traffic can be filtered from the firewalls.
Each SOAP header is having a unique header field that can be
used to check the SOAP messages which are passing through the
firewall.
Web Relational functionalities provided by the SOAP protocol
There are functionalities that are provided for the web page by the
SOAP protocol and they are:
Parallel Sort
For example, we want to sort a relation that resides on n
disks D0,D1,......Dn-1.
If this relation is range partitioned on the attributes then each
partition is sorted out separately and can concatenate the results
to get the full sorted relation.
As the tuples are partitioned on the n disks the time that is
required for reading the entire relation is reduced by the parallel
access.
If the relation is partitioned in any other way in can be sorted
out by using any of the following ways:
1. Range-partition it on the sort attributes and then sort each
partition separately.
2. Use the parallel version of the external sort-merge algorithm.
Range-partitioning sort
It basically works in two steps: first is to range partition the
relation and second is to sort out each partition separately.
When we sort the relation it is not necessary to range
-partition the relation on the same set of processors or disks as
those on which that relation is stored.
The range-partitioning should be done with a good range-
partition vector so that each partition will approximately have the
same number of tuples.
Parallel External Sort-Merge
It is an alternative to range partitioning.
Suppose a relation has already been partitioned among the
disks D0,D1,....Dn-1.
The parallel sort-merge will work in the following manner:
1. Each processor Pi will locally sort the data on the disk Di.
2. To get the final sorted output the system merges the sorted runs
on each processor.
Parallel Join
The join operation tests the pairs of tuples to see whether
they satisfy the join condition and if they do the system adds the
pair to the join output.
The parallel join algorithms attempt to split the pairs that are
to be tested over several processors.
Each processor then checks part of the join locally.
After this the system collects the results from each of the
processor for producing the final result.
The types of joins are:
Partitioned join
Fragment and Replicate join
Partitioned Parallel Hash join
Parallel Nested-Loop join
Other relational operators
Selection
Duplicate elimination
The duplicates can be eliminated by sorting by using either of the
parallel sort techniques. The duplicate elimination can also be
parallelized by partitioning the tuples and eliminating the
duplicates locally at each processor.
Projection
The projection without the duplicate elimination can be performed
as the tuples are read from the disk in parallel. To eliminate the
duplicates any of the techniques can be used.
Aggregation
the operation can be parallelized by partitioning the relation on the
grouping attributes and computing the aggregate values locally at
each processor. Either hash partitioning or range partitioning can
be used.
Interoperation parallelism
It has two types of parallelism:
1. Pipelined Parallelism
The parallel systems use the pipelining mainly for the same
that the sequential systems do.
The pipelines are a source of parallelism in the same way that
the instructions pipelines are a source of parallelism in hardware
design.
Two operations can be run simultaneously on different
processors so that the tuple consumes the tuples in parallel to the
one producing them.
This form of parallelism is known as pipelined parallelism.
Independent parallelism
The operations in a query expression that do not depend on
one another can be executed in parallel. This is known as
independent parallelism.
The independent parallelism does not provide a high degree of
parallelism and is less useful in a highly parallel system, even if it
is useful with a lower degree of parallelism.
Round Robin
It scans the relation in any order and sends the ith tuple to
disk number Di mod n.
The scheme ensures an even distribution of tuples across
disks; that is, each disk has approximately the same number of
tuples as the others.
Hash partitioning
It is a declustering strategy that designates one or more
attributes from the given relation’s schema as the partitioning
attributes.
A hash function is chosen whose range is {0, 1, . . . , n - 1}.
Each tuple of the original relation is hashed on the partitioning
attributes.
If 'i' is returned by the hash function, then the tuple is placed
on disk Di 1.
Range partitioning
It distributes the tuples by assigning contiguous attribute-
value ranges to each disk.
It selects a partitioning attribute, A, and a partitioning vector
[v0, v1, . . . , vn-2], such that, if i < j, then vi < vj.
The relation is partitioned as follows: Consider a tuple 't' such
that t[A] = x. If x < v0, then 't' goes on disk D0. If x = vn-2, then 't'
goes on disk Dn-1. If vi = x < vi+1, then 't' goes on disk Di+1.
Example of this can be with three disks numbered 0, 1, and 2
that may assign tuples with values less than 5 to disk 0, values
between 5 and 40 to disk 1, and values greater than 40 to disk 2.
Comparison of Partitioning Techniques
A relation can be retrieved in parallel by using all the disks
once a relation has been partitioned among several disks.
Similarly, when a relation is being partitioned, it can be
written to multiple disks in parallel.
The rate transfer for reading or writing an entire relation are
much faster with I/O parallelism than without it.
However, it is only one kind of access to data for reading an
entire relation, or scanning a relation.
Access to data can be classified as follows:
The entire relation is scanned.
A tuple is located associatively (example, employee name =
“Pooja”); these queries, also known as point queries, seek tuples
that have a specified value for a specific attribute.
Locating all tuples for which the value of a given attribute lies
within a specified range (example, 10000 < salary < 20000); these
queries are called range queries.
Explain 2PC protocols. Discuss its failure &
recovery techniques.
RDBMS ORDBMS
RDBMS does not support the different ORDBMS supports the different extensions of the
extensions of the database. database.
It is easier to optimize the queries for efficient It is difficult to optimize the queries for efficient
execution. execution.
It is easier to use as there are fewer features It is difficult to use as there are many features that
to master. need to be mastered.
OODBMS ORDBMS
The OODBMS will try to add the DBMS The ORDBMS will try to add the richer data types to
functionality to a programming language. a relational DBMS.
Their aim is to achieve seamless integration Such type of integration is not an important aim for
with a programming language like C++, Java it.
or SmallTalk.
It aims at the applications where an object- It is optimized for the applications where the large
centric viewpoint is appropriate it means that data collections is the focus even though the
the typical user sessions consist of retrieving objects may have a rich structure and are fairly
a few objects and working on them for a long large. The applications will retrieve the data from
period with the related objects that are the disk extensively and optimizing data access is
fetched occasionally. the main concern for efficient execution.
The query facilities of SQL are not supported The query facilities are the centerpiece for ORDBMS.
efficiently.
1. Roll-up
By using the following ways they perform aggregation on a data
cube:
a. By climbing up a concept hierarchy for a dimension.
b. By dimension reduction.
5. Pivot
This operation is also known as rotation.
In order to provide an alternative presentation of the data it
rotates the data axes.
The diagram below illustrates the pivot operation.
Introduction
It is a logical arrangement of the tables in a multidimensional
database in such a manner that the entity relationship diagram
resembles a snowflake shape.
This schema is represented by centralized fact tables which
are connected to the multiple dimensions.
The “Snowflaking” is one of the method of normalizing the
dimension table in star schema.
When complete normalization takes place along all the
dimension tables, the resultant structure will resemble a
snowflake with the fact table in between.
The principle behind this schema is the normalization of the
dimension tables by way of removing the low cardinality attributes
and forming the separate tables.
This schema is similar to that of the star schema.
A complex shape is emerged when the dimensions of this
schema elaborate, have multiple level of relationships and the
child tables have multiple parent tables.
Diagram
Use
They are mostly found in the data warehouses and data marts
where the speed of data retrieval is important than the efficiency
of the data manipulations.
Benefits
The snowflake schema is in the same family as the star
schema logical model.
Star schema is thought through as a special case of the
snowflake schema.
The advantages of snowflake schema over star schema are:
Snowflake schema are used by some of the OLAP multi-
dimensional database modeling tools.
Normalizing the attributes will result into storage savings.
Disadvantages
One of the primary disadvantage of this schema is that the
additional level of attribute normalization will add complexity to
the source query joins.
The goal that is assumed by this schema is to be an efficient
and compact storage of the normalized data but at the cost of poor
performance while browsing the joins required in the dimension.
When there is a comparison with the highly normalized
transactional schema this schema denormalization will remove the
data integrity assurances provided by the normalized schema.
Data that is being loaded in this schema must be highly
controlled and managed so as to avoid the update and insert
anomalies.
b) OLAP
Introduction
OLAP is known as Online Analytical Processing which was
created by E.F. Codd in 1993.
It is used to refer to a type of application that allows a user to
interactively analyze data.
Before the creation of OLAP the systems were referred to as
the Decision Support Systems.
They describe a class of applications which require
multidimensional analysis of business data.
The OLAP systems enable the managers and the analysts to
rapidly and easily examine the key performance data and perform a
powerful comparison and trend analysis on large volumes of data.
These systems can be used in a wide variety of business
areas like sales and marketing, financial reporting, quality tracking
etc.
They are used for any management system that requires a
flexible top down view of an organization.
It is also a method used for analyzing data in a
multidimensional format often across the multiple time periods
with the aim of uncovering the business information concealed
within the data.
It helps in enabling the business users to gain an insight in the
business through interactive analysis of different views of the
business data built up from the operation systems.
OLAP is not a data warehousing methodology but an integral
part of it.
It provides a facility to analyze the data that is held within the
data warehouse in a flexible manner.
It can also be defined as a process of converting raw data into
business information through multi-dimensional analysis.
The OLAP application contains the logic which include:
1. Multi-dimensional data selection
2. Sub-setting of data
3. Retrieval of data via the metadata layer
4. Calculation formulas.
The OLAP application is accessed via a front-end tool that
uses the tables and charts to drill down or navigate through the
dimensional data or aggregated measures.
Uses of OLAP
A lot of organizational functions use the OLAP applications.
Finance departments use the OLAP application for
budgeting,activity-based costing, financial performance analysis
and financial modeling.
Sales department use this application for sales analysis and
forecasting.
Marketing departments use it for market research analysis,
sales forecasting,promotions analysis, customer analysis and
market/customer segmentation.
Manufacturing departments use it for production planning and
defect analysis.
It provides a “just-in-time” information to the managers for
effective decision making.
They enable the managers, analysts, executives to gain an
insight into the data by fast, consistent, interactive access to a
wide variety of possible views of information.
OLAP transforms the data warehouse into strategic
information.
Benefits of OLAP
It increases the productivity of the business managers,
developers and the whole organization.
The systems are flexible which means that they are self-
sufficient.
Enables the managers to model the problems more easily.
The developers can deliver the application to business users
faster by providing better services. The faster delivery also
reduces the applications backlog.
The IT gains more self sufficient users without relinquishing
control over the integrity of data.
It provides the IT by efficient operations. They reduce the
query drag and network traffic on transaction systems.
It enables the organization as a whole to respond more
quickly to market demands. This in turn will improve the revenue
and profits.
Key Features of OLAP
The features are as follows:
Star Schema
It is the simplest data warehouse schema.
It is called a star schema because the entity-relationship
diagram of this schema resembles a star, with points radiating
from a central table.
The center of the star consists of a large fact table and the
points of the star are the dimension tables.
A star schema is characterized by one or more very large fact
tables that contain the primary information in the data warehouse,
and a number of much smaller dimension tables (or lookup tables),
each of which contains information about the entries for a
particular attribute in the fact table.
A star query is a join between a fact table and a number of
dimension tables.
Each dimension table is joined to the fact table using a
primary key to foreign key join, but the dimension tables are not
joined to each other.
The cost-based optimizer recognizes star queries and
generates efficient execution plans for them.
A typical fact table contains keys and measures.
A star join is a primary key to foreign key join of the dimension
tables to a fact table.
Advantages of star schemas are:
They provide a direct and intuitive mapping between the
business entities being analyzed by end users and the schema
design.
It provides highly optimized performance for typical star
queries.
They are widely supported by a large number of business
intelligence tools, which may anticipate or even require that the
data-warehouse schema contain dimension tables.
Star schemas are used for both simple data marts and very
large data warehouses.
Diagram
Snowflake Schema
The snowflake schema is a more complex data warehouse
model than a star schema, and is a type of star schema.
It is called a snowflake schema because the diagram of the
schema resembles a snowflake.
Snowflake schemas normalize dimensions to eliminate
redundancy.
That is, the dimension data has been grouped into multiple
tables instead of one large table.
It is a logical arrangement of the tables in a multidimensional
database in such a manner that the entity relationship diagram
resembles a snowflake shape.
This schema is represented by centralized fact tables which
are connected to the multiple dimensions.
The “Snowflaking” is one of the method of normalizing the
dimension table in star schema.
When complete normalization takes place along all the
dimension tables, the resultant structure will resemble a
snowflake with the fact table in between.
The principle behind this schema is the normalization of the
dimension tables by way of removing the low cardinality attributes
and forming the separate tables.
This schema is similar to that of the star schema.
A complex shape is emerged when the dimensions of this
schema elaborate, have multiple level of relationships and the
child tables have multiple parent tables.
Diagram
1. Decision Tree
These are the predictive models that are used to graphically
organize the information about the possible options, consequences
and the end values.
Each branch of this tree is a classification question and the
leaves of this tree are the partitions of the dataset with the
classification.
The outcome of the test depends upon the choice of a certain
branch.
A particular data item is classified at the start of the root
node and follow the assertions down until we reach a terminal
node (or leaf).
A decision will be taken when the terminal node is
approached.
They are also interpreted as a special form of rule set that are
characterized by their hierarchical organization of rules.
Diagram
Input
A set of training tuples and their associated class labels –
Data partition D
An attribute list of the candidates attributes.
The attribute selection method is used which is a procedure
used to determine the splitting criteria that the best partitions the
data tuples into individual classes.
Output
A decision tree will the output to the above input.
Method/ Steps for creating a decision tree
A node N is created.
If tuples in D are all of the same class C then
N is returned as a leaf node labeled with class C
If the list of attributes is empty then
N is returned as a leaf node that is labeled with the majority of
class in D
By applying the attribute selection method (D, attribute list)
the best splitting criterion is found.
Node N is labeled with the splitting criterion.
If the splitting attribute has a discrete- value and multiway
splits are allowed then
Attribute list – attribute list – splitting attribute – the splitting
attribute is removed
Each outcome j of the splitting criterion the tuples are
partitioned and subtrees are grown for each partition.
Dj will be the set of data tuples where D is satisfying the
outcome of j
Dj is empty then
A leaf labeled with the majority class in D to node N is
attached.
Otherwise the node returned by generate decision tree (Dj,
attribute list) to node N will be attached.
Endfor
N is returned.
2. Bayesian classification
It is based on the Bayes theorem.
They are statistical classifiers.
These classifiers help in predicting about the class
membership probability which means that we can predict about
the particular record to which class it belongs.
Bayesian classifiers are acurate and give a good performance
with the larger databases.
The Naive Bayesian classifier are the class condition
independent which means the effect of an attribute value on a
given class is independent of the values of the other attributes.
Bayes Theorem
The Bayes theorem is named after Thomas Bayes in the 18th
century.
It provides two types of probabilities
1. Posterior Probability [P(H/X)]
2. Prior Probability [P(H)]
Here X is the data tuple and H is some hypothesis
According to the Bayes theorem it is
P(H/X) = P(X/H)P(H) / P(X)
Bayesian Network
These networks joint the probability distributions. They are
also known as Belief Networks, Bayesian Networks or even
Probabilistic networks.
They allow class conditional independences to be defined
between the subsets of the variable.
A graphical model of casual relationship is provided on which
learning can be performed.
A trained Bayesian Network can be used for classification.
A Bayesian Belief Network defines two components
1. Precision
It is the percentage of retrieved documents that are in fact relevant
to the query.
2. Recall
It is the percentage of documents that are relevant to the query and
were in fact retrieved.
It is defined as:
Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|
3. F-score
It is commonly used as trade-off.
The information retrieval system often needs to trade-off for
precision or vice versa.
It is also defined as harmonic mean of recall or precision as follows:
F-score = recall x precision / (recall + precision) / 2
b) Data-visualization
Data visualization is viewed as a modern equivalent way of
visual communication.
It involves the creation of the visual representation of data.
Its primary goal is to communicate the information clearly and
efficiently to the users by means of statistical graphics, plots,
information graphics charts.
It helps the decision makers to see the analytics and learn the
difficult concept or identify the new pattern.
With the help of interactive visualization for more detailing the
data can be drilled down into charts and graphs.
Importance
Data visualization is important as the human brain process the
information faster as charts or graphs are used to visualize large
amounts of data.
It is a quick and easy way to convey the concepts in a
universal manner.
The different scenarios can be experimented by just making
slight adjustments.
It can also help in identifying the areas which need attention
or improvements.
The factors that influence the customer behavior is clarified.
Help to build an understanding about what products need to
be placed where.
The sales volume can be predicted.
Use of data visualization
Comprehend the information quickly
The graphical representation of the business information help to
see large amount of data in a clear or cohesive way and draw the
conclusions from it. The problems can be sorted out in a timely
manner due to faster analysis.
Identifying the relationships and patterns
Large amount of complicated data start making sense when being
presented graphically. The parameters that are highly correlated
are recognized by the business. Identifying some relationships will
help the organization to focus on the area that influence their
goals.
Pinpointing the emerging trends
By using the data visualization for discovering trends can give the
business an edge for competition and affect the bottom line. It
becomes easy to spot the outliers which affect the quality of
product or the customer churn and address the issues before they
turn out to be bigger problems.
Communicating the story to others
once the business has uncovered the new insights from the visual
analytics the next step is to communicate it to the others. By using
the charts, graphs and other visually impactful representations it
becomes easier to send the message across quickly.
Characteristics of an effective graphical display
The graphical display should posses the following characteristics:
Show the data.
It will induce the viewer for thinking about the substance
rather than methodology, graphic design, the technology of graphic
production or something else.
It should avoid distorting what the data has to say.
Many numbers should be present in a small space.
Large data sets should be made coherent.
The eye should be encouraged to compare different pieces of
data.
The data should be revealed at several levels of detail, from a
broad overview to the fine structure.
A reasonably clear purpose should be served: description,
exploration, tabulation or decoration.
The statistical and verbal descriptions of a data set should be
closely integrated.
Diagrams used for data visualization
Bar chart
Histogram
Scatter plot
Scatter plot (3D)
Network
Streamgraph
Treemap
Gantt chart
Heat Map
XML DTD
It is written in the XML language. It is not written in XML language nor does it have a
significant DTD language.
It has derived and built in data type. It does not have derived and built in data type.