You are on page 1of 731

Internet Technologies

Darrel Ince
Open University

Internet Technologies

This course is delivered by Professor Darrel Ince of the Open University.


Professor Ince is the author of over 100 papers, 30 books and has written
extensively on computing matters for the national press. His current research
interests are in the area of security requirements. He is the author of one of the
recommended books for this course Distributed Applications and Ecommerce,
published by Addison Wesley.

Lecture 1
Introduction (i)

Internet Technologies

Aims

To examine base Internet facilities


To describe concepts associated with
distributed application development
To explore some commercial models for
Internet application development

Internet Technologies

What is a distributed system?


Contains a number of computers connected
by some communication technologies
Contains clients
Contains servers
Performance mainly limited by transmission
Even true with
rates
broadband
Wikipedia article on distributed computing.

Internet Technologies

This course is about distributed systems; in particular those connected by Internet


technology although, in truth, the vast majority of the content of the course does
not rely on any technology.
Distributed systems have a number of advantages which I shall look at in detail
later in the course, for example they can be designed to be very robust. They also
have disadvantages, for example they can have poor performance if not designed
properly.
A distributed system contains a number of computer known as servers which
provide a service, for example a database service and a number of computers
known as clients which call on services provided by servers.

The context
Only infrequently
will I look at
network functions

Application
layer

Network
functions

Internet Technologies

You can view a distributed system as a two layer onion (in truth they often have
many more layers). The innermost layer contains system software which
implements functions which are concerned with the management and continuing
functioning of the system, for example software which carries out the transfer of
data from one computer to another.
Within this course I shall mainly ignore the internal layer; the only exception
being when I discuss the design of distributed applications where a knowledge of
this layer is necessary to carry out competent design.
The course concentrates on the outside layer: the design, programming and
technologies required to implement a distributed system; for example the
technologies used to program Web servers.

Network functions

Resource management
Naming
Security will be examined
Security
in more detail later
Transmission
Failure processing

Internet Technologies

The course concentrates on application functionality. However, it is important to


realise that such functionality draws down on network functions such as those
shown above:
Resource management: the management of the hardware and software resources
of the network, for example allocating memory to programs.
Naming: identifying and keeping track of the hardware, people and software
resources in a network.
Security: ensuring that no illegal acts are carried out such as stealing data
Transmission: the process of sending data across a network.
Failure processing: the process of ensuring that a network carries on working
even in the presence of hardware and software failure.

Course content (i)


Introduction-The Internet and its
technologies
Distributed architectural paradigms
Examples of distributed paradigms-RMI
and CORBA
Client/server systems

Internet Technologies

The first part of the course acts as an introduction to the Internet and its
technologies. In particular it looks at how these technologies are used to support
applications.
The second part of the course looks at the various architectures that are employed
in distributed systems. These range from architectures which lie close to the
network (message passing) to those which abstract away from network details
(Tuple architectures).
The course then concentrates on one particular distributed paradigm: distributed
objects and uses the Java-based technology of RMI and CORBA as exemplars.
The course also contains material on client/server technology and, for example,
discusses the range of servers that are available and provides a motivation for
using client/server technology

Course content (ii)

Web servers and associated technologies


Web services
Web 2.0
Fast emerging
topics
XML
Concurrency
Distributed application design
Security
Internet Technologies

After looking at clients and servers the course homes in on Web servers. These
are probably the most important type of server in the Internet. I shall be looking
at how they function, the HTTP protocol which lies under the bonnet of a Web
server and how such servers are programmed. I shall also deal with the emerging
topic of Web services: the functionality provided by Web servers in terms of its
component-based implementation. As part of the web services part of the course I
will look at the development of the web into Web 2.0.
The course also looks at XML. This is a technology which is used to define
markup languages and attempts to overcome a major problem in the Internet: that
of differing formats and standards for data
The next two topics are related. I shall look at how systems can be regarded as
consisting of a series of concurrent agents interacting with each other. In a
distributed environment this interaction occurs using communication technology.
This discussion is then used to motivate a lecture on distributed application
design, focusing on performance
A component of the course is security; here I shall look at the wide variety of
threats which a network can come under and the technologies that can be used to
minimise these threats.

Internet history

ARPANET
NSFNET -> TCP/IP
Merging of ARPANET and NSFNET
ANS take over the joint network
The commercialisation of the net and the rise of
the ISP
The birth of the World Wide Web
Internet link: history of the Internet
Internet Technologies

In the 1960s the American government realised that their command and control
systems were vulnerable to attack by nuclear weapons. From this came a project
which looked at connecting a number of defence computers together using packet
switching technology.
The network that was formed from this was known as ARPANET. It used
primitive protocols and, as a consequence of problems with these protocols, a
more sophisticated protocol known as TCP/IP was developed. By 1983 there
were hundreds of computers connected to the ARPANET. In the eighties a
university network known as NSFNET was developed which was then merged
with the ARPANET to form the early Internet.
Commercial involvement in the Internet came when a company known as ANS
took over the combined network and sold access to companies known as Internet
service providers (ISPs).
Parallel to this development was that of the World Wide Web. Initially this was
intended as an internal document-dispensing system at CERN in Switzerland;
however, it was designed in such a way that it was easily ported to the Internet.
The rest is history!!

A major principle
The application developer should
be hidden as much as possible
from network details
For example, the database
programmer should not
be worried where system
databases are held

Internet Technologies

10

A major principle that runs through this course is that when developing a
distributed system the developers should not worry too much about the physical
details of the network that it is based on, for example, the programming of a
database should be the same, irrespective of where it resides on a network. The
ideal technology is one which hides all the details!

10

Internet facilities

Bulletin boards
FTP
Email
The World Wide Web
Newsgroups
Mailing lists
Slide show: Introduction to the Internet
Internet Technologies

11

In this course I will be concerned with the facilities of the Internet and how they
are used to develop distributed applications. Some of these facilities are shown
above
FTP is used to dispense files, for example software which a customer has
bought.
Email is used as a marketing technology, for example to inform customers of
new offers.
The Web is used as a storefront or as a market for services and products.
Newsgroups and Bulletin boards are used to keep customers in touch with each
other.
Mailing lists are used to dispense information.

11

Supply chain an example of an


application for the Internet

Internet Technologies

12

The slide above shows a typical supply chain. The particular chain is used to
produce mouthwash. At every stage in the chain, network technology can be
used, ranging from the use of a portable computer by a farmer which informs a
wholesaler that maze is available for collection, to the use of a Web client by a
chemist which orders mouthwash from a chemical products wholesaler.
This chain is totally enabled by network technology.

12

Value chain analysis

Used to make judgements on profits


Developed by Michael Porter
Consists of primary and support activities.
Abstract model defines most industries, see
later slide
IT will have a central position within a
value chain
Internet Technologies

13

Model
Technology development
Firm infrastructure

Inbound logistics

Operations

Outbound logistics

Marketing and sales

Human resource management

Internet Technologies

Service

Procurement

14

Categories
Inbound logistics, all activities involved with receiving,
storing and disseminating inputs to the product or service
Operations, all activities involved in transforming the
inputs
Outbound logistics, all activities involving the distribution
of products and services
Marketing and sales, activities which provide
opportunities for the customer to buy a service of a product
Servicing, all activities associated with servicing a product
or service.
Internet Technologies

15

Value chain analysis

Defining the strategic business unit


Identifying critical activities
Defining products
Determining the value of an activity
For each element in the value chain it may be possible
to use IT to increase the efficiency of resource usage
and increase organisational efficiency
Internet Technologies

16

Business models
A business model abstracts away from details

A business model is a high level description


of an application type which contains all
the common features which can be found
in specific examples of the model

Wikipedia entry: Business model


Internet Technologies

17

A business model is a high level description of an application type which


contains all the common features which can be found in specific examples of the
model. For example, one of the most popular business models is the e-shop which
describes a Web site that sells products. The model is general in that it does not
describe the item that is sold or the mechanisms that are used to carry out the
sales process. The remainder of the lecture describes a series of e-commerce and
e-business models

17

E-commerce

The use of Internet technology to support business


Massive hype which has declined somewhat; it
has lead to commercial maturity
Needs to be distinguished from E-business
Archetypal application is etailing, for example
Amazon
Some books on e-commerce

Virtually the domain of


the conventional retailer
Internet Technologies

18

The course is about the use of Internet technology to support business. The most
visible manifestation of this has been e-commerce: the use of Internet technology
(primarily Web technology) to support conventional retailing, often known as
etailing.
In the late nineties it saw a huge de-emphasis. However the conditions that have
lead to the decline of the dot.com boom has not affected the area known as ebusiness. Such an area is still booming and indeed e-commerce has made a strong
recovery
This course concentrates on both e-commerce and e-business applications such as
the supply chain example presented on the previous page.

18

Typical etailing functions

Stock management
Customer payment management
Ware display
Supplier payment management
Delivery
Market analysis
Internet Technologies

19

One of the major features of many Internet application is that if you write down a
systems functions (at least in outline) then you would have major difficulties in
discovering whether the system is, in fact, a networked one. There are a number
of ramifications of this:
For the most part conventional software engineering methods can be used.
The only addition would be design for reliability and testing at the client/server
level.
That you can reuse functions from conventional requirements documents.
The slide above details some system functions, notice that the word Internet
does not appear!

19

An observation

Much of the functionality in a modern Internet


application is common to that of a conventional
application.

Same QA, same development


methods, same tools etc.

Internet Technologies

20

It is a myth that networked applications require a radical approach to


development; this is not true, much of the functionality in such applications is
common to single computer applications such as were common during the late
seventies and eighties.

20

Question
If I had presented the previous slide without
telling you this was an Internet course would
you have guessed the title of the course?

Internet Technologies

21

The answer is no. The important point is that in many ways a networked
application does not differ from a conventional DP or MIS application. The only
differences arise from the fact that you have computers connected by
communication technologies. This gives rise to issues about system performance,
system reliability and system level testing at the client/server level. The test of a
good technology is: does it hide the underlying network?

21

E-Business models

Wide variety of models


Not just etailing via Amazon
In many ways the least well-known models
have proved to be the most effective.
An introduction to e-business models

Internet Technologies

22

The next slides look at the variety of models which have been used to describe e
applications. One of the distinguishing features of the e-business/e-commerce
area is that most people assume that the only model that is available is the
conventional etailing model as exemplified by Amazon. While such a model is
probably the most advanced and functionally static, it is not the only one. The
aim of the next stage of this lecture is to look at a number of other models.
The best work describing electronic commerce business models is Electronic
Commerce, Paul Timmers, John Wiley. This is an excellent description of the
state of the electronic marketplace in 2000.

22

E-shops
Conventional
model

Web marketing of a company or shop


Electronic ordering and paying for goods
Company benefits are: increased demand,
low cost route to global presence, cost
reductions
Customer benefits include lower retail
prices and larger choice
Internet Technologies

23

The e-shop model is the conventional e-commerce model that everyone


associates with the dot.com boom. It is exemplified by sites such as Amazon,
TravelCity and Air Miles. Such a model takes advantage of the global presence
of the Internet to initiate a low-cost entry into a large geographic area.
For the retailer such a model offers a larger customer base, no tie-up in physical
resources, a greater integration of system functions and cost reductions in sales
and marketing.
For the customer the benefits are lower prices, greater choice better information
and convenience.

23

E-procurement
Electronic tendering and procurement of goods
Benefits include access to a greater number of
tenderers, reduced cost of procurement
Benefits to suppliers include more tendering
opportunities and cost reductions in submitting a
tender
Some links on e-procurement

Internet Technologies

24

This model involves a company transferring its procurement process to the Web.
A typical Web site that implements this business model would provide: a
complete list of products and services to be tendered, downloads of tender forms,
dates for tenders, interaction facilities which enable suppliers to submit and track
their tender and some form of private interaction which enables queries about a
particular tender to be resolved.
The main advantage to both the company that is seeking tenders and the tenderers
is reduced cost within the tendering process. It also has the advantage that it
brings in more potential suppliers, thus driving down the cost of individual
contracts.
Electronic tendering has been a major growth area within the e-business area over
the next five years.

24

E-malls
Collection of companies selling services and
products
Can be e-commerce or e-business in concept
They often have the same interface
Many e-malls are thematic
Growth in industry marketplace e-malls
Some failures

Internet Technologies

25

An e-mall is a collection of retailers usually hosted on the same Web site and
administered by a third-party company. E-malls can be based around a particular
market segment such as fishing or could be based around a particular product
such as a highly popular word processing package.
E-malls can be retail in concept or can be based around a closed business
community.
The benefits to the e-mall operator arise from renting or membership revenues,
sale of Web-based services or sale of systems technology.
The benefits to the individual businesses which make up an e-mall include ease
of use through a common HCI, association with larger branded names and easier
access due to the conglomeration effect.
The benefits to the customer include being able to access common sectors
without carrying out a large amount of surfing and reduced prices because of a
close competition effect

25

E-auctions

An area where
profits have been
very healthy

Uses the conventional auction model


Can be real-time or can be cut-off based
Can have an e-commerce base, for example e-Bay
or can have an e-business base such as raw
material auction sites
Income from the provider comes from commission
and technology sales
Seller and buyer reduce costs because there is no
need to physically be at an auction
Internet Technologies

26

An e-auction is the electronic analogue of a conventional auction. Such auctions


can be held in real-time or may be based on accepting the lowest bid over an
elongated time period.
E-auction sites are quite feature rich and include the display of goods and the
financial transaction functionalities normally associated with retailing.
Auction sites make money by commissions, hosting advertising or providing
technology.
Sellers and buyers benefit because they do not need to physically attend an
auction and are able to view and bid for a large item collection. The lowering of
costs for the seller often means that smaller quantities of stock can be offered.
Although most peoples perception of online auctions comes from E-bay there
are a number of very successful business-to business auctions operating on the
net.

26

E-learning
Follows the bust and boom of e-commerce
Transformed by Web 2.0
Supported by a number of virtual learning
environments.
Major investment all levels of education
Not just putting your lecture notes on the
web
Internet Technologies

27

After some disasters such as the United Kingdom E-learning University this area
has started to expand again. The availability of blogs, wikis, conferencing and
allied web 2.0 technologies has meant a resurgence in this area supported by
support software such as virtual learning environments. A typical environment is
the open-source system Moodle.

27

Virtual communities
The creation of communities of buyers/users
Often an adjunct to another business model such
as e-shop
Uses discussion forums, FAQs bulletin boards,
closed user groups
An introduction to virtual communities
Controversial area

Internet Technologies

28

A virtual community is a group of customers or users who are encouraged to


communicate with each other in order to add information to some marketing
environment.
Probably the best example of such a community are the users of the Amazon
Web site who post book reviews. While this can result in depressing sales of
books it can boost many sales.
Amazon is not the only example. For example Cisco host such a community
which discusses basic features and problems associated with their network
products.
Such communities have a number of financial advantages:
They enable a company to carry out market research.
They can reduce the effort needed to support a complex product.
They can bring to the attention of other buyers or users products or services
which are highly valued by others and which may be difficult to market.
Usually virtual communities are an add-on to other business models.

28

Third party marketplaces


Used by companies who leave Web
marketing to other companies
Often augmented by conventional functions
such as invoicing
Minimum function is offering a user
interface to a company catalogue

Internet Technologies

29

This is a model employed by companies who wish to leave marketing and other
functions to a third party. This model is similar to the electronic mall concept.

29

Value chain integrators


Offers reduction of time, labour or
documentation. Also offers added value
Specialist companies who offer services
which ensure that the entities in a supply
chain or a service chain are as tight as
possible
Example might be a company that sent
parcels and allowed customers to track them
Internet Technologies

30

Value added integration is a business model in which the chains in information


flow are integrated and value crated from this integration.
A good example of this is the company that delivers parcels. In the past the
company may have had primitive tracking facilities which enabled a customer to
phone up to see the state of a delivery, often this was just a binary state: in transit
or delivered.
The availability of Internet technology has enabled such companies to provide
much more tracking information which enables customers to determine where a
parcel is and when it was delivered.

30

Information brokerage

Access to information
Increasingly subscription based
Can be based on a per-transaction charge
Major sub-area is financial information

A site for information brokers and potential brokers

Internet Technologies

31

Web sites described by this business model offer access to informationusually


business information. For example, a Web site which offers the results of surveys
of customer satisfaction for a product such as a car would be used by car hire
companies, auto companies and consumer organisations. Major providers in this
area provide information derived from financial data such as company
performance figures, pension fund performance figures and financial market
trends such as the growth of different types of mortgages. Companies whose
Internet presence can be described by this business model usually raise revenues
by subscription or by a per-transaction charge.

31

Trust brokerage

Associated with computer security


Increasing area of business activity
One example is that of a copyright
lodgement company
Another example is an escrow company
Internet Technologies

32

This business model describes those companies or organisations who provide


some service connected with security or trust. For example, as you will see later
in the course, copyright is a major issue for the Internet. A company might
develop a sophisticated graphic which could easily be copied by another
company that would then claim that they developed the graphic. A trust company
might offer the facility for companies to register their work with them and then
be able to testify to the date that the work was registered. Other trust brokers are
associated with computer security and, for example, certify that a particular Web
site run by a company is in fact associated with that company.

32

Dynamic pricing models


Treat the price of a product or service
(primarily a product) as variable and open
to negotiation
Auctions are the
Name your price model
most common example
Comparison price model
Demand sensitive pricing model

Internet Technologies

33

The dynamic pricing model is one which has a number of different instantiations.
Basically, such models treat the price of a product or service (primarily a
product) as variable and open to negotiation.
The name-your-price instantiation of this model is where the customer of a site
offers the price that he or she thinks is reasonable for a product or service. The
administrator of the Web site will then pass on this bid to the provider of the
product or service who will either decide to accept it.
The comparison pricing sub-model encompasses Web sites which provide an
interface to e-shops that sell some specific product. They provide the facility for
the customer to interrogate a database of product catalogues to look for the
cheapest price for a particular product such as book or a CD.
The demand sensitive pricing sub-model is based on the fact that suppliers of a
product will lower the price of a product if a number of units of that product are
included in a single sale. Web sites which employ this model provide facilities
whereby consumers can notify each other of their interest in buying a particular
product such as a freezer. The site keeps a database of current products that have
attracted a number of buyers with a predicted price and allow users to join the
database of buyers who are committed to a sale.
The bartering sub-model allows consumers to barter services or products for
other services or products. A site devoted to this form of economic activity will
keep a structured database of items for sale and allows a buyer to barter with a
seller.

33

B2B exchanges
Collection of Web sites
Enable business to business transactions
such as procurement to be carried out
efficiently
Enable businesses to buy products from
each other, form temporary alliances etc.
An introduction to B2B exchanges
Internet Technologies

34

A B2B exchange is a Web site or collection of Web sites which make the process
of carrying out business-to-business transactions much easier. Under this banner
comes sites which carry enable multiple companies to: procure services and
products from each other; to help businesses form temporary alliances to carry
out activities such as joint marketing or project bidding and enable a marketplace
in raw materials to function.

34

Application System Providers (ASPs)


Companies that provide functionality using
the Web.
Best known is SalesForce.com.
One component of regarding the Internet as
a whole computer (see later)
One component of systems integration (see
later)
Internet Technologies

35

Application System Providers are companies that provide access to functionality


using the web. N effect they enable the user to dispense with implemented
software. When I come onto Web 2.0 I will examine how such companies are
moving the Internet towards a model whereby the net becomes a large virtual
computer.

35

SalesForce.com
Customer relationship management ASP
Functionality includes marketing analytics
and partner relationship management
Currently over 35000 customers
Also includes a platform API

Internet Technologies

36

This company is, without a doubt the best known ASP. It provides a service to
companies that want to access customer relationship functions without doing any
development. It is currently one of the fastest growing CRM companies in the
states.

36

Free products and services


Some sites are really free such as the Open
Software Foundation
However, most sites raise revenue indirectly
for example by giving a product away and
then selling services
Free software a major component
Give the razor
away and sell the blades
Internet Technologies

37

It might seem paradoxical to include sites which provide free products or services
under the category of business models. Typical sites which come under this
category include gaming sites where users can play computer games using their
browser, sites which run free raffles and sites which offer free software.
Such sites do not earn any revues from the products or services they offer;
revenue is earned indirectly, for example by means of banner adverts or by
receiving revenue from sites which you have to visit before experiencing a
service or buying a product.
One of the largest free product areas is that of free software. Organisations in this
area include those who raise revenues and those who do not. An example of a
company in the former category is Red Hat. This is a company that provides free
versions of the Linux operating system. You can download Linux from the Red
Hat Web site and install it on your computer without paying a cent to the
company. Red Hat raise their revenues through support, packaging distributions
onto CDs and providing services to companies who employ Linux for application
development. Companies such as Red Hat are the analogue of those companies
who sell a razor for little or no cost but make their profit from selling the razor
blades.
There are a number of sites in the Internet which do not make any money from
issuing software. These are sites associated with open source development. They
are purely altruistic.

37

What the Internet brings to


commerce

Availability
Ubiquity
Global reach
Digitization
Variety of multimedia
Interactivity
Network effects
Integration

Internet Technologies

38

The Internet brings major advantages to commerce:


It is available all the time.
It is everywhere: from the workplace to the home.
It is global, ideas of what is far and near disappear.
It brings digital products closer to the customer.
Enables multimedia to take a direct place in marketing and selling.
Can make interaction between the customer and a company much larger and
more flexible
Enables rapid growth of businesses (also rapid decline pace Boo.com)
Can enable value chain components to be integrated.

38

The Internet

A network of networks
Dangers and
advantages
Still growing quickly
Open system
Relies on a number of standard protocols
Large part of the net is the World Wide
Web
Internet Technologies

39

The Internet is not a network of computers but can be more accurately described
as a network which consist of sub-networks. It is still growing at a large rate
requiring the development of new versions of old protocols to cope with the
increased number of hosts.
A major feature of the Internet is that it is an open system: all the specifications
of the protocols are publicly available. The positive side to this is that anyone can
write Internet software, the negative side is that it enables malicious acts to be
carried out more easily.
The Internet has a major component known as the World Wide Web which is the
main platform for commerce.

39

Internet protocols(i)

TCP
IP

Most developers
do not need to know
about the details of these

UDP
ICMP
Internet protocols link

Internet Technologies

40

Transmission Control Protocol. This is a protocol which enables data to


be reliably passed through a network. A higher level such as an
application program will collect transmission data together and pass it to
this layer which then carries out the function of sending the data to its
destination. This protocol has the facility to retransmit data, for example
if an error prevented an original collection of data arriving at its
destination. This protocol is usually referred to by the acronym TCP.
User Datagram Protocol. This fulfils the same function as TCP.
However, unlike TCP, it does not have the facility to retransmit data;
because of this, data can be lost using this protocol. This protocol, usually
known by the acronym UDP, enables fast transmission of data; however,
there is no error checking so, for example, under certain network
conditions such as heavy traffic, data sent using the protocol can be lost.
Internet Protocol. This is usually referred to by its acronym IP. This has
the basic function of moving data which has been created by either TCP
or UDP across a network of networks.
Internet Control Message Protocol. This is responsible for checking the
status of computers and other devices attached to a network.

40

Internet Protocols (ii)

Telnet
FTP
SMTP
Kerberos
Domain name system

Again, most programmers


do not need to know about
these

Wikipedia on the Domain Name System

Internet Technologies

41

Telnet. This is a protocol which allows users on one computer to access


and log in to other computers on the Internet provided, of course, that the
user has permission to do so.
File Transfer Protocol. This is commonly known as FTP; it allows a user
to transfer files from one computer to another computer.
Simple Mail Transfer Protocol. This is a protocol which enables
electronic mail to be transferred from one computer to another. It is often
referred to as SMTP.
Kerberos. This is a security protocol which allows highly confidential
data to be transferred from one computer to another.
Domain Name System. This is a service which enables computers to be
referred to by a symbolic name rather than an address.

41

Internet protocols (iii)

SNMP
NFS
TFTP
HTTP

All public!
Internet Technologies

42

Simple Network Management Protocol. This uses the User Datagram


Protocol (UDP) described above. It is used to monitor a network for
problems such as a malfunctioning computer issuing spurious data onto
the network. It is often referred to as SNMP.
Network File System. This is a collection of protocols developed by Sun
Microsystems which enable computers on a network to transparently
access files and file directories on other computers in a network. A user
who employs this protocol is usually unaware of the location of a
particular file; the only item of information that the user normally requires
is the name of the file.
Trivial File Transfer Protocol. This is usually referred to by the acronym
TFTP. This is a very simple protocol used for the fast transmission of
files in a network. It lacks security.
The HyperText Transfer Protocol. This is a protocol used by Web
browsers and Web servers to communicate. Much more on this later.

42

The concept of a naming service


The use of symbolic names to identify
resources.
The Domain Naming Service (DNS) is the
main service in the Internet.
Many more naming services, for local
networks.
Maps symbolic name into physical identity
Internet Technologies

43

A naming service is provided by a server in a network. It provides a way of


mapping symbolic names into some physical address or identification. Naming
services can map names into all sorts of identities; they can map files, directories,
computers, users.
The typical function that a naming service offers is that of returning a physical
detail of some object given its symbolic name.
The main naming service on the Internet is the Domain Naming Service. This
maps names of computers (hosts) into physical addresses.
It is hierarchic in concept.

43

Hierarchic naming in the Internet

An example
open.ac.uk

Internet Technologies

44

The Internet has a form of hierarchic naming which becomes more specific as
you get to the left-hand side of a name. Each time you move to the right in a
name it references more and more collections of computers until you get to
somewhere near the top of the naming tree, for example com or uk.

44

A name
Collection of computers
at the Open University

Collection of computers
at academic institutions

www.open.ac.uk

uk does not
mean that the site
is in the United
Kingdom

Collection of
Name of the computer ( a Web server) computers in the UK

Internet Technologies

45

The leftmost name in the slide represents a host name, the next name represents
all those computers that are found at the Open University, the designation ac
applies to academic institutions and the designation uk refers to United Kingdom
registered hosts.
Note that uk does not mean that the host is physically situated in the United
Kingdom, although normally this is the case.

45

Case Study

Microsoft and MSMarket


Internet B2B application
Got rid off expensive physical chains
Saved something of the order of 45%
Used Internet technology
Now handles $3 billion orders
Internet Technologies

46

Microsoft and MSMarket


Microsoft discovered that 70 per cent of its purchases were for relatively
small items which took up something of the order of 3 per cent of its
purchase volume. The company discovered that a large amount of
employee time was spent on the procurement process and, hence,
invested $1.1m on a system known as MSMarket. When a Microsoft
employee wishes to buy some item such as stationery they log into
MSMarket, the system identities them from their login identity and
consults its database to discern what rules should be applied to purchases
from that employee. The employee informs the system that they require
some stationery and a screen of items and prices negotiated with a
supplier are displayed. The employee purchases what is required and the
order is sent over the Internet; an e-mail is then sent to their manager to
inform them of this and a tracking number generated which can be used to
query the supplier if the item has not been delivered by a certain time.
The use of MSMarket has increased exponentially since it was deployed
and it now handles more than $3 billion of orders.

46

Size case study

Sydney Olympic games 2000


Web site
Games management system
Games results system
Commentator information system

Internet Technologies

47

The Sydney Olympic games system


A Web site which was publicly accessible and which contained features
on the games, the competitors and the results.
A games management system which administered the logistics of the
games, for example arranging transportation, accreditation and
accommodation for athletes.
A games results system which captured input from all the events in the
games and distributed them to judges, scoreboards, competitors,
commentators and the Web site detailed in the first bullet point above.
A commentator information system which provided real-time information
to journalists and broadcasters, for example this system would flash up on
a commentators PC the times achieved by the runners in a race, only a
few seconds after the race was completed.

47

BBC Web site

Media rich
Employs a huge variety of technologies including XML, RSS, HTML,
JavaScript, Cocoon and a variety of web and database servers
A large amount of dynamic content, some real-time
Sound and video downloads
Forums and conferencing
Email updates
SMS updates
Dynamic page generation technologies

BBC Web site

Internet Technologies

48

The BBC site is one of the most famous sites on the Internet and has won a large
number of awards. It employs a huge variety of technologies, some are:
XML is used for the storage of base text
RSS is used for generating news feeds
HTML is used for the development of web pages
Javascript is used for client processing, for example forms processing
Cocoon is used for document processing
A variety of web and database servers are used for dispensing web pages and
storing data.

48

The main user technologies

Web servers and browsers


FTP
Message boards
Email
List servers
Search engines
Wikis
Blogs
Internet Technologies

49

The main technologies used to access commercial sites are:


Web servers. Used as the heart of any commercial application as an interface to
resources
FTP. Used to download software or documents.
Message boards. Used to coordinate staff or foster user communities
Email. Used to keep in touch with staff and customers.
List servers. Used to distribute information, software and documents
Search engines. Used either globally or locally to search for information.
Wikis are shared areas in which users can create and modify text
Blogs are web diaries

49

Some problems and issues

Legacy technology
Security and privacy
Programming and abstraction
Speed of development
Structure and data
Problems with transactions
Internet Technologies

Interrelated

Major problem

50

There are a number of problems with the Internet which this course will look at;
in particular it will show how these problems have been overcome by the use of
technology. The problems are
Legacy technology. The fact that the base technology used in the Internet is
inadequate for the uses to which it is being put.
Security. The fact that since the Internet is an open system its specifications are
readily available and intruders can make use of this to carry out malicious acts
Programming and abstraction. The programming models used in the past for
network development are inadequate given the speed of development of the net.
Structure and Data. The Internet contains large amounts of data ranging from
Web pages to structured databases; often such data is in widely differing formats.
Problems with transactions. A transaction can span over a considerable time and
over a number of hosts situated in separate countries. This gives rise to
considerable synchronisation problems.

50

Legacy problems (examples)

The Internet protocol is not big enough


Web servers use a stateless protocol
Web pages were designed to be static
I shall be looking at these problems later

Many of these problems have been solved


Internet Technologies

51

The Internet was designed to be a small network. A consequence of its huge


expansion has been the fact that its original TCP/IP protocol set will not cope
with huge numbers of hosts in the future. This has lead to the development of
IPv6.
Web servers use stateless protocols where a transaction has no memory of past
transactions. For many applications this is a serious shortcoming. For example a
shopping cart requires memory of previous item buys.
Web pages were intended to be static. A page would be stored and would be
returned in its stored form. Many applications require such pages to be modified
before being sent; for example, a Web page which provides information about the
hour by hour weather conditions in a set of cities.

51

Security problems
The Internet is an open system
Users are allowed access from anywhere to,
say, a Web server
The higher availability of access together
with the open system aspects leaves the
Internet open to abuse.
Security is dealt with later in detail
Internet Technologies

52

Because the specifications of Internet protocols are publicly available this means
that intruders can read them, find weaknesses and then exploit them. This leads to
a greater security problem. The obverse of this is that since there are many more
users, solutions to these problems are often more forthcoming than if the network
was based on proprietary protocols. This is one of the arguments put forward by
the open source movement.
Also, since anyone can access a host to some depth, for example by browsing a
Web page, an intruder will have already gone some way into a system without
deploying very much effort.

52

Programming and abstraction


Past models have depended on the sending
down the wire idea
Need for models which regard a network as
a computer (the network is the computer)
Number of emerging models including
distributed objects and tuple space
I will be looking at these later
Internet Technologies

53

Networked applications have, in the past, been designed and programmed using
the idea that communicating hosts synchronise and communicate using messages.
While this is quite a good model for small applications where speed is of the
utmost it is severely limited.
Because of this a number of new models have been developed. One of the most
popular ones is the use of distributed objects: objects which lie on physically
separate computers but which can be accessed as if they lived on the same
computer. Two popular distributed object schemes are DCOM and CORBA.
Another idea is that of a web service, again this will be dealt with later.
Another paradigm is that of a tuple system which regards a network of computers
as just a very large store of data.

53

Structure and Data


The Internet is huge and has masses of data
This data (Web pages, flat files, relational
databases) is in a number of disparate forms with
differing degrees of structure
Any application that attempts integration becomes
costly
The main solution to this problem is a
metalanguage known as XML
There is also the increasing problem of search
Internet Technologies

54

The Internet contains data of widely differing structure and format. A good
example of this is Web pages which are written in a wide variety of versions of
the Web language HTML. As well as being in different formats the pages do not
have any semantic information which gives a clue to their content, for example in
a book site a number might be the price of a book or its discount.
Because of this a meta-language known as XML has been developed. This metalanguage allows data to be tagged with semantic markers. This data can then be
processed by programs which are XML knowledgeable in order to extract user
information.
XML is an important technology which has the same flavour as HTML and
SGML (the source of inspiration of HTML)
A major problem that has emerged as the Internet has become bigger is that of
searching a large body of unstructured text

54

Speed of development
Internet applications often require high
speed development
The use of design patterns
High level APIs
Fast development methods such as agile
methods

Internet Technologies

55

Because of the speed of use of the Internet there is often a requirement on firms
to develop software quickly. This has given rise to a number of advances:
Design patterns are micro architectures which can be used time and time again.
High level APIs such as the Java collection API which reduce the amount of
detailed coding.
Fast development methods such as rapid prototyping and evolutionary
prototyping.

55

Problems with transactions


A transaction is the execution of some code
which updates a network state
The state may be held across a number of
computers which are physically separated
This gives rise to major coordination and
synchronisation problems.
Solution is transaction servers
Internet Technologies

56

A transaction is some execution which alters the state of an application, for


example a relational database. In Internet applications the state is often shared
around a number of physically separated computers. As a consequence of this
there are often heavy problems in synchronisation, for example when two
transactions interact with the same state.
A number of solutions have been devised for this including transaction servers
which maintain state to high-level APIs which enforce state consistency
conditions.
Later in the lecture series I shall look at how to design systems which are
transaction-based and how to use transaction servers.

56

Open source software and the Internet

Almost anarchic model of distribution


Some very popular products
Provide a form of de-facto standard
Many products oriented towards net
development
Important for integration
Distribution medium is the Internet
Internet Technologies

57

The lecture concludes with a brief description of open source software and its
relevance to the Internet. Open source software is software that has been
developed by designers and programmers and offered as is, for free. It includes
some very popular products such as the Apache web server. The relevance to the
Internet is that for the most popular products their existence provides a de-facto
standard, this is important for systems integration purposes as you will see later
in this module. Another important element related to the Internet is that many of
the products are oriented towards Internet systems development. Finally another
important aspect is that the distribution medium is the Internet; without this it
would be highly unlikely that open source software would have advanced to the
position it occupies today.

57

Some examples from the Apache


Software Foundation

mod_perl
Lucene
Tomcat
Tapestry
Cocoon
XML-graphics

The Open Source Initiative web site

Internet Technologies

58

Here are some examples of some open source software maintained by the Apache
Software Foundation:
mod_perl is a technology that allows Perl programs to be interfaced with a web
server.
Lucene is an open source search engine.
Tomcat is a Java Server Pages based web server.
Tapestry is a framework for developing open source web applications.
Cocoon is a publishing system.
XML-graphics is a project which has a number of strands centred around the use
of XML for graphic formats such as the SVG format for vector graphics.

58

Lecture 2
Introduction (ii)

Internet Technologies

59

59

Aims
To briefly describe the main components of the
Internet
To look at the concept of an open architecture
To describe the low-level mechanism of message
passing
To describe a low-level programming model of
distributed communication

Internet Technologies

60

60

Open systems

Systems whose architecture is not a secret


UNIX an example
Internet protocols are another example
Carried to extreme is exemplified by the open
source movement
A good example is Apache
Wikipedia on open systems
Internet Technologies

61

An open system is one whose architecture is widely available. A good example of


this is UNIX, an operating system whose details and source code is readily
accessible to anyone with a Web browser.
The Internet is also an open system in that the details of its protocols, for example
TCP are readily available. The idea of an open system has been taken to its
extreme by the proponents of open source development where everything is
revealed about a system and where developers can take the source code of a
system, modify it and then release it as a variant of the original. A good example
of open source code is the highly popular web server Apache.

61

The advantages of open systems

Wide variety of implementations


Cost is lowered
High level of compatibility
Wide variety of developers
Disadvantage
is security

Internet Technologies

62

An open system results in a wide variety of implementations, see for example


UNIX. Since specifications, design and code are often available it means that the
cost of an open system is often very low in the case of open source it is free.
Because all software is based on a common specification there is a high degree of
compatibility and because of the wide availability of specifications and reference
implementations there are often a large number of developers

62

A bus network

Internet Technologies

63

A bus network is the simplest form of network where a single communication


pathway is used and where every component is assigned a unique address. When
a message is sent every computer that is not an intended destination ignores it.
While the recipient reads the message, interprets it and carries out some
processing.

63

Ring network

Internet Technologies

64

A ring network is often displayed as a ring. However, there is no physical ring in


existence. The term ring is used to describe the design of the central unit which
carries out the process of sending and forwarding messages in the network

64

Hub network

Internet Technologies

65

A hub network used a main cable like a bus network, the cable is known as a
backplane. From this backplane connections lead to ports into which devices can
be plugged. Hub networks have proved to be very popular: they are easy to set up
and are cheap.

65

Layered networks
A form of architectural description
Relies on layers
Each layer call on services provided by the
next layer down.
Innermost layers are closest to the base
facilities of a computer or network

Internet Technologies

66

A standard way of displaying and implementing a system is via a layered model.


Such a model consist of a series of layers which might be implemented by
hardware or software. There will be a number of layers: layer n will call on
services provided by layer n-1. The Internet is based on a layered architecture
which was inspired by the OSI reference model which itself is a layered
architecture.
In such an architecture each layer implements a well-defined set of functions. It
is important to point out that layered architectures are used for a wide variety of
systems not just networks.

66

Layered models and applications

As you proceed outwards in a layered


model you get to the application, ideally
developers should be working at this
level

Internet Technologies

67

The outermost level in most layered architectures is that which allows the
programmer to call on application functions. This is the level that, ideally,
Internet application developers should be working at.

67

The OSI reference model

Link to OSI

Internet Technologies

68

The OSI reference model is a seven layered model. It ranges from the topmost
layer which is the level applications communicate with to the bottom-most layer
which lies very close to the hardware used to implement the model.
A number of the levels have been coalesced into single layers. This is shown on
the next slide.

68

Internet architecture v OSI reference


architecture

Internet Technologies

69

The diagram shows the relationship between the OSI reference model and the
Internet layered model, with the various protocols and layers mapped across. The
block on the right shows the various services and protocols used in the Internetnot all are shown.

69

Protocols and services (i)

Telnet
File Transfer Protocol (FTP)
Simple Mail Transfer Protocol (SMTP)
Kerberos
Domain name system (DNS)

Internet Technologies

70

Telnet allows users of one computer to log into another computer


FTP allows a user to transfer files from one computer to another
Simple Mail Transfer Protocol is a protocol which allows electronic mail to be
transferred from one computer to another.
Kerberos is a security protocol which allows confidential data to be transferred
from one computer to another
The Domain Name System is the Internet naming service and maps symbolic
names into physical addresses.

70

Protocols and services (ii)


Simple Network Management Protocol
(SNMP)
Network File System (NFS)
Trivial File Transfer Protocol (TFTP)
Transmission Control Protocol (TCP)
User Datagram Protocol (UDP)
Internet Technologies

71

SNMP uses UDP to monitor a network for problems such as a malfunctioning


computer issuing spurious data
NFS is a collection of protocols developed by Sun Systems which implements a
distributed file store
TFTP is a very simple protocol used for the transfer of files in a network.
TCP is a protocol which enables data to be passed reliably through a network
UDP is similar to TCP. However it does not contain facilities for the
retransmission of data. It is fast and is normally used when small errors in
transfer can be tolerated

71

Protocols and services (iii)

Internet Protocol (IP)


Internet Control Message Protocol (ICMP)
HyperText Transfer Protocol (HTTP)
More later
Internet Technologies

72

IP has the basic function of moving data created by TCP or UDP across a
network.
ICMP has the function of checking the status of computers attached to a network
HTTP which builds on many of the other protocols is the protocol used when a
Web client communicates with a Web server. It is used to characterise a request
from a client and to return status information from a Web server. More on this
protocol later

72

Distributed system
A system which consists of a number of
computers (hosts) which are connected to
each other by some transmission media
Rationale: reliability, efficiency
Consists of computers acting as clients and
as servers
Difficult to design
Internet Technologies

73

Now that computers have dropped drastically in price it has been found
convenient to connect a number together in a network. Such a collection of
computers is known as a distributed system. There are a number of reasons for
doing this: first you can get increased reliability by designing duplication into the
system, for example via replicated databases; second, you can increase efficiency
by ensuring that processing power and data lie close to the user.
The price to pay for this is complexity: for example, designing a distributed
system for performance is quite a difficult task.
A distributed system will consist of a number of computers which offer some
service (servers) and a number of computers (clients) which call on this service.

73

Why client/server?

Openness
Scalability
Specialisation
Reliability
Design flexibility

Internet Technologies

74

There are a number of reasons why systems are organised in a client-server


architecture:
Openness, using a set of protocols that all computers understand means that a
wide variety of hardware and platforms can be connected together.
Scalability, means that as a network starts to sufferfrom performance problems
new computers can be easily added.
Specialisation, means that computers can be dedicated and optimised to one
particular service.
Reliability, means that computers can be dualled to increase reliability
Design flexibility, means that many more design options are available to the
designer; for example where to place a database.

74

Protocols
A form of standardised rules for
conversation
Popular type of protocol is the
request/response protocol (HTTP)
Protocols can be system protocols or
application protocols

Internet Technologies

75

The lifeblood of a distributed system are protocols. Already you have seen many
examples of system protocols. Many others exist on the Internet. They enable
clients and servers to interact with each other and are used to call on resources,
return resources or return status information.

75

An example of an application
protocol

FindCurrentBid
UpBid bid amount
DropOut
SendCurrentPrice price
Sold
Internet Technologies

76

The set of lines above represent an example of an application protocol which can
be used in an online auction system. Some of the protocol is used by the client,
for example to establish a bid; other elements of the protocol are used by the
auction server to send the result of a bid or the auction state

76

POP3 a heavily used protocol

USER
PASS
STAT
DELE
RETR

Details of POP3

Internet Technologies

77

POP3 is a simple but heavily used protocol which is used for emails. A number
of elements of the protocol are shown above:
USER informs a POP3 server that the user is going to retrieve mail
PASS communicates the user password
STAT retrieves statistics on how many email messages are waiting for the user
DELE deletes an email message
RETR retrieves some email messages

77

Ports and sockets


Communication between entities in a
network occurs via ports and sockets
A port is a logical concept identified by an
integer
A socket is a unique connection to a
computer formed from the computers IP
address and a port number
Internet Technologies

78

At the lowest programmatic level communication between entities in a TCP/IP


network occurs via ports. A port is not a hardware concept but a logical one. It
represents a connection into and out of a computer.
When a computer wishes to connect to to another one then a socket is
established. A socket is a connection which is unique. It is made up of the IP
address of a computer and a port number. A socket is a programming abstraction
Ports numbered between 0 and 1023 are reserved for special applications. Ports
above 1023 can be used for applications

78

Dedicated ports

7 ECHO
13 DAYTIME
21 FTP
23 TELNET
80 HTTP

25 SMTP
110 POP3
150 SQL-NET
443 SHTTP

Internet Technologies

79

Ports between 0 and 1023 are, by convention, reserved for dedicated services. A
list of some of these dedicated ports is shown above. For example port 80 is used
for communication between a Web client running a browser and a Web server
storing a series of HTML pages.
The normal programmer will be unaware of these port numbers and will not need
to know them, for example programming Web servers requires just a knowledge
of HTTP. However, if you are writing low-level applications it is important to
avoid dedicated ports.
SQL-Net is a network protocol used for communication between relational
databases and SHTTP is a secure version of the HTTP protocol.

79

Java and the Internet


Java is a language used for examples in this
course
A knowledge of Java is not required.
Only a relatively small number of slides
contain Java code.
The exam and the long essay question will
not test you on Java
Internet Technologies

80

I shall be using Java as an example to show the type of programming that is


carried out when developing Internet applications. Java is almost certainly the
most complete language in terms of Internet facilities.
The course does not require a detailed knowledge of Java, all that is required is a
basic grasp of object/message sending.
A relatively small number of slides use Java and I have made sure that there is no
complicated programming in the slides.
The examination or long essay question will not test your knowledge of Java.

80

Basic tutorial in Java

status =computer1.getStatus()

Line of Java gets the current status


of computer1 placing the result into status
This is all the Java you need
I will not be examining the language
at the end of the course

Internet Technologies

81

All you need to know about Java for this course is that you can send messages to
objects and that often the result of this message is some data. The example above
shows the message getStatus being sent to the destination object computer1
with the status of that computer (running, stopped, malfunctioning) being
communicated back and set to the variable status.
This is a common pattern in object-oriented programming.

81

An example
Shows two sets of code: code for a client
and code for a server
Communication via the port 2500
Server located at penny.open.ac.uk
Client is anywhere

Internet Technologies

82

The code on the following pages establishes connection between a client and a
server. The client sends a message Hello to the server and received reply
Connection established. The server is the computer penny.open.ac.uk, the
client can be any computer on the network. All communication is via the port
2500

82

Client
// Set up the socket to the remote computer penny
Socket pSock = new Socket(penny.open.ac.uk, 2500);
//Obtain the streams
InputStream is = pSock.getInputStream();
OutputStream os = pSock.getOutputStream();
//Set up the BufferedReader which is
//associated with the socket
BufferedReader bf =
new BufferedReader(new InputStreamReader(is));

Internet Technologies

83

The client code sets up a socket pSock on port 2500 of computer


penny.open.ac.uk. Two streams are set up: an input stream which reads data
from the server and an output stream which sends data to the server. The input
stream is then used to establish a buffered reader which can read data from the
data using an area of memory known as a buffer.

83

Client code (ii)


PrintWriter pw = new PrintWriter(os);
//Send message to server
pw.println(Hello);
//Get reply from server
String reply = bf.readLine();
if(reply.equals(Connection established))
//Process the reply
else
//Carry out some error process

Internet Technologies

84

Next a print writer object is set up. In Java a print writer is used to write character
data. In the case of the client this data is written to the server.
Next the client sends the message Hello to the server and then reads the reply
that has been sent back. If the message was Connection established then the
server and the client are ready to talk to each other.

84

Server code (i)

ServerSocket ss = new ServerSocket(2500);


//Wait for a connection
Socket sockS = ss.accept();
//Set up the streams and BufferedReader
InputStream is = sockS.getInputStream();
OutputStream os = sockS.getOutputStream();
BufferedReader bf =
new BufferedReader(new InputStreamReader(is));

Internet Technologies

85

The first part of the server code for penny.open.ac.uk is shown above. It first
establishes an object known as a server socket on the server. This is bound to port
2500. It then stops waiting for a connection from a client. When a client comes in
two streams are set up. One is an input stream, the other is an output stream. A
buffered reader is connected to the input stream so that the server can read data
from the client.

85

Server code (ii)


PrintWriter pw = new PrintWriter(os, true);
//code for sending data not shown
String readString = bf.readLine();
if(readString.equals(Hello))
pw.println(Connection established);
//Remaining processing

Internet Technologies

86

A print writer object is then set up and a line read from the client. If the client
sent a Hello message then a connection has been established and the server
informs the client of this.

86

Socket and server programming


Low level
Concurrency needed
Only carried out for fast applications which are
non-standard
Other applications use APIs which are built on top
of the socket and server socket facilities in a
programming language, for example Web server
APIs
Internet Technologies

87

The Java code on the previous slides represents a low-level form of


programming. It is only really used when speed is of the essence and where a
normal API cannot be used.
In practice the code will be a lot more complicated as it is somewhat inefficient.
For example the accept code waits for a connection, then services that
connection and relinquishes it. The connection could carry out some lengthy
server process such as updating a database and other connections could be
waiting and could be doing useful work during the waiting period. In practice the
connections are threaded.
Most application programmers use APIs which hide the details of the
connections, for example the java.servlet API enables the programmer to
carry out functions such as sending a Web page to a client without worrying
about sockets and server sockets.

87

Enterprise frameworks
Microsoft

Sets of software for developing commercial


applications on the Internet
Two main ones are .Net and J2EE
Large variety of facilities
Sun

An introduction to J2EE

Internet Technologies

88

An enterprise framework, sometimes known as an application framework, is the


term given to a collection of software which enables a programmer to access the
facilities that implement distributed applications. The term enterprise
application is beginning to take over from more venerable terms such as
management information application (American) and data processing application
(British).
Effectively an enterprise framework provides a set of application programmer
interfaces (APIs) which shield the programmer from the gory details of protocols
and transport mechanisms.

88

Facilities of an enterprise
framework(i)

Multi-language working
Support for legacy code
Support for high volume transactions
Support for messaging

Internet Technologies

89

An enterprise framework should have the following:


Multi-language support. Ideally the framework should be capable of
being programmed in a number of programming languages with these
languages compiling to code which has a common format. This would
enable compiled code from a number of sources to be integrated together.
Support for legacy code. It should provide interfaces to code written by
older programming languages so that Internet-based systems can be easily
integrated with that code.
Support for high-volume transactions. The framework should be capable
of supporting code which can be used in an application server; the code
should be capable of being efficiently executedeven when the volume
of transactions is high.
Messaging support. Much of the communication between entities in a
distributed application is done via messages. Already you have seen how
POP3 does this. An enterprise framework should support this
communication paradigm.

89

Facilities of an enterprise
framework(ii)
Web server programming
XML facilities
Interface to standard protocols

Internet Technologies

90

Web server programming. When a Web server receives a request for a


Web page such as a form there is a need for a program to be executed; for
example to retrieve data from a database. A good framework should
provide the programmer with the facilities that enable browser requests to
be parsed, unbundled and the processing associated with the request to be
carried out, with some response being sent back to the browser.
XML facilities. As you will see later in this course XML is a major
Internet technology which is beginning to control the textual chaos of the
Internet. It is a technology which allows developers to define special
purpose languages, for example to describe the wares of an e-shop; such
languages can then be processed by anyone with XML software. An
enterprise framework which does not provide facilities for XML
processing is severely limited.
Interface to standard protocols. An enterprise framework should enable a
programmer to access the individual commands that make up a protocol
such as HTTP; while, increasingly, this is not required for many
applications, there are some applications (usually those that require runtime efficiency) where access to low-level details of an Internet
technology is required.

90

Facilities of an enterprise
framework(iii)
Database connectivity
Naming services
Security services

Internet Technologies

91

Database connectivity. The vast majority of enterprise applications


require databases to function. An enterprise framework must be capable
of interfacing with the main databases types.
Naming services. An enterprise framework should provide facilities
whereby the programmer can write code that consults a naming or
directory service such as LDAP.
Security services. As you will see later in this course security is a major
problem in distributed systems. An enterprise framework should provide
facilities whereby the programmer can send secure data across a network,
authorise access and check that an entity that is accessing an application
is allowed to do so.

91

Lecture 3
Distributed paradigms: introduction,
message passing and event-based
paradigms
Internet Technologies

92

92

Aims
To detail the various architectural schemes
available for developing distributed
architectures
To describe the main three tier concept used
in the course
To describe the idea that a technology
should hide an underlying network
architecture
Internet Technologies

93

93

Why client/server? (repeat)

Openness
Scalability
Specialisation
Reliability
Design flexibility

Internet Technologies

94

There are a number of reasons why systems are organised in a client-server


architecture:
Openness, using a set of protocols that all computers understand means that a
wide variety of hardware and platforms can be connected together.
Scalability, means that as a network starts to suffer from performance problems
new computers can be easily added.
Specialisation, means that computers can be dedicated and optimised to one
particular service.
Reliability, means that computers can be dualled to increase reliability
Design flexibility, means that many more design options are available to the
designer; for example where to place a database.

94

Disadvantages

Design complexity
Programming complexity
Performance problems

Depends how far


you are from the
application layer

Predicting and achieving


performance can be really
difficult

Internet Technologies

95

While the client/server paradigm is now the prevalent one it is worth detailing
some disadvantages. The first is that to achieve a high performance, design and
programming can be an immensely tough task: to reconciles a number of servers
with different performance characteristics which are remotely connected via slow
transmission lines requires design skills of the highest order, particularly when
high reliability is required.

95

Architecture
An arrangement of basic components and
their interaction with each other.
Four architectures studied here.
They are message passing, distributed
object, event-based architectures and tuple
space
Distinguished by their distance from the
network
Internet Technologies

96

This shortish lecture describes architectures. When one talks about a system
architecture or a system design what is being referred to is the arrangement of
building blocks and the connections between the blocks. This does not differ
from the conventional use of the term architecture in building.
There are four architectures I discuss in this part of the course. I have chosen
them since they exemplify two properties: the first is their popularity, the second
is their distance from the network. The term distance characteristics the level of
abstraction of the architecture; how much it hides the underlying hardware and
low-level software.
In the next three lectures I will look at two examples of a particular architecture
known as the distributed object architecture.

96

A statement

The computer is the network

To implement this requires


performance vs complexity
trade-offs

Internet Technologies

97

One of the aims of software developers is to isolate the details of the network
away from the programmer. This is best exemplified by the statement above
which forms part of the network strategy of Sun Systems, the original developers
of the Java programming language.
The ideal that is stated there is that there should be no difference between
programming a single computer and a number of computers connected together
implementing the same functionality.

97

The best we can achieve


Programmer

Network
Computers

System configurer
and maintainer
Internet Technologies

98

The slide above shows the best that we can achieve in terms of moving towards
the ideal detailed in the previous slide. The programmer is unaware that he/she is
programming a number of computers. There is, however, another role: that of
system configurer who is responsible for ensuring that performance and
reliability goals are met. For large applications where traffic varies as
functionality changes this will be a continuous job.

98

Sockets and ports and hiding the


network
Sockets and ports are a logical idea
They abstract away the transport level stuff
that is associated with TCP/IP
The transfer of data is done in terms of
input and output streams with no
programming of issues such as low-level
error handling
Internet Technologies

99

Sockets and ports are examples of abstracting away from the network. A socket is
a logical entity it, for example, does not correspond to a hardware idea. When
sockets are established communication is via streams which are normally used
for input and output. At one level socket programming is very similar to
programming sequential file access.
Sockets hide the physical address of a computer (usually a symbolic name is
used) they also hide the transport details, for example there is no need to worry
about what to do if there is a hardware error, the TCP/IP software handles this
problem; however, the programmer does have to be aware that problems can
occur and cope with them.

99

Sockets in Java

ServerSocket ss = new
ServerSocket(venus.open.ac.uk, 1100)

Internet Technologies

100

Here a socket is opened up on a server attached to port 1100 on a computer


venus.open.ac.uk. No physical details are specified. This is the ideal;
however, the next slide shows an intrusion of reality.

100

Reality intruding
So far I have been talking about ideals.
Reality in term of performance and
application level error-handling often
intrudes
An example of this is concurrency and
socket processing

Internet Technologies

101

The model presented in the previous slides has been unreal, when you have a
number of clients accessing a server you will need to develop server code as
concurrent code.

101

A socket programmed server


active
Server
In
queue

Clients
concurrently
accessing
server

Service involves
lengthy wait
In
queue
Internet Technologies

102

Here a server carries out some lengthy process such as accessing a relational
database. A client connects into the server asks for a service and waits for the
service to be carried out. The service may take many milliseconds. In this time
another collection of clients may ask for the same service. In a simple-minded,
high-level program what would happens is that the system software will queue up
these clients until the client currently accessing a service finishes. This is very
inefficient: in the wait time, while the server processor is idle more clients could
be processed. In reality concurrency is used.

102

Performance and the previous server


model
In reality servers are not programmed using
a high-level model such as the one shown in
the previous slide
Concurrency is employed
In Java this means using threading.
Here a number of clients are currently
active
Internet Technologies

103

As with high-level abstractions performance usually intrudes. Since sockets and


servers are at a fairly low level compared with the underlying software, this can
be handed quite easily. The programmer writes threaded code for each client.
This code is executed every time a client makes a connection and then is
executed in parallel with other threaded code.

103

The models

Message passing
Distributed objects
Event-based architectures
Tuple or space-based models

Internet Technologies

Increasing
abstraction

104

In this lecture I shall be dealing with four basic architectural models: message
passing is at the lowest level and is usually implemented using sockets and server
sockets, distributed objects are an attempt to program a distributed system in such
a way that it can be viewed as a collection of interacting objects; event-based
architectures are based on a broadcaster/listener viewpoint, where listeners in a
distributed system are only activated when an event of interest to them occurs;
finally tuple spaces are the highest level architecture in that it views a distributed
system as a kludge of data.
All these models can be implemented in Java.

104

An overriding architecture

Storage
layer

Business
object
layer

Client layer

Three level architectures link


Internet Technologies

105

Whatever architecture is used a distributed system should have the global


application architecture shown above. The storage layer contains data in some
permanent or temporary medium. The business object layer represents objects
from the application domain while the client layer represents code used for client
communication, for example HCI code.

105

The three-layer architecture

It separates concerns
It isolates application code
It protects a system from changes, for
example a change in the underlying storage
technology
Example of information hiding
Internet Technologies

106

There are a number of very good reasons for having a three-layer architecture
similar to that shown in the previous slide.
The first is that it does not intermingle code: database access code is not mixed
with application code which is not mixed with client code.
The second is that it minimises the effect of change in that, for example, all the
storage code is isolated in one layer so that changes to underlying database
technology does not over-affect the maintenance process.
The whole idea of a three-layer architecture is based on information hiding where
details of such things as client events, application objects and database access are
hidden beneath a layer of an API.

106

Code which hides the storage


technology
getPrice (commodity)
getAllatLessPrice(commoditytype,
limit)
inStock(commodity)
noInStock(commodity)

Internet Technologies

107

The code headers above represent the interface to stored data for an e-commerce
application. When the programmer uses this code he or she has no idea what the
underlying technology used for storage is: it could be a relational database
system, it could be an object-oriented database system or it could be a set of
distributed objects. Moreover, the programmer has no idea where the data is: it
could be on a local computer or one thousands of miles away. This means that
changes can be made to the underlying data without affecting much of the
application code that has been written.

107

The client layer

Contains code associated with the client


Mostly HCI and code which communicates
client events to a server
Also handles data from the server

Internet Technologies

108

The client layer will contain code which processes event such as button clicks or
a text field going out of focus, this is HCI event code.
It will also communicate the fact that an event has happened to a server via the
business object layer. How it does this depends on the technology used within the
system, it could, for example involve message passing.

108

The business object layer


Objects representing application objects
reside in this layer
For a theatre booking system typical objects
include customer, booking and performance
Accessed via an application API
Application API communicates with data
layer
Internet Technologies

109

The business object layer contains objects which are concerned with the
application, for example in an e-commerce application for selling CDs there
would be objects which represent individual CDs.
They are supported by application APIs which provide access to the underlying
data.
It also contains code which receives data from the server and displays that data.

109

Some examples of business objects


Sales Note
Customer

A sales application

Supplier

Product

Invoice

Wikipedia on business objects

Internet Technologies

110

Some typical business objects are shown in the slide for an application which
sells some set of products which are supplied by another company, for example
an online bookshop.

110

The storage layer


Often known as the database layer
Contains an implementation of the business
objects in terms of some underlying
database technology.
Usually the mapping is from objects to
relational tables
There is no requirement to use relational
database technology
Internet Technologies

111

This is the layer in which raw data resides. Normally it is implemented in terms
of relational database technology. However, there is no reason why other
technologies cannot be used such as OO-based database technologies or even
simple flat files or transient data.
A three layer system should be designed in such a way that the technology is
completely isolated from the other layers in order that change does not impact in
a major way.

111

Message passing (i)


Implemented using simple messages taken
from an overall protocol
Closest to network
Usually uses socket programming
Used for fast communication or where there
is no other protocol available

Internet Technologies

112

This form of architecture is based on sending and receiving messages from those
available in a protocol set. It is closest to the network than the other architectures
that I will describe in that it is based on sending serial data directly across some
transmission medium.
It is a very efficient way of doing things and is used when speed is of the essence.
It can be easily implemented using sockets and streams although, in practice,
concurrency is employed at the server.
It is also used in novel applications when there is no existing protocol available.

112

Message passing (ii)


Many of the underlying protocols are based
on message passing: SMTP, POP3, HTTP,
FTP etc.
Fixed protocols.
Protocols can be adaptive
Messages can be synchronous or
asynchronous
Internet Technologies

113

Message passing is mainly used in system protocols which are heavily used,
protocols such as HTTP and POP3. Most of these are fixed protocols in that the
whole repertoire is available to both the client and the server.
Some protocols can, however, be adaptive and are able to be modified prior to
being used in a communication link between a client and a server.

113

Adaptive protocols
Used in a number of contexts
Where a command has a variable number of
arguments
Where client and server have to negotiate a
subset of a full protocol set
Where a highly reliable system requires
modification of a protocol on the fly
Internet Technologies

114

A protocol can be fixed in that both the client and server use the same set of
commands over their interactions. However, protocols can be adaptive. An
example of this is where commands need to take a varying number of arguments
or a single argument which is of a different type.
Another example is where a client and a server negotiate a subset of a protocol,
for example the client may only be programmed to recognise an early subset of
the full protocol used by the server.
A further example is where a protocol is modified on the fly when it needs to be
changed. This usually happens in highly reliable systems.

114

Event-based architectures
Based on the concept of a broadcaster and
listeners
Based on the MVC architecture
Listeners register with the system in order
to receive data based on an event such as a
new user being logged on.

Internet Technologies

115

Event-based architectures are based on events being generated by a broadcaster,


for example a new user logging on to a chat room. When such an event occurs
listeners who have registered for that event are executed, other listeners who have
not registered are not. There is no programmatic wait loops with listeners being
programmed to idle.
This idea arose from the Model View Controller architecture which is found in
human-computer interfaces where an event such as a button being pressed might
affect the look and feel of other user components who are registered to listen to
that event. It is the model used for event processing in Java.
All that is needed is facilities to create listeners, create a broadcaster and to
register listeners for a particular event which has emerged from a broadcaster.

115

iBus

Listeners
Broadcaster

Communication bus
Listeners

The suppliers of iBus


Internet Technologies

116

The example of an event-based architecture is based on iBus. This consist of a


central software bus which listeners register with. The bus carries events from a
broadcaster object

116

Code for a commercial


implementation-ibus (i)
String message;
//Create a protocol stack, note that
//this is not a java.util.Stack, the stack
//should be reliable and, not say UDP
Stack st = new Stack(Reliable);
//Open the bus ready for objects to be sent along
//it, it is situated on the computer Venus, note the
//use of the string ibus in the URL
iBusURL url =
new iBusURL(ibus://Venus/Generator/Text);
st.registerTalker(url);
//Construct a posting using the zero argument
//constructor

Internet Technologies

117

This code starts up a broadcaster at the specified URL. Listeners can then listen
to events which are broadcast

117

Code for a commercial


implementation-ibus (ii)
//Construct a posting using the zero argument
//constructor
Posting pst = new Posting();
//There will only be one object in the posting
pst.setLength(1);
//Put the string object to be sent in the 0th position
//in the posting
message = Hello there;
pst.setObject(0,message);
//Further code here
..
//Now push the posting out on the bus
st.push(url, pst);
}
Internet Technologies

118

This code constructs a message (posting) and places it on the bus which listeners
can register with.

118

Code for a commercial


implementation-ibus (iii)
Stack st = new Stack (Reliable);
//Set up an object that will receive messages
ReceiverObject ro=
new ReceiverObject();
//Create a new bus object, it should match the
//one that has been set up by the transmitter

Internet Technologies

119

This code just sets up a receiver object. Note that the definition of a
ReceiverObject is not shown here. It is implemented by inheritance from a class
Receiver.

119

Code for a commercial


implementation-ibus (iv)
iBusURL url =
new iBusURL(ibus://Venus/Generator/Text);
//Tell the bus that the receiver is now attached to it
//waiting for messages
st.subscribe(url, ro);
..
//Suspend the program and wait for an object,
//if this wasnt done then
//the program would exit immediately
st.waitTillExit();

Internet Technologies

120

This shows the subscribing process whereby a listener registers itself with the
bus. What is not shown is the code that is executed when an event occurs. This
would be found in the class Receiver.

120

An abstraction
mechanism

Distributed objects
Objects spread around a distributed system
Access to the objects is virtually the same
irrespective of where they reside.
A number of popular technologies: RMI, DCOM
and CORBA
Dealt with in much more detail in the next lecture
Introduction to distributed objects

Internet Technologies

121

A set of distributed objects are objects which reside on a number of computers in


a distributed environment. The ideal for the programmer is that such objects seem
very little different to those that reside on the programmers computer.
The rationale behind distributed objects is that since a company has a lot of
investment in object-oriented technology, that investment should be used when a
system is distributed.
There are a number of technologies that a re in existence. The best known are
CORBA (general) RMI (Sun) and DCOM (Microsoft).

121

Tuple architectures
Almost certainly the highest level view of a
distributed system
Academic roots
Original model Linda
Java implementation known as JavaSpaces
An introduction to JavaSpaces
Internet Technologies

122

Tuple architectures are the highest level view of a distributed system that we
have. It regards the system as a collection of data known as spaces; programs can
read and write data to these spaces with the system software providing an
interface from a high level view to the underlying implementation
The original model on which spaces architectures was based was a language
called Linda developed at Yale University. Linda was something of a curiosity
until Sun came along and implemented a version known as JavaSpaces as part of
its JINI effort.
Programming JavaSpaces is fairly easy; however setting up spaces is rather
convoluted.

122

The JavaSpaces architecture

Distributed system

Three spaces populated by tuples

Internet Technologies

123

A JavaSpaces implementation consists of a number of collections of data known


as Spaces with individual chunks of data known as tuples stored in them. Clients
reference spaces symbolically and a small number of operations read and write
data to the spaces.

123

An example

sp.write(sm, null, Lease.FOREVER)

Writes an object sm to the


space sp, the object will
stay there forever

Internet Technologies

124

This code shows one of the small number of primitives used in JavaSpaces to
write data to a space. The second argument is null; this argument is part of the
transactional control facilities of JavaSpaces and is out of the scope of this
course.

124

Lecture 3
Distributed paradigms (ii) Distributed
objects

Internet Technologies

125

125

Aims
To describe the rationale behind distributed
object systems
To examine RPC as a predecessor
To detail the middleware required for
distributed objects

Internet Technologies

126

126

A brief history of objects


1967

Simula
Smalltalk
C++, CORBA and Eiffel
Java, RMI, DCOM
C#
Internet Technologies

2001

127

Object orientation started with the simulation programming language Simula in


the sixties. This was followed by the pure OO language Smalltalk which tends
to be used as a prototyping medium.
The seventies saw the first industrial OO language being developed. This was
C++: a superset of C. This was followed by Eiffel which was a superset of Pascal.
Both these languages are used in industrial projects with C++ dominating.
The nineties saw the advent of the language Java which has made huge inroads
into the user base of C++, mainly because of its Internet facilities. This was
followed by C# an amalgam of Java, C++ and Turbo Pascal. This language is a
cornerstone of the Microsoft .Net effort.
At the same time as these languages were being developed distributed object
technologies such as DCOM, CORBA and RMI were being developed

127

What is a distributed object?


It is an object which is capable of being accessed
both by local code and code on remote computers

Internet Technologies

128

Languages such as C++, Java and Eiffel allowed the programmer to specify
objects which just resided on a local computer. Distributed object technology
allows objects on remote computers to be accessed in the same way. This leads to
the concept of access transparency (see next slide plus 1)

128

The key features of object orientation

Information hiding
Aggregation
Inheritance

Key property, used


for reuse

Wikipedia on object-oriented programming

Internet Technologies

129

The key features of object orientation are:


Information hiding: the implementation of an object must be hidden from the
programmer that accesses the object, for example a sequence of employees in a
Works object can be stored in a number of ways but the programmer should not
be aware of the storage structures that are used.
Aggregation is where an object consists of a number of other objects. For
example, a plane object might be made up of a fuselage object, some engine
objects and wing object etc.
Inheritance is where the facilities of a class that defines an object is used by
objects from another class, for example a salaries employee object might inherit
from a general employee object.

129

Distributed objects and software


engineering
Example of hiding network details
They fit in with current methods such as
UML
Can use OO software engineering
techniques

Internet Technologies

130

The major rational behind having distributed object schemes is that it provides
consistency across applications, the fact that a system is viewed as a collection of
co-ordinating objects, irrespective of whether the system is distributed or not
means that the same techniques, skills and software engineering tools can be
used.
This is an example of distributed technologies hiding the sometimes awful detail
that lurks in a distributed system.

130

Transparencies (i)
These are all part of the
process of hiding network details

Access transparency
Location transparency
Migration transparency
Replication transparency

Internet Technologies

131

A distributed object system must be transparent in a number of ways:


Access transparency means that you interact with a distributed object in the
same way that you would in a local system.
Location transparency means that the programmer should be unaware of the
physical location of an object.
Migration transparency means that when an object is moved from one computer
to another the system that uses these objects should be unaware of this.
Replication transparency means that when multiple copies of objects are kept
that this fact should not be of any interest to the programmer.

131

Transparencies

Concurrency transparency
Scalability transparency
Performance transparency
Failure transparency

Internet Technologies

132

Concurrency transparency means that the fact that the concurrent code is being
used to access an object should be hidden from both users and programmers
Scalability transparency means that when the load on a distributed system
increases that more processing power is added to the system without the user or
the programmer being aware of it.
Performance transparency means that it is invisible to the user. For example if
load balancing is used this mechanism should be hidden
Failure transparency means that when a failure occurs the fact that it has
occurred should be invisible to the user.

132

The preceding technology


Remote code is not a new concept
A precursor to remote objects was remote
procedure call
DCE is an example
Requires an Interface Definition Language
similar in concept to that used by CORBA
Current incantation is SOAP
Internet Technologies

133

Distributed systems are not new: they have been around for twenty years. It is
then hardly surprising that remote code has been around for some time, before
even distributed objects had been developed.
One of the most sophisticated systems was the Distributed Computing
Environment (DCE) which implemented a technology known as remote
procedure call. This allowed subroutines on remote computers to be executed as
if they were on a local computer. This requires a special language known as an
Interface Definition Language to define the facilities offered by a subroutine. The
idea is used in the CORBA distributed object scheme which is discussed later.
Remote Procedure Call has not gone away. It is still alive in the form of SOAP an
XML based form.

133

The components of a distributed


object technology

Interface definition language


Presentation layer
Session layer
Transport layer

Internet Technologies

134

The major components of a distributed object technology are shown above. The
interface definition language defines the objects that are to be remotely located.
The presentation layer provides proxy objects on the client and server to which
messages can be sent. The session layer handles multiple objects and maintains
and organises the connections between the objects and the transport layer. The
transport layer uses some base protocol (usually TCP/IP) to carry out the sending
of data to a remote object and the reception of that data back at the calling
computer.

134

The components
Objects defined
by IDL

Conceptual method calls

Server

Client
Real calls

Presentation layer
Session layer
Transport layer
Internet Technologies

135

The figure shows the relationship between the various components of a


distributed object scheme. The IDL defines the objects and produces template
code which is used in their implementation.
The presentation layer maps the distributed object to some common format and
implements proxy objects which connect with the session layer.
The session layer carries out a large number of housekeeping activities including
activating objects and mapping object references to the hosts on which they
reside
The transport layer carries the data associated with a method call to the remote
host.

135

Presentation layer

Communicates with session layer


Resolves data heterogeneity
Implements proxy objects
Performs marshalling and unmarshalling

Internet Technologies

136

The presentation layer communicates with the session layer by passing uniform
representations of data used in method calls to it. For example an object which
represents an employee would be transformed to a set of bytes which represent
the employee. This is the process known as marshalling. The reverse that happens
at both the server and at the client is known as unmarshalling.
In order for a distributed object system to work a local object acting as a proxy
needs to be maintained at the client end. This object is the one to which method
calls are made. It carries out the marshalling and unmarshalling process,
eventually passing the data associated with a method call to the session layer.

136

Session layer
Receives uniform data from presentation
layer and passes it to the transport layer
Maps object references to hosts
Implements activation policies
Provides object adapters
Invokes the requested method
Synchronises client and server objects
Internet Technologies

137

The session layer maps an object reference such as a string name to some
transport layer data used for identification, for example a host name, port name
and object name in TCP/IP. It thus implements a naming service
It receives data in a uniform form from the presentation layer and processes it and
then instructs the transport layer to send it forward to the server containing the
distributed object.
Activation launches a previously inactive object and deactivation is the reverse: it
terminates the execution of the object. This is one of the facilities offered by a
part of the session layer known as the object adapter.
The layer also invokes the requested method on the remote object and
synchronises the process of message sending between remote objects and clients
and ensures that there are no problems with inconsistent updates and lock-outs.

137

Transport layer

Is the layer nearest the transmission


medium
It carries out the bulk transfer of byte data.
Can be any communication protocol;
however, the most popular is TCP/IP.

Internet Technologies

138

This is the lowest level of a distributed object scheme. It is the layer that carries
out the actual transport of marshalled data representing a method call
encapsulating data such as method arguments and method name in a byte stream.
Any communication protocol can be used, in most implementations of remote
object schemes, such as CORBA and RMI, TCP/IP, the Internet protocol is
employed.

138

Developing a distributed application


Identify objects and
location

Design
Server code
generation
Server
coding

Client code
generation

Client
coding

Server
name
registration
Internet Technologies

139

The slide above describes the process of developing a distributed application.


The first step is to design the system. This involves the definition of the classes
involved and the allocation of the remote objects to hosts.
When the objects are defined server and client code is generated from the
definitions, this is usually done via a tool such as idltojava which converts the
interface definition language used to specify objects to Java skeletons
When this is done the coding for the functionality at the server and client is
carried out using the generated language skeletons from the previous step.
Finally the remote objects are registered with a naming service.

139

Local vs remote objects in design

Life-cycle issues
Object reference issues
Request latency
Object activation
Parallelism
Communication
Failures
Internet Technologies
Security

140

The remaining slides look at the issues above, such issues are relatively small
when dealing with local objects communicating among themselves. However,
when distributed objects are involved major problems ensue.

140

Life-cycle issues

Facilities needed for remote creation.


Required for migration
For ensuring migration transparency
Distributed garbage collection required
Difficult to maintain referential integrity in
a distributed environment
Internet Technologies

141

Since the normal code for object creation cannot be used (constructors) code has
to be written which carries out the remote creation, this is then called by the
client.
The creation has to be done in such a way that when an object moves there is no
need to change the code of any applications that use the object.
Local object schemes use garbage collection when an object is no longer
referenced by other objects and its memory reclaimed. This is very difficult to do
in a distributed environment for performance reasons. Consequently many
distributed object schemes do not guarantee referential integrity and hence the
client design has to cater for the condition that an object may have disappeared.

141

Object references

Object references are memory intensive, for


example ORBIX requires 40 bytes for a
single object reference
Can be a factor of 100 for middleware that
supports security

Internet Technologies

142

You must always design distributed systems on the assumption that references
are heavy in terms of memory. A typical piece of data is shown above for ORBIX
a lightweight implementation of CORBA, for other implementations this can be
much bigger.

142

A warning

A distributed object carries a lot of


baggage around with it. Do not be
tempted to make every object in a
system distributed!

Internet Technologies

143

In order to implement distributed objects a lot of memory has to be associated


with a remote object in order for communication to be enabled. Minimise the
number of remote objects that you create.

143

A design technique
One way to minimise the use of distributed
objects is to have a single distributed object
that acts as a factory object at the server, this
instantiates local objects which carry out the
processing required

Introduction to patterns

An example of a
design pattern

Internet Technologies

144

When there is a need to create a large number of objects at, say, a server, you
should not create all of them as distributed objects, the traffic to those objects
would swamp the application. One strategy is to have a single distributed object
which is manipulated by clients to produce the objects as local objects. This
object acts both as a gatekeeper and a factory.

144

The object factory


Simple approach

Sophisticated approach
Object factory

Many remote
objects

One remote object


shown in light blue
Internet Technologies

145

A common technique to avoid a proliferation of remote objects is to have one


remote object which acts as a factory for local objects, with all remote
communication going through the single object. This is an example of a design
pattern.

145

Request latency

Local method call can take up to 250 ms.


Object requests can take 2000 times as long
Careful design of where to place objects
required: local, LAN or WAN

Internet Technologies

146

Emmerich has measured the response when two objects communicate between
two ULTRA Sparc servers in a 100 Mbit network as 500 microseconds. This is
2000 times as long as local method calls.
Designers need to be careful in object placement when designing for
performance.

146

Activation and deactivation


Local objects can reside for ever
Remote objects cannot: hosts need to be
shut down, resources required by all objects
may be more than the server can provide,
objects may be idle for some time and
occupying useful memory.
Activation takes time and should be
transparent
Internet Technologies

147

Activating an object means making it available to client, deactivating is the


opposite. Unfortunately activation and deactivation takes time. The implications
for the designer are:
Activation should be carefully planned, only do it when necessary in order to
minimise overhead
State storage should occur when an object is deactivated and state should be
retrieved when activation occurs
Do not design idle objects which are hanging around doing nothing.

147

Failures
Distributed systems fail more often than
centralised ones
Failure should be programmed and designed
for
Middleware often imposes an exactly once
condition which is resource heavy
At most once usually implemented
Internet Technologies

148

You need to design distributed object systems in such a way that they cope with
failure for example by employing data replication techniques. The middleware is
able too impose an exactly once semantics where every request is guaranteed to
be executed once and and only once.
Unfortunately this gives rise to performance and resource problems and an at
most once semantics is usually applied. This means that they apply a request at
most once and tell the client when a failure occurs. This means that the client
should be programmed to respond to such failures.

148

Security
Centralised applications deal with security
at the session level via techniques such as
authentication procedures.
Distributed object systems are prone to
network attacks.
There is a need for a deeper level of security
on a request by request basis
Internet Technologies

149

Distributed object schemes are prone to security problems. In order to overcome


this the design of such applications must:
Authenticate individual requests
Ensure that server objects are able to decide whether a client is authorised to
make a request
Ensure that irrefutable evidence is generated which proves a transaction has
occurred, for example in order to ask for payment.,

149

Distributed objects
Objects which reside on any computer in a
distributed system and to which messages
can be sent
A number of schemes: RMI, DCOM and
CORBA are the main ones
Hides the transport mechanisms from the
programmer
Internet Technologies

150

In normal object oriented programming the programmer implements processing


by sending messages to objects. These objects reside on the same computer.
Distributed objects are virtually indistinguishable from such objects with the only
difference being that they are spread around a a number of computers that
communicate via some transmission medium.
There are three main objects schemes CORBA which is non-proprietary, RMI
which is associated with Java and DCOM which is a Microsoft technology. This
lecture looks at RMI, the next lecture looks at CORBA.

150

The future
Distributed objects were once very
important, they are less important these
days with the growth of web services.
However, the CORBA technology
described in the next lecture offers major
advantages in interworking.

Internet Technologies

151

Lecture 4
Distributed paradigms (iii) RMI and
CORBA as examples of a distributed
object technology
Internet Technologies

152

152

Aims

To examine the RMI model of distributed objects


To show how RMI programs are developed
To examine the CORBA standards
To describe the rationale behind CORBA
To detail the CORBA development process
To describe criteria that are used to evaluate
distributed object schemes
Internet Technologies

153

153

Remote Method Invocation

A Java technology
Lightweight object technology
Initially Java-centric
Now with links in to CORBA
Efficient compared with CORBA

Technical introduction to RMI

Internet Technologies

154

In this lecture I will look at a pure Java solution to distributed objects known as
RMI. It was an early part of the Java product set. It was a pure Java solution in
that it could only communicate with objects developed using Java; this has
changed in that RMI now has hooks into CORBA.
RMI objects are defined in Java and are placed on remote computers and are sent
messages to. RMI is an efficient technology with a none too steep learning
incline.

154

The RMI layered model

Stubs

Skeletons

Remote reference layer


Transport layer

Server

Clients

Internet Technologies

155

RMI is based on a simple three layered model. The layers are


The stub/skeleton layer. This implements local objects which act as a proxy for
the remote objects. Messages sent to these proxy objects are forwarded on to the
remote object using the remote reference layer.
The remote reference layer. This is the layer that implements a protocol that is
used for executing remote methods.
The transport layer. This is the layer responsible for sending data across some
transmission medium. The default mechanism for this is TCP/IP; however, there
is no reason for other protocols to be used.

155

Garbage collection and security


RMI contains a distributed garbage
collector
Garbage collection is based on a remote
reference count mechanism
There are a number of security managers
that are used to prevent RMI spoofing.

Internet Technologies

156

Java has a mechanisms which implements automatic garbage collection. RMI


extends this by having a mechanism for distributed garbage collection. It involves
keeping track of the number of references to a remote object and garbage
collecting such an object when the number of references drop to zero.
One problem with a remote object technology is that clients can masquerade as
valid clients even though they have no security permissions. In RMI this is
prevented by mediating access to a distributed RMI object via a security
manager.

156

RMI in a three tier architecture


RMI
objects

Data layer

Presentation layer
Business object
layer

Internet Technologies

157

The slide shows how RMI objects can be used in the middle layer of a three-tier
application. The objects can reside on a number of servers, a single server or even
the database server that implements the data layer.

157

Developing RMI code

Develop the server code


Develop the client code
Deploy client code
Make sure that the RMI registry is running
Start server execution

Internet Technologies

158

When developing an RMI system you need to develop the server code. This code
will set up the remove objects and implement the functionality in the methods.
The client code then needs to be developed; this will normally contain visual
objects and message code for the remote RMI objects at the server.
The client code will then need to be deployed. Classes will need to be dispensed
to client sites. This can be done in a number of ways: statically by sending as a
file transfer or dynamically using a Web server.
In order for the system to work the RMI naming system needs to be started. This
is known as the RMI registry and enables objects to be symbolically referenced.
When all these steps have been completed the clients can connected into the
server and call on the service it provides

158

Implementing a remote interface

import java.rmi.*;
public interface SecondGenerator extends Remote
{
long getMilliSeconds() throws RemoteException;
}

Internet Technologies

159

The code here defines a Java interface. This is a class with a template for code
that needs to be provided by any class that implements the interface.
The interface extends a class called Remote. This informs the Java runtime
system that the objects generated from this interface are going to be remote.
There is only a single method within the interface for which code needs to be
provided. The code throws an exception if there is any problems accessing the
remote object

159

Programming the server (i)


import java.rmi.*;
import java.util.Date;
Import java.rmi.server.UnicastRemoteObject;
public class SecondGeneratorImpl
extends UnicastRemoteObject implements
SecondGenerator
{
private String objName;
public SecondGeneratorImpl (String objName)
throws RemoteException
{
super();
this.objName = objName;
}
Internet Technologies

160

This code carries out a number of functions:


It imports all the libraries that are required.
It implements the SecondGenerator interface and will eventually provide code
for the method in this interface (next slide)
It defines a state which just contains the name of the object.
It defines a constructor which creates the remote object and gives it a string
name. This is the name it will be known as to the clients in the distributed system.

160

Programming the server (ii)


public long getMilliSeconds()
throws RemoteException
{
return(new Date().getTime());
//The method getTime
//returns the time in msecs
}

Internet Technologies

161

This is the implementation of the method getMilliSeconds. It gets the current


date and then sends a message getTime to the data to give the date in
milliseconds. This implements the method in the interface that was introduced
previously. All it does is to return a milliseconds time

161

Programming the server (iii)


public static void main(String[] args)
{
String oName = Dater;
System.out.println(Loading in security manager);
RMISecurityManager sManager = new RMISecurityManager();
System.setSecurityManager(sManager);
try
{
SecondGeneratorImpl remote =
new SecondGeneratorImpl (oName);
Naming.rebind(oName, remote);
System.out.println(Object bound to name);
}
catch(Exception e)
{System.out.println(Error occurred at server+e);}
}
Internet Technologies

162

This is the code that sets up the remote object. It consists of a number of
processes:
First the security manager is loaded in and started. Next a remote object is set up
with the name Dater.
The objects name is then communicated to the RMI naming service using
Naming.rebind.

a message is sent to some console window that the object is ready to be sent
messages.
The server will now wait for some messages

162

Programming the client


import java.rmi.*;
public class TimeClient
{
public static void main(String[] args)
{
try
{
SecondGenerator sgen =(SecondGenerator)
Naming.lookup(rmi://hostname/Dater);
System.out.println(Milliseconds are
+sgen.getMilliSeconds());
}
catch(Exception e)
{System.out.println
(Problem encountered accessing remote object +e);}
}
}

Internet Technologies

163

The code on this slide shows how the client is programmed:


First, a reference is made to the remote object. This is referenced by using a
URL which has the protocol rmi:.
A getMilliSeconds message is then sent to the remote object
The result is then displayed on some console window.
An error exception is created and processed if the client has not been able to be
contacted.

163

What betrays the distributed nature


of the application
A small amount of server code
One line in the client
This not very much !

Internet Technologies

164

In a remote application using RMI not much of the code betrays the fact that
distributed objects will be used. A small amount of server code contains, for
example, reference to the naming service. However most of the server code will
be concerned with functionality.
The only code at the client that betrays the remote nature of the application is the
reference to the RMI Registry (the RMI naming service)

164

The client betrays the remote nature


of the application

SecondGenerator sgen =(SecondGenerator)


Naming.lookup(rmi://Arthur/Dater);

I am an RMI object on the computer


Arthur and I am called Dater

Internet Technologies

165

For the client the only code that betrays the fact that an object is remote should
be code which obtains a local reference to the remote object via the naming
service. This is normally a single line. All the code after this line would just send
normal messages to the local proxy object (in the slide above this is sgen)

165

RMI
RMI is a lightweight distributed object
technology
Based on Java
Relatively simple to use
Fairly efficient
Can be used for both the static and dynamic
creation of objects
Internet Technologies

166

RMI is a comparatively simple technology which was initially confined to Java.


It is quite a simple, lightweight solution for pure Java applications, although it
now has hooks into CORBA.
The example given on the previous slides shows the static creation of objects.
There is no reason why they cannot be created on-the-fly, for example by using
an object factory. The client would request a new, named object and the server
would create these objects.

166

CORBA

A standard
Lots of implementations
Multi-language approach to distributed objects
Based on an interface definition language
Mature
Some reliability problems with some products

Introduction to CORBA

Different to RMI
Internet Technologies

167

CORBA is a distributed object technology much like RMI is. However, there are
some major differences. The two major differences are that CORBA is a multilanguage approach in that a wide variety of languages have CORBA interfaces
and that distributed objects are defined by a special purpose language known as
an interface definition language (IDL).
The technology is mature in that interfaces exit for a wide variety of languages
including older ones such as Ada, LISP and FORTRAN.

167

The use of an interface definition


language
A language to describe base facilities
Common denominator language intended to
mirror both object-oriented and pure
procedural languages
Needs a pre-processor
For CORBA looks like C

Internet Technologies

168

The major difference between CORBA and RMI is the use of an interface
definition language which describes the services that are provided by a
distributed object. For CORBA this language looks almost identical to C because
it has to mirror both procedural and OO languages. Because of this it does not
include esoteric facilities such as inheritance.

168

CORBA is a standard
Defined by the Object Management Group
The OMG is the largest standards group in
the world
Almost certainly the most academic
The CORBA standard is huge and attempts
to cover everything

Internet Technologies

169

CORBA is not an implementation it is a standard which specifies facilities such


as security and naming services. How this is implemented is up to product
vendors and developers. The OMG is a huge standards group which is
continually devising updates to the CORBA standard.

169

An example of the use of CORBA


Network layers
CORBA
objects
C code

Legacy
code
Ada
code

Clients
C++ code

Java code

Internet Technologies

170

Here CORBA objects are used as gatekeepers. These are objects which are a
front end to existing code. The diagram shows a system which has been
programmed in four languages two of which are not object-oriented. In order for
clients to communicate with this code a CORBA object programmed in the
language is placed between the client and the code.
The clients see a clean object-based implementation even though the code behind
the CORBA objects is purely procedural.

170

The CORBA architecture

Internet Technologies

171

The figure above is a logical view of the CORBA architecture. Each of the
components will be detailed in the next two slides. The figure shows a client
interacting with CORBA object residing on a server.

171

The architecture (i)

Client IDL stubs


The Interface Definition Language
Dynamic Invocation Interface

Internet Technologies

172

The architecture of CORBA contains a number of components


Client IDL stubs. This is code and data which is used by the client as a
proxy for the real objects that reside on the server. These stubs are
generated by a simple utility which is provided by a CORBA
implementation. They carry out processes such as collecting data together
ready for dispatch to a remote object.
The Interface Defiinition Language. This is a language, usually
abbreviated to IDL, which defines interfaces to CORBA distributed
objects; as you will see later it looks a little like C. The IDL defines the
instance variables of a distributed object together with the methods which
the object can respond to. The IDL stubs are, as their name suggests,
generated from files of IDL code.
Dynamic Invocation Interface. Messages can be sent to distributed
objects either statically where the objects are defined by IDL and where
the type of the object is known at compile time, or dynamically where the
CORBA run-time system is able to determine the type of an object. The
part of the CORBA architecture which deals with this is known as the
Dynamic Invocation Interface.

172

The architecture (ii)

Static IDL skeletons


Dynamic skeletons
The interface repository

Internet Technologies

173

Static IDL skeletons. These are the server side equivalent of the client
IDL stubs. It is code which carries out a number of functions such as
extracting out the arguments from a remote method invocation and
carrying out the actual process of sending a message to the remote object.
These skeletons are implemented by the utility which creates the client
IDL stubs.
Dynamic skeletons. These are equivalent to the static IDL skeletons.
However, they enable clients to access remote objects for which the
clients do not have compile-time knowledge.
Interface repository. This is a database of all the object descriptions
expressed in IDL.

173

The architecture (iii)

The object request broker (ORB)


The object adapter

Internet Technologies

174

Object Request Broker. The ORB is the part of the CORBA architecture
which provides the plumbing between distributed CORBA objects and the
clients that reference them. It is the ORB which carries out the process of
communication between distributed objects and it is the ORB that
communicates with the transport medium used to convey the raw data
used in object communication. There are a number of different ORBs
developed by software vendors. In the early days of CORBA these ORBs
were not compatible with each other: you could not send a message from
a client which had stubs generated by one ORB vendor to a server object
which had skeletons generated by another ORB vendor. However, version
2.0 of the standard specifies that all ORBs should be able to communicate
using an Internet Inter Orb Protocol, usually abbreviated to IIOP.
Object adapter. This is a layer which enables a remote object to access
the facilities of the ORB.

174

CORBA services (i)

The life-cycle service


The persistence service
The event service
The naming service
The concurrency control service

Internet Technologies

175

The Life Cycle Service. This provides facilities for creating, copying,
transporting and deleting objects.
The Persistence Service. This provides facilities whereby objects can be
stored on some permanent medium including relational databases, objectoriented databases and >at <les.
The Event Service. This allows objects to register themselves as listeners
to events and respond to events; for example, an object might register
itself as a listener to an event which occurs when another object changes
one of its instance variable values and carry out some processing when
this occurs. This service also allows objects to de-register themselves
from events.
The Naming Service. This allows objects to be given names and located
by other objects which quote the name.
The Concurrency Control Service. This provides facilities which ensure
that concurrent processes are not allowed to access an object in such a
way that the object is left in an inconsistent state.

175

CORBA services (ii)

The relationship service


The externalisation service
The query service
The licensing service
The properties service

Internet Technologies

176

The Relationship Service. This allows relationships to be established


between remote objects. For example, a book object can be speci<ed to be
related to an author object by virtue of the fact that the author has written
the book.
The Externalisation Service. This enables data to be sent to or read from
a remote object using a technique akin to Java streams.
The Query Service. This allows queries to be sent to remote objects or
collections of remote objects using a syntax which is a superset of SQL.
The Licensing Service. This allows the use of an object to be monitored
in order, for example, to ensure that the user is charged for the use.
The Properties Service. This allows properties to be associated with an
object such as a creation date.

176

CORBA services (iii)

The time service


The security service
The trader service
The collection service

Internet Technologies

177

The Time Service. One of the problems in a distributed system is


managing time, for example a transaction may need to be applied in some
temporal order, but the clocks of the servers involved in the transaction
may be inaccurate. The time service manages transactions on objects
within an environment where time may be out of synchronisation.
The Security Service. This is a service which ensures that facilities such
as authentication are provided.
The Trader Service. This is a service which is very much like a yellow
pages service in which distributed objects advertise what services they are
capable of providing; for example, a distributed object may advertise the
fact that it is capable of retrieving certain types of data from a database.
The Collection Service. This enables collections of distributed objects to
be associated together using standard collections such as queues and trees.

177

The IDL

Object
services
specified in
the CORBA
IDL

Convertor

Internet Technologies

Object definitions
expressed in some
base language

178

When a CORBA implementation is developed the first step is to decide on what


services are to be provided and then partition these services among classes. Each
of the classes is written in the IDL.
The CORBA IDL looks very much like an amalgam of C and C++. A software
tool transforms the IDL source into source expressed in some implementation
language such as Java.

178

An example of an IDL definition


//Fragment of IDL
module Tester{
..
Interface Single{
attribute string exname;
readonly attribute string location;
string returnsVals(in string point);
}
}

Internet Technologies

179

The IDL fragment above defines a module Tester which contains an interface
Single. The interface provides a single service returnsVals and is associated
with two attributes (instance variables, fields) which are strings. One of the
strings location can only be read, it cannot be written to.
The service returnsVals has a single argument point that is a string which is
just read, it is not written to.
The fragment above is vary similar to a class definition and can be easily
translated into such a definition.

179

An IDL translation to Java


package Tester;
public interface Single extends org.omg.CORBA.Object{
String exname();
void exname(String arg);
String location();
String returnsVals(String point);
}
Inheritance

Internet Technologies

180

Here the IDL from the last slide has been translated into an interface contained in
a Java package. The exname attribute has been translated into a constructor and
setter method. The constructor would be used to create exname objects and the
setter sets the value of such objects. The location attribute is just associated
with a method which returns its value since it is readonly. Finally a method
returnsVals implements the final facility in the IDL fragment shown in the
previous slide.

180

Attributes in CORBA
Wide variety of attributes in CORBA
Examples include double, float, long, union,
boolean, enum, char
Sometimes the target language does not
support the attributes; in this case some
hack is used

Internet Technologies

181

There are a large number of attributes in CORBA, some examples are shown
above, most are self-explanatory. enum is an attribute which can have a number
of distinct values and struct is something akin to a record in that it can contain a
number of attributes.
When the target language does not contain an attribute a hack has to take place.
For example Java does not contain anything akin to an enum. In this case the IDL
is translated into a number of Java constants and methods created which access
these constants.

181

Some CORBA facilities

struct
sequence
array
enum

Internet Technologies

182

The slide shows some of the more important IDL facilities:


A struct is a record structure which can contain other IDL entities.
A sequence is very much like a one-dimensional array.
An array is like any other array; it differs from the sequence in that it can only
accept other arrays which are of the same length through assignment.
An enum is an enumerated set of values, for example the set of statuses that an
employee has in a works management system (manager, worker, temp) .

182

An application architecture
Client
code

Server
code

Stub
code

Skeleton
code

ORB
Transport mechanism
Internet Technologies

183

The diagram shows a CORBA architecture. Client and server communicate via
stub code and skeleton code. Stub code is code which implements a proxy object
on the client and skeleton code implements the same at the server. The proxy
objects communicate via the Object Request Broker which, in turn,
communicates with some set of transport protocols, normally these are TCP/IP.

183

The previous slide


This shows the basic primitive storage
features of CORBA, very similar to those
found in early seventies languages such as
C

Internet Technologies

184

The previous slide exemplifies a point made at the beginning of the lecture: that
because CORBA has to deal with a multitude of target languages it cannot
contain any very sophisticated facilities which might not be easily implementable
in a language such as C or Ada.

184

Developing a CORBA application


Design the classes
Develop the IDL which describes the
classes
Develop the remote object class
Develop the server code
Develop the client code
Internet Technologies

185

There are a number of steps involved in developing a CORBA based system:


Carry out a conventional object-oriented analysis in which the classes making up
the application are identified.
Define the classes using the facilities of the IDL.
Generate the files detailed in the previous slide using an IDL compiler.
Use the files that have been generated to produce code for a server.
Use the files that have been generated to produce code for a client.

185

CORBA vs RMI
CORBA is somewhat
more complicated to
program
CORBA implementations
are slower than RMI
CORBA is multi-language
CORBA offers more
services
Simple objects

RMI simple to program


RMI is relatively fast
RMI originally Javabased, however
connections to CORBA
objects are available
Not many services offered
Complex objects

Internet Technologies

186

The slide above describes the main differences between RMI and CORBA.
CORBA is very much more feature rich but is less efficient than RMI.

186

Comparing object schemes

Speed
Programming complexity
Degree of platform independence
The degree of language independence
The degree of complexity of objects created

Internet Technologies

187

There are a number of criteria used to judge an obejct scheme:


Speed. How fast the process of carrying out an action such as updating a
remote object is.
Programming complexity. How much extra programming is required in
order to process a distributed object, over the programming that would be
needed if the object was local.
The degree of platform independence. Whether the technology can be
used over a number of operating systems or is con<ned to one. For
example, DCOM can only be used with Microsoft operating systems.
The degree of language independence. Whether the technology can be
used simply within a wide variety of programming languages, something
which CORBA, for example, can do.
The degree of complexity of objects created. Some technologies such as
CORBA only allow simple objects made up from simple data types and
disallow subclassing. Other technologies such as RMI allow objectoriented features such as subclassing.

187

Summary

CORBA is a big project


Based on an IDL
Multi-language
Mature
Offers a panoply of services
based on a lowest common denominator
approach
Internet Technologies

188

In summary. CORBA is the biggest object scheme around both in terms of


services offered and in terms of the effort being put into it world-wide.
It is based on an intermediate language for defining classes and is a multilanguage technology in that CORBA objects can be implemented in a variety of
programming languages.
It is a mature standard and offers many more services than other objects schemes.
Because it has to target a number of languages it is based on a lowest common
denominator approach so that, for example, CORBA objects are very simple: as
simple as the most primitive language that offers CORBA facilities

188

Lecture 5
Servers, database servers and
development technologies

Internet Technologies

189

189

Aims

Describe the concept of client/server computing


Look at some of the main types of server
Look at middleware
Concentrate on detailing one of the most
important types of server: the database server
Briefly examine some web development
technologies

Internet Technologies

190

This lecture will look at client/server computing in a little more detail. It will
look at the rationale for client/server computing. It will detail some of the more
important types of server, look at the role of middleware. Much of the lecture will
concentrate on database server; next to Web servers they are the most important
type of server used in distributed applications.

190

Clients and servers


Servers provide some service, for example a
Web server dispenses Web files
Clients call on a service, for example a
browser will ask for a Web page
The distinction between server and client is
blurred

Internet Technologies

191

A server is a computer (or program) which provides some service. For example a
print server will react to print service requests by initiating print jobs.
A client will ask for a service, for example a news-reader program will ask for a
set of postings from a news server.
The distinction between clients and server is not clear cut: a server can act as a
client to another server; for example a Web server, in order to carry out its
service may may act as a client to a database server.

191

Some servers

File servers
Web servers
Mail servers
Print servers

Database servers
Groupware servers
Object servers
Application servers

More detail later

Internet Technologies

192

The slide shows some of the most important servers:


File servers dispense large files to users.
Web servers dispense Web pages and associated files. Effectively a Web server
is a specialised form of file server.
Print servers initiate printing jobs from clients and notify them when they are
finished.
Database servers process queries on relational databases and return the results of
the queries.
Groupware servers mediate services such as co-ordinating diaries and arranging
meetings.
Object server contain distributed objects which carry out services such as
responding to requests for functionality embedded in legacy software.
Application servers are dedicated to one or more applications such as airline
ticket booking.

192

The three layer model (revision)

Storage
layer

Business
object
layer

Client layer

Sometimes known as a threetier model


Internet Technologies

193

This is the standard architecture used in client/server computing:


The storage layer contains permanent data stored in some database.
The business object layer contains objects which are found in the application
domain.
The client layer contains visual entities which are used to communicate
functionality.
This is not the only layered architecture: there are many more ranging from the
two layer architecture. However, it is the most popular.

193

Typical business objects

Customer
Book
Order
Review
Back order

For an online book store

Internet Technologies

194

The objects above are some of those which are associated with a book store such
as Amazon. They are distinct entities which are used for data storage and are
accessed and updated by application programs. Typical data which might be
stored in a customer object includes: the name, email address, list of past orders
and credit card details.

194

Two layered architecture


Web an example
Presentation and logic layer and data layer
Used when there is little or no processing
required
Difficult to maintain when there is a lot of
functionality
Thin client, thin server, fat client, fat server
Internet Technologies

195

Another quite popular architecture is the two layer architecture which contains a
layer which has all the visual objects and processing embedded in it and a data
layer which contains permanent data. It is used where there is little processing
and simple functionality; the World Wide Web is a good example of a two tier
architecture: it works because there is little processing required at the browser
end, just the display of pages.
Such architectures are problematic when they contain a lot of functionality which
changes: the process of versioning and broadcasting updates becomes very
complex.
The adjectives fat and thin are used to describe the amount of code which resides
at either the client or the server

195

The network computer

Extreme view of client/server computing


Developed by Sun
Very fat server and very thin client
Makes updates easy
A return to mainframe computing?
Now having a resurgence
So far this approach has not been
hugely successful
Internet Technologies

196

An extreme view of a clientserver system where nearly all the


functionality of a system is embedded within a server, with the client
acting virtually as a dumb terminal, is the network computer idea which
Sun Microsystems tried to popularise in the late 1990s. Here the idea was
that virtually all the software in a clientserver system was embedded in
the server, with clients having minimal hardware. Such clients, known as
network computers, contained only input/output devices and memory, not
mass storage. They would access software such as word processors and
spreadsheets from the server. Network computing has a number of
advantages. The two main ones were the fact that software version control
was easy: since software was executed on a server there would be no
problems with ensuring that every client had the same version of
software. It also meant that administration of a system containing network
computers was easy. The idea of the network computer has not taken off;
the fact that it looked very much like a return to the days of mainframe
computers and dumb terminals put very many potential customers off.

196

Middleware
Software that is interposed between client
and server
Two types: system and service middleware
An example of middleware is the software
that interfaces a browser and the Web.

Internet Technologies

197

Clients and servers do not talk to each other directly: interposed between then is
middleware. There are two types of middleware: system middleware carries out
general, system-level tasks such as transporting raw data around the Internet.
Service middleware is associated with a particular task or service, for example
the middleware which allows a client to query a database or allows a news reader
to interrogate a news server.

197

Examples of general middleware

Software for transporting raw data


Software for keeping replicated files in
synchronisation
Software which maintains a distributed file
store

Internet Technologies

198

All the examples above are of general middleware. They are not tied to an
application but are used by applications.

198

Examples of service middleware


Software for querying a database
Software for retrieving postings from a
news group server
Software associated with a distributed
object scheme
Software for collecting a collection of
emails
Internet Technologies

199

All the examples above are of service middleware since they are all associated
with a particular service: a database service, a newsgroup service, a distributed
object server service and an email service.

199

Service middleware in action


Modify stock file

Programmer
sees only one file

Replication middleware

Computer 1
Main file

Computer 2
Duplicate
file

Internet Technologies

Computer 3
Duplicate
file

200

The slide shows an example of service middleware in action. It shows a request


from an application program to modify a file being transformed into a request to
modify all the replicas of that file.

200

Message-oriented middleware
Manages the transactions that pass between
a client and a server and vice versa
Normally queue based
Model of interaction is small
Often used for connecting to legacy
software

Internet Technologies

201

Message-oriented software is interspersed between clients and server. Usually the


software administers a series of input and output queues with each queue
containing a transaction. Because the underlying data is simple the API for such
software is usually very straightforward.
A major use for such software is to interface Internet-based applications to legacy
software. The interface is very simple and all it requires is programming code for
the Internet application to deposit transactions on a queue and the legacy
application to take transactions from the queue.

201

Queue-based message processing


Queues
Server
adds and
removes
client/server
messages

Client adds
and removes
server
messages

Internet Technologies

202

Message-oriented middleware is very simple, all it really consists of is a set of


queues which contain messages deposited by clients and servers which other
clients and servers can access

202

Case study MQ Series

Leading middleware product


Now known as Websphere MQ
IBM product
Processes four types of messages
Very simple API

Link to the MQ Series site


Internet Technologies

203

Without a doubt the leading message-oriented middleware product is MQSeries


which has been developed by IBM. It processes four types of messages:
Datagrams are one-way messages with no associated reply message to be
returned, for example a signal that a client is online; request messages are used
when the sender expects to get a reply, for example when sending a query to a
database; reply messages are the messages that are sent in response to request
messages; and, finally, report messages are used to signal to a client that some
unexpected event has occurred. For example, a server has malfunctioned. The
API for MQSeries is exceptionally simple and consists of just 11 program calls.

203

Database servers

Administers a set of relational tables


Lie at the heart of any data rich application
Responds to queries in a language known as SQL
Forms the major part of the data layer in a three
tier architecture

Wikipedia on databases

Internet Technologies

204

Database servesr mediate access to a series of databases. There were a rich set of
underlying models for database servers; virtually the only model left is known as
the relational model. It hold data in the form of tables.
Tables are queried and updated using a standardised language known as SQL
(Structured Query Language)
A database server lies behind the object layer in a three tier architecture

204

A relational table
ItemId

Item

NoInStock

Aw222

Washer A

300089

Ntr444

Nut A

2009

Wdt675 Widget Q

300001

Bt56ww Bolt A

200

Bt5556q Bolt B

200009

Internet Technologies

205

This is an example of a small simple table. It holds records which specify the
stock levels in a warehouse. Each column describes a set of similar data. One
column is designated as a key which uniquely identifies each record.
A relational database consists of a number of these tables interlinked by common
data items.
The tables themselves are stored as files.

205

A simple example of an SQL


statement
Names of columns

SELECT EmployeeName, Salary FROM Employee WHERE


Salary >45000

Selection condition

Internet Technologies

Name of table

206

This is an example of an SQL statement which creates a two column table with
columns EmployeeName and Salary by selecting those rows of the table
Employee which contain employees whose salary is greater than 45000.
This is an example of a retrieval query, queries exist for other processes such as
creating tables, deleting tables and deleting rows.

206

Functions of a database server (i)

Interpret SQL statements


Optimise queries
Prevent concurrent access errors

MySQL an open source database server

Internet Technologies

207

A database server has a number of functions, three are shown above


To interpret SQL statements sent to it by a client, execute them and send
back the results.
To optimise queries. An SQL query can be executed in a number of ways
and the difference in response time can be very large. A good database
server will examine an SQL query, look at how tables are stored and work
out an execution plan which minimises execution time.
To prevent the errors that occur when one user concurrently accesses
data which is being accessed by another user. Concurrent access,
particularly updates, can give rise to errors in a database and a database
server will lock areas of the database so that access is only allowed to one
user at a time.

207

Functions of a database server (ii)

To detect and recover from deadlock


To administer security
To administer backup and recovery

Internet Technologies

208

To detect and act upon deadlock. Deadlock occurs when one user
transaction has got exclusive access to a resource such as an SQL table
and is waiting for a resource which is held exclusively by another user
transaction; however, this second transaction is unable to proceed because
the first transaction has exclusive use of another resource that the second
transaction needs to proceed. A database server will detect such serious
conditions and remedy them in a drastic way, often by terminating one of
the user transactions; happily the termination is normally followed by the
re-execution of the transaction by the server, when usually the first
transaction has proceeded and has released the resource that the second
transaction was blocked on.
To administer security. A good database server will ensure that no user is
allowed access to a database who has not been authorised.
To administer backup and recovery. There are two aspects to this. A
database server will keep a log of transactions which is used to recover a
database when some large problem occurs such as a gross system failure.
This log keeps details of the transactions against tables and which parts of
them were affected. When a problem occurs the recovery facility of the
server will find a copy of the last saved version of the database and then
reapply all the transactions held in the backup log.

208

Referential integrity
Relational tables are consistent with each
other
Implemented via triggers
Associated with business rules which
govern the values that business objects
attain

Internet Technologies

209

The term referential integrity refers to the fact that tables in a relational
database are consistent with each other. The following examples are of databases
whose tables do not have referential integrity:
A customer is associated with a transaction which does not occur in a table.
A part stored in a warehouse that has no suppliers associated with it.
A supplier has been given a new reference number, yet the old reference number
of that supplier can still be found in other tables.

209

Referential integrity and business


rules
Business rules govern the values that rows,
columns and tables have
Code at the database server imposes these
rules.
Example rule is that no student in a
university library can borrow more than five
books
Internet Technologies

210

Some examples of business rules:


No account holder is allowed to hold more than 10 accounts.
An order for a product which is more than 400 must be accompanied by an
official order form.
No student can borrow more than five books from the university library.
When a salesperson achieves more than 10 000 of sales in a year they will start
earning commission.
An employee on salary scale C will be allowed 25 paid holidays in a year.

210

Relational middleware

SQL
API

Clients

Driver

Server
software

Database
server

Stacks

Internet Technologies

211

The figure above shows the components of relational middleware:


The first is the API (Application Programmers Interface). This provides
programming facilities for developers who wish, for example, to embed SQL
code within procedural languages. The API is also used by SQL programs which
make internal calls to the facilities offered by the API.
The second is a database driver. This is usually a small piece of software which
takes SQL statements, formats them and then sends them over to the server.
The third is the protocol stack which is used for communicating between the
client and the server.
The fourth is server software. This includes conversion software. Such software
is used to make other database products look like the product supported by a
database vendor. Also included in this category is bridge software which converts
SQL code into a standard form supported by a number of database products.
A fifth category of software is that associated with remote administration of a
database. Most database products allow you to administer a database, for
example setting and removing security permissions from a remote client, and it is
this software that enables you to do it.

211

Distributed databases

Data spread around a number of servers


Done for performance reasons
Also done for reliability
Also done because of pragmatic migration
reasons
Internet Technologies

212

A distributed database is a database which has its data spread over a number of
servers. There are a number of reasons for this:
Performance, for example by keeping a subset of the data close to the users a
distributed system will not experience any transmission delays.
Reliability, by duplicating data across a number of servers a failure of one of the
servers will only result in a performance degradation rather than loss of service.
Pragmatic legacy reasons: systems often consist of data which has been
gradually added at different locations

212

Problems with distributed data

Replication problems
Concurrent access
Security
Reliability
Clock synchronisation

Internet Technologies

213

There are a number of problems with distributed databases:


If data is replicated then there is the problem of making sure that all the
replicated data is up to date without compromising performance.
Ensuring that concurrent access to a database which is distributed does not
result in the database holding erroneous data.
In a distributed environment where there are a large number of computers,
security is a much bigger headache than with say a single, mainframe
computer.
Reliability is a problem in a distributed environment. For example,
distributed transactions are chunks of processing which result in the state of
a system being changed and where a number of distributed databases are
affected. If there is a failure in a transaction which accesses a number of
distributed databases the process of reacting to this can be very complex,
particularly if each distributed transaction is split into a further number of
distributed transactions.
One of the problems in a distributed system is that clocks in the computers
connected to the system may be out of synchronisation with each other.

213

Dates rules for distributed databases

Continuous operation
Transparency
Replication independence
Mixing servers
Operating system independence
Optimisation of queries
Internet Technologies

214

Chris Date, who was one of the pioneers of relational technology, has
devised 12 rules which should be used to judge the effectiveness of a
distributed database technology. Six are shown below:
Continuous operation. A distributed database system should run
continuously; all maintenance operations should be applied to it while it is
running.
Transparency. The programmer or user should not be aware that a
particular database is distributed.
Replication independence. The programmer or user should not be aware
of the fact that a database has been replicated.
Mixing servers. It should make no difference to the development of a
system that a number of disparate servers are used. You should be able to
freely use and interchange database servers.
Operating system independence. The operating system used in a server
should make no difference to the system.
Optimisation of queries. A database servers query engine should be
aware of the distribution of data and be able to make decisions about the
way data is to be retrieved based on the location of the data.

214

Types of distribution

Downloading
Data replication
Horizontal fragmentation
Vertical fragmentation

Internet Technologies

215

There are four types of data distribution:


Downloading is the simplest: all it involves is periodically copying data to a
server from another data source.
Data replication is where either full or partial copies of a database are
maintained on a number of servers.
Horizontal fragmentation is where a table is split into two sub-tables where each
table contains the original columns
Vertical fragmentation is where a table is split by column

215

Programming a database
A number of APIs available
Need to provide facilities for retrieving,
updating and deleting data.
Facilities for connecting to a database
Facilities for querying a database
Facilities for processing results from a
query
Internet Technologies

216

There are a number of APIs which are available for programming a relational
database. They should provide a means whereby a database is connected into a
program, queries issued against the database and the processing of the resultant
data sent back from the database server.

216

Case study The Java JDBC API

Internet Technologies

217

The Java SQL API is shown above, it contains the following classes:
Driver. This is a class associated with the database driver that is used to communicate with a
database.
Statement. This class is used to create and execute SQL statements.
PreparedStatement. This class a subclass of Statement is used to develop SQL statements
which have an increased efficiency when executed a number of times with different
arguments.
CallableStatement. This is a subclass of Statement which provides the programmer with the
facilities for calling stored procedures.
Connection. This is the class which contains facilities for connecting to a database.
ResultSet. When an SQL statement is executed a result set is usually returned.
ResultSetMetaData. There are a collection of classes which provide data about the main
entities that this package manipulates.
DatabaseMetaData. This is another metadata class. In this case it provides information about
a database.
DriverManager. This is a class that manages the drivers that are available for connecting to a
database.
DriverPropertyInfo. This class is not used by application programmers. It contains a number
of instance variables which are used by drivers in order to connect into a relational database.

217

Programming database access


Load a driver
Establish a connection to a database
Associate an SQL statement with the
connection
Execute the SQL statement
Process the results from the staement
Close connection
Internet Technologies

218

The steps required to program a Web server are:


Load a driver which is compatible with the database that is to be
processed. This can be done programmatically or can be done via an
external reference.
Establish a connection to the database.
Associate an SQL statement with this connection.
Execute the SQL statement.
The SQL statement which has been executed will produce a table which
is stored in a ResultSet object. This object will contain a reference to the
rows of the table that has been formed by the execution of the SQL
statement. The rows are traversed and some processing applied.
Execute further SQL statements as above.
When the processing associated with the database is complete the
database is closed and the connection to the database is also closed.

218

A simple program (i)


import java.sql.*;
public class JDBCCode{
public static void main(String args[])
{
//Set the name of the file that is to be accessed
//and the name of the driver
String fileURL = ...;
String driverName = ...;
try
{
// Load in the driver programmatically
Class.forName(driverName);
}
catch (ClassNotFoundException cfn)
{
//Problem with driver, display error message and
//return to operating system with status value 1
System.out.println(Problem loading driver);
System.exit(1);
Internet Technologies
}

219

The remaining slides detail a simple program which issues a query against a
database.
This part of the program imports the jdbc library and loads a driver. If there are
problems with loading a driver then execution terminates.

219

A simple program (ii)


try
{
//Establish a connection to the database, second
//argument is the name of the user and the third
//argument is a password (blank)
Connection con =
DriverManager.getConnection(fileURL, Darrel,);
// Create a statement object
Statement selectStatement =
con.createStatement();
// Execute the SQL select statement
ResultSet rs =
selectStatement.executeQuery
(SELECT name, salary FROM
employees WHERE salary >35000");

Internet Technologies

220

This part of the program obtains a connection to the database, creates an SQL
statement and executes it. The statement obtains the name and salaries of those
employees in the table employees who have a salary greater than 3500.
The result of the query is placed in a ResultSet object which is traversed in the next
section of program.

220

A simple program (iii)


int employeeSalary;
while(rs.next())
{
employeeName = rs.getString(1);
employeeSalary = rs.getInt(2);
System.out.println
(Name = + employeeName +
Salary = + employeeSalary);
}
//Close down the database connection, result set
//and the SELECT statement
selectStatement.close();
con.close();
rs.close();
}
catch(Exception e)
Internet Technologies

221

This part of the program traverses the result set formed from the query. It obtains
the first column content (the employee name) and the second column content (the
salary) from each element of the result set and displays them.
Finally the statement, connection and result set objects used are closed down.

221

A simple program (iv)


{
System.out.println
(Problems with access to database);
e.printStackTrace();
System.exit(2);
}
}
}

Internet Technologies

222

This final piece of code is executed if an error occurs, for example if the server is
down. The method stackTrace provides extra diagnostic information.

222

Object to database mappings


Needed to map objects to relational tables
Often done by hand
However, there are many tools available
which carry this out automatically
Best known is TopLink, an Oracle product
Link to TopLink

Internet Technologies

223

When developing a three tier architecture using a relational database as the third
tier there is a need to map business objects such as Warehouse, Product,
PlaneSeat and Passenger to their relational table equivalents.
This can be done by hand by developing classes such as Product and inserting
retrieve and updating SQL code within the methods used. However there are a
number of modern tools which enable much of the effort to be automated. The
best known is TopLink.

223

Some technologies

The remainder of this lecture looks at some


technologies that are used to implement
web and Internet-based systems employing
some database back-end

Internet Technologies

224

The remainder of this lecture will look at technologies such as PHP. ASP.Net,
Ruby, the Ruby web application framework Ruby on Rails, application servers
and integration servers.

224

PHP

Creates dynamic web pages


Employs code inserts into HTML code
Open-source
Usually associated with the MySQL
database product and the Apache web
server

Internet Technologies

225

PHP is a language that is used to insert text into HTML documents that carries
out some dynamic processing. It is a venerable technology dating back to the
early days of HTML and UNIX. The next slide shows an example.

225

An example of PHP
<?php
if (strpos($_SERVER['HTTP_USER_AGENT'], 'MSIE') !== FALSE)
{
?>
<h3>strpos() must have returned non-false</h3>
<p>You are using Internet Explorer</p>
<?php
} else {
?>
<h3>strpos() must have returned false</h3>
<p>You are not using Internet Explorer</p>
<?php
}
?>

Internet Technologies

226

This example found on the web site http://uk2.php.net/tut.php*. Shows the


interrogation of the browser which is displaying the page containing the HTML.
It determines whether the user is employing the Internet Explorer browser.
*Copyright 2001-2006 The PHP Group

226

ASP.Net
A technology associated with the .Net
framework developed by Microsoft.
Involves embedding processing instructions
within a web page.
Such instructions can access databases,
produce forms, access other web servers.
ASP statements are executed on the web
server.
Internet Technologies

227

This is a technology similar to the Java Server Pages technology and PHP. It
allows the programmer to insert processing statements into an HTML and
effectively embeds processing code within the HTML files.

227

Ruby
Had a longish history
Available for free
Everything is an object in Ruby, looks like an
amalgam of many languages including Smalltalk
Links to many Internet technologies including
XML, XML-RPC, Spaces, message-oriented
middleware etc.
Object-oriented
Dynamically typed language
Interpreted
Internet Technologies

228

Ruby is a language that is quite old. It lay dormant for a number of years.
However, the last couple of years has seen a huge increase in interest in the
technology. This has mainly been due to the emergence of the web development
framework Ruby on Rails which has been reported as having given developers
huge increases in productivity. In some quarters it is seen as a successor for Java
whose APIs are being seen as being over-complex and bloated; this is probably
somewhat optimistic.

228

An example of Ruby code


def example
puts start of example
yield
puts end of example
end

Method

example {puts middle of the example}

Method call
Internet Technologies

229

This shows one of the strongest features of Ruby the ability to execute a chunk of
code with another chunk of code as an argument. What the call does is to
effectively execute the code
puts start of example
puts middle of the example
puts end of example

229

Ruby on Rails
Web development framework
Used for client, business object, database
systems
Used for ecommerce, for example systems
which employ a shopping cart
Employs the reflection facilities in Ruby

Internet Technologies

230

What has made Ruby an important programming language is the fact that it has
given rise to one of the most productive web application frameworks. This is
known as Ruby on Rails. The framework relies on the fact that Ruby has the
facility to interrogate its own programs, for example a program can discover what
the name of the methods that it has. Ruby on Rails is highly productive because it
concentrates on 80% of ecommerce systems: those that can be organised as threetier architectures and which involve common e-commerce functions

230

Application servers
Host application objects
Best known example are those servers
which host Enterprise Java Beans, for
example the BEA WebLogic server
Enables objects to be exposed to developers
and hide the underlying database details

Internet Technologies

231

An application server is one which hosts application objects, for example in a


banking system it would host account objects. Most application servers are
associated with the Java J2EE technologies. They allow concurrent access to the
objects and hide much of the underlying detail from the programmer, for example
the fact that access will be concurrent and that an object may be implemented
using a number of separate tables or even separate tables in separate databases.

231

Integration servers
Act as a hub in an integrated system
Coordinate and orchestrate the flow of data from
one component of an integrated system to another
Carry out activities such as marshalling, data
formatting and data transformation.
Increasingly important as integration becomes a
much more important paradigm.
An example of an integration server

Internet Technologies

232

An integration server is a server that sits in the middle of a system that has been
integrated by bringing together a number of pre-developed components. Some of
these components may not be able to work with each other directly, for example
they may have different data specifications, interfaces and protocols. The role of
an integration server is then to carry out the mediation that is necessary to realise
inter-working. Later in the course I shall look at integration in more detail.

232

Lecture 6
Web Servers

Internet Technologies

233

233

Aims

Describe the basic functions of a Web server


Look at the HTTP protocol including problems
Examine some dynamic technologies
Use Apache as a short case study
Describe how Web servers fit in with other technologies
Understand the relationship between a Web service and
implementation-specific and implementation-neutral code
Describe the main features of SOAP
Understand how HTTP can be used as a Web service
mechanism
Internet Technologies

234

In this lecture I shall be looking at almost certainly the most important type of
server: the Web server. In it I shall be looking at the basic processing cycle used
by a Web server and examining the HTTP protocol that is used by a client and a
Web server to communicate.
Early Web technologies just dispensed static pages; soon this was regarded as
very limiting and so a number of dynamic page technologies were developed; in
the lecture I shall look at some of these.
Apache is almost certainly one of the most popular Web servers; I shall be briefly
looking at it and using it as a case study.
Finally I shall look at the role of the Web server in distributed architectures.

234

Web server functionality


Simple: just responds to a request for a Web
resource and sends it back or issues an error
The resource could be a Web page or a file
associated with a Web page, for example a video
clip
Dynamic page technology is a slight complication
over and above this simplicity
Introduction to web servers
Internet Technologies

235

The functionality of a Web server is basically very simple: all it does is to


dispense files associated with a Web site. These files will be Web pages written
in HTML and any associated files: sound clips, video clips, graphics etc.
As you will see in this lecture there are a number of dynamic page technologies
which enable the page sending process to be interrupted and a page to be
dynamically modified before it is sent back to a browser. This complicates the
basic picture detailed in the previous paragraph only slightly.

235

The Web server processing cycle

Wait for a request


A request arrives
Server parses the request
Other information from browser, for example information
on the browser and its capabilities
Carry out service required and, perhaps, send file
Close the file
Finish: close file and may close network connection
Go back to step 1
Internet Technologies

236

The slide details the processes involved in responding to a browser request for
some resource. The program that carries out this process is known as HTTPD
(HTTP Daemon). A browser requests some resource, the daemon parses the
request to find out what is required, gets the resource and then sends it back to
the browser. If there is a problem then an error is signalled to the browser.
When a request is completed the connection can be closed and the processing
cycle continued.

236

HTTP: the Web protocol


Simple protocol, lots of commands and
parameters but basically simple
A request/response protocol
Essentially stateless
Handles requests, the sending of resources
and the signalling of errors
Three versions (0.9, 1.0, 1.1) 0.9 defunct
Internet Technologies

237

HTTP is the protocol used for communication between a client running a browser
and the Web server. It is very simple in concept although it uses quite a number
of commands and arguments.
It is an example of a request response protocol: every request issued by a browser
elicits a response from a Web server.
It suffers from a major problem in that it does not remember state. This is a major
problem as for example ecommerce applications require state to be tracked. The
best example of this is the shopping cart.
There are three versions of HTTP, no browser ever uses version 0.9 now.

237

An example of an HTTP request

GET

/index.htm HTTP/1.0

Get me the Web page index.htm (usually a


home page) I am using version 1.0 of the
HTTP protocol

Internet Technologies

238

This is an example of the GET command, this is the most popular command in
the HTTP protocol set; it just asks for some resource to be sent back. The
resource in this case is a file containing the HTML for a home page. There may
be a number of files associated with this home page, for example graphics files;
the server will determine which are associated and send them back as well.

238

Another example of an HTTP request


POST /cgi-bin/findall.cgi HTTP/1.0
Content-Length: 46
query =findall&noviceuser=false&command=submit

Execute a program on the Web server


the program will read a number of
arguments. The arguments are 46 chars
in length
Internet Technologies

These are the


arguments

239

This is an example of a command which is associated with forms. When you fill
in details in a Web form this command is sent to the Web server. It informs the
server that a program in the directory cgi-bin is to be executed and defines the
data that is to be processed by this program. The program will usually access
some file-based resource, for example files holding relational databases
maintained by a database server.

239

An example of an HTTP response


HTTP/1.1 200 OK
Date: Wed, 12 Jul 2006 14:01:08 GMT
Server: Apache/1.2.8 (UNIX) (Red Hat/Linux)
Connection: close
Content-Type: text/html
.. The HTML source

Headers

Internet Technologies

240

This informs the browser that its request has been successful (200 is a status code
which indicates this). It then sends back a number of headers which provide
information about the server and the resource that has been sent back. For
example the type of the server and the fact that HTML code expressed in plain
text has been returned.
After a line feed the resource is sent back.

240

Another example of an HTTP


response

HTTP/1.1 404 Not found

The famous Gone to Atlanta response

Internet Technologies

241

This is an example of a simple response which indicates that a problem has


occurred; in this case the resource requested has not been found. There are a
number of error codes that are used by HTTP, 404 indicates that a resource has
not been found (404 is the area code for Atlanta in the United States).

241

HTTP Status codes


100-199 Informational
200-299 Request successful
300-399 Client request redirected
400-499 Client request in error
500-599 Server errors

Internet Technologies

242

The slide above contains the ranges of the status codes, some examples are
200

OK

301
server

Moved permanently: the resource has been moved to another

404

Resource not found

503

Service unavailable

242

The Common Gateway Interface


Used by Web programs to access
information about the environment and a
request
Web programs held in cgi-bin.
Web programs (scripts) access a number of
environment variables
Programming languages such as Perl have
facilities for interrogating these variables
Internet Technologies

243

The Common Gateway Interface is a standard for accessing information


concerned with server and the requests which are made of it. When a request
comes into a Web server which needs some special execution to occur then the
program which carries out this execution will often require information resident
in the CGI. Normally such program will be those which process. Forms.

243

Categories of CGI environment


variables
Standard variables
Header variables
Variables associated with a particular server
mod_include variables
Special purpose variables

Internet Technologies

244

There are a number of categories of CGI variables:


Standard variables such as the name of the host server and the name of the
method being used, for example GET.
Header variables such as the MIME types that the client will accept or the URL
of the page from which the link was traced from
Variables associated with a particular server, for example has a variable
SERVER_ADMIN which specifies the email address of the server administrator.
The variables set by include technology. This will be discussed later under
dynamic page technology.
Special purpose variables such as downgrade-1.0 which forces a request to be
treated as if it was a version 1.0 HTTP request.

244

HTTP is stateless
No memory between transactions
Severe constraint
A number of solutions including cookies
and URL rewriting
Often the solution is transparent to the
programmer
Introduction to cookies
Internet Technologies

245

One of the problems with HTTP is that there is no state memory between
transactions. This is a severe problem since many applications require memory of
previous page requests, for example an ecommerce application might require to
keep track of the identity of a user and what actions he/she has carried out.
There are a number of solutions to the problem. A common one is to keep data on
the client. This data is known as a cookie. Another solution is to send modified
URLs between the client and the server with the URL being built up of previous
state changes. Another solution involves invisible pages. This is explained in a
later slide.

245

Cookies
The storage of memory on the client to keep
track of state.
Example is the sequence of transactions
which take place in an e-retailing site.
Cookies can be temporary or permanent.
Browser can refuse cookies

Internet Technologies

246

Cookies are a common solution to the problems of state tracking. It is data kept
on a client which keeps track of previous transactions or contains data which can
be reused time and time again over sessions. Cookies can hence be temporary or
permanent: they can disappear after a session.
Cookies are a security problem and hence browsers can be configured to reject
them.

246

Invisible pages
Involves the user sending forms data to the
server, for example an item bought.
The server sends back HTML containing
hidden elements which represent state
details
Gradually a page is built up which contains
the state
Internet Technologies

247

Another technique that is used to track state is to use hidden form elements. The
way that this works is as follows:
The user sends data to the server, for example the name of an item that has been
bought.
The server responds with a page which contains confirmation and asks the user
whether he or she want to purchase more. This page contains invisible elements
which describe the built up state.
If the user says yes a purchase page with the same invisible elements is sent
back.
The user chooses another item and the server responds with another page with
this item specified as an invisible element
This continues until check-out.

247

An example of what is sent back


from the server
<INPUT TYPE = hidden NAME = Hide1
VALUE = MickyBook23566>
<INPUT TYPE = hidden NAME = Hide2
VALUE = DumboDoll23337>
<INPUT TYPE = hidden NAME = Hide3
VALUE = Alysband14556>

Internet Technologies

248

Here a page sent back from the Web server is shown. It contains a number of
invisible elements which contain data about three past transactions. This data is
not shown on the page. When the user makes another choice the data is sent back
to the server with the new choice made by the user.

248

What is sent by the client

Hide1=MickyBook23566
Hide2=DumboDoll23337
Hide3=Alysband14556
ItemSelected=BarbyDoll

Past choices

Current choice

Internet Technologies

249

This slide shows the data sent back by the client. It has received the current state
via a page similar to that displayed on the previous slide.
The user has augmented the state by making a choice. This has been added to the
current state and has been send by a POST command.
When the server receives the data it will augment the BarbyDoll item with
numeric data which represents price etc.

249

Programming Web servers


Server page technology
Servlets
CGI programming with some language such
as Perl
Applets and active X controls

Internet Technologies

250

There are a now a host of technologies used to program Web servers. In the next
slides I will look at each of the main technologies:
Server page technology involves embedding code within a Web page.
Servlets are in-memory programs written in Java.
CGI programming with a scripting language involves the programmer accessing
CGI environment variables which hold data on a Web transaction
Applets and Active X controls are code which interact with a browser.

250

Server side includes


Primitive technology: special command inserted in Web
page which the server sees and carries out substitution. It
can also involve the execution of a program which
sends data back to the user
Creation Date:<!-#echo var=LAST_MODIFIED->

Internet Technologies

251

Server side includes are commands inserted in a Web document which are
interpreted by the Web server which will carry out some action such as
substituting some text. It can also specify that a program is executed and its
output sent back to the user.
Server side includes were a very early and quite primitive attempt to overcome
state problems.

251

Servlets
Java technology intended to replace CGI
programming
Employs servlets specified in a Web page.
Efficient solution.
Respects OO conventions
Now mature, superceded to some extent by
JSP (Java Server Pages)
Internet Technologies

252

Servlets are a Java solution to Web server programming. It involves the


execution of Java code which is referenced in a Web page. The Java code can
call on all the facilities of the Java API, for example database processing
primitives, remote object primitives and transaction facilities.
It is an efficient solution which enables the programmer to structure his/her code
as an object-oriented system.

252

Some servlet code (i)


public void doPost(HttpServletRequest rq,
HttpServletResponse rp)
throws ServletException
{
..
String fName = rq.getParameter(FirstName),
sName = rq.getParameter(SurName);
if(fName .equals()|| || sName .equals())
{
//At least one of the text boxes is empty
//Send an error page back to the user
}

Internet Technologies

253

This and the next slide describe the code from a simple forms servlet. It process a
request and produces some response. The first thing that the code does is to find
the value of some HTML forms objects, say text boxes; if these are empty then
the user has made an error and some error page is sent back to the user.

253

Some servlet code (ii)


else
{
..
PrintWriter out = rp.getWriter();
out.println
(Hello there <P>+ fName+ + sName+ <P>);
..
}
}

Internet Technologies

254

Here the rest of the code is shown. This is the code that is executed if the user has
typed in correct data. A writer is obtained which connects to the client browser.
This writer is then used to send back some HTML simple message involving
paragraph breaks ( <P> )
Servlets are a technology which handles state tracking very easily and in a way
that is invisible to the programmer

254

CGI programming via PERL (i)


Use CGI qw(:standard)
open(COUNTREAD, counter.dat);
$data = <COUNTREAD>;
$data++;
open(COUNTWRITE, >counter.dat);
print COUNTWRITE $data;
close(COUNTWRITE)

Internet Technologies

255

Here a simple program is displayed which keeps track of the number of times a
Web page has been accessed.
The first part of the program displayed above will read a data item $data from a
file, increment it and then write it back to the file. The data item represents the
number of accesses made to the Web page.
The program is taken from E-business and E-commerce, How to Program by
Deitel, Deitel and Nieto, Prentice-Hall.

255

CGI programming via PERL (ii)


print header
print <CENTER>
print<H1> You are visitor number </H1> <BR>
for($count =0;$count<length($data);$count++)
{
$number = substr($data, $count, 1);
print <IMG SRC = \images/counter/$number.jpg\>;
}
print </CENTER>;

Internet Technologies

256

This code prints the count by extracting each digit from it and then displaying as
graphic corresponding to the digit. (Note that to print a character you need to
precede it with a \ character.

256

Applets and Active X


These are a form of server programming in
which code is executed on the client
Applets are a Java technology (Sun)
Active X is a Microsoft technology
Security problems with both, problems
solved in different ways.

Internet Technologies

257

Both the technologies here are similar. They are both code which are written in
some language such as Java or Visual Basic. A reference to the code is placed in
a Web page. When the page is downloaded to a browser the code of the applet or
Active X control is also loaded in. The code is then executed and carries out
some processing, for example forms processing.
Because code is loaded into a client computer there are potential security
problems. Java gets over this via a sandbox approach, while Active X controls are
digitally signed. These approaches will be detailed later in the course.

257

Server page technology


Two main examples Active Server Pages
(Microsoft) and Java Server Pages (Sun). Also
PHP.
Involves code being embedded into a Web page.
Code is translated into a program which interacts
with a Web server. In the case of JSP a servlet is
constructed.
JSP is now generally preferred to servlets
Internet Technologies

258

Server page technology is comparatively recent. It involves inserting code into a


Web page. A processor takes this code, turns it into an executable program which
then is executed at the server and interacts with the server.
There are two main technologies in use: JSP and ASP. The former is a Java
technology while the latter is a Microsoft proprietary product.
PHP is a less well known UNIX technology which is distinguished by its ability
to easily interface with relational databases.

258

A simple example
<% for (int j = 0; j<100;j++)
{
if (j%2==0) %>
<P>
The value of the even integer is <%= j %>
<% }%>

Internet Technologies

259

The code above is expressed in JSP. It just displays all the even integers between
0 and 100 on a Web page.

259

A short case study: Apache


Open source product
Certainly the most popular Web server
Name comes form the number of bugs initially
released
Developed by the Apache foundation
Associated with a number of other good open
source products such as Tomcat and Cocoon
Link to Apache Foundation

Internet Technologies

260

Apache is a robust scaleable Web server that is free. It was developed by the
Apache Foundation as an open source product. It is certainly the most popular
Web servers with something like 65% of the market.
It is associated with a number of other free products such as Tomcat the JSP
engine and Cocoon the Web publishing system.

260

Apache features (i)


Basic configuration: server name, IP
address etc.
Starting, stopping and restarting
Graphical configuration tools such as
Comanche
Access restriction
Error handling
Internet Technologies

261

The bullet points above contain the basic facilities of apache. They enable the
server to be configured, started, stopped and restarted. Apache contains a number
of graphical tools which enable the administrator to set up a basic configuration
without the need for developing textual scripts

261

Apache features (ii)


Provision of dynamic content: SSI, JSP and
PHP
Storing and loading CGI scripts
Triggering scripts on events
Hosting more than one Web site
Dynamic virtual hosting
Internet Technologies

262

Apache has facilities for easily integrating most dynamic content technologies
which are non-proprietary, for example Java Server pages (JSP).
It is easy to develop, maintain and store CGI scripts in Apache and configuring
Apache in such a way that scripts are executed when a certain event occurs, for
example an event associated with a security violation.
Apache allows the hosting of multiple sites using a technology known as virtual
hosting.

262

Apache features (iii)

Performance improvement
Fault tolerance
Monitoring Apache
Security provision

Internet Technologies

263

The Web administrator can improve the performance of Apache via a large
number of options. For example the server uses caching to keep popular pages in
main memory. The administrator can decide what these pages are and how big
the cache is to be.
Apache also contains facilities for fault tolerance, for example via clustering.
There are also a large number of facilities for monitoring the use of an Apache
Web server. For example a log file is constructed which contains details of all
accesses to the server. The Web administrator can look at this file to determine
what pages to cache.
Apache also has facilities for security, for example authenticating users,
integrating the Secure Sockets Layer (SSL) and processing digital certificates.

263

Client side programming


Technologies discussed have been server
side
There are technologies which enable
activity at the browser end, these allow
efficient processing of some web functions
such as data valdiation
The most important one of these is
JavaScript
Internet Technologies

264

The technologies that have been described so far have been technologies where
the programming has been at the server side, for example applets are stored at the
server and downloaded into the browser and executed and Java Server Pages are
stored at the server and executed there.
There are technologies which enable code to be directly embedded in a Web page
and executed by the browser.
The best known of these is JavaScript.

264

JavaScript

Nothing much to do with Java


Simplish scripting language
Also known as JScript and ECMAScript
Statements written inside HTML
An interpreted language
Interface with the Document Object Model
Internet Technologies

265

JavaScript was originally developed by Netscape as a scripting language for their


browser. It does not have very much to do with Java, it looks like an amalgam of
C, C++ and a little Java.
It contains all the control statements you would expect to find in C, together with
facilities that enable it to interact with browser pages.

265

A simple JavaScript program


<HEAD>
<TITLE> My example </TITLE>
<SCRIPT LANGUAGE = JavaScript>
var numberTyped;
//process the int data
numberTyped = window.prompt(Enter a number, 0);
document.writeln(<H3>Test program</H3>);
document.writeln(Number typed is+numberTyped);
</SCRIPT>
..

Internet Technologies

266

This is a very simple example of a JavaScript program. All it does is to create an


input box and prompt the user for a number. The number typed by the user is then
passed back to the user with a bold heading <H3>

266

JavaScript is just a simple


programming language

Variables
Control structures
Arrays
Functions
Some objects such as strings
Access to browser page
Internet Technologies

267

JavaScript is not a complicated programming language, a programmer familiar


with C will be able to learn it very quickly. It contains all the facilities that you
expect in a simple procedural languages. However, it also contains facilities for
accessing Web objects and objects such as strings.

267

Object access

charAt(index)
concat(string)
substring(start,end)
toLowerCase()

All messages that can be


sent to JavaScript string
objects

Internet Technologies

268

The four methods above all are associated with JavaScript strings.
charAt

returns with the character at an integer position in a string.

concat

concatenates two strings together.

substring

returns the substring designated by its arguments.

toLowerCase

converts a string to lower case.

268

The Dynamic HTML object model


Exposes browser elements such as pages as
objects to which messages can be sent
JavaScript contains many methods which
access these objects
It also contains facilities for responding to
user events

Internet Technologies

269

JavaScript has access to the Dynamic HTML object model. This consists of a
number of classes which represent Web page elements such as applets and
scripts.
A large number of methods can then be used to interrogate information about an
object. For example the version number of a browser can be found out by sending
a message to a navigator object.

269

An example of access to the DHTML


object model
<HEAD>
<TITLE> My example </TITLE>
<SCRIPT LANGUAGE = JavaScript>
var numberTyped;
//process the int data
numberTyped = window.prompt(Enter a number, 0);
document.writeln(<H3>Test program</H3>);
document.writeln(Number typed is+numberTyped);
</SCRIPT>
..

Examples
Internet Technologies

270

This example taken from a previous slide shows three accesses to DHTML
objects. The object window is a window into which data can be entered and
document is the current page being browsed.

270

Another example

if(navigator.appName==Microsoft Internet Explorer){

if(navigator.appVersion.substring(1,0)==4){
...

The DHTML object navigator

Internet Technologies

271

Here the object navigator is accessed. This object contains information about
the browser being used, the method appName returns the name of the browser.
The method appVersion returns with a string which contains information about
the browser version and the operating system. The first character of the string
represents the browser version.

271

Some IE5 objects

window
document
body
history
navigator
event

anchors
applets
forms
images
links
plugins

Internet Technologies

272

The bullet points above represent some of the objects in the Internet Explorer 5
object model. Most of them are self-explanatory apart from event and history.
The object history represents a history of all the sites visited by a browser;
event is used as an object when a user event occurs such as a button being
clicked.
The objects in the right-hand column represents collections of objects on a Web
page, for example anchors contains an enumeration of all the anchors in a Web
page. Each element in such a collection can then be accessed by means of a for
loop.

272

Web servers as part of a three tier


architecture

Browsers

Web server

Internet Technologies

Data storage

273

Here a three-tier architecture is displayed with the client layer being implemented
by a browser. The middle layer is implemented by a a Web server which contains
business objects. These objects may be implemented in a number of ways, for
example via remote object technology or via a technology such as servlets. The
final layer is that of data storage, normally this is implemented via some database
technology.

273

Developing Web pages

Uses the markup language HTML


Now done using special-purpose editors
Client code inserted
Style sheet facility provided by CSS

Internet Technologies

274

Web pages are developed using a special purpose markup language known as
HTML, the source of a Web page will consist of the text of the page interspersed
with HTML instructions which format the text, for example displaying it in bold.

274

HTML elements

Elements
Tags
Attributes

Internet Technologies

275

When talking about HTML we refer to three components of an HTML document.


An element is the name given to a markup instruction such as UL and a tag is the
name of the element enclosed with angle brackets. The final vocabulary term is
attribute. So far I have not introduced attributes. An example of one is shown
below
<FONT COLOR ="blue">
This text will be displayed in blue.
</FONT>
Here the FONT element enables the HTML coder to designate the colour size and
font used for some text. The attribute COLOR is written within the tag <FONT>
and is set to the value blue. Many HTML elements have a number of options
which correspond to attributes with defaults which allow some or all of the
attributes to be omitted, for example the COLOR element has a FACE attribute
which displays text is a specific font. For example the HTML
< FONT COLOR ="blue" FACE = "Helvetica">
This text will
Helvetica font.

be

displayed

in

blue

and

in

the

</FONT>
Will display the text between the tags in blue using the Helvetica font.

275

Page design principles (i)

Design for consistency


Concentrate on content
Allow for different browsers
Be sparing with colour

Design principles for the Web


Internet Technologies

276

Design so that consistency is achieved. This does not mean that Web
pages in the same site should look the same. What it means is that certain
elements should be implemented in the same way, for example the way
that links to other pages in the Web site are displayed should be consistent
across the site. Also pages which carry out the same function should look
the same, for example pages which contain forms should have the same
look and feel and pages which provide information about products that
are to be sold should also look the same.
Concentrate on user content. Do not use lots of space on a page for
navigation.
Make sure that different browsers display pages in similar ways. Even
today, different browsers will display a page in quite different ways. If
you expect users of the site to use a variety of browsers check out what
the pages look like and be prepared to design two or more different pages
which can be read by different browsers.
Be sparing with colour. One of the major errors made by the designers of
the early Web sites was to employ too much colour in a page, particularly
in the text where words were displayed in different colours. In deciding to
use a colour ask yourself why you want to use the colour. For example,
you may want to use red to highlight a very important point. A warning
though: overuse of a colour such as red can lead to readers ignoring the
text.

276

Page design principles (ii)

Do not make pages memory intensive


Make full use of linking
Make hyperlinks small
Use normal conventions for links

Internet Technologies

277

Do not design pages that take a long time to download. Although this
issue will decrease over the next five to ten years with broadband
connections becoming cheaper it is still a vital issue now. Research has
shown that users are just not prepared to wait for a long time for a page
containing animations and fancy graphics to download. Keep the bulk of a
page devoted to text. If you think that a user wants to look at a specific
graphic then display a small, memory-spare version of the graphic (known
as a thumbnail) with a link to the large version of the graphic.
Make full use of linking. If a page consist of three sections which are
notionally separate, for example a description of the teaching of a
university department, the research and a list of the staff in the
department, do not place all the text for these sections on the same page;
place each in a separate page and link to them from a summary page.
Make hyperlinks small. If you link large amounts of text in a Web page
the text will become very difficult to read. If the link is short, say no more
than four words, it might not fully describe what is being linked to. If so,
place some text close to the link (say, in brackets) which describe what
resources the link points at.
Use familiar conventions for links. The default within browsers for
displaying links is to underline them in blue. Most users are culturally
used to this; designing a Web site with non-standard ways of
implementing links, for example as orange links against a black
background will confuse users.

277

Page design principles (iii)

Use style sheets


Use printed versions
Keep text short
Make text readable

Internet Technologies

278

Use style sheets throughout your site. This means that they will have a
uniform appearance and when you decide to change some aspect of the
look and feel of your site the change will be reflected throughout the site.
Provide printed versions of pages. If you think that a certain page, or
collections of pages are going to be frequently printed then provide
downloadable versions of the pages in some word processing or document
display format. Browsers are still quite poor at displaying pages and often
crop them.
Keep the text on a screen short. Users feel uncomfortable reading text
from a screen. Use hyperlinks to reference other pages which might have
been physically included in the text.
Make your text readable. For example use colours which contrast highly
with the background of a page, use a plain background, use big enough
fonts and do not use moving effects such as flashing text: they make a
page quite unreadable.

278

Page design principles (iv)

Signal the use of multimedia


Make the home page distinctive
Make sure that the user knows where they
are

Internet Technologies

279

Signal the use of multimedia. If you are going you use a multimedia
presentation with all its attendant download problems reference the page
containing the material from another page and warn the user what they
will be getting in terms of download and in terms of what the multimedia
does.
Design the home page in a different way to other pages, albeit using the
same style. The aim of a home page is to encourage users to enter the
Web site so it should succinctly describe what lies in the site and hence
should contain a map of the site.
Make sure the user knows where they are. The user should be aware of
where they are both in terms of the World Wide Web and in terms of the
site they are visiting, where they have been and where they could go.
Where the user is relative to the World Wide Web can be catered for by
having some graphic or noticeable text which tells the visitor what
company or organisation the site is associated with. In terms of where
they have been adopt the practice of displaying past page titles when the
user is traversing a set of linked pages, for example when registering for
an ISP. These titles, together with the titles of future pages, can be
displayed prominently at the top of the page with the current page being
highlighted in a different colour. In terms of where a user can go place
some navigation map in the page which shows the map of the site with the
current page highlighted in a prominent colour. If a user is traversing a
series of pages then don't forget to include a link backwards in the page as
well as the link forwards.

279

REST
Representational State Transfer
An alternative to HTTP
Does use HTTP and other protocols to
connect with a web server
Involves adorning the HTTP address with
information required for functionality.
An architectural style not a standard
Internet Technologies

280

280

An example
INPUT
http://www.parts-depot.com/parts
OUTPUT
<?xml version="1.0"?>
<p:Parts xmlns:p="http://www.parts-depot.com"
xmlns:xlink="http://www.w3.org/1999/xlink">
<Part id="00345" xlink:href="http://www.parts-depot.com/parts/00345"/>
<Part id="00346 xlink:href="http://www.parts-depot.com/parts/00346"/>
<Part id="00347" xlink:href="http://www.parts-depot.com/parts/00347"/>
<Part id="00348 xlink:href="http://www.parts-depot.com/parts/00348"/>
</p:Parts>
Internet Technologies

281

This shows a REST command being issued and a REST compliant server
returning the data required ( a list of parts that are stocked).

281

Web services

A network accessible interface to application


functionality... Built using Internet technology

Internet Technologies

282

There is nothing complicated about Web services. They are just services which
can be accessed over the Internet. They are not associated with a particular
technology such as XML, although the SOAP technology detailed in this lecture
has a link with XML.

282

Web services
Built on top of Internet protocols
Most popular protocol becoming HTTP, although
other protocols can be used, for example POP3
The Web service acts as an abstraction layer
Ongoing effort to build Web services on top of the
Java 2 Enterprise Edition (J2EE)
Messaging is the underlying architecture

Internet Technologies

283

A Web service uses a standard Internet protocol, most applications which use a
protocol embed them in HTTP since the Web browser and the Web server
already have built-in functionality to handle HTTP code. However other
protocols (even mail protocols such as POP3 can be used)
The Web service acts as a buffer between the application code, for example
database code and the client (usually a browser).
The area of Web services is an actively growing one with vendors such as IBM
and Microsoft putting large amounts of resources into providing frameworks for
such services
Since messaging is the main architecture still used on the Internet it is the one
used for implementing Web services.

283

Components of a Web service

Web Services Description Language


(WSDL)
Simple Object Access Protocol (SOAP)
Universal Discovery, Description and
Integration (UDDI)

Internet Technologies

284

There are three components to a web service, all of them are standardised and
implemented using XML.
The Web Services Description Language (WSDL) is a simple language that
describes the service that a Web service provides. In effect it describes the
interactions between a client and a Web service.
SOAP is the protocol that is used to communicate between a client and a Web
service. It defines the format of the service request and the format of the response
by the service. If there are any errors which occur it will be used to send error
details.
UDDI is the part of the web service architecture which is concerned with the
defining the various registries that are used to contain documents expressed in
WSDL.

284

A Web service architecture


WSDL, UDDI
Publish WSDL, UDDI

Service
registry

Service
provider

Find
WSDL, UDDI

Service
requestor
Internet Technologies

WSDL, SOAP

Bind

285

The reprise of the previous previous slide shows where each of the component
parts of a Webs service fits within the generic architecture.

285

Company roles

Service requestor
Service provider
Registry (White pages)
Broker (Yellow pages)
Aggregator (Green pages)

Internet Technologies

286

There are a number of roles that a company can adopt with respect to Web
services.
A service requestor is a company that makes use of some web service.
A service provider is some company that provides the Web service.
A registry is a company that collects data on what services are provided, usually
in a specific area, for example in commodity broking.
A broker is a registry that offers intelligent search services.
An aggregator is a company that is a broker but also has the ability to describe
policy, business processes and binding descriptions

286

Potential revenue models

Transactional model
Membership or subscription model
The lease of licence model
The business partnership model
Registration model

Internet Technologies

287

There are a number of revenue raising models that can be used for web services.
The transactional model is the most familiar where the charge for a service is
made based on the number and type of transactions initiated by the client.
The membership (subscription) model is based on paying a fee for access to a
Web service.
The lease (licence) is similar to the membership model but is normally adopted
between two large companies whereby the membership is customised rather than
shrink wrapped.
The business partnership model is based on activities such as bartering of
services, equity, or even a percentage of gross revenue of the requestor.
The registration model is normally applied to the green pages type of access
where the publication of a service by a company implementing such a model is
based on using employing the company as a sort of shop window.

287

Advantages of Web services

Promotes interoperability
Enables just-in-time integration
Reduces complexity by encapsulation
Enables interoperability of legacy
applications
Internet Technologies

288

Web services interact via XML protocols, this means that all the agents that
interact in a Web service environment are able to operate independent of things
like platform and operating system.
It enables an application to interact with services at run time. For example, a
client might want one type of service from one Web service provider get it and
then during the running of an application obtain a slightly different service from
another provider.
Encapsulation enables the details of a service to be hidden from the client. All
that the client needs to know is how to call on the service.
A legacy application such as an accounting package can be front-ended by a Web
service and the facilities offered by the package offered to clients. It is irrelevant
to the clients that the package might be some legacy software. This has to do with
the encapsulation advantage in that a Web services architecture hides the
underlying details of an implementation.

288

The abstraction layer


Application
client

Application
code

Depends
on
implementation

Implementation
independent
Internet Technologies

289

Here we see the idea of the Web service as an abstraction. On the left-hand side
we have code which is specific to a particular platform and programming
language, note that this code may span a number of computers in a network
where, for example, the network may use a proprietary protocol.
On the right-hand side we see clients communicating with the implementationspecific functionality using standard non-platform specific technologies and
protocols such as browsers, HTTP or even an email protocol such as POP3.
The service isolates the client from the dirty machine specific parts of the
application.

289

Web service layered architecture


Discovery
Description
Packaging
Transport
Network

Internet Technologies

290

The ideal Web service has a five layered architecture which allows discovery
functionality: finding out about a service; description functionality, understanding
what a service offers; packaging functionality, sending data around a network in
a way that an application can understand it; transport functionality, using
standard protocols to send data; and network functionality which carries out the
raw transport of data using low level protocols such as TCP/IP.

290

The five layers and technologies

Discovery, UDDI and WS-Inspection


Description, WSDL, RDF
Packaging, XML and SOAP
Transport, HTTP, SMTP
Network, TCP, IP

Introduction to web services

Internet Technologies

291

There are a number of projects and ongoing technologies which are used for
implementing Web services. The lower down the hierarchy the more established
the technology.
The UDDI project and the WS-Inspection project are both concerned with
providing a sort of sophisticated directory system for clients.
The WSDL and RDF projects are concerned with providing enough information
to a client so that they can use a service, for example what packaging protocol is
supported by the client.
The main packaging technology SOAP is concerned with providing a language
into which service payloads are embedded.
Transport is provided by standard Internet protocols.
Basic network functionality is provided by the low level Internet protocols.

291

SOAP
Major packaging technology
Defined in XML
Contains message format, value encoding
and exception-reporting mechanisms
Usually uses HTTP as a transport carrier

Internet Technologies

292

SOAP (Simple Object Access Protocol) has rapidly become a major player in
Web services. It has been defined in XML and contains a number of facilities
such as that for defining the format of a message, specifying the values of
parameters and providing information about exception processing. It normally
uses HTTP as a mechanism for communication.

292

SOAP messages
Routing,
security
and delivery
info
Payload
information
such as
transaction
fields

SOAP
Header
SOAP
body

Internet Technologies

293

A SOAP message sent say from a client running a browser to a server consists of
two items: a header which specifies administrative information such as security
settings or routing data and the message itself.

293

A SOAP message (i)


<s:Envelope
xmlns:s = http//www.w3.org/2001/06/soap-envelope>
<s:header>
<m:transaction xmlns:m = soap-transaction
s:mustUnderstand = true>
<transactionID>
7788
</transactionID>
</m:transaction>
</s:header>

Internet Technologies

294

This shows the header of a SOAP transaction, it uses the definition of SOAP
found on the World Wide Web Consortium Web site. The header just gives a
transaction number and states that the recipient must understand SOAP
transactions. Note that this is expressed in an XML-based language

294

Must understand
<m:transaction xmlns:m = soap-transaction
s:mustUnderstand = true>

The recipient of the message must


understand SOAP transactions (global
condition). Different types of
mustunderstand can occur

Internet Technologies

295

A recipient may or may not understand a header (it must understand the message)
The mustunderstand property specifies which part of a transaction the recipient
must understand; in the example above the recipient must be capable of
understanding SOAP transactions.

295

A SOAP message (ii)


<s:body>
..
<Bookpurchase>
<customer>
Darrel Ince
</customer>
<CreditCard>
Visa
</CreditCard>
<CardNumber>
765433221256
</CardNumber>
<Address>
23,The Laurels, Nottingham...
</Address>
</Bookpurchase>
</s:body>
</s:Envelope>
Internet Technologies

296

This is the payload of the SOAP message, it contains data which is used when a
client wants to purchase a book from an online book store such as Amazon. Note
the use of XML conventions.

296

SOAP faults
Special type of message which describes errors

Fault code
Fault string
Fault actor
Fault details

Internet Technologies

297

When an error occurs in a SOAP transaction, for example the recipient does not
understand the message, a SOAP fault is generated. This contains some code
which identifies the type of error, a fault code which is a readable version of the
editor, the actor which is the identification of where the error occurred and an
application specific description of where the error occurred.

297

SOAP and HTTP


Natural match
Most common transport medium
SOAP request is posted to the server with
the HTTP request
The server replies with the HTTP response

Internet Technologies

298

Since HTTP is the interaction mechanism between a client and a browser it is a


natural choice to use HTTP as the medium for sending SOAP messages. The
mechanism within HTTP is to use the POST command.

298

SOAP HTTP request


POST /BookTrans HTTP/1.1
Content-Type: text/xml
...
<s:Envelope xmlns:s = >
...
</s:Envelope>

Internet Technologies

299

This shows how a SOAP transaction can be embedded within a POST HTTP
request.

299

SOAP HTTP response


HTTP/1.1 200 OK
Content-type: text/xml
Content-length: 3400
<s:Envelope xmlns:s = >

</s:Envelope>

Internet Technologies

300

The response to the request detailed in the previous slide is shown below, it does
not require any changes to the HTTP specification.

300

WSDL
Language used to define the services that
are offered by some enterprise
Expressed in XML (see later)
Requires a lot of detailed XML coding
Defines each service that is offered

Internet Technologies

301

The Web Services Description language (WSDL) is a language which is used to


define the services offered by an enterprise on the Web.
It is expressed using a technology known as XML. I will be looking at this
technology in the next lecture.
The listings of WSDL code are long and complicated and it serves no purpose to
show much detail in this lecture.

301

Small example of WSDL


<message name="GetEndorsingBoarderRequest">
<part name="body"
element="esxsd:GetEndorsingBoarder"/>
</message>

Internet Technologies

302

This forms part of a very large Web site which is concerned with snowboarding. I
forms part of the definition of a service in which clients can ask questions such as
Which snowboarders endorse snowboard xxxx?

302

UDDI
Universal Description Discovery and Integration
Industry initiative first started by IBM Microsoft,
NTT and SAP
Relies on standard technologies such as XML
A framework for describing services, discovering
businesses, and integrating business services using
the Web.
It has been developed as a platform-independent
and open framework
Internet Technologies

303

UDI is the remaining part of the trilogy behind Web services (SOAP and WSDL
are the others)
It relies on standard Internet technologies and is used to produce registries which
contain descriptions of services offered by companies on the Web.
It is an open system which does not depend on a hardware or software platform.

303

Other Web Service implementations

Sockets
CORBA or RMI
Servlets
RPC (XML-RPC)
Just using standard HTTP
Internet Technologies

304

It is important to point out that there are a variety of technologies that can be
used for Web services ranging from the primitive (sockets) to the complicated
and sophisticated (CORBA) the main principle is that the service acts as a buffer
between implementation-specific code and implementation-neutral code (Internet
code). That the programmer at the client side be offered implementationindependent hooks into the application

304

Summary (i)
A Web service can be implemented in any
way.
A variety of technologies can be used
including sockets and servlets
Web services are predominantly message
based
Web services are based on HTTP
Internet Technologies

305

305

Summary (ii)
Increasingly the main technology being
employed for Web services is SOAP
SOAP is a messaging technology
It is defined by XML
Much more work needed on SOAP, for
example tool sets.

Internet Technologies

306

306

Lecture 7
Web 2.0

Internet Technologies

307

307

Aims
To place Web 2.0 in the context of Web1.0
and the semantic web
To look at the concept of the writable web.
To look at a number of technologies
associated with Web 2.0
To look at the iiea of the Internet as a
computer
Internet Technologies

308

The Writable Web

Web 1.0. Traffic mainly in one direction


Web 3.0. The embedding of semantics
within web pages

So what is Web 2.0


Internet Technologies

309

We have seen the rise of the World Wide Web since the early nineties. We have
seen lots of talk and research about Web 3.0 or the semantic web (dealt with in
the next lecture), so what is Web 2.0. It was a term coined by the publisher Tim
OReilly to describe the increase in two way traffic that is occurring now. He also
used it to describe the increase in community-type use of the web.

309

The Writable Web

Web 1.0
Web 1.0
Web 2.0

The
World

The world

Web 2.0

Internet Technologies

310

This is the major change that has happened to the World Wide web. From being a
mainly one-way communication medium it has been transformed into a 2-way
medium where the client can write to the server

310

Features of Web 2.0 according to Tim OReilly

Internet Technologies

311

Here is the original concept of Web 2.0 as envisaged by Tim OReilly.

311

Three examples
Double Click vs AdSense. Former feeds large web sites,
the latter addresses the long tail, it deals with cost per click
processing.
Flickr. Photo sharing and display site. Allows users to
collect photos as sets and also allows users to tag photos (
a folksonomy).
Content management systems vs Wikis. The former are
highly structured ways of organising web content, for
example Cocoon. The latter are free form ways of enabling
users to develop large collections of text.
Internet Technologies

312

The three examples above are examples of the way that Web 2.0 has effected
commercial applications. All employ collective intelligence and large amounts of
user participation. All are capable of being mashed via APIs.

312

The long tail

Internet Technologies

313

A key work for Web 2.0 companies. Anderson explores the fact that the area
under a long tail distribution is greater than the front of the distribution: that, for
example sales of the last 100,000 books on Amazon will be greater than for the
top 100.

313

Some technologies and techniques

XML
Ajax
Wikis
Blogs
Frameworks

Platform APIs
Yahoo Pipes
Tagging
Genetic algorithms

Internet Technologies

314

These are a number of technologies that have enabled this growth, this lecture
will describe many of them

314

The Internet as the computer


Mass computing

Desktop
Grid computing

RSS

Web based file


storage

Internet
Community
sites

Collaboration based development


Internet Technologies

315

We are now seeing a number of convergences whereby the Internet is


increasingly looking more and more like a computer, albeit a very large
computer.

315

Grid computing

Dealt with later in the course


Has arisen from early mass computing sites
Shares processors around a number of
clients
The Internet as a massive processor
Internet Technologies

316

This topic will be dealt with in more detail later. It started with users donating the
spare cycles of their processor to some processing that was required by an
organisation such as SETI. Usually these sites carried out massive computations
which could not be attempted by supercomputers: molecular genetics, drug trial
simulations and massive engineering design

316

Offline storage
The use of large file servers to store files
created by users of the Internet
Best example is that of S3, the Amazon file
storage facility
Large numbers of advantages: security,
backups facility, etc.
The Internet as a file storage facility
Internet Technologies

317

Now that we have high speed access to the Internet it is becoming feasible to
store data offline in secure facilities, a major growth area has been in companies
offering this service. Amazon is the current leader with its S3 facility.

317

APIs
The availability of APIs that enable a company to
program access to some underlying data
Usually read access is only allowed
Sometimes read and write access is allowed
Examples: Google maps, Amazon, eBay and
PayPal

The Internet as a programmable entity


Internet Technologies

318

There are now a number of companies that offer access to their underlying data
via APIs in languages such as Java, Perl and Python. Most of these just allow
query facilities, however some such as Flickr offer write access.

318

Platform APIs
Such APIs need a good business model
Amazons model allows associate
companies to make money at the same time
that Amazon gains sales
The now defunct Google search API did
not.

Internet Technologies

319

Platform APIs only work when there is some leverage between all the enterprises
that are involved in their use. The Amazon API is a good example of this as it
allows companies to register as Amazon associates, download book details, post
them on their site. Visitors to the site can then buy the books from the associate.
Amazon would then do the standard selling functions gain profit and share a little
with the associate.
The Google API failed in that it allowed other companies to feature search on
their site without the display of the online ads that make Google so profitable. It
was ditched.

319

Desktop interfaces
Mainly implemented by means of a collective technology
known as a framework
Ajax is currently the only game in town
Based on a combination of Javascript, XML, CSS and
HTML
Attempts to replicate the desktop as an interface to the
Web.

The Internet as a windows-based


computer
Internet Technologies

320

320

Collaboration-based development
Examples include Apache and Wikipedia.
Collaborative facilities such as Wikis enable
this to happen easily
Platform APIs also enable collaboration
based on integrating (mashing) sites
The Internet as a collaborative development
medium
Internet Technologies

321

Now that technologies such as those associated with broadband have enabled two
way communication we now have lots of examples of this happening. There are
the standard examples of Apache and Wikipedia where hundreds, if not
thousands of users have collaborated to produce the most popular web server and
the most popular online encyclopaedia. There are also examples of individual
users mashing together sites in order to create functionality that represents
soemthing greater than the sum of their parts

321

Blogs

WebBlogs
Personal diaries
Now being used as corporate PR
A number of good blogging sites and
systems to support blogging.

Internet Technologies

322

One area which has a slight connection with Web 2.0 is the blog. A blog is
effectively a we site that holds text written by one person; the first blogs were
diaries, usually from some pundit or technical guru. They have expanded out over
the last two years to include ordinary users of the Internet including ambulance
drivers, policemen and call girls.
The commercial aspect to blogs is that some companies encourage their staff to
blog in order to give a positive view of the company that would not occur with a
PR site. When the PR department of a company starts blogging derision usually
follows.

322

Platform APIs-Three Examples

The Amazon API


The PayPal API
Google APIs

Internet Technologies

323

The next three slides look at three APIs that enable programmers to mash
applications together.

323

The Amazon API


Allows the developer access to Amazon resources,
for example it allows queries on books etc.
REST and web service implementation.
Facilities includes: detailed product information ,
images, customer reviews, shopping cart, extended
search.
Allows companies to set up their own bookshops.

Internet Technologies

324

One of the most successful APIs has been developed by Amazon. It provides a
variety of calls in a number of programming languages such as Java and Perl to
the huge database of products that are stocked by that company. This API
supports an archetypal Web 2.0 model: that of an associate relegating the tough
stuff to Amazon (stock control, customer billing) and achieving profit via sales.

324

Google APIs
Number of varieties
Google maps
Google search
Google OpenSocial
Google Reader aggregator

Internet Technologies

325

There are a large number of Google APIs that interface to a variety of Google
resources. Many of them are associated with JavaScript with some having java
and Perl interfaces. These APIs are the most used on the Internet.

325

PayPal API
SOAP and REST interfaces
Variety of programming languages can be used.
Smallish number of mashups so far, main mashups
are with the eBay API for process handling.
Typical functions are: process a credit card
payment, authorize funds for an order
authorization, authorize funds for an order
authorization, issue a refund for a PayPal
transaction
,Internet Technologies

326

PayPal have a typical API which like the Amazon API is based on REST and
web services.

326

RSS
The glue that binds together the Internet
Based on XML
Accessed by a number of mashing
technologies such as Yahoo Pipes.
Initially a news feed technology
Now used for all sorts of data feeds
Internet Technologies

327

RSS is a fast emerging technology that delivers content to the browser or to a


program known as a news reader or news aggregator. It is based on XML and
implements much of the plumbing that is employed in Web 2.0 applications.

327

Mashing
RSS feed
RSS feed

Mash up
RSS feed
RSS feed

Internet Technologies

328

RSS is used as the basis for mash ups. These combine feeds from a variety of
sources and provides some cross linked functionality, for example a map showing
high crime levels in a city or a map displaying Flickr photographs.

328

An RSS example
<item>
<title>Earth Invaded</title>
<link>
http://news.example.com/2004/12/17/invasion
</link>
<description>
The earth was attacked by an invasion fleet
from halfway across the galaxy; luckily, a fatal
miscalculation of scale resulted in the entire armada
being eaten by a small dog.
</description>
</item>

Internet Technologies

329

Here I show a very simple feed, in gneral the standards for RSS, for example
Atom allow a lot more content.

329

Frameworks
New way of developing systems.
Starts with a skeletal architecture and then
developers produce a number of
instantiations of concrete examples.
Until recently few frameworks were in
existence as public or commercial entities.
Ruby on Rails the first example of a highly
popular framework technology.
Internet Technologies

330

Frameworks are the software equivalent to the girders and ties that hold a
building together. They are a high level design of a system which can be moulded
towards a specific set of requirements from a customer. For example a
framework for accounting systems can be transformed into concrete versions of
the type of accounting systems that major companies use to drive their business.
The first real framework has been Ruby on Rails.

330

Design patterns
Revolutionising the process of systems
development
Use architectures.
Implementable in any OO language.
Basis of frameworks
Mainly aimed at maintenance
Internet Technologies

331

Design patters are small sections of pre-fabricated code with hooks to insert
application specific code. It was developed by four industrial researchers known
as the gang of four. It relies on object-oriented programming languages, the
original book on the idea being configured to C++. Design patterns enable
software maintenance to be a much easier process since many of the patterns
documented are aimed at change.

331

The book that started frameworks off

Internet Technologies

332

This the book by the gang of four.

332

Ruby on Rails
Concrete system

Templates

ObjectDatabase mappings

MVC architecture

Metaprogramming

Internet Technologies

333

Here we show the main components of Ruby on Rails. The model view
architecture allowse vents to be trigged, meta-programming (the ability for a
programming language to find out things about its programs), mappings from
objects to databases and common templates used in we development produce a
remarkable set of efficiencies.

333

Metaprogramming
Tell me how many variables
does this subroutine have?

Program

Internet Technologies

334

This is the idea of meta-programming, it enables a program to interrogate itself


and find out information about itself. This provides Ruby with one of its most
powerful facilities; without this Ruby on Rails would be a much more
cumbersome product.

334

Ruby on Rails

Based on the interpreted language Ruby


Capable of very rapid development
Relies very heavily on meta-programming
Is a clever framework that contains facilities
for common web-based applications.
Open-source
Internet Technologies

335

Ruby is the only technology that addresses all the properties that one expects of a
framework. It can be capable of very rapid development since 90% of web-based
functionality is embedded in its architecture; when a developer wants to produce
a specific web-based application all they need do is to write relatively small
amounts of Ruby code. Sophisticated web applications can now take only a
matter of days.

335

Implications for System Development


The rise of the application systems
developers such as SalesForce.com
The rise of integration as a development
paradigm.
The perpetual beta
Applications that encompass data
integration
Internet Technologies

336

These are some of the important implications that Web 2.0 has for developers.
The main one is in terms of integration. Many of the technologies documented
her, for example RSS, enable the development of systems in terms of bringing
together internet systems and resources. The last four lectures concentrate on this
increasingly important topic. The remainder of this lecture looks at data
integration.

336

Internet and Data


Lots of it, for example there are millions of
Wikipedia entries.
Much of the data is now tagged via
folksonomies.
A key technology is RSS. This feeds such
data to integrated systems.
Many technologies used: Ajax, Perl, Ruby
PHP and MySQL.
Internet Technologies

337

There is a large amount of data on the Internet and there is considerable scope for
mashing it together. The remaining slides look at some particular mashups
detailed by Segaran.

337

Best Book on Data Mashing Yet


Published

Internet Technologies

338

If you want to see what the capabilities of the Internet are in terms of data
mashing buy this book.

338

Techniques

Bayesian analysis
Genetic algorithms
Particle swarm optimisation
Ant colony algorithms
Cluster analysis
Neural nets
Internet Technologies

339

There is an increasing stress on using advanced techniques within data mashing.


Bayesian statistics is a form of statistics that has found huge use on the Internet,
for example it has been successfully been used to spot spam. Genetic algorithms
use many solutions to a problem to combine these solutions in order to detect
optima. Particle swarm optimisation mimics the flight of birds and other swarms
in order to detect optimums. Ant colony algorithms use search perfected by ants
for graph-based algorithms. Cluster analysis a mature technique used to detect
similar items in a collection of items. Neural nets are an unsupervised form of
learning that detects patterns.

339

A Sobering Read

Internet Technologies

340

This is a sobering read. Carr proposes a world in which IT has become a


commodity in the same way that electricity. In his world integration dominates
with middle level activities greatly reduced with the possible major decrease in
activities such as programming, detailed design, unit testing and major parts of
system testing being virtually eliminated.

340

Lecture 8
The semantic web

Internet Technologies

341

341

Some Applications of Data Mashing


Using Wikipedia for natural language analysis
Using Bayesian stats to detect optimal buy time
for airline tickets
Tracking customer behaviour- via ad-clicks
Combining estate agent data with crime statistics
Optimising a number of airline trips for, say, a
wedding

Internet Technologies

342

These are all examples of current applications which use advanced search and
optimisation algorithms

342

Aims
To describe the disadvantages of the current
web
To describe the use of semantic information
within the web
To outline the main architectural features of
the semantic web
To outline some current research projects
and case studies
Internet Technologies

343

The semantic web


Currently the Web is a disparate collection of
pages mainly written in HTML
Pages are meant for humans and are directly
accessible by only one program: the browser.
No semantic information is provided by pages.
While screen scrapers are intellectually easy to
write they are often a pain to develop and are
usually associated with one-off applications
The semantic web is an attempt to embed semantic
information within
Internet Technologies

344

The World Wide Web has been massively successful. However, it has its
problems. The first is that it was meant for a single program the browser. We
have plenty of tools which enable us to find information from a Web page but
their development is quite painful.
Web pages provide no indication what a chunk of information means. The
semantic net project is a very ambitious project which is attempting to overcome
this

344

Semantics

Web
page

What does this


data mean

200

Internet Technologies

345

The box above shows a Web page. Inside that page there is a number what does it
mean? It is easy for a screen scraper to extract this number but it is much more
difficult to understand what the data stands for. Is it a serial number, a price, a
house number the position of a CD in the charts?

345

Things are a bit better

<TITLE>Employee</TITLE>

Web
page

We have more
clues about the
data now

200

Internet Technologies

346

We might know that the page represents an employee in a company from the fact
that there might be subsidiary information held on the page, for example the
preamble to the page might contain a string Employee However, we are not that
much better off. The 200 might be: the room number of the employee, their
weekly salary in hundreds of pounds, the number of staff they are responsible for
or some internal identity number.

346

Things could be much better

<TITLE>Employee</TITLE>

Web
page

We now have a
pretty good idea

room=200

Internet Technologies

347

We now have an excellent idea about the string 200 Since it is closely
associated with the string room we are able to say with a lot of confidence that
it some room number. But is it the room number that the employee can be found
or the current room number that they are in (the company might be using an
active badge system).

347

Writing a screen scraper


A screen scraper is a piece of bespoke
software, often written in PERL.
It extracts basic information from a Web
page by processing the page line by line.
Semantics are build into the program rather
than being embedded
Screen scraper programs are susceptible to
change.
Internet Technologies

348

A screen scraper is a computer program written in some string processing


language (usually PERL). Its aim is to read a Web page line by line and extract
out important information such as the price of a book. The semantics of the page
is embedded within the programming code by the programmer who looks at the
page and discovers what each of the entities means. Unfortunately, there are
problems with screen scrapers.
First, they are tedious to write. Second, they are susceptible to change: if the
structure of the HTML changes they have to be modified. Third they are specific
to one particular Web site, for example the structure of a page in the Amazon
Web site is different to that found in Blackwells Web site.

348

The solution
The solution is to associate a standard semantics
into a Web page. For example a standard semantics
for book publishers, a semantics for book sellers or
A semantics for open source projects.

Internet Technologies

349

Because of the problems which I have alluded to with screen scrapers the only
solution to the semantic poverty problem is to have a standard semantics for
types of application. Such semantics being embedded within Web pages.The
collection of Web pages which have this semantic information attached is
collectively known as the semantic web.

349

The components of the Semantic Web

Web Ontology Language (OWL)


Resource Description Framework (RDF)
RDF Schema
Semantic Web tools such as inference
engines

Internet Technologies

350

The four bullet points above describe the main components of the Semantic web.
They will be described in detail in subsequent pages.

350

RDF

Triple based
Defined using XML
Long and short versions available
Based on subject/predicate/object
relationship

Internet Technologies

351

RDF is the core of the semantic Web. It is used to hold base information about
entities in some world, for example the world of book selling. It is defined using
XML and there are a number of long and short versions of RDF definitions
available. The long version takes a lot of learning

351

Some examples of triples


Employee Jones earns 20000 per year

The head office of Phillips is Eindhoven

Milton Keynes has a postcode which is MK

Internet Technologies

352

Here there are three triples. The subject is the thing that is described (Jones,
Phillips, Milton Keynes). The predicate is the aspect of the resource that is being
described (earns, is situated, has a postcode). The object is the value of the
subject governed by the trait (20000, Eindhoven, MK)

352

An example of a triple taken from


Wikipedia
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description rdf:about="http://en.wikipedia.org/Tony_Benn">
<dc:title>Tony Benn</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>

Internet Technologies

353

Here the text describes the fact that the title of a page is Tony Benn and the
publisher of the page is WikiPedia

353

The N-triples form of the previous page


<http://en.wikipedia.org/Tony_Benn> <http://purl.org/dc/elements/1.1/title> "Tony Benn" .
<http://en.wikipedia.org/Tony_Benn> <http://purl.org/dc/elements/1.1/publisher> "Wikipedia"

Internet Technologies

354

There are a number of versions of languages used to describe semantics using


RDF, this is an example of one of them in which the previous page is described.

354

Some examples of RDF vocabularies

Friend Of A Friend (FOAF)


RDF Site Summary (RSS)
KlogMS
Dublin Core

Internet Technologies

355

Here are some examples of RDF vocabularies. FOF is used to define the
relationships between people and models social networks. It models predicates
such as the fact that one person knows another person.
RSS is used to describe web sites which offer syndication services.
KlogMS is used for embedding knowledge about Web logs.
Dublin Core is used to describe the semantics associated with information objects
such as books, research papers, Web pages etc.

355

A semantic net
Jones

Is in dept

Accounts
Is a dept in

Works in

Acme Company
Owns

Building 2
Internet Technologies

356

The side above shows an example of a semantic net which might be embedded
within a series of Web pages maintained by a company. There is no reason whey
further items which are external to the company cannot be added, for example a
town planning web site might reference Building 2 and involve relations such as
Is situated in.

356

Some queries on a semantic net


Are there any publishers who require
authors for books on undersea diving
Is there a company in existence based in Europe who
sell old versions of the Windows operating system
Where can I find the cheapest delivery
of 1000 tonnes of aluminium ore
Needs conversion into
some structured language
Internet Technologies

357

The three queries above are examples of natural language queries which might be
processed by traversing the Semantic Web. I is rather fanciful to think that such
queries would be processed as natural language, they would have to be
transformed into queries in some structured language such as |SQL. The
important point to make is that such queries would be enabled by the semantic
web and would enable the instigator of these queries to search the whole of the
Web.

357

The future of queries

Simple queries such as those in the previous slide are technically easy
to program.
More complicated queries require something called an inference
engine which uses axioms.
Inference and logic have been quite busy areas of computing research
over the last twenty years, mainly in artificial intelligence.
The results from AI inference work have been somewhat disappointing
even when data is held in the same memory store.
This means that the prospects for inference for the semantic web are
still a long way off.

Internet Technologies

358

The previous slide detailed some queries which could be easily translated into
SQL and hence can be relatively easy to program. However, more complicated
queries in predicate calculus, queries which require deep levels of quantification
(for all, exists) are still a long way off. RDF still does not support such queries.
There is, however, a heavy chunk of knowledge in computing about inference,
mainly generated by researchers in the artificial intelligence area. Unfortunately
progress in this area has been mighty slow with even queries on data stored on
the same computer taking a huge amount of time. With such data spread around a
WAN the implication here is that logic, inference and inference engines will not
be seen for some time on the Web.

358

Web Ontology Language (OWL)


Used to link vocabularies together.
For example, a book might be referred to in
one vocabulary used within libraries but
may also be used in book selling
applications and university research
applications.
Used to establish equivalence.
Internet Technologies

359

In the future Semantic Web there will be a number of vocabularies which cover a
number of application areas. For example, there may be a vocabulary which is
concerned with applications for book sellers, another for libraries and another
which is used for cataloguing research achievements. In each of these books will
be defined but my have different names, for example bookforsale, publishedbook
and book. OWL enables linking information to be set up which says that these
three entities are equivalent.In general Owl is used to define the relationships
between vocabularies.

359

RDF Schema
Similar to database schema
Defined using XML
Allows a designer to define and publish the
vocabulary used by an RDF data model
Example: it might define that people have a
phone attribute.
Also defines classes and subclasses
Internet Technologies

360

This is an overview description of a vocabulary used by an RDF data model. For


example it might say that in a book retailing model that there is an entity book
and an entity customer. It also defines attributes, for example the fact that a book
has a title. It also provides facilities whereby one entity can be tagged as a
subclass of another, for example the fact that CargoPlane is a subclass of Plane.

360

Research 1( REWERSE)

EEC network of excellence


27 organisations from 14 countries
Aim 1: develop a coherent and complete, yet minimal, collection of
inter-operable reasoning languages for advanced Web systems and
applications;
Aim 2: test these languages on context-adaptive Web systems and
Web-based decision support systems selected as test-beds for proof-ofconcept purposes;
Aim 3:bring the proposed languages to the level of open pre-standards
amenable to submissions to standardisation bodies such as the World
Wide Web Consortium.
Internet Technologies

361

One of the main research areas that is being addressed by Western governments
is that of inference. This project, which is a network of excellence project, is
attempting to develop some small languages which enable the communication of
logical queries to the Semantic Web.

361

Research 2 (NRC)
National Research Council Canada Semantic Web lab
Rule-Applying Collaborative Filtering system
Combines multidimensional collaborative ratings with
RuleML-based rules to recommend Web objects such as
research papers
RuleML is a mark-up language for business rules
First example is RACOFI a system for recommending CD
albums for music

Internet Technologies

362

This is research which takes semantic web pages and uses collaborative agents to
discover objects which best match certain criteria. It uses a mark-up language
called RuleML.

362

Research 3 (DAML Services)


Development of an ontology for Web services
Based at Stanford University AI Laboratory
Development of a set of mark-up language
constructs for describing the properties and
capabilities of their Web services in unambiguous,
computer-interpretable form
Used for Web service discovery, execution,
interoperation, composition and execution
monitoring
Internet Technologies

363

This is a project which is attempting to develop an ontology which can mediate


between a number of vocabularies which describe Web services.

363

Case Study (i)

System known as ITTalks


Developed at the University of Maryland
Uses a mark-up technology known as DAML
Recognises information technology talks at
research institutions and universities
Processes a web page and is able to understand the
content and categorise and present information
about an IT talk
Internet Technologies

364

This is an unusual project in that it uses a technology known as DAML (DARPA


Agent Markup Language). This is a technology that has been developed for
agents: software entities that wander the Internet gathering information. The
endpoint of this research project is the dissemination of web pages that detail IT
related talks and presentations at universities and research institutions.

364

Case Study (ii)


Harpers magazine web site
Conventional magazine web site augmented with
semantic information
Primitive technology used compared with that
normally associated with the Semantic web.
Enables a high degree of linkage within the web
site, for example linking Dolly the sheep to events
associated with this animal
Internet Technologies

365

The second case study involves a home-made, semantically augmented web site
for the cultural magazine Harpers. The web developers who produced the web
site augmented the HTML with XML markup that enables a high degree of
linking between entities such as features, feature elements and events to take
place. It might be worth your while to click on
http://www.harpers.org/HarpersIndex2003-10.html and then navigate between
entries to see the power of this technology.

365

Challenges for Web researchers and


practitioners

Availability of content
Ontology
Multilinguality
Scalability
Visualisation
Stability of Semantic Web languages

Introduction to the semantic web


Internet Technologies

366

There are a number of areas that need to be addressed if the Semantic Web is to
advance.
First, little RDF-based content is available. There needs to be a massive effort to
create new content and convert existing content.
There are few ontologies. A big effort is needed to create new ontologies and also
to create an organisational infrastructure which allows the standardisation and
change management of ontologies.
If the Semantic Web takes off it will be huge, there is a need to ensure that
activities such as searching and inference are carried out efficiently. This not only
requires research on new structures but also new algorithms.
There is a need for research which ensures that content in a number of different
languages can exist within the Semantic Web and activities such as search can be
carried out on such heterogeneous material.
There is a need for research into visualisation of parts of the Semantic Web.
There will be so much content available that some techniques will need to be
deployed to ensure that users are aware of content and that administrators are
able to see the effect of change on content.
The Semantic Web will not be successful if its facilities, including languages are
not stable. There is a need for a major standardisation effort.

366

Lecture 9
Grid Computing

Internet Technologies

367

367

Aims
Show how grid computing evolved from mass
computing and peer-to-peer computing
Describe the main grid computing concepts
Examine the main drivers behind grid computing
Look at the technology and business benefits of
grid computing
Briefly look at some grid computing applications

Internet Technologies

368

In this lecture I shall be looking at almost certainly the most important type of
server: the Web server. In it I shall be looking at the basic processing cycle used
by a Web server and examining the HTTP protocol that is used by a client and a
Web server to communicate.
Early Web technologies just dispensed static pages; soon this was regarded as
very limiting and so a number of dynamic page technologies were developed; in
the lecture I shall look at some of these.
Apache is almost certainly one of the most popular Web servers; I shall be briefly
looking at it and using it as a case study.
Finally I shall look at the role of the Web server in distributed architectures.

368

Grid Computing
Initially, the use of large numbers of computers
connected by network technology to carry out
computationally demanding tasks.
Has its roots in peer to peer computing and mass
computing.
Normally employs Internet technology for
connections.
Still in its early days.
Internet Technologies

369

This lecture looks at the topic of grid computing. This is the term given to
connecting large numbers of computers (thousands) together in order to share out
the processing of some computationally demanding application, for example
analysing biological databases. Each processor carries out some part of the
computation required and other computers gather together the results.
Computers would be connected using Internet technology. Grid computing is in
its infancy, however, it has its roots in slightly more mature technologies and
ideas, mainly in peer to peer computing and mass computing.

369

The client server paradigm


A hierarchical relationship
Client

Client

Client

Server

Internet Technologies

370

The main paradigm that has driven distributed computing during the eighties and
nineties is that of the client server model. This is a hierarchic model where clients
are subservient to servers. It is totally pervasive, for example the World Wide
Web relies on clients (browsers) to access servers (Web servers).

370

Client server computing


Client asks for service (get me a web page)
Server provides the service.
Clients do not provide any service to the
serves or to other clients.
Occasionally some service elements appear
at the client, for example cookies; however,
this is minor.
Internet Technologies

371

The client server model is hierarchical in that a client asks for a service and the
server delivers it. Typical servers include application servers, database servers
and Web servers.
In general a client does not provide any services for other entities in an
application. Sometimes a small amount of functionality is found, for example
where a client is associated with a cookie and the server or other clients can
interrogate the cookie.
This model has been extant since the eighties.

371

Mass computing
The use of a number of computers to carry out
some tough computational task
SETI project is an example where radio waves are
analysed to check for patterns which indicate extra
terrestrial life
Many mass computing projects in existence
The connection with grid computing is the sharing
of hardware
The SETI project
Internet Technologies

372

Mass computing is the use of a number of computers to carry out some


computationally demanding task. Normally in a mass computing project the task
is split into sub-tasks which are then passed to the computers which participate in
the project. Since most computers are idle for most of the time they have a fair
amount of processor resource available which can be used for computing. The
most famous mass computing project is SETI (the Search for Extra Terrestrial
Intelligence). This uses computers to analyse radio waves from distant galaxies in
order to see whether they have been sent by aliens.
There are now hundreds of these project, most emanate from universities or
research labs.

372

Peer to peer computing


Is where the computers in network or part of a
network do not have a hierarchic relationship.
Usually involves the interchange of information,
often large files.
Has a long history dating from the eighties.
However, only in the last five years has there been
significant growth and interest in peer to peer
computing.
Wikipedia on peer-to-peer computing
Internet Technologies

373

Peer to peer computing eliminates the hierarchic relationship between client and
server. Each entity in the network has the same functionality and contribute
equally to the tasks that the network carries out.
Often peer to peer networks involve the transfer of large amounts of data, for
example sound files, and much of the current interest in P to P arises from
applications such as Napster.
There has been a large amount of interest in peer to peer computing in the last
five years, however, the history of P to P starts in the eighties.

373

Growth of P to P

Started with USENET and FidoNet


Requires fast connections, at least broadband
Requires public connection technology (Internet)
P2P applications consist of a number of peers,
each performing a specific role in the P2P
network.

The number of peers is usually large and the


number of different roles is usually small
Internet Technologies

374

Peer to peer computing has its roots in the USENET and FidoNet systems of the
eighties. These were the forerunners of newsgroups systems of today. USENET
was a true newsgroup system while FidoNet was associated with Bulletin boards.
Each were used to exchange files between participants, often using dial-up lines
overnight.
The conditions for successful P to P are that you have a public networking
infrastructure and fast interconnection. A true P to P system will have a number
of peers each dedicated to specific role such as sending files. This is the hallmark
of P to P. A system which does not satisfy all these condition is often known as
peer oriented. For example mass computing is peer oriented.

374

Gnutella and Napster


Both peer to peer applications.
Both associated with the sharing of MP3
(sound) files.
Napster is based on a central server.
Gnutella is based on full peer to peer ideas
with no computer having a dominant
position.
Internet Technologies

375

Probably the best known peer to peer applications are Napster and Gnutella. The
former was the first music file-sharing program on the Internet. It allowed users
to access other users computers via a server and download MP3 files. It gained a
huge amount of notoriety because it was used as a repository of pirated music.
Gnutella is a recent technology. It is interesting because it employs a true P to P
approach in that no servers are involved in the distribution files: it happens
directly between Gnutella nodes.

375

Grid computing
The use of large numbers of computers which look like
one computer.
Up to a couple of years ago was a researchers plaything.
Now there are a number of tools available, many from
IBM.
Number of protocols, most promising being the Open Grid
Services Architecture (OGSA)
Introduction to grid computing
Internet Technologies

376

A grid is a collection of computers which act like a single computer and appears
to both its users and developers as a single computer.The last two years has seen
a number of tools and protocols emerge which make grid computing a
commercial possibility. Most of the standards associated with grid computing are
open standards including the most popular OGSA.

376

Similarities
Like the Web, complexity hidden and all
users have the same interface
Like P to P, allows files to be shared.
Like clusters, brings distributed resources
together
Like virtualisation technologies, creates a
virtual resource.
Internet Technologies

377

There are many similarities between grid computing and other technologies.
It is similar to the Web because all the users have the same interface to it in the
same way that browsers have the same interface to Web pages.
It is similar to P to P in that files can be shared between entities in a grid.
It is like clusters in that computers can be brought together using existing
communication technologies to create a single resource.
It is like virtualisation technologies such as replicating databases in that you can
create virtual interface to a large amount of complicated resources.

377

Slack resources

Most PCs are idle for 95% of the time.


UNIX servers are idle for 10% of the time.
Database servers are idle for 50% of the time.
Mainframe computers do nothing for 40% of the
time.
It has been estimated that a companys computing
system is idle for 85% of the time.

Internet Technologies

378

The slide above provides the commercial rationale behind grid computing: that
most of a computing system is wasted.

378

The initial driver


Although the previous slide talks in commercial
terms the initial rationale for grids was different
They emerged because of the need for solving
hardish computational problems, for example
those associated with nuclear physics and the
genome project
Because of this, most of the research carried out
into grid computing has been high computational
demand research
Internet Technologies

379

There is another dimension to grid computing. The initial driver that has
sponsored most of the research into grid computation did not come from the fact
that resource was slack but came from a demand by applied scientists for a way
of increasing the computational speed of the hardware they used. Grid computing
enables large numbers of computers to share computational tasks.

379

Grid computing a warning


As a way of solving hard problems it is not a panacea
There are problems that can never be solved (NP-complete
problems)
There are problems which cannot be split up sufficiently
that they can be efficiently solved. Such problems often
have major communicational overheads.
NP-complete problems a link

Internet Technologies

380

Grid computing is not capable of solving all computationally difficult problems.


Firstly, there are problems which have exponential run times. Such problems are
known as NP complete. Even if you had millions of computers working together
they would find it impossible to solve realistic size problem.
There are also some problems where the interaction between sub-tasks is such
that the sub-tasks need to keep in step with each other and communicate. For
large instances of these problems the so-called communicational overhead
dominates everything.

380

Technology benefits
Better control of workloads
Increases capacity for high demand
applications
Support for multi-disciplinary collaboration
Workload balancing easier
Recovery easier using replication
Internet Technologies

381

The above are the main technological benefits of using a grid approach.

381

Business benefits
Improves collaboration.
Can solve previously unsolvable problems
Enables virtual departments and virtual
organisations to be easily created
Respond to fluctuations in customer needs
Provides optimal use of resources
Enables faster integration
Eliminates over-provisioning
Internet Technologies

382

The above are the main business benefits of using a grid approach.

382

Some grid applications


DNA matching
Data intensive testing in the car industry
Running long term scenarios in the financial
services industry
Large pictorial database matching.
Meteorological prediction
Geological prediction
Internet Technologies

383

These are a few of the current applications of grid computing. As you will see
most of them involve solving problems with high computational demands, the
sort of problems that first motivated grid research.

383

Standards and the Grid


Up until now there has been little standards work.
However, virtually all grid work is based on
infrastructure Internet standards.
The most promising standard is OGSA promoted
by the Global Grid Forum, an organisation over
400 enterprises and organisations in 40 countries.
OGSA
Internet Technologies

384

Standards work is in its infancy. However, many of the lower levels of any grid
standard will be defined by existing Internet standards such as IP and TCP.
The most promising standard is OGSA (Open Grid Services Architecture). This
has been developed by an organisation known as the Global Grid Forum.

384

OGSA

Infrastructure services
Resource management services
Data services
Context services
Information services
Self-management services
Security services
Execution management services
Internet Technologies

385

Shown above are the eight levels of service that make up OGSA, the slides
following describe them in more detail.

385

Infrastructure services
Communication between disparate
resources
Resources include processors, memory
database storage
Resolves problems associated with shared
access
Close to Internet protocols IP and TCP
Internet Technologies

386

The role of the infrastructure services is to co-ordinate the communication


between the hardware (memory, processor, file storage) and software (distributed
operating system, applications). For example, enabling common applications to
communicate effectively without giving rise to huge overheads. This set of
services also removes barriers which would occur when shared resources are
accessed, for example by ensuring that locked access to databases is dealt with
efficiently.

386

Resource management services


Critical service
Driven by quality of service metrics
Carries out monitoring, reservation,
deployment and configuration of hardware
and software resources so that QoS targets
can be met
Should be carried out dynamically
Internet Technologies

387

The aim of the resource management service is to ensure that a level of service
characterised by performance and availability is maintained throughout the
functioning of the grid. This means that some resources will be devoted to
monitoring critical metrics such as system throughput in order to reconfigure
both the hardware and software elements of the grid.

387

Data services
Concerned with the movement and storage
of data
Provides a data replication service
Provides a P to P service
Provides a format conversion service
Manages the dynamic updating of both
permanent and transient data
Internet Technologies

388

Data services are concerned with managing the stored data that is held in a grid
system. It provides services whereby data is moved from one entity to another,
where a replication service is maintained and where data having different formats
is converted. The data service is not only concerned with permanent data but also
with transient data.

388

Context services
Concerned with keeping customer details
Each customer should be associated with
resource and usage policies
Used by the resource management service
Combined with service requirements to
drive the resource management service

Internet Technologies

389

This service keeps track of customer data: what resources they are allowed, what
sort of access they are allowed and what use is made of a resource. This service is
intimately connected with the resource management service since it requires such
information of any optimisations that are required.

389

Information services
Provides information about entities on the
grid
Example: the current availability of a
resource
Required by the resource management
service in order for it to function

Internet Technologies

390

The information service part of OGSA is concerned with efficiently keeping upto-date information about the resources in a grid. For example it may keep details
of where replicated data can be found and the degree of replication and when
such data was recently updated. This service is sued by the resource management
service part of OGSA in order for it to dynamically configure the grid in order to
meet QoS targets.

390

Self management services


Application of automation for the
achievement of QoS levels
Not yet well defined
Aim to reduce the complexity and cost of
managing a grid -based system
Still a research topic

Internet Technologies

391

Self management services are associated with the automatic management of a


grid-based system in order for it to achieve desired quality of service levels. For
example an automatic replication service forming part of the self management
services would monitor the usage of databases and files, with the being aim to
dynamically keep copies of data in order to improve local access rates.
This is part of the standard that is not well defined. Much of the technology and
theory underlying this is still in the research domain.

391

Security services
Aim is to enforce security policies
Covers all the standard security worries
including authentication, repudiation and
authorisation
Based on standard security infrastructures
such as public key systems
Well defined
Internet Technologies

392

Security services are just the same type of services you associate with a
distributed system, for example ensuring that users can only access resources that
they are allowed to access, that data being transmitted between nodes on the grid
is secure from reading and that users are authorised.
Much of the technology for this is in existence and its implementation is
straightforward.

392

Execution management services


Enables simple workflow actions to be
executed
Enables more complex workflow actions to
be executed
Manages the task life-cycle
Used for communication between executing
entities such as applications and system
programs.
Internet Technologies

393

This service is associated with the communication between executing entities on


the grid. It ensures that such entities have enough resources and communicate
with other entities. The entities could be either system or application programs.

393

Globus toolkit
De facto toolkit for developing grids
Based on OGSA
Implements about 40% of the OGSA
specification
Good Java interface
The Globus toolkit
Internet Technologies

394

The de-facto implementation for grid building is the Globus toolkit. It is


available in open source form from the Globus alliance. It is mainly oriented
towards Java and contains a number of facilities for interfacing Java technologies
such as entity java beans with a grid. The software is still in its early days and
implements around 40% of the OGSA specification.

394

Basic Globus

Internet Technologies

395

The architecture of the toolkit contains an implementation of the OGSA-based


specification OGSI. The security infrastructure contains facilities for message
level security. The system level services currently contains a ping service, a
logging management service and a management service. The logging
management service allows the grid to log events that occur during its lifetime.
The management service the current status and load of a grid service container
and to cleanly shut down a container

395

Lecture 10
XML(i)

Internet Technologies

396

396

Aims
Detail the problems that XML is intended to
solve
Detail the main elements of XML
Describe some XML-based languages
Describe the concept of a DTD
Introduction to XML
Internet Technologies

397

The next two lectures will look extensively at XML. There are a number of
problems associated with distribution which XML is intended to solve. I shall
initially look at these.
I shall also look at some of the main elements of XML and how XML schemas
are developed and processed.
There is a host of software tools that are available for XML, for example tools for
publishing documents in a number of formats. I shall describe a small selection of
these.
I will conclude the lecture by looking at some examples of XML-based languages

397

An example of a specific problem


HTML
Internet dictionary

Process

Word
PDA format

How can the translation process be minimised?

Internet Technologies

398

In 2000 I developed an Internet dictionary for Oxford University Press. I wanted


the dictionary to exist in a number of forms: on the Web and on paper and even
in a mobile PDA form. I did not want to spend a lot of time writing specialist
translators for the dictionary; consequently I used XML to define dictionary
entries and used standard software tools for converting it into a number of
different forms.
This is one of the best examples of the use of XML: as a medium for a publishing
system. Others are detailed in the next four slides.

398

XML is used
Where a distributed program has to interact
with a number of different formats of data
When processing has to be shifted onto a
client
When different users have to have different
views of the same data
Applications which use mobile objects
Internet Technologies

399

There are a number of areas where XML has found use:


Where there are a number of different formats for data and where some form of
standardisation is required.
Where the client receives data and has to carry out a considerable amount of
processing, for example XML might be used to lay out electronic documents for
clients according to some standard format.
Applications where some users have different display requirements to other
users, for example a manager might only want to see a part of a report that a sales
person sees.
Mobile object applications; in particular those applications which visit Web sites
to gather information. Such applications require the visiting sites to be in some
standard form.

399

Applications which have a number of


different data formats
Different formats might include plain text, rtf, a
spreadsheet format.
Such applications might require the data to be
processed uniformly, for example producing sales
reports from both Word and Excel documents.
Problem in non-distributed systems, major
problem in distributed systems
Internet Technologies

400

These are applications where there may be a differing set of formats. The slide
details one: that of document processing; however, there are many more in the
area of Electronic Data Interchange where companies basically hold the same
data but with different formats and want to interchange this data effortlessly.
XML can be used to define the basic structure of the data and XML tools are able
to process it in a uniform way.

400

Applications which shift a lot of


processing onto the client
Archetypal example is that of a browser
displaying HTML text
Sometimes the data is non-standard, again
HTML is a good example
XML provides the facility to filter and
convert data to a standard form which can
then be processed by the same software
Internet Technologies

401

A number of applications require the client to carry out quite a bit of processing:
the example of HTML being displayed by a browser is a good one. Unfortunately
this data is often different. XML can be used to develop software which
interposes itself between the client and the server and which converts it to some
standard format. This means that only one version of the client software would be
used; this overcomes much of the updating and version control problems
associated with distribution.

401

Processing a document in a number


of ways
An example is a sales report
The report might be formatted in different
ways according to its readership.
Another example is documents intended for
poorly sighted users.
XML can be used to convert documents into
a number of visual versions.
Internet Technologies

402

This is one of the most popular uses of XML. Many information system
applications require the same data to be displayed in a variety of ways. For
example sales data may be viewed by accountants, sales managers and individual
sales people. Each of these has different requirements, for example the sales
manager might only want the summary element of the data while a salesperson
might want detailed sales from individual companies.
XML allows a developer to define the format of the data and then easily write
special-purpose programs which can then provide the view that is required.

402

Mobile object (agent) applications


Objects that visit a computer in order to
gather information
Such objects have to work hard in order to
extract semantics
XML can be used to define a standard for
similar Web sites

Internet Technologies

403

Mobile agents are objects which visit sites in order to gather information. The
best example of such agents are those which wander the Web looking for low
price items. Programming such agents is tedious because different Web sites use
HTML in different ways to display their catalogues. For example it is very
difficult for an agent to discern whether a number on a page is a price, a quantity
or a date.
By developing similar sites using the same base XML source it is possible to put
some structure on a site so that certain information important to the mobile agent
can be easily discovered.

403

What is XML?
XML is a language for defining languages.
Such languages are known as meta-languages
and have been used in computing for over 40
years. Recently they have fallen into disuse;
however XML has revived the concept.

Internet Technologies

404

XML is not a language, it is a language for defining languages. It contains


facilities for defining the structure of any number of tag-based languages. For
example it has been used for Electronic Data Interchange for the Oil industry to
define a language which describes the transactions which occur when one oil
commodity broker communicates with another broker.
Such languages have been around for many years; however, they have been
something of an academic plaything. Possibly the best example of the use of such
a language in the past has been YACC on the UNIX system which enables
compilers for simple programming languages to be developed based on a metadescription of the structure of the language.

404

The concept of a meta-language


A user command will start with US and then have a single
alphabetic set of characters

US Davis
Meta-language
Language

Internet Technologies

405

A meta language is a language used to describe another language. The example


in the slide shows English being used to describe line-based user commands in
some system. Another example is the BNF descriptions used to describe
programming languages.

405

XML processing
XML definition
of a language

Language
source

Processing
outputs
Processor

Internet Technologies

406

XML is used to define a language. All the languages defined by XML are based
on the sort of tags that you can find in HTML. A processor will then take source
written in the language defined by the XML and process it in some way. The
outputs from this processing can be:
Error reports
Versions of the documents expressed in certain ways
Protocol commands
Electronic documents
Paper documents
Web documents

406

History of XML
Based on ideas which were in HTML but
which emanated from SGML
Overseen by the World Wide Web
Consortium
Addressed a number of problems that were
emerging in the Web during the nineties
The importance of XML
Internet Technologies

407

Many of the concepts in XML for example the fact that it should be tag-based
can be found in HTML which owes very much to the document markup metalanguage SGML.
Development of XML and all its associated standards is now overseen by the
World Wide Web Consortium a group of universities, companies and
government organisations which look after standards for the Internet.
The XML standard is pretty much static now and should change very little over
the medium future.

407

Problems originally addressed

Similar HTML source was displayed in


different ways
Incompatible browser features
Semantics difficult to discern

Internet Technologies

408

The problems originally addressed by XML were:


The fact that a similar HTML source was displayed in different ways
depending on what browser was used. This meant that designers of Web
pages had at least to duplicate their efforts.
Some browser manufacturers developed HTML facilities which were not
recognised by other browsers.
It was almost impossible to discern any semantics within a Web page:
nowhere was this more apparent than in the use of search engines. Even in
the mid-1990s many users were expressing disappointment with these
programs because of the large number of retrieved documents that a
search returned which were, at best, only marginally connected to the
search. The reason for this poor performance was not the search engine
technology they are in fact highly sophisticated programs which are
testament to the ingenuity of the human developer but that the data they
process, HTML pages, does not contain many clues about their content.

408

Design aims

Easy to use
Support a large number of applications
Compatible with SGML
Easy to write programs
Can be prepared easily
Optional facilities low
Easy to understand
Internet Technologies

409

There were a number of design aims:


That it should be easy to use in the Internet.
That it should be capable of supporting a large number of applications
ranging from browsers to search engine databases.
That it should be compatible with SGML, the text processing language
which was the inspiration for HTML.
That it should not be a complicated process to develop processors for
documents written in languages defined in XML, for example it should be
easy to write a program to check that a source text reflects its definition.
That the number of optional facilities of the language should be very
low.
That XML documents should be easy to read and understand.
That documents written using a language de<ned by XML should be
easy to develop using simple editors.

409

XML and storage media


Document viewed
by spreadsheet program
Document in
word processor
form

Document in
spreadsheet format

Document as
relational db

Document in
XML-based
language

Document viewed by word


processor program

Document in
HTML
Document directly
viewed by browser

Document indirectly
viewed by browser
Internet Technologies

410

The slide shows how a document defined by XML can be stored and viewed in a
number of different ways. Processor which take an XML definition and the
source of the language are able to covert the language into a number of different
forms.
This is an example of a sort of Web publishing system such as Cocoon which is
described later.

410

XML processing
XML DTD or schema
Errors
Parser
Source of XML defined
language
Outputs
Internet Technologies

411

The slide shows the operation of an XML software tool known as a parser. This
takes the XML definition of a language and source expressed in the language
and:
Checks the source for correctness
Issues errors if the source does not match the definition
If the source is correct establishes its structure
Carries out the required processing. For example, it may transform the source
into some form such as HTML or rtf.

411

An example of XML source

<PRODUCT>
<PRODUCTNAME> Coat blue </PRODUCTNAME>
<PRODUCTPRICE> 34000</PRODUCTPRICE>
..
</PRODUCT>
Similar to HTML:
tag, element, attribute

Internet Technologies

412

Here a fragment from an XML defined language is shown. As you can see it
contains a number of elements delineated by tags and end tags with two elements
nested within another element. This is exactly the way in which HTML is
written.

412

An alternative
<PRODUCT>
<PRODUCTNAME PRICE =3400>
Coat blue
</PRODUCTNAME>
..
</PRODUCT>

Attribute

Internet Technologies

413

This is an alternative version of the text displayed in the previous slide. Here the
price of a product is expressed as an attribute.
Whether something is expressed as an element or as an attribute depends on
questions which are concerned as to whether an entity is a feature of a particular
element or an element in its own right. In the case of the <PRODUCT> example in
this and the previous slide the price would normally be expressed as an attribute.
Note that spaces do not matter in the layout.

413

An example of some XML defined


languages

Banking Industry Technology Secretariat


Financial Exchange
Bank Internet Payment System
Product Data Markup Language
The Text Encoding Initiative
Schools Interoperability Framework
Internet Technologies

414

There are hundreds of XML-defined languages. Some are shown above. The area
is so dynamic that languages are created and buried on a daily basis.

414

The Document Type Definition (DTD)


An example
<?xml version = 1.0 encoding =UTF-8"?>
<!DOCTYPE ENTRY[
<!ELEMENT ENTRY ENTRYPAIR*>
<!ELEMENT ENTRYPAIR (NAME, DEFINITION)>
<!ELEMENT NAME (#PCDATA)>
<!ELEMENT DEFINITION (#PCDATA)>
<]>

Internet Technologies

415

The definition of a language is expressed in a document known as a Document


Type Definition (DTD). The slide above shows an example of a short DTD. It
defines what an ENTRY looks like: that it contains a sequence of ENTRYPAIRs.
Each ENTRYPAIR will contain a NAME and a DEFINITION with the NAME
being expressed as a string and the DEFINITION being expressed as a string.

415

Source expressed defined in the


previous DTD
<ENTRY>
<ENTRYPAIR>
<NAME>Dodo</NAME>
<DEFINITION> A dead bird</DEFINITION>
</ENTRYPAIR>
<ENTRYPAIR>
<NAME>Blackbird</NAME>
<DEFINITION> A thieving bird</DEFINITION>
</ENTRYPAIR>
<ENTRYPAIR>
<NAME>Peacock</NAME>
<DEFINITION> An attractive bird</DEFINITION>
</ENTRYPAIR>
</ENTRY>
Internet Technologies

416

Here the source described by the DTD in the previous slide is displayed. The
indentation has been provided by me for readability: white space is ignored by
XML processors.

416

Another example

<!ELEMENT TOWN (COUNTY, POPULATION)>


<ATTLIST TOWN NAME CDATA REQUIRED>
..
Attribute list

Internet Technologies

417

Here a town is defined. The definition specifies that a town will consist of a
country and a population. The second line details an attribute list for town. The
list contains a single attribute called NAME which is always required.
The remainder of the definition is not shown

417

An example of a Town
<TOWN NAME = Kettering>
<COUNTY> Northamptonshire </COUNTY>
<POPULATION>23000</POPULATION>
</TOWN>

Internet Technologies

418

Here is an example of a town defined by the DTD fragment shown on the


previous page. Note the attribute NAME which provides the name of the town.
If this is omitted than an error is monitored; how this is done is detailed in some
of the following slides.

418

A full DTD
<?xml version = "1.0" standalone = "yes"?>
<!DOCTYPE BOOKLIST [
Root
<!ELEMENT BOOKLIST (BOOK)*>
<!ELEMENT BOOK
(TITLE, AUTHORS, PRICE, PUBLISHER)>
<!ELEMENT TITLE (#PCDATA)>
<!ELEMENT AUTHORS (#PCDATA)>
<!ELEMENT PRICE (#PCDATA)>
<!ELEMENT PUBLISHER (#PCDATA)>
<!ATTLIST PRICE
AMOUNTCURRENCY CDATA #REQUIRED
DISCOUNT CDATA "0"
>
]>
Internet Technologies

419

Here is an example of a full DTD. The first line specifies the version of XML
used and states that the source for the DTD will be found following it in the same
file: the alternative is to place it in another file. The next line details the root
element which lies at the top of the definition hierarchy.
The remainder of the lines specify that a book contains a sequence of title,
author, price and publisher details with price having two attributes a currency
designator and a discount which, if it is not present, will be zero.

419

Example of book source


<BOOKLIST>
<BOOK>
<TITLE>The Endless Path</TITLE>
<AUTHORS>Jones</AUTHORS>
<PRICE AMOUNTCURRENCY = "Pounds">200</PRICE>
<PUBLISHER>Pearson</PUBLISHER>
<DISCOUNT>5</DISCOUNT>
</BOOK>
<BOOK>
<TITLE>My Story</TITLE>
<AUTHORS>Roberts</AUTHORS>
<PRICE AMOUNTCURRENCY = "SW Francs">500</PRICE>
<PUBLISHER>McMillan</PUBLISHER>
</BOOK>
<BOOKLIST>
Internet Technologies

420

Here a BOOKLIST is shown containing two books, note the optional DISCOUNT
element.

420

Checking XML source


Carried out by a parser, validating or nonvalidating
A number of high quality parsers available,
for example Xerces from the Apache
foundation
APIs integrate with a parser in order to
carry out processing when a particular
element is encountered
Internet Technologies

421

The source of an XML-defined language is processed by a program known as a


parser. A parser can be validating or non-validating: the former checks the source
against the DTD definition, while the latter carries out rudimentary checks, for
example that an ending tag matches a starting tag.
There are a number of high quality parsers available. A good one is Xerces which
is provided as open source by the Apache foundation.
These days there are a number of standard interfaces which allow a parser to be
integrated with an API that carries out the processing of DTD source. For
example such APIs contain methods that are executed when an error is found in
the XML source.

421

Lecture 11
XML (ii)

Internet Technologies

422

422

Aims
To examine some of the XML processing
models
To look in some detail at XSLT
To look at the application of XML in
document processing

Internet Technologies

423

A typical API

Will read XML source sequentially


Detect tags
Detect attributes
Extract out string values
Two main models: event and tree-based
Described later
Internet Technologies

424

There are a number of APIs for processing XML source in a variety of languages.
They enable the programmer to process the XML source in two ways: by reading
the source sequentially or by accessing some tree representation.

424

DTD for
language rules

Parsers

Language
source

Language
source
XML language
processor

Errors

Programming
language
processor

Output
Errors

Internet Technologies

Output
(object
code)

425

This slide shows the comparison between a programming language processor and
a processor for an XML-based language. The major difference is that with a
programming language the syntax rules are hard-wired into the program code of
the processor rather than being stored in some equivalent of the DTD.

425

Some parsers

Xerces
IBM XML4Java
XP
OpenXML
Lark
Oracle XML Parser

List of parsers
Internet Technologies

426

The slide shows a number of parsers, many are free or open source. The most
used is Xerces.

426

Processing XML

An API carries out the processing


Two styles of API: SAX and DOM
SAX is event based
DOM is in-memory traversal based

Internet Technologies

427

If you want to process the source of an XML-based language then you need to
use an API that integrates with a parser. There are two styles of parser. SAX is an
event-based API. Methods provided as part of the API are triggered when an
event such as a start tag is encountered. It is based on the MVC model of
processing.
The other style of processing is exemplified by the DOM (Document Object
Model). Here the XML-based language source is held in memory and is traversed
by tree processing code.

427

SAX-based processing
DTD

Errors

Events
XML-based source

Processing
code from an
API

Parser

Any outputs
Internet Technologies

428

Here the parser takes the source of a language-based in XML, checks it against
the DTD for the language and every time an event occurs it sends notification to
processing code which is written using some event-based API. Typical events
include:
The start of the source
Encountering a start tag
Encountering some character data
Encountering an end tag
Encountering an error such as a missing end tag

428

Event processing
public void startElement(String tagName){

public void charData(char[] str, int first, int last){

Code
public void endDocument(){
executed when

events occur
}

Internet Technologies

429

Here three Java methods are shown which respond to three events signalled by a
parser element such as TOWN, the reading of character data such as that found in
<TOWN> Kettering</TOWN> and the end of the XML-based document. Code is
placed in these method which carries out the processing required.

429

Tree-based processing

BOOKSELLER

Node

STOCK
Sub node

Tree representing
XML source
BOOKLIST

CDLIST
Internet Technologies

430

A tree representation of some XML source is shown above. It shows the root
element BOOKSELLER which is defined in terms of four other elements (Three of
whose details are not shown). One of these elements is STOCK which contains a
BOOKLIST element and a CDLIST element. These, in turn, might be defined in terms
of further elements.
Tree-based processing moves over these elements and carries out actions based
on what it encounters, for example a start tag.

430

Tree-based processing
Parser builds up some tree-based
representation of the source.
The tree is stored in memory
The API provides facilities which enable
the programmer to traverse over the source.

Internet Technologies

431

In tree-based processing the parser sequentially processes the source and builds
up a representation of the source as a tree with multiple pointers to low level
nodes. The tree is stored in memory which allows easy backtracking. A typical
API which provides facilities for processing this tree will provide code for
traversing a tree and discovering information about a node and its sub-nodes.
This is shown on the next slide

431

Processing a tree
Traverse sub-nodes
to extract all
the employees
for a particular
department

Department

Employees

Internet Technologies

432

An example of tree processing is shown above, where a department consists of


employees. A tree-based scheme enables the children nodes to be processed
sequentially once access has been gained to the parent node. The parent node in
this case is the department.

432

Facilities for tree processing


Move from one node to a sub-node
Move along the children of a node
Extract information from a node such as a
list of attributes associated with the node
Backtrack to a node

Internet Technologies

433

A good API for XML-based tree processing should provide a wide variety of
facilities for accessing such trees:
Facilities which extract out information from a node, for example, the element
associated with the node, the attributes of a node and the number of children
parented by the node.
Facilities which enable the programmer to extract out the children of a node and
then process each of them individually.
Facilities which extract out data for a node, for example the string associated
with the node.
Facilities which enable the programmer to move up and down the tree.

433

Event vs tree-based processing


Events are in a sense a more natural
phenomena to program. Trees take more
programming.
Event-based programs often contain lots of
flags, while tree-based programs do not.
Trees can take up a huge amount of
memory with consequent performance
problems.
Internet Technologies

434

The programmer is faced with a choice of whether to program an XML


application using events or by processing a tree. Each has its own advantages or
disadvantages. Tree-based processing is often recursive and difficult to program;
while events are simple in concept and simple to program: all that is required is
to put processing code in an event handler.
However, event based programs are often littered with flags which provide
information about where the processing of the XML source is, for example
identifying an element in which the processing is embedded.
Trees are placed in memory and for large source files this can use up a lot of
space leading to major performance problems. This problem is not encountered
with event-based programming where small amounts of source are held in
memory.

434

XQuery
Is a form of query language which is used to
interrogate documents in XML
Gradually achieving some importance,
although it has still along way to go before it
challenges SQL
Often used with XML databases, for
example with XML-BDB
Internet Technologies

435

Xlinks and Xpointers


XML used for specifying the links between
documents
More flexible that HTML linking
Enables multi-linking
Enable more specific linking, for example
17th para on page 12
Used for footnotes, end notes and indexes
Internet Technologies

436

Scaleable Vector Graphics


Vector graphic standard, not bitmapped graphic
Bitmaps suffer from zooming problems and also
occupy a large amount of storage
Vector graphics contain drawing instructions
SVG an attempt to devise a standard
SVG can be catalogued
Introduction to SVG
Internet Technologies

437

There are two types of graphics found on the Internet: bit-mapped graphics where
bits contain colour information held in dots or pixels and vector graphics
containing drawing instructions, for example instructions to draw lines or circles.
Vector graphics have a number of advantages compared with bit-mapped
graphics:
They use much less memory
They are interpretable by programs such as search engine spiders
They can be zoomed without losing detail
There has been a major shortage of standard in graphics and SVG is the World
Wide Web Consortiums attempt to remedy it. It is a language based on XML.

437

SVG source
<SVG width = 3in height = 2in>
<DESC>
This is a sample circle
</DESC>
<G>
<CIRCLE style = fill: red; stroke: black cx = 100
cy = 100 r = 100/>
</G>
</SVG>

Internet Technologies

438

Here an example of SVG source is displayed. The first line sets the width and
height of the drawing and the main part of the source draws a circle at position
100, 100 with a width of 100. <DESC> marks descriptive text.

438

The channel definition format (CDF)


Used for pushing information to clients
rather than the clients pulling
Active channel technology developed by
Microsoft
Based on XML
Allows users to subscribe to a channel

Internet Technologies

439

The next XML-based language is used to control the channels that make up the
CDF technology. This is a push technology which is used to broadcast
information to subscribers who inform a publisher they wish to receive
information or data. The technology was developed by Microsoft.

439

The source (i)


<CHANNEL>
<TITLE>
News Headlines
</TITLE>
<SCHEDULE>
<INTERVALTIME HOUR = 1/>
</SCHEDULE>
<CHANNEL>
<TITLE> News </TITLE>
<ITEM HREF = http://www.newsservice.com/british>
<TITLE> British News </TITLE>
</ITEM>

Internet Technologies

440

Here a channel is defined which broadcasts every hour. The channel takes two
sources one of which shown here (www.newsservice.com). The other is shown
on the next slide.

440

The source (ii)


<ITEM HREF = http://www.newsservice.com/american>
<TITLE> American News </TITLE>
</ITEM>
..
</CHANNEL>
..
</CHANNEL>

Internet Technologies

441

Here the remainder of the source is shown. It specifies the second news source
and concludes with all the end tags that are required.

441

ebXML

ebXML is an XML-based language used for implementing electronic


business applications.

Discovering the products and services that are being offered or


required by each party.

Discovering what common information is required to carry out the


business transactions that are required.

Establishing how communication is to be carried out: for example who


is to carry out the communication and the format of the messages that
need to be sent.

Agreeing on the contractual documents and contractual processes.

Internet Technologies

442

ebXML is an XML-based language used for implementing electronic business


applications.
ebXML provides mechanisms whereby a communication infrastructure can be set
up between the parties to a electronic business transaction. This enables some
form of transport mechanism to be agreed on, the provision of software which
handles incoming and outgoing messages and the provision of software which
interfaces to existing applications.
The ebXML standard also provides facilities for a company to define business
processes such as receiving requests for services such as the purchase of some
bulk commodity, responding to these requests and facilities for defining reusable
business objects.
A major component of the ebXML standard is that it defines how an electronic
marketplace can be set up. Such a marketplace will rely on centrally stored data
which enables companies to register and discover information about each other,
for example what services they offer. Such a marketplace relies heavily on some
directory service.

442

An example
<BusinessTransaction name = "Create Order">
<RequestingBusinessActivity name =""
isNonRepudiationRequired = "true"
timeToAcknowledgeReceipt = "P2D"
timetoAcknowledgeAcceptance = "P3D"
>
<DocumentEnvelope
BusinessDocument = "Purchase Order"/>
</RequestingBusinessActivity>
..

Internet Technologies

443

Here a purchase order is created which requires the user to acknowledge that it
has been received (isNonRepudiationRequired = "true") and that
the limit on acknowledging receipt is 2 days (P2D), with the time to acknowledge
acceptance of the order being 3 days (P3D).
The code above would be followed by more text which defined how the business
responding to the transaction should act.

443

XSL transformations
So far I have dealt with hand crafting
transformations
There is a technology which allows you to
format documents
The technology is XSL, Extensible Style
Language
Supported by the WWW consortium
Internet Technologies

444

So far in this lecture and the previous one I have looked at the use of APIs in
processing XML source. A typical process might be the transformation into some
based format such as rtf. Because transformation into some publishing format is a
frequent process a technology has been devised which automates quite a lot of
the process. It means that you do not have to hand craft code using either DOM
or SAX-based APIs.
The technology is known as the Extensible Style Language XSL

444

XSL
Two components a transformation language
and a formatting language
Both can be used together or separately
The components are known as XSLT and
FOP
Both relatively new

Internet Technologies

445

When you are going to take some XML source and transform it in some way you
need two different processes. The first is the transformation into a different form
such as HTML, the second is to format the document for visual presentation.
Different components of XSL carry this process out. XSLT carries out the
transformation while FOP carries out the visual display. Both these technologies
are in a developmental state.

445

Some example source to be


formatted

<PLANETS>
<PLANET>
<NAME> Mercury</NAME>
<MASS> .0553</MASS>

</PLANET>

</PLANETS>

Internet Technologies

446

The slide above gives a simple example of source that is to be formatted using
XSLT. It describes a sequence of planets each planet containing elements which
describe its physical properties. The mass is as a proportion of the earths mass.

446

The transformations (i)


<?xml version = 1.0?>
<xsl:stylesheet version = 1.0
xmlns:xsl:= http://www.w3.org/1999/XSL/Transform>
<xsl:template match =PLANETS>
<HTML>
<xsl:apply-templates/>
</HTML>
</xsl:template>

Internet Technologies

447

Here is the first part of a transformation which takes source expressed in the
language detailed by the previous slide. The transformation is very simple: it just
takes the source, strips out the planet names and creates an HTML document with
each name in a separate paragraph.
The second line specifies where the DTD for the transformation can be found.
The third line carries out a pattern match on the tag <PLANETS> and then issues a
HTML header applies a transformation to the rest of the source and issues a
HTML terminator.
The next slide looks at the actual transformation of a <PLANET> element.
This and the next slide are taken from Inside XML, Holzner, New Riders 2000

447

The transformations (ii)

<xsl:template match = PLANET>


<P>
<xsl: value-of-select=NAME/>
</P>
</xsl:template>
</xsl:stylesheet>

Internet Technologies

448

Here the <PLANET> element has been matched. The code then selects the value of
the element <NAME>, for example Saturn and then displays it enclosed by the
HTML <P> </P> tags indicating that the text is to be treated as a paragraph.
The two final lines are just the end tags for the transformation
This slide is taken from Inside XML, Holzner, New Riders 2000

448

XSLT transformations carried out by

A server
A client
Stand alone programs (most advanced)

Internet Technologies

449

There are three ways that XML source can be transformed:


A server program, for example a servlet or a Perl script can use a style sheet to
transform a document and send it to any client.
A client program such as a browser can carry out the transformation, for
example IE5 can handle many transformations in this way.
A standalone program written in a programming language (usually Java which
has the most advanced facilities for XSLT processing) can carry out the
transformation.

449

Some XSL facilities

xsl:value-of
xsl:for-each
xsl:value-of-select
xsl:sort
xsl:if

Internet Technologies

450

There are a large number of facilities in XSLT, this slide only gives a small
flavour:
The value-of facility allows the XSLT programmer to select the value of a
particular element
The for-each facility provides a way of iterating over sequences of elements
with the same name. Its like a for statement
The value-of-select is a more powerful pattern-based version of value-of
The sort facility allows nodes to be sorted.
The if facility acts very much like an if statement allowing choices based on
some conditional criterion.

450

XSL formatting objects (FOP)


Usually used in conjunction with XSLT
Defined in XML
Contains facilities for specifying formatting
detail such as font, margins, point size
indentation etc.
Aim to convert into a number of target
formats, currently pdf only mature one
Internet Technologies

451

XSL formatting objects are usually used as a back end to XSLT in that formatting
objects are referred to in XSLT source
Like XSLT, FOP is defined in terms of XML.
FOP is a publishing technology and contains facilities for laying out text on a
page or on a screen.
Currently there are a number of efforts aimed at targeting particular formats for
display; however, the only really mature one is that targeted at pdf.

451

An example
<xsl:template match = PLANET/MASS>
<fo: block font-size = 36pt line-height = 48pt
font-family = sans-serif>
Mass (Earth = 1):
<xsl:apply-templates/>
<fo:block>
</xsl:template>

Internet Technologies

452

This XSLT source defines what formatting is to be applied when a <MASS>


element is found within a <PLANET> element. It will be displayed in a 36 pt
font which is sans serif within a line which is 48pts high.
The text Mass (Earth =1): is prefaced before the text which is displayed.
This slide is taken from Inside XML, Holzner, New Riders 2000

452

Some FOP formatting objects

block
footnote
list-block
table
title
page-number
leader
Internet Technologies

453

Some of the 56 formatting objects found in FOP are shown above.


block

creates a display block

footnote

creates a footnote citation and the associated text of the footnote

list-block
table

is used to format a list

formats the data found in a table

title

gives a document a title

page-number
leader

holds the number of the page currently being processed

creates a row of repeating characters

453

Some formatting properties

border
font
font-size
column-width
column-gap
margin-bottom
odd-or-even
Internet Technologies

454

The formatting objects that I described on the previous page all have properties
that can be altered by the XSLT programmer. Some of the large number of
properties are shown above:
border
font

gives the size of a border

gives the font to be used

font-size

gives the size of a font

column-width
column-gap

gives the width of a column in a table

gives the gap between columns in a table

margin-bottom
odd-or-even

defines the size of the bottom margin in a page

designates whether a page is going to be odd or even numbered

454

Using XSLT and FOP

Define the DTD schema for the document


Write the XSLT transformations
Add the formatting statements to the XSLT
Process the source and the definitions to
transform into some output format such as
pdf
Internet Technologies

455

There are a number of steps that are needed for developing a formatted
document:
First, define the DTD for the documents that are to be processed
Develop the XSLT transformations
Include the formatting statements using FOP.
Process the resulting definitions and the source with a software tool which is
targeted at some format such as pdf; for example the Apache tool
org.apache.fop.apps carries out this transformation in UNIX and LINUX

455

Web publishing frameworks


Maintenance of a large number of DTDs
and associated source documents
Handling of the incremental change process
Initiation of transformation process for
various target formats
Deployment of published version of some
base document to the Web or to files or to
paper
Internet Technologies

456

There are a small number of Web publishing frameworks and tools being
developed. These provide a large degree of automation for the tasks specified on
the previous slide. The user of a Web publishing tool will be able to:
Maintain a large corpus of XML-based source, DTD, XSLT transformations and
data files. The framework should provide facilities whereby processes such as
modifying families of files is easy.
Enable the transformation process and provide facilities for specifying what is to
be transformed, in what way and where initially the result of the transformation
should be placed
Enable the process of deploying documents mainly to the Web but also to data
files, relational databases and to paper copies.

456

Cocoon
Open source publishing framework
Easy creation of XML
Easy transformation of XML to a variety of
formats.
Relatively easy deployment of documents to
Web servers

Internet Technologies

457

Cocoon is a Web publishing tool which contains a good subset of the facilities
detailed in the previous slide. It is part of the Apache open source project and is
gradually being transformed into a pure Java project. It was one of the first Web
publishing frameworks which have been deployed commercially.

457

XHTML

Biggest XML application today


W3Cs implementation of HTML 4.0 in XML
Allows extension of HTML
XHTML documents can be displayed in todays
browsers
Can be checked for well-formedness
Also BASIC XHTML for pdas etc.
At last a standard for web pages
Internet Technologies

458

Before leaving XML as a markup technology it is worth saying something about


XHTML. This is largest of the XML applications and addresses one of the
problems alluded to in the previous lecture: that of incompatibilities in HTML
and browser vagaries in display.
It has been developed by the World Wide Web Consortium as a standard. It has
been implemented using XML and can be extended and extensions viewed in a
browser (to come).
There is also a subset of XHTML intended for hand-held devices and pdas.

458

Lecture 12
Concurrency

Internet Technologies

459

459

Aims

To outline the nature of local concurrency


To detail the implications of distributed concurrency
To describe the effect that locking has on performance
To show how distributed systems handle problems such as
deadlock

Internet Technologies

460

460

Concurrency
Fact of life for single computers, also fact of
life for distributed systems
Concurrency involves the execution of
separate chunks of code
Used to take advantage of slack time on a
single computer
Encountered as a series of remote programs
in a distributed system
Internet Technologies

461

In a distributed system there will be a number of activities being carried out at the
same time. For example, a number of clients may be accessing the same server
asking for a particular service to be provided. Such concurrent activities provide
both opportunities for the developer to optimise a distributed system, but at the
same time pose some sophisticated and tricky problems. This lecture looks at the
topic of concurrency and how it affects a distributed system. It <rst looks at
concurrency occurring at a single computer and very briefly examines the Java
facilities for de<ning and controlling concurrent operations. It then looks at how
a distributed system can be viewed as a set of concurrent processes running on a
wide variety of computers.

461

Single computer concurrency


Used to take advantage of slack processor
time
Implemented in Java as threads.
Used, for example, when a network
connection is being established or when a
file is being written to

Internet Technologies

462

In a single computer concurrency is used in order to take advantage of the fact


that for certain operations there will be a hiatus before they are completed, for
example when writing to a file.
In Java and a number of other programming languages concurrency is
implemented as threads: separate loci of control with, in Java, threads being
defined by means of the Thread class

462

An example of a Java thread class (i)


class ThreadDemo extends Thread
{
private String name;
public ThreadDemo(String name)
{
this.name = name;
}
public void run()
{
int count = 100;
}

Internet Technologies

463

There are two ways of implementing threads in Java. The first is to inherit from
the class Thread found in the package java.lang. This class has one method
run which needs to be overridden with the code for a thread. An implementation
of a simple thread is shown above

463

An example of a Java thread class


(ii)
while (count>0)
{
count--;
..
Thread.sleep(100);
System.out.println(Thread +name);
..
}
}
}

Internet Technologies

464

This is a simple class which just has a single constructor and a single instance
variable which identifies the thread. The code for the thread can be found within
the method run. This code is straightforward: it loops a hundred times; each
time it moves through the loop it sleeps for 100 milliseconds (the method sleep
takes an int argument which is the number of milliseconds to sleep) and then
displays a message identifying the thread.

464

Creating a thread and starting it

ThreadDemo th1 =
new ThreadDemo(First example)
..
th1.start();

May not start immediately: decision


of the scheduler

Internet Technologies

465

All this code does is to create a thread object and then starts it executing. When
there are a number of threads a scheduler will determine which of them is to be
given control of the hardware processor

465

Thread APIs
Most programming languages have thread APIs,
some examples from Java are shown below
destroy
setPriority
getname
setname
start

Internet Technologies

466

Some examples of thread methods are described above. They are:


destroy. This method destroys a thread.
setPriority. This changes the priority of a thread. The priority of a thread will
determine how much of the processor a thread will gain: the higher the priority of
a thread the more cycles of a processor the thread will be given by the scheduler.
There are three static constants associated with thread priorities:
MAX_PRIORITY which is the maximum priority for a thread, MIN_PRIORITY
which is the minimum priority and NORM_PRIORITY which is the normal
priority of a thread, the one that it is given when it is created.
getName. This returns a string which is the name of the thread. The name of a
thread can be given when it is created as some of its constructors have a string
argument which represents the name of the thread.
setName. This method gives a thread a name. The name is the single string
argument of the method.

466

A major problem with concurrency


Inconsistent updates
Common data

Threads or processes
Internet Technologies

467

A major problem that is encountered in concurrent processing is that of


inconsistent updates where a thread takes the value of a shared objects, updates it
and writes it back, In this time another thread may have started to access the data
and taken a temporary copy. When the two threads have completed their
processing the second thread may have based its work on an old value of the
shared data.

467

A solution - locking
Locking ensures that inconsistent updates
do not occur
Locking can affect both in-memory and filebased data.
A poor locking strategy can have a major
effect on the performance on a system: both
on a single computer and distributed
Lock can be a read lock or a write lock
Internet Technologies

468

Locking is the process whereby concurrent code is allowed single access to


shared data: no other code is allowed to access the shared data when the locked
code is accessing it. The access code is known as a critical region.

468

Locking queues
Thread
accessing shared
data

Shared data

Suspended threads!

Thread queue

Threads waiting access to shared data


Internet Technologies

469

When a thread is accessing shared data and a number of other threads want
access to that data then they are placed on a lock queue until the current
accessing thread signals that it has finished.
These queued threads are suspended and cannot make any progress

469

Locking decisions
Operation1

Operation2

Conflict

Read

Read

No

Write

Read

Yes

Write

Write

Yes

Type of lock set

Read lock requested

Write lock requested

None

Allowed

Allowed

Read

Allowed

Wait

Write

Wait

Wait

Internet Technologies

470

The tables above show whether conflicts exist when reading and writing occurs
and what a good locking scheme should allow.
Looking at the top table it is clear that in order to minimise the holding of locks a
suitable concurrency control scheme should take cognisance of the fact that a
large number of transactions could be simultaneously reading data but that only
one transaction would be able to write data.
Such a scheme implements what is known as the many reader single writer
scheme. The most popular way of implementing this is via two types of lock: a
read lock and a write lock.

470

Lock management
Many reader-single writer scheme usually
implemented
Strict two phase locking a popular solution
Two phase locking is based on lock
promotion

Internet Technologies

471

A strict two phase scheme operates in the following way:


If an operation within a transaction accesses a data item and the item has not
been locked then the item is locked and the operation proceeds; when a
transaction accesses a data item and there is a con>icting lock, for example a
write lock, then the transaction which contains the operation must wait.
If the operation accesses a data item which has a non-con>icting lock, for
example the lock is a read lock and the operation is just going to retrieve the data,
then the lock is shared and the transaction containing the operation proceeds.
If the data item has been locked in the same transaction as the one containing the
operation the lock can be promoted and the transaction proceeds. Promotion
refers to the process of making a lock stronger, effectively promoting a read lock
to a write lock. The rule for whether the lock is promoted depends on whether
there are other transactions sharing the read lock. If there are none then the lock
is promoted to a write lock; however, if there is at least one then the transaction is
delayed until all the read locks have been released.
When a transaction commits or aborts all the locks created by the transaction are
released.

471

Deadlock another problem


Occurs when a transaction T1 is suspended
when it asks for locked data associated with
T2 which is suspended waiting for T1.
Can also happen with a number of
transactions
Wait-for graph is used to describe this

Internet Technologies

472

One of the major problems that af>ict concurrent systems is that of deadlock.
This occurs when there is a contention between two transactions for two items of
data. As an example of this consider a transaction (T1) which requires access to
an item of data (d1), but which has already issued a write lock to an item of data
(d2). Also assume that another transaction (T2) which currently has locked d1
executes code which tries to access the item of data d2. T1 will be unable to
proceed because T2 has locked d1, while T2 will be unable to proceed because it
requires d2 which T1 has locked. The two transactions are in a state of limbo
waiting for each of them to proceed. This situation is also known as the deadly
embrace. Deadlock occurs in all distributed systems where there is shared access;
however, in those systems where there are a number of clients which hold data
for a long time (the typical interactive system) it is a major occurrence.

472

The wait-for graph

Internet Technologies

473

A wait-for graph is a directed graph which shows the relationships between


transactions and data. The slide above shows part of a very simple wait-for graph
with data access omitted.
Here transactions are denoted by rectangular boxes labelled with the name of a
transaction; the fact that a transaction is waiting for another transaction is
indicated by an arrowed line from the transaction that is waiting to the transaction
it is waiting for. The wait normally occurs because the second transaction has a
lock on a resource that the first transaction wants to process.
Deadlock occurs when there is a path from a transaction which traverses other
transactions and ends up back at the first transaction. In the slide there is a
deadlock since transaction T1 is waiting for transaction T3 which is waiting for
transaction T2 which, in turn, is waiting for transaction T1.

473

Wait-for graph
Can be built up by a lock manager.
Manager looks for cycles when a lock is
created.
Takes action when a potential cycle is
detected for example by aborting another
transaction.
Detecting cycle is easy, deciding on what to
do then is more difficult
Internet Technologies

474

A wait-for graph will be built up by the lock manager every time a


transaction asks for a lock and will be reduced whenever a lock is
released. Each time that a lock is requested the lock manager will
examine the graph to see whether a cycle is about to be created. If so, then
it will take action to ensure that a cycle does not occur, for example by
aborting another transaction in the loop.
There are two aspects to lock management based on a wait-for graph.
First, there is the detection of cycles and, second, there is the decision as
to which lock to break when a cycle is detected.
Detecting a cycle in a graph is a well-known problem and there are a
number of algorithms which are available for doing this. The second
aspect is a little more tricky and is based on factors such as how long a
particular transaction has been waiting to complete and how many cycles
a particular transaction is involved in.

474

Using timeouts
Majority of database servers use timeouts
Rough and ready solution:often nondeadlocked transactions will be aborted and
it penalises long running transactions
Good DBMS will allow a database
administrator to set the time between lock
examinations
Internet Technologies

475

The vast majority of database servers use timeouts to eliminate deadlock. Each
lock that is created is given a time period during which it can exist without being
removed. After this time if another transaction wants to use the data that is locked
then the transaction that holds the lock is aborted and the new transaction locks
the data and is allowed to access it.
Using timeouts is a rough and ready solution compared with processing a waitfor graph. It suffers from a number of problems. The <rst is that transactions can
often be aborted even if they are not deadlocked. A second problem is that longrunning transactions can be penalised too heavily.

475

Locking and database servers

Locking strategy not under control of the


programmer or designer. Depends on
Lock size
Type of access
SQL statements involved
The number of items to be locked
What mechanisms are involved, for example
whether an index is being used
Internet Technologies

476

Most database systems obtain a read lock when they read data from a table and a
write lock when they write to a table. Unfortunately the developer is unable to
directly in>uence the locking strategy used for a particular database. It is
managed by the database management system based on the following factors:
The lock size chosen for the tables that make up the database whether the lock
is, for example, on a page, row or on a whole table.
What sort of access is allowed, for example whether a dirty read is allowed on a
table.
The particular SQL statements involved.
The number of items expected to be locked for a particular transaction.
What mechanism is used to carry out an operation, for example whether an
index is to be used.

476

DBMS lock levels

Row locking
Page locking
Table locking
Database locking

Internet Technologies

477

Most database management systems allow a number of lock levels. These


include:
Row locking, where a row of a relational table is locked.
Page locking, where a page of file storage containing part of a table or
table index is locked.
Table locking, where a whole table is locked.
Database locking, where the whole of a database (every table) is locked.
Most database products implement most or all of the above levels with
row, page and table locking being the most common.
Most relational database systems handle deadlock via continual lock
monitoring rather than preventing it via consulting a wait-for graph. The
vast majority do this by periodically examining locks which have been in
existence for some time and aborting the transactions that own these
locks. A good database management system will allow the system
administrator to set the time gap between these lock examinations.

477

Database integrity and performance

A number of DBMS allow different forms


of data integrity which have an effect on
locking and hence performance
Dirty read
Committed read
Cursor stability
Repeatable read
Internet Technologies

478

Many database management systems also allow some form of data


integrity which would have an effect on the locking strategy used and
hence can reduce run time. The four main levels are:
Dirty read, where applications may read data which has been updated but
has not yet been committed to a database.
Committed read, where applications may not read dirty data.
Cursor stability, where a row being read by a transaction is not allowed
to be changed by another transaction.
Repeatable read, where all data items read are locked until a transaction
reaches a commit point.

478

Lecture 13
Transactions

Internet Technologies

479

479

Aims
To examine the nature of transactions
To look at the essential properties of
transactions
To understand the use of application servers
To detail a case study: Enterprise JavaBeans

Internet Technologies

480

Transactions

All or nothing

Set of atomic operations which access


stored data
Each of the operations in a transaction must
be atomic
Transactions can be defined as atomic
Transactions should not interfere with each
other
Internet Technologies

481

A transaction is a set of atomic operations which carry out some access to stored
data. For example, a typical transaction which processes a database of stock for
an online retailer is shown below:
A customer orders an item.
The system checks that the item is in stock.
If the item is in stock then the customer is allocated the item and the stock total
for the item is reduced by one.
If the stock for the item is dangerously low then an order for new stock is
placed.
An important property of each of the operations that make up a transaction is that
it is atomic.
This means that when a client carries out an operation such as updating a
database then this operation is free from interference by an operation which
belongs to another transaction. In the previous lecture I showed how, by using
locks, this could be achieved.
Transactions can also be atomic. An atomic transaction is one which must either
be totally carried out or not carried out at all. For example, a series of related
credits and debits carried out on a bank account either must have all been carried
out or, if some reason for aborting occurs, none of them must be carried out.

481

Transactions and ACID


Four properties of a transaction
Atomicity
Consistency
Isolation
Durability

Internet Technologies

482

The acronym ACID is often given to the properties of a transaction. It stands for
Atomicity, Consistency, Isolation and Durability. Atomicity means a transaction
must be atomic as de<ned in the previous slide. Consistency means that a
transaction must leave stored data in a consistent state, for example the balance
of a bank account must re>ect the fact that credits have been added and debits
subtracted. Isolation stands for the fact that a transaction must not be interfered
with by other transactions. Durability stands for the fact that after a transaction
has completed its operations the results are stored in permanent storage usually
some form of disk storage.

482

Serial equivalence
Important property
Means that if a number of concurrent
transactions are applied the effect would be
the same as if they were applied serially
Means that problems such as the lost update
problem do not occur

Internet Technologies

483

A major idea behind the design of transactional servers is that of serial


equivalence. This arises from the fact that the vast majority of servers are
concurrent in order to maximise the use of hardware resources. Serial
equivalence, when applied to a transaction, means that when a number of
transactions are applied concurrently the effect of these transactions will be the
same as if they were applied one after the other. This, in effect, provides a hard
requirement on a concurrent server that ensures that effects such as an
inconsistent update should not occur. As an example of this consider one problem
which occurs within concurrent systems: the lost update problem.
Here a client takes a copy of some shared data and carries out a write operation
on it; however, while the copy is being updated another client takes a further
copy and then updates the data. The <rst client writes back the data followed by
the second client doing the same. This means that the data written by the <rst
client is lost. However, if the transactions were carried out in a serially equivalent
way then the effect would be of the <rst client carrying out the update followed
by the second client carrying out an update.

483

Distributed transactions
Allow more concurrency
Provide more flexible policies for abortion

Internet Technologies

484

A transaction may consist of a number of sub-transactions. For example, a


transaction which allocates a product sold to a customer may be structured as
three further transactions: a transaction which associates the customer with the
product that they have bought, a transaction which updates the stock details for
the product and a transaction which creates accounting information such as a
credit card debit. Distributed transactions are useful for two reasons :
They allow more concurrency into a transaction and hence can potentially make
the utilisation of hardware resources more ef<cient.
They allow more >exible policies for aborting a transaction. In a simple
transaction if an operation fails then all the operations must be cancelled and the
data which is affected rolled back to the state that it was in before the transaction
was started. When transactions are organised hierarchically a sub-transaction can
be allowed to fail and not affect other sub-transactions which have already been
carried out; in such a case the sub-transaction that failed could be started again
without any of the previous sub-transactions consuming hardware resources.

484

Atomic commit protocols

Transactions should be atomic


All operations should be carried out or none
Atomic commit protocols enforce this.
Two phase atomic commit protocol most
common

Internet Technologies

485

Because transactions should be atomic there is a requirement on a distributed


transaction that either all the operations associated with it are carried out or none
at all. In order for this to happen some form of protocol is needed. In order to
describe an effective protocol known as a two-phase commit protocol I shall
assume that one server is nominated as a coordinator for the distributed
transaction; in a practical situation this is usually the <rst server that takes part in
the distributed transaction. In Figure the previous slide this would be server S1.

485

Two phase commit protocol

Client aborts transaction


Server decides to abort a transaction
Two phase commit starts
Servers synchronise from co-ordinator
Details in the notes
Internet Technologies

486

The general rule about aborting or committing a transaction is:


1. If the client requests that a transaction is aborted, for example the client
is building up an order in a shopping cart and decides not to complete the
order, then the coordinator will inform all the servers involved in the
transaction that they should abort.
2. If one of the servers decides to abort a transaction, for example in order
to release a lock, then the coordinator informs all the servers involved in
the transaction; they will then all abort.
3. If the client asks for a transaction to be committed then the two-phase
commit protocol starts and steps 3.1 to 3.4 are carried out.
The coordinator asks each server whether they can commit.
Each server decides whether it can commit or not and sends back a
reply which indicates the result of its decision. If it cannot commit
then it aborts its transaction.
If all the servers have voted to commit their transactions
then the coordinator informs all the servers that they can commit.
If at least one server cannot commit then the coordinator decides to
abort the transaction and sends an abort message to all the servers
involved.
4. Those servers which have agreed to commit to a transaction wait for
the <nal decision from the coordinator. This will be either a commit
instruction or an abort instruction. They will then act on this.

486

Locking rules in a distributed


transaction
Parent transactions do not run at the same time as their
children
Children inherit locks from their parents
If a nested transaction wants a read lock on shared data
then all the holders of the write lock must be its ancestors
If a nested transaction wants a write lock then all the
holders of both write and read locks must be its ancestors
When a transaction commits then all its locks are inherited
by its parent
Internet Technologies

487

As I have described in the previous lecture the most popular way of handling
concurrency control is via locks. Each server in a distributed system will have a
lock manager which will decide whether to grant a lock to a transaction; if it does
not then the transaction has to be queued up to wait for the data that it is
accessing to become free. When a transaction is committed or aborted it will
release a lock. The rules for locking for a nested distributed transaction are as
follows:
Parent transactions are prevented from running at the same time as their
children.
Children in
transaction.

nested

transaction

will

inherit

locks

from

parent

If a nested transaction wants a read lock on a shared data item then all the
holders of the write lock on the transaction must be its ancestors.
If a nested transaction wants a write lock on a shared data item then all the
holders of both write and read locks must be its ancestors.
When a transaction commits then all its locks are inherited by its parent.
When a transaction aborts all its locks are removed.

487

Distributed deadlock

Internet Technologies

488

As with deadlocks in a single server there is always the possibility of deadlocks


occurring between servers in a distributed system; such a deadlock is known as a
distributed deadlock. The slide shows an example of this. Here transactions
labelled from T1 to T3 access shared data on servers labelled from S1 to S4. The
line labelled Waits for indicates that a transaction is waiting for a lock to be
released, while a line labelled Holds indicates that a lock is being held on a
particular data item. It can be seen from the slide that there are two cycles
indicating deadlock. The <rst just involves servers S1 and S2; the second involves
the servers S1, S2, S3 and S4. In the previous lecture I detailed two ways in which
deadlocks can be eliminated: either by timing out a transaction involved in a
deadlock cycle such as transaction T1 above, or by ensuring that just before a
transaction attempts to access a particular data item a cycle is not created. In that
lecture I described a graph known as a wait-for graph which allowed deadlocks to
be detected before they occur. There is no theoretical reason why such a graph
cannot be constructed for a number of servers participating in distributed
transactions with one of the servers taking on the role of checking for cycles in
the wait-for graph.

488

Distributed deadlock
Can be handled as described in the previous
lecture: by having a central server checking
for cycles
Impractical: what if the server malfunctions
or the transmission medium used breaks
In practice edge chasing algorithms are used

Internet Technologies

489

There are two major problems with using a central server. The <rst is that
if only one server were used and that server malfunctioned then the
system would be in great trouble and deadlocks would build up to the
point where performance would greatly degrade. The second problem is
that scalability cannot be achieved: as a system grows, more and more
pressure would be built up on the server carrying out deadlock detection
to the point where its performance would suffer; this degradation of
performance would affect other servers which contain deadlocked
transactions and hence the system itself, since the remaining servers
would rely on an overloaded server to enable them to remove deadlocks
and proceed. You will remember that earlier in the course I described one
of the major advantages of distributed computing being the fact that a task
can be split up to be executed on a number of servers, thus leading to
some degree of scalability. It is clearly an advantage for the deadlock
detection process to be made distributed and not centralised on one
particular server.
One popular way of distributing deadlock detection is for each server to
send messages to other servers initiating transactions to indicate that they
are waiting for another transaction. Such a distributed algorithm is known
as an edge chasing algorithm.

489

TP Monitors
Originally associated with mainframe
computers
Manage concurrent execution
Ensure ACID properties are maintained
Solves the problem of thousands of users
concurrently accessing databases

Internet Technologies

490

A TP (Transaction Processing) monitor is a complex program which manages the


execution of a transaction starting with the client executing the transaction; it will
normally employ a number of servers and then return any results to the client. TP
monitors carry out two important processes: they manage the concurrent
execution of the threads and processes that make up a transaction and ensure that
the ACID properties detailed earlier in the chapter are enforced; for example, a
TP monitor ensures that when a transaction updates a shared item of data when
other transactions wish to access the data then the result of the updating is
consistent.
Originally TP monitors were associated with large mainframe computers and
used in areas such as airline ticketing and banking. However, the technology has
migrated to clientserver systems.The effective problem that TP monitors have to
solve is that of potentially hundreds or even thousands of users wanting to access
shared databases concurrently over a period of time. If these users were all
statically allocated enough memory, <le connections and threads to carry out
their functions then there would be a massive degradation of any system which
supported them.

490

TP monitor functions

Initiate and destroy threads


Manage resources
React to transaction failure
Schedule threads
Distribute processing load
Enable a distributed system to function even
in the presence of errors
Internet Technologies

491

The functions of a TP monitor are shown below, they have been taken
from a description of the IBM CICS monitor
Initiate and destroy threads to carry out transactional operations. Many
transaction monitors will access a pool of threads which have been set up
when the monitor was started.
Manage the resources that are being accessed, for example ensuring that
updates are carried out in such a way that the resource does not find itself
in an inconsistent state.
Ensure that if a transaction fails then suitable action is taken; this action
can be provided by a programmer as code to be executed. In order to do
this most TP monitors will use a two-phase atomic commit protocol.
Schedule threads so that low-priority transactions, for example batch
transactions, are allocated a smaller share of resources than high-priority
transactions such as online transactions.
Enable the processing load on a distributed system to be shared between
a number of servers.
Enable a distributed system to function even in the presence of the
failure of one or more servers.

491

Case study: Enterprise Java Beans


Middleware part of Java
Component technology
Supports business components such as a
flight schedule which require transactional
support
Provided by an EJB compliant application
server
Internet Technologies

492

Enterprise JavaBeans technology forms one part of a package of middleware


software developed by Java and known as the Java 2 Platform Enterprise Edition.
You have already seen some of the components of this package in earlier
chapters including Remote Method Invocation, the Java Database Connection
(JDBC) and servlets. Other components of the package include JavaMail which
enables a programmer to easily connect to some mailing protocol such as POP3,
the Java messaging service which enables developers to produce messageoriented middleware and the Java Transaction Service which provides the
programmer with facilities to manage transactions.
Enterprise JavaBeans is a component technology. The model envisaged is that
large-scale components which encompass some business logic are developed by a
component manufacturer. For example, one component that might be developed
for an airline would be one which allocates staff to >ights according to a set of
criteria such as the fact that pilots need certain times for resting.
These components, known as business components, will often need heavy
transactional services in order to support them. These are provided by means of a
TP monitor known as an application server. The business components which
reside within an application server provide facilities which, for example, ensure
that ACID properties are maintained.

492

Six parties involved in bean


development

Bean developer
Container provider
Server provider
Application assembler
Deployer
System administrator
Internet Technologies

493

There are a number of potential parties involved in bean implementation. Usually


some of these are coalesced together, for example the container and server
providers are usually the same company
The bean developer. This is a company that has expertise in some domain such as
ticket reservation or airline staff scheduling. This company would produce the
Enterprise JavaBeans that implemented some application-speci<c logic.
The container provider. This is a company that supplies low-level software which
implements a run-time environment in which Enterprise JavaBeans can execute.
The server provider. A company that sells an Enterprise JavaBeans-compliant
server which provides transactional services. At present the container provider
and the server provider are the same company.
The application assembler. The company that joins beans together.
The deployer. This is some organisation which, given the code for the beans and
the glue code, will deploy the code across a number of servers. A number of
criteria determine how they are deployed; these include performance, security
and reliability.
The system administrator. This is an individual or collection of individuals who
are responsible for ensuring the maintenance of the bean, for example they will
be responsible for tuning the bean for performance if requirements change such
that the original tuning assumptions do not hold.

493

Server requirements for bean hosting

Distributed transaction management


Security
Resource management
Persistence
Multi-client support
Location transparency
Internet Technologies

494

The server requirements for a bean server are:


Distributed transaction management. The server should administer
transactions and ensure that phenomena such as phantom updates and
inconsistent retrievals do not occur.
Security. The server should provide facilities which prevent unauthorised
access to Enterprise JavaBeans.
Resource management. The server should provide resource management,
for example it should oversee the creation and deletion of threads and <le
connections.
Persistence. Many Enterprise JavaBeans require to be held in some
permanent storage medium. An Enterprise JavaBeans server should
manage this process ensuring that all changes to transient data intended
for permanent storage are carried out.
Multiclient support. The server should manage the process of clients
connecting to Enterprise JavaBeans and mediate their interaction with the
beans.
Location transparency. The server should operate in such a way that
clients should have no knowledge of the physical location of Enterprise
JavaBeans.

494

The Enterprise Java Bean


architecture

Internet Technologies

495

The architecture of the Enterprise JavaBeans technology is shown above. An


Enterprise JavaBeans-compliant server can contain a number of containers which
themselves will contain a number of Enterprise JavaBeans. Clients are allowed to
access the Enterprise JavaBeans via a number of method invocations found
within the API for the Enterprise JavaBeans package.

495

Two types of bean


Entity beans represent some stored entity
An entity bean is persistent and mapped into
some file-based form
Session beans perform some business logic
Session beans only last for the duration of a
session

Internet Technologies

496

An entity bean represents some stored entity that is used in an application and
which requires permanent storage. Examples include: bank accounts,
warehouses, stock containers, >ight plans, stock portfolios, insurance policies and
hotel bookings.
An entity bean will normally be mapped into data stored in a relational database
system, although it is quite possible for them to be mapped into data in an objectoriented database.
An important point to make about entity beans is that since they model long-lived
data an application server will provide facilities whereby, if a server crashes or
some disastrous event occurs, the bean state will not be destroyed.
A session bean is a bean which performs some business logic; they do not model
some stored entity such as a bank account. Typical examples of the type of work
that a session bean carries out are:
Processing a debit on a bank account.
Processing an order for some e-commerce product.
Making a trade for some stock or share.
Querying a warehouse for information about stock which requires replenishing.
A session bean will only last for the period during which a client interacts with
the bean

496

Lecture 14
Distributed System Design

Internet Technologies

497

497

Aims
To examine some performance prediction
methods
To detail some design principles
concentrating on performance
To look at some of the trade-offs involved
in distributed application design

Internet Technologies

498

498

There are always trade-offs - the


example of data replication
Data replication used for performance
improvement
Also used for reliability
Uses replicated databases
Overusing replicated databases leads to
performance degradation

Internet Technologies

499

This lecture looks mainly at the design of a distributed system for performance; however,
I shall also look at some of the issues involved in ensuring that reliable services can be
maintained in the presence of hardware failure. At this point it is worth stressing that
many of the design decisions that are made do not have a direct effect on factors such as
performance, but an indirect effect. As an example of this consider data replication. This
is where a database in a distributed system is copied a number of times and located at a
number of points in a network. There are a number of reasons for replicating data; a
major one is that it enables data to be close to users. If a database is stored on a local area
network rather than a location which requires a wide area network access, the amount of
time for the wide area network to provide the data is often orders of magnitude higher
than if it were provided by a local area network.
You might think that many of the problems associated with low-speed access to
databases across a network would be solved, at a stroke, by just carrying out large-scale
replication; unfortunately this is not the case: replicated databases will need to coordinate
with each other as each is updated. After a replicated database has been written to it has
to send messages to all the other replicated databases in order that they reject the changes
that have occurred. This gives rise to two factors which reduce the performance of a
distributed system. The first is that extra traffic is generated in the system; this is often
high-priority traffic and will delay any traffic which has originated from application
transactions. The second is that updates to a replicated database will delay transactions to
that database until it matches the state of the database which they have originally been
applied to.

499

Performance prediction

Using vendor measurements


Rules of thumb
Simulation modelling
Analytical modelling
Projecting from measurements

Internet Technologies

500

There are a number of ways of performing prediction measurement for a


distributed system. Each has its own properties, for example some are easy to
carry out but are inaccurate while others take a long time but are more accurate.
The next set of slides look at each of these in turn.

500

Using vendor measurements

Vendors publish this data.


Moderately useful when comparing servers.
Almost useless because the transaction mix
is usually chosen to show off a server and is
often specialised

Internet Technologies

501

Many server vendors publish performance data on how their servers perform
against a number of benchmarks. This type of data is moderately useful when
making rough comparisons between servers in a search for the most powerful
server. However, as the sole means of predicting performance in a distributed
system it is almost useless. There are two reasons for this: first, server vendors
will often choose a mix of transactions which make their servers perform well
and, second, these benchmarks are often highly specialised and atypical and do
not match the specific transaction mix of any particular application.

501

Rules of thumb
Experienced developer uses various rules of
thumb
One example is the effect of caching
strategies
Quite useful for optimising design
Less useful for prediction
Not recommended for novel systems
Internet Technologies

502

A developer who has implemented a number of small, similar distributed systems


will often use sensible rules of thumb to roughly predict the performance of a
new distributed system. For example, they might have discovered that certain
queuing algorithms or caching strategies work well and give a certain percentage
response time increase for particular clients. Often such rules of thumb will work
to the point where certain hardware optimisations such as using a faster disk or
cache memory will enable the developer to meet performance criteria. However,
it is not a recommended method for systems which are either novel in terms of
functionality, architecture, the technology that is to be employed or large in size
and rich in functionality.

502

Simulation modelling

Process of building an executable model


Based on some mix of queuing data
A number of modelling languages available
Expensive but moderately accurate
Requires special expertise

Internet Technologies

503

This is the process of building an executable model of a system and executing it


with a particular predefined mix of transactions. The simulation is usually based
on some model of queuing where transactions which cannot be immediately
serviced by a server are queued up waiting for the server to become free. There
are a number of general-purpose modelling languages and technologies that can
be used for simulation, together with one or two special-purpose tools which are
oriented towards computer system simulation. In general, simulation modelling
offers a none too expensive approach to performance prediction which is
relatively accurate. However, it does require special expertise which, in small to
medium companies, is often not available and has to be bought in via
consultancy.

503

Analytic modelling
Uses applied maths, usually statistics and
probability theory
Relies on a mathematical model
Equations relate queues, processors, devices
and data messages
As accurate as simulation modelling
Expertise is very short on the ground
Internet Technologies

504

Analytical modelling uses applied mathematics usually statistics and


probability theory to develop a model of a distributed system. The model is
developed in terms of equations that relate entities in a system such as processors,
I/O devices and servers to equations that describe the arrival rate and spread of
transactions in a system. Analytical modelling is similar to simulation modelling
in terms of cost and effectiveness: it is relatively good at prediction and the cost
is not prohibitive. However, expertise in this area can be very thin on the ground:
even large companies might find it difficult to discover someone who has the
degree of expertise both computing and mathematical to carry this process
out.

504

Projecting from measurements


Also known as benchmarking
Simple strategy is to measure critical data
for one client-server with no competing
work. A good strategy for simple systems.
Other strategy is to run a series of
representative processes.
Needs a lot of resources though.
Internet Technologies

505

This is often referred to as benchmarking. There are two types of projection that
can be made. The first is to measure critical results such as internal wait times
and the time taken for results to be processed and appear at a client for just one
clientserver relationship with no other competing work. This is an excellent way
of predicting performance in small systems, particularly when it is augmented by
a little simulation modelling or analytical modelling in order to cope with the
added complexity of multi-threaded working.
The second type of projection is that made from data gathered by running a
number of representative processes.

505

Benchmarking from representative


processes

Produce working system


Find some programs to replicate load
Find representative data
Run the system
Monitor key parameters
Analyse the data
Vary the workload
Internet Technologies

506

The steps below are employed in a full benchmark Once these processes have
been carried out the developer will have gained a very good idea about how a
target system will perform. Unfortunately there is a major problem: a large
amount of resource needs to be committed for the process, often making it
uneconomic. Only systems which are immensely performance-critical and
mission-critical can be analysed in such a way.
Produce a working distributed system in terms of hardware elements.
Find some programs which replicate the workload that will be
experienced by the servers in the system.
Find and load representative test data.
Run the system with users who will generate a meaningful pro<le of
transactions.
Monitor key parameters of the system such as wait time.
Analyse the data.
Vary the workload and see how this affects critical parameters.

506

Distributed design principles

Principle of locality
Principle of sharing
The principle of parallelism

Internet Technologies

507

There are a number of design principles which should be applied or respected


when developing a distributed system :
The principle of locality. This means that parts of a system which are associated
with each other should be in close proximity. For example, programs which
exchange large amounts of data should, ideally, be on the same computer or, less
ideally, the same local area network.
The principle of sharing. This means that ideally resources (memory, file space,
processor power) should be carefully shared in order to minimise the load on
some of the elements of a distributed system.
The parallelism principle. This means that maximum use should be made of the
main rationale behind distributed systems: the fact that a heavy degree of scaling
up can be achieved by means of the careful deployment of servers sharing
processing load.

507

Locality principle - an example

Two separate tables joined on two servers


Bring them together
However, trade-off might be involved with
keeping the tables close to the users

Internet Technologies

508

As an example of this principle consider a distributed system whose


clients issue a query which results in large amounts of data being
retrieved from two tables located on two separate servers connected via a
wide area network such as the Internet. An example of this would be a
query which involved an SQL join. Let us also assume that a third server
carries out the joining process and the construction of the resulting table.
If the two tables are large then there will be a considerable delay while
data is retrieved from the two servers and sent to the third server. If these
two tables were situated on the same server then, theoretically, a
considerable performance improvement can be achieved.
This decision looks very straightforward and one which should
automatically be taken by the designer of a distributed system. However,
trade-offs also have to be considered. For example, as you will see later,
an excellent way of improving the performance of a distributed system is
to locate data close to the users: usually on the same local area network
that they are situated. One of the tables that takes part in the joining
process might have been specifically located at a distance from the other
table for this very reason.

508

Locality principle - keeping data


together
Keep related data close together
An example is compositing separate classes
on the same server
Trade-off with maintenance

Internet Technologies

509

Probably the best known example of the locality principle is that data that is
related to each other should be grouped together. Already in the introduction to
this section I have described one example of this where two tables which were
related by virtue of the fact that they were often accessed together were moved
onto the same server. This whole principle applies to all sorts of groupings of
data: rows in a relational table, columns in a relational table, tables themselves
and attributes of objects. For example, the analysis of an object-oriented system
will produce a series of documents which will describe information such as the
functionality of the system, the classes involved in the implementation of the
functionality and the relationship between the classes. If analysis has been carried
out competently, then there should be little, if any, design information produced,
apart from perhaps specification performance constraints.
The role of design is to take the analysis product and turn it into some form
which is heavily adorned with physical detail. One application of the principle of
keeping data together is to form composite classes which are constructed from
two or more classes identified during the analysis phase. The decision as to
whether classes should be composited is a serious one. It is based on an appraisal
of the workload of a system and the transactions that objects derived from the
classes take part in. If the developer used an object-oriented database system to
store objects then such a compositing decision would have a significant impact
on performance if the objects which were composited were frequently retrieved
together.

509

Locality principle - keeping code


together
If two programs communicate they should
do so over local memory, not a transmission
medium
Statically store programs on the same server
or dynamically load them
However, these programs might be on
separate servers because they need to
communicate with a local database
Internet Technologies

510

The idea behind this is that if two programs communicate with each other in a
distributed system then, ideally, they should be located on the same computer or,
at worst, they should be located on the same local area network. The worst case is
where programs communicate by passing data over the slow communication
media used in wide area networks.
There are a number of ways of implementing this design decision, the most
obvious being to statically store programs which communicate together on the
same server; an alternative would be to dynamically load these programs at run
time. However, a word of warning is necessary: many programs that are found on
separate servers will be there because they are communicating with some local
database. Thus, the decision to bring together two programs will often require
data to be moved with an effect on performance ensuing.

510

Locality principle - bring users and


data close together
One previous example is replicated
databases
Another is caching
A number of variants of caching

Internet Technologies

511

Replicated data is data that is duplicated at various locations in a distributed


system. The rationale for replicating a file, database, part of a file or part of a
database is simple. By replicating data which is stored on a wide area network
and placing it on a local area network major improvements in performance can be
obtained as local area technology can be orders of magnitude faster than wide
area technology in delivering data.
The other method ensuring that users are close to data is caching. Caching is the
storing of frequently used data in a fast memory, either at a client or at a server
which is connected to clients via a local area network. Huge increases in
performance can be obtained by storing frequently used data in a local file of
memory data which would normally be accessed over a slow wide area network
connection or from some slow file device.

511

Caching
Stores data on local memory
Dynamic caches are known write-back or
write-through caches
A number of caching strategies, for example
least recently used strategy
Used in browsers

Internet Technologies

512

Caching is an excellent way of speeding up a system for data which is not subject
to much change such as simple Web pages which do not contain dynamic data.
For data that does change performance gains can still be achieved; however, the
maintenance of the pages in the fast area of memory devoted to storage known
as the cache reduces the gains and sometimes requires quite a degree of extra
programming.
Caches which deal with dynamically updated data are known as write-back
caches or write-through caches. For such caches when a transaction updates some
stored data which appears in a cache at a client computer the following must
occur:
The data that is stored at the clients cache must be updated to reject the
change.
The stored data corresponding to the cached data must also be updated at
its server.
All other caches at other clients must be changed to respect the changed
data.

512

Caching strategy

Depends on pattern of access


Also, relative size of the cached data to the
overall data
Also, how the cache is to be managed
Plenty of studies published
Internet Technologies

513

When designing a caching mechanism there are a number of factors


which should be considered. The most important are:
The pattern of access to the stored data that is to be cached.
The relative size of the cached data as against the size of the full set of
data that the cache forms a part of.
How the cache is to be managed.

513

Principle of sharing

Resources should be shared


Loads should be balanced
Resources are memory, software, and
processors

Internet Technologies

514

This principle is concerned with the sharing of resources memory, software,


processor the best example of this being the sharing of a servers processor by a
number of clients which take advantage of spare processor resource in order to
improve the response time of transactions. This takes advantage of the slack
resources that are generated when a lengthy external process such as accessing
stored data on a file occurs.

514

The principle of sharing -sharing


among servers
How the servers are to be partitioned in a
system
Target is to avoid bottlenecks
Bottleneck work is partitioned to servers
Application resource usage matrix used

Internet Technologies

515

A major decision to be made about the design of a distributed system is how the
servers in a system are going to have the work performed by the system
partitioned among them. The main rationale for sharing work amongst servers is
to avoid bottlenecks where servers are overloaded with work which could be
reallocated to other servers.

515

The application resource usage


matrix
Resources

Applications

Internet Technologies

516

An application resource usage matrix can be used in order to aid the


process of allocating servers to applications. An example of this matrix is
shown in the slide.
Each column in the matrix represents some system resource, for example
a relational table, an object, a processor or a server. Each row represents a
particular application. The entries in the matrix represent the utilisation
made of the resource. At the initial stages of development these entries
might just be binary.
In the slide the fact that resource is used by an application is indicated by
a U and the fact that it isnt is represented by a blank entry.
As the process of analysis proceeds the entries might be replaced by more
detailed figures such as quantitative estimates of the usage made of a
particular resource. The process of designing the sharing part of
distributed systems involves ensuring that each column has an even
spread of utilisation at the same time respecting a number of heuristics
that have emerged over the last decade or so, for example ensuring that
small and large units of work are separated: one server should deal mainly
with small transactions, while another server should deal with longer
transactions.

516

Principle of sharing - sharing of data


It is a given in a distributed system that data
should be shared
Design involves choosing locking on tables,
and isolation level
Target is to minimise amount of locks and
deadlocks

Internet Technologies

517

Most of the decisions about locking will be made with respect to the database
management system that is used. In general such systems allow locking at one or
more sizes: page locks, table locks, database locks and row locks.
Locks are defined as a property of a table or database when the database designer
defines the structure of the database. Many of the decisions about what sort of
locking to adopt are common sense ones, for example when a table is going to be
the target of a bulk update it is better for efficiency reasons to use a table lock
and when data is just being read by online transactions and updated by batch
processes to use the largest lock size in order to minimise the amount of locking
that occurs.

517

Trade-offs with locking


Deadlock occurrence
Locking strategy

Making small areas of a table


locked leads to more concurrency
but increases deadlock and
the drop in performance to
cope with this

Performance

Internet Technologies

518

Although many locking decisions are straightforward, some locking


issues are much more subtle. For instance, the relationship between
deadlock occurrence, the locking strategy adopted and performance can
be a subtle one, where trade-offs have to be considered. For example,
locking at the row level will enable more concurrency to take place at the
cost of increased deadlocking where resources have to be expended in
order to monitor and release deadlocks; conversely locking at the table
level can lead to a major reduction in deadlock occurrence, but at the
expense of efficient concurrent operations.

518

Minimising deadlock
Monitor where they are occurring and then
modify database code, for example
changing access order
Use data replication
Experiment with the deadlock break interval

Internet Technologies

519

There are a number of ways of designing a system to minimise the effect


of deadlocks. The first is to monitor where the deadlocks are occurring
and then modify any database code which creates this situation. For
example, you may find that changing the order in which tables are
accessed will remove some deadlock occurrences. The second is to spread
the occurrence of database locks more evenly across the tables.
Another way of minimising distributed deadlocks is to carry out data
replication. Already I have detailed how this technique can be used to
locate data close to users; it is also quite a powerful technique for
minimising locks since, if there are n clients accessing a replicated set of
d databases, then the probability that a lock will occur will be the order of
n/d of the single database example.
Another strategy which is applied post-design is to experiment with the
deadlock break interval. Many database systems do not employ devices
such as wait-for graphs in order to detect deadlocks before they happen.
What they do is regularly examine locks which have been around for
some time and release those which have been in existence for a period
exceeding the deadlock break interval. Many database systems allow this
interval to be set by the database administrator when the database is
specified; in my experience varying the deadlock break interval can have
a drastic effect on performance for many database designs.

519

Varying the isolation level

Dirty read
Committed read
Cursor stability
Repeatable read

Internet Technologies

520

Another factor which many database systems allow to be varied is the isolation
level of a program. There can be as many as four isolation levels which a DBMS
will allow the database administrator to choose.
Dirty read. This is where a transaction can read data which has been
modified but the changes that have occurred have not been committed.
Committed read. Here a transaction is not allowed to read dirty data and
overwrite another transactions dirty data.
Cursor stability. Here a row being read by one transaction cannot be
changed by another transaction.
Repeatable read. All items are locked until a commit has been executed
As you proceed down the list of bullet points above the strength of the isolation
increases; this will increase the number of locks and hence the greater the chance
of deadlock occurring and performance dropping. The design principle here
should be that the isolation level chosen should be the weakest consistent with the
application data integrity and the demands of the application.

520

An example

If you have an application which reports on data in a


stored database and where 100 per cent accuracy in the
data is not required, for example a transaction which
reported on some averaged amount such as sales over the
past n months, then such an application can be
speci<ed as having a dirty read isolation level.

Internet Technologies

521

The slide here provides an example of a design decision which is based on


considering the isolation level

521

The parallelism principle


Work partitioned among n servers running
in parallel
Do not assume that linearity in power is
achieved
Key idea is load balancing
Program splitting, database splitting, I/O
splitting
Internet Technologies

522

The key idea behind the parallel principle is that of load balancing:
ensuring that a resource, be it a database, processor, relational table,
memory or object, is subdivided without incurring too many of the
overheads listed above.
For example, one decision that the designer of a distributed system has to
make is what to do about very large programs which could theoretically
execute on the same processor. Should this program be split up into
different threads which execute in parallel, either on a single server or on
a number of distributed servers? The designer here has to make a decision
which is driven by a consideration of the amount of synchronisation and
communication which occurs between the threads. If threads could spend
a large amount of time working away at a particular algorithm with little
access to shared data and with only small amounts of data required for
communication, then splitting the program into a number of component
programs would be a good decision, even over a number of servers.
Unhappily things are never so clear cut as this: very few programs can be
cleanly partitioned and a careful consideration of the operating system
overhead and resources consumed by the various threads involved has to
be carried out before making a partitioning decision.

522

Reliability
Key idea in distributed applications
More prone to fail than standalone systems
Server can fail, medium for transmission
can fail
System should be designed so that it can
cope with failure, albeit with perhaps a
degraded performance
Internet Technologies

523

The slides have outlined some techniques and principles for designing a
distributed system so that it has an acceptable performance. The other major
worry that a designer has is that of reliability. With the advent of distributed
systems and the increasing incidence of systems which interact with the general
public this has become a much more important design factor than it once was: for
example, in the old days of mainframe computers and minicomputers when, say,
customers phoned in their orders, companies could cope fairly easily with
incidents which caused computers to malfunction: normally such companies
would switch over to some manual form of ordering where order staff would
consult printouts which were generated on a periodic basis, say every two or
three hours.
The advent of clientserver computing and the World Wide Web, where
customers can inspect stocks of products and order them online, has not only
changed the importance of reliability where a malfunctioning server could
effectively shut down a retailer for a period, but also, at the same time, provided
many of the tools that can ensure that a system will always be running albeit at
a lower performance level.

523

Strategies for increasing reliability

Recovery files
Replicated databases
Mirrored servers
Multi-parallel running
Important to point out that many reliability design decisions are taken
out of the hands of the designer
Internet Technologies

524

The first aspect of dealing with failures is that of recovery: that when some
failure occurs such as a server malfunctioning the data stored on that server can
be quickly recovered and reconstituted at a working server. In order to do this
many systems use some form of recovery file. This is a file which contains a list
of changes that have been applied but have not yet been committed; if there is a
malfunction then a program known as a recovery manager will carry out the
process of restoring any files which are in limbo when the malfunction occurred.
Another technique for achieving reliability, and which has a faster response to
failure than techniques which use a recovery file, is parallel running. Here a
number of servers with replicated databases process the same transaction. Each
time that a transaction is received by the distributed system in which the servers
are located they will all apply that transaction to their databases. In this way, if a
fault occurs, recovery from the fault would be virtually instantaneous.
This is an expensive way of implementing reliability and is only really suited to
systems where a very high degree of reliability is required. There are
intermediate solutions. For example, a server could be designated a primary
server and a collection of other servers designated secondary servers. These
servers would lag behind the primary server in terms of updates to their
databases; however, if the primary server malfunctions one of the secondary
servers would catch up by applying the transactions which distinguished the
difference between the primary server and itself. These transactions would
normally have been written to some recovery file.

524

Lecture 15
Security (i)

Internet Technologies

525

525

Aims

To look at some typical security violations


To examine the properties of a security service
To describe the basic elements of cryptography
To examine asymmetric cryptography
To examine some key management techniques
To look at authentication
To describe some uses of cryptography
Internet Technologies

526

526

Problems with Internet security


Public standards
The network is pervasive
Web servers are extensible and can interact with
all sorts of software
Internet software was designed with functionality
in mind with little thought of security
The Internet is complex
The Internet is public
Internet Technologies

527

The Internet suffers from a number of security problems:


Internet standards are public and elements of the Internet such as transmission
media are public and can be relatively easily accessed.
A lot of software intended for Internet use was initially designed for
functionality, at best security was tacked on at the end.
The Internet and many of its products is so complex that security problems
appear almost on a weekly basis.

527

Security threat category

Integrity threats
Confidentiality threats
Denial of service threats
Authentication threats

Internet Technologies

528

Integrity threats involve the retrieval and tampering of important data such as
credit card details.
Confidentiality threats are concerned with the reading of confidential data
Denial of service threats are concerned with restricting, degrading or removing a
service provided by a host
Authentication threats involve an intruder pretending to be an authorised user and
carrying out operations that the authorised user is allowed to execute.

528

Requirements for secure Internet


applications

Confidentiality
Authentication
Integrity
Non-repudiation
Access control
Availability
Internet Technologies

529

The bullet points above represent requirements for secure Internet applications:
Confidentiality means that information stored on a system cannot be accessed by
unauthorised parties
Authentication means that the origin of a message or transaction is correctly
identified and the originator is who they claim to be
Integrity means that only authorised parties are able to change data
Non-repudiation means that neither the sender or the receiver of a transaction
can deny that a transaction took place
Access control means that facilities in a system are controlled so that users are
only allowed to use resources that they are authorised to use
Availability means that the resources of a system are available to authorised
users when they are needed by users.

529

Non technological attacks


(examples)
Guessing someones password
Stealing a password
Taking advantage of poor physical or
clerical controls
Dumpster diving

Internet Technologies

530

There are a number of non-technological attacks that are possible, some


examples are shown above:
Guessing someones password and then using this password to gain
access to secure files. Often passwords are chosen which are memorable,
for example the name of the password holders wife, dog or one of their
children.
Stealing a password which is unsecured; for example, the password could
have been written on a white-board, could be written inside a diary or
stored on a sheet of paper in a drawer.
Taking advantage of poor physical or clerical controls. There is a history
of bank employees authorising credit cards or bank guarantee cards to
non-existent customers, keeping the card themselves and using the card
for their own purchases. This type of crime is the furthest away from the
image of the technologically inspired criminal act: it just relies on internal
weaknesses in some organisation.
Looking into
information.

dumpsters

(rubbish

containers)

for

paper-based

530

Destructive Attacks (examples)

Email bombs
List linking
Denial of service attacks

Internet Technologies

531

Some attacks are aimed at destroying data or a service or denying a service.


An email bomb is an e-mail which either has a large amount of text pasted into it
or has a large file attachment associated with it. Often such devices are sent to
newsgroup participants who the sender has disagreed with. They are a nuisance:
if you have a dial-up connection and someone sends you a substantial e-mail
bomb then you could be waiting quite a long time before it is delivered. There
are, however, examples of large quantities of e-mail bombs being sent to an
organisation and disabling its mail server which becomes overwhelmed by the
load.
Another more serious form of e-mail bombing is list linking. Here the recipient of
a list linking attack is subscribed to a large number of mailing lists by the person
who carries out the attack. Many mailing lists send attachments with the
messages they transmit to the members of the list so that the recipient effectively
gets e-mail bombed continually. Many e-mail providers will allocate a quota of email space to their customers; when that space is full then no more e-mail will be
accepted.
A denial of service attack is one in which an intruder carries out some action
which either prevents access to a service or degrades the quality of the service.
For example, running a program on a computer which will spawn other programs
which, in turn, spawn further programs is an example of a denial of service
attack.

531

Viruses
Program that executes on a host and causes serious
problems
A number of categories: executable virus, data
virus, device driver virus, stealth virus,
polymorphic virus
A number of virus kits available on the Internet

Introduction to viruses
Internet Technologies

532

A virus is a program that is inserted into a host and which carried out some
destructive process such as deleting the file store.
There are three main types of virus: executable viruses, data viruses and device
driver viruses. An executable virus is a virus which is attached to an executable
file which, when executed, will result in the virus code being run. This code will
then carry out some malicious act such as deleting important files. A data virus is
a virus which infects a file containing data, rather than executable code. Often
this data is associated with some program and which the program requires in
order to carry out its functions. For example, many programs require a start-up
file which initialises the program and sets up basic parameters for its operation. A
data virus could infect such a file and set the data in it to values such that the
program will crash or its functions. A third class of virus is the device driver
virus. This infects the device drivers of an operating system which are then used
to piggy-back into other parts of a computer such as its file store
There is also a further classification of viruses which categorise the ways that
they use to hide their presence on a computer. There are two types of virus which
are categorised in this way, the stealth virus and the polymorphic virus.

532

Scanner attacks
A scanner is a program which detects
system weaknesses
Poor name
Most famous is SATAN
Example is a scanner which checks that
sendmail is secure
Scanners can be used for intrusion

Internet Technologies

533

A scanner is a program which detects security weaknesses. There will be some


controversy about placing this topic in a section devoted to attacks on a computer
system, since scanners have been developed to help system administrators
pinpoint security weaknesses in a system. However, some of the scanners that
have been developed can be used outside a network in order to probe it for
potential ways of intruding or crashing the computers on the network.
A scanner is a software tool that looks at the various components of an operating
system and checks whether they are secure, for example some scanners for the
UNIX operating system are capable of checking whether the popular sendmail
utility is secure enough to prevent intrusion; other scanners can check the
robustness of the ftp facility at a site, for example by checking whether sending
an overlong password will result in an ftp server crashing.

533

Password cracker attack

Finds password by trying common ones


Also operate by brute force
Originally used by system administrators to
check on user passwords
It can be surprisingly
easy to crack passwords

Intro to passwords
Internet Technologies

534

A password cracker is a program which attempts to find out a users password or


the identity of a number of passwords stored in the password file of a computer.
Crackers were originally used by system administrators to check that the
passwords that users of a system had chosen were not easily detectable. However,
they have also been used for criminal purposes, for example to gain access to a
computer system where the users have chosen easily detectable passwords such
as system or admin.
Most password crackers either attempt to discover a password by consulting a
large corpus of words which users habitually employ for passwords and check on
characteristics of passwords which make them easy to detect such as a short
password of a few alphabetic letters which could be detected by a brute force
attack; or they may attempt a brute force attack on a password file.

534

Sniffer attacks
Devices which are used to read packets of
data moving around a network
Used by system administrators for detecting
inefficiencies in a network.
They can be used for siphoning off sensitive
data

Internet Technologies

535

These are devices which read the packets of data that travel around a network.
They have a legitimate use for systems administrators since they can be used for
determining the efficiencies and inefficiencies in a network, for example they can
be used for detecting choke points: parts of a system where network traffic is
heavy. They are also used by developers, for example, in order to judge the
design of a distributed system in terms of the traffic it generates.
However, they have often been used for siphoning off sensitive data. An intruder
might install a sniffer at a strategic point in a network such as a gateway and read
the traffic that is passing through the gateway. A successful sniffer can detect
hundreds, if not thousands, of passwords in a matter of hours and send them to a
remote computer where they can be used for unauthorised intrusions.
Sniffer attacks are, surprisingly, not very prevalent; however, when they occur
they can compromise a very large number of computers. For example, a recent
sniffer attack on a number of computers resulted in 268 sites (not computers, but
sites!) having their computers violated.

535

Trojan horse attacks

Code which looks legitimate


Often sent as shareware
Difficult to detect as they masquerade as
useful software and are stored in object
code form

Internet Technologies

536

A Trojan horse is malcode which looks legitimate but attempts to do something


which the user does not expect it to do. For example, a shareware program which
provides a system administrator with information about file usage in a networked
system but which, after a number of uses, destroys many of these files is an
example of a Trojan horse.
Trojan horses can be used for financially criminal purposes such as discovering
passwords and secure network information or they can be used to destroy
resources or carry out a denial of service attack.
The major problem with Trojan horses is that they are very difficult to detect.
They are difficult to detect for two reasons: the first is that they often masquerade
as utilities which would normally be found stored on a computer or would require
installing on a computer, for example in 1997 a Trojan horse was in circulation
masquerading as the popular Stuffit file compression program used in
Macintoshes. This particular Trojan horse deleted important files when it was
installed on a host.
The second reason that they are difficult to detect is that they are stored in a
computer in object code form and it can be difficult to detect whether they
contain malicious code.

536

Spoofing

A user masquerading as another


Common variety is IP spoofing
Another vareity is DNS spoofing
Common user spoofing is the piggyback
attack

Internet Technologies

537

This is a jargon term used to describe the fact that an intruder uses a computer to
masquerade as another trusted computer in order to carry out operations that the
user(s) of the trusted computer are allowed to initiate. Spoofing does not require
the in-depth knowledge of passwords and authentication that the previous
intrusion methods do: it just relies on masquerading as a computer that a network
trusts. In order to understand what spoofing involves it is worth looking at one
variety of this technique known as IP spoofing. This attack uses the TCP-IP
protocol to subvert the normal authentication controls in a system by running a
computer which purports to have an address that is trusted.
Another form of spoofing is DNS spoofing. This is less serious than IP spoofing
as it can easily be detected; however, this has not prevented a small number of
such attacks over the last five years. It involves infiltrating a domain name server
and rewriting the fiof the server so that a computer which is outside a network
can be given the same name as a trusted computer. This means that clients who
request a service from the trusted computer using a symbolic name would be
routed to the rogue computer which could then involve them in a dialogue in
which important information such as credit card details is elicited.

537

Technology attacks

Take advantage of security flaws in Internet


technology
Two targets have been Java and Active X
In the early days of the Internet browsers
were also insecure
Huge number of attacks on MS Windows
Internet Technologies

538

These are attacks which rely on security flaws in some software, often newly
released software. For example applets are Java programs which are downloaded
onto a client computer running a browser. In the early days of applets a number
of security violations occurred:
Applets could be used for denial of service attacks with certain browsers.
One browser was vulnerable to applets writing data to the system files used in
Windows 95.
An applet has been written which would automatically reboot Windows 95.
On one version of the Netscape Navigator browser an applet can capture a Web
page which acts as a form, read some data entered by the user and then send that
data to a remote server.
With some versions of the Netscape Navigator and Internet Explorer applets can
capture the IP addresses of computers in a closed network.

538

Cryptography
The main technological core on which security is based
It has been round since Roman times
The history of cryptography has been of novel schemes
initially being effective and then being cracked.
Lectures will look at symmetric and asymmetric
cryptography
Introduction to cryptography

Internet Technologies

539

Cryptography is the process of hiding the text in a message by transforming it.


Cryptography has been in existence since Roman times when the Roman army
used simple substitution ciphers to deliver orders.
Much of the remaining lecture and a half will look at cryptography and the
products that use it.

539

A warning

Cryptography is only one component


of an overall security system

Internet Technologies

540

It is worth making a warning though: cryptography is only one part of an overall


security system: there are many clerical, procedural and managerial controls
which are needed over and above the provision of utilities and tools based on
cryptography.

540

The basis of cryptography


ENCRYPTION

Plain text

Crypto
algorithm

Cipher text

Key

Internet Technologies

541

The term cryptography refers to a collection of techniques which are used to


ensure that data cannot be read by anyone who is not a party to its creation or
dissemination. It involves transforming a collection of data (often known as the
plain text) into a scrambled form known as the cipher text. For example,
scrambling the letters of the message I am here so that it reads heIm a er is an
example of cryptography in action albeit a very simple and easily cracked
application.
Modern cryptography relies on immensely complex, validated algorithms to
transform a plain text into a cipher text, the transformation being known as
encryption. The algorithm that is used will vary its action according to a key.
This is a set of characters which change the action of the algorithm. A very
simple, easily crackable, example of this is an algorithm which will replace every
character in a plain text by its ASCII equivalent n positions ahead of it in a table
of ASCII codes, where the algorithm is varied by providing it with different
values of n. In this case n is the key albeit a trivial one.

541

Cryptography
Algorithm changed by key
Symmetric cryptography described in
previous slide
Large number of algorithms available
Key has to be distributed (weakness)
Encryption changes original text and
decryption changes it back
Internet Technologies

542

What is described on the previous slide is symmetric cryptography where the key
used to vary the algorithm is sued by both the agent who is sending and who is
receiving the message. This is a weakness as there must be a secure way of
distributing keys
There are a number of high-powered virtually uncrackable algorithms around,
later slides will describe them

542

DES and AES as encryption


standards
DES was most widely used
Now replaced with AES because of
concerns over its 56 bit key length.
Used here as an example
Both are block ciphers

Internet Technologies

543

DES is a clich encryption standard which is here because it represents some of


the typical processes involved in encryption. It was a standard for American
government applications but is now in the process of being retired. T has been
repalced by the more secure AES standard, DES still has a use for medium
security applications.

543

DES steps

Takes 64 bit plaintext


Passed through an initial permutation to produce a permuted input.
16 rounds of permutation and substitution are applied.
The last round consist of 64 bits that depend in the plaintext and the
key
The left and right halves are then swapped to produce the preoutput
Finally the preoutput is passed through a permutation function that is
the inverse of the initial permutation function to produce the output

Internet Technologies

544

The processing steps carried out by DES are detailed above. These can be carried
pout by either software or hardware.

544

DES problems
Key too small
Processes involved lead to known plaintext
attacks being successful
Calculated a $1m machine could crack DES
in 2 hrs
Still useful for commercial and personal use

Internet Technologies

545

DES is becoming something of a historical curiosity: its key size is too small and
the processes involved in lead to known plaintext attacks being successful. A
known plaintext attack is one where an intruder has access to a plaintext and a
cipher text pair.

545

Some commercial encryption


algorithms

Triple DES
AES
Blowfish
IDEA
RC2
RC4
RC5

An introduction to AES

Internet Technologies

546

Triple DES. As its name suggests this is a variant of the DES scheme. It
involves applying the DES algorithm three times to a plain text. Triple
DES has been used by financial institutions such as banks as a more
secure alternative to DES.
AES is the American government replacement of DES
Blowfish. This is an algorithm which is capable of using a 448 bit key. It
is unpatented and is available for anyone to use.
IDEA. This is an algorithm developed in Switzerland and published in
1990. It uses a 128 bit key and is patented.
RC2. This is a cipher which was developed by the American security
researcher Ronald Rivest. It transforms blocks of data and relies on a key
which can range from 1 to 128 bits.
RC4. This is a cipher which transforms data on a character by character
basis. It was originally a trade secret; however, it was published on a
Usenet newsgroup in 1994. It can employ a key which ranges between 1
and 2048 bits. The cipher was developed, like RC2, by the American
researcher Ronald Rivest.
RC5. This is a cipher which encrypts blocks of text and which was
developed in 1994, again by Ronald Rivest.

546

A cheering statement
You do not have to know the innards of a
cryptographic algorithm in order to use them.
There are a number of commercial products which
employ block ciphers and which just require the
programmer to call code methods or subroutines
such as encrypt(string or stream) and decrypt(string
or stream)

Internet Technologies

547

This is a heartening statement for the programmer.

547

Asymmetric cryptography
Also known as public key cryptography
An attempt to overcome the major problem
with symmetric cryptography: the key
distribution problem
Requires the use of two keys: a public key
and a private key
Introduction to public key cryptography
Internet Technologies

548

One of the problems with symmetric key encryption is that both participants are
required to use the same key. This means that they need to be distributed to them
both, perhaps over some secure medium. For example the keys could be
maintained in a key server which might be subject to being tampered with.
The Americans Diffie and Helman developed public key cryptography in 1976. It
requires two keys a public key and a private key.

548

Encryption and
decryption
Bobs
public key
Plain text

Encryption
algorithm

Bobs
private key
Ciphertext

Decryption
algorithm

Bob

Alice

Internet Technologies

549

Here two agents communicate. Alice wants to communicate with Bob. Bob
published a private key which Alice uses to encrypt the plain text. This is then
sent to Bob. Bob then uses a private key which forms part of the (private key,
public key) pair to decrypt the message.
Here there is no need for Alice to know Bobs private key.

549

Using public keys for authentication


Alices
private key
Plain text

Encryption
algorithm

Alices
public key
Ciphertext

Decryption
algorithm

Bob

Alice

Internet Technologies

550

Here asymmetric cryptography is used for authentication. Alice sends a message


to Bob. Alice uses her private key to do the encryption. Bob uses Alices public
key for the decryption. If a valid message is received then he can be sure that
Alice has sent the message and it has not been tampered with.

550

Comparison
Symmetric

Public key

Same algorithm
Sender and receiver share key
and algorithm
Key is secret
Impossible to decipher if no
other info available
Knowledge of algorithm plus
cipher text must be insufficient
to determine the key

One algorithm but a pair of


keys
Sender and receiver must have
one of the matched set of keys
One of the pair must be secret
Knowledge of algorithm plus
one of the keys together with
samples of cipher text must be
insufficient to determine the
other key

Internet Technologies

551

The slide above describes the main differences between each of the two
algorithm approaches.

551

Public key cryptography conditions


It should be easy to generate a pair of keys
It should be easy to generate the cipher text from the public
key and the plain text
It should be easy to decipher a message given the private
key
It should be impossible to find the private key knowing the
public key
It should be impossible to recover a message given a
public key and a cipher text
Encryption and decryption can be applied in any order
Internet Technologies

552

The slide shows the original five conditions detailed by the researchers Diffie and
Helman for public key cryptography with a sixth useful one but not necessary.

552

RSA algorithm
Developed by Rivest, Shamir and Adelman
Block cipher in which the plain text and
cipher text are integers between 0 and n, for
some value of n
Algorithm based on factorisation of integers

Internet Technologies

553

The RSA algorithm is the most popular and virtually unchallenged algorithms.

553

Algorithmic aspects
Encryption and decryption requires raising to a
power, quite a slow process although there are
some reasonable algorithms about.
Even so, public key cryptography is not used for
bulk transfer
Key generation involves finding two large prime
numbers. Usually done by randomly selecting odd
numbers and testing for primality
Internet Technologies

554

Public key encryption and decryption is computationally demanding because


numbers have to be raised to a large power. Doing this efficiently has been the
subject of a large amount of research. However, even the efficient algorithms that
have been developed are still so slow that carrying out the bulk transfer of data
using public key cryptography is infeasible. However, it does have major uses,
for example in key exchange.

554

Attacks on RSA

Brute force: try all possible private keys


Mathematical attacks involving finding
very efficient factorisation algorithms
Timing attacks, equivalent to timing the
movement of a key combination lock

Internet Technologies

555

There are a number of possible attacks on RSA. The first is to try every possible
private key. The solution here is to make the key space large.
The second is to develop efficient factorisation algorithms. Factorising numbers
used to be very hard, it is becoming a little easier and is still the subject of
research. If RSA is ever cracked it will be because of an efficient factorisation
algorithm.
Timing attacks involve the attacker measuring the computation time for
deciphering messages. Happily this can be easily countered by, for example,
inserting random delays into the decipherment process.

555

Key management
Two aspects to this
The distribution of public keys
The distribution of secret keys for
symmetric cryptography

Internet Technologies

556

The nest few slides look at the problems that are involved with key distribution.
The first problem is how are public keys notified to users who wish to
communicate with the user who has the public key?
The second is how can symmetric keys be distributed in such a way that there is
no possibility that an intruder can discover these keys? Because public key
cryptography is inefficient many users still use symmetric cryptography.
However, the key distribution problem is still a major drawback. Happily there is
a solution involving public key cryptography

556

Diffie Hellman key exchange


Early algorithm
Still used in many products
Enables two users to exchange a secret
symmetric key
Relies on the difficulty of computing
discrete algorithms

Internet Technologies

557

A major algorithm that is used for secretly exchanging keys is based on public
key cryptography. It relies on the fact that it is very computationally inefficient to
calculate the discrete logarithm of a number. In the description that follows I
shall refer to a which is the primitive root of an integer. Do not worry how this is
calculated.
The algorithm is due to Diffie and Hellmann and was developed in the seventies.

557

The management of public keys

Public announcement
Publicly available directories
Public-key authority
Public key certificates

Internet Technologies

558

The next few slides look at the problems involved in maintaining a set of public
keys and some of the solutions to these problems. Each solution is presented in
terms of increasing order of effectiveness.

558

Public announcements

Very easy
How it was intended to work
Major problem is that anyone can announce
a public key and masquerade as a user and
read data intended for the user

Internet Technologies

559

The simplest scheme is for a user to announce his/.her public key in some public
forum such as a Web site or a news group. Anyone can then send messages to
that user employing the public key. The major drawback here is that anyone can
purport to be a user and issue a public key and then read data intended for that
user.

559

Publicly available directory


Trusted authority maintains a (name, public
key) store
Subscribers register a public key, either in
person or via some secure communication
Directory published peridically

Internet Technologies

560

Here some trusted authority maintains a key store which is regularly published.
The trusted authority uses security provisions to ensure that the owner of the
public key is who they purport to be.
This is still not 100% secure since someone can still find the private key of the
authority and pass out counterfeit keys.

560

Public key authority

Again a key store is used.


The authority has a private key
Clients of the authority are given the public
key

Internet Technologies

561

The scheme detailed above and in the next slide is much more secure and
involves communication between the subscribers and the authority using public
key cryptography

561

Public key dispensing


A sends time-stamped message to authority asking
for public key of B
Authority sends encrypted message containing: Bs
public key, the original request and a timestamp
A stores the public key of B and encrypts a
message to B containing an identifier of A and a
nonce which is used to identify the transaction
uniquely
B retrieves As public key as in the previous steps
Internet Technologies

562

The message sent back to A will contain the public key of B, a time stamp so that
A can determine whether this is no an old message with an out of date public key
and the original request so that A can view the request to check that it has not
been tampered with before it was received by the authority.

562

Public key certificates


Communication without an intervening
authority
Certificate contains a public key and other
information
Certificate created by trusted certificate
authority such as a PTT
Certificates can be verified that they have
been created by a valid authority
Internet Technologies

563

This is another,r very secure way of communicating keys. A user is given a


certificate issued by some trusted organisation. This certificate contains the users
public key along with other information.
When a user wishes to send its key information it transmits using the certificate.
The certificate is created by the certification authority who also dispense the
public and private keys associated with the certificate.
This topic is dealt with in a little more detail in the next lecture.

563

Requirements for a certification


scheme
Anyone can read a certificate and see the
name and public key of the owner
Anyone can verify that the certificate was
created by the certificate authority
Only the authority can create and update a
certificate

Internet Technologies

564

The bullet points above describe the important criteria used for certificates and
certification authorities. How they work out in practice will be describes in the
next lecture.

564

Elliptic curve cryptography


Most products use RSA
As the bit length of keys has increased
processing load on RSA has increased
Elliptic curve technology is more efficient
Arcane maths used
The jury is still out on its security
effectiveness
Internet Technologies

565

The main technology used in public key cryptography is RSA. However, as


computers have become more and more powerful and capable of cracking it the
key length has had to be increased. This has meant an increased processing load
on products that use RSA. Elliptic curve cryptography is a recent competitor
which is certainly faster than RSA. However, experience with it in the field is
limited and so no definitive statements can yet be made about its effectiveness;
however, the security community consensus so far is that it as at least as secure as
RSA.

565

Some attacks on messages

Disclosure
Traffic analysis
Masquerade
Content modification
Sequence modification
Replay modification
Digital
Repudiation

First two
handled by
cryptography

signatures
Internet Technologies

566

The slide above details some of the attacks that can be made on a message
transmitted
Disclosure means that content is released to a third party
Traffic analysis involves looking for patterns of data between participants
Masquerade involves pretending to be someone else
Content modification involves changing the data in a message before it is
received
Sequence modification involves inserting, deleting or modifying individual
sequences of messages
Timing modification involves delaying or replaying messages
Repudiation involves denial of receipt or denial of transmission of a message

566

Symmetric encryption as an
authentication mechanism
Provides confidentiality
Has a degree of authentication
Does not provide a signature facility

Internet Technologies

567

Symmetric encryption can provide a degree of authentication, for example it


provides evidence that a message has not been tampered with.

567

Public key encryption as an


authentication mechanism
Provides confidentiality from A to B using a
public key to decrypt
Does not provide authentication from A to
B using a public key to decrypt
Sending back using a private key to decrypt
provides authentication and signature

Internet Technologies

568

Using public key cryptography increases the amount of authentication and


confidentiality

568

Message authentication codes


Uses a secret key to generate a
cryptographic checksum
Assumes that two parties share a common
key
Usually two keys involved: key for
encrypting the message and a key for
producing the checksum
Internet Technologies

569

Message authentication codes are used to provide a unique number which


characterises a message. This number is calculated using a function know as a
hash function. A message is sent along with its encrypted checksum and is
checked by the receiver who decodes the checksum, calculates what the
checksum should be and compares it with the decoded checksum

569

Hash functions
Mapping f from integer to integer
Computationally easy
Should have the property that if a is not
equal to b then the probability that f(a) is
not equal to f(b) is virtually 1

Internet Technologies

570

The basis of message authentication codes is that a function can be developed


which is relatively easy to calculate but which has a very high probability of
generating different values if the arguments to the function are different. Such
functions have been knocking around computer science for at least four decades.

570

Properties of a hash function


All the input to the function should
determine the output
If a bit is modified in the input then every
bit in the output has a probability of .5 of
changing
It should be computationally infeasible to
find a message which has the same output
as another message
Internet Technologies

571

The slide shows the main properties of a secure hash function. The mains security
property is the final one. If the function has this property then a message cannot
be tampered with.

571

The process
B

A
Checksum

Message

User B receives both, decrypts


message and calculates the
checksum. Decrypts checksum and
compares it with what has been
calculated
User A encrypts message, calculates checksum
of message and encrypts that. Sends both

Internet Technologies

572

A user who wishes to send a message encrypts it, calculates the checksum using a
hash function, encrypts that and then sends both to another user.
The second user decrypts both of these. He/she then calculates the checksum of
the sent message and compares it with the decrypted checksum; if they are the
same the message has not been changed.

572

Some message digest technologies

HMAC
The MD Series
The SHA series

Internet Technologies

573

There are a number of message digest algorithms available. Three are shown
above. HMAC uses public key cryptography. The MD series uses 128 bit digests
and the SHA series, developed by the American National Security administration
uses 160 bit digests.

573

Uses for message digest functions

Checking message authentication


Generating pass-phrases
Virus checking
Used in digital signatures

Internet Technologies

574

Four uses for message digest functions are shown above. The first use has already
been discussed.
Pass phrases are phrases that are used to identify a user to a system, for example
Hello I like potatoes for tea A message digest function can take such a phrase
and generate a password from it which is near unique and cannot be cracked by
guessing.
Message digest functions are also used for checking virus infection. Each file in a
system is mapped to its message digest; if a virus has infected a file then it does
not match its digest value. A virus checker would periodically scan the files to
check this.
Message digests are also used in digital signatures (see next lecture)

574

Digital certificates
Already discussed lightly
Issued by a third party organisation such as a
national PTT
Contains name of the user, unique serial number,
the users public key, digital signature of the
certificate issuer.
In order to function the recipient of data needs
access to the issuers public key, often embedded
in packaged software such as a browser
Internet Technologies

575

I have already discussed digital certificates. To complete the story I shall look at
some practical examples of certificates and how they are used

575

The x509.v3 standard


Standard for digital certificates
Contains all the information on previous
slide together with the ability to hold
name/value pairs which aid authentication,
for example the name of the message digest
function used to develop a digital signature
Most frequently used standard
Internet Technologies

576

The x509 standard is the most frequently used standard. It uses digital signatures.
These are detailed in the next lecture. An important facility is that it provides the
facilities where extra authentication information can be contained in the
certificate via name/value pairs.

576

Processing digital certificates

Obtain the certificate of a user


Check that the signature provided by the
certification authority is valid using the public key
provided by the authority, often embedded within
a program such as a browser.
Us the public key specified in the digital
certificate to decrypt any data sent by the user
Internet Technologies

577

When you want to read some data which is provided by a user who is associated
with a public key and who is described by a digital certificate the process is as
follows:
Obtain the certificate. This might come bundled in some software provided by
the user or might be found on the users Web site.
Check the digital signature of the certificate issuer using the public key
associated with the certificate issuer; this can often be found in heavily used
software such as a browser.
If the previous step is successful employ the users public key found in the
certificate to decrypt data

577

Types of certificate

Certification authority certificates


Server certificates
Personal certificates
Software developer certificates

Internet Technologies

578

There are a number of different types of certificate. These basically contain the
same data. The only difference lies in the fact that extra data associated with the
particular use is included, for example the IP address of a server might be
included in a server certificate

578

Lecture 16
Security(ii)

Internet Technologies

579

579

Aims
To look at security on the Web
Examine the various types of viruses that can be
encountered.
Examine anti-virus tools
Look at the concept of a digital signature
Examine some architectural issues
Look at the effect that distributed applications
have on the development process
Internet Technologies

580

580

Digital signatures
Uniquely identify a person or organisation
that published a public key
Relies on message digest functions and
public key cryptography
Provides authentication
Needs knowledge of the message digest that
is being used
Internet Technologies

581

A digital signature is a collection of data which ensures that the recipient of a


message can be confident that the person who sends the message is who they
purport to be. The digital signature provides the same facility as an ordinary
written signature although it is much more secure and resistant to forgery.

581

Digital signature process


Agent A calculates message digest of some
message
Encrypt the message digest using a private key.
This creates the digital signature
Message and digital signature is sent to B.
B decrypts digital signature using public key to
obtain message digest
B calculates message digest and compares it with
the decrypted digest.
Internet Technologies

582

A digital signature is some data that uniquely identi<es a person or an


organisation. Digital signatures rely on message digest functions and public key
cryptography. In order to describe how they work consider the sending of a
message from one agent (A) to another (B) where agent A publishes a public key.
The steps below assume that B knows what message digest function is being
used. The following steps occur:
Agent A calculates a message digest of the message to be sent.
The message digest is then encrypted using the private key. This is the digital
signature.
The message, along with the digital signature, is sent to B.
B decrypts the digital signature using the public key to obtain the message
digest.
B then calculates the message digest of the sent message using the message
digest function that A used and compares it with the decrypted digest. If they
match then the message has been sent by the owner of the private and public key.
An important point to make is that digital signatures can give irrefutable proof
that content has been changed en-route. They ensure integrity but not privacy.

582

Electronic mail security

Uses conventional cryptography


technologies such as RSA
Two main technologies PGP and S/MIME
Both on standards tracks

Internet Technologies

583

The next set of slides detail how the security technologies that have been detailed
in previous lectures and slides are employed in email. The major product I shall
examine is PGP since it is commonly available. However, the other technology
S/MIME will be standardised first

583

PGP

Developed by Phil Zimmerman


Uses the best algorithms
Independent of platform
Source freely available
Low cost commercially supported version
available
Explosive growth
Internet Technologies

584

The product I shall concentrate on is PGP. This is a shareware product which


uses strong, freely available algorithms for sending email. A low-cost
commercial version is available which is associated with industry maintenance
support.

584

PGP services
Digital signature: DSS/SHA or RSA/SHA
Message encryption: CAST or IDEA or
Triple DES with Diffie Helman
Compression: Zip
Email compatibility: radix-64 conversion
Segmentation: implemented to
accommodate message size limitations
Internet Technologies

585

The slide shows all the facilities of PGP. The bulk sending of data employs a
standard symmetric encryption algorithm, for example IDEA, with public key
cryptography creating a single session, one-time key for this bulk transfer.

585

Authentication in PGP
Sender creates a email message
SHA-1 used to generate a hash code
Code is encrypted using RSA by using the
senders private key, this is placed before
the message
Receiver uses RSA public key to decrypt
the code
Receiver checks code compatibility
Internet Technologies

586

Authentication is achieved in PGP by means of digital signatures. The algorithms


used for this are SHA-1 to generate a message digest and RSA to carry out
encryption and decryption.

586

Confidentiality in PGP
Based on one-time keys used for encrypting
and decrypting a single message
Algorithms used include CAST, IDEA or
triple DES
One time keys are created using RSA
(Diffie Hellman is another option)

Internet Technologies

587

The bulk transfer of email messages is carried out using one of a number of
algorithms. Each message that is encoded has a unique key generated for it
known as a session key. This key is created using RSA.

587

The PGP sending process


Sender generates the email message and a random 128 bit
number to be used as a session key
Message is encrypted with the session key
Session key is encrypted employing RSA and using the
recipients public key and added to the message
Receiver uses his/her RSA private key to recover the
session key
Receiver uses the session key to decrypt the message

Internet Technologies

588

588

Web security
A number of approaches depending on what
level the technology is used
IP/IPSec
SSL
Kerberos, PGP, SET, S/MIME

Internet Technologies

589

Web security can be applied at a number of levels in the Internet layered


architecture. At the lowest level it would involve musing IPSec, one level above
it would mean employing the Secure Sockets Layer. An even higher level would
involve placing the security almost at the application level.

589

Web security considerations

Internet is bi-directional
Highly visible output for corporate activity
Web server software is complex
Web server can be used as a launching pad
into a businesses network
Casual and untrained users employ the Web
Internet Technologies

590

There are a number of factors which make the Web insecure. First traffic goes
into a Web server as well as leaving it. Second web software is complex and can
hide serious security weaknesses
A Web server if infiltrated can be used as a launching pad whereby an attacker
infiltrates into a network. Finally many Web users are inexperienced and and are
not aware of security risks.

590

Web attacks and countermeasures(i)


Integrity: modification of user data, trojan
horse browser, modification of memory,
modification of messages in transit.
Solution Cryptographic checksums.
Confidentiality: eavesdropping, theft of
server data, theft of client data, network
information. Solution Encryption and
firewalls
Internet Technologies

591

There are a number of threats which a web site can receive, Two of them are
shown above with the measures to counteract them in italics. In general these
measures work well.

591

Web attacks and countermeasures(ii)


Denial of service: killing of threads,
spawning lots of threads, filling up memory,
isolating computer by DNS attacks.
Solution: none very satisfactory
Authentication: Impersonation of users,
forging data. Solution: cryptographic
techniques
Internet Technologies

592

The only type of threat that is very difficult to counteract is the denial of service
attack where, for example, a machine resource such as connection threads are
used to the point where genuine users cannot access the resource.

592

IPSec

RFC 1636
Security features issued for IPv6
Usable within current IP
Many vendors have some IPSec capability
in their products
It encrypts and authenticates traffic at the IP
level. Thus all distributed applications can
be made secure
Internet Technologies

593

This is a security standard that originated from the Internet Architecture Board. It
has the major advantage that is is based on the IP level in the layered Internet
architecture. Because all applications use this level they can all be secured in a
uniform way.

593

Benefits of IPSec
If implemented in a firewall or a router it
provides strong security for all traffic
passing through.
Makes a firewall stronger and makes it
virtually impossible to bypass.
Transparent to applications
Transparent to end users
Can be tailored to individual groups or users
Internet Technologies

594

There are major benefits to using IPSec. First, to both applications and users it is
transparent. Second, it also strengthens firewalls in that all outside traffic must
use IP by using HTTP) and the firewall is the only entrance to a corporate
network. Within a corporate environment the security provisions can hence be
relaxed.

594

Employing security above IP


Main technology is SSL or its successor
TSL
Uses secure sockets
Originated by Netscape
Transparent to a programmer: it just
involves creating a socket.

Internet Technologies

595

You can also implement security in levels above the TCP/IP level.

595

SSL protocols

Record protocol
Change cipher spec protocol
Alert protocol
Handshake protocol

Internet Technologies

596

SSL consists of four protocols:


The Record Protocol is concerned with taking a message to be transmitted,
fragments it, applies a MAC and encrypts it. It is concerned with the bulk
transport of data
The Change Cipher Spec Protocol updates the current cipher spec that is to be
used for transfer.
The Alert Protocol is used to convey alerts such as a digital certificate not being
valid
The Handshake Protocol is used to validate all the entities used in the transfer
process. It is the most complex.

596

The SSL transfer process (i)


Client communicates with the server and
sends data such as cipher settings
Server responds with a similar set of data.
The client authenticates the server
A premaster secret is created

Internet Technologies

597

The client sends the server a number of items of data including the
clients SSL version number, the cipher settings for the client and some
randomly generated data.
The server responds with a burst of similar data and also sends its digital
certi<cate; if the interchange of data requires the client to provide a digital
certi<cate then it will ask for this item.
The client authenticates the server; if this fails the user of the client is
informed.
Using the data that has been generated in the handshake the client creates
an item of data known as the premaster secret. This is used later in the
handshake.

597

The SSL transfer process (ii)

Server authenticates the client (optional)


Master secret generated.
Two sets of keys generated
Bulk transfer started based on keys.
When session is complete connection is
severed.
If a new session is required then the process
starts again
Internet Technologies
598

The server authenticates the client. This only happens if the transaction
requires both parties to be authenticated. SSL is capable of being used
when only the server is authenticated and so this step could be omitted,
and most of the time it is.
If the client and the server have been successfully authenticated then
both sides carry out the process of generating another item of data known
as the master secret; this item is partly generated from the premaster
secret. The master secret is a one-time 48 bit quantity that is used to
create the keys used in the bulk transfer of data between the client and the
server after the handshake has been completed.
At this point the client and the server generate a pair of keys from the
master secret. One key is used for encrypting and decrypting data from
the client to the server; the other key is used for encrypting and decrypting
data from the server to the client.
The handshake is complete and the client and the server can start
exchanging encrypted data employing one of the algorithms which are
built into the version of SSL that is used. Part of the handshake involves
the parties to the transfer of data deciding on which algorithm to use.
Once a session has been completed the connection is severed. If the two
parties wish to communicate again then they have to carry out the
handshake; each time that the handshake takes place a different pair of
encryption keys are generated and a different master secret generated.

598

Implement security close to the


application level
Further up the layered architecture than SSL
and IPSec.
Embed security in system or application
software
Good example is PGP
Another example is SET

Internet Technologies

599

A further approach is to embed security in a topmost layer close to the


applications, for example in system software such as emailer or even in an
application. A good example of this approach is PGP the electronic email
technology previously discussed in this lecture.

599

SET
Secure electronic Transaction Standard
Developed by Visa, MasterCard, RSA,
Microsoft etc.
Set of security protocols and formats
1998 first wave of SET compliant products
became available

Internet Technologies

600

SET (Secure Electronic Transaction) is an example of placing security high up


the layered architecture. It was developed by a consortium of credit card
companies, software vendors and security companies

600

Aims of SET
Confidentiality of payment and ordering information
Integrity of transmitted data
Authentication that a user is alegitmte holder of a credit
card
Authentication that a merchant can accept credit card
orders
Best security practices
A protocol that does not depend on transport security
mechanisms or interferes with them
Independence from hardware and software
Internet Technologies

601

The bullet points above describe the various criteria that were specified when
developing SET. In general all of them have been satisfied.

601

Key features of SET

Confidentiality of information
Integrity of data
Cardholder authentication
Merchant authentication

Internet Technologies

602

Confidentiality of information is secured by using DES as a bulk transfer


technology
Integrity of data employs digital signatures which use SHA-1 has codes. Certain
messages are also prevented from being tampered by HMAC using SHA-1
Card holders are authenticated by means of X.509v3 digital certificates with
RSA signatures
Merchants are authenticated in the same way as card holders: using digital
certificates

602

Firewalls
Extra layer of protection placed around a
network
often employs a router which can filter off
certain types of message
A number of configurations, I shall detail
only two

Internet Technologies

603

One major way of guarding against a number of forms of attack is to design the
topology of your network in such a way that it is difficult for intrusion to occur.
For example, it can be virtually impossible for a sniffer to be placed in a network
if it is highly compartmentalised. One of the most effective ways of using
network topology is by implementing a firewall.
A firewall is an extra layer of protection placed around a network or around a
particular application. A firewall placed around a network will usually employ a
router which can be programmed to deny access to a network, for example it can
be programmed to deny access to any packets of data which have been sent to a
particular dedicated port.

603

Simple firewall or screened host


firewall

Internet Technologies

604

The configuration above is intended to protect a Web server which dispenses pages to
the public from being compromised and perhaps acting as a starting point for a more
serious intrusion which affects other computers in the internal network. The
configuration involves a programmable router which is able to monitor, re-route and
reject packets of data and a Web server known as a bastion host or a proxy server. The
bastion host acts as a temporary store or cache of pages which have been dispensed by
a real Web server which resides within a closed network.
When a packet of data is processed by the firewall router it will determine what to allow
through to the internal network that it protects. Often the data allowed through will be a
very small subset of the data which could be sent to it: for example, it might only allow
through data which represents e-mails. If the router detects data which is intended for the
Web server it will forward the data to the bastion host. Any other data is rejected.
When the bastion host receives data which accesses Web services it will satisfy that
service. It will first check that the pages required by the request are contained in its cache
of pages; if so, then it will send the pages to the computer that requested them. If the
pages are not contained in the cache then it will request the real Web server, which
resides within the firewall, to send it the pages so that it can satisfy the request.
The use of a bastion host secures Web services because any intruder has to compromise
this computer before they can enter the network in which the real server resides. For
example, a malicious attack on the bastion host which attempted to delete Web pages
would only delete the temporary cached pages.

604

A screened subnet

Internet Technologies

605

An even stronger use of a firewall is to employ two layers of protection: a router


which is open to the Internet and a further router which guards the internal
network.
In between these two routers there would be further bastion servers which offer
services that outside users may need to access, services such as a mail service or
an ftp service which enables customers to download company samples or
brochures; again these bastion servers would communicate with the real servers
which are located in an internal network. This form of organisation is known as a
screened subnet; the area in which the public services are located is often known
as a demilitarised zone. This security configuration is shown in above

605

Viruses

Executable virus
Data virus
Polymorphic virus
Startup file virus
Device driver virus
Stealth virus
Internet Technologies

606

There are three main types of virus: executable viruses, data viruses and device
driver viruses. An executable virus is a virus which is attached to an executable
file which, when executed, will result in the virus code being run. This code will
then carry out some malicious act such as deleting important files. A data virus is
a virus which infects a file containing data, rather than executable code. Often
this data is associated with some program and which the program requires in
order to carry out its functions. For example, many programs require a startup file
which initialises the program and sets up basic parameters for its operation. A
data virus could infect such a file and set the data in it to values such that the
program will crash or its functions.
will be compromised; another type of data virus could add an entry to a password
file that allows access to an intruder. Another example is that of a data virus for a
word processor that can be easily written and which would corrupt every
document opened by the word processor or, even worse, delete every document.
A third class of virus is the device driver virus. This infects the device drivers of
an operating system which are then used to piggy-back into other parts of a
computer such as its file store. Happily this type of virus is usually associated
with older operating systems such as MSDOS.
There is also a further classification of viruses which categorise the ways that
they use to hide their presence on a computer. There are two types of virus which
are categorised in this way, the stealth virus and the polymorphic virus.

606

Anti virus software

Mainly works by continuous scanning


Often employs hash functions
Often based on known virus signatures
Virus signatures downloaded from Web
sites
Internet Technologies

607

Anti-virus software works by scanning the file store of a computer looking for
known viruses or for changes in files, for example an operating system file
suddenly becoming larger, even though no update to the operating system has
taken place.
They are software tools which look for unusual changes in the files stored in a
computer and also look for file characteristics which are associated with known
viruses. Many of the tools allow the user to download a database of current virus
signatures; often these databases are only a matter of hours out of date so they
will catch most viruses.

607

Case study
Large financial institution
Screened host firewalls
Virus checkers both for network incoming
and physical incoming
Huge amount of non-computer security, for
example physical security, visitor security,
document security level specification
Internet Technologies

608

Lecture 17

Integration (i)

Internet Technologies

609

609

Conventional system development

Requirements analysis
Requirements specification
Design
Coding
Validation
Implementation

Internet Technologies

610

The slide shows the conventional way of developing software: requirements


analysis involves the extraction of the requirements for a system, requirements
specification involves writing those requirements down, design is about
developing an architecture, coding is the process of programming a system and
implementation is the process of bringing al the coded elements together.
These activities are accompanies by validation activities such as system testing,
design reviews, integration testing and acceptance testing.

610

The change

Requirements analysis
Requirements specification
Design
Radically
Coding
changed
Implementation

Internet Technologies

611

Distributed application development, which is becoming the norm, has changed


the development practices of many developers, especially those who produce
enterprise systems. It has lead to an integration paradigm in which the role of
design and coding has been drastically modified. Before looking at this it is worth
examining some of the technologies that enable an integration paradigm to be
implemented.

611

Aims
To describe four approaches to integration
To describe information oriented integration
To describe business process-oriented
integration
To describe portal-based integration.
To describe service-oriented integration
To introduce the role of some technologies
in integration
Internet Technologies

612

612

Integration
The process of bringing together chunks of
prewritten software.
Technologies that are involved include
XML, Web services, SOAP, Application
servers and protocols such as HTTP

Internet Technologies

613

Integration involves mostly bringing together prewritten chunks of software such


as accounting packages and using technologies such as XML and Web services to
carry out the binding process that implements the connection. A little coding is
usually carried out; however, this coding is normally quite small compared with
the amount of code residing in the chunks that are integrated.

613

The effects and non-effects of integration


Requirements analysis, requirements
specification and validation are unchanged.
Validation is still carried out
Functional design is virtually eliminated.
Design for performance and reliability are
the only design processes.
Coding is drastically reduced.
Internet Technologies

614

Many of the activities associated with the conventional development cycle are
unchanged, for example the developer will always need to know what the system
requirements are going to be. However, other activities are modified or virtually
eliminated. Design just involves looking at performance and reliability since the
prewritten chunks have already been designed. Coding is reduced or even
eliminated because the chunks will have been prewritten.

614

Messaging and integration


Messaging is the main technology used to
join chunks together.
Sometimes distributed objects are
employed.
Messaging is often implemented using
message-oriented middleware.

Internet Technologies

615

The main technique used to join chunks together is messaging where an entity in
a distributed application communicates its need for a service or the provision of a
service using a message. System software known as message oriented
middleware (MOM) is employed for this.
Sometimes distributed objects are employed for joining the chunks with, say a
CORBA object being used to front-end a package.

615

XML and integration


XML is used as part of Web services
It is also used as an EDI mechanism
XML imposes syntactics and semantics on
messages.
Standard DTDs can be held anywhere on
the Internet

Internet Technologies

616

XML is an important technology for integration. Messages expressed in XML


can be interchanged by the various chunks of software that are integrated. Using
XML means that the developer can define a standard form for the interchange
protocol which every part of the distributed system can understand. XML is also
used as the messaging mechanism used in Web services. Here messages
requesting a service are defined using the SOAP protocol.

616

Integration and application servers


Application server holds reusable
distributed components.
Takes care of the concurrency problems
associated with multiple transactions.
The key to their use in integration is
reusability.
Enterprise Java Beans is a major technology
that is associated with application servers.
Internet Technologies

617

Application servers are servers into which can be loaded reusable objects, for
example a warehouse object which could be used in a stocktaking application.
Such servers obviate the need to do any detailed programming to cope with
problems such as inconsistent updates or lost updates.
They are important for integration because the objects that are embedded in the
server are reusable and can be easily modified and moved from one application to
another. The modification usually requires very little change to the core code of
the object.

617

Integration and Web services


Web services are sites which provide some set of
services using standard Internet protocols such as
HTTP and a specialised protocol known as SOAP.
A web service sits on the Web waiting for clients
to ask for a particular service which it then
satisfies.
Leads to the concept of an Application Service
Provider (ASP)
Internet Technologies

618

A Web service is a service provided by some server which relates to an


application. An example of a Web service might be one which a Web mail
provider offers.
Web services are an important integration technology since they are widely
available, offer simple ways of accessing a service and can be quite generic.
They have given rise to the concept of an Application Service Provider (ASP)
these are providers which offer some service to all and sundry. For example, an
ASP might provide a contacts database service for salesman.
This is a step forward from the use of a package in that only one easily
maintainable copy of the service is available.

618

Frameworks and integration


A framework is a collection of classes which
implement the skeleton of an application.
It can then be instantiated for specific applications.
It enables families of applications to be developed.
Important for integration as it enables
functionality to be changed with not a lot of effort.

Internet Technologies

619

A framework is very much like the empty shell of a building. It contains all the
structural elements necessary to implement an application such as a purchasing
system. However, like the shell of an empty building it does not contain detailed
items that distinguish a specific application from another one.
Its importance with regard to integration is that a system can be developed in
terms of a framework and then instantiated for specific applications. It also
enables maintenance to be carried out efficiently.

619

Messaging
The main mechanism for communication
between chunks of an integrated system is
messaging.
Achieved via low level means such as
HTTP or via message-oriented middleware.
Design then becomes a way of tweaking the
messaging system so that response time is
maximised.
Internet Technologies

620

The main mechanism used for connecting chunks of a system together is


messaging. This might be low level messaging where HTTP is sued to
communicate forms data or high level messaging where message oriented
middleware is used to store and forward messages from clients and servers.
The whole process of design with an integrated system is then split into two
areas: performance and reliability. With performance a major process is
modifying queuing parameters such as maximum queue length in order to ensure
that an optimum speed is achieved.

620

Integration servers
Servers which mediate between various
components of an integrated system.
Carries out processes such as merging data,
removing data and transforming data to other
formats.
More sophisticated servers are driven by business
process definitions.
BizTalk is the best example of an integration
server.
Internet Technologies

621

An integration server is a server that acts as a sort of hub between the individual
components of a system. These components will often be written in different
programming languages and will employ different technologies that use different
variants of languages. For example, there may be a number of different database
products used each employing a different variant of SQL. An integration server
sits in the middle of an integrated system and is programmed to coordinate the
various commands and messages that are exchanged from one components to
another. For example, an integration server will be able to intercept an SQL
command from one database system, decode it and change it into an SQL
command for another database system.
Very sophisticated integration servers can be programmed using some Business
Process Definition language. This effectively puts a business-oriented notation on
top of the code found in the integration server.

621

Scripting languages
Often need for glue code to join together
components of an integrated system. Usually this
is a relatively small amount of code.
Scripting languages such as Perl, Python and Ruby
are often used for this.
Ruby is becoming an increasingly popular
language due to its connection with Ruby on
Rails.
Internet Technologies

622

Sometimes integrated systems will require some new code to be written, for
example when a new function is to be implemented or some transformation that
cannot be carried out by an integration server cannot be implemented. Scripting
languages are interpreted languages often heavily oriented towards string
processing and are easy to debug. Perl has often been the language of choice for
this. However, newer languages such as Python and Ruby are coming to the fore.
Particularly Ruby which has a rapidly increasing following due to a package
known as Ruby on Rails used for web site development.

622

JSP and ASP


These are similar in use to scripting
languages.
They allow program code to be interspersed
with HTML.
JSP allows Java code to be interspersed.

Internet Technologies

623

Scripting languages are used to provide the glue which brings together
components of an integrated system and also develop code required for any extra
functionality. JSP and ASP are similar in use. These are technologies that allow
program code to be interspersed with HTML with the program code providing
the functionality that the HTML displays. JSP (Java Server Pages) allows Java
programming code to be interspersed with HTML and allows virtually any
processing that can be used with Java as a standalone programming language.

623

Message-oriented middleware
Used as buffer between integrated
components
Very simple API
A number of mature products
Supports interrupted running
No reliance on any programming
technology
Internet Technologies

624

Message-oriented middleware is a systems technology that is used as a buffer


between one or more components of an integrated system. In essence it consist of
a set of queue which hold transactions (messages) send from one part of an
integrated system to another. These messages are deposited on a queue and then
taken from the queue by another component of the integrated system.
The programming required for this technology is very simple: the APIs only
really include code which deposits an removes transactions from a queue. A
major advantage of this type of technology is that it is language independent, for
example a Java program could deposit a transaction on a queue while a C++
program could remove that transaction.

624

The Internet and integration

What is clear is that there is a rapid


acceleration towards integration powered by
the Internet. The Internet has enabled
communication over large distances using
standard protocols which all Internet-based
software have to implement.
Internet Technologies

625

The main driver towards integration has been the Internet. Not only has it
spawned standards such as HTTP which every piece of Internet software has to
implement, but has also given rise to XML, a technology that can be used for
defining industry specific standards. If it wasnt for the standards features of the
Internet I would not be lecturing on integration here today.
Another important part of the drive to integration that the Internet has enabled is
the fact that by using message passing, large systems which are developed using
a variety of technologies can be connected even though these systems might be
many thousands of miles away.

625

Integration and incremental development


Current from scratch development can be
disastrous, particularly for large systems
Integration allows a gradual development
from existing resources
Initially a rough and ready system is
developed
The it is gradually improved
Internet Technologies

626

Integration provides an extra tool in the system developers toolbox. It enables an


incremental form of development where say an existing set of systems which
provide 70% of eventual functionality is built upon by adding functionality.
When 100% functionality is achieved then the system is improved by replacing
less optimal parts of the system. This would still employ an integration paradigm
with small steps in improvement being carried out.

626

Types of integration

Four types
Information-oriented
Business process-oriented
Service oriented
Portal-oriented

Internet Technologies

627

In the remainder of this lecture I shall look at four types of integration:


Information-oriented integration where a number of relatively mature
technologies are used to combine databases together. This is a very wellestablished form of integration which uses tools that have been around for a
couple of decades.
Business process-oriented integration involves driving the process of integration
via a definition of the business processes involved in the new, integrated system.
This is the least mature of the methods.
Service-oriented integration involves using services, usually web services, as the
building blocks of an integration. This form of integration is quite recent and has
been started to be considered by enterprises that employ web service
technologies.
Portal-oriented integration is a quick and dirty way of integrating. It involves
mashing together a number of web pages which communicate with the
components of the integrated system using technologies such as scripting
languages

627

Features of some of the approaches


Degree of maturity of the technologies.
Ability to work with a diverse range of
technologies.
Amount of human intervention.
Processing and communicational overheads.

Internet Technologies

628

There are a number of ways of judging any of the approaches mentioned on the
previous slide: how mature are the technologies, business process type
integration does rely on the state of the art technology and there have been some
recent disasters. Another criterion is the degree to which each of the methods is
able to work with a wide variety of technologies not those bound to some
proprietary set of standards. A further criterion is how much human intervention
is required: for example portal-oriented integration gives rise to systems quickly
but relies on more human operator intervention when they are operating than
other forms of integration. A further criterion is the important one of judging how
much processing and communicational overheads are generated.

628

Portal-oriented integration
Views a number of systems via a single, usually
web-based interface.
Other forms of integration use technologies that
are real-time and user-driven. Portal-oriented
integration involves a human operator carrying out
the coordinating
Can be a very rough and ready approach involving
the development of quite a bit of glue software.
Internet Technologies

629

Portal-oriented integration involves bringing together a number of disparate


company systems and front-ending them with an interface, almost invariably a
web-based interface. In order to do this more coding than is normally expected is
carried out and technologies such as scripting languages or server page
technologies are employed.
Sometimes portal-based integrated systems use an application server or
integration server to coordinate the various components that make up the system.

629

An example

SQL Server
based system

COBOL-based
flat file system

Browser
Internet-based
system

Portal server
Internet Technologies

630

The figure shows how a portal server brings together all the potentially disparate
components of an integration server. These components can be as disparate as the
three shown in the figure.

630

An example
Company wishes to implement a bulk
buying site.
Needs web access to both customers and
staff.
Needs access to the purchasing systems of
wholesalers.

Internet Technologies

631

Here is an example of an application which is ripe for a portal-based


implementation. It is of a company that runs a web site which advertises a certain
category of goods, say white goods. Customers are allowed to log into the site
and express an interest in buying a particular item. When a number of customers
reaches a certain point then the company negotiates a bulk buying deal. When the
deal is finalised the company takes a percentage rake off. For this system to
work, the application must be open to the customer and mediate the customers
access to the companies system, the companies system, must, in turn, be able to
access the purchasing systems of suppliers.

631

Categories of portals

Single system portals


Multiple-enterprise system portals
Trading community portals

Internet Technologies

632

A single system portal is a portal that is situated within a single enterprise such
as a commercial company or a hospital. It just integrates all the disparate systems
which are found in each of the enterprises. It effectively takes all the interfaces
that are in existence and unties them into a eb interface.
A multiple-enterprise system portal is the most common portal. Here a number
of enterprise systems are connected via some server technology which allows
access from one enterprise system to the resources of another enterprise system.
The different systems that are integrated could be diverse and include SAP-based
systems, legacy systems based on technologies such as COBOL, systems based
on packages such as inventory packages and advanced web-based systems.
A trading community portal is one is one in which many companies are involved
in the integration with the portal bringing together all the various systems that are
maintained by the companies.

632

Mashing
A recent form of amateur portal development.
Here simple tools are sued to combine a number
of web sites which are accessed through a single
web site.
An example of this might be a web site which
used mapping technology to provide a search
facility for someone wanting to buy a house.

Internet Technologies

633

Mashing is a form of portal-based integration that brings together a number of


web sites and offers services based on compositing the functions of the web sites.
Mashing has come into greater prominence with the arrival of APIs such as the
Google API and the EBay API which provide program language-based interfaces
to the functionality of the web site.
The ChicagoCrime.org Web site is a great intuitive example of what's called a
mapping mashup. One of the first mashups to gain widespread popularity in the
press, the Web site mashes crime data from the Chicago Police Department's
online database with cartography from Google Maps. Users can interact with the
mashup site, such as instructing it to graphically display a map containing
pushpins that reveal the details of all recent burglary crimes in South Chicago.
The concept and the presentation are simple, and the composition of crime and
map data is visually powerful.
A number of technologies can be used for mashing including: web services,
screen scraping and scripting languages.

633

Information-oriented integration
Used for high data usage applications
Involves combining databases
The main processing is that of moving data
between components of an integrated system.
Wide variety of technologies can be employed:
integration servers, custom converters, database
replication software and special purpose code
developed using conventional or scripting
languages.
Internet Technologies

634

The second approach to integration that I describe is information-oriented


integration. The type of application this approach suits is an application which is
very data rich and combines a number of databases. For example, an integrated
system which is developed when two financial institutions such as banks are
merged would be the archetypal system for this approach.
Because many systems are data rich and the technologies that are employed are
mature this is currently the most popular form of integration. Technologies that
are sued for this form of integration include data replication software which takes
data at one database, replicates it and then deposits at another, usually remote
database; custom programs which periodically end data from one database to
another; and convertors which allow parts of a proprietary database to be sent to
another, different proprietary database.

634

Architecture

Databases

Transfer medium
Databases

Internet Technologies

635

This figure shows the essential architecture of a system which has been
developed using an information-oriented approach to integration. Here a central
medium employs one or more technologies to transfer data from one database to
another. These databases will usually be different in scope or different in terms of
manufacturer.

635

Data replication tools


Just one tool in information-oriented integration.
Can be programmed to extract out w whole
database or part of a database.
Can be programmed to transform data.
Number of products on market.
High overhead if not used properly.
Various parameters can be set including frequency
of replication, conditions when replication occurs,
targets for replication etc.
Internet Technologies

636

One of the most important tools used for information-oriented integration is the
data replication utility. This takes a tabular description of a set of databases and
carries out replication and writing according to some script embedded in the
table. This will determine:
What databases are to be replicated from
Which databases are going to receive the data
What parts of a database are going to be copied.
Which modifications are to be carried out on the data.
When is the replication going to occur.
What conditions must hold for a replication to occur.
What monitoring information is produced.

636

Steps for information-oriented integration


Understand the metadata (database schemas)
Select the data that needs to be transferred
Detail the events that occur that give rise to data
transfers
Determine the frequency of the data movement
Determine any reformatting that is needed
Choose some technology or technologies
Implement the solution using the technology or
technologies
Internet Technologies

637

This slide shows the steps that are needed in order to carry out this form of
integration:
First, you will need to understand the structure of the data that is stored in the
databases that will eventually make up the eventual system. Database schemas
are the normal place to look for this.
Once the functionality of the new system has been determined it is necessary to
select the data that is to be transferred.
The frequency when data is to be transferred needs to be determined; it could be
seconds or it could be a daily update, all this depends on the application.
If any reformatting of the data is required, for example a customer designator
needs some extra characters, then this needs to be determined.
The technologies used for the transfer then need to be chosen. Normal criteria
such as overhead, communication time, cost etc. would be used here.
Finally the technologies that are employed to carry out the transfer are deployed,
this might involve some programming in something like a scripting language or
the creation of tables such as data replication tables.

637

An important point

That all four of the techniques that are


described here have fuzzy boundaries and
also in many integrated systems you might
find that more than one technique has been
employed.

Internet Technologies

638

An important point to be made about the techniques described in this section is


that often an integration might use one or more of them. For example
information-oriented integration might be combined with portal integration to
develop a system where there is a lot of data transfers but when a central human
operator has to carry out new functions using a web interface.

638

Service-oriented integration
Based on application services
Connects together individual software
nodes that offer a number of services.
Has come to the fore with the rise of web
service technology.
Some application services are public, for
example those associated with Amazon,
Ebay and Google
Internet Technologies

639

Service-oriented integration views an integrated system as a series of components


that expose application services to the other components, an example of such a
service might be a stock control system that offers the services
Find all stock that is close to replenishment
Find out whether we can satisfy an order for a particular number of items
The important concepts here are the clean functional interface and the ability of
individual components to interact with each other using standard protocols such
as HTTP, SOAP and XML.

639

The Google API


An example of a type of application service.
API replicates many of the functions that you employ
when you access Google using a web interface.
Number of languages available.
Typical function is to search for sites using some collection
of keywords.
Lesson: it was withdrawn in 2007.
Other sites have their own API, for example eBay.

Internet Technologies

640

As well as services being developed for proprietary systems there are a number of
application services that are associated with public web sites such as eBay and
Google. Such sites offer an API, a set of method or subroutine calls which invoke
the functions of the site.

640

When to use service-oriented integration


Two or more companies need to share program
logic
Two or more companies want to share
development costs and value of an application
When the application domain is small and a
common application can be developed that a
number of companies are capable of sharing.

Internet Technologies

641

There are three major reasons for developing a system using service oriented
integration.
Where some functionality is to be accessed where the functionality might
change over time.
Where there is a need to share development costs of a project and where each of
the companies involved want a clean interface to the functionality of some of the
components.
Where the problem area is small and a common application is to be developed
which companies want to share.

641

Architecture
What the programmer sees
Interface layer

Internal systems

Internet Technologies

642

Service-oriented architectures are not reliant on web services a number of


technologies can be used. It utilises a major principle in software engineering:
that of hiding all the details of a system being a wall of functionality. For
example a service associated with a stock control system might be: find all the
stock items which are below their replenishments. A programmer can invoke this
function with out knowing implementation details, for example the relational
tables that store product details.
All sorts of technologies can be used for the interface layer including XML-RPC,
distributed objects and, of course, web services.

642

Business process-oriented integration


Can be regarded as an extra layer on top of
a services solution
Employs some business process language
In its most sophisticated form dependent on
some integration server such as BizTalk
Can be implemented without a sophisticated
integration server.
Internet Technologies

643

The final method of carrying out integration is that of business process-oriented


integration. Here the first thing that the developer does when producing an
integrated system is to define the business processes that need to be supported.
Once these have been defines some mechanism is then used to execute the
various components in the system so that the processes are carried out.
This form of integration is at present the least popular. This is mainly because
integration servers have yet to make a major impact on the market and also
because most integration projects are quite small and do not require the
sophistication of this approach.

643

A business process
Some action expressed in business terms
rather than technical terms.
A variety of languages have been defined
Languages include BPEL, XLANG, WSFL
and BPML
Often supported by both a textual notation
and a graphical notation
Internet Technologies

644

Business processes are a series of steps which give rise to some business result
such as a credit card being validated. They are expressed in business terms and
hence would use a vocabulary consisting of words such as invoice, bill, picking
list and account.
A language for expressing business processes is usually very simple and consists
of a set of control structures together with some modularisation facility such as a
subroutine. There are a number of languages in existence; however, there are
major sign of a shake down with the winner becoming BPEL.

644

Business process technology components

Graphical modelling tool


Business process engine
Business process monitoring interface
Business process engine interface
Integration middleware

Internet Technologies

645

According to Linthicum* there are a number of components to a business process


modelling system:
A graphic modelling tool. Essentially an editor which allows the user to define
business processes and modify existing processes.
A business process engine that executes the business process, initiates
transactions and maintains system state often over very long periods.
A business process monitoring interface. This enables the users of an integrated
system to control and monitor the execution of a business process, for example
finding data on the number of transactions that have been initiated by a particular
customer.
A business process engine interface. This allows other software , both
middleware and application software to interact with the business process engine.
Integration technology, for example an integration server that actually carries
out the process of coordinating the actions of the components of an integrated
system.
* Next Generation Application Integration, D Linthicum, Addison Wesley. 2004

645

An example of a business process


Take credit card from customer
if it is an allowable card
check the validity of the card
if the card is valid
get the order from the customer
check that the item ordered is in
stock
...

Internet Technologies

646

This is a simple example of a business process definition, one which is the frontend of a purchasing process. As you can see it is expressed in programming terms
but lacks the detail associated with a programming language.

646

Architecture
Components of the integrated system

BP middleware

Business processes
Internet Technologies

647

The schematic shows the role of the Business process middleware. It consults the
business process definitions and then coordinates the action of the program code
of the components of the integrated system. The next lecture will describe the
technologies used to implement the BP middleware.

647

Lecture 18

Integration (ii)

Internet Technologies

648

648

Aims
To outline some middleware models
To describe some technologies that are used
to implement the integration systems
software layer.
To examine in a little detail the role of
integration servers.
To look at BizTalk Server as an example of
an integration server.
Internet Technologies

649

649

Integration models

Three major models


Point-to-point
Central hub
Integration hub (with publish and subscribe)

Internet Technologies

650

There are three main models used for integration architectures. The point-to-point
model connects every node with all other nodes. The central hub model employs
some technology as a central marshalling point. This technology is usually some
form of integration server. The integration hub is where a bus is used to send
messages and transactions and where individual components of the integrated
system subscribe to certain types of message and certain types of transactions.

650

Point-to-point architecture
Point-to-point
All connections

Internet Technologies

651

Here the components of the integrated system are joined by a multitude of


connections. This is the type of architecture that is used for small systems or for
systems which have grown without much software engineering discipline being
applied. The major problem here is that as new functions are added and new
components integrated the number of connections rise rapidly as the square of the
number of connections. The approach also suffers from a lack of consistency in
that different connections may be implemented using different technologies and
connection paradigms. Such systems often hinder a company in its business
expansion in that the effort to add new components may be prohibitively high.

651

Central hub architecture


Hub surrounded by components
Central hub

Internet Technologies

652

Here a central hub often implemented by an integration server mediates the


communication between the individual components of the integrated system
carrying out transformation, routing and storage tasks; for example the hub might
change a name so that the forename is converted into an upper case initial.
The advantage of this approach is clear: it gets rid off a large amount of the
functionality associated with multiple connections. All connections are to the hub
and are usually implemented using message passing. The hub is a vital part of
such a system and normally it is replicated so that it any problems occur with it
then a backup server is started.

652

Integration hub
Central bus with components subscribing

Internet Technologies

653

Here a central bus is used for communication between the individual components
of the system. The components subscribe to messages that are of interest to the,
for example a component carrying out database processing might subscribe to all
those transactions that affect the databases that it holds. Other components
publish messages and transactions to the bus.

653

Remote Procedure Call (RPC)


Based on one program on a host calling part
of another program on another host.
Long-lived technology.
Relatively simple programming model.
Recent incarnation being XML-RPC.
Requires some description of the services
offered by a host
Internet Technologies

654

One of the most venerable technologies used for integration is that of Remote
Procedure Call (RPC). Here a program on one computer in a distributed system
calls part of another program on another computer. This is one of the first
technologies used for inter-computer communication and is over twenty years
old. It has the major advantage that it is relatively easy to program, all the
programmer has to do is call some code as if it was resident on their local
machine.
In order for RPC to function there is a need to define the interface offered by one
computer. This definition should specify the individual subroutines or methods
available, the arguments that need to be employed and the results that are
returned. In the latest version of RPC (XML-RPC) this is done via an XMLdefined language.

654

XML-RPC
Most modern version of RPC
Based on standard Internet protocols
Uses XML to define the interface between
entities using RPC.
Becoming a competitor to web services

Internet Technologies

655

XML-RPC is a technology that is quite young, around five years old. It enables
programs written in a variety of languages to communicate by sending messages
which invoke program code on a remote computer. It employs standard Internet
protocols such as HTTP.

655

An example
POST /rpchandler HTTP/1.1
User-agent: DarrelXMLRPC/1.1
Host: XMServer.book.com
Content-Type: text/xml
Content-Length:238
<?xml version = 1.0?>
<methodCall>
<methodName>
getStaffNames
</methodName>
<params>
<param>
<value><string>Part-time</string></value>
</param>
<param>
<value><string>WeeklyPaid</string></value>
</param>
</params>
</methodCall>
Internet Technologies

656

Here a standard POST command is used to send the payload to a remote


computer. The payload consists of XML defined code that invokes the method
getStaffNames with two parameters Part-time and WeeklyPaid which are both
strings.

656

The XML RPC processing cycle


Client makes the RPC call
Call is converted to XML and packaged in a POST
command
The XML is sent
An XML-RPC compliant server receives the code.
The server extracts out the payload, determines the code
that is to be executed and the argument used.
The code is executed and any results created.
The results are sent back to the client in XML form.
The client unpacks them and processes any data that has
been sent back.
Internet Technologies

657

A client program makes a software call which calls the XML-RPC library code. The program
specifies the name of the code to be executed, the arguments and the address of the server
which is XML_RPC compliant.

The XML-RPC software on the client packages up the request and converts it into the
HTTP/XML form detailed above and issues a POST request to the server specified in the first
step. The client stops at this point waiting for a response from the server to which the POST
has been sent.

The server receives the POST command, extracts the XML payload out and passes the
content to the XML-RPC software.

The XML-RPC software on the server parses the payload and determines what code is to be
executed and what the arguments are.

The code that has been identified in the previous step is then executed with the arguments
that have been identified.

The XML-RPC software on the server monitors the result of the execution of the code
detailed in the previous step and constructs a HTTP response which contains the XML that
represents the result of the execution.

The server then sends back the response to the client.

The client uses the XML-RPC software to unpack the response and extract out the result of
the remote execution at the server.

Finally the client will take the data returned from the server and restart its execution using
this data.

657

A response
<?xml version = 1.1 ?>
<methodResponse>
<params>
<param>
<value>Out-of-Stock</value>
</param>
</params>
</methodResponse>

Internet Technologies

658

Here a response is shown. It is expressed in an XML form and show that a single
data item having the value Out-of-Stock is returned. It is this value that will be
processed by the client.

658

An example of some Java code for XMLRPC


RpcClient rpClient = new
RpcClient(http://101.125.100.45:4566);
Vector arguments = new Vector();
arguments.addElement(Replenished);
Object result =
rpClient.execute(FindRouter.mesh,arguments);

Internet Technologies

659

This is some example code showing how a client might make an XML-RPC call.
Here the computer identified by its IP address 101.125.100.45:4566 is sent a
command to execute the method mess in the class FindRouter with the
arguments found in the Vector object named arguments.

659

BizTalk Server
Microsoft product
True integration server
Recent incarnation uses business rules and
business process language
Intended for a central hub role
Buit on top of the .Net architecture
Internet Technologies

660

BizTalk Server is probably the most advanced integration server available. It


mainly integrates Microsoft technologies although there are many add-ins for
other proprietary technologies plus it provides the technological means for a
developer to hand code such capabilities.
It is intended for central hub role. The latest version of the server employs a
business process language to carry out the orchestration of the various
components that make up an integrated system.

660

Features
Sets of adapters to manage the interaction between various
components employing a wide variety of protocols.
Contains a receive pipeline connected via an adapter to a
receive port
Contains a send pipeline which connects with the outside
world via an adapter and a send port.
At the heart of the server there is a message box containing
messages in transit.
A business rule repository carries out orchestration.

Internet Technologies

661

The BizTalk MessageBox is a repository for messages.. Each message is associated with detail
such as where the message originates from.
Incoming messages arrive at a location known as a receive location. A listener is a component
that monitors the URL of a receive location and introduces the message into the BizTalk server.
Before the message is deposited in the BizTalk MessageBox it may traverse a receive pipeline.
This pipeline might contain processing elements that would transform a message in some way
before depositing it in the MessageBox. For example, the message might be encrypted and a
component of the pipeline would then be responsible for its decryption.
There are a number of send ports associated with a BizTalk server. These ports will have
subscribed to a particular message and will receive those messages that they have subscribed to.
A send pipeline is used to process messages before they are delivered to some component of the
integrated system. As with the receive pipeline the send pipeline will contain components that
could transform the message being sent out; for example adding digital certificate information.
An orchestration component will manage the process of messages being send, transformed and
being delivered. For example this component might be used when a message is received from a
supplier that a particular quantity of a product has been delivered to a warehouse. It would
orchestrate the process of receiving the message, delivering the basic data of the delivery to a
database used by the accounting part of a company and delivering data to any companies that
have made orders for the item being delivered that it is now available.

661

Architecture schemtic

Receive port
Adapter

Send port

Business
rules

Receive pipeline

Adapter

Send pipeline

Configuration database

Tracking database

Message box

Internet Technologies

662

Here you see the architecture of BizTalk Server. Messages come in via receive
ports and are initially processed by an adapter. They can then be transformed by a
pipeline known as the Receive pipeline. They are then stored in an SQL-server
implemented message box before being transformed and sent to other
components of the integrated system. This is via a send pipeline and an
associated adapter. The processes of transformation and routing are determined
by business rules which are stored centrally.

662

BizTalk tools

Editor
Mapper
Pipeline designer
Orchestration designer (uses Xlang)
BizTalk explorer
Internet Technologies

663

There are a large number of tools associated with BizTalk Server. The slide show
five of the most important.
The editor defines the format of messages that will be processed by BizTalk and
allows the developer to check whether specific messages meet the specification.
The mapper is a tool which describes how one message can be mapped into
another message. The mapper employs code snippets known as functoids to do
this.
The pipeline designer allows the developer to specify the connections between
the individual transformation elements in a pipeline.
The orchestration designer It employs the process language Xlang to do this.
BizTalk explorer is a tool which enables the developer to view entities such as
orchestrations, message formats and transformations.

663

Message-oriented middleware
Another major technology used for
connecting components in an integrated
system.
Simple APIs
Number of products on the market
Product with the most penetration is
WebSphere MQ
Internet Technologies

664

The other major technology employed for integration is message-oriented


middleware. It is effectively a set of queues onto which are deposited messages
and from which messages are removed. Because of this the API that is associated
with a MOM product is normally very simple and involves nothing more than
deposit, remove and query actions.
The best known product and one which has the most penetration is WebSphere
MQ.

664

MOM schematic
Components
Components

Message queues

Internet Technologies

665

The figure shows a number of queues communicating with a number of


components of an integrated system. Each queue is administered by a central
queue manager which makes decisions about what messages are to be placed on
what queue, for example some messages might be regarded as high priority and
are deposited on a queue which has a higher read frequency than other queues.

665

MOM and integration


The major reason for the popularity
of MOM as an integration technology
lies in the fact that it is simple and that
It can connect disparate systems easily
because the APIs that are available
are written in a variety of languages
Internet Technologies

666

MOM is a very popular solution for integration because it is simple. A


programmer can be trained in using it in a couple of days. The other reason is that
good quality MOM products will have a variety of APIs written in a number of
languages for a number of platforms, hence, for example a HR system written in
C++ for a Windows platform can interface easily with a salary system written for
Linux and developed using Java. It is the messaging aspect of the technology that
provides this feature of MOM.

666

WebSphereMQ
Hugely popular technology
Variety of platforms: z/OS. UNIX, LINUX,
Windows
Important APIs supported include: Java, .Net, C,
COBOL, PL/1, JMS for Java, CMS for C/C++
Also a number of unsupported APIS including one
for the scripting language Perl.
Winner of two major product prizes in 2004.
Internet Technologies

667

WebSphere MQ is the most popular MOM technology. It is implemented on a


wide variety of platforms including all the UNIX variants andrecent Windows
variants. It has major support for a wide variety of languages including legacy
languages such as COBOL and more modern languages such as Java and C++.
WebSphere MQ also has support for the Java messaging technology JMS. This is
a low level technology that was intended as a medium level interface with
messaging products such as WebSphere MQ.

667

Major features of WebSphere MQ

Assured one-time delivery of messages


Non-time dependent architecture
Low-level transformations on data
Can be used for triggering and hence the
basis for event driven architectures

Internet Technologies

668

There are a number of features of the product which make it robust and scaleable.
First id guarantees that messages will be delivered once, and once only.
It will deliver messages even though a receiving application might not be
running. The message will be stored until the application restarts.
It provides a primitive means of transforming data as it passes from one
component of an integrated system to another. This is achieved through the use of
message data "Exits". These are compiled applications which run on the queue
manager host; they are executed by the WebSphere MQ software at the time data
transformation is needed.
Finally WebSphere MQ has the capability to trigger applications when a
predefined messages arrive

668

The Queue Manager


Handles storage, timing issues, triggering, and all
other functions not directly related to actual
movement of data
Connection is either a Bindings connection or a
client connection
Queue manager communicate with each other
using a program known as a channel
The Listener is a program that monitors incoming
queue actions
Internet Technologies

669

This slide details the various components the key part of WebSphere: the queue
manager. This is the central coordinating part of the system and carries out all the
important functions apart from those involved in the transport of data. It
maintains two types of connection: Bindings connections and Client connections.
The former are faster than the latter; however, the latter allow for a much more
robust design which can be maintained more easily.

669

Major commercial features

Assured delivery
Excellent development facilities
End to end security
Interface with web services
Ability to cluster
Time independent processing
Large number of systems that can be integrated
Support for growth
Internet Technologies

670

These are the main commercial claims for this messaging technology:
The fact that it assures delivery using an at least at most once model.
That it has an excellent set of tools for producing systems using WebSphere
MQ, both tools for code development and tools for administering the queues.
That security is provided by SSL.
That is can now interface with Web services defined by SOAP
That MQ processing can be distributed across a number of processors.
That it can be used to integrate a wide variety of systems written using large
number of technologies
That it can handle messages where the recipient is not executing, for example a
program resident on a portable computer which has gone out of WiFi contact.
That the simple messaging structure allows new applications to be easily added
to an existing system that uses WebSphere MQ.

670

Standards and Integration


One of the biggest drivers behind the
increasing popularity of integration
Standard fuelled mainly by XML
Standards for both systems level and
application level
Hub and spoke model normally employed

Internet Technologies

671

One of the main enablers for integration is the availability of standards. If the
components of an integrated system can communicate with a shared semantics
then much of the work involved in integration will be substantially reduced.
Many of the standards that are available have been fuelled by XML. Standards
are industry specific such as 1YNC or system level standards such as SOAP or
HTTP.
Often, certainly for the simpler standards a hub and spoke model is used for
communication as for example with 1SYNC.

671

1SYNC
American standard
Based on the codification of retail products.
Employs a hub and spoke model with the hub
containing a central database of product
information such as the name of the product, its
unique key and its dimensions.
Used by retailers when they are ordering products
from a supplier, for example to estimate
warehouse space.
Internet Technologies

672

1SYNC is a simple but highly effective standard for describing retail goods. A
central database keeps about 70 product attributes live and allows subscribers to
the 1SYNC system to add new products, modify products and be notified when
there is a change to a product.
This standard is unusual in that it does not just encompass a written description of
what product information should look like but also contains a technological
infrastructure which enables subscribers to access a database.
Such a database removes many of the transformation and customisation problems
that are associated with supply chain applications.

672

OpenDocument
A standard for memos, reports, spreadsheets etc,
any document that can be generated in an office.
Based on XML
Supported by OpenOffice and KOffice
In competition with Microsoft's Open Office
XML.
Recently Microsoft have announced that they will
develop a plug-in to save to ODF
Expected to become an ISO standard
Internet Technologies

673

This is a significant approach to standardise office documents. It is in


competition with Microsoft's proprietary office suite of products.
The documents that form part of this standard are all defined using XML DTDs
and schemas. It is supported by both IBM and Sun, the latterly via its OpenOffice
set of products.

673

FIXML
Standard for exchanging messages about
financial trading.
Mainly used for securities trading
Implements the FIX protocol which was
based on a comma-separated format
Some criticism of FIXML in that it is seen
as lengthening the life of a not too useful
protocol.
Internet Technologies

674

FIXML is a standard messaging format based on XML that describes the


messages that are generated when a security is traded, this requires a fixed but
well-defined set of steps between the buyer and the seller. There have been some
criticisms of this standard because it just implements quite an old protocol (FIX)
and does not take advantage of many of the features of XML. The developers
have effectively used XML as a layering technology which first in with old FIXbased systems.

674

FIXML

Internet Technologies

675

An important feature of FIX that differentiates it from other financial protocols is


that it is a connected, session-based protocol. The FIX protocol consists of two
layers, known as the Session Layer and the Application Layer. The Session layer
contains all the session related information; all the business related information
such as quote and order information resides in the Application Layer

675

WikiPing
Used to broadcast messages about changes
to a Wiki
Open standard
Idea based on the ping utility used by
Internet users
Provides information about the nature of a
change
Internet Technologies

676

WikiPing is a useful open standard defined in XML. A wiki is a collaborative


web site where users can create and edit items, probably the best known Wiki is
Wikipedia which is a global Internet dictionary. When a wiki changes its content
data is sent to a subscriber. The data can include the following items:
tag (required) the name (or title) of the page which has been changed
url (required) the URL assigned to that page
wiki (required) the wiki on which the page is hosted
interwikiname (optional)
history (optional) an url pointing to the revision-list (page history) of the page
author (optional) name of the editor who performed the change
authorpage (optional) wikipage of the author.
changelog (optional) some wikis provide a textfield in which the author can drop
a short note that describes what kind of changes have been made.
language (optional) two letter code (based on ISO639-1) of the used language.

676

SMIL
Defined yet again using XML.
Synchronised Multimedia Integration
Language
Defines the properties of a multimedia
entity such as a video clip
Open standard defined by the W3C
MMS based on SMIL for handheld devices
Internet Technologies

677

This is again is a standard defined in XML. It is used for defining multimedia


presentations. It defines characteristics such as: timing, layout, animations and
the transitions between elements of the presentation. The latest version (2.0)
allows SMIL code to be integrated with other descriptive languages such as SVG.
SMIL source looks very much like HTML

677

RSS a reprise

Dealt with in the Web 2.0 lectures


Vital technology for integration
Used as data glue
Number of standards converging on Atom
Given rise to two user mashing technologies
Yahoo Pipes and Scratch.
Internet Technologies

678

Yahoo Pipes
Mashing technology
Relies on a simple graphic programming
language
Combines RSS feeds
Major debate about its power
First steps towards more powerful facilities
Internet Technologies

679

An example

Internet Technologies

680

This is an example of a very simple pipe. All it does is to fetch an RSS feed and
replace items within the feed with other items.

680

Scratch

Childrens programming language


Developed at MIT
Ability to integrate data sources
Being touted as a good model for mashing
which avoids problems with Yahoo Pipes.
Based on Smalltalk
Internet Technologies

681

Lecture 19

Integration (iii)

Internet Technologies

682

682

Aims
To outline some of the roles of an
integration server
To examine BizTalk in more detail
To introduce the concept of business
process analysis
To introduce ebXML

Internet Technologies

683

683

Integration services
Message-based
Employs standard-based messages, pseudo
standards and application specific standards
However, these standards are usually
hidden by an extra system layer.
Employed in a hub and spoke architecture

Internet Technologies

684

Integration services are almost invariably message based, they employ a host of
protocols ranging from standards-based protocols such as HTTP to pseudo
standards such as MQSeries and application standards such as BAPI used for
SAP. A good integration server should hide all the details of these protocols and
enable the developer to employs tools which do not require knowledge of the
protocols. Integration servers are employed in a hub and spoke architecture where
they marshal, coordinate and orchestrate the various components of the system.

684

The main components of an integration


server (i)

Transformation
Schema conversion
Data conversion
Routing

Internet Technologies

685

The slide above details the first four functions of an integration server.
Transformation involves the transformation of data that passes through the hub
that is implemented by the server, for example removing a first name and
replacing it with an initial. Schema conversion involves the transformation of a
database schema into a form that it can be used by another component of the
integrated system. Data conversion is the process of carrying out some
transformation so that data which is one form can be processed in another form
by another component of the integrated system, for example replacing a dollar
amount by a pound or Euro amount. Finally routing involves the determination of
the destinations of some message or business transaction and sending it.

685

The main components of an integration


server (ii)

Rules processing
Message warehousing
Repository maintenance
Directory services

Internet Technologies

686

The remaining components of an integration server are shown above. Rules


processing involves the processing of the business rules that determine how a
particular business process is to be carried out. Message warehousing involves
the storage of messages and transactions that take part in some composite
business process, this involves the storage of state which might last for a long
time, say days or even months. Repository maintenance involves the maintenance
of a central database which describes the interaction between two companies. A
directory service is a service which provides details of the physical entities which
make up a series of shared or composite business processes, for example the
servers that take part.

686

Architecture

Business rules
Business processes

Multiple application connectivity


Two applications connectivity

Internet Technologies

687

This shows a typical four-layer architecture. At the bottom is functionality that


allows two applications to connect together. The next level draws on this level in
order to implement connectivity between multiple applications. The third level
contains the business process definitions which drive the server, these are usually
written in some simple programming language-like notation. Finally there are the
business rules that determine the detailed functioning of the integrated
application. It is these business rules that describe the more dynamic aspects of
the business processes.

687

Major problems that have to be solved

State management
Transaction management
Correlation
Security

Internet Technologies

688

There are a number of problems that a good integration server needs to solve.
The first is state management. State is usually shared between the various parts of
a business process. This is analogous to the problem of maintaining state within a
web server. This is usually solved by having a centralised state container.
Transaction management is also a problem where a transaction which may
involve a number of concurrent updates needs to complete with out problems
such as the lost update occurring. Normally transaction management is solved by
employing techniques that are sued by application servers and which are
discussed earlier in this lecture course.
A third problem is correlation, this problem involves a number of instances of a
business process being in execution and a message arrive for one of them. Which
instance should the message be delivered to? There are two solutions: the first is
to use unique identifiers on messages; a second approach associates a message
with a unique business process instance.
Security is the final problem and is handled via standard security techniques such
as employing SSL.

688

BizTalk (revision)
Main Microsoft product
Based on messaging
Central message store known as the BizTalk
MessageBox
The heart of this server is an SQL Server
database
Heavy use of XML
Internet Technologies

689

Probably the best known integration server is BizTalk Server. This is quite a
mature Microsoft product which is based on messaging where messages are
stored in an SQL Server database known as the MessageBox.
XML plays a very important part in this technology in that all messages are
converted to an XML format. This server is normally employed in a hub and
spoke archtiecture

689

The components of BizTalk

Incoming message
Receive locations
Receive port
Receive pipeline
MessageBox
Send pipeline
Send port
Outgoing message
Internet Technologies

Direction of
travel

690

This slide shows the various components of BizTalk Server. A message is


received at a receive location This is then processed in a receive pipeline to
transform it into a form that it can be processed, for example an encrypted
message may need to be decrypted. Messages are then stored in the MessageBox
until a recipient is ready for them. Applications subscribe to types of messages.
The MessageBox software carries out the matching of subscribed components
with the messages that it stores. The orchestration, transformation and translation
aspect of the server is handled by means of a business process language
description. This type of language is dealt with in more detail in later parts of this
lecture.

690

An example of a business process


get a customers itineray
repeat
get a flight from last arrival place
get a hotel at destination
until the final destination is achieved
calculate the cost
bill the customer

Internet Technologies

691

This is a very simple example of a business process specification. It looks a little


like a programming language fragment as it contains a repeat..until loop.
However, it is distinguished from a program code by the fact that it is expressed
both at a high level and in non technical terms. Most business process languages
are either based on this sort of idea or are graphical in nature.

691

Some vocabulary
Process definition: the definition of the whole of a
business process
Process instance: one instantiation of a process for
some specified data
Activity: a step in a process, for example check
credit card
Automated activity: an activity carried out by
some computer
Manual activity: an activity carried out by some
human operator
Internet Technologies

692

Benefits of business process modelling


(Havey)

Formalisation of existing processes


Leads to automated process flow
Increases productivity
Allows people to solve difficult problems
Enables compliance with regulatory
requirements
Internet Technologies

693

Havey in his excellent book Essential Business Process Modelling has described
a number of reasons for carrying out business process modelling:
It enables an enterprise to formalise its processes; this means, for example, that
process descriptions can be given to new staff who can execute these processes
with little training.
It can lead to the automation of a number of the activities, this is the rationale
detailed in this lecture.
There have been a number of studies which show major savings in both staff
numbers and response time when an effective process modelling exercise has
been carried out.
By delegating easy processes to computers it means that those processes which
require deeper skills can be delegated to human operators.
It enables a company to easily discover whether they are meeting external
regulations and, moreover, it enables that company to demonstrate to auditing
staff that it is compliant.

693

The rationale and connection with this


course
A good BPM architecture...is as elegant an
enterprise architecture as can be. Many
applications fail because they lack an
intelligible initial design; BPM can prevent
this.
Havey, 2005
Internet Technologies

694

This quote represents the rationale for including business process modelling
(BPM) within this part of the course. Many applications that have been integrated
have suffered from problems that have occurred because it was done in a
piecemeal way with little if any consideration of the overall enterprise
architecture.

694

Standards for BPM

BPEL
BPML
Web services choreography
BPM
BPSS

Internet Technologies

695

There are a number of standards for business process modelling, the most
important are shown above.

695

BPEL

Also known as BPEL4WS


Derived by the OASIS group
Strongest backing
Favourite to win any standards war
This lecture concentrates on this standard

Introduction to BPEL
Internet Technologies

696

BPEL is far and away the most popular standard as it has the backing of four very
powerful organisations: BEA, Microsoft, IBM and Oracle. It also has the
advantage that it closely allied to web service technology, for example it is
relatively easy to turn a process specification expressed in BPEL into XML code
which describes a web service that reflects the process. You will sometimes see a
reference to BPEL4WS and BPELJ. The former is an alternative acronym while
the latter is an extension of BPEL that provides a smooth progression to the Java
code that implements a process.

696

BPML
Business Process Modelling Language
XML based definitions moderately similar
to BPEL
Allied with a graphical modelling notation
BPMN
Can be mapped to BPEL

Internet Technologies

697

This is a notation that emerged from the Business Process Modelling Initiative
organisation. It, like BPEL is based on XML and includes an interface with a
very sophisticated modelling notation known as the Business Process Modelling
Notation (BPMN). In standards terms it lags well behind BPEL.

697

Choreography
Web-service oriented
Describes how web services should work
with each other for multiple participants
Can be used for large-scale specification
Language WS-CDL

Internet Technologies

698

Choreography is a modelling notation which concentrates on the interaction of


processes in the presence of multiple participants: essentially a form of
concurrent processing. It is mainly oriented towards being targeted at web
services. It is associated with a notation known as WS-CDL (Web ServicesChoreography Description Language). It is a language which lags behind the use
of BPEL even though it addresses issues where BPEL is weak.

698

WfMC BPM Reference model


Reference model
Based on a central enactment service
The service executes processes designed
using a process design tool.
Based on XML
Not much evidence of penetration
Internet Technologies

699

This is a reference model that has been developed by the Workflow Management
Coalition. It is based on a central enactment service which interfaces with other
components: administration and monitoring tools, workflow client applications,
normal applications, process definition tools and other workflow enactment
services.

699

Components of a good design (i)

Some form of choreograph infrastructure


Exporter to map notations
Human work list application
Internal code
Runtime engine
Administration and monitoring console
Graphical editor
Code generators
Internet Technologies

700

The list above, taken from Havey, are the main components of a good BPM
architecture. There should be some infrastructure which choreographs the various
components of the process, you should be able to export and import a number of
notations, there should be some tool which creates human work lists for tasks that
are not automatable; there should be the internal code which is executed when a
process is executed, a runtime engine which carries out the coordination of tasks,
a console which allows staff to monitor the business processes that are being
executed, a graphical editor which creates business process descriptions and code
generators which convert business process definitions into working program
code.

700

ebXML purpose
Describe business process and specific interfaces?
Sharing of business process with other
enterprises?
Discovering which business processes a cocompany supports?
Description of the business messages for a
particular transaction?
Description of the security policy and technical
configuration employed to implement business
processes?
Internet Technologies

701

ebXMl is a popular notation used for business process description. The slide
above describes its main functions.

701

The ebXML registry


Contains
XML schemas
specifications of business process
ebXML Core Components
any UML models
co-company information
software components
Internet Technologies

702

The core of any ebXML implementation is the ebXML registry. The registry
information model that is used is hierarchic and is organised on a sector basis
(see next slide). When co-companies interact with each other it is this registry
that is accessed: it forms the main interface between business entities.

702

Hierarchical organisation of exXML


registry
Industry
Telecomms

Retail

European retail

IT

US retail

ASDA
Internet Technologies

703

This shows the hierarchical arrangement of the registry. The first three levels
represent what are known as classifications and contain common entities that are
associated with the classification, for example generic business processes
associated with the retail sector. At the bottom of the hierarchy there is actual
entities; only one is shown in the slide, that of ASDA.

703

The process of connecting two


enterprises together

Implementation phase
Discovery and retrieval phase
Run time phase

Internet Technologies

704

When two companies want to develop some trading arrangement using ebXML
there are three phases that they have to carry out. The first is the implementation
phase. In this the partners will analyse their business processes and publish them
to the registry. After this an ebXML implementation must be carried out with the
business processes between the co-companies being linked together via some
ebXML framework.
The next phase occurs when the co-companies start working together. Each
company will discover business process information from the other company as a
prelude to the actual transfer of business data and the execution of the code that
implements the business processes.
The final phase is where business transactions and messages are exchanged
between the two companies which carries out some composite business process.

704

Lecture 20

Integration (iv)

Internet Technologies

705

705

Aims
Introduces BPEL
Looks at BPEL as a business process
modelling language
Describes the elements of BPEL
Examines the interface between BPEL and
web services
Examines to final standard-based case
studies.
Internet Technologies

706

BPEL History

Sometimes called BPEL4WS


XML based rather than notational
Written by IBM Microsoft and BEA
Handed over to OASIS organisation for
standardisation
Arose from an initiative known as the BPM
initiative
Internet Technologies

707

BPEL is now the most popular business process language. It was developed by
BEA, IBM and Microsoft before being handed over to the OASIS organisation
for standardisation. Standards are either XML-based or notational. When a
standard is notational it is based on a notation, usually graphical which is nonXML for example a notation based on the software engineering notation UML.
BPEL is firmly based on XML, however, there are a number of convertors
available which can transform BPEL into other notations.

707

An example of a BPEL architecture


BPEL

WSDL

Process
engine

Purchaser

Receive order
Invoke purchase

Invoke sending

Invoke confirmation

Internet Technologies

708

The diagram above shows the architecture of a system which receives orders
from a customer for some product which is then shipped to the customers, for
example an online store. The BPEL part of the architecture contains details of the
business processes, while the Web Services Definition Language (WSDL) part
details how the process is to interact with a web service implementation. The
diagram show one interaction: with the purchase, other interactions, for example
that with a shipping company are not shown.

708

The two main components

WSDL, used for specifying the interface


between a set of business processes and
their web service implementation
BPEL the language used to describe the
business processes involved

Internet Technologies

709

There are two main components to a BPEL implementation. There is text


expressed using the web services description language WSDL. This, for example,
will describe the port types, operations and messages that are used when a web
service is invoked. The second component is the BPEL text which describes
business processes; this will contain data that specifies the steps in a business
process, what is to happen when a fault occurs and the variables used to hold
information such as the identity of a customer.

709

Some BPEL objects (i)

Process
Variable
Partner link
Compensation handler

Internet Technologies

710

BPEL contains a number of different objects; this slide and the following slide
details some of them.
A process is the specification of a business process.
A variable is akin to a programming language variable in that it holds data that
is used by a particular business process, for example a variable might hold the
identity of a product that is ordered by a customer from an online store.
A partner link is a specification of the role that a process has, for example a
process might be a buyer of goods and also resells those goods on to another
entity.
A compensation handler is a specification of cancellation logic which is invoked
if a particular transaction needs to be returned to a previous state, for example by
a customer cancelling part of a shopping cart.

710

Some BPEL objects (i)

Correlation set
Receive
Invoke
Sequence
While

Internet Technologies

711

A correlation set is q collection of properties that is sued to synchronise any


messages received with the internal state of a process.
A receive is an activity that is invoked when a process receives some message
expressed in the SOAP message language
Invoke is the opposite of receive in that it sues SOAP to invoke a web service
which forms part of a set of servcies exposed by a trading partner.
Sequence is a sort of programming construct which runs a set of activities
sequentially
While is equivalent to the conditional looping found in a conventional
programming language.

711

Some examples of BPEL: partner


linking(ii)
<partnerLinks>
<partnerLink name = client
partnerLinkType = Customer
myRole = customerSupplier
/>
...
</partnerlinks>

Internet Technologies

712

Here is the first example of some BPEL. Notice that it is expressed in an XML
format. Here it defines what the role of a partner company is: that of a customer,
and the role that the receiver has: that of the provider of some goods that os
required by the customer.

712

Some examples of BPEL: correlation set

<correlationSets>
<correlationSet name = AccSet
properties idprop amountprop />
</correlationSets>

Internet Technologies

713

This piece of BPEL defines a correlation set named AccSet. Any number of
correlation sets can be specified within such a nested element. The correlation set
has properties (variables) idprop and amountprop. The correlation set can then be
sued as the data interface between a receive operation and an invoke operation;
the former receives a request for a service while the latter invokes a service from
another provider.

713

Some examples of BPEL: receiving


<receive.....
variable = accountDetails...>
<correlations>
<correlation set = AccSet
initiate = yes />
</correlations>
</receive>

Internet Technologies

714

Here the correlation set is assigned the values of the variables associated with the
receive that has been received from some partner. The initiate attribute is set to
yes to indicate that the step is the first in a business process.

714

Some examples of BPEL: invoke


<invoke.....
inputVariable = accountDetails...>
<correlations>
<correlation set = AccSet
... />
</correlations>
</invoke>

Internet Technologies

715

Here the process invokes another service using the data held in the correlation
set. This sends data back to the entity that started the receive detailed on the
previous slide.

715

Some examples of BPEL: partner

<partner name = contractor>


<partnerLink name = consulter/>
<partnerLink name = builder/>
</partner>

Internet Technologies

716

Here a partner is defined. This is some enterprise that a user of BPEL interacts
with. The example shows that this partner has two roles: that of a consultancy
company and also a builder.

716

Some examples of BPEL: partner link


types
<partnerLinkType name = Reseller>
<role name = receiver>
<portType name = receiverPort/>
</role>
<role name = sellerOn>
<portType name = sellerOnPort/>
</role>
...
</partnerlinkType>

Internet Technologies

717

This BPEL defines the fact that there is a partner link type known as a Reseller
and that there are two roles: that of a receiver and that of someone who receives
goods and then sells them on. The code also maps a role to the web services port
that is involved in the reseller interactions.

717

Some examples of BPEL: exception (fault


handling)
<scope name = fault1>
<faulthandlers>
<catch faultname = invalidProduct>
...
</catch>
<catch faultname = invalidCustomer>
...
</catch>
...
</scope>

Internet Technologies

718

BPEL also contains a facility for error processing. The code above shows the
code that catches a number of errors, for example the fault that is generated when
an invalid product code is encountered. The ellipsis shown indicates the position
of the code that is executed when the fault is found. Faults are often generated by
means of the throw statement an example of this is
<throw faultName = invalidCustomer>

718

Some examples of BPEL: assign


<variable name = loopCounter
type = xsd:integer/>
...
<assign>
<copy>
<from expression = 22/>
<to variable = loopCounter/>
</copy>
</assign>
Internet Technologies

719

This is an example of the type of programming that can be carried out in BPEL.
Here the variable loopCounter is set to 22.

719

Some examples of BPEL: while


<while condition = ...>
<sequence>
...
</sequence>
</while>

Internet Technologies

720

This just shows the skeleton for a while loop. This is the only looping structure
currently found in BPEL. This fact makes it quite a difficult language to program
in as for loops can be really difficult to simulate.

720

BPEL summary
XML-based business process language
In standardisation
Whole implementation involves WSDL
specification together with the BPEL
specification
Java version of BPEL known as BPELJ
BPELJ can either be Java with XML
embedded or XML with Java embedded
Internet Technologies

721

BPEL is the current popular business process definition language that is currently
undergoing the standardisation process via the OASIS group. A full specification
involves both BPEL and WSDL the Web Services Description Language. This
means that business processes can be directly hooked up to a set of web services
which implement the business processes. As well as there being an XML version
of BPEL there is also a specified version of BPEL which is Java-oriented. This is
in its early days with only two rudimentary proprietary implementations
available.

721

Two example standards


Slides have nothing to do with BPEL, just a
convenient point to talk about standards which
involve business processes.
Standards enable integration by producing
interfaces which can be accessed without recourse
to hack ups including the use of scripting
languages
Most standards are now XML based
Two standards that I look at are RosettaNet and
eTom.
Internet Technologies

722

The final part of this lecture are devoted to two standards which have links to
business process specification. Standards are vitally important in systems
integration because they enable components to be integrated without any special
processing, for example by employing Perl or Python. The two standards that I
examine in this part of the course are eTom which is a telecoms industry standard
and RosettaNet. The former is a standard for supply chain integration within the
high-tech sector while the latter is a standard for the telecoms sector.

722

eTOM
Standard gaining major acceptance in
telecoms industry
Split into three areas:
Strategy infrastructure and product
Operations
Enterprise management

Internet Technologies

723

eTOM is the prime business process standard for the telecommunications


industry. It is one of the most complete standards in existence and handles
virtually everything that any large telecoms company would want to do ranging
from marketing to operations. It is it is organization, technology and service
independent. Business processes are decomposed hierarchically so that, for
example, the business process recruitment might be split up into sub-processes
dealing with activities such as situation advertising, short-listing and
interviewing.

723

eTOM Strategy, Infrastructure and


Product

Marketing and offer management


Service development and management
Resource development and management
Supply chain management and development

Internet Technologies

724

This part of the standard deals with the management of selling and marketing, for
example it would deal with the development of product brochures. It deals with
service development and management, for example the development of a helpdesk service. It deals with resource development and management, for example
the deployment of teams to produce some telecoms product. It deals with supply
chain management, for example the sourcing of components for some telecoms
hardware.

724

eTOM Operations

Customer relationship management


Service management and operations
Resource management and operations
Supplier/partner relations management

Internet Technologies

725

This slide shows the parts of the eTOM standard which deals with operations. It
deals with customer relationship management, for example the process of
keeping product technical information up to date. It deals with service
management and operations, for example running a hardware service department.
It deals with resource management and operations, for example the allocation of
staff to one-off projects. It deals with supplier/partner management, for example
the processing of defect data and the ascribing defects to bought in items.

725

eTOM Enterprise management

Strategic and enterprise planning


Enterprise risk management
Enterprise effectiveness management
Knowledge and research management
Financial and asset management
Stakeholder and external relations management
Human resource management
Internet Technologies

726

This details the items which come under the eTOM enterprise management
banner. For example strategic and enterprise planning would, for example cover
the process of determining new markets or new products. Knowledge and
research management would cover tasks such as the development of horizon
documents which detail important blue sky areas to be investigated. Human
resource management would cover activities such as employee customer
professional updating programmes.

726

eTOM strengths
Covers both technical and non-technical areas
Is the only player in the sector
Enables business requirements analysis and
specification to be closely linked with technical
processes
It handles the problem of different processes
operating on different times and life-cycles
It integrates technical areas: application,
computing and network
Internet Technologies

727

These are some of the strengths of eTOM. The major strength of this standard is
the fact that it covers all the activities that a telecoms business needs to carry out,
for example it integrates business processes such as marketing with product
development and integrates technical areas which have in the past have existed as
isolated islands in telecoms providers, for example it integrates the process of
software development with network development and hardware development.

727

RosettaNet
Developed to standardise e-business in the
high-tech industries
Was developed surprisingly easily
Was developed as a response to the failure
of Electronic Data Interchange (EDI) in the
high-tech sector
XML based
Relies on PIPs (Partner Interface Processes)
Internet Technologies

728

RosettaNet is a standard that is employed in the high-tech industry, for example


by computer manufacturers. Since this industry uses the same suppliers, for
example memory suppliers, it was very easy to get a large amount of agreement
on this standard. The sector had suffered from a number of failures in using EDI
(Electronic Data Interchange); it was found expensive, it was proprietary and was
not flexible enough for B2B processes. By basing the standard on XML many of
these disadvantages were overcome.

728

RosettaNet Processes
Business process modelling

Business process analysis

PIP

Dictionaries

Internet Technologies

Implementation
framework

729

The process of delivering a PIP is shown below. The first process is that of
understanding the business model uses and the various processes that make up
the model. This is shown as the top two boxes.
What is also needed is the development of a set of dictionaries that contain both
technical and business properties. The former contains the technical details of a
product, for example the details of a computer processor. The latter describes
business data relevant to partner companies.
The implementation framework consists of documents, for example XML DTDs
which define the formats used to exchange data and documents.

729

A PIP

XML documents
Class and sequence diagrams
A validation tool
An implementation guide

Internet Technologies

730

A PIP is a common business/data model which enables system developers to


produce the code that integrates a companies operations with that of partner
companies such as delivery companies and suppliers.

730

PIP clusters (An example)


Inventory management
Collaborative forecasting
Inventory allocation
Inventory reporting
Inventory replenishment
Sales reporting
...
Internet Technologies

731

PIPs are partitioned into a number of different clusters, for example there are
clusters for partner, product and service review; for product introduction, for
order management; for inventory management; for marketing information
management and for service and support. The slide show the various clusters
associated with inventory management, probably the simplest of the clusters.

731