You are on page 1of 30


Prof. Reuven Aviv Tel Hai Academic College Department of Computer Science Topics in Data Communication

World Wide Web

Acknowledgements for slides: A. Tanenbaum, Computer Networks

The World Wide Web Architectural Overview Static Web Documents Dynamic Web Documents HTTP The HyperText Transfer Protocol Content Delivery Networks


Architectural Overview

Architectural Overview
Client with Web browser program Server with Web Server and pages (html) Other servers with Web Servers and pages Links between pages


Browser Operation when User clicks on a link B picks the URL from the clicked link B gets IP address of Web server from DNS B open TCP connection to the (IP, port 80) B sends a request for page (HTTP packet) W.S. sends the linked page (HTTP packet) Page is in html language B. closes TCP connection B. interpret html, displays page to user B fetches & presents images linked to the file

The Client Side

non html in page: PDF, GIF, JPEG, MP3, MPEG, ... Plug-ins: Code installed as an extension to the browser Code uses browser functions & v.v e.g. to supply the data to the plug-in Helper Applications, invoked by B as a separate process


Helper Application


Server Side
Accepts TCP connection Gets name of requested file (HTTP packet) Gets the file (local disk) Sends back the file (HTTP packets) Release TCP connection To improve performance Maintain cache of files Multithreading

Multi-threaded Web Server

Front-end thread accept request, build record Pass record to a Working Thread All threads share memory , including the cache If page not in cache, WT initiates disk read


Tasks of a Working Thread Resolving name of the file Authenticating client (another lecture) Perform access control on client Check the cache Fetch file from disk Determine MIME type of file This will be sent to the client Send reply to client Construct HTTP packet(s) Write in the Web Server log

What if the CPU cant handle the load?

Server Farm on a LAN

Problems Each Processing Node has its own cache P.N. specialize with certain files Both requests and replies via the Front-end



TCP Handoff
Front-end passes the TCP endpoint (IP, port) to the Processing Node Processing Node send page to Client


TCP Handoff

URLs Uniform Resource Locaters

URL provides answers to what?
What is the name of the page? What is the location of the page? How to access the page (which protocol)?

? ?


Statelessness and Cookies

HTTP is request/reply; stateless But, server needs: to recognize users (registered?, adapt home page) to keep track of visited items (shopping cart) Cookies (small text files) keep that info. Stored at Client C:\Documents and Settings\aviv\Cookies Identified by domain name of the sending server

Cookies: Structure domain: where the cookie came from Path: root of the file tree related to cookie Content: variableName=value pairs. Anything Expires if set it is kept (persistent cookie) Secure: If set cookie is sent only to secure server



Using cookies

Casino server chooses which gambling option it presents Store Server puts items in cart in the cookie Web Portal server presents stock prices and Sport results records visits of UserID in certain pages pages include adds/banners/small pictures User not aware its browser visited User profile is built, maybe with name/password

HTML: Hypertext Markup Language


HTML HyperText Markup Language


Text with markups instructions (formatting, links,) Instructions in form of pair of tags <h2> </h2>

Formatted Page Presented by browser


Some HTML Tags

HTML Table



HTML Input: Forms

Browser presents a web page with a form User fill the form Browser stores User inputs in variables Browser send the information via HTTP

HTML Input: Web page with a Form




Browser Response
A possible response from the browser to the server with information filled in by the user. A string of name=value

Server passes the string to back-end script for processing (e.g. Perl script) Script writes to DB, might create new page

Automatic Processing of Web Pages

Need to process html web pages by programs E.g. Find a book that was published after 2000 Program searches page(s), which have no structure. Hard for program to understand if 2000 is a year or a price Idea: Build documents (pages) with structure that will be useful for program Describe a document by XML language to define named structures, sub-structures XML: eXtnsible Markup Language



A simple Web page in XML

Hierarchical Structure We define a structure, named book_list Book_list: a list of three structures named book Book: three fields, each with name & value

A simple Web page in XML

A program can search for >= 2002 How a browser will present this page to a user? Need an processor that creates from XML doc an HTML page with formatting tags Instructions for the processor are in another file Written in the eXtensible Style Language (XSL) Referenced in the XML file (at the top) Browsers include XML/XSL processor and do this automatically on given XML/XSL files



eXtensible Style Language


Pure html

XSL language program

Server Side Dynamic Pages: CGI Script



Dynamic Web Documents

Server Side Dynamic pages: Embedded PHP Web server calls the PHP interpreter before downloading test.php Web Server maintains info about the browser (OS type, ..) in the variable HTTP_USER_AGENT Php re-writes the page, inserting the value of HTTP_USER_AGENT



Web Page With A Form PHP Script Processing Form data

User Input: Barbara, 24

Output from PHP Script html page

Client-Side Dynamic Pages: Embedded Javascript



Server Side & Client Side Dynamic Pages

Client Side is faster. Used for local interaction with User

JavaScript is a full blown language



Various ways to create and Display Content

Embedded Java Applets downloadable ActiveX control

HTTP Protocol



HTTP Protocol (1)

Versions 1.0, 1.1 RFC 2616 Request Response Using TCP (port 80 on server side) Persistent connection (HTTP 1.1) Request: ASCII Response: RFC 822 MIME-like A general protocol for object oriented Apps Accessing functionality of Remote Objects Many but not all methods are Web specific E.g. GET Object (not necessary a file)

HTTP Protocol (2)

transaction oriented client/server protocol between Web browser (client) and Web server stateless each transaction treated independently flexible format handling client may specify supported formats



Examples of HTTP Operation

Direct connection

Via Intermediary system(s)


Intermediary systems 1: Proxy process

Usage: Clients within organization must authenticate external Web Server. Proxy sits on the client side of the firewall (FW) a. Proxy authenticates server (e.g. passwd, cert) b. replies carry authentication data e.g. SSL header (encrypted hash of message) Proxy send requests to server & replies to clients Acts as a client in interacting with the server Acts as a server in interacting with clients



Types of Intermediate HTTP Systems

Intermediary systems 2: Gateway process

1: Server inside organization must authenticate external Client. Gateway sits on the Server side of the firewall a. GW authenticates Client (e.g. password, cert) b. requests carry authentication data e.g. SSL header (encrypted hash of message) 2: Client connects to non-http Server (e.g. FTP) Client sends http requests. GW translates



Intermediary systems 3: Tunnel

Tunnel perform no operation on http messages used if an intermedate is required for the connection but understanding http not required E.g. Initial authentication of Client and/or Server After that messages retransmitted unchanged

HTTP Operation - Caches

Caching can be done by a client, server or intermediary system stores previous requests/ responses may return stored response to subsequent requests not all requests can be cached



HTTP Messages

General Structure of HTTP message

Request Line: Method (e.g. GET), Resource (filename), HTTP Vers Response Status Line: HTTP Vers; Status Code e.g. OK; Reason Headers general: Date, Upgrade (to better version) Request: Host, Accept-charset, Response: Server (Softw), Accept-ranges (willing to take partial page with range expressed in bytes) Entity Header Content-Type, Last-Modified, Entity Body: Data (e.g. html page)



Request and Reply

GET /rfc.html HTTP/1.1 Host: HTTP/1.1 200 OK Date: Wed, 08 May 2002 22:54:22 GMT //Request Line //Request Header //Status Line //General Hdr

Server: Apache/1.3.20 (Unix) mod_ssl/2.8.4 /Response Hdr Last-Modified: Mon, 11 Sep 2000 13:56:29 GMT//Entity Headers ETag: 2a79d-c8b-39bce48d Accept-Ranges: bytes Content-length:3211 Content-Type: text/html X-pad: avoid browser bug <html> .. // non standard field //page id, used in caching //express range in bytes

Conditional GET (1) GET /fruit/kiwi.gif HTTP/.0 User-agent: Mozilla/4.0 HTTP/1.0 200 OK Date: Wed, 1 Aug 199815:39:29 Server: Apache/1.3.0 (Unix) Last-Modified: Mon, 22 June 1998 09:23:24 Content-Type: image/gif (data)



Conditional GET (2)

One week later GET /fruit/kiwi.gif HTTP/1.0 User-agent: Mozilla/4.0 If-Modified-since: Mon, 22 June 1998 09:23;24 HTTP/1.0 304 Not Modified Date: Wed, 19 Aug 1998 15:39:29 Server: Apache/1.3.0 (Unix) (empty entity body)

HTTP1.1 Methods



Response Status Codes

Request Headers
User-Agent Accept Host Authorization Cookie # Date Upgrade suggest switch to another version Info about the browser (OS) Type of pages client can handle The server DNS name client credentials (e.g. passwd) Cookie that was received before

Accept-charset; Accept-Encoding; Accept-Lang



Response Headers
Server Info about the Server Content-Encoding; Content-Length; Content-Language; Content-Type (MIME type) Last-Modified Location commanding client to go elsewhere Accept-Ranges The server will accept byte range requests Set-Cookie # Please save attached cookie with number # Date Upgrade

Entity Body
entity body is an arbitrary sequence of octets HTTP can transfer any type of data including: text, binary data, audio, images, video data is content of resource identified by URL interpretation data determined by header fields: Content-Type - defines data interpretation Content-Encoding - applied to data Transfer-Encoding - used to form entity body



More Header Fields

Forwarded: Gateways and proxies add this header with their URL Connection: close, keep-alive,.. special instructions Keep-Alive: If was set in Connection, it indicates max time the sender will keep connection open waiting for next request, or max number of additional requests that will be allowed on the current persistent connection Pragma Implementation specific info relevant to any recipient along the way

HTTP Messages BNF Format

HTTP-Message = Simple-Request | Simple-Response | Full-Request | Full-Response Full-Request = Request-Line *( General-Header | Request-Header | Entity-Header ) CRLF [ Entity-Body ] Full-Response = CRLF [ Entity-Body ] Simple-Request = "GET" SP Request-URL CRLF Simple-Response = [ Entity-Body ] Status-Line *( General-Header | Response-Header | Entity-Header )



Content Delivery Networks

Content Delivery Networks (1)

A Content Provider has a main page with links to many content items (pictures, music, video, newspapers) A CDN company (e.g Akamai) contract Content Provider to deliver the content on their CDN contentservers The CDN also contract many O(10K) ISPs to put CDN content-servers with the content on the ISP nets The CDN redirects the links in the main page of the CP to CDN main Server (changing the href)



Example: The Furry Video Content Provider

Original Web Page Of Content Provider

Web Page Of Content Provider With redirections

Example (Contd) User types, gets to main page of the Content Provider FurryVideo User clicks on content item Client sends Request to the cdn-Server identifies (from file name) which object is required, and from IP address of user, which CDN servers is the closest to the Client cdn-server sends response to client with status code 301 and Location header, giving the files URL on a content server close to the client Client connects to the CDN content-server