You are on page 1of 53

Web Server Technologies

Part I: HTTP & Getting Started

Joe Lima
Director of Product Development
Port80 Software, Inc.
jlima@port80software.com
Web Server Technologies | Part I: HTTP & Getting Started

Tutorial Content

Introduction to HTTP
• TCP/IP and application layer protocols
• URLs, resources and MIME Types
• HTTP request/response cycle and proxies

Setup and deployment


• Planning Web server & site deployments
• Site structure and basic server configuration
• Managing users and hosts
Web Server Technologies | Part I: HTTP & Getting Started

Preliminaries - Recommended Texts

Administrating Web Servers, Security & Maintenance


Larson and Stephens, Prentice Hall

HTTP The Definitive Guide


Gourley and Totty, et al., O’Reilly

Online resources are plentiful and will be cited along the way.
Web Server Technologies | Part I: HTTP & Getting Started

The Role of a Web Server

• Web servers serve various resources


- As file (document) servers
- As application front ends

• Other servers also provide services on the


Internet, each speaking its own protocol:

- SMTP, POP, IMAP, NNTP, FTP, etc.

• Web server = HTTP server

• HTTP servers serve HTTP clients (browsers


and other user agents) with the help of HTTP a box or a service?
intermediaries (proxies)
Web Server Technologies | Part I: HTTP & Getting Started

An Introduction to HTTP

• Hyper Text Transfer Protocol

• One of the application layer protocols that make up the Internet


- HTTP over TCP/IP
- Like SMTP, POP, IMAP, NNTP, FTP, etc.

• The underlying language of the Web

• Three versions have been used,


two are in common use and have been specified:
- RFC 1945 HTTP 1.0 (1996)
- RFC 2616 HTTP 1.1 (1999)
Web Server Technologies | Part I: HTTP & Getting Started

A Brief Digression on TCP/IP

HTTP sits atop the TCP/IP Protocol Stack

Application Layer HTTP

Transport Layer TCP

Network Layer IP

Data Link Layer Network Interfaces


Web Server Technologies | Part I: HTTP & Getting Started

A Brief Digression on TCP/IP, cont.

• IP provides packets that are The ports let TCP carry multiple
routed based on source and protocols that connect services
destination IP addresses running on default ports:

• TCP provides segments that ride • HTTP on port 80


inside the IP packets and add • HTTP with SSL
connection information based on (HTTPS) on port 443
source and destination ports • FTP on port 21
• SMTP on port 25
• POP on port 110
• SSH on port 22
Web Server Technologies | Part I: HTTP & Getting Started

A Brief Digression on TCP/IP, cont.

• TCP also provides mechanisms to make the connection a reliable bit pipe

• 3-way handshake, sequence numbers, checksums, control flags

• A data stream is chopped up into chunks that are reassembled, complete and
in correct order on the other endpoint of the connection

• TCP segments, riding inside IP packets, carry the chunks of data

• When HTTP is the Application Layer protocol on top of the stack, these
chunks of data are the contents of the HTTP Message
Web Server Technologies | Part I: HTTP & Getting Started

A Brief Digression on TCP/IP, cont.

How an HTTP Message is delivered over TCP/IP connection:

HTTP Message’s data stream is chopped up


GET /index.html HTTP/1.1<CRLF> into chunks small enough to fit in a TCP
Host: www.hostname.com Con… segment

The chunks ride inside TCP segments used


to reassemble them correctly on the other
end of the connection

The segments are shipped to the right destination inside IP datagrams


Web Server Technologies | Part I: HTTP & Getting Started

A Brief Digression on TCP/IP, cont.

HTTPS (HTTP + SSL/TLS)


Other Although a different protocol, service and port, HTTPS is
application usually integrated with the Web server
layer
protocols FTP
use TCP/IP Often run on the same box as the HTTP server to provide file
to provide transfer capabilities
Internet
services SMTP
often found Sometimes run with Web server (email gateways)
in company
with HTTP SSH
Widely used instead of telnet for remote admin
Web Server Technologies | Part I: HTTP & Getting Started

Introduction to HTTP - continued

• HTTP and URLs

• URLs used early on by all Internet protocols,


including various document retrieval protocols

• More specifications (both from 1994):


- Uniform Resource Locators - RFC 1738
- Universal Resource Identifiers - RFC 1630

• Hypertext came to predominate as the most efficient way of


providing access to resources
- Fast, flexible, generic, extensible
- Facilitated searching, collaboration, annotation

• HTTP now the central mechanism for requesting and serving URL
based resources
Web Server Technologies | Part I: HTTP & Getting Started

Introduction to HTTP - continued

A Digression on MIME Types


– URLs point to resources (“content”)
– Resources are represented using different Media Types (MIME Types)
• Multipurpose Internet Mail Extensions RFC2045,6
• Should be registered with IANA (www.iana.org)
– MIME Type tells how content should be handled
• File extensions are mapped to certain MIME Types
– .html usually means a MIME Type of text/html
– .jpg usually means a MIME Type of image/jpeg
• But mapping by file extension is dependent on local software’s
conventions and might not be shared across applications or machines
Web Server Technologies | Part I: HTTP & Getting Started

Introduction to HTTP - continued

HTTP allows MIME Type info to be passed between client and


server so both agree about the media type of the resource

• primary-type/sub-type

The most common MIME Types used on the Web come from the
text, image and application top-level groups

• text/html, text/css
• image/gif, image/jpeg, image/png
• application/pdf, application/octet-stream
• application/x-javascript, application/x-shockwave-flash
Web Server Technologies | Part I: HTTP & Getting Started

Introduction to HTTP - continued

HTTP servers turn URLs into resources through a request-response cycle

•User agent (client) issues an HTTP request to a host (server) for a given
resource using its URL

•Server “resolves” the URL, acts on the resource


- Retrieves, but also launches, modifies etc.

•Server sends an HTTP response back to the client


- Usually (not always) a representation of the requested resource
- Can also be info about the resource, its state, etc.

•Each request is discontinuous with all previous requests – HTTP is stateless


Web Server Technologies | Part I: HTTP & Getting Started

Basic HTTP Request/Response Cycle

HTTP Request

HTTP Response
Resource
HTTP Client /bar

Asks for resource by its URL:


HTTP Server
http://www.foo.com/bar.html
www.foo.com
Web Server Technologies | Part I: HTTP & Getting Started

An HTTP Request/Response Chain

LAN DMZ Internet


HTTP Server
HTTP Client

Egress Transparent Reverse


Proxy Proxies Proxy

Network at
Hosting
Provider

Local DNS External Root DNS


DNS Servers
Servers
Web Server Technologies | Part I: HTTP & Getting Started

Types and Uses of Proxy Servers

•Proxies are HTTP Intermediaries

•All act as both clients and servers

•Major types of proxies can be distinguished by

where they live and how they get traffic


- Explicit (e.g., Egress)
- Transparent/Intercepting
- Reverse/Surrogate

•Three primary uses for proxies


- Security
- Performance
- Content Filtering
Web Server Technologies | Part I: HTTP & Getting Started

Looking into HTTP

To really understand Web servers (and clients), study the grammar,


syntax and semantics of HTTP requests and responses:

• Look at the parts of the transaction you


don’t normally see in a browser

• Issue requests manually to understand


how a user agent gets resources from a
server

• Use protocol analyzers to “spy” on the


HTTP conversation

• Learn to troubleshoot problems by


“reading” and “writing” HTTP
Web Server Technologies | Part I: HTTP & Getting Started

Looking into HTTP - continued

HTTP requests and responses are both types of Internet


Messages (RFC 822), and share a general format:
– A Start Line, followed by a CRLF
• Request Line for requests
• Status Line for responses
– Zero or more Message Headers
• field-name “:” [field-value] CRLF
– An empty line
• Two CRLFs mark the end of the Headers
– An optional Message Body if there is a payload
• All or part of the “Entity Body” or “Entity”
Web Server Technologies | Part I: HTTP & Getting Started

Making a simple HTTP request

• Open a TCP connection to a host


– Can borrow telnet protocol to do this, by pointing it
at the default HTTP port (80)
– C:\>telnet www.google.com 80
• Ask for a resource using a minimal request syntax:
– GET / HTTP/1.1 <CRLF>
– Host: www.google.com <CRLF><CRLF>
• A Host header is required for HTTP 1.1 connections,
though not for HTTP 1.0
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at the Request Line

Consists of three major parts


– The Request Method followed by a SP
• GET, POST, HEAD, TRACE, OPTIONS, PUT, DELETE and CONNECT
• Extension methods such as those specified by WebDav (RFC 2518)
– The Request URI followed by a SP
• The URL associated with the resource
• By far the most complex part of any Start Line
• Defined by intension rather than extension
– The HTTP Version followed by the CRLF
• 0.9, 1.0, 1.1
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at the Request Methods

• GET
– By far most common method
– Retrieves a resource from the server
– Supports passing of query string arguments

• HEAD
– Retrieves only the Headers associated with a resource but not the entity itself
– Highly useful for protocol analysis, diagnostics

• POST
– Allows passing of data in entity rather than URL
– Can transmit of far larger arguments that GET
– Arguments not displayed on the URL
Web Server Technologies | Part I: HTTP & Getting Started

More Request Methods

• OPTIONS
– Shows methods available for use on the resource (if given a path) or the host
(if given a “*”)

• TRACE
– Diagnostic method for assessing the impact of proxies along the request-
response chain

• PUT, DELETE
– Used in HTTP publishing (e.g., WebDav)

• CONNECT
– A common extension method for Tunneling other protocols through HTTP
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at the Request URI

• Absolute URI vs. Absolute Path


– Explicit Proxies Require Absolute URIs
• Client is connected directly to the proxy
• Protocol and host name needed to resolve request
– Grammar of the Absolute Path
• Like Absolute URI minus the “http://hostname”
• Initial “/” equivalent of the host’s document root
• In HTTP 1.1 with name-based virtual hosting Host header directs request
to appropriate document root
• Subsequent slashes left-to-right imply less “significant” distinctions
• The “*” form used to query entire host
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at the Status Line

Consists of three major parts


– The HTTP Version followed by a SP
• Just like third part of Request Line
– Status Code followed by a SP
• 5 groups of 3 digit integers indicating the result of the attempt to satisfy
the request
• 1xx are informational
• 2xx are success codes
• 3xx are for alternate resource locations (redirects)
• 4xx indicate client side errors
• 5xx indicate server side errors
– The Reason Phrase followed by the CRLF
• Short textual description of the status code
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at HTTP Headers

Headers come in four major types, some for requests,


some for responses, some for both:

– General Headers
• Provide info about messages of both kinds
– Request Headers
• Provide request-specific info
– Response Headers
• Provide response-specific info
– Entity Headers
• Provide info about request and response
entities
– Extension headers are also possible
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at General Headers

• Connection – lets clients and servers manage connection state


– Connection: Keep-Alive (HTTP 1.0)
– Connection: close (HTTP 1.1)
• Date – when the message was created
– Date: Sat, 31-May-03 15:00:00 GMT
• Via – shows proxies that handled message
– Via: 1.1 www.myproxy.com (Squid/1.4)
• Cache-Control – Among the most complex of headers, enables
caching directives
– Cache-Control: no-cache
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at Request Headers

• Host – The hostname (and optionally port) of server to which request is being sent
– Required for name-based virtual hosting
– Host: www.port80software.com
• Referer – The URL of the resource from which the current request URI came
– Misspelled in the specification, so [Sic]
– Referer: http://www.host.com/login.asp
• User-Agent – Name of the requesting application, used in browser sensing
– User-Agent: Mozilla/4.0 (Compatible; MSIE 6.0)
Web Server Technologies | Part I: HTTP & Getting Started

Some More Request Headers

• Accept and its variants – Inform servers of client’s capabilities and preferences
– Enables content negotiation
– Accept: image/gif, image/jpeg;q=0.5
– Accept- variants for Language, Encoding, Charset
• If-Modified-Since and other conditionals
– Frequently used by browsers to manage caches
– If-Modified-Since: Sat, 31-May-03 15:00:00 GMT
• Cookie – How clients pass cookies back to the servers that set them
– Cookie: id=23432;level=3
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at Response Headers

• Server – The server’s name and version


– Server: Microsoft-IIS/5.0
– Can be problematic for security reasons
• Vary – Tells client & proxy caches which headers were used for content
negotiation
– Vary: User-Agent, Accept
• Set-Cookie – This is how a server sets a cookie on a client
– Set-Cookie: id=234; path=/shop; expires=Sat, 31-May-03 15:00:00 GMT;
secure
Web Server Technologies | Part I: HTTP & Getting Started

A Closer Look at Entity Headers

• Allow – Lists the request methods that can be used on the entity
– Allow: GET, HEAD, POST
• Location – Gives the alternate or new location of the entity
– Used with 3xx response codes (redirects)
– Location: http://www.ibm.com/us/
• Content-Encoding – specifies encoding performed on the body of the response
– Used with HTTP compression
– Corresponds to Accept-Encoding request header
– Content-Encoding: gzip
Web Server Technologies | Part I: HTTP & Getting Started

More Entity Headers

• Content-Length – The size of the entity body in bytes


– Value shrinks when compression is applied
– Content-Length: 24000
• Content-Location – The actual URL of the resource if different than its
request URL
– Often used to show the index or default page
– Content-Location: http://www.foo.com/home.html
• Content-Type – specifies Media (MIME) type of the entity body
– Corresponds to Accept header
– Content-Type: image/png
Web Server Technologies | Part I: HTTP & Getting Started

More Entity Headers

• Etag – Uniquely identifies a particular instance of a given resource


– Used with conditional request headers to validate cached instances of the
resource
• If-Match, If-None-Match
– Etag: adkskdashjgk07563AF
• Expires – Gives expiration for the instance of the resource for use in caching
– Expires: Sat, 31-May-03 19:00:00 GMT
• Last-Modified – Date/time the entity was last changed (or created)
– Last-Modified: Fri 30-May-03 09:00:00 GMT
Web Server Technologies | Part I: HTTP & Getting Started

Planning Web Server Deployments

• Major issues to consider when planning a Web server or


Web site deployment
– What is the appropriate form of Web hosting?
– What type of server software will be used?
– What are the sizing requirements?
– How will DNS be handled?
• There are no fixed answers to any of these questions
• Planning should be guided by the goals of the
deployment and should harmonize with the related
business processes
Web Server Technologies | Part I: HTTP & Getting Started

Choosing Among the Hosting Options

• Host your own


– Pro: Complete control over the physical box
– Con: Expensive and difficult to maintain well

• Hosting provider schemes


– Dedicated Server
• Pro: Control without the hardware purchase
• Con: Must manage the box – remotely
– Co-located Server
• Pro: Admin control of entire box
• Con: Must purchase box and manage remotely
– Virtual Hosting
• Pro: Cheapest and easiest to maintain solution
• Con: Server is shared, admin access limited
Web Server Technologies | Part I: HTTP & Getting Started

Choosing Server Software

Beware of sectarian quarrels, especially over performance and security


– Apache has the best reputation historically
• OS started out more stable, secure and scalable
• Features rapidly extended & refined via modular and open
development model
• Strong administrator ethos = well managed boxes
– IIS formerly favored mainly for ease of use in less demanding
environments, but 5.0 on Win2K closed most of the remaining quality
gap
– Any modern HTTP server is very solid software that is terribly
vulnerable when deployed & used naively
Web Server Technologies | Part I: HTTP & Getting Started

Choosing Server Software, cont.

In real world, usually a conditioned choice if not a forgone conclusion

– Biggest single factors are type of deployment and prior commitment to an


underlying OS
– Apache on UNIX and Linux predominates in universities, research institutes
and for virtual hosting setups – has majority of hosted domains
– Netscape/iPlanet used to have large enterprise market almost to itself
– IIS started with smaller companies, often as part of LAN server, but has now
taken over Netscape’s leading role in the enterprise
Web Server Technologies | Part I: HTTP & Getting Started

Sizing a Web Server

• Sizing is process of determining the physical resources required to meet


anticipated demand
• Processing power and memory are not typically a problem for the Web server
– Basic HTTP server job of fetching files is not processor intensive
– Resource constraints on the box probably an effect of other server-side
mechanisms
• Automated session management by app servers
• Manipulation of large database queries
• Lots of non-optimized code in Web applications
Web Server Technologies | Part I: HTTP & Getting Started

Sizing a Web Server, cont

Network bottlenecks
– Available bandwidth should accommodate max HTTP operations (“hits”) under
peak load
– Assuming an average file size of 14,000 bytes
• 56K Modem could handle about 0.5 hits/sec
• T1 line (1.5Mb) could handle about 13 hits/sec
• T3 (45Mb) could handle about 400 hits/sec
• OC3 (155Mbps) could handle about 1380 hits/sec
– Bandwidth sizing should be adjusted based on your actual request frequency
and size
• Assume peaks at triple the average loads
– Also watch out for collisions and overloading of routers, switches, hubs and
NICs on the network
Web Server Technologies | Part I: HTTP & Getting Started

Dealing with DNS

Making a site available by domain name requires its registration and use of DNS
– A domain name can be registered with many different registrars
– During registration, a DNS server is designated to maintain the domain’s DNS
records
– These records propagate to other DNS servers
– DNS servers use them to resolve a domain such as www.port80software.com
to a four-octet IP address such as 66.45.42.237
– ISP’s offer DNS services; you can also maintain your own or use a 3 rd party
service that lets you manage the records without running a DNS box
Web Server Technologies | Part I: HTTP & Getting Started

A Simplistic Model of the DNS System

1. Client asks its ISP’s DNS to resolve foo.com


2. That DNS asks root DNS whom to ask about foo.com
3. Root DNS points to 2nd ISP’s DNS
Root DNS
4. 1 ISP’s DNS asks 2
st nd
ISP’s DNS Server
2
5. 2nd
ISP’s DNS responds with IP
6. 1st ISP’s DNS replies and caches 3

1 4
6 5

ISP DNS ISP DNS


Server Server
Web Server Technologies | Part I: HTTP & Getting Started

Dealing with DNS, cont.

• You should learn to use nslookup to verify your DNS lookups are
working and troubleshoot DNS problems
• Command line utility also built into network analyzers like free
ieHTTPHeaders
– C:\>nslookup google.com
• You can also point nslookup at specific DNS servers to test their ability
to resolve
– C:\>nslookup
– >Server 206.13.30.12
– >google.com
Web Server Technologies | Part I: HTTP & Getting Started

Virtual and Physical Site Structure

Think of a site as having not one structure but two – virtual and physical
– Virtual structure is described by the URLs used to request resources
from the site
• This is the public view of the site – the site as visitors will see it
when they browse to it
– Physical structure is the organization of the files and directories in the
file system on the host machine’s hard disk
• This is the private view of the site seen only by you and those
users you choose to give access
– It will become obvious why this distinction is necessary to keep
things straight
Web Server Technologies | Part I: HTTP & Getting Started

Configuring Virtual-Physical Mappings

The Document Root


– A directory in the file system of the host machine where the Web server
looks for the files that constitute the Web site
• Also called the root directory
– Often given an index or default document that serves as the homepage
of the site.
– Corresponds to the “/” at the end of hostname portion of the URL:
• http://www.foo.com/index.html (virtual)
• /var/www/index.html (physical)
• C:\inetpub\wwwroot\index.html (physical)
Web Server Technologies | Part I: HTTP & Getting Started

Configuring Virtual-Physical Mappings

Notice how the hostname portion of the URL maps to the same place pointed to
by the physical path that lies to the left of the the “/” representing the
document root
– The URL is virtual to the left of the document root, but it seems to be
physical to the right of the document root
– In fact, a URL is purely virtual – there is no guarantee that the path to
the right of the document root looks this way on disk
– In this simple case, virtual and physical paths happen to coincide from
the document root down, but such is not always the case
Web Server Technologies | Part I: HTTP & Getting Started

Configuring Virtual-Physical Mappings

• A virtual directory or alias in the URL path preempts the lookup in the document
root
• This extends the virtual structure to the right of (or “below”) the root “/” in the URL
path
– http://www.foo.com/virtual/index2.html
– /htdocs/physical/index2.html
• Here a virtual directory virtual points to a physical directory that is outside of the
document root altogether
• Nested virtual directories are also possible
Web Server Technologies | Part I: HTTP & Getting Started

Configuring Virtual-Physical Mappings

• You can (and should) take advantage of this virtual/physical distinction to:
– Preserve the site’s URL scheme even if the physical structure has to
change
• Avoids broken links due to site expansion/revision
– Manage directory and file locations in ways that minimize security risks
and facilitate backup procedures
– Reduce redundant physical directories for supporting files
– Allow developers to keep relative URLs in source code simple
Web Server Technologies | Part I: HTTP & Getting Started

Virtual Hosting

• We know the hostname part of the URL is a virtual locator for files that live
(physically) in a site’s document root
• The idea of virtual hosting takes this a step further by allowing a single
server to host many domains, each with its own document root
• Two methods of virtual hosting
– Old way: multiple IP addresses per server
– New way: name-based using host headers
Web Server Technologies | Part I: HTTP & Getting Started

Managing Users and Hosts

• Users (developers) will need remote access allowing them to transfer files to and
from the site’s physical structure
• FTP (and other file transfer mechanisms) allow the administrator to restrict this
access
– to sub-sections of the site
– by user account or client IP
• These restrictions should be backed up by access control lists on the directories
that enforce the “principle of least access”
Web Server Technologies | Part I: HTTP & Getting Started

Managing Users and Hosts

• Similar rules apply to managing access to the Web site itself by visitors
– ACLs in the Web site’s physical file structure should be set to the minimum
required by the Web server to serve the resources on the site
• This gets tricky with server side programming
– If the Web site (or part of it) does not need to be available for anonymous
access from everywhere then users, groups, hosts and IPs should be
restricted
– HTTP Authentication can also be employed to require make all or part of a site
private and require login
Web Server Technologies | Part I: HTTP & Getting Started

Managing Users and Hosts

• Although HTTP authentication now offers safeguards like checksums and


password encryption, it is not very secure
– Lack of end-to-end encryption of the entire message transmission makes
hijacking, scanning and spoofing easy
• If all or part of the site requires authentication and serious security for user’s login
credentials, form based authentication over SSL is the only choice
Web Server Technologies | Part I: HTTP & Getting Started

Basic SSL Configuration

• Initiate an application for a certificate from a recognized Certificate Authority (CA)


– The site (domain) owner will have to prove they are who they say they are
• Create a Certificate Signing Request (CSR)
– Contains the site’s Public Key and matches up with a Private Key that is
created simultaneously and stored on the server
• Submit the request to the CA and pay up
• Retrieve the certificate and install it
• Test the certificate with an HTTPS request
Web Server Technologies | Part I: HTTP & Getting Started

About Port80 Software

Solutions for Microsoft IIS Web Servers


Port80 software exposes control to server-side functionality
for developers, and streamlines tasks for administrators:

• Increase security by locking down what info you


broadcast and blocking intruders with ServerMask and
ServerDefender

• Protect your intellectual property by preventing


hotlinking with LinkDeny

• Improve performance: compress pages and manage


cache controls for faster load time and bandwidth savings
with CacheRight, httpZip, and ZipEnable

• Upgrade Web development tools: Negotiate content


based on device, language, or other parameters with
PageXchanger, and tighten code with w3compiler.

Visit us online @ www.port80software.com

You might also like