Professional Documents
Culture Documents
Lecture 3
HTTP and Working with the Web (Count..)
• Content negotiation
• Content compression with the Accept-Encoding header and language
selection with the Accept-Language header are examples of content
negotiation
• The following headers can be used for Content negotiation:
• Accept: For requesting a preferred file format
• Accept-Charset: For requesting the resource in a preferred character set
Content types
• HTTP can be used as a transport for any type of file or data.
• The server can use the Content-Type header in a response to inform the
client about the type of data that it has sent in the body.
• To view the content type, we inspect the value of the response header, as
shown here:
>>> response = urlopen('http://www.debian.org')
>>> response.getheader('Content-Type')
'text/html'
Content types
• The values in this header are taken from a list which is maintained by IANA.
• These values are variously called content types, Internet media types, or
MIME types (MIME stands for Multipurpose Internet Mail Extensions, the
specification in which the convention was first established).
• The full list can be found at http://www.iana.org/assignments/media-types.
Content types
• Content type values can contain optional additional parameters that provide
further information about the type.
• This is usually used to supply the character set that the data uses. For
example:
Content-Type: text/html; charset=UTF-8.
Content types
• Let's discuss an example, downloading the Python home page and using the
Content-Type value it returns. First, we submit our request:
• Then, we check the Content-Type value of our response, and extract the
character set:
• Lastly, we decode our response content by using the supplied character set:
User agents
• only identify the product that was used for making the request.
• We can view the user agent that urllib uses. Perform the following steps:
• This header and its value are included in every request made by urllib, so
every server we make a request to can see that we are using Python 3.4 and
the urllib library.
User agents
• Webmasters can inspect the user agents of requests and then use the
information for various things, including the following:
• Classifying visits for their website statistics
• Blocking clients with certain user agent strings
• Sending alternative versions of resources for user agents with known
problems, such as bugs when interpreting certain languages like CSS, or
not supporting some languages at all, such as JavaScript
User agents
• To work around this, we can try and set our user agent so that it mimics a
well known browser. This is known as spoofing, as shown here:
Cookies
• A cookie is a small piece of data that the server sends in a Set-Cookie header
as a part of the response.
• The client stores cookies locally and includes them in any future requests
that are sent to the server.
Cookies
• Cookies are necessary because the server has no other way of tracking a client
between requests.
Cookie handling
• First, we need to create a place for storing the cookies that the server will
send us:
• Lastly, we can check that the server has sent us some cookies:
Cookies
• You can see that we have two Cookie objects. Now, let's pull out some
information from the first one:
Know your cookies
• The HttpOnly flag indicates that the client should only allow access to a
cookie when the access is part of an HTTP request or response.
• This will protect the client against Cross-site scripting attacks
Know your cookies
• If the value is true, the Secure flag indicates that the cookie
should only ever be sent over a secure connection, such as
HTTPS.
• Again, we should honor this if the flag has been set such that
when our application send requests containing this cookie, it
only sends them to HTTPS URLs.
Redirects
• The 300 range of HTTP status codes is designed for Redirects purpose.
• These codes indicate to the client that further action is required on their part
to complete the request.
• The most commonly encountered action is to retry the request at a different
URL.
Redirects
• We'll learn how this works when using urllib. Let's make a request:
• When we send our first request, the server sends a response with a
redirect status code, one of 301, 302, 303, or 307.All of these indicate a
redirect.
• This response includes a Location header, which contains the new URL.
• The urllib package will submit a new request to that URL, and in the
aforementioned case, it will receive yet another redirect, which will lead
it to the Google login page.
• if we hit too many redirects for a single request (more than 10 for urllib),
then urllib will give up and raise an urllib.error.HTTPError exception.
URLs
• URLs are comprised of several sections. Python uses the urllib.parse module for
working with URLs. Let's use Python to break a URL into its component parts:
URLs
• Port numbers can be specified explicitly in a URL by appending them to the host.
They are separated from the host by a colon. Let's see what happens when we
try this with urlparse.
• If we don't supply a port (as is usually the case), then the default port 80 is
used for http, and the default port 443 is used for https. This is usually what
we want, as these are the standard ports for the HTTP and HTTPS protocols
respectively.
Paths and relative URLs
• The path in a URL is anything that comes after the host and the port. Paths
always start with a forward-slash (/), and when just a slash appears on its own,
it's called the root. We can see this by performing the following:
• When a scheme and a host are included in a URL (as in the previous example),
the URL is called an absolute URL. Conversely, it's possible to have relative
URLs, which contain just a path component, as shown here:
Paths and relative URLs
• Suppose that we've retrieved the http://www.debian.org URL, and within
the webpage source code we found the relative URL for the 'About' page.
We found that it's a relative URL for intro/about.
• We can create an absolute URL by using the URL for the original page and
the urllib.parse.urljoin() function. Let's see how we can do this:
• By supplying urljoin with a base URL, and a relative URL, we've created a
new absolute URL.
Paths and relative URLs
Query strings
• URLs can contain additional parameters in the form of key/value pairs that
appear after the path.
• They are separated from the path by a question mark, as shown here:
• By supplying urljoin with a base URL, and a relative URL, we've created a
new absolute URL.
Query strings
URL encoding
• URLs are restricted to the ASCII characters and within this set, a number of
characters are reserved and need to be escaped in different components of
a URL.
URL encoding
• Here, it's worth mentioning that hostnames sent over the wire must be
strictly ASCII,
• however the socket and http modules support transparent encoding of
Unicode hostnames to an ASCII-compatible encoding,
• so in practice we don't need to worry about encoding hostnames.