You are on page 1of 56

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

HTTP Headers for Dummies


Burak Guzel on Dec 2nd 2009 with 115 Comments
Whether youre a programmer or not, you have seen it everywhere on the web. At this moment your
browsers address bar shows something that starts with http://. Even your first Hello World script sent
HTTP headers without you realizing it. In this article we are going to learn about the basics of HTTP
headers and how we can use them in our web applications.

What are HTTP Headers?


HTTP stands for Hypertext Transfer Protocol. The entire World Wide Web uses this protocol. It was
established in the early 1990s. Almost everything you see in your browser is transmitted to your computer
over HTTP. For example, when you opened this article page, your browser probably have sent over 40
HTTP requests and received HTTP responses for each.
HTTP headers are the core part of these HTTP requests and responses, and they carry information about
the client browser, the requested page, the server and more.

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Example
When you type a url in your address bar, your browser sends an HTTP request and it may look like this:
view plaincopy to clipboardprint?

1.
2.
3.

GET /tutorials/other/top-20-mysql-best-practices/ HTTP/1.1


Host: net.tutsplus.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; enUS; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)
4. Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
5. Accept-Language: en-us,en;q=0.5
6. Accept-Encoding: gzip,deflate
7. Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
8. Keep-Alive: 300
9. Connection: keep-alive
10. Cookie: PHPSESSID=r2t5uvjq435r4q7ib3vtdjq120
11. Pragma: no-cache
12. Cache-Control: no-cache

First line is the Request Line which contains some basic info on the request. And the rest are the HTTP
headers.
After that request, your browser receives an HTTP response that may look like this:
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.

HTTP/1.x 200 OK
Transfer-Encoding: chunked
Date: Sat, 28 Nov 2009 04:36:25 GMT
Server: LiteSpeed
Connection: close
X-Powered-By: W3 Total Cache/0.8
Pragma: public
Expires: Sat, 28 Nov 2009 05:36:25 GMT
Etag: "pub1259380237;gz"
Cache-Control: max-age=3600, public
Content-Type: text/html; charset=UTF-8
Last-Modified: Sat, 28 Nov 2009 03:50:37 GMT
X-Pingback: http://net.tutsplus.com/xmlrpc.php
Content-Encoding: gzip
Vary: Accept-Encoding, Cookie, User-Agent

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

The first line is the Status Line, followed by HTTP headers, until the blank line. After that, the
content starts (in this case, an HTML output).
When you look at the source code of a web page in your browser, you will only see the HTML portion and
not the HTTP headers, even though they actually have been transmitted together as you see above.
These HTTP requests are also sent and received for other things, such as images, CSS files, JavaScript
files etc. That is why I said earlier that your browser has sent at least 40 or more HTTP requests as you
loaded just this article page.
Now, lets start reviewing the structure in more detail.

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

How to See HTTP Headers


I use the following Firefox extensions to analyze HTTP headers:
Firebug

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Live HTTP Headers

In PHP:
getallheaders() gets the request headers. You can also use the $_SERVER array.
headers_list() gets the response headers.
Further in the article, we will see some code examples in PHP.

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

HTTP Request Structure

The first line of the HTTP request is called the request line and consists of 3 parts:
The method indicates what kind of request this is. Most common methods are GET, POST and HEAD.

The path is generally the part of the url that comes after the host (domain). For example, when
requesting http://net.tutsplus.com/tutorials/other/top-20-mysql-best-practices/ , the path portion is
/tutorials/other/top-20-mysql-best-practices/.
The protocol part contains HTTP and the version, which is usually 1.1 in modern browsers.
The remainder of the request contains HTTP headers as Name: Value pairs on each line. These contain
various information about the HTTP request and your browser. For example, the User-Agent line
provides information on the browser version and the Operating System you are using. Accept-Encoding
tells the server if your browser can accept compressed output like gzip.
You may have noticed that the cookie data is also transmitted inside an HTTP header. And if there was a
referring url, that would have been in the header too.
Most of these headers are optional. This HTTP request could have been as small as this:
1.
2.

GET /tutorials/other/top-20-mysql-best-practices/ HTTP/1.1


Host: net.tutsplus.com

And you would still get a valid response from the web server.

Request Methods
The three most commonly used request methods are: GET, POST and HEAD. Youre probably already
familiar with the first two, from writing html forms.

GET: Retrieve a Document


This is the main method used for retrieving html, images, JavaScript, CSS, etc. Most data that loads in
your browser was requested using this method.
For example, when loading a Nettuts+ article, the very first line of the HTTP request looks like so:
1.
2.

GET /tutorials/other/top-20-mysql-best-practices/ HTTP/1.1


...

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


Once the html loads, the browser will start sending GET request for images, that may look like this:
1.
2.

GET /wp-content/themes/tuts_theme/images/header_bg_tall.png HTTP/1.1


...

Web forms can be set to use the method GET. Here is an example.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.

<form method="GET" action="foo.php">


First Name: <input type="text" name="first_name"> <br />
Last Name: <input type="text" name="last_name"> <br />
<input type="submit" name="action" value="Submit" />
</form>

When that form is submitted, the HTTP request begins like this:
view plaincopy to clipboardprint?

1.
2.

GET /foo.php?first_name=John&last_name=Doe&action=Submit HTTP/1.1


...

You can see that each form input was added into the query string.

POST: Send Data to the Server


Even though you can send data to the server using GET and the query string, in many cases POST will be
preferable. Sending large amounts of data using GET is not practical and has limitations.
POST requests are most commonly sent by web forms. Lets change the previous form example to a POST
method.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.

<form method="POST" action="foo.php">


First Name: <input type="text" name="first_name" /> <br />
Last Name: <input type="text" name="last_name" /> <br />
<input type="submit" name="action" value="Submit" />
</form>

Submitting that form creates an HTTP request like this:


view plaincopy to clipboardprint?

1.
2.
3.

POST /foo.php HTTP/1.1


Host: localhost
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; enUS; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)
4. Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
5. Accept-Language: en-us,en;q=0.5
6. Accept-Encoding: gzip,deflate
7. Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
8. Keep-Alive: 300
9. Connection: keep-alive
10. Referer: http://localhost/test.php
11. Content-Type: application/x-www-form-urlencoded
12. Content-Length: 43
13.
14. first_name=John&last_name=Doe&action=Submit

There are three important things to note here:


The path in the first line is simply /foo.php and there is no query string anymore.

Content-Type and Content-Lenght headers have been added, which provide information about the data
being sent.

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

All the data is in now sent after the headers, with the same format as the query string.
POST method requests can also be made via AJAX, applications, cURL, etc. And all file upload forms are
required to use the POST method.

HEAD: Retrieve Header Information


HEAD is identical to GET, except the server does not return the content in the HTTP response. When you
send a HEAD request, it means that you are only interested in the response code and the HTTP headers,
not the document itself.

When you send a HEAD request, it means that you are only interested in the
response code and the HTTP headers, not the document itself.
With this method the browser can check if a document has been modified, for caching purposes. It can also
check if the document exists at all.
For example, if you have a lot of links on your website, you can periodically send HEAD requests to all of
them to check for broken links. This will work much faster than using GET.

HTTP Response Structure


After the browser sends the HTTP request, the server responds with an HTTP response. Excluding the
content, it looks like this:

The first piece of data is the protocol. This is again usually HTTP/1.x or HTTP/1.1 on modern servers.
The next part is the status code followed by a short message. Code 200 means that our GET request was
successful and the server will return the contents of the requested document, right after the headers.
We all have seen 404 pages. This number actually comes from the status code part of the HTTP
response. If the GET request would be made for a path that the server cannot find, it would respond with a
404 instead of 200.
The rest of the response contains headers just like the HTTP request. These values can contain information
about the server software, when the page/file was last modified, the mime type etc
Again, most of those headers are actually optional.

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

HTTP Status Codes

200s are used for successful requests.

300s are for redirections.

400s are used if there was a problem with the request.

500s are used if there was a problem with the server.

200 OK
As mentioned before, this status code is sent in response to a successful request.

206 Partial Content


If an application requests only a range of the requested file, the 206 code is returned.
Its most commonly used with download managers that can stop and resume a download, or split the
download into pieces.

404 Not Found

When the requested page or file was not found, a 404 response code is sent by the server.

401 Unauthorized
Password protected web pages send this code. If you dont enter a login correctly, you may see the
following in your browser.

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Note that this only applies to HTTP password protected pages, that pop up login prompts like this:

403 Forbidden
If you are not allowed to access a page, this code may be sent to your browser. This often happens when
you try to open a url for a folder, that contains no index page. If the server settings do not allow the display
of the folder contents, you will get a 403 error.
For example, on my local server I created an images folder. Inside this folder I put an .htaccess file with
this line: Options -Indexes. Now when I try to open http://localhost/images/ I see this:

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

There are other ways in which access can be blocked, and 403 can be sent. For example, you can block by
IP address, with the help of some htaccess directives.
1.
2.
3.
4.
5.

order allow,deny
deny from 192.168.44.201
deny from 224.39.163.12
deny from 172.16.7.92
allow from all

302 (or 307) Moved Temporarily & 301 Moved Permanently


These two codes are used for redirecting a browser. For example, when you use a url shortening service,
such as bit.ly, thats exactly how they forward the people who click on their links.
Both 302 and 301 are handled very similarly by the browser, but they can have different meanings to
search engine spiders. For instance, if your website is down for maintenance, you may redirect to another
location using 302. The search engine spider will continue checking your page later in the future. But if
you redirect using 301, it will tell the spider that your website has moved to that location permanently. To
give you a better idea: http://www.nettuts.com redirects to http://net.tutsplus.com/ using a 301 code instead
of 302.

500 Internal Server Error

10

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

This code is usually seen when a web script crashes. Most CGI scripts do not output errors directly to the
browser, unlike PHP. If there is any fatal errors, they will just send a 500 status code. And the programmer
then needs to search the server error logs to find the error messages.

Complete List
You can find the complete list of HTTP status codes with their explanations here.

HTTP Headers in HTTP Requests


Now, well review some of the most common HTTP headers found in HTTP requests.
Almost all of these headers can be found in the $_SERVER array in PHP. You can also use
thegetallheaders() function to retrieve all headers at once.

Host
An HTTP Request is sent to a specific IP Addresses. But since most servers are capable of hosting multiple
websites under the same IP, they must know which domain name the browser is looking for.
1.

Host: net.tutsplus.com

This is basically the host name, including the domain and the subdomain.
In PHP, it can be found as $_SERVER['HTTP_HOST'] or $_SERVER['SERVER_NAME'].

User-Agent
1.

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; enUS; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)

This header can carry several pieces of information such as:


Browser name and version.

Operating System name and version.

11

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Default language.
This is how websites can collect certain general information about their surfers systems. For example,
they can detect if the surfer is using a cell phone browser and redirect them to a mobile version of their
website which works better with low resolutions.
In PHP, it can be found with: $_SERVER['HTTP_USER_AGENT'].
view plaincopy to clipboardprint?

1.
2.
3.

if ( strstr($_SERVER['HTTP_USER_AGENT'],'MSIE 6') ) {
echo "Please stop using IE6!";
}

Accept-Language
view plaincopy to clipboardprint?

1.

Accept-Language: en-us,en;q=0.5

This header displays the default language setting of the user. If a website has different language versions, it
can redirect a new surfer based on this data.
It can carry multiple languages, separated by commas. The first one is the preferred language, and each
other listed language can carry a q value, which is an estimate of the users preference for the language
(min. 0 max. 1).
In PHP, it can be found as: $_SERVER["HTTP_ACCEPT_LANGUAGE"].
view plaincopy to clipboardprint?

1.
2.
3.

if (substr($_SERVER['HTTP_ACCEPT_LANGUAGE'], 0, 2) == 'fr') {
header('Location: http://french.mydomain.com');
}

Accept-Encoding
1.

Accept-Encoding: gzip,deflate

Most modern browsers support gzip, and will send this in the header. The web server then can send the
HTML output in a compressed format. This can reduce the size by up to 80% to save bandwidth and time.
In PHP, it can be found as: $_SERVER["HTTP_ACCEPT_ENCODING"]. However, when you use
theob_gzhandler() callback function, it will check this value automatically, so you dont need to.
view plaincopy to clipboardprint?

1.
2.
3.

// enables output buffering


// and all output is compressed if the browser supports it
ob_start('ob_gzhandler');

If-Modified-Since
If a web document is already cached in your browser, and you visit it again, your browser can check if the
document has been updated by sending this:

12

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


1.

If-Modified-Since: Sat, 28 Nov 2009 06:38:19 GMT

If it was not modified since that date, the server will send a 304 Not Modified response code, and no
content and the browser will load the content from the cache.
In PHP, it can be found as: $_SERVER['HTTP_IF_MODIFIED_SINCE'].
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.

// assume $last_modify_time was the last the output was updated


// did the browser send If-Modified-Since header?
if(isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
// if the browser cache matches the modify time
if ($last_modify_time == strtotime($_SERVER['HTTP_IF_MODIFIED_SINCE'])) {
// send a 304 header, and no content
header("HTTP/1.1 304 Not Modified");
exit;
}
}

There is also an HTTP header named Etag, which can be used to make sure the cache is current. Well talk
about this shortly.

Cookie
As the name suggests, this sends the cookies stored in your browser for that domain.
view plaincopy to clipboardprint?

1.

Cookie: PHPSESSID=r2t5uvjq435r4q7ib3vtdjq120; foo=bar

These are name=value pairs separated by semicolons. Cookies can also contain the session id.
In PHP, individual cookies can be accessed with the $_COOKIE array. You can directly access the session
variables using the $_SESSION array, and if you need the session id, you can use the session_id() function
instead of the cookie.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.

echo $_COOKIE['foo'];
// output: bar
echo $_COOKIE['PHPSESSID'];
// output: r2t5uvjq435r4q7ib3vtdjq120
session_start();
echo session_id();
// output: r2t5uvjq435r4q7ib3vtdjq120

Referer
As the name suggests, this HTTP header contains the referring url.
For example, if I visit the Nettuts+ homepage, and click on an article link, this header is sent to my
browser:
1.

Referer: http://net.tutsplus.com/

In PHP, it can be found as $_SERVER['HTTP_REFERER'].


view plaincopy to clipboardprint?

1.
2.
3.

if (isset($_SERVER['HTTP_REFERER'])) {
$url_info = parse_url($_SERVER['HTTP_REFERER']);

13

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.

// is the surfer coming from Google?


if ($url_info['host'] == 'www.google.com') {
parse_str($url_info['query'], $vars);
echo "You searched on Google for this keyword: ". $vars['q'];
}
}
// if the referring url was:
// http://www.google.com/search?source=ig&hl=en&rlz=&=&q=http+headers&aq=f&oq=&aqi=g-p1g9
// the output will be:
// You searched on Google for this keyword: http headers

You may have noticed the word referrer is misspelled as referer. Unfortunately it made into the
official HTTP specifications like that and got stuck.

Authorization
When a web page asks for authorization, the browser opens a login window. When you enter a username
and password in this window, the browser sends another HTTP request, but this time it contains this
header.
view plaincopy to clipboardprint?

1.

Authorization: Basic bXl1c2VyOm15cGFzcw==

The data inside the header is base64 encoded. For example, base64_decode(bXl1c2VyOm15cGFzcw==)
would return myuser:mypass
In PHP, these values can be found as $_SERVER['PHP_AUTH_USER'] and
$_SERVER['PHP_AUTH_PW'].
More on this when we talk about the WWW-Authenticate header.

HTTP Headers in HTTP Responses


Now we are going to look at some of the most common HTTP headers found in HTTP responses.
In PHP, you can set response headers using the header() function. PHP already sends certain headers
automatically, for loading the content and setting cookies etc You can see the headers that are sent, or
will be sent, with the headers_list() function. You can check if the headers have been sent already, with
theheaders_sent() function.

Cache-Control
Definition from w3.org: The Cache-Control general-header field is used to specify directives which
MUST be obeyed by all caching mechanisms along the request/response chain. These caching
mechanisms include gateways and proxies that your ISP may be using.
Example:
view plaincopy to clipboardprint?

14

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


1.

Cache-Control: max-age=3600, public

public means that the response may be cached by anyone. max-age indicates how many seconds the
cache is valid for. Allowing your website to be cached can reduce server load and bandwidth, and also
improve load times at the browser.
Caching can also be prevented by using the no-cache directive.
1.

Cache-Control: no-cache

For more detailed info, see w3.org.

Content-Type
This header indicates the mime-type of the document. The browser then decides how to interpret the
contents based on this. For example, an html page (or a PHP script with html output) may return this:
view plaincopy to clipboardprint?

1.

Content-Type: text/html; charset=UTF-8

text is the type and html is the subtype of the document. The header can also contain more info such as
charset.
For a gif image, this may be sent.
1.

Content-Type: image/gif

The browser can decide to use an external application or browser extension based on the mime-type. For
example this will cause the Adobe Reader to be loaded:
1.

Content-Type: application/pdf

When loading directly, Apache can usually detect the mime-type of a document and send the appropriate
header. Also most browsers have some amount fault tolerance and auto-detection of the mime-types, in
case the headers are wrong or not present.
You can find a list of common mime types here.
In PHP, you can use the finfo_file() function to detect the mime type of a file.

Content-Disposition
This header instructs the browser to open a file download box, instead of trying to parse the content.
Example:
view plaincopy to clipboardprint?

1.

Content-Disposition: attachment; filename="download.zip"

That will cause the browser to do this:

15

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Note that the appropriate Content-Type header should also be sent along with this:
view plaincopy to clipboardprint?

1.
2.

Content-Type: application/zip
Content-Disposition: attachment; filename="download.zip"

Content-Length
When content is going to be transmitted to the browser, the server can indicate the size of it (in bytes)
using this header.
1.

Content-Length: 89123

This is especially useful for file downloads. Thats how the browser can determine the progress of the
download.
For example, here is a dummy script I wrote, which simulates a slow download.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.

// it's a zip file


header('Content-Type: application/zip');
// 1 million bytes (about 1megabyte)
header('Content-Length: 1000000');
// load a download dialogue, and save it as download.zip
header('Content-Disposition: attachment; filename="download.zip"');
// 1000 times 1000 bytes of data
for ($i = 0; $i < 1000; $i++) {
echo str_repeat(".",1000);
// sleep to slow down the download
usleep(50000);
}

The result is:

16

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Now I am going to comment out the Content-Length header


view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.

// it's a zip file


header('Content-Type: application/zip');
// the browser won't know the size
// header('Content-Length: 1000000');
// load a download dialogue, and save it as download.zip
header('Content-Disposition: attachment; filename="download.zip"');
// 1000 times 1000 bytes of data
for ($i = 0; $i < 1000; $i++) {
echo str_repeat(".",1000);
// sleep to slow down the download
usleep(50000);
}

Now the result is:

The browser can only tell you how many bytes have been downloaded, but it does not know the total
amount. And the progress bar is not showing the progress.

17

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Etag
This is another header that is used for caching purposes. It looks like this:
1.

Etag: "pub1259380237;gz"

The web server may send this header with every document it serves. The value can be based on the last
modify date, file size or even the checksum value of a file. The browser then saves this value as it caches
the document. Next time the browser requests the same file, it sends this in the HTTP request:
1.

If-None-Match: "pub1259380237;gz"

If the Etag value of the document matches that, the server will send a 304 code instead of 200, and no
content. The browser will load the contents from its cache.

Last-Modified
As the name suggests, this header indicates the last modify date of the document, in GMT format:
1.

Last-Modified: Sat, 28 Nov 2009 03:50:37 GMT

view plaincopy to clipboardprint?

1.
2.
3.

$modify_time = filemtime($file);
header("Last-Modified: " . gmdate("D, d M Y H:i:s", $modify_time) . " GMT");

It offers another way for the browser to cache a document. The browser may send this in the HTTP
request:
1.

If-Modified-Since: Sat, 28 Nov 2009 06:38:19 GMT

We already talked about this earlier in the "If-Modified-Since" section.

Location
This header is used for redirections. If the response code is 301 or 302, the server must also send this
header. For example, when you go to http://www.nettuts.com your browser will receive this:
1.
2.
3.
4.

HTTP/1.x 301 Moved Permanently


...
Location: http://net.tutsplus.com/
...

In PHP, you can redirect a surfer like so:


view plaincopy to clipboardprint?

1.

header('Location: http://net.tutsplus.com/');

By default, that will send a 302 response code. If you want to send 301 instead:
view plaincopy to clipboardprint?

1.

header('Location: http://net.tutsplus.com/', true, 301);

Set-Cookie
When a website wants to set or update a cookie in your browser, it will use this header.
view plaincopy to clipboardprint?

1.
2.

Set-Cookie: skin=noskin; path=/; domain=.amazon.com; expires=Sun, 29-Nov-2009 21:42:28 GMT


Set-Cookie: session-id=120-73335188165026; path=/; domain=.amazon.com; expires=Sat Feb 27 08:00:00 2010 GMT

18

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


Each cookie is sent as a separate header. Note that the cookies set via JavaScript do not go through HTTP
headers.
In PHP, you can set cookies using the setcookie() function, and PHP sends the appropriate HTTP headers.
view plaincopy to clipboardprint?

1.

setcookie("TestCookie", "foobar");

Which causes this header to be sent:


view plaincopy to clipboardprint?

1.

Set-Cookie: TestCookie=foobar

If the expiration date is not specified, the cookie is deleted when the browser window is closed.

WWW-Authenticate
A website may send this header to authenticate a user through HTTP. When the browser sees this header, it
will open up a login dialogue window.
view plaincopy to clipboardprint?

1.

WWW-Authenticate: Basic realm="Restricted Area"

Which looks like this:

There is a section in the PHP manual, that has code samples on how to do this in PHP.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.

if (!isset($_SERVER['PHP_AUTH_USER'])) {
header('WWW-Authenticate: Basic realm="My Realm"');
header('HTTP/1.0 401 Unauthorized');
echo 'Text to send if user hits Cancel button';
exit;
} else {
echo "<p>Hello {$_SERVER['PHP_AUTH_USER']}.</p>";
echo "<p>You entered {$_SERVER['PHP_AUTH_PW']} as your password.</p>";
}

19

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Content-Encoding
This header is usually set when the returned content is compressed.
1.

Content-Encoding: gzip

In PHP, if you use the ob_gzhandler() callback function, it will be set automatically for you.

Conclusion
Thanks for reading. I hope this article was a good starting point to learn about HTTP Headers. Please leave
your comments and questions below, and I will try to respond as much as I can.

20

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Php tutorial : Parsing html with Domdocument


By Silver Moon On Sep 17, 2012 3 Comments

Domdocument
The domdocument class of Php is a very handy one that can be used for a number of tasks like
parsing xml, html and creating xml. It is documented here.
In this tutorial we are going to see how to use this class to parse html content. The need to parse
html happens when are you are for example writing scrapers, or similar data extraction scripts.

Sample html

The following is the sample html file that we are going to use with DomDocument.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

<html>
<body>
<div id="mango">
This is the mango div. It has some text and a form too.
<form>
<input type="text" name="first_name" value="Yahoo" />
<input type="text" name="last_name" value="Bingo" />
</form>
<table class="inner">
<tr><td>Happy</td><td>Sky</td></tr>
</table>
</div>
<table id="data" class="outer">
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
<tr><td>Happy</td><td>Sky</td></tr>
</table>
</body>
</html>

21

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Loading the html


So the first thing to do would be to construct a domdocument object and load the html content in
it. Lets see how to do that.

1
2
3
4
5
6
7
8

// a new dom object


$dom = new domDocument;
// load the html into the object
$dom->loadHTML($html);
// discard white space
$dom->preserveWhiteSpace = false;

Done. The $dom object has loaded the html content and can be used to extract contents from
the whole html structure just like its done inside javascript. Most common functions are
getElementsByTagName and getElementById.
Now that the html is loaded, its time to see how nodes and child elements can be accessed.

Get an element by its html id

This will get hold of a node/element by using its ID.

1
2
3
4
5
6
7
8
9

//get element by id
$mango_div = $dom->getElementById('mango');
if(!mango_div)
{
die("Element not found");
}
echo "element found";

Getting the value/html of a node


The "nodeValue" attribute of an node shall give its value but strip all html inside it. For example

echo $mango_div->nodeValue;

The second method is to use the saveHTML function, that gets out the exact html inside that
particular node.

echo $dom->saveHTML($mango_div);

22

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


Note that the function saveHTML is called on the dom object and the node object is passed as a
parameter. The saveHTML function will provide the whole html (outer html) of the node including
the node's own html tags as well.
Another function called C14N does the same thing more quickly

1
2

//echo the contents of mango_div element


echo $mango_div->C14N();

inner html

To get just the inner html take the following approach. It adds up the html of all of the child
nodes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

$tables = $dom->getElementsByTagName('table');
echo get_inner_html($tables->item(0));
function get_inner_html( $node )
{
$innerHTML= '';
$children = $node->childNodes;
foreach ($children as $child)
{
$innerHTML .= $child->ownerDocument->saveXML( $child );
}
return $innerHTML;
}

The function get_inner_html gets the inner html of the html element. Note that we used the
saveXML function instead of the saveHTML function. The property "childNodes" provides the
child nodes of an element. These are the direct children.

Getting elements by tagname

This will get elements by tag name.

1
2
3
4
5
6

$tables = $dom->getElementsByTagName('table');
foreach($tables as $table)
{
echo $dom->saveHTML($table);
}

23

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


The function getElementsByTagName returns an object of type DomNodeList that can be read
as an array of objects of type DomNode. Another way to fetch the nodes of the NodeList is by
using the item function.

1
2
3
4
5
6
7
8
9

$tables = $dom->getElementsByTagName('table');
echo "Found : ".$tables->length. " items";
$i = 0;
while($table = $tables->item($i++))
{
echo $dom->saveHTML($table);
}

The item function takes the index of the item to be fetched. The length attribute of the
DomNodeList gives the number of objects found.

Get the attributes of an element


Every DomNode has an attribute called "attributes" that is a collection of all the html attributes of
that node.
Here is a quick example
1
$tables = $dom->getElementsByTagName('table');
2
3
$i = 0;
4
5
while($table = $tables->item($i++))
6
{
foreach($table->attributes as $attr)
7
{
8
echo $attr->name . " " . $attr->value . "<br />";
9
}
10
}
11
To get a particular attribute using its name, use the "getNamedItem" function on the attributes
object.

1
2
3
4
5
6
7
8
9
10
11
12

$tables = $dom->getElementsByTagName('table');
$i = 0;
while($table = $tables->item($i++))
{
$class_node = $table->attributes->getNamedItem('class');

if($class_node)
{
echo "Class is : " . $table->attributes->getNamedItem('class')->value . PHP_
}
}

24

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


13

Children of a node

A DomNode has the following properties that provide access to its children
1. childNodes
2. firstChild
3. lastChild
1
$tables = $dom->getElementsByTagName('table');
2
3
$table = $tables->item(1);
4
5
//get the number of rows in the 2nd table
6
echo $table->childNodes->length;
7
//content of each child
8
foreach($table->childNodes as $child)
9
{
10
echo $child->ownerDocument->saveHTML($child);
11
}
12
Checking if child nodes exist
The hasChildNodes function can be used to check if a node has any children at all.
Quick example
1
if( $table->hasChildNodes() )
2
{
3
//print content of children
4
foreach($table->childNodes as $child)
{
5
echo $child->ownerDocument->saveHTML($child);
6
}
7
}
8

Comparing 2 elements for equality


It might be needed to check if the element in 1 variable is the same as the element in another
variable. The function "isSameNode" is used for this. The function is called on one node, and the
other node is passed as the parameter. If the nodes are same, then boolean true is returned.

1
2
3
4
5

$tables = $dom->getElementsByTagName('table');
$table = $tables->item(1);
$table2 = $dom->getElementById('data');

25

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


6
7

var_dump($table->isSameNode($table2));

The var_dump would show true , indicating that the tables in both $table and $table2 are the
same.

Conclusion

The above examples showed how Domdocument can be used to access elements in an html
document in an object oriented manner. Domdocument can not only parse html but also
create/modify html and xml. In later articles we shall see how to do that.

26

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

HTML Parsing and Screen Scraping


with the Simple HTML DOM Library
Erik Wurzer on May 21st 2010 with 99 Comments
If you need to parse HTML, regular expressions arent the way to go. In this tutorial, youll learn how to
use an open source, easily learned parser, to read, modify, and spit back out HTML from external sources.
Using nettuts as an example, youll learn how to get a list of all the articles published on the site and
display them.

Step 1. Preparation
The first thing youll need to do is download a copy of the simpleHTMLdom library, freely available
fromsourceforge.
There are several files in the download, but the only one you need is the simple_html_dom.php file; the
rest are examples and documentation.

Step 2. Parsing Basics


This library is very easy to use, but there are some basics you should review before putting it into action.

Loading HTML
view plaincopy to clipboardprint?

1.
2.
3.

$html = new simple_html_dom();


// Load from a string

27

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


4.
5.
6.
7.

$html->load('<html><body><p>Hello World!</p><p>We're here</p></body></html>');


// Load a file
$html->load_file('http://net.tutsplus.com/');

You can create your initial object either by loading HTML from a string, or from a file. Loading a file can
be done either via URL, or via your local file system.
A note of caution: The load_file() method delegates its job to PHPs file_get_contents. If allow_url_fopen
is not set to true in your php.ini file, you may not be able to open a remote file this way. You could always
fall back on the CURL library to load remote pages in this case, then read them in with the load() method.

Accessing Information

Once you have your DOM object, you can start to work with it by using find() and creating collections. A
collection is a group of objects found via a selector the syntax is quite similar to jQuery.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.

<html>
<body>
<p>Hello World!</p>
<p>We're Here.</p>
</body>
</html>

In this example HTML, were going to take a look at how to access the information in the second
paragraph, change it, and then output the results.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

# create and load the HTML


include('simple_html_dom.php');
$html = new simple_html_dom();
$html->load("<html><body><p>Hello World!</p><p>We're here</p></body></html>");
# get an element representing the second paragraph
$element = $html->find("p");
# modify it
$element[1]->innertext .= " and we're here to stay.";
# output it!
echo $html->save();

28

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Using the find() method always returns a collection


(array) of tags unless you specify that you only want the
nth child, as a second parameter.
Lines 2-4: Load the HTML from a string, as explained previously.
Line 7: This line finds all <p> tags in the HTML, and returns them as an array. The first paragraph will
have an index of 0, and subsequent paragraphs will be indexed accordingly.
line 10: This accesses the 2nd item in our collection of paragraphs (index 1), and makes an addition to its
innertext attribute. Innertext represents the contents between the tags, while outertext represents the
contents including the tag. We could replace the tag entirely by using outertext.
Were going to add one more line, and modify the class of our second paragraph tag.
view plaincopy to clipboardprint?

1.
2.

$element[1]->class = "class_name";
echo $html->save();

The resulting HTML of the save command would be:


view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.

<html>
<body>
<p>Hello World!</p>
<p class="class_name">We're here and we're here to stay.</p>
</body>
</html>

Other Selectors
Here are some other examples of selectors. If youve used jQuery, these will seem very familiar.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.

# get the first occurrence of id="foo"


$single = $html->find('#foo', 0);
# get all elements with class="foo"
$collection = $html->find('.foo');
# get all the anchor tags on a page
$collection = $html->find('a');
# get all anchor tags that are inside H1 tags
$collection = $html->find('h1 a');
# get all img tags with a title of 'himom'
$collection = $html->find('img[title=himom]');

The first example isnt entirely intuitive all queries by default return collections, even an ID query, which
should only return a single result. However, by specifying the second parameter, we are saying only
return the first item of this collection.
This means $single is a single element, rather then an array of elements with one item.

29

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


The rest of the examples are self-explanatory.

Documentation
Complete documentation on the library can be found at the project documentation page.

Step 3. Real World Example


To put this library in action, were going to write a quick script to scrape the contents of the Nettuts
website, and produce a list of articles present on the site by title and description.only as an example.
Scraping is a tricky area of the web, and shouldnt be performed without permission.

view plaincopy to clipboardprint?

1.
2.
3.
4.

include('simple_html_dom.php');
$articles = array();
getArticles('http://net.tutsplus.com/page/76/');

30

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


We start by including the library, and calling the getArticles function with the page wed like to start
parsing. In this case were starting near the end and being kind to Nettuts server.
Were also declaring a global array to make it simple to gather all the article information in one place.
Before we begin parsing, lets take a look at how an article summary is described on Nettuts+.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.
9.

<div class="preview">
<!-- Post Taxonomies -->
<div class="post_taxonomy"> ... </div>
<!-- Post Title -->
<h1 class="post_title"><a>Title</a></h1>
<!-- Post Meta -->
<div class="post_meta"> ... </div>
<div class="text"><p>Description</p></div>
</div>

This represents a basic post format on the site, including source code comments. Why are the comments
important? They count as nodes to the parser.

Step 4. Starting the Parsing Function


view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.

function getArticles($page) {
global $articles;
$html = new simple_html_dom();
$html->load_file($page);
// ... more ...
}

We begin very simply by claiming our global, creating a new simple_html_dom object, then loading the
page we want to parse. This function is going to be calling itself later, so were setting it up to accept the
URL as a parameter.

Step 5. Finding the Information We Want

31

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.

$items = $html->find('div[class=preview]');
foreach($items as $post) {
# remember comments count as nodes
$articles[] = array($post->children(3)->outertext,
$post->children(6)->first_child()->outertext);
}

This is the meat of the getArticles function. Its going to take a closer look to really understand whats
happening.
Line 1: Creates an array of elements divs with the class of preview. We now have a collection of articles
stored in $items.
Line 5: $post now refers to a single div of class preview. If we look at the original HTML, we can see that
the third child is the H1 containing the article title. We take that and assign it to $articles[index][0].
Remember to start at 0 and to count comments when trying to determine the proper index of a child node.
Line 6: The sixth child of $post is <div class=text>. We want the description text from within, so we
grab the first childs outertext this will include the paragraph tag. A single record in articles now looks
like this:
view plaincopy to clipboardprint?

1.
2.

$articles[0][0] = "My Article Name Here";


$articles[0][1] = "This is my article description"

Step 6, Pagination
The first thing we do is determine how to find our next page. On Nettuts+, the URLs are easy to figure out,
but were going to pretend they arent, and get the next link via parsing.

32

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

If we look at the HTML, we see the following:


view plaincopy to clipboardprint?

1.

<a href="http://net.tutsplus.com/page/2/" class="nextpostslink"></a>

If there is a next page (and there wont always be), well find an anchor with the class of nextpostslink.
Now that information can be put to use.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.

if($next = $html->find('a[class=nextpostslink]', 0)) {


$URL = $next->href;
$html->clear();
unset($html);
getArticles($URL);
}

On the first line, we see if we can find an anchor with the class nextpostslink. Take special notice of the
second parameter for find(). This specifies we only want the first element (index 0) of the found collection
returned. $next will only be holding a single element, rather than a group of elements.
Next, we assign the links HREF to the variable $URL. This is important because were about to destroy
the HTML object. Due to a php5 circular references memory leak, the current simple_html_dom object
must be cleared and unset before another one is created. Failure to do so could cause you to eat up all your
available memory.
Finally, we call getArticles with the URL of the next page. This recursion ends when there are no more
pages to parse.

Step 7 Outputting the Results


First were going to set up a few basic stylings. This is completely arbitrary you can make your output
look however you wish.

33

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.

#main {
margin:80px auto;
width:500px;
}
h1 {
font:bold 40px/38px helvetica, verdana, sans-serif;
margin:0;
}
h1 a {
color:#600;
text-decoration:none;
}
p{
background: #ECECEC;
font:10px/14px verdana, sans-serif;
margin:8px 0 15px;
border: 1px #CCC solid;
padding: 15px;
}
.item {
padding:10px;
}

Next were going to put a small bit of PHP in the page to output the previously stored information.
view plaincopy to clipboardprint?

1.
2.
3.
4.
5.
6.
7.
8.

<?php
foreach($articles as $item) {
echo "<div class='item'>";
echo $item[0];
echo $item[1];
echo "</div>";
}
?>

The final result is a single HTML page listing all the articles, starting on the page indicated by the first
getArticles() call.

34

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Step 8 Conclusion
If youre parsing a great deal of pages (say, the entire site) it may take longer then the max execution time
allowed by your server. For example, running from my local machine it takes about one second per page
(including time to fetch).
On a site like Nettuts, with a current 78 pages of tutorials, this would run over one minute.
This tutorial should get you started with HTML parsing. There are other methods to work with the DOM,
including PHPs built in one, which lets you work with powerful xpath selectors to find elements. For easy
of use, and quick starts, I find this library to be one of the best. As a closing note, always remember to
obtain permission before scraping a site; this is important. Thanks for reading!

35

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

PHP Simple HTML DOM Parser Manual


Quick Start

Get HTML elements


// Create DOM from URL or file

$html = file_get_html('http://www.google.com/');
// Find all images

foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links

foreach($html->find('a') as $element)
echo $element->href . '<br>';

Modify HTML elements


// Create DOM from string

$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');


$html->find('div', 1)->class = 'bar';
$html->find('div[id=hello]', 0)->innertext = 'foo';
echo $html; // Output: <div id="hello">foo</div><div id="world" class="bar">World</div>

Extract contents from HTML


// Dump contents (without tags) from HTML

echo file_get_html('http://www.google.com/')->plaintext;

Scraping Slashdot!
// Create DOM from URL

$html = file_get_html('http://slashdot.org/');
// Find all article blocks

foreach($html->find('div.article') as $article) {
$item['title']
= $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);

36

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


How to create HTML DOM object?

Quick way
// Create a DOM object from a string

$html = str_get_html('<html><body>Hello!</body></html>');
// Create a DOM object from a URL

$html = file_get_html('http://www.google.com/');
// Create a DOM object from a HTML file

$html = file_get_html('test.htm');

Object-oriented way
// Create a DOM object

$html = new simple_html_dom();


// Load HTML from a string

$html->load('<html><body>Hello!</body></html>');
// Load HTML from a URL

$html->load_file('http://www.google.com/');
// Load HTML from a HTML file

$html->load_file('test.htm');

How to find HTML elements?

Basics
// Find all anchors, returns a array of element objects

$ret = $html->find('a');
// Find (N)th anchor, returns element object or null if not found (zero based)

$ret = $html->find('a', 0);


// Find lastest anchor, returns element object or null if not found (zero based)

$ret = $html->find('a', -1);


// Find all <div> with the id attribute

$ret = $html->find('div[id]');
// Find all <div> which attribute id=foo

$ret = $html->find('div[id=foo]');

Advanced
// Find all element which id=foo

$ret = $html->find('#foo');
// Find all element which class=foo

$ret = $html->find('.foo');
// Find all element has attribute id

$ret = $html->find('*[id]');

37

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


// Find all anchors and images

$ret = $html->find('a, img');


// Find all anchors and images with the "title" attribute

$ret = $html->find('a[title], img[title]');Descendant selectors

Descendant selectors
// Find all <li> in <ul>

$es = $html->find('ul li');


// Find Nested <div> tags

$es = $html->find('div div div');


// Find all <td> in <table> which class=hello

$es = $html->find('table.hello td');


// Find all td tags with attribite align=center in table tags

$es = $html->find(''table td[align=center]');

Nested selectors
// Find all <li> in <ul>

foreach($html->find('ul') as $ul)
{
foreach($ul->find('li') as $li)
{
// do something...

// Find first <li> in first <ul>

$e = $html->find('ul', 0)->find('li', 0);

Attribute Filters

Supports these operators in attribute selectors:

Filter
[attribute]
[!attribute]
[attribute=value]
[attribute!=value]
[attribute^=value]
[attribute$=value]
[attribute*=value]

Description
Matches elements
Matches elements
Matches elements
Matches elements
value.
Matches elements
certain value.
Matches elements
value.
Matches elements
value.

that
that
that
that

have the specified attribute.


don't have the specified attribute.
have the specified attribute with a certain value.
don't have the specified attribute with a certain

that have the specified attribute and it starts with a


that have the specified attribute and it ends with a certain
that have the specified attribute and it contains a certain

38

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Text & Comments


// Find all text blocks

$es = $html->find('text');
// Find all comment (<!--...-->) blocks

$es = $html->find('comment');

How to access the HTML element's attributes?

Get, Set and Remove attributes


// Get a attribute ( If the attribute is non-value attribute (eg. checked, selected...), it will returns true or false)

$value = $e->href;
// Set a attribute(If the attribute is non-value attribute (eg. checked, selected...), set it's value as true or false)

$e->href = 'my link';


// Remove a attribute, set it's value as null!

$e->href = null;
// Determine whether a attribute exist?

if(isset($e->href))
echo 'href exist!';

Magic attributes
// Example

$html = str_get_html("<div>foo <b>bar</b></div>");


$e = $html->find("div", 0);
echo
echo
echo
echo

$e->tag; // Returns: " div"


$e->outertext; // Returns: " <div>foo <b>bar</b></div>"
$e->innertext; // Returns: " foo <b>bar</b>"
$e->plaintext; // Returns: " foo bar"

Attribute Name
$e->tag
$e->outertext
$e->innertext
$e->plaintext

Usage
Read
Read
Read
Read

or
or
or
or

write
write
write
write

the
the
the
the

tag name of element.


outer HTML text of element.
inner HTML text of element.
plain text of element.

Tips
// Extract contents from HTML

echo $html->plaintext;
// Wrap a element

$e->outertext = '<div class="wrap">' . $e->outertext . '<div>';


// Remove a element, set it's outertext as an empty string

$e->outertext = '';
// Append a element

$e->outertext = $e->outertext . '<div>foo<div>';

39

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


// Insert a element

$e->outertext = '<div>foo<div>' . $e->outertext;

How to traverse the DOM tree?

Background Knowledge
// If you are not so familiar with HTML DOM, check this link to learn more...
// Example

echo $html->find("#div1", 0)->children(1)->children(1)->children(2)->id;


// or

echo $html->getElementById("div1")->childNodes(1)->childNodes(1)->childNodes(2)>getAttribute('id');

Traverse the DOM tree


You can also call methods with Camel naming convertions.

Method

Description
mixed

$e->children ( [int
$index] )
element

$e->parent ()
element

$e->first_child ()
element

$e->last_child ()
element

$e->next_sibling ()
element

$e->prev_sibling ()

Returns the Nth child object if index is set, otherwise return an array of
children.
Returns the parent of element.
Returns the first child of element, or null if not found.
Returns the last child of element, or null if not found.
Returns the next sibling of element, or null if not found.
Returns the previous sibling of element, or null if not found.

How to dump contents of DOM object?

Quick way
// Dumps the internal DOM tree back into string

$str = $html;
// Print it!

echo $html;

Object-oriented way
// Dumps the internal DOM tree back into string

$str = $html->save();

40

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


// Dumps the internal DOM tree back into a file

$html->save('result.htm');

How to customize the parsing behavior?

Callback function

// Write a function with parameter "$element"

function my_callback($element) {
// Hide all <b> tags

if ($element->tag=='b')
$element->outertext = '';

// Register the callback function with it's function name

$html->set_callback('my_callback');
// Callback function will be invoked while dumping

echo $html;

41

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

PHP DOM: Using XPath

2| By: Tim Smith | Posted: June 25, 2012 | Intermediate, PHP Tutorials

In a recent article I discussed PHPs implementation of the DOM and introduced various functions to pull data from and
manipulate an XML structure. I also briefly mentioned XPath, but didnt have much space to discuss it. In this article, well
look closer at XPath, how it functions, and how it is implemented in PHP. Youll find that XPath can greatly reduce the
amount of code you have to write to query and filter XML data, and will often yield better performance as well.
Ill use the same DTD and XML from the previous article to demonstrate the PHP DOM XPath functionality. To quickly
refresh your memory, heres what the DTD and XML look like:

01 <!ELEMENT library (book*)>


02 <!ELEMENT book (title, author, genre, chapter*)>
03

<!ATTLIST book isbn ID #REQUIRED>

04 <!ELEMENT title (#PCDATA)>


05 <!ELEMENT author (#PCDATA)>
06 <!ELEMENT genre (#PCDATA)>
07 <!ELEMENT chapter (chaptitle,text)>
<!ATTLIST chapter position NMTOKEN #REQUIRED>
08
09 <!ELEMENT chaptitle (#PCDATA)>
10 <!ELEMENT text (#PCDATA)>
01 <?xml version="1.0" encoding="utf-8"?>
02 <!DOCTYPE library SYSTEM "library.dtd">
03 <library>
04

<book isbn="isbn1234">

05

<title>A Book</title>

06

<author>An Author</author>

07

<genre>Horror</genre>

08

<chapter position="first">

09

<chaptitle>chapter one</chaptitle>

10

<text><![CDATA[Lorem Ipsum...]]></text>

42

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


11

</chapter>

12

</book>

13

<book isbn="isbn1235">

14

<title>Another Book</title>

15

<author>Another Author</author>

16

<genre>Science Fiction</genre>

17

<chapter position="first">

18

<chaptitle>chapter one</chaptitle>

19

<text><![CDATA[<i>Sit Dolor Amet...</i>]]></text>

20
21

</chapter>
</book>

22 </library>

Basic XPath Queries


XPath is a syntax available for querying an XML document. In its simplest form, you define a path to the element you want.
Using the XML document above, the following XPath query will return a collection of all the book elements present:

//library/book

Thats it. The two forward slashes indicate library is the root element of the document, and the single slash indicates book is
a child. Its pretty straight forward, no?
But what if you want to specify a particular book. Lets say you want to return any books written by An Author. The
XPath for that would be:

//library/book/author[text() = "An Author"]/..

You can use text() here in square braces to perform a comparison against the value of a node, and the trailing /..
indicates we want the parent element (i.e. move back up the tree one node).
XPath queries can be executed using one of two functions: query() and evaluate(). Both perform the query, but the
difference lies in the type of result they return. query() will always return a DOMNodeList whereas evaluate() will
return a typed result if possible. For example, if your XPath query is to return the number of books written by a certain
author rather than the actual books themselves, then query() will return an empty DOMNodeList. evaluate() will
simply return the number so you can use it immediately instead of having to pull the data from a node.

Code and Speed Benefits with XPath


Lets do a quick demonstration that returns the number of books written by an author. The first method well look at will
work, but doesnt make use of XPath. This is to show you how it can be done without XPath and why XPath is so powerful.

43

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

01 <?php
02 public function getNumberOfBooksByAuthor($author) {
03

$total = 0;

04

$elements = $this->domDocument->getElementsByTagName("author");

05

foreach ($elements as $element) {

06

if ($element->nodeValue == $author) {
$total++;

07
}

08
09

10

return $number;

11 }
The next method achieves the same result, but uses XPath to select just those books that are written by a specific author:

1 <?php
2 public function getNumberOfBooksByAuthor($author)
3

$query = "//library/book/author1/..";

$xpath = new DOMXPath($this->domDocument);

$result = $xpath->query($query);

return $result->length;

7}
Notice how we this time we have removed the need for PHP to test against the value of the author. But we can go one step
further still and use the XPath function count() to count the occurrences of this path.

1 <?php
2 public function getNumberOfBooksByAuthor($author)
3

$query = "count(//library/book/author1/..)";

$xpath = new DOMXPath($this->domDocument);

return $xpath->evaluate($query);

6}
Were able to retrieve the information we needed with only only line of XPath and there is no need to perform laborious
filtering with PHP. Indeed, this is a much simpler and succinct way to write this functionality!
Notice that evaluate() was used in the last example. This is because the function count()returns a typed result.
Using query() will return a DOMNodeList but you will find that it is an empty list.
Not only does this make your code cleaner, but it also comes with speed benefits. I found that version 1 was 30% faster on
average than version 2 but version 3 was about 10 percent faster than version 2 (about 15% faster than version 1). While
these measurements will vary depending on your server and query, using XPath in its purest form will generally yield a
considerable speed benefit as well as making your code easier to read and maintain.

44

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

XPath Functions
There are quite a few functions that can be used with XPath and there are many excellent resources which detail what
functions are available. If you find that you are iterating overDOMNodeLists or comparing nodeValues, you will
probably find an XPath function that can eliminate a lot of the PHP coding.
Youve already see how count() functions. Lets use the id() function to return the titles of the books with the given
ISBNs. The XPath expression you will need to use is:

id("isbn1234 isbn1235")/title

Notice here that the values you are searching for are enclosed within quotes and delimited with a space; there is no need for a
comma to delimit the terms.

01 <?php
02 public function findBooksByISBNs(array $isbns) {
03

$ids = join(" ", $isbns);

04

$query = "id('$ids')/title";

05
06

$xpath = new DOMXPath($this->domDocument);

07

$result = $xpath->query($query);

08
09

$books = array();

10

foreach ($result as $node) {

11

$book = array("title" => $booknode->nodeValue);

12

$books[] = $book;

13

14

return $books;

15 }
Executing complex functions in XPath is relatively simple; the trick is to become familiar with the functions that are
available.

Using PHP Functions With XPath


Sometimes you may find that you need some greater functionality that the standard XPath functions cannot deliver. Luckily,
PHP DOM also allows you to incorporate PHPs own functions into an XPath query.
Lets consider returning the number of words in the title of a book. In its simplest function, we could write the method as
follows:

45

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


01 <?php
02 public function getNumberOfWords($isbn) {
03

$query = "//library/book[@isbn = '$isbn']";

04
05

$xpath = new DOMXPath($this->domDocument);

06

$result = $xpath->query($query);

07
08

$title = $result->item(0)->getElementsByTagName("title")
->item(0)->nodeValue;

09
10
11

return str_word_count($title);

12 }
But we can also incorporate the function str_word_count() directly into the XPath query. There are a few steps that
need to be completed to do this. First of all, we have to register a namespace with the XPath object. PHP functions in XPath
queries are preceded by php:functionString and then the name of the function function you want to use is enclosed in
parentheses. Also, the namespace to be defined is http://php.net/xpath. The namespace must be set to this; any other
values will result in errors. We then need to call registerPHPFunctions() which tells PHP that whenever it comes
across a function namespaced with php:, it is PHP that should handle it.
The actual syntax for calling the function is:

php:functionString("nameoffunction", arg, arg...)

Putting this all together results in the following reimplementation of getNumberOfWords():

01 <?php
02 public function getNumberOfWords($isbn) {
03

$xpath = new DOMXPath($this->domDocument);

04
05

//register the php namespace

06

$xpath->registerNamespace("php", "http://php.net/xpath");

07
08

//ensure php functions can be called within xpath

09

$xpath->registerPHPFunctions();

10
11

$query = "php:functionString('str_word_count',(//library/book[@isbn
= '$isbn']/title))";

12
13

return $xpath->evaluate($query);

14 }

46

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


Notice that you dont need to call the XPath function text() to provide the text of the node.
TheregisterPHPFunctions() method does this automatically. However the following is just as valid:

php:functionString('str_word_count',(//library/book[@isbn =
'$isbn']/title[text()]))

Registering PHP functions is not restricted to the functions that come with PHP. You can define your own functions and
provide those within the XPath. The only difference here is that when defining the function, you use php:function rather
than php:functionString. Also, it is only possible to provide either functions on their own or static methods. Calling
instance methods are not supported.
Lets use a regular function that is outside the scope of the class to demonstrate the basic functionality. The function we will
use will return only books by George Orwell. It must return true for every node you wish to include in the query.

1 <?php
2 function compare($node) {
3

return $node[0]->nodeValue == "George Orwell";

4}
The argument passed to the function is an array of DOMElements. It is up to the function to iterate through the array and
determine whether the node being tested should be returned in theDOMNodeList. In this example, the node being tested
is /book and we are using /author to make the determination.
Now we can create the method getGeorgeOrwellBooks() :

01 <?php
02

public function getGeorgeOrwellBooks() {

03

$xpath = new DOMXPath($this->domDocument);

04

$xpath->registerNamespace("php", "http://php.net/xpath");

05

$xpath->registerPHPFunctions();

06
07

$query = "//library/book1";

08

$result = $xpath->query($query);

09
10

$books = array();

11

foreach($result as $node) {
$books[] = $node->getElementsByTagName("title")

12

->item(0)->nodeValue;

13
14

15
16

return $books;

17 }

47

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


If compare() were a static method, then you would need to amend the XPath query so that it reads:

//library/book[php:function('Library::compare', author)]

In truth, all of this functionality can be easily coded up with just XPath, but the example shows how you can extend XPath
queries to become more complex.
Calling an object method is not possible within XPath. If you find you need to access some object properties or methods to
complete the XPath query, the best solution would be to do what you can with XPath and then work on the
resulting DOMNodeList with any object methods or properties as necessary.

Summary
XPath is a great way of cutting down the amount of code you have to write and to speed up the execution of the code when
working with XML data. Although not part of the official DOM specification, the additional functionality that the PHP
DOM provides allows you to extend the normal XPath functions with custom functionality. This is a very powerful feature
and as your familiarity with XPath functions increase you may find that you come to rely on this less and less.

48

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

PHP XML DOM


The XML DOM (Document Object Model) is a standard set of objects for accessing and
manipulating XML documents.
The DOM extension allows you to to process XML documents in PHP 5+, you can use it to
get, change, add, or delete XML elements.
In the DOM, every element is transformed into an object, and is seen like a node. Each node
can have content, child nodes and parent node, starting with the root element (tag).

Creating an XML document with PHP


With the PHP DOM functions you can read an exiisting XML content, and also create new XML
documents.
In the following example we create an XML document and save it in a new file.
1. First, we create a new XML object, with the DomDocument class (which is part of the PHP).
2. We use DomDocument methods to create and to add some elements, an attribute, and a
text content (here we build a structure with HTML tags).
3. Then, we write (save) these data on the server, in a .xml file (named "dom_example.xml").
The PHP must have CHMOD write permissions on the server.
In the script code there are more explanations.
<?php
$xml_file = 'files/dom_example.xml';
the name of the xml file

// define the directory and

$xmlDoc = new DomDocument('1.0', 'utf-8');


a new DOM object
$root = $xmlDoc->createElement('html');
the root element
$root = $xmlDoc->appendChild($root);
the root element in the DOM object
$body = $xmlDoc->createElement('body');
another element, 'body'
$body = $root->appendChild($body);
'body' as a child element in $root
$body->setAttribute('bgcolor', '#e8e8fe');
an attribute for the 'body'
$div = $xmlDoc->createElement('div');
create another element, 'div'
$div = $body->appendChild($div);
'div' as a child element in 'body'
$text = $xmlDoc->createTextNode('coursesweb.net - PHP');
text content
$text = $div->appendChild($text);
the text content in 'p'

// create
// create
// adds
// create
// adds
// sets
//
// adds
// create a
// adds

// save the xml content stored in $xmlDoc


if($xmlDoc->save($xml_file)) echo 'The dom_example.xml was created';
else echo 'Error: unable to write dom_example.xml';
?>

In the $xmlDoc variable (which is a DOM object) is created the content of the XML, defining and
adding each element one by one.
49

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


The save($xml_file) method saves the XML content in the path specified in $xml_file.
If you run the script above and open the "dom_example.xml", you'll see the following result.
<?xml version="1.0" encoding="utf-8"?>
<html>
<body bgcolor="#e8e8fe">
<div>coursesweb.net - PHP</div>
</body>
</html>

Reading an XML document


PHP DOM can also be used to read data from an XML document.
In the next example we use the XML document saved in the "dom_example.xml" file (created by the
script above).
1. First, we create a new XML object, with the DomDocument class, and load the content of the
"dom_example.xml" file in it (with the load() method).
2. We get the root (with documentElement property) and all elements of root (with the
getElementsByTagName("*")).
3. Then, we use a foreach() instruction to loop through all these elements and output their
name and value
<?php
$xml_file = 'files/dom_example.xml';
the name of the xml file

// define the directory and

// create a new XML object and load the content of the XML file
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml_file);
$root = $xmlDoc->documentElement;
(the root)
$elms = $root->getElementsByTagName("*");
of root

// get the first child node


// gets all elements ("*")

// loop through all elements stored in $elms


foreach ($elms as $item) {
// gets the name and the value of each $item
$tag = $item->nodeName;
$value = $item->nodeValue;
// outputs the $tag and $value
echo $tag. ' = '. $value . '
';
}
?>

This code will output:


body = http://coursesweb.net - PHP
div = http://coursesweb.net - PHP
The DOM saves in system memory all hierarchical tree before starting to analyze the XML document,
this thing affects the processing of XML documents that exceed the allocated memory.
If you only want to read and output some data from a XML document, you should use the
50

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM


SimpleXML, which is fast and easy to use when performing basic tasks like: reading XML files,
extracting data from XML strings, and editing nodes and attributes.
- SimpleXML is presented in the next lesson.

Modify XML documents


PHP DOM can also be used to modify data of an XML document.
In the next example we use the XML document saved in the "dom_example.xml" file.
1. First, we create a new XML object, with the DomDocument class, and load the content of the
"dom_example.xml" file in it (with the load() method).
2. We get the root (with documentElement property) and all elements of root, then we use a
for() instruction to loop through all these elements.
3. When the loop reachs at the element which we want to modify, we set another value for it,
create a new DIV element and add it in the XML content.
4. We save the new content in the same XML file and display its structure.
<?php
$xml_file = 'files/dom_example.xml';
the name of the xml file

// define the directory and

// create a new XML object and load the content of the XML file
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml_file);
$root = $xmlDoc->documentElement;
(the root)
$elms = $root->getElementsByTagName("*");
in root
$nr_elms = $elms->length;
elements

// get the first child node


// gets all elements ("*")
// gets the number of

// loop through all elements stored in $elms


for($i = 0; $i<$nr_elms; $i++) {
$node = $elms->item($i);
// gets the current node
// if the name of the current $node is 'div', changes its value
if($node->nodeName=='div') {
$node->nodeValue = 'The new text value';
// sets and add a new DIV, in the same parent node
$new_elm = $xmlDoc->createElement('div', 'This is the new inserted
DIV');
$node->parentNode->appendChild($new_elm);
}
}
// save the new xml content in the same file and output its structure
if($xmlDoc->save($xml_file)) {
echo htmlentities($xmlDoc->saveXML());
}
?>

- item($i) - etrieves a node specified by index ($i).


- saveXML() - dumps the internal XML tree back into a string
This example above will output:

51

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

The DOM has many properties and methods for working with XML, some of them are
used
in
the
examples
of
this
lesson.
For the complete list of PHP DOM functions, see the Document Object Model

52

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

SmartDOMDocument A Smarter PHP DOMDocument Class


Table Of Contents

What Is SmartDOMDocument?
What Is DOMDocument?
So What Exactly Does SmartDOMDocument Do Then?
saveHTMLExact()
Encoding Fix
SmartDOMDocument Object As String
Example
Requirements And Prerequisites
Sounds Great Where Do I Get It?
Download
Check out from SVN
Use as "svn:externals"
Version History
References
How To Report Bugs
Comments (33)

What Is SmartDOMDocument?

SmartDOMDocument is an enhanced version of PHP's built-in DOMDocument class.


SmartDOMDocument inherits from DOMDocument, so it's very easy to use just declare an
object of type SmartDOMDocument instead of DOMDocument and enjoy the new behavior
on top of all existing functionality (see example below).

What Is DOMDocument?

DOMDocument is a native PHP library for using DOM to read, parse, manipulate, and write
HTML and XML.
Instead of using hacky regexes that are prone to breaking as soon as something you haven't
thought of changes, DOMDocument parses HTML/XML using the DOM (Document Object
Model), just like your browser, and creates an easily manipulatable object in memory.
DOMDocument can actually validate and normalize your HTML/XML.
DOMDocument supports namespaces.

53

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

So What Exactly Does


SmartDOMDocument Do Then?
DOMDocument by itself is good but has a few annoyances, which SmartDOMDocument
tries to correct. Here are some things it does:

saveHTMLExact()
DOMDocument has an extremely badly designed "feature" where if the HTML code you are
loading does not contain <html> and <body> tags, it adds them automatically (yup, there are
no flags to turn this behavior off).
Thus, when you call $doc->saveHTML(), your newly saved content now has <html><body>
and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a
similar problem).
SmartDOMDocument contains a new function called saveHTMLExact() which does exactly
what you would want it saves HTML without adding that extra garbage that
DOMDocument does.

Encoding Fix
DOMDocument notoriously doesn't handle encoding (at least UTF-8) correctly and garbles
the output.
SmartDOMDocument tries to work around this problem by enhancing loadHTML() to deal
with encoding correctly. This behavior is transparent to you just use loadHTML() as you
would normally.

SmartDOMDocument Object As String


You can use a SmartDOMDocument object as a string which will print out its contents.
For example:
echo "Here is the HTML: $smart_dom_doc";

54

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Example
This example loads sample HTML using SmartDOMDocument, uses
getElementsByTagName() to find and removeChild() to remove the first <img> tag, then
prints the old HTML and the newly removed image HTML.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

$content = <<<CONTENT
<div class='class1'>
<img src='http://www.google.com/favicon.ico' />
Some Text
<p></p>
</div>
CONTENT;
print "Before removing the image, the content is: " .
htmlspecialchars($content) . "<br>";
$content_doc = new SmartDOMDocument();
$content_doc->loadHTML($content);
try {
$first_image = $content_doc->getElementsByTagName("img")>item(0);
if ($first_image) {
$first_image->parentNode->removeChild($first_image);
$content = $content_doc->saveHTMLExact();
$image_doc = new SmartDOMDocument();
$image_doc->appendChild($image_doc->importNode($first_image,
true));
$image = $image_doc->saveHTMLExact();
}
} catch(Exception $e) { }
print "After removing the image, the content is: " .
htmlspecialchars($content) . "<br>";
print "The image is: " . htmlspecialchars($image);
}

Requirements And Prerequisites

This is no longer a requirement any version of PHP 5 that has DOMDocument should work
now.
DOMDocument this should be a built-in class but I've seen instances of it missing for some
reason. My guess is 99.9% you will already have it.

55

ALL ABOUT PHP IN INTERACTION WITH HTTP & DOM

Sounds Great Where Do I Get It?


Download
http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk/SmartDOMDocument.cl
ass.php

Check out from SVN


svn co http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk
SmartDOMDocument

I highly recommend using SVN (Subversion) because you can easily update to the latest
version by running svn up.

Use as "svn:externals"
If you have an existing project in SVN and you would like to use SmartDOMDocument, you
can use set up this library as svn:externals.
svn:externals is kind of like a symlink to another repository from your existing SVN project.
That way, you can still benefit from using SVN commands such as svn up without having to
maintain a local copy of the external code.
You can read more about setting svn:externals here.
Here's how you would do this:
cd YOUR_PROJ_DIR;
1
svn propset svn:externals 'SmartDOMDocument
2
http://svn.beerpla.net/repos/public/PHP/SmartDOMDocument/trunk' .
3
svn ci .
4
svn up

56

You might also like