You are on page 1of 16

CGI Basics Introduction to CGI, CGI building blocks, CGI Scripting in C, CGI Security

Introduction of CGI

CGI stands for Common Gateway Interface. It is a protocol meant for interfacing the external software application having a server of information. This information server is known as web server. The CGI or Common Gateway Interface is nothing but a method or way or convention of passing information or data to and fro from the server in one end to the application on the other. The CGI is a part of Hypertext Transfer protocol or HTTP Rather than simply giving out static pages, web servers are able to generate dynamic pages depending on a user's input. The input can be from an entry form, buttons, pictures, etc. The information is then passed to a program on the web server which processes the data and returns a page or image. CGI is a standard for interfacing executable files with Web servers. It allows for the interactive, dynamic, flexible features that have become standard on many Web sites, such as guestbooks, counters, bulletin boards, chats, mailing lists, searches, shopping carts, surveys, and quizzes. Several newer, faster means for accomplishing these same kinds of tasks have been developed, but CGI is more flexible in a number of ways. CGI is commonly used whenever one needs a Web server to run a program in real-time, take some kind of action, and then send the results back to a users browser. Scripts can be written in any language that allows a file to be executed, but the most common language for CGI scripts is Perl. This program is commonly known as a CGI program (Common Gateway Interface). It can be written in any available language such as PERL, C, C++, or shell script.

Below diagram describe CGI processing architecture

How CGI Scripts Work

1. The Web surfer fills out a form and clicks, Submit. The information in the form is sent over the Internet to the Web server. 2. The Web server grabs the information from the form and passes it to the CGI software. 3. The CGI software performs whatever validation of this information that is required. For instance, it might check to see if an e-mail address is valid. If this is a database program, the CGI software prepares a database statement to either add, edit, or delete information from the database. 4. The CGI software then executes the prepared database statement, which is passed to the database driver.

5. The database driver acts as a middleman and performs the requested action on the database itself. 6. The results of the database action are then passed back to the database driver. 7. The database driver sends the information from the database to the CGI software. 8. The CGI software takes the information from the database and manipulates it into the format that is desired. 9. If any static HTML pages need to be created, the CGI program accesses the Web server computers file system and reads, writes, and/or edits files. 10. The CGI software then sends the result it wants the Web surfers browser to see back to the Web server. 11. The Web server sends the result it got from the CGI software back to the Web surfers browser.

Advantages of Using CGI Programs CGI programs can be written in any programming language that allows users to write values directly to STDIN and STDOUT on a Web server A CGI program can be used with most Web servers and operating systems

In contrast, Active Server Pages run only on a Microsoft Web server CGI programs are compiled programs The program source code is converted into machine language when you compile the program In contrast, script commands must be converted into machine language each time the script is run For this reason, compiled programs execute (run) much faster than scripts

Disadvantages of Using CGI Programs The drawback of CGI programs is that they do not use Web server resources efficiently On a busy Web site using CGI programs, all of the Web servers main memory could be consumed trying to service multiple submissions of the same HTML form, and the Web server would be very slow in sending responses back to users To solve this problem, vendors are developing products to allow a single CGI program to service multiple submissions of the same form Servlets vs. CGI scripts
Advantages: Running a servlet doesnt require creating a separate process each time A servlet stays in memory, so it doesnt have to be reloaded each time There is only one instance handling multiple requests, not a separate instance for every request Untrusted servlets can be run in a sandbox Disadvantage: Less choice of languages (CGI scripts can be in any language)

Other Useful CGI Environment Variables

CGI scripts have access to 20 or so environment variables, such as QUERY_STRING and CONTENT_LENGTH mentioned on the main page. Here's the complete list (from "").

The following environment variables are not request-specific and are set for all requests: o SERVER_SOFTWARE The name and version of the web server software answering the request (and running the gateway). Format: name/version. o SERVER_NAME The web server's hostname, DNS alias, or IP address as it would appear in self-referencing URLs. o GATEWAY_INTERFACE The revision of the CGI specification to which this server complies. Format: CGI/revision. The following environment variables are specific to the request being fulfilled by the gateway program: o SERVER_PROTOCOL The name and revision of the information protcol this request came in with. Format: protocol/revision o SERVER_PORT The port number to which the request was sent. o REQUEST_METHOD The method with which the request was made. For HTTP, this is "GET", "HEAD", "POST", etc. o PATH_INFO The extra path information, as given by the client. In other words, scripts can be accessed by their virtual pathname, followed by extra information at the end of this path. The extra information is sent as PATH_INFO. This information should be decoded by the server if it comes from a URL before it is passed to the CGI script. o PATH_TRANSLATED The server provides a translated version of PATH_INFO, which takes the path and does any virtual-to-physical mapping to it. o SCRIPT_NAME A virtual path to the script being executed, used for self-referencing URLs. o QUERY_STRING The information which follows the ? in the URL which referenced this script. This is the query information. It should not be decoded in any fashion. This variable should always be set when there is query information, regardless of command line decoding. o REMOTE_HOST The hostname making the request. If the server does not have this information, it should set REMOTE_ADDR and leave this unset. o REMOTE_ADDR The IP address of the remote host making the request. o AUTH_TYPE If the server supports user authentication, and the script is protected, this is the protocol-specific authentication method used to validate the user. o REMOTE_USER If the server supports user authentication, and the script is protected, this is the username they have authenticated as.

REMOTE_IDENT If the HTTP server supports RFC 931 identification, then this variable will be set to the remote user name retrieved from the server. Usage of this variable should be limited to logging only. o CONTENT_TYPE For queries which have attached information, such as HTTP POST and PUT, this is the content type of the data. o CONTENT_LENGTH The length of the said content as given by the client. In addition to these, the header lines received from the client, if any, are placed into the environment with the prefix HTTP_ followed by the header name. Any - characters in the header name are changed to _ characters. The server may exclude any headers which it has already processed, such as "Authorization", "Content-type", and "Content-length". If necessary, the server may choose to exclude any or all of these headers if including them would exceed any system environment limits. An example of this is the HTTP_ACCEPT variable which was defined in CGI/1.0. Another example is the header User-Agent. o HTTP_ACCEPV The MIME types which the client will accept, as given by HTTP headers. Other protocols may need to get this information from elsewhere. Each item in this list should be separated by commas as per the HTTP spec. Format: type/subtype, type/subtype o HTTP_USER_AGENT The browser the client is using to send the request. General format: software/version library/version. A few variables that you may find handy: o REQUEST_METHOD The HTTP method this script was called with. Generally "GET", "POST", or "HEAD". o HTTP_REFERER The URL of the form that was submitted. This isn't always set, so don't rely on it. Don't go invading people's privacy with it, neither. o PATH_INFO Extra "path" information. It's possible to pass extra info to your script in the URL, after the filename of the CGI script. For example, calling the URL
o o

will set PATH_INFO to "/path/info/here". Commonly used for path-like data, but you can use it for anything.
o o o

SERVER_NAME Your Web server's hostname or IP address (at least for this request). SERVER_PORT Your Web server's port (at least for this request). SCRIPT_NAME The local URL of the script being executed. The CGI standard is unclear on whether the leading slash is included, but most servers today include it.

CGI Scripting in C (Using a C program as a CGI script)

In order to set up a C program as a CGI script, it needs to be turned into a binary executable program. This is often problematic, since people largely work on Windows whereas servers often run some version of UNIX or Linux. The system where you develop your program and the server where it should be installed as a CGI script may have quite different architectures, so that the same executable does not run on both of them. This may create an unsolvable problem. If you are not allowed to log on the server and you cannot use a binary-compatible system (or a cross-compiler) either, you are out of luck. Many servers, however, allow you log on and use the server in interactive mode, as a shell user, and contain a C compiler.

You need to compile and load your C program on the server (or, in principle, on a system with the same architecture, so that binaries produced for it are executable on the server too).
Normally, you would proceed as follows: 1. Compile and test the C program in normal interactive use. 2. Make any changes that might be needed for use as a CGI script. The program should read its input according to the intended form sub mis sion method. Using the default GETmethod, the input is to be read from the environment variable. QUERY_STRING. (The program may also read data from filesbut these must then reside on the server.) It should generate output on the standard output stream (stdout) so that it starts with suitable HTTP headers. Often, the output is in HTML format. 3. Compile and test again. In this testing phase, you might set the environment variableQUERY_STRING so that it contains the test data as it will be sent as form data. E.g., if you intend to use a form where a field named foo contains the input data, you can give the command setenv QUERY_STRING "foo=42" (when using the tcsh shell) or QUERY_STRING="foo=42" (when using the bash shell). 4. Check that the compiled version is in a format that works on the server. This may require a recompilation. You may need to log on into the server computer (using Telnet, SSH, or some other terminal emulator) so that you can use a compiler there. 5. Upload the compiled and loaded program, i.e. the executable binary program (and any data files needed) on the server. 6. Set up a simple HTML document that contains a form for testing the script, etc. You need to put the executable into a suitable directory and name it according to server-specific conventions. Even the compilation commands needed here might differ from what you are used to on your workstation. For example, if the server runs some flavor of Unix and has the Gnu C compiler available, you would typically

use a compilation command likegcc -o mult.cgi mult.c and then move (mv) mult.cgi to a directory with a name likecgi-bin. Instead of gcc, you might need to use cc. You really need to check local instructions for such issues. The filename extension .cgi has no fixed meaning in general. However, there can beserver-dependent (and operating system dependent) rules for naming executable files.Typical extensions for executables are .cgi and .exe.

The Hello world test

As usual when starting work with some new programming technology, you should probably first make a trivial program work. This avoids fighting with many potential problems at a time and concentrating first on the issues specific to the environment, here CGI. You could use the following program that just prints Hello world but preceded by HTTP headers as required by the CGI interface. Here the header specifies that the data is plain ASCII text.
#include <stdio.h> int main(void) { printf("Content-Type: text/plain;charset=us-ascii\n\n"); printf("Hello world\n\n"); return 0; }

After compiling, loading, and uploading, you should be able to test the script simply by entering the URL in the browsers address bar. You could also make it the destination of a normal link in an HTML document. The URL of course depends on how you set things up; the URL for my installed Hello world script is the following:

How to process a simple form

For forms that use METHOD="GET" (as our simple example above uses, since this is the default), CGI specifications say that the data is passed to the script or program in an environment variable called QUERY_STRING.
It depends on the scripting or programming language used how a program can access the value of an environment variable. In the C language, you would use the library functiongetenv (defined in the standard library stdlib) to access the value as a string. You might then use various techniques to pick up data from the string, convert parts of it to numeric values, etc. The output from the script or program to primary output stream (such as stdin in the C language) is handled in a special way. Effectively, it is directed so that it gets sent back to the browser. Thus, by writing a C program that it writes an HTML document onto its standard output, you will make that document appear on users screen as a response to the form submission. In this case, the source program in C is the following:

#include <stdio.h> #include <stdlib.h> int main(void) { char *data; long m,n; printf("%s%c%c\n", "Content-Type:text/html;charset=iso-8859-1",13,10); printf("<TITLE>Multiplication results</TITLE>\n"); printf("<H3>Multiplication results</H3>\n"); data = getenv("QUERY_STRING"); if(data == NULL) printf("<P>Error! Error in passing data from form to script."); else if(sscanf(data,"m=%ld&n=%ld",&m,&n)!=2) printf("<P>Error! Invalid data. Data must be numeric."); else printf("<P>The product of %ld and %ld is %ld.",m,n,m*n); return 0; }
As a disciplined programmer, you have probably noticed that the program makes no check against integer overflow, so it will return bogus results for very large operands. In real life, such checks would be needed, but such considerations would take us too far from our topic.

Note: The first printf function call prints out data that will be sent by the server as an HTTP header. This is required for several reasons, including the fact that a CGI script can send any data (such as an image or a plain text file) to the browser, not just HTML documents. For HTML documents, you can just use the printf function call above as such; however, if your character encoding is different from ISO 8859-1 (ISO Latin 1), which is the most common on the Web, you need to replace iso-88591 by the registered name of the encoding (charset) you use. I have compiled this program and saved the executable program under the name mult.cgiin my directory for CGI scripts at This implies that any form with action="" will, when submitted, be processed by that program.

Consequently, anyone could write a form of his own with the same ACTIONattribute and pass whatever data he likes to my program. Therefore, the program needs to be able to handle any data. Generally, you need to check the data before starting to process it.

The idea of METHOD="POST"
Let us consider next a different processing for form data. Assume that we wish to write a form that takes a line of text as input so that the form data is sent to a CGI script that appends the data to a text file on the server. (That text file could be readable by the author of the form and the script only, or it could be made readable to the world through another script.)

It might seem that the problem is similar to the example considered above; one would just need a different form and a different script (program). In fact, there is a difference. The example above can be regarded as a pure query that does not change the state of the world. In particular, it is idempotent, i.e. the same form data could be submitted as many times as you like without causing any problems (except minor waste of resources). However, our current task needs to cause such changesa change in the content of a file that is intended to be more or less permanent. Therefore, one should use METHOD="POST". This is explained in more detail in the document Methods GET and POST in HTML forms - whats the difference? Here we will take it for granted that METHOD="POST" needs to be used and we will consider the technical implications.

For forms that use METHOD="POST", CGI specifications say that the data is passed to the script or program in the standard input stream (stdin), and the length (in bytes, i.e. characters) of the data is passed in an environment variable calledCONTENT_LENGTH.

Reading input
Reading from standard input sounds probably simpler than reading from an environment variable, but there are complications. The server is not required to pass the data so that when the CGI script tries to read more data than there is, it would get an end of file indi cation! That is, if you read e.g. using the getchar function in a C program, it is undefined what happens after reading all the data characters; it is not guaranteed that the function will return EOF.

When reading the input, the program must not try to read more thanCONTENT_LENGTH characters.

Sample program: accept and append data

A relatively simple C program for accepting input via CGI and METHOD="POST" is the following:
#include <stdio.h> #include <stdlib.h> #define MAXLEN 80 #define EXTRA 5 /* 4 for field name "data", 1 for "=" */ #define MAXINPUT MAXLEN+EXTRA+2 /* 1 for added line break, 1 for trailing NUL */ #define DATAFILE "../data/data.txt" void unencode(char *src, char *last, char *dest) { for(; src != last; src++, dest++) if(*src == '+') *dest = ' '; else if(*src == '%') { int code; if(sscanf(src+1, "%2x", &code) != 1) code = '?'; *dest = code;

src +=2; } else *dest = *src; *dest = '\n'; *++dest = '\0'; } int main(void) { char *lenstr; char input[MAXINPUT], data[MAXINPUT]; long len; printf("%s%c%c\n", "Content-Type:text/html;charset=iso-8859-1",13,10); printf("<TITLE>Response</TITLE>\n"); lenstr = getenv("CONTENT_LENGTH"); if(lenstr == NULL || sscanf(lenstr,"%ld",&len)!=1 || len > MAXLEN) printf("<P>Error in invocation - wrong FORM probably."); else { FILE *f; fgets(input, len+1, stdin); unencode(input+EXTRA, input+len, data); f = fopen(DATAFILE, "a"); if(f == NULL) printf("<P>Sorry, cannot store your data."); else fputs(data, f); fclose(f); printf("<P>Thank you! Your contribution has been stored."); } return 0; }

Essentially, the program retrieves the information about the number of characters in the input from value of the CONTENT_LENGTH environment variable. Then it unencodes (decodes) the data, since the data arrives in the specifically encoded format that was already men tioned. The program has been written for a form where the text input field has the name data (actually, just the length of the name matters here). For example, if the user types Hello there! then the data will be passed to the program encoded as data=Hello+there%21 (with space encoded as + and exclamation mark encoded as %21). The unencode routine in the program converts this back to the original format. After that, the data is appended to a file (with a fixed file name), as well as echoed back to the user. Having compiled the program I have saved it as collect.cgi into the directory for CGI scripts. Now a form like the following can be used for data submissions:
<FORM ACTION="" METHOD="POST"> <DIV>Your input (80 chars max.):<BR> <INPUT NAME="data" SIZE="60" MAXLENGTH="80"><BR>


Sample program: view data stored on a file

Finally, we can write a simple program for viewing the data; it only needs to copy the content of a given text file onto standard output:
#include <stdio.h> #include <stdlib.h> #define DATAFILE "../data/data.txt" int main(void) { FILE *f = fopen(DATAFILE,"r"); int ch; if(f == NULL) { printf("%s%c%c\n", "Content-Type:text/html;charset=iso-8859-1",13,10); printf("<TITLE>Failure</TITLE>\n"); printf("<P><EM>Unable to open data file, sorry!</EM>"); } else { printf("%s%c%c\n", "Content-Type:text/plain;charset=iso-8859-1",13,10); while((ch=getc(f)) != EOF) putchar(ch); fclose(f); } return 0; }

Notice that this program prints (when successful) the data as plain text, preceded by a header that says this, i.e. has text/plain instead of text/html. A form that invokes that program can be very simple, since no input data is needed:
<form action=""> <div><input type="submit" value="View"></div> </form>

Finally, heres what the two forms look like. You can now test them:

Form for submitting data

Please notice that anything you submit here will become visible to the world: Your input (80 chars max.):


Form for checking submitted data

The content of the text file to which the submissions are stored will be displayed as plain text.

Even though the output is declared to be plain text, Internet Explorer may interpret it partly as containing HTML markup. Thus, if someone enters data that contains such markup, strange things would happen. The viewdata.c program takes this into account by writing the NUL character ('\0') after each occurrence of the greaterthan character lt;, so that it will not be taken (even by IE) as starting a tag.

CGI Security

Many CGI developers do not take security as seriously as they should. So before we look at how to make CGI scripts more secure, let's look at why we should worry about security in the first place: 1. On the Internet, your web site represents your public image. If your web pages are unavailable or have been vandalized, that affects others' impressions of your organization, even if the focus of your organization has nothing to do with web technology. 2. You may have valuable information on your web server. You may have sensitive or valuable information available in a restricted area that you may wish to keep unauthorized people from accessing. For example, you may have content or services available to paying members, which you would not want non-paying customers or non-members to access. Even files that are not part of your web server's document tree and are thus not available online to anyone (e.g., credit card numbers) could be compromised. 3. Someone who has cracked your web server has easier access to the rest of your network. If you have no valuable information on your web server, you probably cannot say that about your entire network. If someone breaks into your web server, it becomes much easier for them to break into another system on your network, especially if your web server is inside your organization's firewall (which, for this reason, is generally a bad idea). 4. You sacrifice potential income when your system is down. If your organization generates revenue directly from your web site, you certainly lose income when your system is unavailable. However, even if you do not fall into this group, you likely offer marketing literature or contact information online. Potential customers who are unable to access this information may look elsewhere when making their decision. 5. You waste time and resources fixing problems. You must perform many tasks when your systems are compromised. First, you must determine the extent of the damage. Then you probably need to restore from backups. You must also determine what went wrong. If a cracker gained access to your web server, then you must determine how the cracker managed this in order to prevent future break-ins. If a CGI script damaged files, then you must locate and fix the bug to prevent future problems. 6. You expose yourself to liability. If you develop CGI scripts for other companies, and one of those CGI scripts is responsible for a large security problem, then you may understandably be liable. However, even if it is your company for whom you're developing CGI scripts, you may be liable to other parties. For example, if someone cracks your web server, they could use it as a base to stage attacks on other companies. Likewise, if your company stores information that others consider sensitive (e.g., your customers' credit card numbers), you may be liable to them if that information is leaked. These are only some of the many reasons why web security is so important. You may be able to come up with other reasons yourself. So now that you recognize the importance of creating secure CGI scripts, you may be wondering what makes a CGI script secure

CGI Security Breach : Origins and Consequences The first step towards tackling the CGI security issues requires finding the origins of the problems. This in turn requires identification of all the different components involved in the entire CGI communication process. (Note that not all of these are involved in all cases.) They are: 1. The User. A person including an intruder (such as a hacker, masquerader, counterfeiter, an eavesdropper) or a program (such as a virus). 2. HTML form or searchable index. 3. HTTP and CGI protocols. 4. The CGI script. 5. Compiler/interpreter that runs the CGI script (which depends on the language the script is written in). 6. External data (that comes from the user in 1. above). 7. External programs that the script calls. 8. Client-side techniques, such as JavaScript, used in conjunction with the CGI. 9. The Web browser. 10. The Web server. Figure 1 presents a schematic of the CGI communication between the Web client and server. Here are some remarks on the effect of the above components. The main sources of CGI security problems are 2, 4, 6, 7 and 10, which result in insecure data, insecure code, or insecure server. 6 can pose a major security problem when it comes in contact with 7. It is possible that 3, 5 or 7 themselves could lead to security problems. Discussion of that is beyond the scope of this article. 8 and 9 are client-side technologies and don't really have a direct relationship with the CGI security issue. 8, however, can affect 6. This article is primarily targeted towards developers who write CGI scripts; however, we have also provided a section for those who use pre-built CGI scripts for their purposes.

Figure . CGI Communication between the Web Client and Server. CGI scripts can present security holes in two ways:

1. They may intentionally or unintentionally leak information about the host system that can result in a break in. 2. Scripts that process remote user input, such as the contents of a form or a "searchable index" command, may be vulnerable to attacks in which the remote user tricks them into executing commands. Security holes present in CGI scripts on Web sites can be exploited for various frivolous purposes, including the following:

Critical files, particularly those which contain sensitive information (such as passwords), are stolen, modified or erased by unauthorized users. Content is sold to a competitor. Information about the host machine is obtained which will allow unauthorized users to have access to the system. Commands are executed on the server host machine, allowing unauthorized users to modify the system. The site is used to launch attacks against other sites.

There are important security issues when mailing or calling any other program from a CGI script. But CGI security is deep magic, far beyond the scope of this tutorial. A collection of documents discussing security is available on the Web. I can only give a brief example here. In general, you should never write scripts that allow a user's form data to be executed on your system. The most obvious example might be something like

exec "$in{message}";

This would allow a browser to execute commands on your system, whatever was submitted through the variable named message on your web form. (Perhaps rm -rf?) Perl has some built-in safeguards against this (TaintPerl), as do most web servers, though they are not perfect and can sometimes be circumvented by crafty web surfers. As a more devious and realistic example, suppose your mail program is mail and you put the recipient on the command line:

$recipient = $in{email_address};open(MAIL, "|mail $recipient");

If the browser supplied her address as "nobody ; rm -rf", the second command might be executed after the mail program completed. (Recent versions of sendmail have safeguards against this sort of spoofing.) So what can you do?

Realize you have no control over what form data is passed to your script, and anyone can bypass your form and access your script directly. All they need to do is point their own form at it. Study the security documents linked above. These are technical issues, but they make for morbidly interesting reading. You should be reasonably safe if you don't execute any other scripts (including mail or other CGI scripts) in your code. (This is the kind of sweeping statement that often proves wrong, so I can't guarantee it.)

Sanitize any form data that you pass to other scripts that you must execute. s/\W//g; will remove all nonalphanumeric characters from a variable, including punctuation (*.;'/). Even better would be to accept only a pre-determined list of possible answers.

I believe the code in this document (including sendmail -t, which keeps the email address off the command line) is reasonably secure. No guarantees though, and if you know otherwise, please let me know. If you'd like to look at a script which goes to great lengths to be security conscious (because it's able to write to any file on your web site) see SiteMgr.