Author(s): Contributor(s): Last Revised


Adam Zajac 2008.12.12

This work licensed under Creative Commons Attribution-Share Alike 3.0 Unported License.

Working with Tar Files in Python
1. Introduction 1.a Background Reading 2. Tutorial 2.a Adding Files 2.b File Information 2.c Extracting Files 3. Examples 3.a Archiving Select Files from a Directory 4. Extending 4.a Removing Files

1. Introduction
"Tar" is an archiving format that has become rather popular in the open source world. In essence, it takes several files and bundles them into one file. Originally, the tar format was made for tape archives, hence the name; today it is often used for distributing source code or for making backups of data. Most Linux distributions have tools in the standard installation for creating and unpacking tar files. Python's standard library comes with a module which makes creating and extracting tar files very simple. Examples of when individuals might want such functionality include programming a custom backup script or a script to create a snapshot of other personal projects.

Background Reading
There is significant documentation of both tar files and Python's tarfile module. In addition to this document, the following resources are recommended reading: Wikipedia: tar file Python Library Reference 12.5: tarfile

2. Tutorial
This is a basic tutorial designed to teach three things: how to add files to an archive, how to retrieve information on files in the archive, and how to

extract files from the archive.

Adding Files
To begin, import the tarfile module. Then, create what is called a "TarFile Object". This is an object with special functions for interacting with the tar file. In this case, we are opening the file "archive.tar.gz". Note that the mode is "w:gz", which opens the file for writing and with gzip compression. As usual, "w" not preserve previous contents of the file. If the tarfile already exists, use "a" to append files to the end of the archive (n.b.: you cannot use append with a compressed archive - there is no such mode as "a:gz"). Create a TarFile Object >>> import tarfile >>> tar ="archive.tar.gz", "w:gz") >>> tar <tarfile.TarFile object at 0x2af77c060990> Adding files to the archive is very simple. If you want the file to have a different name in the archive, use the arcname option. Adding a File to the Archive >>> tar.add("file.txt") >>> tar.add("file.txt", arcname="new.txt") Adding directories works in the same way. Note that by default a directory will be added recursively: every file and folder under it will be included. This behavior can be changed by setting recursive to False. Adding a Directory to the Archive >>> tar.add("docs/") >>> tar.add("financial/", recursive=False) As with normal file objects, always be sure to close a TarFile Object. Close the TarFile Object >>> tar.close()

File Information
The tarfile module includes the ability to retrieve information about the individual contents of a tar file. Each item is accessed as a "TarInfo Object". For example, getmembers() will return a list of all TarInfo objects in a tar file: Listing TarInfo Objects >>> import tarfile >>> tar ="archive.tar.gz", "r:gz") >>> members = tar.getmembers() >>> members [<TarInfo 'text.txt' at 0x2b0b73e46a90>, <TarInfo 'text2.txt' at 0x2b0b73e46ad0>] Each TarInfo object has several methods associated with it. Some examples are below, and a full list can be found here.

TarInfo information >>> members[0].name 'text.txt' >>> members[0].isfile() True

Extracting Files
Extracting the contents is a very simple process. To extract the entire tar file, simple use extractall(). This will extract the file to the current working directory. Optionally, a path may be specified to have the tar extract elsewhere. Extracting an entire tar file >>> import tarfile >>> tar ="archive.tar.gz", "r:gz") >>> tar.extractall() >>> tar.extractall("/tmp/") If only specific files need to be extracted, use extract() Extracting a single file from a tar file >>> import tarfile >>> tar ="archive.tar.gz", "r:gz") >>> tar.extract("text.txt") You should be aware that there is at least one security concern to take into account when extracting tar files. Namely, a tar can be designed to overwrite files outside of the current working directory (/etc/passwd, for example). Never extract a tar as the root user if you do not trust it.

3. Examples
Archiving Select Files from a Directory import os import tarfile whitelist = ['.odt', '.pdf'] contents = os.listdir(os.getcwd()) tar ='backup.tar.gz', 'w:gz') for item in contents: if item[-4:] in whitelist: tar.add(item) tar.close()

4. Extending
Removing Files
The tarfile module does not contain any function to remove an item from an archive. It is presumed that this is because of the nature of tape drives, which were not designed to move back and forth (consider this post to the Python tutor mailing list). Nevertheless, other programs for creating tar archives do have a delete feature. The following code uses the popular GNU tar programs that comes with most Linux distributions. Their documentation of the "--delete" flag can be read here; note that they warn not to use it on an actual tape drive. The reliance on an external program obviously makes the code far less portable, but it is suitable for personal scripts. Removing an Item from a Tar import subprocess def remove(archive, unwanted): external = subprocess.getoutput("tar --version") if external[:13] != "tar (GNU tar)": raise Exception("err: need GNU tar to delete individual files.") command = 'tar --delete --file="{0}" "{1}"'.format(archive, unwanted) output = subprocess.getstatusoutput(command)[0] return output