You are on page 1of 11

Choosing The Right File Format/Print version

Table of Contents
1. Introduction
2. Quick Guide to recommended Iormats
3. Is there a problem?
4. A general look at File Formats
5. Recommendations in detail
1. Texts and Documents
2. Web pages
3. Images (Raster Graphics)
4. Vector Formats
1. Vector Iormats Ior graphics
2. Vector Formats Ior 3D modeling -- lost!
3. Vector Formats Ior Architecture, Engineering and Construction industries (CAD)
5. Databases & Spreadsheets
6. Appendix
Introduction
File Iormats are the language oI a computer's memory. Choosing the right Iormat Ior the electronic inIormation we want to store is one important step in
making good use oI computers and minimising problems.
This book tries to help you choose the Iile Iormat best suited to its use. It concentrates on two purposes oI storing your inIormation (data).
u Portability and interoperability
u Digital Preservation
Portability and Interoperability is the ability oI your data to be read (interpreted) by diIIerent soItware and hardware. The most common portable Iormat in
use now is the PDF (Portable Document Format) Ior sending documents over the internet. Somewhat more troublesome is the exchanging oI address
inIormation between email soItware.
Digital Preservation can be deIined as long-term, error-free storage of digital information, with means for retrieval and interpretation of needed files from
the long-term, error-free digital storage, for all the time span that the information is required for.
u Preservation
u Portability
Old version of Introduction to merge in
Planning Ior an unpredictable Iuture is known as future proofing, although you can't really know iI you are Iuture proof you can practice risk reduction and
learn Irom the mistakes oI history. This article Iocuses on Iuture prooIing oI computer Iiles. This article gives tips Ior creating Iiles in a manner which makes
them easy to preserve and later access, and Ior avoiding pitIalls that could make your Iiles diIIicult to access later. Looking aIter the Iiles you already have is
known as digital preservation.
Where Iuture prooIing and digital preservation deal with the rather etheric matter oI electronic Iiles and their Iormats, then your next concern is the media on
which your inIormation is stored. That is an area oI study in itselI, and is not the subject oI this article.
Both Iuture prooIing oI inIormation and the media it is stored on are vital in any thorough review oI your IT systems. Will you be able to read the Iiles you're
working on now in 5 years time? Do you know iI all the old Iiles you have now are still readable?
The chance oI electronic Iiles being readable in 5 or 10 years is not something to leave up to chance. Active intervention is needed in most cases. Migrating to
soItware/hardware that supports openly published standards is the most eIIective single step in any plan to Iuture-prooI.
Quick Guide to recommended formats
For an explanation oI terminology read the section "Formats Ior storing electronic inIormation".
Page 1 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
Page 2 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
Page 3 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
Data Type Preferred formats Common OpenSource applications
Common proprietary
applications
Documents
u Plain text, encoded with either ASCII (limited to
the English alphabet) or Unicode UTF-8
u RTF (Rich Text Format) - Proprietary, open
speciIication by MicrosoIt
u PDF (Portable Document Format) - Proprietary,
open speciIication by Adobe
u EPS (Encapsulated Post Script) - Proprietary, open
speciIication by Adobe
u OOXML aka DOCX (OIIice Open XML) -
controversially non-proprietary, open speciIication
by Ecma International and ISO/IEC
u ODF (OpenDocument Format) - non-proprietary,
open speciIication by OASIS and ISO
Notepad (http://notepad-
plus.sourceIorge.net/uk/about.php) , gedit
(http://www.gnome.org/projects/gedit/) , kate,
AbiWord, OpenOIIice.org Writer, KWord,
PDFCreator, Geany, vim, emacs
Notepad, Adobe Acrobat,
Corel WordPerIect,
MicrosoIt Word, Star
OIIice
Spreadsheets
u ODS (OpenDocument Spreadsheet) - non-
proprietary, open speciIication by OASIS and ISO
u OOXML aka XLSX (OIIice Open XML) -
controversially non-proprietary, open speciIication
by Ecma International and ISO/IEC
u CSV (Comma Separated Values) - Non-standard
conventional Iormat where a comma is used to
separate each value
OpenOIIice.org Calc, Lotus 1-2-3, Gnumeric,
KSpread
MicrosoIt Excel, Corel
Quattro Pro
Web pages
u HTML (HyperText Markup Language) or XML
(eXtensible Markup Language) - Non-proprietary,
open speciIications by W3C
Abiword, OpenOIIice2.0, Mozilla composer,
Quanta, Geany
Dreamweaver, Adobe
GoLive
Graphics
(raster &
vector)
u PNG (Portable Network Graphic) - non-
proprietary, open speciIication. Released as
ISO/IEC 15948:2003 and W3C recommendation
(reI (http://www.libpng.org/pub/png/spec/iso/) )
u JPEG/JPG (Joint Photographic Experts Group)
Non-proprietary, open speciIication by the Joint
Photographic Experts Group
u DjVuLibre (http://djvulibre.djvuzone.org/)
(pronounced "deja vu") Non-proprietary, open
speciIication by lizardtech
(http://www.lizardtech.com/)
u CGM (Computer Graphics MetaIile) - Non-
proprietary, open speciIication by ISO/IEC
u SVG (Scalable Vector Graphic) - Non-proprietary,
open speciIication by W3C
u VRML (Virtual Reality Modeling Language) - Non
-proprietary, open speciIication by W3C
u DXF (Drawing eXchange Format) - Proprietary,
partially open speciIication by Autodesk
Gimp, TuxPaint, OpenOIIice.org Draw,
Blender, Inkscape, kolourpaint
Adobe Photoshop, Adobe
Illustrator, 3DStudioMax,
AutoCAD
Audio
Audio has containers with codecs in together they Iorm a
Iormat.
u Ogg container with vorbis, Ilac, speex and others
u Flac container with Ilac audio.
u MP3 Can contain mp3 audio.
Audacity, Ardour Adobe Soundbooth
Video
Video also has containers with codecs in together they
Iorm a Iormat.
u Ogg container with Theora
u WebM container with VP8 video compression
Iormat and vorbis audio compression Iormat
u MP4 container with h.264 video compression
Iormat and mp3 audio compression Iormat
Openshot , KDEnlive
Adobe Premiere ,
Windows Movie Maker
Database Databases do not use Iiles in the normal sense, however a
good database can output its content structured with SQL
(Structured Query Language) - an ANSI/ISO standard. It
PostgreSQL, MySQL, Firebird and InterBase,
Kexi
MicrosoIt SQL Server,
Oracle and Sybase
(http://www.sybase.com)
Page 4 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
is also important that it supports ODBC (Open Data
Base Connectivity)
(QFU\SWHG
ILOHV
u OpenPGP - Non-proprietary, open speciIication,
OpenSSL
GnuPG (GNU Privacy Guard)
PGP (Pretty Good
Privacy)
,VWKHUHDSUREOHP"
II you are one oI the many people who used to use WordPerIect or WordStar and have since switched to a diIIerent editor, you may already be Iamiliar with
the problem oI retrieving your own inIormation Irom certain types oI Iiles. Or perhaps you switched Irom one operating system to another, Irom Amiga to
Windows, or Windows to Macintosh. Stated simply, Iile Iormats Ior diIIerent soItware Iar too oIten leave your inIormation scrambled in a way you cannot
decipher again years later.
II this seems a bit theoretical to you then here are some stories to illustrate the issue oI choosing the right Iormat Ior your inIormation.
7KH(QJOLVK7RXULVW
A tourist walks into a very nice restaurant in a lovely village in the French countryside and mutters in English "Are you still serving lunch?" No
one reacts, so he says louder, "Do you have a TABLE where I might DINE?" Recognizing a Iew words and realizing that the tourist must only
speak English or isn't interested in trying his French, one oI the employees goes oII to Iind someone who might be able to help this ignorant
tourist.
AIter a long delay, someone comes, interprets his request and Iinds him a seat in the restaurant. The tourist is handed a menu. "I can't read this! It
is in French! What are Cervelles anyways?" The helpIul interpreter is called back and the tourist has the whole menu explained to him and is
Iinally ready to order a meal. By now our hapless tourist is getting hungry and Irustrated and, in just the way everyone gets when they are
Irustrated and hungry Iorgets their manners and blurts, "By the way, I am going to order in English so I can be sure oI what I am getting - and Ior
the privilege oI taking my order, I demand that you pay the Queen oI England a small sum Ior the use oI this language which you should really
just learn to use like everyone else!"
AIter this last sentence is Iinally translated back to the previously Iriendly proprietors, the kitchen is closed and the tourist is sent packing.
In terms oI Iile Iormats where this tourist has gone wrong is that although he is happy with the Iormat he is using (unlike the Roman oIIicial in the next story)
he has Iorgotten that diIIerent people do things diIIerently. When in a diIIerent context his preIerred Iormat (English) is not supported. This is the situation iI
your Iavourite soItware company goes bust or stops supporting the soItware you bought. The Iiles which once were so convenient can become useless with
time.
7KH5RPDQ2IILFLDO
An oIIicial in ancient Rome by the name oI Gallus hires a scribe called Taruna who understands Latin but can only write in a rare (and
unrecorded) dialect oI Sanskrit. AIter Taruna has been in the job Ior some years Gallus Iinds he is actually too slow and keeps losing important
documents. Taruna is turned out into the street and goes back to his Iamily in disgrace.
The Iollowing day the oIIicial employed a highly regarded new assistant and sent him into the archive. A Iew minutes later the assistant came out
in tears explaining that he only knows a Iew words oI Sanskrit, can't Iind any reIerences to the dialect used and could never hope to make sense oI
these documents.
Frantically they search Ior Taruna. When they Iind him they ask him to come back to work, but he sees their problem. So he says with a smile "I
will happily come back to work, you just need to double my pay and holidays!"
In modern terms where the Roman oIIicial went wrong is to use an unpublished Iormat (an unrecorded dialect oI Sanskrit) to store his inIormation. He was
then trapped by this Iormat and Iorced to keep buying the soItware (the scribes services) at ever increasing cost. He had lost control oI his own inIormation!
In a report written Ior The National Archives (UK) in 2003, Adrian Brown summarises how to proceed.
The selection oI Iile Iormats Ior creating electronic records should ... be determined not only by the immediate and obvious requirements oI the
situation, but also by longer-term considerations. An electronic record is not Iully Iit-Ior-purpose unless it is sustainable throughout its required
liIe cycle. ... It is thereIore highly desirable to identiIy the minimum set oI Iormats which meet both the active business needs and the
sustainability criteria below, and restrict data creation to these Iormats. |1|
(http://www.nationalarchives.gov.uk/preservation/advice/pdI/selectingIileIormats.pdI) (PDF)
The approach oI Project Gutenberg (http://www.gutenberg.org/) to this challenge has been a strict criteria that all the 15,000 books stored in their digital
repository are stored in plain ASCII text.
Whenever possible, Project Gutenberg distributes a plain text version oI an eBook. Other Iormats, such as HTML, XML, RTF, and others are also
welcome, but plain text is the "lowest common denominator." We stress the inclusion oI plain text because oI its longevity: Project Gutenberg
includes numerous text Iiles that are 20-30 years old. In that time, dozens oI widely used Iile Iormats have come and gone. Text is accessible on
all computers, and is also insurance against Iuture obsolescence. |2|
(http://www.gutenberg.org/wiki/Gutenberg:PublicDomaineBookSubmissionHow-To#FileFormats)
Does that mean we cannot use word processors, iI we want long term access to the inIormation in our documents? Well, yes and no. II you want long term Iile
readability (oI Latin script languages) as Project Gutenberg does, then ASCII text is the way to go. This might be something to consider Ior Iinancial records
and other valuable inIormation. II, as many people do, you have non-text inIormation, like images and sounds, then this is the article to read. Either way there
are a lot oI common errors you can avoid which will at the verv least make Iuture migrations to the next generation oI Iile Iormats much easier.
Page 5 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
Let's now take a real world scenario. Many people use the MicrosoIt Windows operating system and the MicrosoIt OIIice package which includes the
document application MicrosoIt Word (or just MSWord). The deIault Iile Iormat oI MSWord is DOC. So what's DOC like Ior long term storage?
MS Word is a proprietary program and the .doc Iile extension is a proprietary Iormat. That means that how the soItware works and stores your inIormation is
secret - only MicrosoIt knows exactly how it all works.
Formats for storing electronic information
At any one time there is an enormous variety oI Iile Iormats in use Ior various purposes, so how do you choose which one is best? There are three type oI Iile
Iormats:
u Proprietary, closed speciIications
u Proprietary, open speciIications
u Non-proprietary, open speciIications
Proprietary, closed specifications are used by some oI the most common soItware, iI you don't use them yourselI you probably get sent them. However
because these Iormats are not publically documented, you are held hostage to the company making the soItware. II they decide not to support old versions oI
their own Iormat, suddenly you can't open your old Iiles! Then your choice oI soItware is greatly dependant on any new soItware ability to second-guess the
Iormat used by your old soItware. Examples oI this type oI Iormat are those Irom the MicrosoIt OIIice Word doc Iormat and Excels xls Iormat, and Adobe
Photoshop's Document (.psd).
Proprietary, open specifications are somewhat better in that although the Iormat is still legally owned and developed purely Ior their commercial beneIit by
one company, they have undertaken to document the Iormat openly. They can still choose to switch back to a closed speciIication, or they may make changes
they choose not to document. In other words, a proprietary open speciIication is only open as long as the company wants to keep it that way. Examples oI this
type oI Iormat are Adobe's Portable Document Format (.pdI) (patented, although most oI the patents are licensed on a royalty-Iree basis), Adobe TIFF Iormat
(.tiII) and Macromedia Shockwave Flash (.swI) (however, the documentation is under a non-disclosure agreement that requires readers not to contribute to any
other implementations oI Flash, so in practice it is still closed).
Non-proprietary, open specifications have been openly documented by some public body (or released to them) by developers. Once released these Iormats
have a guaranteed reIerence point. Examples oI this type oI Iormat are Portable Network graphic (.png) Joint Photographic Expert Group (.jpg / .jpeg)
( .mpeg2), eXtensible Markup Language (.xml) (the structure, more than the speciIic Iormat), and Scalable Vector Graphic (.svg).
One special case is the Adobe Portable Document Format for archival (pdI-archive or PDF-A), a restricted application oI the proprietary open speciIication
PDF 1.4 (http://partners.adobe.com/public/developer/pdI/indexreIerence.html#2) Iormat. It is a published ISO International Standard Irom 2002 |3|
(http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER34607&ICS137&ICS2100&ICS399) and developed by the PDF-Archive
Committee (http://www.aiim.org/standards.asp?ID25013) in close partnership with the Administrative OIIice oI the U.S. Courts |4|
(http://www.Icw.com/article82304-03-14-04-Print) . I have not Iound any soItware which supports this Iormat, so it is possibly only used in organisations
where archival is their main concern.
One Iamily oI Iormats which could solve many issues is collectively called OpenDocument developed by OpenOIIice.org, OASIS, and many others in the
industry (but not MicrosoIt). OpenOIIice.org (http://www.openoIIice.org/) 2.0, recently released, uses the OpenDocument Iamily oI Iormats. OI the soItware
supporting OpenDocument, OpenOIIice, AbiWord and Google Docs are cross-platIorm, and KOIIice will be as oI KDE 4.1 (around July 2008). Mac OS X
10.5's TextEdit can understand the Iormat to some degree, and MicrosoIt states it will add native support Ior OpenDocument 1.1 (rather than plug-in
converters) to MS OIIice, as oI Spring 2009.
UnIortunately leading soItware oIten deIaults to a Iormat which is inherently unsuited to later retrieval. An example is MicrosoIt Word which deIaults to their
native .doc Iormat rather than the better documented and more widely supported Rich Text Format (.rtI), Though you can change the deIault Iormat
(http://oIIice.microsoIt.com/en-us/assistance/HP052372851033.aspx) . MicrosoIt products are also notable Ior their use oI what Marshall Masters oI the
Independent Book Publishers Association (http://www.pma-online.org/) calls 'upgrade blackmail' and describes as "Someone with a new version oI your
desktop application edits your Iile, and now your older version oI the application cannot read it, which Iorces you to pay Ior an expensive upgrade iI you want
to continue working and playing well with others."|5| (http://www.pma-online.org/scripts/shownews.cIm?id1093) That's deIinitely something Ior anyone
with a budget to avoid.
So, what's the next level oI Iuture-prooIing? Read on...
Criteria in choosing future proof file formats
Formats Ior Iuture prooIing must:
u Be supported by comprehensively, public documentation
u Be stable, not under constant revision
u Be supported by several soItware providers
u Be supported on various hardware
u Be supported by soItware on various operating systems (Windows/Macintosh/Unix/Linux)
u Be Iree oI legal restriction in its use (see PNG not GIF (http://www.gnu.org/philosophy/giI.html) )
Additional consideration:
u Popular Iormats are more likely to remain supported
Criteria in choosing suitable software
Page 6 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
There are some soItware implications oI the Iormat criteria. Not all soItware uses open speciIications correctly, so they only appear to be using a particular
Iormat. This is most common with Hyper Text Markup Language (HTML) editors and the notorious (http://philip.greenspun.com/wtr/word.html) 'Save as
html' option in MicrosoIt Word. So iI you are buying soItware, make sure to check this. Open Source (http://www.opensource.org/) / Free
(http://www.IsI.org/) inherently tends to have strong support Ior open standards.Choosing The Right File Format/Recommendations
Text & Documents
In most types oI organisations text documents are their most important type oI electronic inIormation aIter Iinancial accounts. Depending on what the
document contains there are several types oI Iormat to choose Irom.
There are three types oI text documents: Plain text files - simple text, no Iormatting, no Iont choices. Text documents - you can choose Ionts, colors, text
size, backgrounds and imbed images (sounds/video etc.). Documents for presentation - all the options oI Text documents, with restrictions on Iurther
editing.
For Plain text files the simplest, and most durable Iormat is ASCII (American Standard Code Ior InIormation Interchange). It has been developed since 1963
and must be the single most supported Iormat ever. However it is also very limited. The only Iormatting available is the selection oI line breaks. There is no
embedding oI any images or colors, and there is no support Ior diacritic marks or non-Latin scripts. There are a variety oI other encodings based on ASCII
which add support Ior more characters. In the western world windows-1252 (which is closely related to ISO-8859-1) is the most common oI these. Other parts
oI the world will have other conventions. UTF-8, which can represent texts oI all languages in real use, is becoming more common and may be the best choice
Ior long term storage oI text.
Text Iiles using an encoding based on ASCII are usually represented with the .txt suIIix, but it can be hard to determine which one automatically. So it is a
good idea to try and Iind out what encoding you are using and record it. II you are really paranoid you may also want to Iind and store the authoritative tables
Ior converting that encoding to unicode (try http://www.iana.org/assignments/character-sets and http://www.unicode.org/Public/MAPPINGS/).
For Windows users, Notepad is the deIault application Ior handling TXT Iiles. Current versions oI notepad assume UTF-8 iI the Iile is completely valid UTF-
8 or has a UTF-8 byte order mark, UTF-16 iI they detect a UTF-16 byte order mark and the windows ANSI code page (1252 Ior western versions) otherwise.
In a pinch it is oIten possible to use notepad and similar editors to get the raw text out oI other types oI Iiles, and it can be inIormative to try this on other Iiles
you plan to store.
Text documents are what you produce most oI the time on one oI the many commercial or Iree word processors. Most oI the time you probably use it Ior
writing basic text documents. Letters to Iriends and colleagues, project lists and so on. Applications Ior this type oI text are Iound in popular oIIice suites like
MicrosoIt OIIice (http://oIIice.microsoIt.com/en-us/deIault.aspx) , AppleWorks (http://www.apple.com/appleworks/) and OpenOIIice.org
(http://openoIIice.org) .
For the purpose oI durability oI your documents it is important that the document you write today will still be readable next year. For a long time there has
been no open standard Ior documents, so compatibility has been a constant problem. People have had diIIerent levels oI success when they've chosen to
migrate Irom one document editor to another, as each used its own Iormat. The .doc Iormat is now well supported by several editors.
Whichever word processor you use, it should support several Iormats, choosing the most durable Iormat is very important. While work proceeds on the
OpenDocument standard (Version 1.0 was approved as an OASIS standard in May 2005), RTF (Rich Text Format) is the most widely supported and
documented Iormat available. You should be able to make this your deIault Iormat so all Iuture documents are in the RTF Iormat. (Tutorial
(http://oIIice.microsoIt.com/en-us/assistance/HP052372851033.aspx) on changing the deIault Iormat in MicrosoIt Word) II you choose not to do this because
RTF does not support some Ieature you need, you should still consider using RTF as you archival Iormat. Your Iormatting may not be represented correctly,
but at least your content is there Ior posterity.
II you spend time making Documents for presentation you'll know that Word processors are limited in this area. You might be using programs like Adobe
Illustrator (http://www.adobe.com/products/illustrator/main.html) /InDesign (http://www.adobe.com/products/indesign/main.html) , sodipodi
(http://www.sodipodi.com/) or CorelDRAW (http://www.corel.com/servlet/Satellite?
pagenameCorel3/Products/Display&pIid1047024307335&pid1047022690654) . These programs are great, but they can be tricky to successIully archive.
There are at least two competing options, PDF especially PDF/A Irom Adobe and XPS Irom MicrosoIt.
The Portable Document Format (PDF) is the file format created bv Adobe Svstems in 1993 for document exchange. PDF is a fixed-lavout format used for
representing two-dimensional documents in a manner independent of the application software, hardware, and operating svstem. Each PDF file encapsulates a
complete description of a 2-D document (and, with Acrobat 3-D, embedded 3-D documents) that includes the text, fonts, images, and 2-D vector graphics that
compose the documents. PDF is an open standard that has been oIIicially published on July 1, 2008 by the ISO as ISO 32000-1:2008. "The Portable Document
Format (PDF)" Wikipedia Online Encvclopedia, accessed Julv 4th, 2008
PDF/A is described in ISO 19005-1.2005 Document Management - Electronic document Iile Iormat Ior long term preservation - Part 1: Use oI PDF 1.4
(PDF/A-1) that was published on October 1, 2005. This standard defines a format (PDF/A) for the long-term archiving of electronic documents and is based
on the PDF Reference Jersion 1.4 from Adobe Svstems Inc. (implemented in Adobe Acrobat 5). PDF/A is in fact a subset of PDF, leaving out PDF features
not suited to long-term archiving. This is similar to the definition of the PDF/X subset for the printing and graphic arts. "PDF/A" Wikipedia Online
Encyclopedia, accessed July 4th, 2008
The XML Paper Specification (XPS), formerlv codenamed "Metro", is a specification for a page description language and a fixed-document format developed
bv Microsoft. It is an XML-based (more preciselv XAML-based) specification, based on a new print path and a color-managed vector-based document format
which supports device independence and resolution independence. "The XML Paper SpeciIication (XPS)" Wikipedia Online Encyclopedia, accessed July 4th,
2008
One word oI caution Ior using PDF Iiles: do not use inbuilt compression oI pdI Iiles and iI possible use the PDF 1.4 speciIication.
Recommendation
Page 7 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
u Use plain ASCII text whenever possible
u Use ODT where Iormatting is important or where graphics need to be included
u Use PDF or XPS Ior documents which will not need to be edited in the Iuture
References
u OpenDocument in Wikipedia
Web pages
Producing (X)HTML files can be done with a wide variety oI soItware, and in general web browser are very Iorgiving oI errors in HTML code. In most
cases it would still be unwise to use oIIice applications Ior creating (X)HTML documents. Although some oIIice soItware applications are now quite good at
generating clean (uncluttered) code, there are always some mistakes. Never under any circumstances use MicrosoIt Word's "Save as HTML Iunction". The
code that will be produced is Iull oI non-standard, MicrosoIt-speciIic extensions, and the Iiles it produces are very large.
The advantage of (X)HTML is that it can always be read with your eye, whether you have a suitable browser or not. For example, the title oI an HTML page
(iI you look at the code oI the Iile) is surrounded by <title> and </title> so it looks like this: <title>A page about me</title>. This makes (X)HTML
ideal Ior storing text Iiles with structure. You are however limited by your ability to create (X)HTML Iiles and (X)HTML's limitations on Iormatting
Recommendation
u Set the DTD and (X)HTML Ilavour your soItware uses beIore starting a new Iile
u Validate your code with W3C's online validator: http://validator.w3.org/
u Group or compress the (X)HTML and CSS Iiles together in a Iolder so they do not become separated
References
u World Wide Web Consortium: http://www.w3.org/
Images (Raster Graphics)
Most oI your images are probably raster graphics, as vector graphics are less common. Although there are hundreds oI Iormats to choose Irom two oI the most
popular are GIF (Graphics Interchange Format) and TIFF (Tagged Image File Format). UnIortunately these have been caught up in legal issues
(http://lpI.ai.mit.edu/Patents/GiI/GiI.html) since 1994 over a Unisys patent on the LZW compression algorithm which both Iormats use. These patents have
now expired (http://www.unisys.com/aboutunisys/lzw) although there is still an IBM patent (http://patIt.uspto.gov/netacgi/nph-Parser?
Sect1PTO1&Sect2HITOFF&dPALL&p1&u/netahtml/srchnum.htm&r1&IG&l50&s14,814,746.WKU.&OSPN/4,814,746&RSPN/4,814,746)
valid until August 2006.
Thus the GIF and TIFF Iormats do not currentlv meet the requirement oI being 'Iree Irom legal restriction in their use'.
The lossless PNG (Portable Network Graphic) Iormat replaces GIF and has many advantages in quality, size and options (but lacks animation). Most
importantly the PNG Iormat is patent Iree and has been a W3C Recommendation (http://www.w3.org/TR/PNG/) and ISO Standard
(http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER29581&ICS135&ICS2140&ICS3) since 2003.
Another popular Iormat Ior raster graphics is the lossy JPEG/JPG (Joint Photographic Experts Group) Iormat, which is specially suited to storing photographic
images.
...the JPEG committee have always tried to ensure in their standardisation work that the 'baseline' part oI their standards should be implementable
without payment oI either royalty Iees (volume related) or license Iees (non-volume related)." JPEG Committee (http://www.jpeg.org/Iaq.phtml?
actionshowanswer&questionidq3I042a5e42Id8)
...there are many patents associated with some optional Ieatures oI JPEG, namely arithmetic coding and hierarchical storage. For this reason, these
optional Ieatures should not be used Ior long-term storage oI valuable images. W3C (http://www.w3.org/Graphics/JPEG/)
Remember that JPEG is a lossy Iormat, meaning that each time the image is modiIied and resaved there is some irretrievable data loss. ThereIore it should be
avoided Ior archival use unless shortage oI space requires that lossy compression is used.
(Windows users should read MicrosoIt Security Bulletin MS04-028 (http://www.microsoIt.com/technet/security/bulletin/MS04-028.mspx) and this press
release (http://www.jpeg.org/newsrel10.html) )
Recommendation
u Use the PNG Iormat.
u Use the JPEG Iormat, but avoid optional Ieatures.
u II your soItware supports it, you can save in TIFF Iormat without compression.
References
u History oI the Portable Network Graphics (PNG) Format (http://www.libpng.org/pub/png/pnghist.html)
u Joint Photographic Experts Group (http://www.jpeg.org/index.html)
Page 8 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
A mad scientist cartoon
Understanding Vector Formats
Vector Formats represent shapes by describing their geometric properties in points, lines, curves, and polygons.
The diIIerence between vector and raster images, and when to use each, is important, but outside the scope oI this
document.
Sadly there are no widely adopted standard Iormats Ior vector images. Part oI the reason Ior this is the variety oI
uses Ior vector images. At one extreme oI complexity is a DWG (pronounced drawing) Iile Irom AutoCAD which
can represent a multi storey building in three dimensions. At the other extreme is an SVG (Structured Vector
Graphics) Iile Ior the elegant graphics used in digital cartoons.
So which Iormat one uses depends a lot on how long you will need the inIormation to be stored and who will need
to have access to it.
As a starting point here are some vector Iormats and things to consider:
Typical uses Formats Status
Architecture,
engineering and
construction
DWG
Proprietary Iormat oI AutoDesk. The Open Design Alliance asserts that DWG 2004 is partially encrypted
(http://www.opendesign.com/about/whtpaper/alwhtpap.htm) .
DWG
Also known as OpenDWG. The Open Design Alliance's version oI the DWG Iile Iormat used by Autodesk.
Published as an open standard.
DXF (Drawing
Interchange Format,
or Drawing Exchange
Format)
Promoted by AutoDesk as the preIerred Iormat Ior interoperability with other CAD soItware. Partially
documented with partial support Ior Iunctions contained in DWG Iiles. Its limitations are described in more detail
in the white paper "Why Isn't DXF Good Enough? (http://www.opendesign.com/about/whtpaper/whynot.htm) "
DGN (DesiGN Iile) Also known as OpenDGN. The native Iormat oI MicroStation, a product oI Bentley Corp. DGN is
In 2001, the Scalable Vector Graphics (SVG) (http://www.w3.org/Graphics/SVG/) Iormat became a W3 recommended standard.
Single layer images
Another application oI 2D vector graphics are Ior images where the image itselI is the Iinal product. Commercial products doing this are Adobe Illustrator and
Macromedia Flash.
Adobe uses their proprietary Iormat AI (standing Ior 'Adobe Illustrator', one oI their products.) Movement is being made on a platIorm and product
independent Iormat called SVG (Scalable Vector Graphics). The SVG Iormat has much oI the Iunctionality oI Macromedia's SWF Iormat plus many others,
including the ability to be searched Ior text by search engines. Although at the time oI writing this article SVG was not a mainstream Iormat, its use is growing
and it seems likely to become a standard in wide use |2|. Development oI the SVG Iormat are carried out by the W3C and at present SVG version 1.0 is a
W3C Recommendation.
Multiple layer images
For print shops and graphic designer it is oIten necessary to store the original working Iiles. These may contain many layers oI diIIerent images, as well as
some record oI previous changes made to the Iile.
The main players in this Iield are the PhotoShop Document (.psd) Iormat and Corel PhotoPaint. The main OpenSource rival is GIMP Image File (.xcI) Iormat.
Because Gimp is an open source application the potential Ior its Iormat becoming unreadable is low, however the two applications should not really be
directly be compared.
TODO: Research availability and licensing oI .psd and .xcI Iormat speciIications
Adobe Acrobat Reader includes support Ior SVG, Adobe also has a standalone SVG viewer which can be imbedded easily into internet explorer browsers.
Recommendation
u Do not use vector graphics to store important inIormation unless unavoidable.
u Keep original Iiles on CD with copies in SVG Iormat. Keep paper copies.
Recommendation
References
3D vector format
Three-dimensional vector Iiles can be divided into two groups in the same way as 2D Iile Iormats. Programmes Ior producing 3D Iiles include AutoCAD,
ArchiCAD, 3DStudioMax, Rhino etc. The main openly published standard Iormat is VRML ( Virtual Reality Markup Language ). Another widely used
Iormat is DXF which is touted by AutoDesk (the makers oI AutoCAD) as the transportable Iile Iormat Ior 3D drawings. The DXF Iormat, as implemented by
AutoCAD is not an ideal solution as outlined in Why Isn't DXF Good Enough? by the Open Design Alliance . It is however a well supported Iormat by many
programmes and is widely used to transport Iiles Irom one programme to another.
Recommendation
Page 9 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open world
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
u Do not use vector graphics to store important inIormation unless unavoidable.
u Keep original Iiles on CD with copies in VRML and/or DXF Iormat. Keep paper copies.
References
Vector files (2D & 3D scalable line drawings)
Vector files present special challenges partly because they are used Ior everything Irom 3 dimensional models oI aircraIt to scalable desktop icons. So to talk
about one Iormat Ior all the diIIerent uses would be misleading. Instead we'll break vector Iile Iormats down by their purpose.
File type Reviewed formats
3 Dimensional models (architectural, animation, engineering...) DWG, DXF, VRML, IFC
2 Dimensional drawings (architectural, engineering...) SVG, DXF, DWG
3 Dimensional models
3D computer graphics can model or represent almost anything Iound in the physical world, and much more. Because oI the variety oI uses progress has been
slow on adopting standard Iormats Ior exchanging and storing data. Many good programs can import and export in several propriety Iormats, although the
quality oI the resulting Iile can be suspect because in many cases developers have had to experiment and guess their way into how the external Iormat works.
In the Iields oI architecture, engineering and building, some Iormats (and Iamilies oI Iormats) have begun to emerge as contenders Ior the role oI an industry-
wide standards. One such Iormat (are there others?) is the Industry Foundation Classes Iormat (IFC) which is now compulsory Ior state supported building
projects in Denmark and many state supported Finnish projects reI (http://www.senaatti.Ii/document.asp?siteID2&docID517) .
Until the adoption oI Object Class Iormats like IFC, the main contenders Ior being called a standard are the proprietary but open Drawing Exchange Format
(DXF) Irom AutoDesk and the W3C recommendation Virtual Reality Modeling Language (VRML).
Because the use oI AutoCAD (an AutoDesk product) is so widespread there has been a very successIul attempt at supporting their Iormat in other products.
The Open Design Alliance has produced the commercially licensed OpenDWG (DWG Iormat) as a Iormat compatible with AutoCADs own DWG Iormat.
Many competitors oI AutoCAD now oIIer support Ior DWG via this Iormat.
The OpenSource 3D modeling application Blender can export in both DXF and VRML.
2 Dimensional drawings
Two-dimensional vector graphic Iormats can be divided into two groups. Commercial products like AutoCAD and ArchiCAD use 2D (and 3D) vector
inIormation to make highly advanced, multiply layered drawings Ior architects and engineers.
In the graphics and to some extent the animation industry, Iormats like the proprietary Flash Iormat and the open source standard SVG are popular. Many
Adobe programs use a variety oI Iormats which could well be very good working Iiles in the parent program, but can be a problem to open later when,
especially when a copy oI that program is no longer available.
(this section needs expansion)
Recommendations
u 3 Dimensional Models: Keep the original working Iile and make a copies in the DXF (to protect metadata) and VRML (to secure visual elements)
Iormats.
u 2 Dimensional Drawings: Keep the original working Iile and make a copy in the DXF Iormat or SVG Iormat depending on which is best suited.
References
CAD Standards (not Iully incorporated in this section yet)
Databases & Spreadsheets
Databases are inherently good Ior long term storage oI inIormation. Because oI the way they are constructed it is generally easy to extract inIormation and
reIorm it Ior restorage.
The backbone oI standards compliant databases is Structured Query Language (SQL). SQL is not a Iormat Ior storing database inIormation. It is a Iormat Ior
storing the requests made to a database. In other words a database stores your inIormation and SQL is the language Ior retrieving that inIormation.
For a long time SQL was being developed in diIIerent places by diIIerent people and Ior some time the 1999 version has thereIore been widely used as a saIe
bet. Now SQL:2003 is an ISO/IEC standard and very wisely there are "No changes or conIormance requirements - Products conIorming to Core SQL:1999
should conIorm automatically to SQL:2003"|6| (http://www.wiscorp.com/sql/SQL2003Features.pdI) .
Part oI the beauty oI the SQL standard is that you can extract your inIormation together with the structural inIormation needed to put that inIormation into
another database. The resulting Iile is oIten called an 'SQL dump'.
Page 10 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open wo...
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion
5HFRPPHQGDWLRQV
u Do not use MS Access, it is not a Iully Iunctional RDBMS and does not support standards compliant SQL queries.
u Use a database which can import and export/dump in SQL
u Check that your SQL is standard SQL:1999 or SQL:2003
u Backup the inIormation in your database regularly as a TXT Iile or SQL Iile
5HIHUHQFHV:
u Migrating Irom MicrosoIt Access to MySQL (http://www.kitebird.com/articles/access-migrate.html)
u SQL: The Standard and the Language (http://www.opengroup.org/public/tech/datam/sql.htm)
u Databases in the Open Directory Project (http://dmoz.org/Computers/SoItware/Databases/)
u SQL:1999 validators (http://www.google.com/search?qSQL-99Validator)
u SQL:2003 validator (http://developer.mimer.com/validator/parser200x/index.tml)
5HODWHG5HDGLQJ
u Read
u Recommended Data Formats (http://www.Icla.edu/digitalArchive/pdIs/recFormats.pdI) Ior Preservation Purposes in the FCLA Digital Archive.
u Digital preservation: a time bomb Ior Digital Libraries (http://www.uky.edu/~kiernan/DL/hedstrom.html) - Margaret Hedstrom
u Digital Preservation in Wikipedia
u Guidelines Ior the Preservation oI Digital Heritage (http://unesdoc.unesco.org/images/0013/001300/130071e.pdI) (PDF) UNESCO, March 2003.
u European Interoperability Framework Ior pan-European eGovernment Services pdI(1449Kb) (http://ec.europa.eu/idabc/servlets/Doc?id19528) , 2004.
Includes the EU deIinition oI Open Standards (p 9) and outlines reasons Ior giving strong consideration to OpenSource soItware (p 10).
u Public Sector Use oI Open IT Standards and Open Source SoItware (http://odin.dep.no/mod/norsk/dok/hoeringer/paahoering/050021-080002/dok-
nu.html) in the Norwegian public sector
u The Interoperability Framework (http://standarder.oio.dk/English/) is the Danish e-Government Interoperability Framework Ior exchange, storage and
availability oI electronic inIormation
u OpenIormats.org (http://www.openIormats.org/)
u Wikipedia:Comparison oI document markup languages
u "Holding My Data Hostage: Why soItware licenses should not expire" (http://stereopsis.com/hostage.html) article by Michael HerI 2001-05-08
u "Planning Ior longevity" (http://embedded.com/showArticle.jhtml?articleID22103292) article by Jack Ganssle 2004-07-01
6WDQGDUGVRUJDQLVDWLRQ
u World Wide Web Consortium (W3C) (http://www.w3.org/)
u Organization Ior the Advancement oI Structured InIormation Standards (Oasis) (http://www.oasis-open.org/)
u International Standards Organisation (ISO) (http://www.iso.org/)
u American National Standards Institute (ANSI) (http://www.ansi.org/)
u PDF-Archive Committee (http://www.aiim.org/standards.asp?ID25013)
)LOHW\SHVIRUIXUWKHUUHVHDUFK
Feel Iree to expand this list.
u Contact lists (ldap Ior external address books, ldiI -Lightweight Directory Interchange Format)
u Audio Iiles (Ogg Vorbis) and the problems with mp3 as proprietary Iormat FrauenhoIer Patents (http://www.iis.IraunhoIer.de/amm/legal/)
u Video Iiles: Moving Pictures Experts Group MPEG-2 & MPEG-4 include FrauenhoIer Patents (http://www.iis.IraunhoIer.de/amm/legal/)
u 3D Modeling: ASCII Alias/WaveIront OBJ. What is SQL DDL?
u Financial records
u Calendars: ScheduleWorld (http://www.scheduleworld.com/)
u Diagrams Dia over MicrosoIt Visio
)LOHIRUPDWVIRUSRUWDELOLW\
u Addressbook inIormation: LDIF
Retrieved Irom "http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion"
u This page was last modiIied on 2 March 2006, at 09:55.
u Text is available under the Creative Commons Attribution-ShareAlike License; additional terms may apply. See Terms oI Use Ior details.
Page 11 oI 11 Choosing The Right File Format/Print version - Wikibooks, open books Ior an open wo...
8/14/2011 http://en.wikibooks.org/wiki/ChoosingTheRightFileFormat/Printversion