You are on page 1of 10

Easily Extract Images, Text and Embedded Files from an Office

2007/2010 Document
Microsoft Office 2007 introduced a new XML based file format for the Office suite of
products. Word files use extension “.doc” for Office 2003 and earlier with “.docx” for
Office 2007, etc. Most likely, none of this is new to you.

One thing you may not know, however, is the new XML based file formats are actually
compressed file which you can view using a zip client. For this article, we are going to
dig into the inner contents of a Word 2007 file using 7-Zip to extract images, text and
embedded files.

Viewing the Internal Contents of an Office 2007 Document

Consider the document below.


Right click on the document and select Open archive from the 7-Zip context menu.
The document contains a folder structure and XML files which contain all the data used
to render the respective Word file.
Extracting Images from an Office 2007 Document

You can view all the embedded images inside of the “\word\media” folder.
These image files can be extracted from the document the same way you would extract
files from a standard zip file. For example, you can drag and drop the entire “media”
folder to your desktop to extract all the images in the document.
The extracted files are the original images used by the document. Inside the document,
there may be resizing or other properties set but the extracted file are the raw images
without these properties applied.

Extracting Text from an Office 2007 Document

The text you see in the Word documents comes from the file “\media\document.xml”
inside of the inner contents. By opening this file in an XML viewer such as XML Notepad
2007, you can see all of the copy in plain text regardless of the style and/or formatting
applied in the document itself.
Extracting Embedded Files from an Office 2007 Document

Extracting embedded OLE objects and/or attached files do not work as seamlessly.
While you can find and extract the respective resources they all have a “.bin” extension
leaving it up to you to figure out the correct file type. Typically you can “trial and error”
guessing the name of the file by using the images displaying file names in the
document.

Consider this Word document:


The respective names of these embedded files do not help in determining which is
which.
By systematically guess-and-checking the extension by using the captions of the
embedded files (in our example .mp3 and .pdf), you can figure out which extension
goes to which file.

When the file extension is assigned correctly, it should open in the respective program.

Conclusion

As you can see, this process is pretty simple. Using the same methodology, you do not
have to stop at just images and text. You can also view style information, printer setup
and any other properties specific to the document.

While the example illustrated above covers Word documents, you can just as easily
extract the same information from other Office 2007 format documents such as Excel
and PowerPoint. The name and location of the respective data is in a logical location
within the inner contents of the document, but if nothing else you can always extract
resource files you are unsure of and see what they contain.

You might also like