You are on page 1of 1

AC 2202

The Enron Corpus is a database of over 600,000 emails generated by 158 employees[1] of the Enron
Corporation in the years leading up to the company's collapse in December 2001. The corpus was
generated from Enron email servers by the Federal Energy Regulatory Commission (FERC) during its
subsequent investigation.[2] A copy of the email database was subsequently purchased for $10,000 by
Andrew McCallum, a computer scientist at the University of Massachusetts Amherst.[3] He released this
copy to researchers, providing a trove of data that has been used for studies on social networking and
computer-mediated communication.

In the legal investigation into Enron's collapse, the discovery process required collecting and preserving
vast amounts of data, for which the FERC hired Aspen Systems (now part of Lockheed Martin). The
emails were collected at Enron Corporation headquarters in Houston during two weeks in May 2002 by
Joe Bartling,[4] a litigation support and data analysis contractor for Aspen. In addition to the Enron
employee emails, all of Enron's enterprise database systems,[5] hosted in Oracle databases on Sun
Microsystems servers, were captured and preserved, including its online energy trading platform,
EnronOnline.

Once collected, the Enron emails were processed and hosted in proprietary electronic discovery
platforms (first Concordance, then iCONECT) for review by investigators from the FERC, Commodity
Futures Trading Commission, and Department of Justice. At the conclusion of the investigation, and
upon the issuance of the FERC staff report,[6] the emails and information collected were deemed to be
in the public domain, to be used for historical research and academic purposes. The email archive was
made publicly available and searchable via the web using iCONECT 24/7, but the sheer volume of email
of over 160GB made it impractical to use. Copies of the collected emails and databases were made
available on hard drives.

Jitesh Shetty and Jafar Adibi from the University of Southern California processed the data in 2004 and
released a MySQL version.[7] In 2010, EDRM.net published a revised and expanded version 2 of the
corpus,[8] containing over 1.7 million messages, which has been made available on Amazon S3 for easy
access to the researchers.

The corpus is valued as one of the few publicly available mass collections of real emails easily available
for study; such collections are typically bound by numerous privacy and legal restrictions which render
them prohibitively difficult to access, such as non-disclosure agreements and data sanitization.[3] Shetty
and Adibi, based on their MySQL version, published some link analysis of which user accounts emailed
which.[9] Linguistic comparison with more recent email corpora shows changes in the email register of
English. It is also used as test or training data for research in natural language processing and machine
learning.[10]

You might also like