John Wang, Michael McGuire EDRM Data Set Project Co-Leads EDRM VI Kickoff Meeting May 12, 2010 Minneapolis

, MN

Overview
◦ 2009-2010 Accomplishments ◦ Project Trajectory

EDRM DS Projects
◦ Reference ESI Data Sets ◦ Software Reference Data Set ◦ Probabilistic Hash Data Set

Summary

EDRM VI Kickoff Meeting – May 12, 2010

Improve the Quality and Reduce the Costs of ED

By Providing Unencumbered Data Sets
And Best Practices Guidelines

In Partnership with Leading Organizations

EDRM VI Kickoff Meeting – May 12, 2010

Launch of first three ESI Data Sets
◦ Enron PST Files v1.0 ◦ File Formats Data Set v1.0 ◦ Internationalization Data Set v1.0

Partnership with TREC Legal Track
◦ To provide 2010 TREC Legal Track Data Set

Identify and Articulate New Projects

EDRM VI Kickoff Meeting – May 12, 2010

ESI Data Sets (ESIDS)

Software Reference Data Sets

Probabilistic Hash Data Set

(SRDS, “EDRM List”)

(PHDS)

Data
Best Practices

Case Studies

EDRM VI Kickoff Meeting – May 12, 2010

Background
◦ To provide multiple reference data sets for testing and benchmarking
Unit Testing System Testing

Goals & Benefits
◦ Improved Quality ◦ Lower Development Costs ◦ Lower Acquisition Costs

Regression Testing

Acceptance Testing

EDRM VI Kickoff Meeting – May 12, 2010

Current ESI Data Set Efforts
◦ Enron ESI Data Set ◦ File Formats Data Set ◦ Internationalization Data Set

EDRM VI Kickoff Meeting – May 12, 2010

Background
◦ The email released by FERC via the Enron Western Energy Crisis investigation remains one of largest collections of email available as ESI
 1+ million before de-duplication*  100,000s after de-duplication*

◦ Many different versions of the collection have been used in various research projects, including TREC Legal Track ◦ It is hard to correlate the many studies
* There are many more duplicates in the source collection than would exist in Enron’s production email environment. Some collections account for this and others do not. There does not currently exist an correlate various de-duplication efforts.

EDRM VI Kickoff Meeting – May 12, 2010

Goals

◦ To provide a data set that accurately represents Enron’s email environment ◦ That is useful for a wide variety of research and industry use, including the ability to correlate various studies

Accomplishments

◦ EDRM Enron PST files representing 132 custodians with attachments

Upcoming Deliverables

◦ EDRM / TREC 2009 Data Set ◦ EDRM / TREC 2010 Data Set

EDRM VI Kickoff Meeting – May 12, 2010

EDRM PST v1.0
Custodians Custodian PSTs EDRM XML TIFF Full Headers SDOC NO3 132 Y N N Y N

EDRM TREC 20091
1 (1044) N Y N5 N N

EDRM TREC 20102
150 Y Y Y Y Y

1. 2. 3. 4. 5.

To be available shortly from EDRM. This set was contributed by Clearwell Systems for the TREC 2009 Legal Track project and will be hosted by EDRM. This data set is still in progress for delivery use by TREC Legal Track 2010. Required for correlating various versions. The EDRM XML specifies a single custodian. As part of the TREC 2009 Legal Track, ZL Technologies identified 104 custodians in this data set. TIFF images can be made available.

EDRM VI Kickoff Meeting – May 12, 2010

File Format Data Set
◦ 381 Files ◦ 200 File Formats

Internationalization Data Set
◦ 23 languages ◦ 724 MB of email in MIME format

EDRM VI Kickoff Meeting – May 12, 2010

Getting Involved
◦ Analyze and report on contents of existing data sets ◦ Promote EDRM Data Set project usage by authoring and working on case studies ◦ Contribute new data sets

EDRM VI Kickoff Meeting – May 12, 2010

Background
◦ The NSRL or “NIST List” is a common way to cull system and application files but file coverage is very limited

Goals & Benefits
◦ To lower the costs of eDiscovery and forensics by complementing the NSRL through a parallel EDRM offering covering more knowable files

Installed / Uninstalled Online Distributed Software Installed Software from CD / DVD Media Uninstalled Software from CD /DVD Media

EDRM VI Kickoff Meeting – May 12, 2010

Proposed Work Product
◦ Published hash lists ◦ Open source hashing software. We may be able to leverage NSRL open source hashing code:
 http://www.nsrl.nist.gov/perl/

Hashing Targets
◦ Public images, e.g. AWS, Azure VM images

Community Involvement
◦ Have organizations install and use EDRM provided tools to submit hashes on known hashing targets

EDRM VI Kickoff Meeting – May 12, 2010

Getting Involved
◦ ◦ ◦ ◦ ◦ Establish bootstrap goals for initial v1.0 launch Identify and access initial software targets Develop and validate tools for creating hash lists Release toolsets and hash lists Partner with NSRL and other organizations

EDRM VI Kickoff Meeting – May 12, 2010

Background

◦ There are many knowable system and application files that can be culled for eDiscovery, e.g. OS, application, help files, etc.; however, there are also many that cannot be pre-determined. ◦ What if there was a way to probabilistically determine if any file was user-generated or not?

EDRM VI Kickoff Meeting – May 12, 2010

Goals and Benefits
◦ To provide a hash database for theoretically all files on a probabilistic basis so organizations can use statistics to help determine whether or not a file may be user generated ◦ Lower ED costs by dramatically enhancing culling of knowable non-user files
ESI Data Sets (test files) Probabilistic Hash Data Set (all files)

Software Reference Data Sets (knowable files)

EDRM VI Kickoff Meeting – May 12, 2010

1) Community Contribution

2) EDRM Aggregated Tracking

3) eDiscovery Processing

Provide anonymous hashes of all collected files to EDRM DS

Aggregated frequency and histogram analysis

System checking for files not culled via NSRL or ESRDS

By aggregating anonymous hashes across all collections, frequency analysis can be used to assist in determining if any given file is user generated

EDRM VI Kickoff Meeting – May 12, 2010

Getting Involved
◦ Recruiting interested parties ◦ Initial system
 Requirements/ Development
 Interfaces (API, XML File Type)  Systems and development

 POC deployment  Using known ESIDS data  Scalability testing

EDRM VI Kickoff Meeting – May 12, 2010

Providing real benefits for the ED community
◦ ◦ ◦ ◦ Multiple ESI data sets have been released Case studies are in progress New projects are underway Discussions and partnerships with other organizations are underway, e.g. TREC, NIST, etc.

Get involved for 2010-2011!
◦ dataset@edrm.net ◦ http://edrm.net/activities/projects/dataset

EDRM VI Kickoff Meeting – May 12, 2010