You are on page 1of 7

Duke Project for the Advancement of Public Affairs Reporting / The Reporters Lab Executive summary

The Duke Project for the Advancement of Public Affairs Reporting, dubbed the Reporters Lab, tackles common and difficult problems in the reporting phase of accountability journalism while contributing to the research and transparency communities. The project includes four elements: 1. Adapt existing technological advances in other disciplines to address inefficiencies in public affairs journalism by creating, curating and deploying free, open source tools for reporters, both as software and as hosted services. 2. Produce news and reviews of reporting techniques and advances, including tests of open source and commercial tools for public affairs reporting based on real-life document and data collections. 3. Contribute to research in information and computer science, social sciences and digital humanities and encourage work on public affairs reporting techniques and technology needs. 4. Research public records practices and consult with journalists and government to reduce the need for technological workarounds. The projects success will mean that, in the future, one reporter might point the labs video service at the local county council webcast to get a searchable copy on a desktop widget. Another might use its web scraping and entity extraction service to collect and tag all of the states regulatory agency actions. A third might install a free piece of software on the newsrooms server to convert government agency sign-in sheets into a searchable and sortable list. These and other services and software, from automatic analysis of redacted e-mails to best-practice analysis of government databases, would be curated and created by the program. This program fits squarely with the goals of the computational journalism initiative at Duke University to reduce the cost and difficulty of public affairs, accountability and investigative reporting. The universitys DeWitt Wallace Center for Media and Democracy has committed the initial funding and ongoing administrative and some staffing support.
Sarah Cohen DeWitt Wallace Center for Media and Democracy Sarah.cohen@duke.edu / @reporterslab 1

The labs director is Sarah Cohen, the Knight Professor of the Practice at Duke and a Pulitzer Prize-winning reporter and editor who spent most of her career in investigative teams at The Washington Post. Its in-house developer is Charlie Szymanski, formerly of the Sarasota Herald-Tribune and the National Journal.

The need
Of course, social media, data journalism and other technological advances in publishing have profoundly changed the pace of news, the relationships between news producers and consumers and even the definition of news itself. Breaking news and events that happen in public have been fundamentally transformed with eyewitness accounts and immediate reaction. Data journalists are transforming information that government wants to make public into engaging mobile applications, visualizations and interactive databases. But technological and methodological advances in government, academia, business and even publishing have left a vital precinct of the journalism production line behind: the beat and investigative reporters who mine sources, documents, databases and other records to uncover new and hidden stories of public interest. Their methods are stuck somewhere in the early millennium. The Reporters Lab is focused on this growing digital divide between news producers and their sources and consumers. In 2010, for example, Paige St. John of the Sarasota Herald-Tribune listened to dozens of hours of hearings and analyst conference calls to document parts of her Pulitzer Prize-winning series on property insurance failures in Florida. The Washington Post spent two years compiling 1990s-style spreadsheet to record the sprawling intelligence industry for its Top Secret America series. Reporters at the New York Times analyzing Wikileaks documents turned to Microsoft Word, not a sophisticated text analytic tool, to search for stories.

Strategy
Immediately create and deploy reporting tools Some envisioned tools have been well enough researched to know they are both needed in newsrooms and possible using existing technology. They may be already available as open source software that requires extensive technical skill, or may be adaptations of research already conducted in intelligence, business, law or other fields but would not be profitable to adapt to a commercial newsroom product. Applications might include:
Sarah Cohen DeWitt Wallace Center for Media and Democracy Sarah.cohen@duke.edu / @reporterslab 2

Web-scraping, entity extraction and indexing services that allow reporters to gather, on demand, materials published on dispersed government websites such as state regulatory actions, lawmaker press releases or U.S. attorneys settlement announcements. This would also be used on searchable government databases that are not easily obtained in other ways. Commercial products, such as Needlebase or Kapow, can cost up to $40,000 for a single license. They have found eager markets among businesses seeking to integrate their information assets and keep up with their clients and competitors. Journalists need a similar tool but cant justify the expense for a task that is not mission-critical to publishers. A widget that lets reporters upload audio or video files or point to webcast, index the contents and return searchable versions. This could be used on interview recordings, public meeting webcasts or court proceeding audio. Open source tools from Carnegie Mellon University are believed to be adequate for this task. An on-demand free optical character recognition service for scanned redacted documents to extract names, places, dates and other entities and analyze their interactions. This builds on perhaps the most successful project in reporting tools, DocumentCloud, and is envisioned as an extension of that open source project.

Curate test documents collections and existing tools There are plenty of existing tools aimed at many of the problems reporters face. However, they have not been tested on the kinds of documents typical to journalism: redacted e-mails, printed forms or large collections of PDFs. Some simply cant be tested in a reporting environment because journalists acquire documents and data in inconvenient forms. For example, most of the document analysis software available assumes that the information is in simple text files, with each file representing a document. But reporters usually get one large PDF combining images of thousands of documents in one file. Another example is audio and video indexing programs, which expect that a file, not a link to streaming audio or video, will be used as input. The lab has begun curating these types of documents, datasets and tasks and has started reviewing existing tools on those problems. We plan to hire freelancers to do more and to use the library as a source of challenges to software developers. Identify candidates for further development Some projects are less well researched and may yet be technologically
Sarah Cohen DeWitt Wallace Center for Media and Democracy Sarah.cohen@duke.edu / @reporterslab 3

feasible to immediately apply to reporting. Examples include: Working with hand-written documents. Digital historians have made considerable strides toward working with handwritten documents, but the technology may not yet be up to the task of converting handwritten forms or government sign-in sheets into searchable databases. A tool to convert forms into searchable databases. This could adapt the U.K. ScraperWiki system to curate, collect and distribute scrapers and data from common systems used in federal, state and local governments such as police incident reports and housing inspections, and provide easier methods of converting paper forms or their electronic cousins into usable data through a simple interface allowing reporters to capture the most important elements. Exploratory analysis of unstructured documents. Social scientists are beginning to develop methods for working with large documents collections such as comments on proposed regulations and collections of congressional press releases to find interesting clusters for further analysis. The program would research which of these methods and tools might be useful for reporting. This could also be applied to the overwhelming newsroom task of monitoring scores of individual sources that too often just repeat the same information.

The goal in each of these cases it to identify the best available technology and research and move from experimentation to implementation. Work with reporters and news organizations The lab will help reporters use its products on live stories, perhaps by working on projects in their own newsrooms to shoulder the cost of testing and refining the tools it creates and identifies as candidates. This crucial testing will assure that the project is creating tools and services that are of use in actual reporting scenarios while producing demonstration projects that can be emulated by others and used as training resources. Steer research in other disciplines toward investigative reporting issues The DeWitt Wallace Center faculty is already collaborating with researchers in other fields. This role will continue under the program. Other contributions to research projects might include creating datasets for the NSFs Digging into Data challenge, IEEEs contests on unstructured data visualization and other contests in computational linguistics and information sciences. Getting reportorial datasets into the challenges would help researchers gear their work toward accountability reporting alongside their traditional work on Twitter feeds, published news stories and legal documents. These contest
Sarah Cohen DeWitt Wallace Center for Media and Democracy Sarah.cohen@duke.edu / @reporterslab 4

data and document sets require painstaking preparation to be sure that the entrants can be judged on the outcome. Research policy on transparency Many of the difficulties in public affairs reporting come long before documents and data are extracted from government agency file cabinets, electronic documents collections or databases. By researching policies in public records and transparency, the center can help reduce the cost and difficulty of obtaining and working with records before they are ever requested. For example, it might develop a series of best practices for the administration of laws for common records or provide advice to public officials on extracting the public portions of commonly used software to manage 911 calls, procurement and local government checkbooks and other state and local government administrative systems. It could serve as a liaison between public officials and the journalism community on the administration of public records laws, while helping journalism advocates create model policies for transparency.

Impact
In journalism, its impossible to say which stories hidden in government file cabinets might be exposed. But its easy to imagine the way a reporters day might change when the tools are fully developed and deployed. A reporter would never have to check dozens of RSS feeds and websites just to keep up with existing media on a beat or a community. In the Triangle area of North Carolina alone, more than 100 standalone news organizations publish at least occasional original content of interest to a reporter covering the region. Following those sources remains an inefficiency that keeps reporters shackled to their computers. Instead, the material would flow into a desktop widget, already organized by story with the most reliable sources and the most obscure information highlighted, making sure that the hidden, rather than the already known, could prompt further reporting. The reporter could scan the list daily with confidence that no important yet buried story had been missed in the swirl of aggregated news and information overload. When a government agency delivers a 24,000-page redacted PDF file of e-mails from the governor, reporters would no longer split up the documents and read them in the hopes they would collectively catch anything important. Instead, they could send the file to a service that would create searchable versions, extract the entities and topics, and map them into geographical, social and timeline visualizations.

Sarah Cohen DeWitt Wallace Center for Media and Democracy Sarah.cohen@duke.edu / @reporterslab

A reporter searching for a contractors testimony at a public meeting, a lawmakers demand for the resignation of an official, or a judges admonition to a lawyer would no longer have to listen to hours of hearings. Instead, the audio or video could be indexed by the programs service, so a search would lead to the likely spot in the recording for a closer listen. A local reporter could know immediately every day whether any state or federal agency had disciplined anyone in that area without depending on the agency to send e-mails or notices to the right reporter or editor. Reporters could keep track of indictments, enforcement actions, and press releases from hundreds of lawmakers or any other web-based collection of documents and immediately be alerted to events in their areas or their beats. Obtaining thousands of police incident reports, charge sheets, nursing home inspections or other forms, either through a Web-based system or on paper, would no longer become an exercise in computer programming and data entry. Instead, the reporter could point the service to a collection forms, identify the elements to be extracted, and come back later to find a spreadsheet that could immediately be used in reporting.

One example of the kinds of stories that would have taken less time and effort had some of the technologies currently possible been in place was The Districts Lost Children, a Washington Post series that won the 2002 Pulitzer Prize for Investigative Reporting. The series was based largely on an analysis of about 15 boxes of handwritten, heavily censored forms and other documents. Each document was reviewed by a team of three reporters, who triangulated the circumstances of the deaths of 180 children -- one-fifth of whom died after Washington, D.C. agencies failed to follow their own rules. It also documented the aim of more than 700 recommendations made in the cases and identified children whose lives might have been saved had those earlier recommendations been followed. This analysis alone took more than 6 months. Electronic tools that would have made those documents searchable, extracted the little precise information that was not censored and grouped the recommendations might have cut that effort by about a third. If the analysis were easier, more reporters in other cities might have tackled similar projects. This kind of long-term project isnt the only outcome. In 2010, Salon reporter Mark Benjamin had a seemingly simple problem: a hand-written log of the names of soldiers buried at Arlington Cemetery in the late 1800s, some of whose graves may have been destroyed. The names were listed alphabetically, but he had to walk the cemetery by plot number to determine
Sarah Cohen DeWitt Wallace Center for Media and Democracy Sarah.cohen@duke.edu / @reporterslab 6

which graves were missing. He had to re-type the list in order to get something sorted, when a reasonably accurate handwriting analysis program would have converted it for him. These are just a few of the tasks that are now so difficult in newsrooms that many reporters simply skip them. Sometimes, they are so time-intensive or expensive that a news organization decides against even trying to do a story. In every case, the goal is to move the difficulty of a story down a notch, increasing the likelihood that more stories can be accomplished or difficult explorations of documents and data are undertaken. If nothing else, the work of the Reporters Lab should free reporters from the newsroom to spend more time in their communities finding new and under-covered stories on their beats.

Sarah Cohen DeWitt Wallace Center for Media and Democracy Sarah.cohen@duke.edu / @reporterslab

You might also like