Frontiers of

Computational Journalism
Columbia Journalism School
Week 9: Knowledge Representation

November 10, 2017
This class
• Structured Journalism
• Ontologies and Graphs
• Relations from Text
Structured Journalism
Unstructured data
Structured data
Everyblock.com circa 2009
Connected China. Reuters, 2013
Article Metadata
headline

photo

photo credit
photo caption
byline
publication date
dateline
article body
related articles
Schema.org news markup
Overall type of the object on this page, in HTML head

Headline, dateline, date as additions to div/span properties

Byline expressed as nested object (using itemscope) of type schema.org/Person
Driving application: “rich snippets ”

Schema.org covers not just news but music, restaurants, people,
organizations, reviews, offers...

Snippets, and better search-ability generally, are motivation for
Google, Yahoo, Bing to push schema.org
Additional metadata from indexing team

In database, but doesn't necessarily make it to HTML.
Application: content navigation

Articles about “Syria”
on NYT topic page

More reliable than simple
text search (because the
relevance algorithm knows a
story is "about" Syria.)
Application: automatic stories
Wall Street is high on Molson Coors Brewing (TAP), expecting it to report earnings
that are up 17.5% from a year ago when it reports its third quarter earnings on
Wednesday, November 7, 2012. The consensus estimate is $1.34 per share, up
from earnings of $1.14 per share a year ago.

The consensus estimate has dipped over the past month, from $1.35, but it’s still
up from the consensus estimate of $1.19 three months ago. For the fiscal year,
analysts are expecting earnings of $3.89 per share. Revenue is projected to
eclipse the year-earlier total of $954.4 million by 31%, finishing at $1.25 billion for
the quarter. For the year, revenue is projected to roll in at $4.04 billion.

The company’s net income has declined in the last two quarters. The company
posted profit falling by 52.8% in the second quarter. This is after it reported a profit
decline in the first quarter by 4.1%.

Automatic story generation, by Narrative Science
Ontologies and Graphs
What objects and relations are available?

Often represented as class hierarchy.
Arrows = “is_a” relation
(Part of) a real ontology, from Cyc
News as relations between entities
“Alice attended the wedding”
attended(alice, wedding)

“IBM was founded in 1917.”
founded(IBM, 1917)

“Hurricane Sandy hit New York”
hit(hurricane_sandy, New_York)

Encode facts as relation(subject,object)
also written (subject relation object)
Things we could do with this
Question answering
“The granddaughter of which actor starred in E.T.?”
(?x acted-in “E.T.”)(?y is-a actor)(?x granddaughter-of ?y)

Inference
(bob brother-of alice)
(alice mother-of lucy) =>
(bob uncle-of lucy)

Answer questions using inference
“how many executives of publicly-traded Canadian companies died
in car crashes?
Every big news org has their own
big ontology L

topics, people, organizations, places...
Yaaay Linked Data!
Triples of (subject relation object), each a URL or literal

<urn:x-states:New%20York>
<http://purl.org/dc/terms/alternative>
"NY”

<http://dbpedia.org/resource/Columbia_University>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/CollegeOrUniversity>

Abbreviations possible with many formats...
<http://dbpedia.org/resource/Columbia_University> rdf:type
ns6:CollegeOrUniversity
NYT API can return linked data
{
"title": "Syria's Rebels Open Talks on Forging United Political Front"
"body": "BEIRUT, Lebanon — Syria ’s fractious opposition groups began
negotiations in Doha, Qatar, on Sunday to forge a more unified front to reshape
the political landscape in a bloody conflict that claims more than 100 lives
virtually every day. Given the scant prospects that any attempt to restructure
the opposition will succeed — the",
"dbpedia_resource_url": [
"http://dbpedia.org/resource/Hillary_Rodham_Clinton",
"http://dbpedia.org/resource/Bashar_al-Assad"],
"facet_terms": "CLINTON, HILLARY RODHAM ASSAD, BASHAR AL- SYRIA DOHA
(QATAR) SYRIAN NATIONAL COUNCIL STATE DEPARTMENT WAR AND REVOLUTION DEFENSE AND
MILITARY FORCES"
}
Property Graphs in the Panama Papers
Relations from Text
Objects and relations in text?

names, dates, places, verbs.
Named Entity Recognition
Extract subjects, objects, from text.
Also, resolve pronouns if possible.

"Gov. Andrew M. Cuomo on Wednesday gave a sea
wall the nod. Because of the recent history of
powerful storms hitting the area, he said, elected
officials have a responsibility to consider new and
innovative plans to prevent similar damage in the
future."
Relations from sentence parsing
“The water that made rivers of Avenues C and D
receded on Tuesday, and the East Village was a
mixture of disaster and nonchalance. A group of
young men in pajama pants and shorts threw a
football on East 12th Street, while workers pumped the
basement of CHP Hardware on Avenue C and Eighth
Street.”

subject verb object
Stanford Open IE
Ontology explosions

(water made rivers of Avenues C and D)
(East Village was a mixture of disaster and nonchalance)
(group of young men in pajama pants and shorts threw
football)
(workers pumped the basement of CHP Hardware )

Do we have all of these in the ontology?
“General Question Answering”

Precision/recall tradeoff. State of the art is IBM’s DeepQA
DeepQA use of structured data
“Watson can also use detected relations to query a
triple store and directly generate candidate answers.
Due to the breadth of relations in the Jeopardy
domain and the variety of ways in which they are
expressed, however, Watson’s current ability to
effectively use curated databases to simply “look up”
the answers is limited to fewer than 2 percent of the
clues.”

- Ferruci et. al. “Building Watson”