You are on page 1of 35
Frontiers of Computational Journalism
Frontiers of
Computational Journalism

Columbia Journalism School Week 9: Knowledge Representation

November 10, 2017

Frontiers of Computational Journalism Columbia Journalism School Week 9: Knowledge Representation November 10, 2017
This class
This class

Structured Journalism

Ontologies and Graphs

Relations from Text

This class • Structured Journalism • Ontologies and Graphs • Relations from Text
Structured Journalism
Structured Journalism

Unstructured data

Unstructured data

Structured data

Structured data
Everyblock.com circa 2009
Everyblock.com circa 2009

Everyblock.com circa 2009

Everyblock.com circa 2009
Connected China . Reuters, 2013
Connected China . Reuters, 2013

Connected China. Reuters, 2013

Connected China . Reuters, 2013

Article Metadata

Article Metadata headline photo photo credit photo caption byline publication date dateline article body related articles

headline

Article Metadata headline photo photo credit photo caption byline publication date dateline article body related articles

photo

Article Metadata headline photo photo credit photo caption byline publication date dateline article body related articles

photo credit

Article Metadata headline photo photo credit photo caption byline publication date dateline article body related articles
Article Metadata headline photo photo credit photo caption byline publication date dateline article body related articles

photo caption byline

publication date

dateline article body

related articles

Article Metadata headline photo photo credit photo caption byline publication date dateline article body related articles

Schema.org news markup

Overall type of the object on this page, in HTML head

markup Overall type of the object on this page, in HTML head Headline, dateline, date as

Headline, dateline, date as additions to div/span properties

Headline, dateline, date as additions to div/span properties Byline expressed as nested object (using itemscope )

Byline expressed as nested object (using itemscope) of type schema.org/Person

date as additions to div/span properties Byline expressed as nested object (using itemscope ) of type
date as additions to div/span properties Byline expressed as nested object (using itemscope ) of type
Driving application: “rich snippets”
Driving application: “rich snippets”

Schema.org covers not just news but music, restaurants, people, organizations, reviews, offers

Snippets, and better search-ability generally, are motivation for Google, Yahoo, Bing to push schema.org

reviews, offers Snippets, and better search-ability generally, are motivation for Google, Yahoo, Bing to push schema.org

Additional metadata from indexing team

Additional metadata from indexing team In database, but doesn't necessarily make it to HTML.

In database, but doesn't necessarily make it to HTML.

Additional metadata from indexing team In database, but doesn't necessarily make it to HTML.

Application: content navigation

Application: content navigation Articles about “Syria” on NYT topic page More reliable than simple text search

Articles about “Syria” on NYT topic page

More reliable than simple text search (because the relevance algorithm knows a story is "about" Syria.)

topic page More reliable than simple text search (because the relevance algorithm knows a story is

Application: automatic stories

Wall Street is high on Molson Coors Brewing (TAP), expecting it to report earnings that are up 17.5% from a year ago when it reports its third quarter earnings on Wednesday, November 7, 2012. The consensus estimate is $1.34 per share, up from earnings of $1.14 per share a year ago.

The consensus estimate has dipped over the past month, from $1.35, but it’s still up from the consensus estimate of $1.19 three months ago. For the fiscal year, analysts are expecting earnings of $3.89 per share. Revenue is projected to eclipse the year-earlier total of $954.4 million by 31%, finishing at $1.25 billion for the quarter. For the year, revenue is projected to roll in at $4.04 billion.

The company’s net income has declined in the last two quarters. The company posted profit falling by 52.8% in the second quarter. This is after it reported a profit decline in the first quarter by 4.1%.

Automatic story generation, by Narrative Science

This is after it reported a profit decline in the first quarter by 4.1%. Automatic story
Ontologies and Graphs
Ontologies and Graphs

What objects and relations are available?

What objects and relations are available? Often represented as class hierarchy. Arrows = “is_a” relation
What objects and relations are available? Often represented as class hierarchy. Arrows = “is_a” relation

Often represented as class hierarchy. Arrows = “is_a” relation

What objects and relations are available? Often represented as class hierarchy. Arrows = “is_a” relation

(Part of) a real ontology, from Cyc

(Part of) a real ontology, from Cyc
News as relations between entities
News as relations between entities

“Alice attended the wedding”

attended(alice, wedding)

“IBM was founded in 1917.”

founded(IBM, 1917)

“Hurricane Sandy hit New York”

hit(hurricane_sandy, New_York)

Encode facts as relation(subject,object) also written (subject relation object)

York” hit(hurricane_sandy, New_York) Encode facts as relation(subject,object) also written (subject relation object)
York” hit(hurricane_sandy, New_York) Encode facts as relation(subject,object) also written (subject relation object)
Things we could do with this
Things we could do with this

Question answering

“The granddaughter of which actor starred in E.T.?”

(?x acted-in “E.T.”)(?y is-a actor)(?x granddaughter-of ?y)

Inference

(bob brother-of alice) (alice mother-of lucy) => (bob uncle-of lucy)

Answer questions using inference

“how many executives of publicly-traded Canadian companies died in car crashes?

lucy) Answer questions using inference “how many executives of publicly-traded Canadian companies died in car crashes?
Every big news org has their own big ontology L
Every big news org has their own
big ontology L

topics, people, organizations, places

Every big news org has their own big ontology L topics, people, organizations, places

Yaaay Linked Data!

Triples of (subject relation object), each a URL or literal

<urn:x-states:New%20York>

<http://purl.org/dc/terms/alternative>

"NY”

<http://dbpedia.org/resource/Columbia_University>

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<http://schema.org/CollegeOrUniversity>

Abbreviations possible with many formats

<http://dbpedia.org/resource/Columbia_University>

ns6:CollegeOrUniversity

rdf:type

possible with many formats <http://dbpedia.org/resource/Columbia_University> ns6:CollegeOrUniversity rdf:type

NYT API can return linked data

{

"title": "Syria's Rebels Open Talks on Forging United Political Front" "body": "BEIRUT, Lebanon — Syria ’s fractious opposition groups began negotiations in Doha, Qatar, on Sunday to forge a more unified front to reshape the political landscape in a bloody conflict that claims more than 100 lives virtually every day. Given the scant prospects that any attempt to restructure the opposition will succeed — the", "dbpedia_resource_url": [ "http://dbpedia.org/resource/Hillary_Rodham_Clinton", "http://dbpedia.org/resource/Bashar_al-Assad"], "facet_terms": "CLINTON, HILLARY RODHAM ASSAD, BASHAR AL- SYRIA DOHA (QATAR) SYRIAN NATIONAL COUNCIL STATE DEPARTMENT WAR AND REVOLUTION DEFENSE AND MILITARY FORCES"

}

AL- SYRIA DOHA (QATAR) SYRIAN NATIONAL COUNCIL STATE DEPARTMENT WAR AND REVOLUTION DEFENSE AND MILITARY FORCES"

Property Graphs in the Panama Papers

Property Graphs in the Panama Papers
Relations from Text
Relations from Text

Objects and relations in text?

Objects and relations in text? names , dates , places, verbs.
Objects and relations in text? names , dates , places, verbs.

names, dates, places,

verbs.

Objects and relations in text? names , dates , places, verbs.
Named Entity Recognition
Named Entity Recognition

Extract subjects, objects, from text. Also, resolve pronouns if possible.

"Gov. Andrew M. Cuomo on Wednesday gave a sea wall the nod. Because of the recent history of powerful storms hitting the area, he said, elected officials have a responsibility to consider new and innovative plans to prevent similar damage in the future."

elected officials have a responsibility to consider new and innovative plans to prevent similar damage in
Relations from sentence parsing
Relations from sentence parsing

“The water that made rivers of Avenues C and D receded on Tuesday, and the East Village was a mixture of disaster and nonchalance. A group of young men in pajama pants and shorts threw a football on East 12th Street, while workers pumped the basement of CHP Hardware on Avenue C and Eighth Street.”

subject verb object

12th Street, while workers pumped the basement of CHP Hardware on Avenue C and Eighth Street.”
Stanford Open IE
Stanford Open IE

Ontology explosions

(water made rivers of Avenues C and D) (East Village was a mixture of disaster and nonchalance) (group of young men in pajama pants and shorts threw football) (workers pumped the basement of CHP Hardware )

Do we have all of these in the ontology?

and shorts threw football ) ( workers pumped the basement of CHP Hardware ) Do we

“General Question Answering”

“General Question Answering” Precision/recall tradeoff. State of the art is IBM’s DeepQA

Precision/recall tradeoff. State of the art is IBM’s DeepQA

“General Question Answering” Precision/recall tradeoff. State of the art is IBM’s DeepQA

DeepQA use of structured data

“Watson can also use detected relations to query a triple store and directly generate candidate answers. Due to the breadth of relations in the Jeopardy domain and the variety of ways in which they are expressed, however, Watson’s current ability to effectively use curated databases to simply “look up” the answers is limited to fewer than 2 percent of the clues.”

- Ferruci et. al. “Building Watson”

“look up” the answers is limited to fewer than 2 percent of the clues.” - Ferruci