You are on page 1of 35

Enterprise Information Integration and

Semantic Technologies at the World Bank

Denisa Popescu
Enterprise Architecture
World Bank Group
Presentation Outline

• Bank’s Information Challenges


• SAS Teragram Technologies Overview
• How Teragram Works in the Bank
• Key Outcomes & Lessons Learned
Bank’s Information Challenges
World Bank Group
• World Bank Group is an international development
organization providing loans, grants and knowledge and
advisory services to developing countries for a wide array of
purposes that include education, health, public
administration, infrastructure, financial and private sector
development, agriculture and environmental and natural
resource management.

• Office of the Enterprise Architecture is part of the Central IT


Department and is responsible for the Enterprise Architecture
Framework, Enterprise Information and Technology
Standards and Policies, and Shared Enterprise Information
Platforms and Tools.
Bank’s Architectural & Information Challenges

• Numerous repositories that contain large amounts of information


• Most of our information is unstructured (pdfs,.doc, .txt, .ppt, .html)
• Rely on staff to “file” information and add metadata
• Lack of authoritative reference sources

As a result,
• Uneven capture of information and metadata across the
Bank’s institutional repositories
• Similar information resides in multiple repositories
• Multiple representations for same type of “information”
• Staff can’t find related information
Information in Bank’s Environment
Knowledge
Sharing People find information
People create information
by searching or browsing
repositories

Structured Data Unstructured Information


Operational/Transaction Data Content Title
Date
01010101 01010101 01010101 01010101 01010101
11011011 11011011 11011011 11011011 11011011 Speaker
10010011 10010011 10010011 10010011 10010011 File format
Operations Human Loans Financial Etc.… Email Records Documents Books Multimedia Topic, …
Web Pages
Resources Mgmt
Author
Title
Project ID TOR Comments on Conference
Country PCN review Proceedings Author
I.Purpose meeting
Business Function Title
II.Participants Project ID

III.Findings Country
IV.… Topic
….
Metadata

Attributes describing unstructured content


Data Entities, Attributes, Relationships, Data definition: varchar(x),number, character, primary
key, foreign key, etc,

Metadata is the glue


But the problem with metadata is that…

• Quality of metadata is uneven

• It takes too much effort for the user to put it in

• Each individual (creator or searcher) may have


different perspectives on how to describe
information
What are Semantic Technologies?

• Semantic technologies provide an abstraction layer above


existing IT technologies that enables bridging and
interconnection of information, people, and processes.

• In the World Bank, we are using Conceptual Information


Models, and SAS Teragram Technologies to create this layer.
A unified Enterprise Information Architecture
Business Process

Designing Structures Managing Information


Policy &
Guidance
Discover/Design
Architecture
Build Structural
Schemes
IA

Create,
Architecture Processes to Manage Information Capture &
Structured Catalog
Information
Structures
SAP P/Soft Other DBs Capture

Information Management Framework


Metadata

Provide Administrative Reports & Manage Workflow


Information
Governance: Policies, Procedures & Standards

Enterprise Unstructured
Architecture Governance Manage
(Models, Stewards, Data Harmonization)

Create, Capture & Catalog Workflow


Business,
Conceptual
Application Information Capture Metadata (Automated & Manual)
Organize,
Model
Manage &
Master Data

Technology, Publish
Documents Email Multimedia Etc. Collections
Information
Access/Usage

Organize, Manage & Publish Collections Search,


Information Contextualize
Information
Distribution/ & Deliver
Movement Records Document to Audience
Portal Web
Management Management
Records
Management
Search, Contextualize & Deliver to Audience
(Retention, etc

HQ Staff CO Staff Partners Public


Corporate Data Model Framework
Shared Information Domains
CDM Data Domains

Employee
Employee Business
Vendor Project
Client
Client
Client //Consultant
Consultant Partner Party

Finance
WB Product
Project
Project Organization
Policy (Cost Centre –
&
Fund Centre,
Service
Chart of Account)

Reference
Reference Documents
Data
Data
Identity
Identity & Reports Theme
Theme Sector
Sector
Geographical
Geographical Country
Country

For each Data Domain,


Conceptual & Logical Data Models, Data Dictionary, Data Standards are provided
Core Metadata Standard for Unstructured
Information

Project
Client Party

Identity

Title
Topics
Author
Core Business Function
Owner
Keywords
Abstract

Document Date Language

Extension Country

Project ID

ResourceIdentifier
Automatic Metadata Capture using Teragram

• Automatic Metadata Capture to generate consistent


values for core metadata across information
collections

• For high-value information collections, automated


metadata extraction strengthen the information
quality control function (e.g. indexers)
Teragram Technologies Overview
What does SAS / Teragram do?
• SAS Teragram applies natural language processing (NLP) and
advanced linguistic techniques to automatically extract
relevant concepts and categorize large volumes of multilingual
content.

– Rule-based Automatic Categorization


– Entity and Fact/Event Extraction
– Document Summarization
– Document De-duplication
– Noun Phrase extraction
– Clustering
– Language detection
– Tokenization, Stemming, Part-of-speech tagging, …
Why is the Bank using Teragram?

• Teragram will allow the Bank to standardize description of


information across multiple systems and programmatically generate
metadata:

– Standardization improves the consistency of metadata

– Automatic metadata capture saves time and resources

– Ability to process and describe huge amounts of information

– Will improve “findability” of information but providing data drive


browsing structures
Case in Point: How Teragram Works in
the Bank
Case in Point
At the World Bank, we create so much information that no number of human eyes
could ever effectively categorize all of it in a timely manner.

You might think it would be easy to tell what the document you’re reading is
about. However, this software can tell us not only what you think it’s about,
but what the Bank thinks it’s about.

Fortunately, we can
automatically process a great
deal of it, using a Teragram that
scans documents, recognizes
terms and categorizes them for
us. This is often more effective
than letting a human being try
to figure out what a document
is about.
For example, the Bank produces a working paper on “Sustainable tourism
and cultural heritage”. This report provides and overview of the relation
between culture heritage preservation and tourism and present
strategies for promoting sustainability in tourism industry associated with
cultural heritage sites and natural environments.

That report I see this mainly as an


obviously “sustainable development”
belongs under project….
“Eco-Tourism”

This concerns
preservation of
heritage sites

Since we don’t have a folder


A lot of these projects
for “tourism industry”, I’ll just
needs to be consistent
tag this “industry” for now.
with the country’s
cultural policy
Automatic Metadata Capture: Documents & Reports Library
Example of Browsing Structures: Documents & Report Library

Enterprise Topic
Taxonomy
Automatic Metadata Capture: E-Library

Teragram-generated
Topics , Keywords,
Region
Example of Use of Thesaurus in Search

Teragram-
generated
(Thesaurus)
Automatic Metadata Extraction and Categorization
Group into Collections
Raw Content

Apply Extract
Teragram Metadata
Profiles

content metadata

Quality
control

content Search Syndication Browse


delivery
Rule-based Categorization: Enterprise Topics
Rule-based Categorization: Enterprise Topics
Rule-based Categorization: Business Functions
Grammar-based Concept Extraction: People Names
Grammar-based Concept Extraction: People Names
RegEx-based Concept Extraction: Project Identifier
RegEx-based Concept Extraction: Project Identifier
Summarizer: Project Identification Document
Other uses

• Find and link similar documents (e.g. versions, parts contained


in other documents)
• Extract Institutions Referenced in the document
• Extract People Referenced in the document
• Extract Document date
• Extract Location (country, region, city)
• Extract Title of document
Key Outcomes & Lessons Learned
Key Outcomes

• Improve quality and consistency of metadata and reference


sources
• Increase productivity of the metadata capture process
– Prior to using Teragram's technology, Bank staff categorized three
electronic documents per hour. Teragram now drives 50,000 PDF
pages per hour through its platform, dramatically improving the
processing rate.
• Improve the availability & quantity of metadata
– Prior to Teragram, Bank editors manually uncovered four to five
keywords per document. Today, the software identifies 70-300
keywords per document.
Lessons Learned

• Understand the characteristics of the information collections and the way


a person decides to categorize the information

• Ensure that you can derive rules from the information and context

• Large initial investment in building the profiles

• Iterative process: use feedback to improve the profiles over time

• Link the initiative to Master Data/Reference Data Program

• Get buy-in from the Business Departments in the organization

You might also like