Popescu Denisa

Enterprise Information Integration and
Semantic Technologies at the World Bank
Denisa Popescu
Enterprise Architecture
World Bank Group
Presentation Outline
• Bank’s Information Challenges

• SAS Teragram Technologies Overview
• How Teragram Works in the Bank
• Key Outcomes & Lessons Learned
Bank’s Information Challenges
World Bank Group
• World Bank Group is an international development
organization providing loans, grants and knowledge and
advisory services to developing countries for a wide array of
purposes that include education, health, public
administration, infrastructure, financial and private sector
development, agriculture and environmental and natural
resource management.
• Office of the Enterprise Architecture is part of the Central IT

Department and is responsible for the Enterprise Architecture
Framework, Enterprise Information and Technology
Standards and Policies, and Shared Enterprise Information
Platforms and Tools.
Bank’s Architectural & Information Challenges
• Numerous repositories that contain large amounts of information

• Most of our information is unstructured (pdfs,.doc, .txt, .ppt, .html)
• Rely on staff to “file” information and add metadata
• Lack of authoritative reference sources
As a result,
• Uneven capture of information and metadata across the
Bank’s institutional repositories
• Similar information resides in multiple repositories
• Multiple representations for same type of “information”
• Staff can’t find related information
Information in Bank’s Environment
Knowledge
Sharing People find information
People create information
by searching or browsing
repositories
Structured Data Unstructured Information

Operational/Transaction Data Content Title
Date
01010101 01010101 01010101 01010101 01010101
11011011 11011011 11011011 11011011 11011011 Speaker
10010011 10010011 10010011 10010011 10010011 File format
Operations Human Loans Financial Etc.… Email Records Documents Books Multimedia Topic, …
Web Pages
Resources Mgmt
Author
Title
Project ID TOR Comments on Conference
Country PCN review Proceedings Author
I.Purpose meeting
Business Function Title
II.Participants Project ID
…
III.Findings Country
IV.… Topic
….
Metadata
Attributes describing unstructured content

Data Entities, Attributes, Relationships, Data definition: varchar(x),number, character, primary
key, foreign key, etc,
Metadata is the glue

But the problem with metadata is that…
• Quality of metadata is uneven
• It takes too much effort for the user to put it in
• Each individual (creator or searcher) may have

different perspectives on how to describe
information
What are Semantic Technologies?
• Semantic technologies provide an abstraction layer above

existing IT technologies that enables bridging and
interconnection of information, people, and processes.
• In the World Bank, we are using Conceptual Information

Models, and SAS Teragram Technologies to create this layer.
A unified Enterprise Information Architecture
Business Process
Designing Structures Managing Information

Policy &
Guidance
Discover/Design
Architecture
Build Structural
Schemes
IA
Create,
Architecture Processes to Manage Information Capture &
Structured Catalog
Information
Structures
SAP P/Soft Other DBs Capture
Information Management Framework

Metadata
Provide Administrative Reports & Manage Workflow

Information
Governance: Policies, Procedures & Standards
Enterprise Unstructured
Architecture Governance Manage
(Models, Stewards, Data Harmonization)
Create, Capture & Catalog Workflow

Business,
Conceptual
Application Information Capture Metadata (Automated & Manual)
Organize,
Model
Manage &
Master Data
Technology, Publish
Documents Email Multimedia Etc. Collections
Information
Access/Usage
Organize, Manage & Publish Collections Search,

Information Contextualize
Information
Distribution/ & Deliver
Movement Records Document to Audience
Portal Web
Management Management
Records
Management
Search, Contextualize & Deliver to Audience
(Retention, etc
HQ Staff CO Staff Partners Public

Corporate Data Model Framework
Shared Information Domains
CDM Data Domains
Employee
Employee Business
Vendor Project
Client
Client
Client //Consultant
Consultant Partner Party
Finance
WB Product
Project
Project Organization
Policy (Cost Centre –
&
Fund Centre,
Service
Chart of Account)
Reference
Reference Documents
Data
Data
Identity
Identity & Reports Theme
Theme Sector
Sector
Geographical
Geographical Country
Country
For each Data Domain,

Conceptual & Logical Data Models, Data Dictionary, Data Standards are provided
Core Metadata Standard for Unstructured
Information
Project
Client Party
Identity
Title
Topics
Author
Core Business Function
Owner
Keywords
Abstract
Document Date Language
Extension Country
Project ID
ResourceIdentifier
Automatic Metadata Capture using Teragram
• Automatic Metadata Capture to generate consistent

values for core metadata across information
collections
• For high-value information collections, automated

metadata extraction strengthen the information
quality control function (e.g. indexers)
Teragram Technologies Overview
What does SAS / Teragram do?
• SAS Teragram applies natural language processing (NLP) and
advanced linguistic techniques to automatically extract
relevant concepts and categorize large volumes of multilingual
content.
– Rule-based Automatic Categorization

– Entity and Fact/Event Extraction
– Document Summarization
– Document De-duplication
– Noun Phrase extraction
– Clustering
– Language detection
– Tokenization, Stemming, Part-of-speech tagging, …
Why is the Bank using Teragram?
• Teragram will allow the Bank to standardize description of

information across multiple systems and programmatically generate
metadata:
– Standardization improves the consistency of metadata
– Automatic metadata capture saves time and resources
– Ability to process and describe huge amounts of information
– Will improve “findability” of information but providing data drive

browsing structures
Case in Point: How Teragram Works in
the Bank
Case in Point
At the World Bank, we create so much information that no number of human eyes
could ever effectively categorize all of it in a timely manner.
You might think it would be easy to tell what the document you’re reading is
about. However, this software can tell us not only what you think it’s about,
but what the Bank thinks it’s about.
Fortunately, we can
automatically process a great
deal of it, using a Teragram that
scans documents, recognizes
terms and categorizes them for
us. This is often more effective
than letting a human being try
to figure out what a document
is about.
For example, the Bank produces a working paper on “Sustainable tourism
and cultural heritage”. This report provides and overview of the relation
between culture heritage preservation and tourism and present
strategies for promoting sustainability in tourism industry associated with
cultural heritage sites and natural environments.
That report I see this mainly as an

obviously “sustainable development”
belongs under project….
“Eco-Tourism”
This concerns
preservation of
heritage sites
Since we don’t have a folder

A lot of these projects
for “tourism industry”, I’ll just
needs to be consistent
tag this “industry” for now.
with the country’s
cultural policy
Automatic Metadata Capture: Documents & Reports Library
Example of Browsing Structures: Documents & Report Library
Enterprise Topic
Taxonomy
Automatic Metadata Capture: E-Library
Teragram-generated
Topics , Keywords,
Region
Example of Use of Thesaurus in Search
Teragram-
generated
(Thesaurus)
Automatic Metadata Extraction and Categorization
Group into Collections
Raw Content
Apply Extract
Teragram Metadata
Profiles
content metadata
Quality
control
content Search Syndication Browse

delivery
Rule-based Categorization: Enterprise Topics
Rule-based Categorization: Enterprise Topics
Rule-based Categorization: Business Functions
Grammar-based Concept Extraction: People Names
Grammar-based Concept Extraction: People Names
RegEx-based Concept Extraction: Project Identifier
RegEx-based Concept Extraction: Project Identifier
Summarizer: Project Identification Document
Other uses
• Find and link similar documents (e.g. versions, parts contained

in other documents)
• Extract Institutions Referenced in the document
• Extract People Referenced in the document
• Extract Document date
• Extract Location (country, region, city)
• Extract Title of document
Key Outcomes & Lessons Learned
Key Outcomes
• Improve quality and consistency of metadata and reference

sources
• Increase productivity of the metadata capture process
– Prior to using Teragram's technology, Bank staff categorized three
electronic documents per hour. Teragram now drives 50,000 PDF
pages per hour through its platform, dramatically improving the
processing rate.
• Improve the availability & quantity of metadata
– Prior to Teragram, Bank editors manually uncovered four to five
keywords per document. Today, the software identifies 70-300
keywords per document.
Lessons Learned
• Understand the characteristics of the information collections and the way

a person decides to categorize the information
• Ensure that you can derive rules from the information and context
• Large initial investment in building the profiles
• Iterative process: use feedback to improve the profiles over time
• Link the initiative to Master Data/Reference Data Program
• Get buy-in from the Business Departments in the organization

Popescu Denisa

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Popescu Denisa

Uploaded by

Copyright:

Available Formats

Enterprise Information Integration and

Semantic Technologies at the World Bank

• Bank’s Information Challenges

• Office of the Enterprise Architecture is part of the Central IT

• Numerous repositories that contain large amounts of information

Structured Data Unstructured Information

Attributes describing unstructured content

Metadata is the glue

• Quality of metadata is uneven

• It takes too much effort for the user to put it in

• Each individual (creator or searcher) may have

• Semantic technologies provide an abstraction layer above

• In the World Bank, we are using Conceptual Information

Designing Structures Managing Information

Information Management Framework

Provide Administrative Reports & Manage Workflow

Create, Capture & Catalog Workflow

Organize, Manage & Publish Collections Search,

HQ Staff CO Staff Partners Public

For each Data Domain,

Document Date Language

• Automatic Metadata Capture to generate consistent

• For high-value information collections, automated

– Rule-based Automatic Categorization

• Teragram will allow the Bank to standardize description of

– Standardization improves the consistency of metadata

– Automatic metadata capture saves time and resources

– Ability to process and describe huge amounts of information

– Will improve “findability” of information but providing data drive

That report I see this mainly as an

Since we don’t have a folder

content Search Syndication Browse

• Find and link similar documents (e.g. versions, parts contained

• Improve quality and consistency of metadata and reference

• Understand the characteristics of the information collections and the way

• Large initial investment in building the profiles

• Iterative process: use feedback to improve the profiles over time

• Link the initiative to Master Data/Reference Data Program

• Get buy-in from the Business Departments in the organization

You might also like