You are on page 1of 18

Beyond Data Glossary 101:

From Manual to Automated


Discovery
Matthew Lawler

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 1
Beyond Data Glossary 101
Introduction
This case study will walk through how I
discovered a corpus of 5,000 words and 1,000
acronyms by parsing 200,000 Data Warehouse
(DW) column names.
The manually defined acronym list contained
530 acronyms, so this doubled the total.
In addition, the Data Glossary term was also
linked to the schema and the column name it
occurred in.

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 2
Beyond Data Glossary 101
Why Listen?
• For Data professionals, the creation of Data (Business)
Glossary is an important first step in managing data.
• But...
• Are you confident that your Data Glossary is complete?
• Are your Data Glossary terms used in any database?
• Can you map the Data Glossary terms to your database
columns, to check for usage gaps?
• Can you separate Data Glossary acronyms and words?
• Do you maintain the Data Glossary automatically, or
are you struggling manually?

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 3
Beyond Data Glossary 101
Who is this for?
• For Staff who need
– to understand common terms, especially new or transferred
staff.
• For Business Analysts who need
– to determine if systems support business goals and terms.
– to resolve confusion between business areas.
– to integrate across business areas.
• For Data Modellers who need
– to enforce more consistent design rules when generating DDL
and SQL.
– to improve design and development productivity.
– to publish metadata for Business Analysts
– to review business terms against current data models and
databases.

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 4
Beyond Data Glossary 101
Building a Corpus
• A corpus is the full set of words used in the enterprise.
• This is always specific to the enterprise.
• This is mostly done by collecting definitions from
Parliamentary Acts, Manuals, and Data Dictionaries.
• But most words are common and obvious.
• The valuable terms are unique terms, homonyms and
acronyms.
• Acronyms are important as they are shortened terms
of common phrases with shared meaning/semantics.
• Common words are filtered out, so that only the
acronyms and unique words are left.

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 5
Beyond Data Glossary 101
Parsing
• A parser is a function that takes text and builds a data structure.
• The data structure can be a list of Phrases.
• String -> [Phrase]
– Worksites -> [Worksites]
– Worksite -> [Worksite]
– Workskill -> [Work, Skill]
– Workstatus -> [Work, Status]
– Works -> [Works]
– Work -> [Work]
• Technical
– This is a semi-automated power tool. It grew out of my use of excel
macros and awk scripts to solve this problem.
– This is a non-technical talk, so I will avoid any code review. Come to
CanFP if you want to see the code.

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 6
Beyond Data Glossary 101
Data Model
Phrase Type Authority Domain Input

Phrase Output

Snippet 2 Phrase Phrase 2 Name

Name 2 Phrase Name

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 7
Beyond Data Glossary 101
Phrase Type
A phrase type represents the type of word phrase.
This can be Acronym, Contraction, Letter, Multiple Words, etc.
Type Definition Example
WIP for Work In
Acronym is any word formed from the initial letters of a group of words Progress
AllPhrase is the default type for a normal word. (AKA Lexeme) Work

Contraction is any shortened word with missing letters. Yr for Year


Letter is a single alphabetic character. E
Workstatus for [Work,
MultipleWords is a phrase that consists of more than 1 word. Status]
Number is a single numeric character. 9

PastTense is a phrase that occurs in the past. Accrued

Plural is a phrase that denotes quantity. Activities


ProperNoun is any name, such as an organisation, system name, etc. Oracle
Term is used for multiple word phrases that are almost a single phrase. Macaddress

ZRubbish is for misspellings and non standard contractions Iadc


Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 8
Beyond Data Glossary 101
Domain
A Domain is like a namespace.
There should be no homonym (same spelling/sound
but different meaning) words in a Domain.
But homonyms will occur from different domains.
Domain Type Domain Name

PNI Physical Network Inventory

IT Information Technology

HR Human Resources

Finance Finance

Engineering Engineering
Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 9
Beyond Data Glossary 101
Authority
Authority Type represents the 'Who' of phrases.
That is, which person or Org has defined this phrase.
This is very useful for defusing definition wars.
Authority Authority Type Comment
Any term used by the organisation without an
Internal Adhoc external authority.
Womb of Ignorance, Kraziness and
Wiki Adhoc Incomprehension
Oracle Commercial Organisation
Kimball Expert
AG Government
Water Act
2007 Parliamentary Act
ANSI Standards Organisation
Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 10
Beyond Data Glossary 101
Phrase
A phrase is a single word, or common multiword
phrase. A set of phrases is a Corpus of words.
Phrase Phrase Type Expansion Domain
WIP Acronym Work In Progress AllDomains
Work AllPhrase AllDomains
Yr Contraction Year AllDomains
E Letter AllDomains
Workstatus MultipleWords [Work, Status] AllDomains
9 Number AllDomains
Accrued PastTense AllDomains
Works Plural AllDomains
Oracle ProperNoun Organisation
Macaddress Term IT
Iadc ZRubbish ZDomain
Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 11
Beyond Data Glossary 101
Column Name
Main input of database names, including schema,
table and column. This can be extracted using
SQL from the metadata tables.
Schema Table Name ORD Column Name

BIA_BA_CAL A_APPOINTMENT_SUMMARY_T 1 REGION_KEY

BIA_BA_CAL A_MAX_WORK_ORDER_STATUS_HISTORY_V_TABLE 1 WORK_ORDER_SK

BIA_BA_CAL ARR_CONTRACT_VERSION_T 1 ARR_CONTRACT_KEY

BIA_BA_CAL ARR_CONTRACT_VERSION_T 7 ROW_NATURAL_ID

BIA_BA_CAL ARR_CONTRACT_VERSION_T 8 EFFECTIVE_FROM_TS

BIA_BA_CAL ARR_CONTRACT_VERSION_T 9 EFFECTIVE_TO_TS

BIA_BA_CAL ASSR_TASK_T 1 TASK_ID

BIA_BA_CAL ASSR_TASK_T 10 INSTANCEID

BIA_BA_CAL ASSR_TASK_T 12 ROOTREQUESTINSTANCEID


Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 12
Beyond Data Glossary 101
Snippet2Phrase
This is a simple mapping of Phrases to Snippets.
Each Phrase is a key value defined in Phrase In.
Snippets can be upper or lower case, or some mixed case.
Cardinality = O(Phrase In) * 2
Phrase Snippet

Accrued ACCRUED

Activities Activities

Activities ACTIVITIES

E E

Macaddress Macaddress

Macaddress MACADDRESS
Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 13
Beyond Data Glossary 101
Name2Phrase
For each Name, this shows the phrase list, and any unparsed string.
Cardinality = O(Column Name) (e.g. 200,000).
This shows examples of true and false positive parsing examples.
name2PhraseOutName name2PhraseOutSnippetsFinal ? Note
No Underscore, but still
ACTIONWHENCOMPLETE [Action,When,Complete] 0 works
ORDER_TOTAL_ELAPSED_DURATION_H
OURS_WH [Order,Total,Elapsed,Duration,Hours,Wh] 0 Underscore separator
EFFORTTRACKINGTOTALTIMESPENTHOU
RS [Effort,Tracking,Total,Times,PE,NT,Hours] 1 Need to add Timespent

INSTANTIATIONNUMBER [Inst,Anti,At,IO,N,Number] 1 Need to add Instantiation

NUMRETRIES [Num,Ret,R,IES] 1 Need to add retries

PARENTSIGNAL [Parents,I,G,NA,L] 1 Need to add parentsignal

SIMULATION_MESSAGE [SI,M,UL,At,IO,N,Message] 1 Need to add Simulation


Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 14
Beyond Data Glossary 101
Phrase2Name
For each Phrase, this shows all Names used.
Unused phrases are filtered.
Cardinality = O(Phrase In) (e.g. 6,000)
Doma Cou
Name Type Expansion in nt Used In Names Note
parses without
CI Acronym Configuration Item PNI 2 [TASK_CI,SERVICECI] underscore _
parses without
GUID Acronym Globally Unique ID IT 2 [PHASE_GUID,DETAILSAPPGUID] underscore _

IES Acronym NBN 1 [NUMRETRIES] False parse

IO Acronym Input Output IT 2 [SIMULATION_MESSAGE,INSTANTIATIONNUMBER] False parse


Network Analyser/Not
NA Acronym Applicable PNI 1 [PARENTSIGNAL] False parse
WW [EFFORTTRACKINGTOTALTIMESPENTMINUTES,EFF
PE Acronym M 2 ORTTRACKINGTOTALTIMESPENTHOURS] False parse

SI Acronym NBN 1 [SIMULATION_MESSAGE] False parse


Matthew Lawler lawlermj1@gmail.com
17 Sep 2019 15
Beyond Data Glossary 101
Data Flow Diagram

Snippet 2
Phrase Output

Name 2
Name Parse Snippet Input

Name 2
Phrase Join Phrase

Phrase 2
Invert Name

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 16
Beyond Data Glossary 101
Demo
• PhraseIn - 6,000 Phrases
• ColumnName – 500 Names

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 17
Beyond Data Glossary 101
Thanks 
• This is Open source on Github at:
https://github.com/lawlermj1/DBGlossary
• DW Dictionary defines an ISO 11179 that could use this
code. See:
https://www.scribd.com/document/371481026/Datawareho
use-Dictionary

Future?
Extracting words from Documents.
Grammatical rules + lexemes
NLP - Natural Language Processing

Matthew Lawler lawlermj1@gmail.com


17 Sep 2019 18
Beyond Data Glossary 101

You might also like