National Institute of Electronics & Information Technology: Gorakhpur Center

National Institute of Electronics & Information Technology
Gorakhpur Center
Ministry of Electronics & Information Technology (MeitY), Government of India
Shubhra Dubey
Project Engineer
http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

Contents to be covered
1. Document Structures
2. Document Understanding Framework Steps
3. Data extraction from Documents

Document Structures
• We can identify 3 main categories:

1. structured,
2. semi-structured and
3. unstructured.
• Each type of document comes with its own set of challenges.
• Documents come in many shapes and forms and are usually
combinations of the three classes above.

Structured Document
Structured documents generally focus on collecting information in a precise format, guiding the person who is filling
them with precise areas where each piece of data needs to be entered. These come in a fixed form and are generally
called forms.
Examples of structured documents: Surveys, questionnaires, Tax forms. These contain exclusively key-value pairs
and tables.

Semi-structured Document
• Semi-structured documents are documents that
do not follow a strict format the way structured
forms do and are not bound to specified data
fields. These don't have a fixed form but follow a
common enough format. They may contain
paragraphs as well, but data is mainly to be
found in key-value pairs.
• Examples of semi-structured
documents: Invoices, Receipts, Purchase Orders,
Healthcare lab reports.

Unstructured Document
• Unstructured documents are

documents in which the
information isn't organized
according to a clear, structured
model. These files are all easily
comprehensible by human beings,
yet much more difficult for a robot.
• Examples of unstructured
documents: Contracts, Annual
Reports. Some may contain key-
value pairs and tables, but much of
the data is in unstructured form
inside the text.

Document Understanding Framework Steps

Taxonomy
In this pre-processing step, you can add multiple document types and the
fields you are interested in extracting. For example, you can work with
Invoices, wanting to extract the vendor and the total amount, and with
medical forms, wanting to extract insured ID number and patient name.

Digitization
• As the documents are processed one by one, they go through the

digitization process. The difference for non-digital (scanned)
documents is that you need to apply the OCR engine of your
choice. The outputs of this step are the Document Object Model
and a string variable containing all the document text and are
passed down to the next steps.

Classification
• After digitization, the document is classified. If you are working

with multiple documents types in the same project, to extract data
properly you need to know what type of document you're working
with. The important thing is that you can use multiple classifiers in
the same scope, you can configure the classifiers and, later in the
framework, train them. The classification results help in applying
the right strategy in extraction.

Extraction
• Extraction is getting just the data you are interested in. For
example, extracting specific data from a 5-page document is quite
troublesome if you want to do it with string manipulation. In this
framework, you can use different extractors, for the different
document structures, in the same scope application. The
extraction results are passed further for validation.

Validation
• The extracted data can be validated by a human user through the

Validation Station. A best practice is to build logic around the
decision of adding or not a human validation step, with rules
depending on the specific use case to be implemented. Validation
results can then be exported and used in further automation
activities.

Export
• Once you have your validated information, you can use it as it is, or
save it in a DataTable format that can be converted very easy into
an Excel file.

Training Classifiers and Extractors
• Classification and Extraction are as efficient as the classifiers and

extractors used are. If a document wasn’t classified properly, it
means it was unknown to the active classifiers. The same way goes
for incorrect data extraction. The Framework provides the
opportunity to train the classifiers and the extractors, to improve
recognition of the documents and fields.

Thank You
Any Query

National Institute of Electronics & Information Technology: Gorakhpur Center

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

National Institute of Electronics & Information Technology: Gorakhpur Center

Uploaded by

Copyright:

Available Formats

National Institute of Electronics & Information Technology

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

• We can identify 3 main categories:

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

• Unstructured documents are

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

• As the documents are processed one by one, they go through the

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

• After digitization, the document is classified. If you are working

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

• The extracted data can be validated by a human user through the

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

• Classification and Extraction are as efficient as the classifiers and

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

http://www.nielit.gov.in/gorakhpur /GKP.NIELIT @GKP_NIELIT /NIELITIndia /school/NIELITIndia

You might also like