You are on page 1of 10

Document Separation and Classification

1 CONFIDENTIAL © 2013 Kofax, All rights reserved


1. Document Separation

The Kofax Platform has several means to separate documents.

Document Separation tab is used to manage how multi-page images are separated into single
documents or loose pages grouped into multi-page documents. When this feature is enabled, the
Kofax Transformation Modules - Server performs document separation before extraction. The
following options define how the server handles unclassified pages.

 No document separation. Select this option to disable document separation for this
project. This option is selected by default.
 Standard Document Separation. Select this option to use the class properties and
project settings to determine how documents are separated. This option is cleared by
default.
 Duplex scan mode (front and back side are never split). This option is disabled unless
Standard Document Separation is selected.

Select this option if you have two-sided pages. The back side is ignored. This option is
cleared by default.

 Unclassified pages should be handled as. This option is disabled unless Standard
Document Separation is selected.

This option defines the handling when document separation processes a page that could not
be classified using the following options:

 First page of new document


Used to handle an unclassified page as the first page of a new document. For example,
a multi-page document consists of four pages. Document separation processes the
pages sequentially and the first page belongs to class A, whereas the other pages stay
unclassified. If this first option is selected, for each unclassified page a document is
created so that as the result of the document separation four single page documents
are created.

 Attachment to previous document


Used to handle an unclassified page as an attachment to the previously classified
document. For example, a multi-page document consists of four pages. Document
separation processes the pages sequentially and the first page belongs to class A,
whereas the other pages stay unclassified. If this option is selected, the unclassified
pages are added to the current document so as result of the document separation
process, one multi-page document is created that consists of four pages. This option is
selected by default.

 Attachment if previous document was unclassified


Used to handle an unclassified page as an attachment to the previous document, if that
document was unclassified. For example, a multi-page document consists of four pages.
Document separation processes the pages sequentially, and the first page is assigned to
class A while the other pages remain unclassified. A document is created for the first

2 CONFIDENTIAL © 2013 Kofax, All rights reserved


page and the next page is used to create a new document. The subsequent pages are
treated in the same way.

 Trainable Document Separation (TDS). Select this option to activate trainable document
separation (TDS) for the project. This option is cleared by default.

One fundamental concept is the auto split or segregation of documents, sometimes


also called Automatic Document Separation. Rather than doing manual document
separation (e.g. by inserting separator sheets or attaching bar code stickers), Kofax’s
Automatic Document Separation technology uses two separation schema, Simple
Schema Algorithm and Sophisticated Schema Algorithm.

Simple Schema Algorithm recognize the first page of the document


Training the Kofax software by showing the front page of every type of document
Whenever it recognizes one of the front pages of any of the documents, it assumes a
new document has begun.

Classification and Separation is executed all in one operation


Sophisticated Schema Algorithm is done by classifying every page by layout and content
into “start page”, “middle page” and “end page” for each class, followed by calculating
the likeliest document separation path for the set of pages provided.
Looks at a contiguous string of pages all at the same time
Assesses the probability of each page being the 1st page of document type A, the 2nd
page of document type A, … the 1st page of document type B, etc.
In other words, calculates the probability of the whole sequence of pages

3 CONFIDENTIAL © 2013 Kofax, All rights reserved


The best path through the batch is determined:
Try every path
Eliminate steps using built-in rules and constraints.

Higher success rate because of the global scope. For example, if a page looks like it
could be either
o the 1st page of a 3-page document or
o the 1st page of a 2-page document

If 1 looks like it is a higher probability, simple, local separation will classify it as the 3-
page document. If the 3rd page has a high probability of being the 1st page of another
document, option 2 above should have been chosen.

o Even if the first page has a relatively low recognition probability, it can still be
correctly classified and separated with Sophisticated Schema Algorithm
Classification.
o Can reorder pages if mixed up in document preparation

The use of a barcode that is located within the document can also be used to separate a
document. Use of barcode would ensure that separation results are 100%

4 CONFIDENTIAL © 2013 Kofax, All rights reserved


5 CONFIDENTIAL © 2013 Kofax, All rights reserved
2. Document Classification

The Kofax Platform has in general 3 ways to classify a document:

1. Layout (Template based)


Layout Classifier
The Layout Classifier is an image classifier. It performs image-based classification
by analysing the graphical elements of an image without need for OCR. To
enable this classifier for a class, it is normally sufficient to add one or two
representative documents and to train the project with these examples.
Layout classification makes use of the geometrical structure of a document to
determine its class. Kofax Transformation Modules can automatically learn
about the geometrical structure of a class by analysing a number of example
documents that are representative of that class.

Documents with completely different layouts can be associated with a single


class provided you have examples of each. Typically, layout classification is used
to identify documents in a batch. Layout classification can also be utilized to
recognize the sender of a letter if the sender’s document layout is unique. This
might be the case for formal letters or invoices.

2. Content (Adaptive Feature Classifier)


Adaptive Feature Classifier
The Adaptive Feature Classifier (AFC) is a content-based classifier that uses the
text in a document to identify the class. The AFC is trained by having it analyze
several dozen sample text or XDoc documents per class. It automatically and
adaptively determines the salient features that can be used to define a class.
Since the AFC is fault tolerant, and does not only use words as features,
information with OCR or typing errors may still be used to accurately classify the
document. The sample documents are analyzed during AFC training and a
classification pattern is automatically created that can be used during
production.

3. Instruction Classification (Keyword Based)


In addition to the type of self-learning content classification provided by the
Adaptive Feature Classifier, explicit instructions can be added to classes to
handle exceptions. Instructions are defined as words and phrases that can be
combined using Boolean operations. Negative instructions can be used to inhibit
classification into a class.

6 CONFIDENTIAL © 2013 Kofax, All rights reserved


7 CONFIDENTIAL © 2013 Kofax, All rights reserved
3. Clustering of Unstructured Documents

For unstructure documents, Kofax provide efficient way of learn-by-example approach to understand,
classify and cluster a large amount of documents. It only requires user to provide sufficient samples to
start with. The solution can automatically learn the features of layouts and patterns of the content.

Providing sample for clustering

Simple configurations to perform clustering:

 Content or layout?

 OCR on demand

 Cluster size and min. confidence

8 CONFIDENTIAL © 2013 Kofax, All rights reserved


9 CONFIDENTIAL © 2013 Kofax, All rights reserved
Execution results and review statistics

10 CONFIDENTIAL © 2013 Kofax, All rights reserved

You might also like