You are on page 1of 300

Product: OpenText Content Server

Version: 21.1, 16.0.19


Task/Topic: Indexing and Search
Audience: Administrators, Developers
Platform: All
Document ID: 500434
Updated: January 26, 2021

White Paper
Understanding Search Engine 21
Patrick Pidduck, Director, Product Management
Understanding Search Engine 21

Foreword
Since 2009, I have had the honor of working with the extraordinary software
development teams at OpenText responsible for the OpenText Search Engine.
Search has always been a fundamental component of the OpenText Content Suite
Platform, and OpenText pioneered several key technologies that serve as the
foundation for modern search engines. Our team has built upon more than 25 years
of search innovation, and contributed to external research initiatives such as TREC
for years.
OpenText knows search.
In the last few years, our customers have pushed scalability and reliability
requirements to new levels. The OpenText Search Engine has met these goals, and
continues to improve with each quarterly product update. Several billion documents
in a single search index, unthinkable just a few years ago, is a reality today at
customer sites.
This edition of the “Understanding Search” document covers the capabilities of
Search Engine 21. Search Engine 21 is the most recent version, superseding 16.2,
16, 10.5, 10.0 and versions reaching back to Content Server 9.7. We understand our
enterprise customer needs, and this latest search engine provides seamless upgrade
paths from all supported versions of Content Server. While protecting your existing
investments, we continue to add incredible new capabilities, such as efficient search
methods optimized for eDiscovery and Classification applications, enhanced
backups, and integrated performance monitoring.
This document would not be possible without the help of our resident search experts.
As always, you have my thanks: Alex and Alex, Ann, Annie, Christine, Dave, Dave,
Dave, Hiral, Jody, Johan, Kyle, Laura, Mariana, Michelle, Mike, Ming, Parmis, Paul,
Ray, Rick, Riston, Ryan, Scott and Stephen.
Patrick.

The Information Company™ 2


Understanding Search Engine 21

Contents
Basics........................................................................................................................... 3
Overview ................................................................................................................. 3
Introduction ...................................................................................................... 3
Disclaimer ........................................................................................................ 3
Relative Strengths ............................................................................................ 4
Upgrade Migration: .................................................................................... 4
Transactional Capability: ........................................................................... 4
Metadata Updates: .................................................................................... 4
Search-Driven Update: .............................................................................. 4
Maintenance Commitment: ....................................................................... 4
Data Integrity: ............................................................................................ 4
Scaling: ...................................................................................................... 4
Advanced Queries: .................................................................................... 5
Related Components ....................................................................................... 5
Admin Server ............................................................................................. 5
Document Conversion Server ................................................................... 5
IPool Library .............................................................................................. 5
Content Server Search Administration ...................................................... 5
Query Languages ...................................................................................... 6
Remote Search.......................................................................................... 6
Backwards Compatibility .................................................................................. 6
Installation with Content Server ....................................................................... 6
Search Engine Components .................................................................................. 7
Update Distributor ............................................................................................ 8
Index Engines .................................................................................................. 8
Search Federator ............................................................................................. 8
Search Engines ................................................................................................ 9
Inter-Process Communication ................................................................................ 9
External Socket Connections ........................................................................... 9
Internal Socket Connections .......................................................................... 10
Search Federator Connections ...................................................................... 10
Search Queues........................................................................................ 10
Queue Servicing ...................................................................................... 11
Search Timeouts...................................................................................... 11
Testing Timeouts...................................................................................... 12
File System .................................................................................................... 13
Server Names ................................................................................................ 13
Partitions ............................................................................................................... 13
Basic Concepts .............................................................................................. 13
Large Object Partitions .................................................................................. 16

The Information Company™ 3


Understanding Search Engine 21

Regions and Metadata .............................................................................................. 17


Metadata Regions ................................................................................................ 17
Region Names ............................................................................................... 17
Nested Region Names ................................................................................... 18
DROP - Blocking Indexing of Regions ........................................................... 18
Removing Regions from the Index ................................................................ 19
Removing Empty Regions ............................................................................. 19
Renaming Regions ........................................................................................ 20
Merging Regions ............................................................................................ 20
Changing Region Types ................................................................................. 21
LONG Region Conversion ....................................................................... 22
Multiple Values in Regions ............................................................................. 22
Attributes in Text Regions .............................................................................. 23
Region Size Attribute ..................................................................................... 24
Metadata Region Types........................................................................................ 25
Key 25
Text................................................................................................................. 26
Rank ............................................................................................................... 27
Integer ............................................................................................................ 27
Long Integer ................................................................................................... 27
Timestamp ..................................................................................................... 27
Enumerated List ............................................................................................. 29
Boolean .......................................................................................................... 29
Date................................................................................................................ 29
Currency......................................................................................................... 29
Date Time Pair ............................................................................................... 30
User Definition Triplet..................................................................................... 30
Aggregate-Text Regions ................................................................................ 30
CHAIN Regions .................................................................................................... 31
Text Metadata Storage ......................................................................................... 32
Configuring the Storage Modes ..................................................................... 34
Memory Storage (RAM) ................................................................................. 35
Disk Storage (Value Storage) ........................................................................ 35
Low Memory Mode ........................................................................................ 36
Merge File Storage ........................................................................................ 36
Retrieval Storage ........................................................................................... 37
Storage Mode Conversion ............................................................................. 37
Reserved Regions ................................................................................................ 38
OTData - Full Text Region ............................................................................. 38
OTMeta .......................................................................................................... 38
XML Text Regions .......................................................................................... 38
OTObject ........................................................................................................ 39
OTCheckSum ................................................................................................ 39

The Information Company™ 4


Understanding Search Engine 21

OTMetadataChecksum .................................................................................. 39
OTContentStatus ........................................................................................... 40
OTTextSize..................................................................................................... 42
OTContentLanguage ..................................................................................... 42
OTPartitionName ........................................................................................... 42
OTPartitionMode ............................................................................................ 42
OTIndexError ................................................................................................. 43
OTScore ......................................................................................................... 43
TimeStamp Regions....................................................................................... 44
OTObjectIndexTime................................................................................. 44
OTContentUpdateTime ........................................................................... 44
OTMetadataUpdateTime ......................................................................... 44
OTObjectUpdateTime .............................................................................. 45
_OTDomain .................................................................................................... 45
_OTShadow ................................................................................................... 45
Regions and Content Server ................................................................................ 45
MIME and File Types ..................................................................................... 46
Extracted Document Properties ..................................................................... 46
Workflow ........................................................................................................ 47
Categories and Attributes............................................................................... 47
Forms ............................................................................................................. 48
Custom Applications ...................................................................................... 48
Default Search Settings ........................................................................................ 48
Indexing and Query .................................................................................................. 49
Indexing ................................................................................................................ 49
Indexing using IPools ..................................................................................... 50
AddOrReplace ............................................................................................... 52
AddOrModify .................................................................................................. 53
Modify............................................................................................................. 53
Delete ............................................................................................................. 53
DeleteByQuery ............................................................................................... 54
ModifyByQuery .............................................................................................. 54
Transactional Indexing ................................................................................... 55
IPool Quarantine...................................................................................... 55
Query Interface ..................................................................................................... 56
Select Command ........................................................................................... 56
Set Cursor Command .................................................................................... 57
Get Results Command................................................................................... 58
Get Facets Command .................................................................................... 60
Date Facets .................................................................................................... 61
FileSize Facets .............................................................................................. 62
Expand Command ......................................................................................... 64
Hit Highlight Command .................................................................................. 64

The Information Company™ 5


Understanding Search Engine 21

Get Time......................................................................................................... 65
Set Command ................................................................................................ 66
Get Regions Command ................................................................................. 66
OTSQL Query Language...................................................................................... 68
SELECT Syntax ............................................................................................. 69
FACETS Statement........................................................................................ 70
WHERE Clause ............................................................................................. 70
WHERE Relationships ................................................................................... 71
WHERE Terms ............................................................................................... 72
WHERE Operators ......................................................................................... 73
Proximity - prox operator................................................................................ 76
Proximity - span operator ............................................................................... 77
Proximity – practical considerations .............................................................. 78
WHERE Regions ........................................................................................... 79
Priority Region Chains ................................................................................... 80
Minimum and Maximum Regions................................................................... 81
Any or All Regions .......................................................................................... 82
Regular Expressions ...................................................................................... 82
Relative Date Queries .................................................................................... 85
Matching Lists of Terms ................................................................................. 86
ORDEREDBY ................................................................................................ 88
ORDEREDBY Default ............................................................................. 89
ORDEREDBY Nothing ............................................................................ 89
ORDEREDBY Relevancy ........................................................................ 89
ORDEREDBY RankingExpression.......................................................... 89
ORDEREDBY Region ............................................................................. 89
ORDEREDBY Existence ......................................................................... 90
ORDEREDBY Rawcount ......................................................................... 90
ORDEREDBY Score[N] ........................................................................... 90
Performance Considerations for Sort Order ............................................ 90
Text Locale Sensitivity ............................................................................. 91
Facets ................................................................................................................... 91
Purpose of Facets .......................................................................................... 91
Requesting Facets ......................................................................................... 92
Facet Caching ................................................................................................ 93
Text Region Facets ........................................................................................ 93
Date Facets .................................................................................................... 93
FileSize Facets .............................................................................................. 94
Facet Security Considerations ....................................................................... 94
Facet Configuration Settings.......................................................................... 95
Reserving Facet Memory ..................................................................................... 96
Facet Performance Considerations ............................................................... 96
Protected Facets ............................................................................................ 97

The Information Company™ 6


Understanding Search Engine 21

Search Agent Scheduling............................................................................... 98


Interval Execution .................................................................................... 98
Search Agent Configuration ........................................................................... 99
Search Agent Query Syntax......................................................................... 100
New Search Agent Query Files .................................................................... 100
Search Agent iPools..................................................................................... 101
Performance Considerations ....................................................................... 102
Relevance Computation ..................................................................................... 102
Retrieving the Relevance Score .................................................................. 103
Components of Relevance........................................................................... 103
Date Ranking ............................................................................................... 104
Object Type Ranking .................................................................................... 105
Text Region Ranking .................................................................................... 106
Full Text Search Ranking ............................................................................. 106
Relative frequency ................................................................................. 106
Frequency.............................................................................................. 106
Commonality.......................................................................................... 106
Object Ranking ............................................................................................ 107
Relevance Boost Overview .......................................................................... 108
Query Boost ................................................................................................. 108
Date Boost ................................................................................................... 108
Integer Boost ................................................................................................ 109
Multiple Boost Values ................................................................................... 110
Query versus Date / Integer Boost .............................................................. 110
Content Server Relevance Tuning ..................................................................... 110
Date Relevance ........................................................................................... 111
Boosting Object Types ................................................................................. 111
Boosting Text Regions ................................................................................. 115
Default Search Regions ............................................................................... 115
Using Recommender ................................................................................... 116
User Context ................................................................................................ 116
Enforcing Relevancy .................................................................................... 116
Extended Query Concepts ..................................................................................... 118
Thesaurus ........................................................................................................... 118
Overview ...................................................................................................... 118
Thesaurus Files ........................................................................................... 118
Thesaurus Queries ...................................................................................... 119
Creating Thesaurus Files ............................................................................. 119
Content Server Considerations .................................................................... 120
Stemming ........................................................................................................... 121
English Stemming Rules .............................................................................. 122
French Stemming Rules .............................................................................. 122
Spanish Stemming Rules............................................................................. 123

The Information Company™ 7


Understanding Search Engine 21

Italian Stemming Rules ................................................................................ 123


German Stemming Rules............................................................................. 123
Alternative Stemming Algorithm ................................................................... 124
Content Server and Stemming..................................................................... 125
Phonetic Matching .............................................................................................. 125
Exact Substring Searching ................................................................................. 126
Configuration ................................................................................................ 126
Substring Performance ................................................................................ 127
Substring Variations ..................................................................................... 127
Included Tokenizers ..................................................................................... 128
Preserving other Query Features ................................................................ 129
Part Numbers and File Names ........................................................................... 129
Problem ........................................................................................................ 129
Like Operator ............................................................................................... 129
Like Defaults ................................................................................................ 130
Shadow Regions .......................................................................................... 130
Token Generation with Like.......................................................................... 131
Limitations .................................................................................................... 132
User Guidance ............................................................................................. 132
Email Domain Search ......................................................................................... 133
Text Operator - Similarity .................................................................................... 135
Top Words........................................................................................................... 136
Stop Words ......................................................................................................... 137
Advanced Feature Configuration .......................................................................... 138
Accumulator ........................................................................................................ 138
Accumulator Chunking ....................................................................................... 139
Reverse Dictionary ............................................................................................. 140
Transaction Logs ................................................................................................ 141
Protection ........................................................................................................... 142
Text Metadata Size ...................................................................................... 142
Text Metadata Values ................................................................................... 142
Incorrect Indexing of Thumbnail Commands ............................................... 142
Cleanup Thread .................................................................................................. 143
Merge Thread ..................................................................................................... 144
Merge Tokens ........................................................................................ 145
Too Many Sub-Indexes .......................................................................... 146
Tokenizer ............................................................................................................ 146
Language Support ....................................................................................... 146
Case Sensitivity ........................................................................................... 146
Standard Tokenizer Behavior ....................................................................... 147
Customizing the Tokenizer ........................................................................... 148
Tokenizer File Syntax ......................................................................................... 148
Tokenizer Character Mapping ...................................................................... 149

The Information Company™ 8


Understanding Search Engine 21

Latin Extended-A Character Set Mapping ............................................. 150


Arabic Characters ........................................................................................ 150
Complete List of Character Mappings ......................................................... 152
Tokenizer Ranges ........................................................................................ 152
Tokenizer Regular Expressions .......................................................................... 152
East Asian Characters ................................................................................. 153
Tokenizer Options ........................................................................................ 154
Testing Tokenizer Changes ................................................................................ 154
Sample Tokenizer ............................................................................................... 155
Metadata Tokenizers .......................................................................................... 156
Metadata Tokenizer Example 1.................................................................... 157
Metadata Tokenizer Example 2.................................................................... 157
Administration and Optimization .......................................................................... 160
Index Quality Queries ......................................................................................... 160
Index Error Counts ....................................................................................... 160
Content Quality Assessment ........................................................................ 160
Partition Sizes .............................................................................................. 160
Metadata Corruption .................................................................................... 160
Bad Format Detection .................................................................................. 160
Text Metadata Truncation............................................................................. 160
Text Value Truncation ................................................................................... 161
Search Result Caching ....................................................................................... 161
Query Time Analysis ........................................................................................... 161
Administration API .............................................................................................. 163
getstatustext ................................................................................................. 163
getstatuscode ............................................................................................... 168
registerWithRMIRegistry .............................................................................. 169
checkpoint .................................................................................................... 169
reloadSettings .............................................................................................. 169
getsystemvalue ............................................................................................ 169
addRegionsOrFields .................................................................................... 170
runSearchAgents ......................................................................................... 170
runSearchAgent ........................................................................................... 170
runSearchAgentOnUpdated......................................................................... 171
runSearchAgentsOnUpdated ....................................................................... 171
Server Optimization ............................................................................................ 171
Metadata Region Fragmentation ................................................................. 171
Partition Metadata Memory Sizing ............................................................... 172
Automatic Partition Modes ........................................................................... 174
Memory Usage Mode Switching............................................................ 174
Disk Usage Mode Switching .................................................................. 176
Selecting a Text Metadata Storage Mode .................................................... 176
High Ingestion Environments ....................................................................... 177

The Information Company™ 9


Understanding Search Engine 21

Update Distributor Bottlenecks .............................................................. 177


Operation Counts .................................................................................. 179
Percentage Times.................................................................................. 179
Backup Times ........................................................................................ 179
Agent Times........................................................................................... 179
NetIO Stats ............................................................................................ 179
Checkpoint Writing Thresholds.............................................................. 180
Index Batch Sizes .................................................................................. 181
Partition Biasing..................................................................................... 181
Parallel Checkpoints .............................................................................. 182
Testing for Object Ownership ................................................................ 183
Compressed Communications .............................................................. 184
Scanning Long Lists .............................................................................. 184
Ingestion versus Size ............................................................................ 184
Content Server Considerations ............................................................. 185
Ingestion Rate Case Study .......................................................................... 185
Re-Indexing .................................................................................................. 187
Optimize Regions to be Indexed .................................................................. 188
Selecting a Storage System......................................................................... 188
Measuring Network Quality .......................................................................... 190
Measuring Disk Performance....................................................................... 190
Checkpoint Compression ............................................................................. 192
Disk Configuration Settings.......................................................................... 192
Delayed Commit .................................................................................... 192
Chunk Size ............................................................................................ 192
Query Parallelism .................................................................................. 192
Throttling Indexing ................................................................................. 193
Small Read Cache................................................................................. 193
File Retries ............................................................................................ 193
Indexing Large Objects....................................................................................... 194
Servers with Multiple CPUs ................................................................................ 194
Virtual Machines........................................................................................... 195
Garbage Collection ...................................................................................... 196
File Monitoring ............................................................................................. 196
Virus Scanning ............................................................................................. 197
Thread Management.................................................................................... 197
Scalability ........................................................................................................... 197
Query Availability.......................................................................................... 197
Indexing High Availability ............................................................................. 199
Sizing a Search Grid........................................................................................... 200
Minimizing Metadata .................................................................................... 200
Metadata Types............................................................................................ 200
Hot Phrases and Summaries ....................................................................... 200

The Information Company™ 10


Understanding Search Engine 21

Partition RAM Size ....................................................................................... 200


Sample Data Point................................................................................. 201
Memory Use .......................................................................................... 201
Redundancy ................................................................................................. 201
Spare Capacity ............................................................................................ 201
Indexing Performance .................................................................................. 202
CPU Requirements ...................................................................................... 202
Maintenance ....................................................................................................... 203
Log Files ............................................................................................................. 203
Log Levels .................................................................................................... 204
Log File Management .................................................................................. 204
RMI Logging ................................................................................................. 205
Backup and Restore .................................................................................... 205
Application Level Index Verification ............................................................. 205
Purging a Partition Index.............................................................................. 205
Step 1 .................................................................................................... 206
Step 2 .................................................................................................... 206
Step 3 .................................................................................................... 206
Step 4 .................................................................................................... 206
Security Considerations...................................................................................... 206
Java Security Policy ..................................................................................... 207
Backup and Restore ........................................................................................... 208
Backup Feature – Method 1......................................................................... 208
Restoring Partitions ............................................................................... 211
Backup – Method 2 ...................................................................................... 211
Backup Utilities – Method 3 ......................................................................... 211
Differential Backup ................................................................................ 212
Backup Process Overview .................................................................... 212
Sample Full.ini File ................................................................................ 212
Sample Lang File................................................................................... 214
Sample Backup.ini File .......................................................................... 215
Running the Backup Utility .................................................................... 217
Restore Process – Method 3 ....................................................................... 217
Preparation ............................................................................................ 217
Analysis ................................................................................................. 217
Copy ...................................................................................................... 217
Validate .................................................................................................. 217
Restore.ini File....................................................................................... 218
Index and Configuration Files ............................................................................... 220
Index Files .......................................................................................................... 220
Signature File ............................................................................................... 221
Accumulator Log File ................................................................................... 221
Metadata Checkpoint Files .......................................................................... 221

The Information Company™ 11


Understanding Search Engine 21

Lock File ................................................................................................ 222


Control File ............................................................................................ 222
Top Words ............................................................................................. 222
Config File ............................................................................................. 222
Metalogs....................................................................................................... 222
Index Fragment Folders ............................................................................... 223
Core, Region and Other ........................................................................ 223
Index Files ............................................................................................. 223
Object Files............................................................................................ 224
Offset File .............................................................................................. 224
Skip File ................................................................................................. 224
Map File ................................................................................................. 224
Low Memory Metadata Files ........................................................................ 224
Metadata Merge Files .................................................................................. 224
Configuration Files.............................................................................................. 225
Search.ini ..................................................................................................... 225
Search.ini_override ...................................................................................... 225
Backup.ini..................................................................................................... 226
FieldModeDefinitions.ini ............................................................................... 226
LLFieldDefinitions.txt.................................................................................... 227
SEARCH.INI Summary....................................................................................... 228
General Section ........................................................................................... 228
Partition Section ........................................................................................... 229
DataFlow Section ......................................................................................... 229
Update Distributor Section ........................................................................... 236
Index Engine Section ................................................................................... 237
Search Federator Section ............................................................................ 238
Search Engine Section ................................................................................ 240
DiskRet Section ........................................................................................... 241
Search Agent Section .................................................................................. 241
Field Alias Section ........................................................................................ 241
Index Maker Section .................................................................................... 241
Reloadable Settings ........................................................................................... 242
Common Values ........................................................................................... 242
Search Engines .................................................................................................. 243
Update Distributor............................................................................................... 243
Tokenizer Mapping ............................................................................................. 244
Additional Information ............................................................................................ 256
Version History ................................................................................................... 256
Search Engine 10 ........................................................................................ 256
Search Engine 10 Update 1 ......................................................................... 256
Search Engine 10 Update 2 ......................................................................... 257
Search Engine 10 Update 3 ......................................................................... 257

The Information Company™ 12


Understanding Search Engine 21

Search Engine 10 Update 4 ......................................................................... 257


Search Engine 10 Update 5 ......................................................................... 258
Search Engine 10 Update 5 Release 2 ....................................................... 258
Search Engine 10 Update 6 ......................................................................... 258
Search Engine 10 Update 7 ......................................................................... 258
Search Engine 10 Update 8 ......................................................................... 258
Search Engine 10 Update 9 ......................................................................... 259
Search Engine 10 Update 10 ....................................................................... 259
Search Engine 10 Update 11 ....................................................................... 259
Search Engine 10 Update 12 ....................................................................... 260
Search Engine 10.5 ..................................................................................... 260
Search Engine 10.5 Update 2014-03 .......................................................... 260
Search Engine 10.5 Update 2014-03 R2 .................................................... 260
Search Engine 10.5 Update 2014-06 ......................................................... 261
Search Engine 10.5 Update 2014-09 ......................................................... 261
Search Engine 10.5 Update 2014-12 ......................................................... 261
Search Engine 10.5 Update 2015-03 ......................................................... 261
Search Engine 10.5 Update 2015-06 ......................................................... 262
Search Engine 10.5 Update 2015-09 ......................................................... 262
Search Engine 10.5 Update 2015-12 ......................................................... 262
Search Engine 16 Update 2016-03 ............................................................ 262
Search Engine 16.0.1 (June 2016) .............................................................. 263
Search Engine 16.0.2 (September 2016) .................................................... 263
Search Engine 16.0.3 (December 2016) ..................................................... 263
Search Engine 16.2.0 (March 2017) ............................................................ 263
Search Engine 16.2.1 (June 2017) .............................................................. 263
Search Engine 16.2.2 (September 2017) .................................................... 264
Search Engine 16.2.3 (December 2017) ..................................................... 264
Search Engine 16.2.4 (March 2018) ............................................................ 264
Search Engine 16.2.5 (June 2018) .............................................................. 265
Search Engine 16.2.6 (September 2018) .................................................... 265
Search Engine 16.2.7 (December 2018) ..................................................... 265
Search Engine 16.2.8 (March 2019) ............................................................ 265
Search Engine 16.2.9 (June 2019) .............................................................. 265
Search Engine 16.2.10 (September 2019) .................................................. 265
Search Engine 16.2.11 (December 2019) ................................................... 266
Search Engine 20.2 (March 2020) ............................................................... 266
Search Engine 20.3 (July 2020) .................................................................. 266
Search Engine 20.4 (October 2020) ............................................................ 266
Error Codes ............................................................................................................. 268
Update Distributor ........................................................................................ 268
Index Engine ................................................................................................ 269
Search Federator ......................................................................................... 269

The Information Company™ 13


Understanding Search Engine 21

Search Engine ............................................................................................. 270


Utilities ................................................................................................................ 270
General Syntax ............................................................................................ 270
Backup ......................................................................................................... 271
Restore......................................................................................................... 271
DumpKeys.................................................................................................... 271
VerifyIndex ................................................................................................... 272
RebuildIndex ................................................................................................ 274
LogInterleaver .............................................................................................. 274
tools.analysis.ConvertDateFormat ............................................................... 274
com.opentext.search.tokenizer.LivelinkTokenizer ....................................... 275
ProfileMetadata ............................................................................................ 275
tools.index.DiskReadWriteSpeed ................................................................ 276
SearchClient................................................................................................. 277
Repair BaseOffset Errors ............................................................................. 278
Problem Illustration ...................................................................................... 278
Repair Option 1 ............................................................................................ 279
Repair Option 2 ............................................................................................ 280
New Base Offset Errors ...................................................................................... 283
Index of Terms ......................................................................................................... 284

The Information Company™ 14


Understanding Search Engine 21

Basics
This section is an overview of Search Engine 21, and introduces fundamental
concepts needed to understand some of the later topics.

Overview

Introduction
Search Engine 21 (“OTSE” – OpenText Search Engine) is the search engine
provided as part of OpenText Content Server. This document provides information
about the most common Search Engine 21 features and configuration, suitable for
administrators, application integrators and support staff tasked with maintaining and
tuning a search grid. If you are looking for information on the internal details of the
data structures and algorithms, you won’t find it here.
This document is based upon the features and capabilities of Search Engine 21.1,
which has a release date of January 2021.

Where possible, the discussion of OTSE is isolated from the larger


context of Content Server in which it operates. However, there are
instances where references to Content Server are necessary due to
the tight integration between Content Server and OTSE.
Paragraphs which are specific to Content Server are usually
designated by means of the icon you see at the left.
Occasionally, items of particular interest will be highlighted by
means of a sticky note icon, as seen here.

Disclaimer

DISCLAIMER:
This document is not official OpenText product
documentation. Any procedures or sample code is specific to the
scenario presented in this White Paper, and is delivered as-is and
is for educational purposes only. It is presented as a guide to
supplement official OpenText product documentation.
While efforts have been made to ensure correctness, the
information here is supplementary to the product documentation
and release notes.

The Information Company™ 3


Understanding Search Engine 21

Relative Strengths
There are many search engines available on the market, each of which has relative
merits. Search Engine 21 is a product of the ECM market space, developed by
OpenText, with a proven record as part of OpenText Content Server. This search
engine has been in active use and development for many years, and was previously
known by names such as “OT7” and “Search Engine 10”.
Because of the nature of OpenText ECM solutions, OTSE has a feature set oriented
towards enterprise-grade ECM applications. Some of the pertinent features which
make OTSE a preferred solution for these applications include:
Upgrade Migration:
As new features and capabilities are added, you are not required to re-index your
data. OTSE includes transparent conversion of older indexes to newer versions.
Our experience is that customers with large data sets often do not have the time or
infrastructure to re-index their data, so this is a key requirement.
Transactional Capability:
During indexing, objects are committed to the index in much the same way that
databases perform updates. If a catastrophic outage happens in the midst of a
transaction, the system can recover without data corruption. Additionally, logical
groups of objects for indexing can be treated as a single transaction, and the entire
transaction can be rolled back in the event that one object cannot be handled
properly.
Metadata Updates:
The OpenText search technology has the ability to make in-place updates of some or
all of the metadata for an object. This represents a significant performance
improvement over search technology that must delete and add complete objects,
particularly for ECM applications where metadata may be changing frequently.
Search-Driven Update:
OTSE has the ability to perform bulk operations, such as modification and deletion,
on sets of data that match search criteria. This allows for very efficient index updates
for specific types of transactions.
Maintenance Commitment:
OpenText controls the code and release schedules. This way, we can ensure that our
ECM solutions customers will have a supported search solution throughout the life of
their ECM application.
Data Integrity:
OTSE contains a number of features that allow the quality, consistency and integrity
of the search index and the data to be assessed. These features give system
administrators the tools they need to ensure that mission critical applications are
operating within specification.
Scaling:
Not only can OTSE support very large indices (1 billion+ objects), it can be
restructured to add capacity, rebalance the distribution of objects across servers,

The Information Company™ 4


Understanding Search Engine 21

switch portions from read-write to update-only or read-only, and perform in-place


addition or removal of metadata fields. OTSE shelters applications from the
complexity of tracking which objects are indexed into each Search Engine.
Advanced Queries:
Customers engaged in Records Management, Discovery and Classification have
unique query features available optimized for these applications. Examples include
searching for N of M terms in a document, searching for similar information, and
conditional term matching where sparse metadata exists.

Related Components
The scope of this document is constrained to the core OTSE components which are
located within the search JAR file (OTSEARCH.JAR).
There are a number of other components of both the overall search solution and
Content Server which are strongly related to OTSE but are not covered in this
document. In some instances, because of the tight relationship with other
components, references may be made in this document to these other components.
For a complete understanding of the search technology, you may wish to also learn
about the following products and technologies:
Admin Server
The Admin Server is a middleware application which provides control, monitoring and
management of processes for Content Server. The Admin Server performs a number
of services, and is critical to the operation of the search grid when used with Content
Server. As a rule of thumb, there is generally one Admin Server installed on each
physical computer hosting OTSE components.
Document Conversion Server
DCS is a set of processes and services responsible for preparing data prior to
indexing. DCS performs tasks such as managing the data flows and IPools during
ingestion, extracting text and metadata from content, generating hot phrases and
summaries, performing language identification, and more. You should ensure that
DCS is optimally configured for use with your application before indexing objects.
IPool Library
Interchange Pools (IPools) are a mechanism for managing batch-oriented Data Flows
within Content Server. IPools are used to encapsulate data for indexing. OTSE uses
the Java Native Interface (JNI) to leverage OpenText libraries for reading and writing
IPools.
Content Server Search Administration
While most OTSE setup is managed using configuration files, in practice many of
these files are generated and controlled by Content Server. Many of the concepts
and settings described in this document have analogous settings within Content
Server Search Administration pages, and should be managed from those pages
wherever possible.

The Information Company™ 5


Understanding Search Engine 21

Query Languages
This document describes the search query language implemented by the OTSE. It is
common for applications to hide the OTSE query language and provide an alternative
query language to end users. The Content Server query language – LQL – is NOT
described in this document.
Remote Search
Content Server Remote Search currently uses code within OTSE to facilitate
obtaining search results from remote instances of Content Server.

Backwards Compatibility
OTSE is capable of reading all indexes and index configuration files from all released
versions of OpenText Search Engine 20, Search Engine 16.2, Search Engine 16,
Search Engine 10.5, Search Engine 10, and OT7. OT7 is the predecessor to SE10.0
that was part of Content Server 9.6 and 9.7. For most of these, an index conversion
will take place. The new index will not be readable by older versions of the search
engines.
Indexes created with OT6 are not directly readable. Search Engine 10 can be used
to convert an OT6 index to a format Search Engine 10 can use, which can then be
upgraded in a second step using OTSE. In practice, given the improvements and
fixes since OT6, you would be best advised to re-index extremely old data sets. You
should consult with OpenText Customer Support if you are considering a migration
from these older search indices.

Installation with Content Server


This update of OTSE is optimized to be run on the OpenJDK 11.x Java platform, as
installed with current OpenText Content Server updates.
If necessary, OTSE can be provided separately from Content Server, and applied as
an update for fixes to older versions of Content Server running within a Java 8
runtime environment. OTSE itself is contained within a Java container called
“OTSEARCH.JAR”.
There are many services and components to OTSE contained with the OTSEARCH
JAR file, which are differentiated at startup by means of command line parameters.
When deployed with Content Server, multiple copies of Java.exe (or Javaw.exe on
Windows) are made with distinct names in order to help differentiate the various Java
processes when using monitoring tools. If you do upgrade the version of Java used
with OTSE and Content Server, you must also remember to make new versions of
these Java wrapper programs. The names for the copies of the Java.exe file used by
Content Server are:
otindexengine.exe
otupdatedistributor.exe
otsearchengine.exe
otsearchfederator.exe

The Information Company™ 6


Understanding Search Engine 21

otbackup.exe
otrestore.exe
otsumlog.exe
otcheckndx.exe
llremotesearch.exe
The llremotesearch.exe file is specifically for Content Server Remote Search, and is
not a requirement for other OTSE installations.

Search Engine Components


OTSE is comprised of a number of logical components. These are logical
components because physically they are the all located within the same program
(contained within the OTSEARCH.JAR file), but started in a different mode of
operation based upon command line parameters. This section presents an overview
of each component and its purpose.

The Information Company™ 7


Understanding Search Engine 21

Update Distributor
The Update Distributor is the front end for indexing. The Update Distributor performs
the following tasks, not necessarily in this order:
• Monitors an input IPool directory to check for indexing requests.
• Reads IPools, unpacks the indexing requests.
• Breaks larger IPools into smaller batches if necessary.
• Determines which Index Engines should service an indexing request.
• Sends indexing requests to Index Engines.
• Rolls back transactions and sets aside the IPool message if indexing of an
object fails.
• Rebalance objects to a new Index Engine during update operations if a
partition is too full or retired.
• Manages which Index Engines can write Checkpoints.
• Grants merge tokens to Index Engines that have insufficient disk space.
• Controls sequence of operations for Index Engines writing backups.

Index Engines
An Index Engine is responsible for adding, removing and updating objects in the
search index. The Index Engines accept requests from the Update Distributor, and
update the index as appropriate. Multiple Index Engines in a system are common,
each one representing a portion of the overall index known as a “partition”.
The search index itself is stored on disk. In operation, portions of the search index
are loaded into memory for performance reasons.
Index Engines are also responsible for tasks such as:
• Converting older versions of the index to newer formats.
• Converting metadata from one type to another.
• Converting metadata between different storage modes.
• Background operations to merge (compact) index files.

Search Federator
The Search Federator is the entry point for search queries. The Search Federator
receives queries from Content Server, sends queries to Search Engines, gathers the
results from all Search Engines together, and responds to the Content Server with
the search results.
The Search Federator performs tasks such as:
• Maintaining the queues for search requests.
• Issuing search queries to the Search Engines.

The Information Company™ 8


Understanding Search Engine 21

• Gathering and sorting results from Search Engines.


• Removing duplicate entries from search results.
• Caching search results for long queries.
• Running scheduled Search Agents.

Search Engines
The Search Engines perform the heavy lifting for search queries. They are
responsible for performing searches on a single partition, computing relevance score,
sorting results, and retrieving metadata regions to return in a query. Every partition
requires a Search Engine to support queries.
The Search Engines keep an in-memory representation of key data that replicates
the memory in the Index Engines. The files on disk are shared with the Index
Engines.
Search Engines read Checkpoint files at startup and incremental Metalog and
AccumLog files during operation to keep their view of the index data current. These
Metalog and AccumLog files are checked every 10 seconds by default, and any time
a search query is run.
Search Engines also perform tasks such as building facets, and computing position
information used for highlighting search results.

Inter-Process Communication
Each component of the search engine exposes APIs for a variety of purposes. This
section outlines the various communication methods used.

External Socket Connections


Each component of OTSE listens on a configurable port number for socket-level
communications. The port number is configured within the search.ini file. These
socket connections are used for tasks such as:
• Search queries
• Configuration queries
• Status monitoring
• Shutdown, restart and reload
• Backup and restore
Within Content Server 16.2, the administration pages allow you to set the IP Address
and Port Numbers used by each OTSE component.
The administrator must ensure that there are no port conflicts for components
installed on the same computer. Socket communication may also occur across
computers in a distributed system. You must ensure that socket communications are
not blocked by firewalls, switches or other networking elements.

The Information Company™ 9


Understanding Search Engine 21

Some customers have encountered intermittent problems that can


be traced back to the support of sockets. For example, the
Windows operating system has a configurable limit on the total
number of sockets that can be active, and reserves connections for
several minutes. You may need to adjust the maximum number of
connections upwards, and the reservation time downwards within
the Operating System.

Internal Socket Connections


The primary socket communications of interest are from the Update Distributor to the
Index Engines, and from the Search Federator to the Search Engines.
The Update Distributor typically initiates transactions with a broadcast message to
determine which (if any) Index Engine owns an object. If an Index Engine responds,
then the index update request is directed to that specific Index Engine, otherwise the
Update Distributor selects a partition to receive a new item.
The Search Federator, on the other hand, typically broadcasts requests to all the
Search Engines, receives responses from all the partitions, then prepares a
consolidated response.
Socket connections consume system resources, which vary depending on the size of
the search grid. The majority of connections are consumed as listeners, with the
single largest number of possible connections allocated to the search engines, where
each possible simultaneous query on each search engine requires a thread.
For example, if you have 1 Admin Server, 1 Search Federator and 20 Search
Engines with 10 simultaneous queries possible, the peak resource consumption for
communication between the Search Federator and Search Engines is as follows:

Sockets

Threads 210

Connections 200

Ports 21

The socket connections allocate and hold the threads and connections. Although this
uses the maximum number of resources, there are performance benefits and
predictability since it avoids allocation and re-use overhead that may exist within Java
or the operating system.

Search Federator Connections


Search Queues
A search queue is responsible for listening for search requests on a port, adding
requests to a queue, and allowing a defined number of requests to be executed
concurrently, each on a separate thread.

The Information Company™ 10


Understanding Search Engine 21

There are two search queues that may be used, the “normal” queue, and the “low
priority” queue. The low priority queue was first introduced in version 16.2.6, prior
versions supported only a single queue. The motivation for the low priority queue is
based on usage patterns in Content Server. There are background programmatic
operations that perform searches, and there are interactive user searches. The
programmatic searches have the potential to consume all available search capacity,
blocking users from having access. The purpose of the having two queues is to allow
specific search capacities to be independently reserved for background searches and
user searches.
Use of both queues is optional. By convention, the “normal” queue is always used.
[SearchFederator_xxx]
SearchPort=8500
WorkerThreads=5
QueueSize=25

The low priority queue is disabled by default, and activated with the following
settings:
[SearchFederator_xxx]
LowPrioritySearchPort=-1
LowPriorityWorkerThreads=2
LowPriorityQueueSize=25

Note that using the low priority queue requires an additional port. As a general
recommendation, small values (perhaps 2 or 3) should be used for the threads to
prevent the low priority searches from consuming too many resources.
Queue Servicing
There are three phases to servicing a request to the Search Federator.
Phase 1 – Content Server indicates a desire to start a search query by
opening a connection to the Search Federator. The connection is put on an
operating system / Java queue (not in the search code).
Phase 2 – a dedicated thread takes the connection from Java, and places it
in an internal queue. If the internal queue is full, the request is discarded and
the connection is closed.
Phase 3 – when a search worker thread becomes available, the connection
is removed from the queue and given to the worker. At this point, the worker
responds to Content Server to indicate it is ready to receive the search
request, and Content Server sends the search query for processing.
Note that in versions prior to 20.2, the process around Phase 1 and Phase 2 were
different – the pending requests were left on the operating system queue, and the
internal queue had an effective size of 1.
Search Timeouts
The Search Federator places a limit on how long it will wait for an application which
has opened a search transaction. If the application does not initiate a message in

The Information Company™ 11


Understanding Search Engine 21

the available time, then the Search Federator will close the connection and terminate
the transaction.
Keeping a connection and transaction open is expensive from a resource
perspective, and applications that leave connections open and idle can block search
activity by consuming all available threads from the search query pool.
There are two timeout values. The first is the time between the acknowledging that a
worker is ready to receive a query the first message arriving. This is expected to be
a short time, and the default is 10 seconds. The second is the time between
messages – for instance between consecutive “GET RESULTS” messages. This is
longer, with a default of 120 seconds. Both times can be adjusted or disabled in the
search.ini_override file. Bear in mind these are timeouts from the server perspective
– Content Server will also have timeout values from the client perspective.

NOTE: if you are testing search, or using an interactive client for


querying search, these timeout values (especially the initial
connection timeout) will likely be too short, and you may wish to
adjust the timeouts accordingly.

Within the [SearchFederator] section of the search.ini file, you may specify the time
the Federator will wait between a connection being created and the first command
arriving (10 second default):
FirstCommandReadTimeoutInMS=10000
Time the Search Federator will wait between commands (2 minute default):
SubsequentCommandReadTimeoutInMS=120000
In either case, the timeouts can be completely disabled with a value of 0.
The Search Federator also places a limit on how long it will wait for a response from
a Search Engine with an open search session. If the Search Engine does not reply
within the available time, then the Search Federator will terminate the search
session. For example, if the Search Federator has issued a “SELECT” to a Search
Engine, it will wait a limited amount of time for the reply. This timeout value is in the
[DataFlow] section of the search.ini file has a default value of 2 minutes:
QueryTimeOutInMS= 120000
The search session on a Search Engine will regularly ping the Search Federator to
ensure that it is still responding. If the Search Federator does not answer, then the
Search Engine will terminate its search session to recover resources. In addition,
there is a failsafe timeout which is the maximum time that a Search Engine will leave
a session active. In normal operation, even if the Search Federator fails, this is not
typically encountered. Located in the [DataFlow] section of the search.ini file, the
failsafe timeout value is 6 hours:
SessionTimeOutInMS= 21600000
Testing Timeouts
In a test environment, search results are often completed too quickly to permit testing
of system behavior for long searches and search timeouts. For test purposes, there

The Information Company™ 12


Understanding Search Engine 21

is a configuration setting that will cause all searches to take at least a defined period
of time. In production environments, this value should be 0.
MinSearchTimeInMS=0

File System
The Index Engines communicate updates to the Search Engines using a shared file
system. At various times, files may be locked to ensure data integrity during updates.
It is important that the Search and Index Engines have accurate file information for
this to work correctly. Some file systems use aggressive caching techniques that can
break this communication method. The Microsoft SMB2 caching is one example, and
it must be disabled for correct operation of OTSE. Microsoft SMB3 reverts to using
the SMB2 protocol in many situations, and so should also be avoided. You must
disable SMB2 caching on the servers running the search processes and on the file
server. Similarly, Microsoft Distributed File System (DFS) is known to have
unpredictable file locking behavior and must not be used.
Some customers have also experienced locking issues with NFS, and have needed
to use the NOLOCK or NOAC parameter in their NFS configuration to ensure correct
operation.

Server Names
Java enforces strict adherence to the various IETF standards for URIs and server
naming conventions. RFC 952, RFC 2396 and RFC 2373 are examples. Some
operating systems allow server names that do not meet the criteria for these
standards. When this happens, OTSE will likely fail with exceptions at startup. One
example we have seen is violation of this rule in RFC 952: “The rightmost label of a
domain name consisting of two or more labels, begins with an alpha character”. This
means a domain name such as “zulu.server3.7up” is invalid because the “7” must
instead be an alpha character.

Partitions

Basic Concepts
The concept of partitions is central to how OTSE scales and manages search
indexes. A search index may be broken horizontally into a number of pieces. These
pieces are known as “partitions” in OTSE terminology. The sum of all the partitions
together represents the search index.
Splitting an index into partitions is needed for a number of possible reasons:
• For best query performance, some metadata can be stored in memory.
There are practical limits on the amount of memory that can or should be
used by a single Java process. Using partitions allows these limits to be
overcome.
• OTSE can often provide better indexing or searching performance by
allowing operations to be distributed to multiple partitions. These partitions

The Information Company™ 13


Understanding Search Engine 21

can be run on separate physical or virtual computers or CPUs to improve


performance.
• Indexing and searching are disk-intensive activities. By splitting an index into
partitions, the index can be distributed over multiple physical disks and I/O
connections, improving overall search performance.
Each partition is a self-contained subset of the search grid. Each has its own index
files, a Search Engine, and an Index Engine. The partitions are tied together by the
Update Distributor (for indexing) and by the Search Federator (for queries).
Each partition is relatively independent of the other partitions in the system during
indexing. If one partition is given an object to index, the other partitions are idle. The
Update Distributor can distribute the indexing load across multiple partitions. For
systems with high indexing volumes, using multiple partitions this way can help
achieve higher performance, since partitions can be indexing objects in parallel.
A search query normally is serviced by all partitions. Only partitions containing
matches to the query will return results. The Search Federator will blend results from
multiple partitions into a consolidated set of search results.

Update-Only Partitions
It is possible to place a partition in “Update-Only” mode. In this mode, the partition
will not accept new objects to index, but it will update existing objects or delete
existing objects. If a partition is marked as Update-Only, then the Update Distributor
will not send it new objects.
Update-Only behavior is a legacy feature inherited from OT7, and is still supported
for backwards compatibility. However, it is recommended that you do not use
Update-Only mode for future applications. In normal Read-Write mode, OTSE
contains a dynamic “soft” update-only feature which is generally superior. The use
and configuration of dynamic update-only mode is covered elsewhere in this
document. Beginning with Content Server 16, Update-Only mode is not available as
a configuration option from within Content Server.
The default storage mechanism for text metadata is independently configured for
Update-Only partitions. If your default configuration for Update-Only mode differs
from Read-Write mode, then the Index Engines will convert the index data structures
the first time they restart after the configuration is changed. This default
configuration setting is found in the FieldModeDefinitions.ini file.

Read-Only Partitions
OTSE allows partitions to be placed in a “Read-Only” mode. In this mode, the
partition will respond to search queries, but will not process any indexing requests.
Objects cannot be added to the partition, removed or modified.
In operation, when started, the Index Engines for Read-Only partitions will shut down
once they have verified the index integrity. This means that fewer system resources
are being consumed. It also means that, since there is no Index Engine to respond
to the Update Distributor, a new instance of an object will be created in another
partition if you attempt to replace or update an object in a Read-Only partition.

The Information Company™ 14


Understanding Search Engine 21

You should only use Read-Only partitions in very specific cases. Customers will
occasionally get into trouble because they use Read-Only partitions when their
applications are still updating objects. This would happen in an application such as
Records Management – a “hold” is put on an object in a Read-Only partition, and a
duplicate entry is inadvertently created in another partition. Similarly, moving items to
another folder, updating classifications, updating category attributes and other
operations will cause this type of behavior. The search engines then respond to
search queries with multiple copies of objects.
The use of “Retired” mode for partitions avoids these issues, and should be
considered instead of Read-Only mode. Beginning with Content Server 16, Read-
Only mode will no longer be provided as a configuration option in the Content Server
administration interface.
Read-Only partitions also have a distinct default configuration for text metadata
storage in the FieldModeDefinitions.ini file, and changing to or from Read-Only mode
may trigger data conversion on startup.

Retired Partitions
OTSE allows partitions to be placed in a “Retired” mode. This mode of operation is
intended for use when a partition is being replaced. The behavior is close to
partitions in Update-Only mode. It will not accept new items, but it will update
existing objects or delete existing objects. If a partition is marked as Retired, then the
Update Distributor will not send it new objects. The key difference is that when an
object in a Retired partition is re-indexed, it will be deleted from the Retired partition
and added to a Read-Write partition.
Support for Retired Partitions is new starting with Search Engine 10.5. Retired mode
is strongly preferred over Read-Only mode, since Retired mode avoids problems
related to creating duplicate copies of objects in the Index.
Retired partitions are also a key feature for merging many small partitions into a set
of larger partitions. This is typical for customers upgrading older systems that use
RAM mode, and are switching to Low Memory mode. In this case, approximately
65% of the partitions can be marked as “Retired”, and incremental re-indexing of the
Retired partitions will empty move all the objects out of the Retired partitions. When
empty, the partitions can be removed from the search grid.
One common strategy for moving items from one partition to another is to place a
partition into Retired Mode, perform a search for all items in the Retired partition, add
them to a Collection, and re-index the Collection. This moves all the items that are
re-indexed from the Retired partition into other partitions. In practice, there are often
items left behind in the Retired partition after this is done. Typically, this is to be
expected. Occasionally, a Content Server object will be deleted but not removed
from the index. When this happens, it cannot be Collected. In other cases, the
Extractor may be set to re-index only recent versions of objects, and will not re-index
older versions. In some cases, when a document was deleted, an associated
Rendition may not have been removed from the index. If unsure about whether a re-
indexed Retired partition can be deleted, the OpenText customer support
organization may be able to provide some guidance.

The Information Company™ 15


Understanding Search Engine 21

Note that when objects are deleted from a partition, some of the data structures
remain in place. For example, a dictionary entry for a word may exist, even though
no objects now contain that word. It is normal for a retired partition that has had all
objects removed to show a small non-zero size. The search engine will also mark
items as deleted, but leave them in place until scheduled processes compact and
refresh the data – which may take days depending on the situation.

Read-Write Partitions
For completeness, the normal mode of operation for a partition is “Read-Write” mode.
In this mode, the partition will accept new objects, can delete objects and update
objects.
Read-Write partitions can be configured to automatically behave as Update-Only
partitions as they become full. More information on soft Update-Only configuration is
available in the optimization section.

Large Object Partitions


In typical applications, the full text of objects being indexed is truncated, typically to 5
MB or 10 MB. In most cases, being able to search only the first 5 MB of text in
objects is sufficient. Note that this value applies to the actual text – a 100MB
PowerPoint deck may only contain 20 KB of actual text.
If searching the complete text of very large objects is required, the configuration
settings can be changed to adjust the truncation size to arbitrarily large values.
However, significantly more memory will be needed for every Search Engine and
Index Engine to handle the very large objects. If 4 extra GB of memory are needed
for 100 partitions, that’s 800GB of extra RAM (4 GB x (100 search engines + 100
Index Engines)).
To address this, the Search Engine can reserve specific partitions for very large
objects. Only those specific partitions need additional memory. When an object is
presented for indexing, the Update Distributor will send very large objects to one of
these reserved partitions, and all other objects are be sent to traditional partitions.
For more information on configuring sizes, refer to the section “Indexing Large
Objects”.
To reserve a partition for large objects:
[Partition_xxx]
LargeObjectPartition=true

To set the size threshold for determining if an object should be sent to a large object
partition:
[DataFlow_yyyy]
ObjectSizeThresholdInBytes=1000000

The Information Company™ 16


Understanding Search Engine 21

Regions and Metadata


OTSE performance and tuning is strongly dependent upon how you configure, index
and query metadata. Everything you need to know about metadata configuration and
tuning should be here somewhere.

Metadata Regions
A region is OTSE terminology for a metadata field. Using a database analogy, you
can think of a region as being roughly equivalent to a column in a database.
Understanding and optimizing how metadata regions are defined and stored has a
big impact on performance, sizing, usability and search relevance. This section
provides background on the administration of regions to optimize the search
experience.

Defining a Region
Regions are defined in the configuration file “LLFieldDefinitions.txt”. This file is edited
to define the desired regions and their behaviors, and interpreted by the Index
Engines when they start. Currently, Content Server does not provide an interface for
editing and managing this file, so you must do this with a text editor.
Once a region is defined, it is recorded in the search index. Changing the definition
for an existing region in the LLFieldDefinitions.txt file or attempting to index a
metadata value that is incompatible with the defined region type will usually result in
an error. It is possible to redefine the type for existing metadata regions in many
cases as explained under the heading “Changing Region Types”.

Region Names
There are limitations on the labels which can be used for a metadata region. The
rules for acceptable region names are approximately the same as the rules for valid
XML labels.
The simplified explanation is that almost any valid UTF-8 characters can be used in
the name, with some exceptions. White-space characters (various forms of spaces,
nulls and control characters) are not permitted. To remain compliant with XML
naming conventions, use of a hyphen ( “-“ ), period ( “.” ), a number ( 0-9 ) or various
diacritical marks are discouraged as the first character.
The DCS filters often create region names from extracted document properties. In
some cases, DCS will strip white space and punctuation from the property names to
ensure that the region names are comprised of valid characters.
Region names are case sensitive. The region “author” is different from the region
“Author”.

Content Server is often not case sensitive with respect to naming


regions, and may derive region names from sources such as

The Information Company™ 17


Understanding Search Engine 21

Categories and Attributes, or workflow fields. This could potentially


lead to name collisions in search, so be alert to possible case
sensitivity issues when creating new regions within Content Server
applications.
Older versions of the search engine had less error checking on region names. It is
possible that some regions exist in legacy indexes that contain null characters.
There are configuration settings in the search.ini file that will instruct OTSE to report
and delete these incorrectly formed regions (set
“RemoveRegionsWithNulls=true”), which you should only need to use if there
are null character errors reported when trying to load an index.

Nested Region Names


Region names during indexing can be expressed in an “XML-like” nesting. If this
occurs, only the top level region is recognized, and the inner values are indexed
including the nesting tags. For example, if the following is presented for indexing:

<customerName>
<firstName>bob</firstName>
<lastName>smith</lastName>
</customerName>

Then the region “customerName” is indexed, and it will have the value:

<firstName>bob</firstName><lastName>smith</lastName>.
Within the definitions file, you can define hierarchy structures that should be ignored
and flattened when looking for regions to index. In the case above, by declaring
“customerName” as a nested region, the field customerName is ignored and the
regions firstName and lastName would be recognized and indexed. This is not
intended to handle arbitrarily complex nesting structures, but was designed to
accommodate a few specific instances in data presented for indexing by Content
Server. In particular, indexing of Workflow objects within Content Server prior to
Content Server 10 SP2 Update 10 is the only known requirement for the use of
nested region names. Using the above example, a nested value is expressed within
the definitions file like this:

NESTED customerName

DROP - Blocking Indexing of Regions


There is a special operator available for blocking regions from being indexed: the
DROP keyword. When a region is marked for dropping, no values will be indexed for
that region. The DROP operation can only be applied before data for the region is
indexed. Once there is data indexed for a region, DROP is no longer possible, and
an error will be written to the log files.
The DROP operator is “sticky”. Once a region is marked as DROP, this status is
remembered by the index. Deleting the DROP line from the definitions file will not re-

The Information Company™ 18


Understanding Search Engine 21

enable indexing for that region. For most applications, use of REMOVE is
recommended instead of DROP. In the definitions file:
DROP regionName

Removing Regions from the Index


The definitions file allows you to remove regions entirely from the index. The
REMOVE operator is used to delete the values and index for the named region. Be
cautious using this command, since there is no recovering REMOVED data other
than re-indexing the values. REMOVE is an important operator for eliminating low-
value metadata that may be bloating your search index.
The REMOVE operator will also instruct the Index Engines to discard any indexing
requests for the named region, so the region will not be created. This is not sticky –
once the REMOVE entry is deleted from the definitions file, the Index Engine is free
to create and index this region.
The REMOVE operation has precedence over most of the “sticky” settings. NESTED
and DROP regions can be eliminated from the index using the REMOVE operator.
To eliminate a region from the index, in the definitions file:

REMOVE someRegionName

Special considerations exist for the compound region types DATETIME and USER.
USER regions must be removed together in the same way they were defined, with 3
regions removed:

REMOVE OTCreatedBy OTCreatedByFullName OTCreatedByName

DATETIME regions can also be removed in their entirety by specifying both regions:

REMOVE OTVerMDate OTVerMTime

There is a special case supported for removing the TIME portion of a DATETIME pair
to leave only the DATE field behind. Ensure that you also add a DATE field to
prevent conversion of the DATE field to TEXT. There is no method available to
remove just the date portion of a DATETIME field to leave the time intact.

REMOVE OTVerMTime
DATE OTVerMDate

Removing Empty Regions


By default, OTSE will automatically remove empty regions from the search index on
startup. If empty regions are detected and removed, this will trigger the creation of a
new Checkpoint, which will increase the startup time. Some applications in Content
Server create regions with temporary objects; when the objects are subsequently
deleted, the empty regions remain. This capability removes the administration

The Information Company™ 19


Understanding Search Engine 21

“noise” of empty regions. This feature can be disabled by adding the following entry
in the [Dataflow_] section of the search.ini file:

RemoveEmptyRegionsOnStartup=false

Renaming Regions
Consider the case where you need to change the name of a metadata field in
Content Server or a custom application. You are now confronted with the problem
that data which is already indexed is using an older name for the region.
OTSE provides a mechanism for handling these situations. Within the region
definitions file, you can rename an existing region like this:

RENAME oldRegionName newRegionName


Renaming of the region occurs at startup of the Index and Search Engines. If the
new region and the old region both already exist, then this represents a conflict, and
the startup will be aborted with an error message.
When a RENAME statement exists in the definitions file, it also affects new data
being indexed. If a region named ‘oldRegionName’ is presented for indexing, it will
be indexed instead as ‘newRegionName’.
If conversions for RENAME are required at startup, this will trigger the writing of
checkpoints.
RENAME works for regions of type enum, integer, long, Boolean, and timestamp.
RENAME also works for text regions with single values stored in RAM (not on disk).

Merging Regions
The merge capability of OTSE is similar to the RENAME capability, but is instead
used to combine two existing regions. Within the definitions file:

MERGE sourceRegion targetRegion


When the engines start, if a region named sourceRegion exists, it will be copied into
a region named targetRegion. Where a conflict exists, the targetRegion has
precedence, and the value in the sourceRegion will be discarded. After the merging
operation is complete, the sourceRegion is deleted.

It is important to note the ability of the MERGE operation to discard


data when a value exists in both the source and target regions.
Use caution.

Once an index is running, any new values for sourceRegion will instead be indexed
within the targetRegion.
If targetRegion does not exist, the effective behavior of the MERGE command is the
same as a RENAME command.

The Information Company™ 20


Understanding Search Engine 21

There are limitations. The MERGE operation is NOT capable of merging text
metadata values that contain attributes. For Content Server, this includes the
OTName, OTDescription and OTGUID regions. The attributes will be silently lost
during the merge operation. You must check to ensure that regions being merged do
not incorporate attributes.
If conversions are required for MERGE at startup, this will trigger writing new
checkpoint files.

Changing Region Types


Once a metadata region type definition is made, it is remembered in the search
index. If no explicit type definition was made, the region will have type TEXT. In
theory, type definitions should be made before indexing of objects occurs to ensure
that optimal type definitions are set. In practice, this often does not occur, and leads
to a situation in which metadata regions have type definitions that are incorrect. With
Content Server, it is common for metadata to be indexed as Text, even if it should be
a Date, Boolean or Integer.
It is possible to change the type definition for an existing search region under certain
circumstances. For a type conversion to succeed the format of the values for the
target region type must be compatible with the format of the values of the current
type. For example, attempts to convert TEXT regions to INTEGERS will work if the
values are “123”, but fail for values such as “Harold Smith”.
Assuming value compatibility, the following region type conversions are viable:

To
Boolean Integer Long Enum Text Date

Boolean  
Integer     
Long     
From Enum     
Text     
Date    
TimeStamp 

You cannot change the type of a Text region that has multiple values or uses
attribute/value pairs, since these concepts are only available for Text regions.
The procedure is as follows:
Edit the search.ini (or search.ini_override) file to include the following entry in the
[Dataflow] section: EnableRegionTypeConversionAsADate=YYYYMMDD, where
YYYMMDD is today’s date. This informs OTSE that type conversion is allowable
today. This is a safety feature to prevent inadvertent region type conversion.

The Information Company™ 21


Understanding Search Engine 21

Edit the LLFieldDefinitions.txt file to have the desired region type definitions.
Restart the search processes. On startup, the Index Engines will determine that a
conversion is required, and use the stored values to rebuild the metadata indexes for
the changed regions. This process may require several minutes per partition, longer
if many region types are being defined.
In the event that a given value cannot be converted, the failure is recorded in the log
files and the OTIndexError count for metadata errors is incremented for the affected
object in the index.
You are strongly encouraged to back up an index before converting region types and
ensure that conversion has succeeded, reverting to the backups if there are
problems. In the log files, each failed conversion has an entry along these lines:
Couldn't set field OTFilterMIMEType for object
DataId=254417&Version=1 to text/plain:

With a summary of errors for each converted region like:


Total number of errors setting field
OTFilterMIMEType=112610:

LONG Region Conversion


Older versions of the default LLFieldDefinitions.txt file specified type INTEGER for a
number of Content Server fields, such as the DataID or ParentID. In current
versions, these are defined as type LONG to accommodate systems that exceed 2
billion objects. If using the old type definitions, these LONG values above 2 billion
are lost.
The Search Engine will force conversion of some of these INTEGER regions to type
LONG during Index Engine startup if encountered. This conversion was introduced
in version 16.2.5 (June 2018). The list of regions to force to type LONG is defined in
a list in the search.ini file, which has a default value of:
FieldsToBeLongCSL=OTCreatedByGroupID, OTDataID, OTOwnerID,
OTParentID, OTUserGroupID, OTVerCreatedByGroupID,
OTWFManagerID, OTWFMapManagerID, OTWFMapTaskPerformerID,
OTWFMapTaskSubMapID, OTWFSubWorkMapID, OTWFTaskPerformerID

Multiple Values in Regions


A text region may be populated with multiple values. For example, your application
may have a region named “OfficeLocation”. If you are indexing a record for a
customer that had several locations, the indexing entry in the IPools might look
something like this:
<OTMeta>

<OfficeLocation>Chicago, Illinois</OfficeLocation>
<OfficeLocation>Toronto, Ontario</OfficeLocation>
<OfficeLocation>New York, New York</OfficeLocation>

The Information Company™ 22


Understanding Search Engine 21


</OTMeta>
This would create 3 separate values for the region OfficeLocation attached to this
object. A search for any of “Chicago”, “Ontario” or “New York” would match this
object. Similarly, if the region OfficeLocation is selected for retrieval, the results
would return all three values.
When updating values in regions, you cannot selectively update one specific value of
a multi-value region. If a new value is provided for OfficeLocation for this object, all 3
existing values would be replaced with the new data – which may be a single value or
multiple values.

Attributes in Text Regions


OTSE allows the use of attributes with text regions. As an illustration, consider how
Content Server uses metadata attributes to support multi-language indexing and
searching. Within Content Server, multi-language regions are represented this way
for indexing:

<OTMeta>

<OTName lang="en">My red car</OTName>
<OTName lang="fr">Mon voiture rouge</OTName>

</OTMeta>
In addition to using the multiple value capabilities of OTSE, region attributes are used
by Content Server to tag each metadata value with attribute key/value pairs. In this
example, the key is “lang”, and the values are “en” and “fr”.
When constructing a search query, use of the region attributes is optional. A search
for “red car” or a search for “rouge” will find this object and return the values. When
values are returned, the attributes are included in the results only on request.
It is possible to construct a search query against regions that have specific region
attributes. If you only want to locate objects that contain the term “rouge” in the
French language value for OTName, the where clause would look like this:
where [region "OTName"][attribute "lang"="fr"] "rouge"

The query language has also been extended to permit sorting of results using an
attribute. Consider the case where there are values for both French and English, but
the user preference is French. Sorting based on the French values is therefore
desired. Within the “ORDEREDBY” portion of a SELECT statement, the SEQ
keyword is used to specify the attribute to be used for sort preferences:

SELECT ... ORDEREDBY REGION "OTName" SEQ "fr" ASC


In this example the results are sorted by the values within the OTName region which
have an attribute value of “fr”, in ascending order. Since there is no guarantee that
the desired attribute value exists for an object, the following rules are used:

The Information Company™ 23


Understanding Search Engine 21

• Use the specified attribute if it exists (in this example, “fr”);


• Otherwise, if the default attribute for this region exists, use it;
• Otherwise, use the attribute which is first alphabetically;
• If there are no attributes, then use the first value.
The concept of a default attribute is defined in the SystemDefaultSortLanguage
entry of the search.ini file. A list of regions for which default attributes should be used
is first defined, followed by the default attributes key/value pairs for each of these
regions. A priority list can be used if desired:
DefaultMetadataAttributeFieldNames="OTName","OTDescription"
DefaultMetadataAttributeFieldNames_OTName="lang"."en"
DefaultMetadataAttributeFieldNames_OTDescription="lang"."en
","orig"."true"

NOTE: The INI entry is derived by appending _RegionName to


the base label DefaultMetadataAttributeFieldNames.

The use of attributes with text values for specifying language values is a relatively
simple example. You may index multiple attributes within a single region. You may
also have different attributes for each value. The following example illustrates this
concept for indexing:

<ProductName color="red" origin="china">Cartoon character


glass</ProductName>
<ProductName color="blue" size="large">Inflatable
Djinni</ProductName>

Within Content Server, attributes are used for multi-language


regions such as OTName and OTDescription. A multi-value region
with attributes is also used to index the object GUID, with the
attributes used to differentiate the object GUID from the version
GUID.

Region Size Attribute


There is a reserved region attribute that can be used to specify the size of a region in
bytes, the otb (OpenText Bytes) region. This attribute can be supplied on any region,
and is used to prevent forgery or corruption of region data. If this attribute is present,
then the Index Engine requires that the size of the region must match the provided
otb value, measured in bytes (not UTF8 characters). If it does not match, then

The Information Company™ 24


Understanding Search Engine 21

serious data corruption or potential data injection is assumed and the metadata for
the object is discarded and an OTContentStatus code is used to capture the error.
For example, if an attacker provided metadata in the Description field of an object
that looked like this:
Silly stuff</Description><fakeRegion>Certified
Paid</fakeRegion><Description>nothing to see here

Then this data could be wrapped in a legitimate Description region when extracted for
indexing, resulting in:
<Description>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>

Which effectively forges a value for fakeRegion. By using the otb attribute,
<Description otb=94>Silly stuff</Description>
<fakeRegion>Certified Paid</fakeRegion>
<Description>nothing to see here</Description>

The Index Engine would notice that the Description region ended after only 11 bytes
instead of 94 bytes, and would prevent the injection of the fakeRegion by flagging the
object metadata as unacceptable. Content Server first began using this otb
protection for regions generated by Document Conversion Server in September
2016, and for regions provided by Content Server metadata in December 2016.
The otb attribute is never stored in the index. There is a search.ini setting that will
disable this capability, which will ignore the otb value. In the [Dataflow_] section:
IgnoreOTBAttribute=true

Metadata Region Types


This section contains a list of the basic data types supported in metadata regions,
and their syntax within the region definitions file LLFieldDefinitions.txt.
The general format of entry in the file is a keyword, whitespace, parameters.
Whitespace can be tab or space characters.
Most region definitions are sticky, so changing the definitions file for
an existing installed application will often generate errors. For
upgrades, replacing the LLFieldDefinitions.txt file is therefore
usually not recommended. When you do upgrade, you should
review release notes to see if there are new regions from DCS or
Content Server that should be manually added to existing
definitions files BEFORE indexing new data.

Key
Each object in the index must have a unique identifier, or key. The KEY entry in the
region definitions file identifies which region will be used as this unique identifier. It is

The Information Company™ 25


Understanding Search Engine 21

of type text and may not have multiple values. Exactly one must be defined. The
default Key name is OTObject. During indexing, the Key is typically represented by
the entry OTURN within an IPool. To paraphrase, in a default Content Server
installation, the OTURN entry in an IPool is treated as the Key, and populates the
region OTObject.

KEY OTObject

Text
Text, or character strings. Text strings must be defined in UTF-8 encoding. Text
strings can potentially be very large. Because of this, many customers find that the
available space in their search index is consumed quickly by text regions. To help
manage the large potential sizes, there are several methods available for storing text
metadata. This is covered in a separate section.
Text values may contain spaces and special punctuation. When represented in the
input IPools, certain characters may need to be ‘escaped’ to allow them to be
expressed in the IPools. In general, this means placing a backslash (‘\’) character
before “greater than” and “less than” characters (‘<’ and ‘>’).
There are some features available for TEXT regions which are not available for other
data types, and these may affect the decision about which type of region is suitable
for a given metadata field. TEXT regions support multiple values for an object, and
TEXT regions also support attribute keys and values.
It is possible to index numeric information in a text region, but they
are indexed as strings. When using comparison operations – such
as greater than, less than, ranges and sorting – remember that
strings sort differently than numbers. Intuitively, you expect the
number 123 to be greater than the number 50. But text
comparisons consider 123 to be less than 50. For example, in a
TEXT region, a clause of WHERE [region "partnum"] range
"100~200" will match a value of 1245872. If numeric comparisons
are important, a TEXT region is not a good choice.

TEXT is the “default” type for a region which is indexed without an


entry in the definitions file. Put another way, TEXT metadata
regions are automatically and dynamically created during indexing
whenever a new region name is encountered. If your application
allows arbitrary creation of metadata regions, this may result in
unexpected growth of the search index.
In the definitions file:

TEXT textRegionName
There are default limits on the size and number of values you can place in a text
region. It is possible to configure these limits on a per-region basis. Size is
expressed in Kbytes. These parameters are optional. More details are available in
the “Protection” section of this document.

The Information Company™ 26


Understanding Search Engine 21

TEXT textRegionName maxValues=200 maxSize=250

Rank
The rank type region is a special case for modifiers used in computing the relevance
of an object to boost its position in the result list. For example, frequently used
objects may be given a rank of 50. The default is 0. Values in this region must be
between 0 and 100 inclusive. Only 1 region may be defined with type of rank. In the
definitions file:

RANK rankRegionName

Integer
An integer is a 32 bit signed value, which can represent an integer value between -
2,147,483,648 and 2,147,483,647. Integer values are stored in memory. Search
results can be sorted on an integer field. In the definitions file:
INT integerRegionName

Long Integer
A long integer is a 64 bit signed value, which can represent a number between
−9,223,372,036,854,775,808 and 9,223,372,036,854,775,807 inclusive. LONG
integer values are stored in memory. Existing Integer fields in an index can be
converted to LONG Integer values by changing their definition. Search results can
be sorted on a LONG integer field. In the definitions file:
LONG longRegionName

Timestamp
A TIMESTAMP region encodes a date and time value. TIMESTAMP values are
expressed in a string format that is compatible with the standard ISO 8601 format.
The milliseconds and time zone are optional, but time up to the seconds is
mandatory:
2011-10-21T14:24:17.354+05:00
2011-10-21T14:24:17
Where

The Information Company™ 27


Understanding Search Engine 21

2011 – 4 digit calendar year

10 – 2 digit calendar month

21 – 2 digit calendar day

T – separates date from time

14 – 2 digit hour in 24 hour format [00 to 23]

24 – 2 digit minute [00 to 59]

17 2 digit second [00 to 59]

354 – milliseconds [000 to 999]

+05:00 – optional time zone offset preceded by + or –

NOTE: 24 is not accepted for 12 midnight, use 00.

The time zone is always optional. If omitted, the local system time zone will be
assumed. The local system time zone is determined from the operating system, but
can also be explicitly set by means of a search.ini file setting. Internally, timestamp
values are converted to UTC time before being indexed.
During search queries, lower significance time elements can be omitted. For
instance, the following will all be accepted:
2011-05-30T13:20:00
2011-05-30T13:20
2011-05-30-2:30
2011
If not fully specified, during indexing the earliest possible time for a value will be
used. For example:
2011-05
Would be interpreted as:
2011-05-01T00:00:00.000

TIMESTAMP values are kept in memory, stored as 64 bit integers. In the definitions
file:
TIMESTAMP timestampRegionName
There are special behaviors for several reserved metadata regions that use
TIMESTAMP definitions for tracking the time when objects are indexed or modified.
See the section on Reserved Regions for more information.

The Information Company™ 28


Understanding Search Engine 21

Enumerated List
The enumerated type is ideal for metadata regions which will have one of a defined
set of values. For example, file type identifiers (Word, Excel, etc.) are members of a
set of file types. Enumerated lists use less memory than text if RAM storage is being
used. In the definitions file:
ENUM enumerableRegionName

Boolean
The BOOLEAN type is used for objects which can have a value of true or false.
Fields of type BOOLEAN use memory very efficiently. In order to accommodate the
reality that different applications represent BOOLEAN values in different ways, the
indexing processes will accept BOOLEAN values in any of the following alternate
forms:
true false
yes no
1 0
on off
y n
t f
Boolean values are not case sensitive, so that False, FALSE and false are
equivalent. When retrieved, the values are always presented as true or false,
regardless of which form was used for indexing. If building a new indexing
application, the use of true and false is the preferred form.
BOOLEAN booleanRegionName

Date
A Date region accepts a string that represents a date in the form ‘YYYYMMDD’,
where YYYY is the year, MM the month, and DD the day. For example, 20130208
would represent February 8th 2013. Date values can be used presented in search
facets, and used in relevance scoring computation. This form of a Date matches the
format for dates used in Content Server. The date portion of a DateTime region is
effectively a Date region. The Date region type is first available in Search Engine 10
Update 10.
DATE dateRegionName

Currency
A region can be defined as a currency, a feature first available with Update 2015-09.
When so declared, the input data will be assumed to be in one of several common
forms that are used to represent currency values. The data is stored internally as a
long integer, with an implied 2 decimal digits. Character strings preceding or trailing
the currency value are discarded, which would typically be a symbol or a country
currency designation. Although some tolerance of poorly formed currency values is
built in, the expectation is that well formed data with 0 or 2 digits after the decimal will
be present. Examples of valid currency representations are:

The Information Company™ 29


Understanding Search Engine 21

$1,376,378  1376378.00
1456.87 AUD  1456.87
€ 8.447,75  8447.75
$ 4000US  4000.00

CURRENCY2 ListPrice

Date Time Pair


The DateTime definition is a special case for convenience in Content Server
applications. Content Server represents dates and times for most metadata regions
as integers. This type is a convenience function that declares the relationship
between a given date region and a time region. DATES for indexing must be an
integer of the form YYYYMMDD, and TIME values must be of the form HHMMSS,
where HH is based on a 24 hour clock. There is no time zone adjustment. Both are
stored as integer regions, and can be independently indexed and queried. This type
is not recommended for new applications. In practice, most Content Server
applications only care about the date, not the time. So creating a DATE field and
discarding (REMOVE) the time portion results in smaller index sizes. In the
definitions file:
DATETIME dateRegionName timeRegionName

User Definition Triplet


The User type is a special case for convenience in Content Server applications.
Content Server often uses 3 alternate values to represent a user: a user ID – which is
an integer; a username – which is a text value; and a userFullName – also a text
value. This convenience function declares the triplet as types integer, text, text.
Each region can be separately indexed and queried. In the definitions file:
USER integerRegionName textRegionName textRegionName
This type is not recommended for new applications.

Aggregate-Text Regions
An AGGREGATE-TEXT region has a search index which is the sum of all the regions
it aggregates, but does not store a copy of the values. The values remain within the
original regions. Aggregation only applies to TEXT regions.
Judicious use of AGGREGATE-TEXT regions can improve search performance and
simplify the user experience. Searching many text regions is slower than searching
against an equivalent AGGREGATE-TEXT region. When the AGGREGATE-TEXT
feature is combined with the DISK_RET storage mode for text regions, a significant
reduction in the total memory used to store the index and metadata of the aggregate
is possible if not using Low Memory mode.
AGGREGATE-TEXT regions are constructed using the LLFieldDefinitions.txt file.
Create an entry along these lines:

The Information Company™ 30


Understanding Search Engine 21

AGGREGATE-TEXT AggName OTCreatedBy,OTModifiedBy,OTDocAuthor


In this example, a new field is created, “AggName”. The values from the regions
named OTCreatedBy, OTModifiedBy and OTDocAuthor are all placed as separate
values into the AggName field.
There is a special case for defining aggregates, a trailing wildcard character.
AGGREGATE-TEXT DocProperties OTFileName,OTDoc*
This would place the OTFileName region and any text region that starts with OTDoc
into the DocProperties region.
Regions that match the wildcard pattern can be excluded by using an exclamation
mark instead of a comma as the preceding delimiter. The following illustrates
excluding two regions from a pattern match:
AGGREGATE-TEXT DocProperties
OTFilterName,OTDoc*!OTDocAuthor!OTDocumentUserRating
The exclusions must be exactly specified: they must follow the wildcard operator;
they must match the wildcard operator; they must not contain wildcards.
When the Index Engines start, if the AGGREGATE-TEXT configuration has been
changed, a one-time conversion of the index takes place. The Aggregate
configuration is then subsequently applied to new objects as they are indexed or
updated.
Deleting the entry for an AGGREGATE-TEXT field within the LLFieldDefinitions.txt file
does not cause the field to be deleted. The REMOVE command in the
LLFieldDefinitions.txt file must be used to remove an AGGREGATE-TEXT region.
REMOVING an AGGREGATE-TEXT region will delete the index for the region, but
does not eliminate the underlying regions that comprise the Aggregate.
If the definition of an AGGREGATE-TEXT field is edited to add or remove regions
from the list of regions which comprise an Aggregate, then when the Index Engines
are next started, the AGGREGATE-TEXT region will be rebuilt. This will take some
time, and results in a new checkpoint being written.
It is possible to combine AGGREGATE-TEXT with any text region storage mode. For
example, if Storage-Only mode (DISK_RET) is used, then only the Aggregate region
can be searched, but each component region can be retrieved.

CHAIN Regions
The CHAIN definition can be used to define a synthetic region which is used for
constructing queries against lists of regions. The list is prioritized. The value of the
first region that is defined (not null) is used for evaluating the query. There is no
additional storage or index penalty since the definition is an instruction used at query
execution that directs how the CHAIN region should be evaluated.
CHAIN UserHandle UserID FacebookID TwitterID

A search for [region "UserHandle"] "bsmith" would be interpreted as…


If UserID defined for object

The Information Company™ 31


Understanding Search Engine 21

Match object if UserID=bsmith


Else if FacebookID defined for object
Match object if FacebookID=bsmith
Else
Match object if TwitterID=bsmith

CHAIN regions can be used with any region type. Using different region types within
a single CHAIN region is not recommended, since not all search operators are
consistently available or applied to all region types.
The [first "UserID","FacebookID","TwitterID"] syntax in a query is equivalent to a
CHAIN region for queries. However, when a CHAIN region is predefined, the value
of the CHAIN region can also be requested in the search results using the SELECT
statement.

Text Metadata Storage


Text metadata regions usually comprise the bulk of the regions in a search index.
OTSE provides a number of alternate mechanisms for storing text regions. Each of
these alternatives has relative strengths and weaknesses, and the storage modes
should be selected to best meet the needs of your system. This section applies only
to text regions – other region types, such as integers or dates, are always stored in
memory.
Before describing each mode, it is useful to understand the requirements for storage.
Each text metadata region is comprised of an index and values. Value storage is
used to keep an exact copy of the text metadata, allowing it to be retrieved in search
queries. The values usually require significantly more space than the index.
Some modes of operation require a copy of the index or values be kept in memory, in
addition to the persistant disk storage. Other modes are designed to use the disk
representation for searching. The tables below outline the common configurations for
Text metadata fields, and illustrate differences between search operations and how
the data is stored on disk.
Index Element Locations used during Search Operations

Integers, Text Metadata Text Metadata Full Text Index


dates, times, Index Values
etc.
RAM Memory Memory Memory Merge fragments +
AccumLog

DISK Memory Memory Checkpoint + Merge fragments +


MetaLog AccumLog

Low Memory Memory MOD fragments + Checkpoint + Merge fragments +

The Information Company™ 32


Understanding Search Engine 21

(+DISK) MODAccumLog MetaLog AccumLog

Merge Files Memory MOD fragments + MODCheck + Merge fragments +


(+ Low Memory) MODAccumLog MODCheckLog AccumLog

The Information Company™ 33


Understanding Search Engine 21

Persistent Storage of Index Elements

Integers, Text Metadata Text Metadata Full Text Index


dates, times, Index Values
etc.
RAM Checkpoint + Checkpoint + Checkpoint + Merge fragments +
MetaLog MetaLog MetaLog AccumLog

DISK Checkpoint + Checkpoint + Checkpoint + Merge fragments +


MetaLog MetaLog MetaLog AccumLog

Low Memory Checkpoint + MOD fragments + Checkpoint + Merge fragments +


(+DISK) MetaLog MODAccumLog MetaLog AccumLog

Merge Files Checkpoint + MOD fragments + MODCheck + Merge fragments +


(+ Low Memory) MetaLog MODAccumLog MODCheckLog AccumLog

It is possible to change the text metadata storage modes for an existing index without
re-indexing the content. The Index Engines can perform any necessary storage
mode conversions when they are started.
Content Server exposes control over the storage modes in the search administration
pages. Beginning with Content Server 16, support of several legacy configuration
modes have been removed, forcing indexes to use DISK + Low Memory + Merge
Files as the proven best overall configuration. For most applications, the
configuration file settings described here will not need to be directly manipulated.

Configuring the Storage Modes


RAM versus DISK storage modes can be explicitly defined for a text region. If not
defined, then a default storage mode is used. Storage modes are unique to the
mode for a partition. The storage modes are defined in the FieldModeDefinitions.ini
file, which looks like this:
[General]
NoAdd=DISK
ReadOnly=DISK
ReadWrite=RAM
Retired=DISK

[ReadWrite]
SomeRegionName=DISK
OtherRegionName=DISK_RET

[ReadOnly]
ImportantRegionName=RAM

The Information Company™ 34


Understanding Search Engine 21

[NoAdd]
HugeRegionName=DISK

[Retired]
HugeRegionName=DISK

The [General] section of this file specifies the default storage mode for text metadata.
The ‘NoAdd’ value is the setting for Update-Only partitions.
You can also specify storage modes for regions which differ from the default settings.
Each partition mode has a section, and a list of regions and their storage modes can
be provided. Note that Low Memory and Merge File storage modes require DISK
configuration as a pre-requisite.
The FieldModeDefinitions.ini file is generated
dynamically by administration interfaces within
Content Server. Normally, you should not edit
this file.
Beginning with Content Server 16, RAM based storage, ReadOnly mode and NoAdd
mode are no longer available through the administrative interfaces.

Memory Storage (RAM)


In this configuration, the text index and values are stored on disk using the
Checkpoint system. A copy of the index and values is kept in memory for use when
searching. This provides the fastest operation when search results must be retrieved,
since it minimizes disk activity. Conversely, memory storage consumes the most
memory in partitions, and is often the limiting factor in how large a partition may be.
Memory storage is selected using the ‘RAM’ keyword in the FieldModeDefinitions.ini
file. This mode of operation has been available for many years.

Disk Storage (Value Storage)


In this configuration, the index is stored on disk in the same manner as the Memory
mode above. The key difference is that a copy of the values for Text metadata
regions is not kept in memory (hence the name Value Storage). If values need to be
retrieved, they are read from Checkpoint files on disk. The index for the Text
metadata is on disk, with a copy in memory for search purposes.
Keyword searches are still fast because the index is in memory, but search queries
which need to examine the original data, such as phrase searches, are generally
slower. Retrieving values from disk for display is also slower.If you do not require the
fastest possible search performance, or for use with regions which are not commonly
searched and displayed, disk storage is a good choice. Disk storage mode is
selected in the FieldModeDefinitions.ini file using the value of “DISK”. This mode of
operation has been available for many years.
Indexing is somewhat slower in Disk storage mode relative to Memory storage. A
typical Content Server installation, which has hundreds of text metadata regions, will

The Information Company™ 35


Understanding Search Engine 21

typically see a 30% reduction in the indexing performance with Disk storage relative
to Memory storage. For example, in one of the OpenText test cases using a 4-
partition system performing a 1 million+ objects indexing test: 7 hours 24 minutes
with Disk mode versus 5 hours 9 minutes in RAM mode.

Low Memory Mode


Low Memory disk storage leverages the technology used to represent the full text
index to similarly store text metadata indexes. The text metadata values are stored
in the Checkpoint file, and the text metadata index and dictionary is encoded in files
stored on disk. The overall result is a 3 to 4 times increase in the number of typical
Content Server objects that can be managed by a search partition using the same
amount of memory.
The Low Memory mode for disk indexes was introduced in Content Server 10 Update
9. Installations of Content Server 10.5 and later will default to Low Memory mode,
over-writing the OTSE default of Value Storage mode.
Configuration of Low Memory mode requires DISK mode to be configured in the
FieldModeDefinitions.ini file as a pre-requisite. Once DISK mode is defined, Low
Memory mode is enabled in the [DataFlow_] section:
MODDeflateMode=1

Switching between Value Storage and Low Memory disk modes will trigger a
conversion of the index format when the Index Engines are next started. Typically,
conversion of a partition should be less than 20 minutes. Value Storage mode is
backwards compatible with versions of Search Engine 10.0 back to Update 2. Low
Memory mode is new beginning with Update 9, and partitions in Low Memory mode
cannot be read by earlier versions of Search Engine 10.5.

Merge File Storage


The Merge File storage method uses a dedicated set of files to persist the Text
metadata values. These operate much like the index files – using background merge
processes to consolidate recently changed values into larger compacted files.
Compared to the alternative of storing the Text metadata values in Checkpoint files,
this is a major advantage since the size of the Checkpoint files is significantly smaller.
This means that the time required to write Checkpoints is reduced, resulting in higher
potential indexing throughput.
The Index Engines support converting existing indexes into and out of Merge File
storage mode for text values when started. The conversion time is approximately the
time needed to start the search grid, write new checkpoints, plus possibly a few
minutes of conversion time.
DISK configuration in the FieldModeDefinitions.ini file is a required prerequisite. Use
of Low Memory mode for Text Metadata index storage is strongly encouraged as a
prerequisite, since this is the tested variation. The configuration settings are located
in the [Dataflow_] section of the search.ini file. By default, Merge File storage is
disabled for backwards compatibility. The key settings are:
MODCheckMode=0

The Information Company™ 36


Understanding Search Engine 21

The Merge File storage mode is fist available in Content Server 10.5 Update 2015-
03.

Retrieval Storage
This mode of storage is optimized for text metadata regions which need to be
retrieved and displayed, but do not need to be searchable. In this mode, the text
values are stored on disk within the Checkpoint file, and there is no dictionary or
index at all. This mode of operation is recommended for regions such as Hot
Phrases and Summaries. These regions do not need to be searchable since they
are subsets of the full text content (you can search the full body text instead). Typical
ECM applications see a savings of 25% of metadata memory using Retrieval Storage
mode instead of Memory Storage for these two fields.
Retrieval Storage mode can be configured in the FieldModeDefinitions.ini file using
the value DISK_RET.

[DataFlow_DFname0]
DiskRetSection=DISK_RET

[DISK_RET]
RegionsOnReadWritePartitions=OTSummary,OTHP
RegionsOnNoAddPartitions=OTSummary,OTHP
RegionsOnReadOnlyPartitions=OTSummary,OTHP

Storage Mode Conversion


When the engines are started, any changes to the storage modes are applied to the
existing index. This requires index conversion, and creation of new Checkpoint files.
This process adds time to the startup. How long? It depends; the size of the index,
the number of fields to convert, the CPU, memory and disk properties are all factors.
In an appropriately scaled hardware environment, this would typically be 10 minutes
per million items in a partition or less, but this time can vary widely.
In general, you can convert between storage modes with impunity. If you put a
region into Retrieval-Only mode and later discover that it needs to be searchable,
simply change the appropriate settings in the FieldModeDefinitions.ini file, restart the
search grid, and everything is wonderful.

In practice, you cannot always convert between storage modes. If


you are close to the limit of available RAM for your partition, then
converting to a more RAM-intensive storage mode may result in the
partition exceeding the available memory. Converting from Low
Memory to Value Storage mode is one example. If you have
memory available, then simply increasing the memory limits can
solve this. Otherwise, you may need to use other tricks, such as
rebalancing partitions to make more room, or deleting or moving
other less-important regions to disk to make space available. If you

The Information Company™ 37


Understanding Search Engine 21

are uncertain about whether you will need to convert regions, using
a more conservative partition memory setting may be advisable in
order to ensure you have memory available for future metadata
region tuning.

Reserved Regions
There are a number of region names which are reserved by OTSE, and application
developers must be aware of the restrictions on their use. In most scenarios, the
Document Conversion Server is part of the indexing process, and DCS will also add
a number of metadata regions that are not described here.

OTData - Full Text Region


In some cases, the full text (or body of the content) can be considered to be a region.
The region name “OTData” is reserved for this purpose. A query constructed to look
for a term in the region OTData will search the full text body.

OTMeta
The OTMeta region is reserved for use in two ways. In the first case, the region
OTMeta is reserved to indicate the collection of all metadata regions defined in the
Default Metadata List. This list is described in the search.ini file by the entry
DefaultMetadataFieldNamesCSL. A query against the OTMeta region will
search this entire list of regions. Where possible, this should be discouraged since
searches of this form may be relatively slow compared to searching in a specific
region, particularly if there are many regions included in the default search region list.
The second application is using OTMeta as the prefix for a region in a search query.
A query with a WHERE clause of [region "someRegion"] "term" is
equivalent to [region "OTMeta": "someRegion"] "term".

XML Text Regions


The full text search engine has the ability to treat indexed XML files as if they were
regions for query purposes. There is no type definition required, all data is considered
to be of type text. Consider the following XML fragment which gets indexed as part
of the text content:

<furniture>
<chairs>
4
<chairColor>red</chairColor>
</chairs>
</furniture>

You can construct a query to locate objects where the chair color is red. The
WHERE clause of the search query would look something like this:

The Information Company™ 38


Understanding Search Engine 21

[region "OTData":"furniture":"chairs":"chairColor"] "red"

The XML search capability does not require a complete XML path specification. The
following WHERE clauses would also match this result, but would potentially also
match other results that are less specific:
[region "OTData":"chairs":"chairColor"] "red"
[region "OTData":"chairs"] "red"

To be a candidate for XML search matching, the XML document must have been
assigned the value text/xml in the OTFilterMIMEType region, which is typically
the responsibility of the Document Conversion Server. The metadata region and the
value for allowing XML content search are configurable in the DataFlow section of the
search.ini file:
ContentRegionFieldName=OTFilterMIMEType
ContentRegionFieldValue=text/xml

OTObject
Each index must specify a unique key region which functions as the master reference
identifier for an object. The region which represents the key is declared in the region
definitions file, but by convention and by default, the region OTObject is almost
always used as the key. During indexing, the unique key is defined in the OTURN
entry for an IPool object.
In practice, Content Server uses strings that begin with “DataId=” for the unique
identifier of managed objects. There are special cases in the code that rely on this
form of the OTObject field to determine when certain optimizations can be applied,
such as Bloom Filters for membership within a partition. If you are creating
alternative or custom unique object identifiers, ensure that the string “DataId” is not
present in the identifier to avoid unexpected behaviors.

OTCheckSum
This region contains a checksum for the full text content indexed for an object. The
value is generated by the Index Engines. Attempts to provide an OTCheckSum value
when indexing an object will increment the metadata error count for the object, and
be ignored. You can search and retrieve this region.
Internally, the Index Engines use this field to optimize re-indexing operations by
skipping content that is unchanged. This value is also used by index verification
utilities to verify that data has not been corrupted.

OTMetadataChecksum
This region has several purposes related to checksums for metadata. You cannot
index this region, but you can query against it and retrieve the values. Internally, this
value is used to verify the correctness of the metadata. Errors in the checksum
generally indicate severe hardware errors.

The Information Company™ 39


Understanding Search Engine 21

When a new object is indexed, a checksum of each metadata value is made. These
values are combined to create an aggregate checksum value, and the checksum is
stored in the region OTMetadataChecksum.
A background process is then scheduled which runs at a low priority. This process
traverses all objects in the index and recalculates the metadata checksum. If the
recalculated value does not match the stored value, a message is logged, and an
error code (-1) is placed in the OTMetadataChecksum region for that object.
Applications can find objects with metadata checksum errors by searching for a value
of -1 in this region.
If an existing index does NOT have checksums computed, then the background
process will populate checksum values. When objects are re-indexed, changes to
the metadata will be reflected in the new checksum. Transactional integrity for
metadata regions that were not changed is preserved.
There are configuration settings in the Index Engine section of the search.ini file that
allow the feature to be ON, OFF or IDLE. When IDLE, new indexing operations will
still create checksums, but the background process will not be validating them. In the
Index Engine section of the search INI file, this entry controls the mode, where
acceptable values are ON, OFF and IDLE. Default value is OFF for backwards
compatibility.
MetadataIntegrityMode=OFF (IDLE | ON)
By default, the engines will wake up once every two seconds and verify 100 objects:
MetadataIntegrityBatchSize=100
MetadataIntegrityBatchIntervalinMS=2000
Metadata regions stored on disk are excluded from this processing by default, since
disk files have other checksum validation mechanisms. It is possible to include
checksum validation for regions stored on disk, as indicated below, but the
processing is considerably slower in this mode:
TestMetadataIntegrityOnDisk=OFF (ON)

OTContentStatus
This region is used to record an indicator of the quality of the full text index for each
object. This data can assist applications with assessing the quality of the indexed
data, and taking corrective action when necessary. The status codes are roughly
grouped into 4 levels of severity – level 100, 200, 300 and 400 codes, where 100
level codes indicate good indexed content, and level 400 codes represent significant
problems with the content.
Applications can provide a status code as part of the indexing process. If the
Indexing Engines encounter a more serious content quality condition (a higher
number code) then the higher value is used. In other words, the most serious code is
recorded if multiple status conditions exist.
The majority of the codes are generated within DCS. Based upon Content Server 16,
the defined codes are:

The Information Company™ 40


Understanding Search Engine 21

100 There is no content indexed, only metadata. This is expected behavior, since
no content was provided as part of the indexing request.

103 This is the value for a normal, successful extraction and indexing of a single
document, both text and metadata.

104 One or more metadata regions contained non-UTF8 data. The non-UTF8 bytes
were removed and best-attempt indexing of the region performed. This
behavior only exists when region forgery detection is disabled.

120 The full text content of the indexing request was correctly processed, and is
comprised of multiple objects. The metadata of only the top or parent object
was extracted. The full text content of all objects is concatenated together. An
example is when multiple documents within a single ZIP file are indexed.

125 There were multiple objects provided for indexing, but some of them were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. The metadata of only the top or parent object was extracted. The
full text content of all objects that were not discarded are concatenated
together. A typical example would be when a Word document and JPEG photo
are attached to an email object, and the JPEG was discarded as an excluded
file type.

130 There were one or more content objects provided for indexing, but all were
intentionally discarded because of configuration settings, such as Excluded
MIME Types. There is no full text content.

150 During indexing, the statistical analyzer in the Index Engine identified that the
content has a relatively high degree of randomness. This is a warning, the data
was accepted and indexed.

300 During indexing, the text required more memory than is allowed by the
Accumulator memory settings that are currently configured. The text has been
truncated, and only the first portion of the text that fit in the available memory
has been indexed.

305 Multiple content objects were provided, and at least one but not all of them are
an unsupported file format. There is some full text content, but the content of
the unsupported files have not been indexed.

310 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects consists of an unsupported file
format.

320 Multiple content objects were provided, and at least one but not all of them
timed out while trying to extract the full text content. There is some full text
content, but the content of the objects which timed out have not been indexed.

360 Multiple content objects were provided, and at least one but not all of them
could not be read. There is some full text content, but the content of the objects
exhibiting read problems have not been indexed.

365 One or more content objects were provided, and the full text of at least one but
not all of them could be indexed. At least one of these objects was rejected
because of a serious internal or code error while preparing the content. This

The Information Company™ 41


Understanding Search Engine 21

error may or may not recur if you re-index this object.

401 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of
unsupported character encoding.

405 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because the
process timed out while trying to extract the full text content from a file.

406 Non-UTF8 data was found in metadata regions with region forgery detection
enabled. The metadata was discarded.

408 One or more content objects were provided, and the full text of none of them
could be indexed. At least one of these objects was rejected because of a
serious internal or code error while preparing the content. This error may or
may not recur if you re-index this object.

410 DCS was unable to read the contents of the IPool message or the file
containing the content. No full text content has been indexed.

OTTextSize
This region captures the size of the indexed full text content in bytes. Note that for
many languages there may be fewer characters than bytes. Note that this reflects
the size of the text extracted by DCS and filters, and can be significantly different
from the OTFileSize region defined by Content Server. Should be declared as type
INTEGER. First available in update 21.1.

OTContentLanguage
This region is optionally generated by the Document Conversion Server. DCS can
assess the full text content of an object to determine the language in which the
content is written. The language code is then typically represented in this region.

OTPartitionName
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the name of the partition which contains
the object. In a search query, OTPartitionName supports equals and not equals, for
either an exact value or a specific list of range values. Operations like regular
expressions or wildcards are not supported. This limited query set is intended to help
administrators with system management tasks, such as locating all the objects in a
given partition. In Content Server, partition names usually start with the text
“Partition_”.

OTPartitionMode
This is a synthetic region, generated when results are selected. You may not provide
this value for indexing. This region returns the operating mode of the partition which

The Information Company™ 42


Understanding Search Engine 21

contains the object. In a search query, OTPartitionMode supports equals and not
equals, for either an exact value or a specific list of range values. Operations like
regular expressions or wildcards are not supported. This limited query set is
intended to help administrators with system management tasks, such as locating all
the objects in a retired partition. The mode will be one of:

ReadWrite Normal configuration, including partitions in rebalancing or soft update


only mode.

NoAdd The partition is configured for updates only.

ReadOnly The partition is configured for read-only mode.

Retired The partition is configured in retired mod.

OTIndexError
This field is used to contain a count of metadata indexing errors associated with an
object. Metadata indexing errors occur for situations such as:
• An improperly formatted metadata object. A string value within an integer or
date field would be examples of this.
• An improperly formed region name.
• Attempts to provide values for reserved and protected region names.
For each such instance, the OTIndexError count region is incremented. Applications
providing objects for indexing may provide an initial value. For example, DCS may
have found that a date or integer value it attempted to extract was incorrect, and
therefore could determine that there is already a metadata error before the Index
Engine is provided with the object.
The error counts are incremental. Updates to objects which contain metadata errors
can cause this value to become artificially inflated. For example, if an object is added
with a date error, and then 10 updates include the same date error, then error count
may be 11.
Applications can query and retrieve this field to help assess the quality of the search
index.

OTScore
This synthetic region usually contains the computed relevance score for a search
result as an integer value. With the default configurations, a relevance score is
between 0 and 100. It is important to understand that the relevance score as
computed does NOT have any measurable correlation with the relevance of an object
as assessed by a user. These scores at best must be considered relative. For most
applications, displaying the OTScore (or computed relevance) is not normally
appropriate.

The Information Company™ 43


Understanding Search Engine 21

Although a simple integer is presented in the OTScore, internally the relevance


differences between objects may be very small fractions. The sorting of objects
internally for relevance is based on the floating point value.

In hindsight, a better name for this region would have been


OTSortRegion. If the results are not ordered by relevance, this
region will not contain a relevance score, but will instead contain
the values which represent the sort key. If results are not sorted
(ORDEREDBY NOTHING) then OTScore will be populated with a
value of 1.

TimeStamp Regions
During indexing operations, the Index Engine can mark objects with the time that
objects are created or updated. This behavior is enabled by including the appropriate
definitions in the LLFieldDefinitions.txt file as described below. When enabled, by
default these timestamps are added on all objects. If trying to minimize the index
size, you might want to add timestamps to only a subset of objects. For example,
with Content Server, you might want to add timestamps to only the Content Server
“Index Tracer” objects. For stamping only limited object types, ensure the
TimeStamp fields are defined in LLFieldDefinitions.txt, and add the list of object types
to the [DataFlow_] section of the search.ini file. Only objects that contain an
OTSubType value in the list will have the time stamp values added:
IndexTimestampOnlyCSL=147

OTObjectIndexTime
When an object is created, this field will be populated with the current time, as
determined by the system clock. This field has the type TIMESTAMP, and must
be declared in the LLFieldDefinitions.txt file to function.
OTContentUpdateTime
When the text content of an object is updated, this value records the current time
for the update. Only actual changes to the content will trigger a change. If an
object is re-indexed, but the text content is identical, then this value will not be
updated. This region has the type TIMESTAMP, and must be declared in the
LLFieldDefinitions.txt file to function.
The definition of “identical” is based upon the text as interpreted by the index
engine. Changes in the tokenizer or file format filters may result in the text being
declared “different”, even if the master object content is unchanged.
OTMetadataUpdateTime
This field records the time at which the metadata for an object was last modified.
If an object is re-indexed and no metadata changes, then this value is not
updated. This region has the type TIMESTAMP, and must be declared in the
LLFieldDefinitions.txt file to function.

The Information Company™ 44


Understanding Search Engine 21

OTMetadataUpdateTime leverages the Metadata Integrity


Checksum feature. Metadata Integrity checking must be set to ON
or IDLE for the OTMetadataUpdateTime to function.

OTObjectUpdateTime
This field is updated any time the metadata OR the content is changed. You
should normally not remove this field, since it is required for correct operation of
Search Agents.

_OTDomain
The searchable email domain feature generates synthetic regions by appending this
suffix to the email region name. For instance, if your region that contains email is
OTEmailSender, then the region OTEmailSender_OTDomain will be created to
support the email domain search capability.

_OTShadow
Regions ending with the string _OTShadow are created when the LIKE operator is
configured. If the Content Server region OTName is configured for use with LIKE,
then the region OTName_OTShadow contains the extended indexing information
required by the LIKE feature.

Regions and Content Server


The purpose of indexing metadata into regions is to simplify the user task of locating
information. The quality of the search experience therefore depends on which
Content Server metadata is indexed, and which regions the user queries when
looking for objects. There is a clear tradeoff here – more metadata regions require a
larger search index and higher hardware expenditures.
Content Server is a very flexible platform, supporting a wide range of possible
applications that may include Document Management, Records Management, Data
Archival, Web Site Management, Workflow applications, Litigation Support, and
many other OpenText, custom and 3rd party solutions. The choice of which Content
Server metadata should be added to the index is therefore an important decision.
When shipped, Content Server has a default configuration for metadata indexing.
For many applications, the default configuration is acceptable, and often indexes
more regions than necessary. On the other hand, some applications such as
eDiscovery may have a higher expectation of searchable metadata than the default.
Either way, it is strongly recommended that an assessment of Content Server
metadata indexing be undertaken as part of installing Content Server.
Although this document is focused on OTSE, the choice of metadata to be indexed is
very important. Hence, we will briefly touch on Content Server metadata topics, with
the understanding that you will need to look elsewhere for details.

The Information Company™ 45


Understanding Search Engine 21

MIME and File Types


There are several regions typically used to identify the type of a file or object. There
can sometimes be confusion around the purposes and differences of these regions.
The OTLLMIMEType region is basic Content Server “system” metadata. The intent is
that Content Server has set the MIME type, typically based on browser properties or
file name extensions when the document is added to Content Server.
The OTFilterMIMEType region is added by the Document Conversion Server during
indexing, and is based on an assessment of a document by format filter technology,
usually the OpenText Document Filters.
Perhaps the most useful standard region is OTFileType. This region is added by the
Document Conversion Server, but uses a combination of file format analysis, MIME
types, OTSubType and file format extensions to provide better coverage. More
importantly, OTFileType by default has values that are more user friendly, such as
“Microsoft Word” or “Adobe PDF”. The disadvantage is that OTFileType was
introduced with Content Server 10.5, and indexes from older systems will need to re-
index to apply OTFileType values to an older index.

Extracted Document Properties


The Document Conversion Server (DCS) is responsible for extracting properties from
documents and transforming them into metadata regions prior to indexing. There are
a number of configuration settings that affect the number of document properties that
will be extracted.
The settings that have the biggest impact relate to extracted properties of Microsoft
Office documents, or EXIF / XMP data extracted from media files. In addition to a
number of standard pre-defined properties, users (or custom applications) have the
ability to add arbitrary properties to any document. If the DCS settings permit it, each
of these properties becomes a region in the search index. It is not uncommon for
customers with this feature enabled to have thousands of search regions defined this
way. These regions could represent a significant portion of the search index size and
memory requirements.
For new applications, the default DCS behavior is to extract a common subset of the
more useful standard properties for indexing, and discard the rest. This list of the
“useful” regions can be edited within DCS. Other configuration settings are available
to index all Microsoft Office document properties, or disable indexing any Microsoft
Office document properties, or to extract and index all EXIF/XMP metadata fields. Be
sure to review the DCS documentation for your version of Content Server, as the
control over extracted properties may vary based upon the version of Content Server
and the types of format filters being used.
Litigation support or eDiscovery applications may require all these regions to be
searchable. In these scenarios, you may also want to consider the use of
AGGREGATE-TEXT configuration in conjunction with DISK_RET storage modes to
make these values searchable with the minimum index sizing requirements.

The Information Company™ 46


Understanding Search Engine 21

NOTE: Legacy installations of Content Server often have indexing


of Microsoft Office document properties enabled. You may wish
to review these settings, and perhaps even remove some of the
existing Microsoft Office document properties from your current
index.

In the index, these types of regions are typically prefixed with OTDocXXXX or
OTXMP_XXXX. Be careful if you choose to remove these, since it is possible that
region names from other sources might match this naming convention. For example,
the Content Server ‘User Rating’ metadata fields OTDocSynopsis and
OTDocUserRating also have this form.

Workflow
Indexing of Workflow metadata from Content Server has been problematic
historically, but is considerably better since Content Server 10.0 Update 10.
Firstly, the default Workflow configuration indexes all the internal Workflow metadata
to the search engine. In most applications, many of these regions have no value for
user search. The default region definitions file has DROP or REMOVE instructions in
place to prevent this data from being indexed. If you need to make these metadata
fields searchable, edit the definitions file appropriately.

NOTE: Older Content Server systems defaulted to indexing all the


Workflow metadata as text regions. You may wish to consider
using removing these regions or changing their type where
possible.

The other aspect is Workflow Map attributes. These are presented as regions for
indexing in the form WFAttr_xxxx, where xxxx is text that represents the name of the
Workflow attribute. It is possible for a very large number of these WFAttr_ regions to
exist, especially in older versions of Content Server where the default setting was to
always index these regions. This increases the size of the index. If you do not need
to search on these fields, you might consider DROP or REMOVE in the definitions
file.
If searching the aggregate value of these fields is sufficient, you might also want to
consider using AGGREGATE-TEXT for queries against these regions, in conjunction
with DISK_RET for storing the values.

Categories and Attributes


For search indexing purposes, the metadata fields for Categories and Attributes are
presented to the Index Engines in the form Attr_1234567_12. Depending on the
Attribute type, this is sometimes also appended with an additional underscore
character and text.
Often, Category and Attribute data is comprised of defined values, which are
optimally represented within the search index as enumerated data types (ENUM
within the definitions file), or as integer values. If you want to optimize the search
index to minimize the memory consumed by metadata, you will need to modify the

The Information Company™ 47


Understanding Search Engine 21

region definitions file and restart the search grid BEFORE these values are indexed.
Once indexed, they will be marked as type ‘TEXT’, and cannot be changed short of
removing the entire region and re-indexing the objects, or using the region type
conversion features.
This is an optimization consideration only. Leaving the Category and Attribute values
as TEXT within the index does not affect feature availability, although differences in
behavior between integer and text values may be a concern.

Forms
Within Content Server 10, the Forms module permits users to create arbitrary labels
for form fields. The region names are generated directly from these labels.
Unfortunately, this can result in conflicts with other search regions in the index. It is
recommended that you enforce a business practice of prefixing all form names with a
unique value, such as OTForm_. This will provide two major benefits: it will minimize
the chance of name conflicts, and it allows use of AGGREGATE-TEXT regions to
improve search usability.
Content Server 10.5 or later will generate region names that follow a well defined
syntax, along the lines of OTForm_1234_5678. This change makes it much easier to
identify regions associated with forms, and simplifies selecting them for REMOVE or
aggregation purposes.

Custom Applications
It is common for OpenText customers to create their own solutions using Content
Server as a platform. Often, the considerations for metadata indexing and search are
overlooked. If you have custom applications that index metadata fields, you should
consider the impact on search index size and performance.
• Only index object subtypes that are of interest to users
• Only extract metadata fields that are useful for search
• Ensure that the region definition file has optimal configuration for each region
• Provide a unique prefix so that the custom metadata will not conflict with
other region names
• If appropriate, add the custom regions to the default Content Server search
regions.

Default Search Settings


Content Server ships with a set of default search regions and default search
relevance ranking settings. Review these defaults against your application
requirements, and change them as appropriate. These settings really do have an
impact on relevance computation and object findability. Refer to the section which
describes the search relevance computation for more information.

The Information Company™ 48


Understanding Search Engine 21

Indexing and Query


There are two fundamental tasks for any search engine – put data into it, and
formulate queries to find it. This section explores how OTSE exposes these features.

Indexing
Updating the search index is performed by preparing files containing indexing
commands in a defined location. The input files and structures are in the OpenText
“IPool” format. The Update Distributor watches for these files, and initiates indexing
when IPools arrive.
A single IPool may contain many indexing commands and objects. Updates to the
index from an IPool are only “committed” once all of the messages within the IPool
are successfully handled. If either the Update Distributor or one of the Index Engines
is unable to process a message, then the indexing process will halt and the all the
changes from the IPool are rolled back when the Index Engines are restarted. This
behavior applies to serious indexing IPool errors, such as malformed IPool
messages. Objects too large, for example, are not IPool errors.

When a serious indexing problem occurs, one or more elements of


the indexing grid will have stopped with exceptions. The offending
IPool needs to be removed from the input queue; otherwise the
problem will simply repeat and recur when the indexing grid is
restarted. On the 3rd restart/attempt, the offending IPool will be
moved to quarantine.

If multiple partitions exist for an index, the Update Distributor chooses which partition
will index an object. Some operations, such as Modify By Query, are broadcast to all
the Index Engines. Most operations are specific to a single partion, and the first step
in deciding which partition to use is to ask if any of the existing Index Engines already
have an entry with the same object identifier (the “Key” value). If one of the Index
Engines responds affirmatively, then the object is given to that Index Engine to add,
modify or remove.
If no partition already has the object, the Index Engine will make a selection based
upon the Read-Write or Update-Only mode of the partitions, and whether they are
full.
Partitions which are in “Update-Only” or “Retired” mode are never given new objects
to index. Partitions which are in “Read-Only” mode do not have Index Engines
running, and are not given any indexing tasks.

The order of processing is not guaranteed within an IPool. Placing


multiple operations for the same object in a single IPool may
generate unexpected results. For example, when multiple types of
operations exist in a single IPool (adds, deletes and modifies), the
Update Distributor may batch similar operations together to obtain

The Information Company™ 49


Understanding Search Engine 21

performance improvements.

NOTE: As long as we are discussing IPools, some trivia:


although IPools look very much like XML, they aren’t quite XML.
IPool syntax evolved over the years at OpenText from earlier
versions of our search technology - which were developed by a
gentleman named Tim Bray, among others. Tim leveraged his
OpenText search and SGML experience to later guide the
specification of XML.

Indexing using IPools


Interchange Pools (IPools) are used for many purposes within Content Server, and
can contain many objects or operations. IPools are used as the mechanism for
providing input into the Update Distributor for indexing. The discussion of IPools in
this section is strictly limited to an overview of IPools for the purpose of indexing
objects.
IPools are not typically constructed directly by an application. OpenText provides
linkable libraries that provide utilities for reading and writing IPools. These libraries
are used by applications creating IPools, and also used through the Java Native
Interface (JNI) by the Update Distributor to read the IPools. However, when
diagnosing search indexing issues, a basic understanding of the IPool structures can
be useful.
An indexing object has the following basic form shown below. Only a single object is
displayed, although an IPool may contain many objects. Note that within an IPool no
white space (new lines or indentation) is provided for formatting – it has been added
here for readability.

<Object>
<Entry>
<Key>OTURN</Key>
<Value>
<Size>16</Size>
<Raw>8273908620;ver=1</Raw>
</Value>
</Entry>
<Entry>
<Key>Operation</Key>
<Value>
<Size>12</Size>
<Raw>AddOrReplace</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>187</Size>

The Information Company™ 50


Understanding Search Engine 21

<Raw>

<FileName>/MyContentInstances/testhtml.html</FileName>
<ObjectTitle>Things that go bump</ObjectTitle>
<OTName>Cars</OTName>
<OTName lang=”fr”>Voitures</OTName>
<OTCurrentVersion>true</OTCurrentVersion>
</Raw>
</Value>
</Entry>
<Entry>
<Key>ContentReferenceTemp</Key>
<Value>
<Size>20</Size>
<Raw>C:/dev/testhtml.html</Raw>
</Value>
</Entry>
<Entry>
<Key>Content</Key>
<Value>
<Size>28</Size>
<Raw>full text to be indexed here</Raw>
</Value>
</Entry>
</Object>

The <Size> value reports the number of characters contained within a <Raw>
section. The <Raw> section contains the actual values. The <Raw> section can
contain arbitrary data expressed in UTF-8 encoding, and does not require character
escaping because the <Size> is known, although for metadata regions this data is
expected to be structured much like XML. The <Key> value specifies the top level
purpose for each entry, sometimes processed by DCS, sometimes by the Index
Engines. This object contains 5 entries – the OTURN, Operation, Metadata, and
content referenced in two different ways.
Every object to be indexed requires a unique identifier. For typical Content Server
applications, the unique identifier is provided in the region “OTURN”, as shown in this
example. The value for the OTURN is “8273908620;ver=1” – different Content
Server modules may provide OTURN values in different forms. Operations such as
ModifyByQuery would use a query “where clause” as the OTURN.
The Operation entry instructs the Index Engines how the object should be interpreted
as explained in the sections below.
The Metadata entry is used to provide the regions names and values that are
provided for indexing. In the example above, metadata for the regions FileName,
ObjectTitle, OTName and OTCurrentVersion are provided. You can specify multiple
values for one region. The OTName region, for example, has two values, and one of
them also uses the attribute key/value feature of OTSE to specify that “voitures” is
the French language value.

The Information Company™ 51


Understanding Search Engine 21

The entry for ContentReferenceTemp is used to identify that the content data is
located at the specified file location. The IPool libraries would normally delete the file
after processing, since by convention ContentReferenceTemp is used when a
temporary copy of a file was made. A permanent copy can also be specified using
ContentReference as the key, which does not delete the original. IPools given to the
Index Engines normally should NOT have either ContentReferenceTemp or
ContentReference entries, since extraction and preprocessing of files should already
have occurred to extract the raw text data. These modes are common for earlier
steps in the DCS process.
The entry for Content in the example indicates that the data in question is contained
within the IPool, in the <Raw> section. This is the normal expected use case for
IPools being consumed by the Update Distributor. Unlike this artificial example,
having both Content and ContentReferenceTemp values is atypical.

AddOrReplace
This is the primary indexing operation used to create new objects in the index. If the
object does not exist, it will be created. If an entry with the same OTURN exists in
either a Read-Write or Update-Only partition, then it will be completely replaced with
the new data, equivalent to a delete and add.
The AddOrReplace function distinguishes between content and metadata. If an
object already exists, and metadata only is provided, the existing full text content is
retained. However, the line between content and metadata is somewhat distorted.
The DCS processes will typically extract metadata from content and insert this
metadata into regions for indexing. There is a list of metadata regions which are
therefore considered to be “content”, and not replaced or deleted if content is not
provided in a replace operation.
The list of metadata considered to be content for this purpose is defined in the
[DataFlow_] section of the search.ini file by:

ExtraDCSRegionNames=OTSummary,OTHP,OTFilterMIMEType,
OTContentLanguage,OTConversionError,OTFileName,OTFileType
ExtraDCSStartsWithNames=OTDoc,OTCA.OTXMP_,OTCount_,OTMeta_
DCSStartsWithNameExemptions=OTDocumentUserComment,
OTDocumentUserExplanation
ExtrasWillOverride=false

The ExtrasWillOverride setting is used to disable this feature, which would cause the
regions to be deleted if content is not indexed in an AddOrReplace operation. The
DCSStartsWith entry is used to capture the dynamic regions that DCS extracts from
document properties.
The Exemptions list identifies regions that should not be treated as part of the full text
content, despite matching the DCS “starts with” pattern.
The AddorReplace function can also trigger “rebalancing” operations. If the target
partition is Retired or has exceeded it’s rebalancing threshold, the Update Distributor

The Information Company™ 52


Understanding Search Engine 21

will instead delete the object from the partition where it currently resides, and redirect
the AddorReplace operation to a partition with available space.

AddOrModify
The intended use of AddorModify is to update selected metadata regions for an item
thought to already exist in the index. The AddOrModify function will update an
existing object, or create a new object if it does not already exist. When modifying an
existing object, only the provided content and metadata is updated. Any metadata
regions that already exist which are not specified in the AddOrModify command will
be left intact.
There is no mechanism to delete a region which has already been defined for an
object, but you can delete the values by providing an empty string as the value for
the region ("").
One potential downside of the AddOrModify operation is that if you selectively modify
metadata regions and the target object is not already correctly indexed, you will
create a new object that only has the metadata regions or content which was defined
in the modify operation. This will effectively create an object which only has partial
data indexed. If you provide all metadata region values in a modify operation, this
situation will not arise. New applications may want to consider using the
“ModifyByQuery” or “Modify” indexing operator instead of AddOrModify, which do not
create an object if not already defined.

If you have “Read-Only” partitions and attempt to modify an object


in a Read-Only partition, this will create a duplicate object. This
happens because Read-Only partitions do not have Index Engines
running. No Index Engine claims ownership of the object, so it is
assumed that the object does not exist, and it is created in another
partition.

Modify
The Modify operation is used to update specific metadata in an object. Unlike the
AddOrModify operation, Modify will never create a new object. If the OTURN
specified in a Modify operation does not exist, the transaction is simply discarded.
Modify can add new metadata, or replace existing metadata. Metadata for regions
not included in the IPool message are unaffected.

Delete
The Delete function will remove an object from the index, including both the metadata
and the content.
Note that if an object exists in multiple partitions, it will only be removed from the
partition to which the Update Distributor sent the Delete operation. This is a very rare
case, and would likely only arise if partitions were marked as Read-Only, then
updates to objects in the Read-Only partition were performed.

The Information Company™ 53


Understanding Search Engine 21

DeleteByQuery
The DeleteByQuery operator deletes objects which meet the provided search criteria.
A standard “WHERE” clause is provided in OTURN. This operator can be used to
delete many objects at once. Since the Update Distributor broadcasts the function to
all active Only partitions, duplicate objects can also be removed.
DeleteByQuery is of particular usefulness for applications that no longer track the
unique identifier for an object.

Some versions of Content Server have difficulty removing


Renditions from the search index, since the delete operation given
to the indexing system happens after the information about the
Rendition is removed from the Content Server database. Using
DeleteByQuery, these objects can still be deleted from the index
because they have a unique pattern which can be located with a
search.

Applications which need to perform bulk deletes on a project will also find this far
more efficient. Instead of issuing 25,432 delete requests for every object in a project,
a single DeleteByQuery operation with an OTURN of
[region "ProjectName"] "old project"
would delete all objects marked as belonging to the project in a single transaction.

ModifyByQuery
This operation is used to selectively modify the content or specific metadata regions
for objects in the index. The affected objects are specified by search parameters – a
valid “WHERE” clause within the OTURN entry of the IPool. If no objects match the
query, then no updates are performed. Every object in the index which matches the
query will have the provided regions updated. Other regions for objects are not
affected; for example, you could change the value in the region “CurrentVersion” to
“false” without modifying values in other regions.
The Update Distributor will send ModifyByQuery operations to every active partition.
To modify a specific known object, you can place an object ID in the OTURN field:
[region "OTURN"] "ObjectID=1833746;ver=3"
You can also quickly perform bulk operations, such as marking all the objects
associated with a specific project as “released”. The IPool would contain region
values such as:
<ProjectStatus>released</ProjectStatus>
And the Key field in the IPool would contain a ‘WHERE’ clause such as:
[region "ProjectName"] "Great Scott"
All objects with the value of “Great Scott” in a region labeled “ProjectName” will then
have their ProjectStatus region populated with the value “released”.

The Information Company™ 54


Understanding Search Engine 21

A value for a region cannot be completely removed, but it can be replaced with an
empty string by providing a region definition in the IPool that has an empty string:
<ProjectStatus></ProjectStatus>
The full text content of an object can not be updated using ModifyByQuery.

Transactional Indexing
The indexing process with OTSE is transactional in nature. This essentially means
that the indexing request is not deleted until the index updates have been committed
to disk.
Transactional indexing ensures that no indexing requests are lost in the event of a
power loss or similar problem while indexing is taking place.
OTSE treats all of the indexing requests within an input IPool as a single transaction.
The input IPool is not considered complete until every request in the IPool is serviced
and committed to disk. Only then is the IPool deleted.
There are performance considerations related to transactional indexing. The more
objects there are within an IPool indexing transaction, the more efficient the indexing
process is. This is because a new index fragment is created each time a transaction
completes. Many objects in a transaction therefore generate fewer new index
fragments, and use the disk bandwidth more efficiently.
The converse of this is the time to index. By collecting index updates and packaging
them into transactions, for low-load systems, the average time for an object to be
indexed is somewhat slower. The majority of applications do not have a requirement
to minimize the lag time between an object update and the moment the changes are
reflected in the index, so large numbers of objects in the indexing IPool is generally
the best approach.
OTSE does not collect objects to create transactions. The number of objects in a
transaction is set by the upstream applications which are generating the indexing
updates. By default, Content Server 16 will attempt to package up to 1000 objects
within a single indexing transaction.
IPool Quarantine
In the event that an object in an IPool cannot be indexed because of severe errors,
the affected indexing component will halt. Upon restart, all of the indexing operations
for the IPool will be rolled back. Depending on the error code and configuration
settings, the Admin Server might automatically restart the component. If an IPool
fails in this way 3 times, it is moved into quarantine and the next IPool is processed.
The quarantine location is a sub-directory named \failure in the IPool input directory.
If there are too many quarantined items, the IPool libraries can be configured to
either halt or discard the oldest IPool. Quarantine behavior is a Content Server
configuration, not in OTSE.

The Information Company™ 55


Understanding Search Engine 21

Query Interface
Queries to OTSE are submitted to the Search Federator over a socket connection
using a language known as OpenText Search Query Language (OTSQL).
Applications communicating directly with the Search Federator will need to
understand and implement this wire-level protocol exposed by the Search Federator.
Content Server implements this protocol, as does the Admin Server component of
Content Server and the search client built into OTSE.
Connection to the Search Federator requires knowledge of the computer IP address
and the port number on which the Search Federator is listening, which is configurable
within the search.ini file. The search client will need to establish a basic text socket
to engage in a query conversation, which is a generic network function which should
be possible from most programming languages. The OTSQL commands and
responses described here are conveyed across the socket connection.
A conversation with the Search Federator consists of opening a socket connection,
issuing commands, receiving responses, and closing the socket connection.
Managing the number of open connections can be important in optimizing the overall
resource use in OTSE. There are two settings: the number of queries that can be
simultaneously active (being serviced by the Search Engines); and the queue size
(maximum number of queries waiting for service). By default, the queue size is 25
and the active query limit is 10. When the queue is full, the Search Federator simply
does not accept any additional socket connections.
A typical query conversation between an application and the Search Federator is:

open socket connection


set parameters
select
set cursor
get results
get results
get facets
hh
get time
close socket connection

Responses from the Search Federator are expressed in a clear text data stream
which explicitly includes data size information to allow parsing values without needing
to escape special characters.
The available commands are described below. The commands themselves are not
case sensitive, although parameters to the commands such as region names may be
case sensitive.

Select Command
The select command is used to initiate a query. This command is essentially the
OpenText “OTSTARTS” query language, which is described in more detail in the
OTSQL section of this document. The basic form is:

The Information Company™ 56


Understanding Search Engine 21

select SELECTLIST [FACETLIST] where QUERYTERMS [orderedby


ORDER]
The SELECTLIST defines the metadata regions that should be retrieved in the
results. FACETLIST is optional and defines the facet information to be computed
during the query. QUERYTERMS contains the search regions, terms and operators,
such as

(([region "OTName"] stem "happy" AND [region


"OTModifiedDate"] range "20110101~20110201") OR "exact
string in the content").
The ORDEREDBY portion is optional – the default is to order by computed relevance,
which will include the QUERYTERMS. However, additional terms can be added to
relevance scoring, or the ordering can specify sorting based upon other regions. Note
that queries will run faster without an “ORDEREDBY” clause. If you do not care
about the order in which results are presented from the search engine, omitting this
clause can improve query performance.
The select command responds with a count of the number of results that match the
query.
A typical response to a SELECT command returns with the current cursor location
and the number of results that match the query:

<OTResult>
Cursor 0
DocSetSize 1012
</OTResult>

Set Cursor Command


This command is used to set the start location for getting results. By default, the
cursor position is set to 1 (first result) after a select operation. It is also advanced
automatically when you get results to point to the next result. If you want to retrieve
results starting at result number 100, use this command:

set cursor 100


Which responds with an acknowledgement and the current cursor location.

<OTResult>
cursor 100
</OTResult>
The cursor is automatically advanced after a get results command, which means
that use of set cursor between get results is optional if you are retrieving
consecutive sets of results. It should also be noted that moving the cursor forward is
relatively efficient. Moving the cursor backwards internally requires a reset to the
start of the results and moving forward to the desired location. If you are performing
multiple get results operations, structuring them to move strictly forward through

The Information Company™ 57


Understanding Search Engine 21

the results is much faster. This observation is only true within a search transaction
(between open and close operations), and has no impact on distinct queries.
There is an alternative method for managing the cursor location. The general form of
a query is:
Select … where … orderedby … starting at N for M

Where N is the number of the first desired result, and M is the number of results to
return in the Get Results command. The first result has a number of 0.
Select "OTObjectID" where "dogs" starting at 1000 for 250

Would return results number 1000 through 1249 when Get Results is called. This
method is not generally used or recommended, and is noted here for completeness.
Using Set Cursor with Get Results is the recommended usage pattern.

Get Results Command


This command is used to retrieve search results after a select command. The results
for a query are retained by the Search Engines until the socket connection is closed.

get results count


The parameter count is an integer, and represents the number of results that should
be returned. If there are not enough results to fulfill the count, it will return as many
as possible, and provide the actual number of results in the response.
The returned results are based upon the sort order specified in the select command,
which by default is ordered by computed relevance. Note that internally the
relevance computation is a floating point value, even though it may be reported in
OTScore as an integer. This means that even though the user may perceive a
relevance score of 59 for multiple objects, the Search Engines can discriminate
between results that have relevance scores of 0.58993 and 0.58991 and order them
accordingly.
The response to get results is a count of the actual number of results returned, along
with a structure that contains all the values specified in the SELECTLIST parameter
of the select command.
A typical response is of this form:

<OTResult>
ROWS 4
ROW 0
COLUMN 0 "OTObject"
DATA 25
DataId=41280133&Version=1DATA END
COLUMN 1 "OTName"
DATA 29
Approval Handilist Poothe.pdfDATA END
ROW 1

The Information Company™ 58


Understanding Search Engine 21

COLUMN 0
DATA 25
DataId=41280094&Version=1DATA END
COLUMN 1
DATA 18
P&L Jun to Nov.xlsDATA END
ROW 2
COLUMN 0
DATA 25
DataId=41280131&Version=1DATA END
COLUMN 1
DATA 0
DATA END
ROW 3
COLUMN 0
DATA 25
DataId=41280093&Version=1DATA END
COLUMN 1
DATA 10
Mar TB.XLSDATA END
</OTResult>
In this example, there are 4 results, indicated by the “ROW” values. ROW values are
numbered starting at 0.
Each result contains 2 returned regions, identified by the COLUMN values. In the first
ROW, the COLUMN labels are provided. To save bandwidth, the COLUMN values are
not labeled in subsequent ROWS.
The COLUMN values are numbered starting at 0, in the same order in which the
regions were requested in the SELECT statement for the query. Note that the
DataId= portion of the COLUMN 0 results is typical of how Content Server provides
the data for indexing, this is not an artifact of the search technology.
If a value is not defined for a region, the region is still returned in the results with an
empty value. ROW 2 COLUMN 1 illustrates this case.
If ATTRIBUTES were requested in the select statement, then the requested attribute
information will be appended to the get results data. In the example below, the data
element for the region “TestSplit” has 3 values. The first value had one attribute, the
language (English), the second has two attributes, and the third value has no
attributes – indicated by the empty placeholder.

COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr"
translated="true"</><></>ATTRIBUTES END

The Information Company™ 59


Understanding Search Engine 21

If HIT LOCATIONS were requested in the select statement, the locations are added
to the results:

COLUMN 1 "TestSplit"
DATA 33
<>Hello</><>Goodbye</><>vanish</>DATA END
ATTRIBUTES 59
<>language="en"</><>language="fr" translated="true"</>
<></>ATTRIBUTES END
LOCATIONS 17
0 4 6 1; 2 10 7 3 LOCATIONS END
The triplets indicate that the first cell (start counting at 0) has a hit at location 4,
length 6, matching term 1. The third term (2) has a hit starting at character 10 with
length 7, matching query term 3.
If you are retrieving large numbers of search results, it can be more efficient to break
the operation into multiple get results operations. Typically, these “gulp” sizes are
optimal in the 500 to 2000 results range. The performance benefit of using an
optimal size is typically only about 10 percent, so this is not a critical adjustment.

Get Facets Command


If the SELECT command specified that facets should be computed, then a
subsequent GET FACETS command will retrieve the facets that were generated in
the query. There are no parameters; all the facet information that was requested is
returned. The response has the following form:
get facets
<OTResult>
ROWS 1
ROW 0
COLUMN 0 "RegionName","RegionType"
FACETS facetLength
nFacets{+},{keyLength,key,count;}FACETS END
{COLUMN n … FACETS END}
</OTResult>
The facets follow the general structure of other search results, and thus include the
ROW and COLUMN constructs. Only ROW 0 is used, with each facet set
represented within a COLUMN. Column numbers start at 0.
The COLUMN line includes the RegionName and RegionType. The RegionName is
the same as the name of region for which a facet was requested in the SELECT
statement. The RegionType may be used by an application to optimize how the
facets should be interpreted. The RegionType will be one of:
Date
Integer
Text
UserLogin
UserName
Enum

The Information Company™ 60


Understanding Search Engine 21

FileSize
The next line contains the text FACETS with the facetLength value. This is the total
length of the string in bytes on the next line including the FACETS END statement.
The next line contains the actual facet data. The first integer, nFacets, is the number
of key/value pairs that are included in the facet results for this column. The key/value
pairs are represented by data triplets of keyLength, key and count. The key is the
text of the value. The count is an integer. The keyLength is the number of bytes in
the key – using a length simplifies parsing.
Note that there is a special case for nFacets, where it may be appended with a plus
(+) character. This indicates that building of the facet data structures terminated
because of size restrictions. This means that there are facet values in the index for
this region that have not been considered in computing these facet results.
The facet data is terminated with the FACETS END text.
A simple example of output from a get facets command is included below. Note the
special case where a facet has no values, as illustrated in the COLUMN 1 values.
get facets
<OTResult>
ROWS 1
ROW 0
COLUMN 0 "OTModifyDate","Date"
FACETS 45
3,9,d20120605,14;9,d20120528,4;9,d20120514,1;
FACETS END
COLUMN 1 "OTUserName","UserLogin"
FACETS 3
1,;FACETS END
</OTResult>

Date Facets
Facets for regions that are defined as type DATE in the LLFieldDefinitions.txt file
have a special presentation in the facet results.
Each date value is placed into buckets representing days, weeks, quarters, months
and years. Instead of the most frequent values being returned in facets, the most
recent values are returned instead. For most search-based applications, the
“recentness” of an object is a key consideration, and the implementation of date
facets reflects this requirement.
A single date value may be represented in multiple buckets. For example, if today is
July 1st 2012, an object with an OTCreateDate of June 30 2012 may be represented
in the facet values for yesterday, for this week, for last month, last quarter and this
year. Each date bucket type has a distinct naming convention to help parsers
discriminate between the buckets.
• Years have the form y2012. Years are aligned to the calendar. The current year
will include dates from the start of the year to today.

The Information Company™ 61


Understanding Search Engine 21

• Quarters have the form q201204, which represent the year and the month in
which the quarter starts. Quarters start in January, April, July and October. The
current quarter will include dates from the start of the quarter to today.
• Months have the form m201206, which represent the year and the month. Month
facets are aligned to the calendar month. The current month will include dates
from the start of the month to today.
• Weeks have the form w20120624, which represents the year, month and first
day of the week. Weeks are always aligned to start on Sundays. The current
week will include dates from the start of the week to today.
• Days have the form d20120630, which represents the year, month and day.
If the contents of a date bucket are empty (count of zero), then no result is returned
for that bucket.
Refer to the FACETS portion of the SELECT statement for information on requesting
the number of facet values for each of years, quarters, months, weeks and years.

FileSize Facets
The search.ini file can be used to identify integer or long regions that should be
treated as FileSize facets. Size facets are optimized for values that represent file
sizes. Clearly, discrete file size facets are useless. File sizes have the property that
they range from 0 to Gigabytes, but are psychologically thought of in geometric sizes.
The FileSize facet places integers into ranges that follow this geometric pattern. The
entire set of sizes is returned, rather than the most frequent counts for facets.
Applications presenting facets may choose to combine these ranges into larger
ranges.
The buckets for FileSize facets and the corresponding labels for those buckets are
captured in the table below:

Label Integer Range


0b 0
1b 1
2b 2 to 4
5b 5 to 9
10b 10 to 19
20b 20 to 49
50b 50 to 99
100b 100 to 199
200b 200 to 499
500b 500 to 999

The Information Company™ 62


Understanding Search Engine 21

1k 1000 to 1999
2k 2000 to 4999
5k 5000 to 9999
10k 10000 to 19999
20k 20000 to 49999
50k 50000 to 99999
100k 100000 to 199999
200k 200000 to 499999
500k 500000 to 999,999
1m 1,000,000 to 1,999,999
2m 2,000,000 to 4,999,999
5m 5,000,000 to 9,999,999
10m 10,000,000 to 19,999,999
20m 20,000,000 to 49,999,999
50m 50,000,000 to 99,999,999
100m 100,000,000 to 199,999,999
200m 200,000,000 to 499,999,999
500m 500,000,000 to 999,999,999
1g 1,000,000,000 to 1,999,999,999
2g 2,000,000,000 to 4,999,999,999
5g 5,000,000,000 to 9,999,999,999
10g 10,000,000,000 to 19,999,999,999
Label Integer Range
20g 20,000,000,000 to 49,999,999,999
50g 50,000,000,000 to 99,999,999,999
100g 100,000,000,000 to 199,999,999,999
big >= 200,000,000
negative <0
undefined No value for field

The list of integer regions to be presented as FileSize facets is within the search.ini
file in the [Dataflow_] section. The default regions shown here are tailored for typical
Content Server installations:

The Information Company™ 63


Understanding Search Engine 21

GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize

Expand Command
This command is used to determine the list of words that are used in a search query
for a given term expansion operation. Term expansions occur when features such as
stemming, regular expressions or a thesaurus are used in a term. The simple case
of stemming to match boat and boats is illustrated below.

> expand stem "boat"


<OTResult>
ROWS 2
ROW 0
COLUMN 0 "Data"
DATA 4
boatDATA END
ROW 1
COLUMN 0 "Data"
DATA 5
boatsDATA END
</OTResult>
The following operator examples also work:

> expand thesaurus "boat"


> expand regex "^boat.*"
> expand phonetic "boat"
> expand range "sa~sc"
> expand < "apples"
> expand phonetic "boat"
Some of these cases can generate a very large number of matches. For regular
expressions or left-truncation this operation is potentially very slow, and should be
used judiciously. It is possible to limit the result set by appending the maximum
number of desired results to the expand operator within square brackets. The default
limit is 100 terms; the example below limits the result to 5 terms.

> expand[5] thesaurus "boat"


One possible application of the expand operation is to establish which terms should
be provided to the hit highlighting function.

Hit Highlight Command


The hh command is used to identify the characters within text that match the search
query. This is used by applications displaying search results that want to emphasize
the text that matches the query. The hh command is passed a block of text to be
analyzed and a list of terms to match. The output from hh is a list of start and end
positions of characters to be highlighted in the target text.
In the basic form, the hh command sequence has the following form:

The Information Company™ 64


Understanding Search Engine 21

> HH
> DATA 61
> The <B>rain</B> in <Tag>Spain</Tag> falls mainly on the
plain
> TERMS 2
> the
> spain falls

<OTResult>
HITS 3
0,3,0
52,3,0
24,17,1
</OTResult>
After the TERMS element, each keyword to be matched is entered on a separate line.
If there are multiple words in the line, it is considered to be a phrase to be matched.
This example requests hit highlighting for the terms “the” and “spain falls”.
The results are comprised of numeric triplets, where each triplet is of the form
POSITION,LENGTH,TERM. The position starts at 0, and the term numbering starts
at 0.
The hit highlighting code strips common HTML formatting characters out of the data.
In this example, the </Tag> is ignored when matching the phrase “spain falls”,
although these formatting tags are counted in the character positions.
You may need to use the EXPAND command to obtain a list of terms that should be
tested in hit highlighting.

Get Time
While a query is executing, detailed timing information for each element of the query
is tracked. The Get Time command will return this data, including total time, wait
time, execution time, and execution time broken down by each command execution
within the connection. To obtain accurate information about the entire search query,
this should be the last command executed before closing the connection.
<OTResult>
<TIME>
<ELAPSED>68638</ELAPSED>
<SELECT>21329</SELECT>
<GET RESULTS>610</GET RESULTS>
<GET FACETS>187</GET FACETS>
<HH>0</HH>
<GET STATS>31</GET STATS>
<EXECUTION>22157</EXECUTION>
<WAIT>46481</WAIT>
</TIME>
</OTResult>

The Information Company™ 65


Understanding Search Engine 21

Set Command
The set command is used to specify values for variables that apply to the subsequent
operations. The supported set operations include:

Set lexicon English


Set thesaurus Spanish
Set uniqueids true [maxNum]
The lexicon variable specifies the language preference for stemming. The thesaurus
variable identifies which thesaurus file should be used.
Set uniqueids true requests that the Search Federator remove duplicate results from
multiple Search Engines. The optional maxNum parameter is the upper limit on
performing de-duplication. If there are more results than maxNum, de-duplication
does not occur. De-duplication is generally not recommended, since it can negatively
impact query performance and increases the memory used by the Search Federator.
Duplicates of objects may exist if a partition was placed in read-only mode, and
subsequent attempts are made to modify an object managed by the read-only
partition. This causes a new instance of the object to be created in a read-write
partition. De-duplication is a method of last recourse if you have misused the read-
only mode for partitions.
The Set Lexicon and Set Thesaurus commands are usually the first operations in a
handshaking sequence for a search query. If one or more search engines are
unavailable, the return message is:

MESSAGE 2 401 "Search engine(s) not ready.


This can be used as a convenience for an application to try another Search
Federator in environments that wish to support automated failover for high
availability. This does not apply if RMI is being used between the Search Federator
and Search Engines.

Get Regions Command


This command is not typically used in a search query. Instead it is used by an
application to discover the list of regions that exist in the search index. The first row
represents the titles for the columns in the result. Column 0 is the name of the
region, column 1 is labeled “Description” for historic compatibility reasons, but the
data returned in this region is always empty. After the title row, there will be one row
for every region defined in the index.
get regions

<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "Description"

The Information Company™ 66


Understanding Search Engine 21

DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 0
DATA END
ROW 2
COLUMN 0
DATA 16
OTWFSubWorkMapIDDATA END
COLUMN 1
DATA 0
DATA END

</OTResult>
The Get Regions command can take an optional parameter, “types”.
get regions types
When the types parameter is present, this function will include the type definition for
the region in the response. This type definition can be used to provide optimized
interfaces for users (for example, integer comparisons instead of text modifiers). If
multiple partitions report different types, then the Search Federator will respond with
the value “inconsistent” as the type. Note that differences in region types for partitions
in Retired mode are allowed; the assessment of inconsistency is based only on
partitions that are not Retired. The possible types are: Integer, Long, Enum, Date,
Text, Boolean, Timestamp.
<OTResult>
ROWS 218
ROW 0
COLUMN 0 "Name"
DATA 18
OTWFMapTaskDueDateDATA END
COLUMN 1 "RegionType"
DATA 4
DateDATA END
COLUMN 2 "Description"
DATA 0
DATA END
ROW 1
COLUMN 0
DATA 17
PHYSOBJDefaultLocDATA END
COLUMN 1
DATA 4
EnumDATA END

The Information Company™ 67


Understanding Search Engine 21

COLUMN 2
DATA 0
DATA END

</OTResult>

Another optional parameter is facets


get regions facets

When the facets parameter is present, then the type definition of generated facets is
included in the response. Normally, the facet types are the same as the region types,
but the special handling of integers that represent file sizes is an exception, returning
the value ‘FileSize’.
get regions types facets

<OTResult>

ROW 98
COLUMN 0
DATA 12
OTObjectSizeDATA END
COLUMN 1
DATA 4
LongDATA END
COLUMN 2
DATA 8
FileSizeDATA END
COLUMN 3
DATA 0
DATA END

</OTResult>

OTSQL Query Language


The SELECT command supported by the Query Interface implements the OpenText
Search Query Language, also known as OTSQL. Within this language, a query is
comprised of a number of basic parts, all contained on a single line:

SELECT parameters
FACETS parameters
WHERE clauses
ORDEREDBY parameters
Content Server users do not directly use OTSQL. The Content Server search query
language is known as LQL (historically, the Livelink Query Language). LQL is similar
to OTSQL in most respects, but provides some convenience operators and generally

The Information Company™ 68


Understanding Search Engine 21

uses different keywords. LQL in Content Server represents only the subset of
OTSQL that defines the WHERE clauses. Some of the differences between LQL and
OTSQL include:

LQL OTSQL
termset termset

stemset stemset

near, qlnear prox[10,f]

qlprox prox

term* right-truncation term

*term left-truncation term

t*er? regex ^t.+er.?$

qlregion region

qlleft-truncation left-truncation

qlright-truncation right-truncation

qlthesaurus thesaurus

qlstem Stem

qlphonetic phonetic

qlregex regex

qlrange range

qllike like

in in

any any

text text

” « » ‟ ″ “ ”„″ "

SELECT Syntax
The SELECT section is used to specify which regions in the index should be included
in the returned results. The more regions that are requested, the longer the ‘get
results’ operations will take, but this does not impact the query time.

The Information Company™ 69


Understanding Search Engine 21

SELECT "region1","region2","region3"
To return all of the regions use the * keyword. For a Content Server installation, this
is not recommended, since there may be hundreds of regions. Requesting the
minimum necessary regions is suggested for optimal performance.
If you want to return information about the key/value attributes within text regions,
you can use the ATTRIBUTES modifiers:

SELECT "OTName","OTObject" WITH ALL ATTRIBUTES


SELECT "OTName","OTObject" WITH ATTRIBUTE "lang"
SELECT "OTName","OTObject" WITH ATTRIBUTES "lang" "color"
When attributes are requested, the response in the get results command is modified
to append the attribute information (see the “get results” description for more
information). The primary usage for requesting attributes is to identify language tags
attached to values in multi-language applications. The attributes modifier is applied
to all the regions specified in the select list.
The select statement can also be modified to request hit locations within the results:

SELECT "OTName","OTObject" WITH HIT LOCATIONS


When requested, the hit locations will be appended to the get results response, with
ordered triplets indicating the query term hit character position, length and term which
matched. The hit locations will be returned for all selected regions when requested.
You can request both hit locations and attributes in a single select statement.

SELECT "OTName","OTObject" WITH HIT LOCATIONS WITH


ATTRIBUTE "lang"

FACETS Statement
The FACETS section specifies whether facets are desired, and if so, for which
regions. This is optional, with the default being no facets returned. Refer to the next
major section of this document entitled “Facets” for a complete description of the
FACETS statement.
Sample facet requests:
FACETS "regionX"[10],"regionY"
FACETS "OTCreateDate"[d100,m24]

The ‘get facets’ command is used to retrieve the results. See the commands section
for additional details.

WHERE Clause
The WHERE clause defines the rules by which an object satisfies the search query.
The basic form is:

The Information Company™ 70


Understanding Search Engine 21

WHERE <clause1> relationship <clause2> relationship


<clause3>
A query determines which objects satisfy the search by means of search clauses. A
WHERE clause is comprised of a region, operator and term, although only the term is
mandatory.
The following are simple WHERE clauses:

where "red"
where "red riding hood"
where [region "name"] "red riding hood"
where [region "FileSize"] >= "1000" and [region "FileSize"]
< "10000"

WHERE Relationships
Each WHERE clause in a query is evaluated relative to other WHERE clauses by a
logical relationship. The supported relationships are:

AND Requires both the left and right expression.

AND-NOT Requires the left expression be true but the right


expression be false.

OR Requires that either the left expression or the right


expression (or both) are satisfied.

XOR The exclusive or operator requires that either the left


expression or the right expression is satisfied, but not
both.

SOR The synonym OR operator matches terms in the same


way that the OR operator does, but the way relevance
score is computed is somewhat different. In an OR
operation, if both terms are satisfied, they both contribute
to the relevance score. With a SOR operation, only the
term with the highest contribution is added to the
relevance.

PROX[distance,order] The proximity operator is an “AND” operation which


PROX[10] requires that the left and right expressions be within
PROX “distance” words of each other. If order is present (T for
PROX[50,T] true, F for false), the left expression must also precede
the right expression. If no parameters are specified, the
interpretation is PROX[10,F].
The PROX operator ONLY works with simple terms and
phrases. It does not work in conjunction with expanded
term sets (wildcards, regular expressions, stemming,
etc.).
Refer to the SPAN operator for more advanced proximity

The Information Company™ 71


Understanding Search Engine 21

options.

Relationships are evaluated from left to right. Brackets can be used to clarify and
modify the order of evaluation of clauses. For example, using single letters a through
d to represent entire clauses:

where a or b and c and-not d


Is interpreted by OTSE as:

where (((a or b) and c) and-not d)


Brackets can be used to change the order of evaluation:

Where a or ((b and c) and-not d)


In an actual query, this might look something like:

where thesaurus "pyjamas" or (([region "color"] "pink" and


[region "pattern"] = "polka dots")
and-not [region "theme"] stem "boxers")

WHERE Terms
The search terms in a WHERE clause should normally be enclosed in quotes.
Although there are some specific cases where the lack of quotes is tolerated, if you
are writing a query application, quotes are recommended in all cases.
The first form of a search term is the simple token. This is a value which is normally
expected to pass through the tokenizer and be recognized in its entirety as a single
token. All operators work on simple terms.
"hello"
"pottery123"
"3.1415926"
The second form is an exact phrase. Not all operators are compatible with phrases.
Phrases should normally only be used in string comparison operations.
"the quick brown fox"
"1334.8556/995-x"
You can also request that matches are only returned when the entire value is an
exact match for the phrase. For example, if there is a search region “ProjectName”,
and possible values are “Plan A” and “Plan A Extended”, searching for “Plan A” will
match both of these cases. Preceding the phrase with an equality operator ( = ) can
differentiate these, and match only the values that do not include the “Extended”
term:
[region "ProjectName"] = "Plan A"
Finally, there is a special case for search terms, the * character (asterisk or star) or
the keyword all, with no quotation marks. This value is interpreted by the search
engine to match any object which has a value for the specified region. This will not
match objects if the region does not have a value defined for an object.
[region "name"] *

The Information Company™ 72


Understanding Search Engine 21

[region "name"] all

WHERE Operators
Each WHERE clause is comprised of a region specification, a comparison operation,
and a term. The region is optional, and if missing is assumed to be the default
search region list. The operation is optional, and if absent is assumed to match any
token within the region.
The following operators function with either simple tokens or phrases:

This is the default operation, where no operator is explicitly provided.


Matches any value within the region. For example a query for “York”
will match a value of “New York”.
= Use of the equality operator will only match if the entire value is
identical to the term provided. “York” will not match “New York” but a
query for “New York” will.
!= Will match all values which exist and do not exactly match the term.

The next set of operators is available for use with integers, dates and text metadata
values. They are disabled by default for full text query, since comparison queries in
full text are generally misleading and perform very slowly, although this behavior can
be changed by setting AllowFullTextComparison=true in the search.ini file.
These operators also have special capabilities for Date regions described later.

< Will match all values which exist and are less
than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
<= Will match all values which exist and are less
than or equal to the specified term. If a
phrase is provided, only the first term in the
phrase is used.
> Will match all values which exist and are
greater than the specified term. If a phrase is
provided, only the first term in the phrase is
used.
>= Will match all values which exist and are
greater than or equal to the specified term. If
a phrase is provided, only the first term in the
phrase is used.

Constructing a query of the form

The Information Company™ 73


Understanding Search Engine 21

[region"x"] > "20150621" and [region "x"] < "20160101"

is not efficient. To improve performance, the query syntax parser will attempt to
identify usage patterns where multiple comparisons are made to a single region, and
convert it to the more efficient form of
[region"x"] range "20150621~20160101"

The following operators are designed for use with single tokens, not phrases. Some
limited phrase support is available with some of the operators as noted in the
explanations.

The Information Company™ 74


Understanding Search Engine 21

range "start~to" Will match any value between the start term
and the end term, inclusive. Note that the
start term must be less than the end term.
range "value1|value2|value3" The range operator can be provided with a
list of terms or phrases. This is equivalent to
value1 OR value2 OR value3. This operator
matches any value in a region; it is not
restricted to matching entire values.
thesaurus Will match the exact term or synonyms for
the term using the currently defined
thesaurus.
phonetic Will match phonetic equivalents for the term.
If applied to a phrase, phonetic matching for
each word in the phrase will be performed.
Refer to the Phonetic matching section for
more information.
regex Will interpret the term as a regular
expression. Values which satisfy the regular
expression match the term. Regular
expressions apply only to a single token.
Regular expressions are more fully described
later.
stem Will match values that meet the stemming
rules. Refer to the Stemming section for
more information. If stemming is applied to a
phrase, then the last word in the phrase is
stemmed.
right-truncation Right truncation matches terms which begin
with the provided search term. The user
would typically consider this as term*. If
used with a phrase, then the last word in the
phrase is stemmed.
left-truncation Left truncation matches terms which end with
the provided search term. The user would
typically consider this to be of the form *term.
This operator is valid only for single tokens.
like String matching optimized for part number
and file names. Only valid with “Likable”
regions.
any (term,"search phrase") Match any term or phrase in the list. Unlike
the IN operator, partial matches within a
metadata region are acceptable. Equivalent
to (term SOR "search phrase").

The Information Company™ 75


Understanding Search Engine 21

in (term, "search phrase") Match any term or phrase in the list. Within a
region, only matches complete values.
Equivalent to (=term SOR ="search phrase").
not in (term, "search phrase") Excludes any objects containing the term or
phrase. For regions, equivalent to (and-not
[region "xx"] in (term,"search phrase")).
termset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of terms and phrases. N% may also be
used.
stemset (N, term, term, "search
phrase") Matches objects where full text contains N or
more of the stems (singular/plural) of the
terms and phrases. N% may also be used.
text (something to search)
For large blocks of text, finds objects with
similar common terms. Check Advanced
Concepts section for more details.
span (distance, query)
Match query within distance number of terms.

NOTE: the behavior of comparison operations depends


upon the type definition of the region. Text string
comparisons use a text sort, so that 2000 > 1000000 for
values stored in a text region.

The following examples illustrate usage of WHERE operators:


<= "100"
stem "flower"
range "250~300"
range "alice|bob|carol|dave"
left-truncation "ntext"
right-truncation "opent"
= "my fair lady"
in (car,auto,suv,"sport utility vehicle")
any (house,home, "place of residence")
text (there must be documents with similar information)
span (5, swamp and (gas or methane))

Proximity - prox operator


A common requirement is to find search terms that are near one another. The PROX
operator provides an easy way to locate two terms within a specified distance, with
optional ordering. For example

The Information Company™ 76


Understanding Search Engine 21

big prox[3,t] truck

Will match “big truck” or “big red truck” but not “truck is big”. The second parameter
is a single letter indicating whether order needs to match. Use a ‘t’ (true) or f (false).
In the example above, using f would match “truck is big”.

Proximity - span operator


Many proximity requirements are complex, especially for discovery and privacy
applications. Consider searching for “Michael Smith”.
Michael may also be known as Mike,
His middle names are James T., but his name or initial is optional in the text;
Last name was given verbally, might have been Smithe, Smit, or Smyth.
The “span” operator allows more complex queries to be evaluated and tested to
ensure that the entire query falls within a defined number of search terms. You can
thus construct a search of the form:
span(4, (michael or mike) and (smith or smithe or smit or
smyth))

The first parameter of the span operator is the maximum distance between terms that
will satisfy the query. These fragments would meet the distance of 4 requirement:
Mike smith
A smith named Michael
Michael Herbert James Smit

This would not:


Mike never met Bob Smith

The span operator supports query fragments for any combination of AND, OR, and
nesting (brackets) for single search terms.
“space” and span(10, ((Yellow and sun) or (blue and moon))
and (earth or planet))

The span operator can be used with full text, but not with text metadata.
A span query is a relatively expensive operation and can be very expensive when
used with wildcards (left-truncation and right-truncation) or regular expressions. By
default, the engine is configured to disable support for these types of term
expansions within the span operator. If term expansion is enabled, the search
engines will store temporary working data on disk files during the evaluation of the
span. Temporary files are stored by each Search Engine in their corresponding
index\tmp directory, and files are named matchingWordsNNNNN and
spanValuesNNNNN, where NNNNN is a dynamically generated unique value. The
temporary files are deleted when the query completes, and also by the general
purpose cleanup thread which runs from time to time.

The Information Company™ 77


Understanding Search Engine 21

If abused, the span operator has the potential to require large amounts of disk space
and will take a long time to execute. There are a number of limits set by default in
the search.ini configuration file, which can be adjusted if more complex queries must
be run. When a limit is reached, the search will be terminated as unsuccessful. The
limits apply to a single partition (not the entire query for the entire index) and are
located in the [Dataflow_] section of the configuration file, with the defaults shown
below.

SpanScanning=false
By default, use of term expansion (regex and wildcards) is not permitted with the
span operator. Set true to enable.

SpanMaxNumOfWords=20000
The upper limit on the number of terms that will be considered when wildcards and
regular expressions are expanded.

SpanMaxNumOfOffsets=1000000
Each term in the span expression may exist multiple times in documents. This file
stores the locations of the terms being evaluated. This is the upper limit for the
number of instances of matching terms.

SpanMaxTmpDirSizeInMB=1000
Limits the temporary disk space the partition can use for storing temporary data
during span operation evaluation.

SpanDiskModeSizeOfOr=30
The cost of executing a span is directly related to the number of “OR” operations in
the span query. This setting is an upper limit on the number of “OR” Boolean
operators that can be assessed.

Proximity – practical considerations


When using the prox or span operators, you may need to increase the distance to
accommodate pattern and tokenizer behavior. Keep in mind that the distance is
measured internally in the search engine by “tokens”, not by words.
In addition, if pattern insertion features of the Document Conversion Server are
enabled, unique tokens will be inserted into the full text at locations where phone
numbers, email addresses, hash tags or other items are detected.
Both the tokenization and pattern behaviors can increase the distance between
words. As a result, adding a small additional distance to the prox and span operators
may be needed to capture all the expected results.

The Information Company™ 78


Understanding Search Engine 21

WHERE Regions
A region is specified within square brackets with a region keyword, and enclosed in
quotation marks. The search term is likewise enclosed in quotation marks. There
are specific cases which are unambiguous and quotation marks are not required, but
for consistency your application should use quotation marks regularly. Region names
are case sensitive!
If the region portion of a WHERE clause is absent then the default search list is used
to determine the regions.
The following are examples of WHERE clauses using regions:
[region "OTNAME"] "cars"
[region "OTNAME"] all
[region "OTDate"] > "20100602"
[region "abc”] <= "string1"
Regions are grouped by OTSE into content and metadata regions, which are
internally represented by OTData and OTMeta. The representation of the “OTNAME”
in the example above is actually an abbreviated form of:
[region "OTMeta":"OTNAME"]
You can use OTMeta without a region name to examine all of the metadata regions.
However, this is relatively slow (depending on the number of regions) and in many
cases is not logical because of the different type definitions for regions.
You can also use OTMeta with some surrounding syntax to search within metadata
regions. For example, the clause:
[region "OTMeta"] "<someRegion>123 ABC</someRegion>"
Will find the exact value ‘123 ABC’ within the region “someRegion”. This is a much
slower way to locate the value, but there may be special cases where matching a
phrase anchored to the start or end of a region is needed.
You can specify searching in the full text using the OTData region:
[region "OTData"] "looking for this"
If you have indexed XML content, you can also search within specific XML regions of
the full text content using the XML structure, refer the section on indexing XML data
for more information.
The WHERE clause can also be used to set restrictions on attribute/value tags for
text metadata. For example, to restrict a search to looking at French language
values of the OTName field, you might use the syntax:
[region "OTName"][attribute "lang"="fr"] "voiture"
This presumes that “lang” is the attribute name, and “fr” is the value for that attribute.
Multiple attribute fields are possible, which effectively operates as a Boolean “and”,
requiring that both attributes must match:
[region "OTName"][attribute "lang"="fr"][attribute
"size"="med"] all

The Information Company™ 79


Understanding Search Engine 21

Priority Region Chains


Certain types of search queries are very difficult to construct using Boolean
operations. In particular, OTSE supports a prioritized region evaluation method for
use with similar sparse regions. Consider document “creation” dates. A Content
Server object may or may not have dates from several possible sources… the source
(disk) creation date, the source (disk) modified date, an extracted date from Microsoft
Office document properties, the date the object was added to Content Server. If the
source create date is defined, it is the best quality information – it should be used in
evaluating the query. If it is not defined, then the source modified date should be
used, if defined. Next most reliable date is the MSOffice property, and as a last
recourse use the Content Server date only if none of the other date values exist for
an object.
These priority chains of related metadata regions can be easily specified using the
“first” region declaration in a WHERE clause. For example, to find all objects with the
“best” date earlier than 5 years ago…
[first "OTExternalCreateDate", "OTExternalModifyDate",
"OTDocCreatedDate", "OTCreateDate"] < "-5y"

This syntax can be used to dynamically define the regions and their priority as part of
the query. However, this approach does not allow the value that matched the query
to be returned. If retrieving of a priority value is necessary, then a synthetic region
declaration must be made in the LLFieldDefinitions.txt file:
CHAIN GoodDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate OTCreateDate

A query can then be made using the pre-defined date, and the GoodDate field can
also be returned as a target of the SELECT:
[region "GoodDate"] < "-5y"

For those interested in trying to construct the equivalent query using standard
Boolean operators, an example is shown below. Note that using the ‘first’ feature is
not only more convenient, but the implementation is more efficient. Internally, a new
operator performs the necessary logic with fewer operations, it is not simply
converted to this Boolean equivalent:
[region "OTExternalCreateDate"] < "-5y" or ([region
"OTExternalCreateDate"] != all and ([region
"OTExternalModifyDate"] < "-5y" or ([region
"OTExternalModifyDate"] != all and ([region
"OTDocCreatedDate"] < "-5y" or ([region "OTDocCreatedDate"]
!= all and ([region "OTCreateDate"] < "-5y"))))))

The ‘first’ region method can be used with all region types and most operators.
However, search within a specific text metadata attribute value with the CHAIN / first
operator is not supported.

The Information Company™ 80


Understanding Search Engine 21

Minimum and Maximum Regions


Similar to the use of region chains, the search engine can be instructed to evaluate
an object based upon the minimum or maximum value of a set of regions. These can
be dynamically constructed as part of the query, as illustrated here:
[max "OTExternalCreateDate", "OTExternalModifyDate",
"OTDocCreatedDate", "OTCreateDate"] < "-5y"

[min "Attr1", "Attr2", "Attr3"] ="6"

The min and max operators will skip assessment when an object lacks a value. For
example, if an object had only Attr2 defined in the example above, then it would
automatically be evaluated as the minimum value. If none of the regions has a value,
the object does not match.
Min and max region assessments work for all data types, although not all operations
are supported. Supported operations include comparisons against a value (<,=, >,
etc.), basic term and phrase matching, IN, ranges, etc. However, operators that
expand to multiple elements are not available, such as termset, stemset, thesaurus,
wildcards and regular expressions.
For multi-value TEXT metadata regions, the smallest value in a set of values for a
region will be used when assessing a minimum region, and the largest value will be
used when assessing a maximum region.
In addition to specifying ad-hoc minimum and maximum region evaluations in a
query, a synthetic region may be defined as a convenience using the
LLFieldDefinitions.txt file:
MIN SmallAttr Attr1 Attr2 Attr3
MAX BigDate OTExternalCreateDate OTExternalModifyDate
OTDocCreatedDate

A query could then be constructed using the predefined region:


[region "SmallAttr"] ="6"

A predefined region has the additional property that the tested value can also be
returned in a SELECT statement. Note that no additional storage or indexes are
created, this region definition is a directive to the query constructor. Both the
dynamic and predefined approaches execute identically.
As a point of interest, it is usually possible to construct an equivalent query using
standard Boolean logic, although the min and max forms are computationally more
efficient. The equivalent query is quite complex, and varies depending on the nature
of the comparison (greater than, equal, less than) and whether a minimum or
maximum is required. Where multi-value text is present, there is no Boolean logic
equivalent. As one example,
[min created,modified,record,system] >= "20150403"

Is equivalent to:

The Information Company™ 81


Understanding Search Engine 21

(([region created] >= "20150403" or [region created] != *) and


([region modified] >= "20150403" or [region modified] != *)
and ([region embedded] >= "20150403" or [region embedded] !=
*) and ([region record] >= "20150403" or [region record] != *)
and ([region system] >= "20150403" or [region system] != *)
and (([region created] >= "20150403") or ([region modified] >=
"20150403") or ([region embedded] >= "20150403") or ([region
record] >= "20150403") or ([region system] >= "20150403")))

Any or All Regions


To simplify constructing queries that need to find the same result in multiple regions,
the Any and All region specification is available. This feature is first available in the
16.2.3 (2017-12) update of the search engine.
The any region designation is a syntax shortcut for using the OR operator. The
convenience form:
[any "r1", "r2", "r3"] "bob"

Is equivalent to constructing this query using OR:


[region "r1"] "bob" or [region "r2"] "bob" or
[region "r3"] "bob"

Similarly, the all region designation is a syntax shortcut for using the AND operator.
The convenience form:
[all "r4", "r5", "r6"] "sue"

Is equivalent to constructing this query using AND:


[region "r4"] "sue" and [region "r5"] "sue" and
[region "r6"] "sue"

Regular Expressions
OTSE supports the use of regular expressions for matching tokens. A regular
expression is a pattern of characters. In the OTSE query language, a term preceded
by the operator regex is interpreted as a regular expression. Patterns are defined
using the following rules:

. A period will match any single character.


[ ] Square brackets are used to enclose a character set or range of
characters. A range of characters consists of two characters
separated by a hyphen, such as 0-9. The characters within a range
are determined by their ordering in the UTF8 character set. Examples:
[a-z] matches the letters of the alphabet
[$#.!%] matches a number of punctuation symbols. Contrary to

The Information Company™ 82


Understanding Search Engine 21

popular belief, this does not match obscene words.


[^ ] The caret ^ symbol has special meaning. If it immediately follows the
opening square brace, then it is a negation of the range. For example:
[^0-9x] matches any character except the digits 0 through 9 or the
letter x.
[ Within a range, the caret symbol is an escape character which allows
^]] matching a closing square bracket, hyphen or caret. This use of the
caret allows these special characters to be used in a range. For
example:
[abc^]^-] matches any of the letters a, b or c or the closing square
brace or the hyphen.
^ The caret symbol is an anchor used to denote the beginning of a word
when used as the first character in the regular expression.
For example, the pattern
"^sp" will match spain or sporadic, but not hospital or wasp.
$ The dollar sign, when used as the last character in a pattern, denotes
an anchor at the end of a word. For example, the pattern
"sp$" will match wasp, but not spain or hospital.
* The asterisk matches the smallest preceding range zero or more
times. The preceding pattern may be a character or a range. For
example, the regular expression
"ad*" will match a, ad, add, addition.

+ The plus character matches the smallest preceding range one or more
times. For example,
"tr[eay]+ " will match words like try, tree, trey, treayaaa or country. It
will not match tr.
? The question mark character matches the smallest preceding range
exactly zero or one time. Reusing the previous example:
"tr[eay]? " will match try or pictr. However, it will not match tree.
| The vertical bar functions as an OR operation between patterns.
"go|stay" will match cargo or stay.
The range "[a-c]" could be represented as "a|b|c".
( ) Braces are used to group patterns together. This allows complex
patterns to be constructed.

The Information Company™ 83


Understanding Search Engine 21

"ho(us|m)e" will match both house and home.

\ The backslash character is used as an escape character to indicate


the following character should be interpreted literally, and not
interpreted as an operation. Use a double \\ to match the \ character.
"func\(a\)" matches func(a).
"3\.14" matches 3.14, but not 3714 ("3.14" will match 3714).
“folder\\subfolder” will match folder\subfolder.
Some additional examples:

"^l(uke|eia)" Match words that start with luke or leia

"^....s?$" Match five letter words that end with the letter s or four letter
words.

Not sure how you spell encyclopedia. Starts with ‘en’, has some
"^en[a-z]+p[eaid]+$" letters, then a ‘p’, then some combination of e, a, i and d. Mind
you, this also matches envelope.

"(0?[1-9])|(1[0-2]):[0-5][0-9]" Find words that contain a string that might be a time in 12 hour
format, such as 1:30, 03:26, 12:59
"^s(ch)?m[iy](th|dt|tt)e?$" Match words like smith, smyth, Schmidt, smitte.
"^ope.+ext$” Matches the common user expectation of a wildcard in the
middle of a word: ope*ext.

Within a WHERE clause, the regex operator looks like this:


[region "Size"] regex "^(small|med)"
Regular expressions can be very expensive operations. In the worst case, the entire
internal dictionary may need to be examined to test every word as a potential match.
The most effective way to reduce the cost of finding candidate words is by anchoring
the start of the regular expression with a caret.
It is also important to make the expression as targeted as possible. If the regular
expression matches thousands of possible words, then the resulting search query will
have an effective “OR” operation of thousands of terms.
Note that the search index has typically normalized the indexed words (see the
section on the Tokenizer for details). Usually, this means that all dictionary entries
are lower case, and the use of upper case within a regular expression is normally not
appropriate.

The Information Company™ 84


Understanding Search Engine 21

NOTE: the search index has typically normalized the


indexed words to lower case (see the section on the
Tokenizer for details). Unless you are using a tokenizer
that preserves case, the use of upper case within a
regular expression is normally not appropriate.

Relative Date Queries


When searching within a region of type DATE or TIMESTAMP, there are special
operations available that simplify the creation of common relative date searches.
Relative date queries can use day, week, month, quarter or year comparisons,
represented by an integer immediately followed by the letter d, w, m, q or y. Positive
integers represent periods in the future; negative numbers are periods in the past. As
an example, "-1y" means previous year.
The current date determines the meaning of the current week, month quarter or year,
which can be expressly used in the query with the integer 0. For example, the
current month can be represented by "0m".
Relative weeks, months, quarters or years are aligned to their calendar boundaries;
they are not shortcuts for 7, 30, 90 or 365 days. The first day of the week is
determined by the system locale, which is Sunday for most areas of the world.
Calendar quarters are used, comprised of three month periods starting in January,
April, July and October. Relative date queries are supported for comparison { < <=
>= > }, but not for equality (or inequality).
For illustration, assume that the LLFieldDefinitions.txt file contains the following entry
that captures the date a contract ends:
DATE EndDate

The query syntax has the form:


[region "regionName"] comparator "rDate"
e.g.
[region "EndDate"] >= "-365d"
The following table illustrates how the relative date value is interpreted, assuming
that today is Monday 13 October 2014.

rDate Meaning Effective Query

>= +1m Next month or later >= 20141101

< 0d Before today < 20141013

> -7d More recent than 7 days ago > 20141006

> -1w Later than last week > 20141011

>= 0w This week or later >= 20141012

The Information Company™ 85


Understanding Search Engine 21

>= -1y Last year or later >= 20131013

>= -365d Last 365 days or later >= 20131013

> -1y After last year > 20131231

< -2y Before previous 2 years < 20120101

>= -1q Last quarter or later (after July 1) >= 20140601

> -1q After last quarter > 20140930

<= 0q Before end of this quarter <= 20141231

> -16m After June 2013 > 20130631

If a TIMESTAMP region is used, the internal conversion is similar, but is expressed to


the millisecond level where necessary.

<= 0q Before end of this quarter <= 20141231T23:59:59.999

> -7d More recent than 7 days ago > 20141006T23:59:59.999

Matching Lists of Terms


There are 3 query operations that are optimized for matching items in a list of terms:
IN, TERMSET, and STEMSET. These operations are valid only for the full text
(body), or text metadata regions.
The IN operator takes a list of simple terms or phrases, and is a more concise
method of matching items in a list than using OR operations. Consider the term
[region "lake"] in(superior,erie, "Lake of the Woods")

This is equivalent to:


[region "lake"] ="superior" SOR [region "lake"] ="erie" SOR
[region "lake"] ="Lake of the Woods"

Using the SOR operator ensures that multiple matches won’t rank the result higher.
Note the use of the = modifier; the IN operator will only match entire values in
metadata regions. The behavior in full text content is slightly different, in that the
entire value matching is no longer pertinent.
in(superior,erie, "Lake of the Woods")

In full text queries is equivalent to


"superior" SOR "erie" SOR "Lake of the Woods"

The Information Company™ 86


Understanding Search Engine 21

The TERMSET feature allows you to locate objects that have at least N matching
values from the provided list. For example, the clause:
termset(5,Water, river, lake, pond, stream, creek, rain,
rainfall, dam)

will match an object that contains 5 or more of the terms and phrases. This is a very
powerful construct for discovery and classification applications. There is no simple
equivalent representation. The example above could be expressed like…
SELECT ... WHERE
(stream AND pond AND lake AND river AND water) OR
(creek AND pond AND lake AND river AND water) OR
(creek AND stream AND lake AND river AND water) OR
(creek AND stream AND pond AND river AND water) OR
(creek AND stream AND pond AND lake AND water) OR
(creek AND stream AND pond AND lake AND river) OR
(rain AND pond AND lake AND river AND water) OR
(rain AND stream AND lake AND river AND water) OR …

Fully written out, this query is comprised of 126 lines with 629 operators. The
TERMSET operator is powerful, concise, and eliminates errors constructing complex
queries. The implementation of TERMSET and STEMSET is also internally
optimized for these cases. Queries may operate considerably faster with less
memory using TERMSET/STEMSET compared to executing the fully expanded
equivalent queries constructed of AND / OR terms.
The value of N can also be a percentage, meaning that it must match at least the
specified percentage of terms. 50% of 4 terms means that 2 or more matching terms
are needed. 51% means that 3 or more must match, since the percentage is a
minimum requirement. Using percentages is typically useful when there are longer
lists of candidate matching terms. These are equivalent:
Termset( 3, Water, river, lake, "duck pond", "stream")
Termset( 50%, Water, river, lake, "duck pond", "stream")

Negative values for N are interpreted to mean M-N as the threshold. For example, if
there are 10 terms, a value of -2 is equivalent to a value of 8 for N. It may be of
interest to note that at the endpoints for a list of N terms, TERMSET 1 is an effective
OR, and TERMSET N is an effective AND.
Termset (1, red, blue, green)  red OR blue OR green
Termset (3, red, blue, green)  red AND blue AND green

The STEMSET operator is similar to TERMSET, except that it matches stems of the
values (that is, singular and plural variations).
stemset(5, Water, river, lake, pond, stream, creek, rain,
rainfall, dam)

Would match an object that contains:

The Information Company™ 87


Understanding Search Engine 21

Water, rivers, ponds, stream, rain

Being singular/plural aware means that a document that had only the words:
Water, river, rivers, pond, ponds

will not match, since STEMSET considers the singular and plural forms of river and
pond to be the same term. This document therefore only has 3 matching terms,
instead of the desired 5. Essentially,
stemset(2,water,river,pond)

can be thought of as
((stem(water) and stem(river)) or (stem(water) and
stem(pond)) or (stem(river) and stem(pond)))

or, in a somewhat simplified form which doesn’t really cover all the variations of
stemming,
((water or waters) and (river or rivers)) or ((water or
waters) and (pond or ponds)) or ((river or rivers) and
(pond or ponds))

Unlike the IN operator, STEMSET and TERMSET are not constrained to matching
only full values in text metadata regions. The negation of these operators is possible
using NOT, and can be interpreted as follows:
(m or n) not termset(2,a,b,c)
 (m or n) and-not (termset(2,a,b,c))

region["r"] not stemset(2,x,y,z)


 not (region["r"] stemset(2,x,y,z))

The TERMSET and STEMSET operators were first introduced in version 16.0.1
(June 2016).

ORDEREDBY
The ORDEREDBY portion of a query is optional. Its purpose is to give you control
over how the search results should be sorted (ranked) and returned in the get results
command. If omitted from the query, the result ranking is sorted by the relevance
score in descending order. This means that the most “relevant” results are returned
first.

Within Content Server, ordering of results is not available in the


Livelink Query Language (LQL). Content Server injects the
appropriate ORDEREDBY statements as needed depending upon
the way results are displayed.

The Information Company™ 88


Understanding Search Engine 21

ORDEREDBY takes parameters, with the first parameter determining whether


additional parameters are accepted. These are:
ORDEREDBY Default
This is the same as omitting the parameter entirely, and ranks results by relevance.
ORDEREDBY Nothing
This parameter identifies that no sorting of the search results takes place. This
provides both a memory and performance improvement, especially if retrieving large
sets of search results. The order returned by a specific Search Engine is repeatable.
Where multiple partitions exist, the overall order is not repeatable, since the Search
Federator will select results based on the order in which Search Engines completed
their individual searches.
ORDEREDBY Relevancy
This is the default if the parameter is omitted. Results are ranked by relevance, with
the most relevant results firsts.
ORDEREDBY RankingExpression
The RankingExpression method allows you to extend the WHERE clauses to provide
additional parameters for evaluating relevance. This does not affect the objects
selected, only their relevance computation. For example:
ORDEREDBY RankingExpression ([region "size"] "small"
OR [region "color"]!= "green")
This would modify the relevance computation to favor objects which have the value
“small” in the region “size”, or any value except “green” in the “color” region. The
same rules that apply to WHERE clauses are used here.
Note that these values SUPPLEMENT the WHERE clauses, not replace them in the
scoring.
ORDEREDBY Region
The Region ordering allows you to sort the results by one or more specific fields.
ORDEREDBY REGION "OTCreateDate" ASC, "Author" DESC
This example sorts first by OTCreateDate in ascending order, and for objects with
identical OTCreateDate values they are further sorted by Author descending. Use of
ascending (ASC) or descending (DESC) is optional, with ascending being the default.
Regions are separated by commas.
If sorting on a multi-value text region, the first value (as provided during indexing) is
used as the sort key.
There are special syntax cases available for text regions which have multi-language
metadata, allowing you to specify which of the language values for the region should
be used in the sort.
ORDEREDBY REGION "OTName" SEQ "fr" DESC
The use of SEQ or SEQUENCE, followed by the language code, requests that the
value in the OTName region which has the key/value attribute pair of “lang”=”fr” be
used when sorting this region. If there is no French language value defined, then the
system default language for the value will be used. If there is also no system default

The Information Company™ 89


Understanding Search Engine 21

language, then the language with the smallest value is used, otherwise use the
standard “no attribute” sorting.
ORDEREDBY Existence
Rank the search results by the number of matching terms in an object. This modifies
the standard relevance computation slightly, so that the number of times a term
appears is not important, only the number of terms which exist in the document.
ORDEREDBY Rawcount
Rank the search results by the number of instances of terms in an object. This
modifies the standard relevance computation slightly, so that the number of times a
term appears is highly rated. The default scoring algorithm considers the number of
times a word appears, but it is only a modifier. Using Rawcount will make the
number of times words appear a major factor in the score.
ORDEREDBY Score[N]
Rank the search results using a combination of the ranking computation (global
settings) and boost values specified as parameters in the query. Refer to the
Relevance section of this document for details.
Performance Considerations for Sort Order
In some cases, the sorting requested for results can be a factor in search
performance. Sorting is performed in the search engines, and each search engine
requires temporary memory allocation and time to perform the sorting. For both time
and memory, the key variables are the type of sort, and the cursor position of the
requested results.
Orderedby Nothing is the fastest performer, and uses the least memory – since it
skips the sorting step entirely. If your application needs to gather all the results from
a query, the use of Nothing as the sort order is strongly recommended, especially if
you are dealing with large data sets. Sorting and retrieving 1 million results may
require on the order of 100 Mbytes of temporary memory. Sorting by Nothing will
avoid this penalty.
Sorting by primitive data types such as floats (relevance), integers, or dates is the
next best performing configuration. Roughly speaking, primitive types require about
4 Mbytes of RAM for each 100,000 results the cursor is advanced.
Sorting by string values is slower and uses more memory. Performance may start to
become material moving the cursor past about 20,000 results. Memory requirement
varies depending on the lengths of the strings, but typically runs about 15 Mbytes of
temporary memory per 100,000 results the cursor is advanced.
Sorting on multiple fields is slower, and uses more memory. The performance
penalties are difficult to predict, since they depend on the numbers and types of
sorts. The order also matters – a sort on a number first, then on a string uses about
8 Mbytes per 100,000 results the cursor advances. Reversing it to sort on the string
first, then a number, would use more memory than just a string sort.
What does it mean when we talk about advancing the cursor position? Regardless of
how many search results there are, if you are only retrieving the first few hundred, the
sort time and memory required will be low. However, if you want sorted results

The Information Company™ 90


Understanding Search Engine 21

numbers 99,900 to 100,000 – then the cursor must be advanced to at least position
100,000. The search engines must sort at least that number of results, requiring
significant resources. When asking for results 1 to 100, the search engines can
optimize their sorting implementation to focus on just the ensuring the minimum set of
values are properly sorted.
The memory resources required for sorting are per search engine, per concurrent
search query. If you want to support up to 10 concurrent queries, each asking for
100,000 results, then each search engine may need over 150 Mbytes of working
space available. In normal types of applications this pattern is rarely observed, and
in practice most applications use relatively small amounts of memory to retrieve less
than 10,000 results from a few concurrent queries.
Text Locale Sensitivity
When ordering results by a text region, locale-sensitive sorting is used by default. As
a result, sorting can differ somewhat depending upon the locale. Locale-sensitive
collation generally groups accented characters near their unaccented equivalents.
Depending on the locale, multiple characters may be considered as a single logical
character, and some punctuation may be ignored.
The locale for a system is determined from the operating system by Java, and uses
the Java system variables user.language, user.country and user.variant. For
debugging, these values are logged during startup. In Java, locale can explicitly be
set to override system defaults as command line parameters. For example:
java -Duser.country=CA -Duser.language=fr …
Locale sensitive sorting was first added in 20.4, and can be disabled in the
[Dataflow_] section of the search.ini file by requesting the older behavior:
OrderedbyRegionOld=true

Facets

Purpose of Facets
Facets allow metadata statistics about a search query to be retrieved. For example,
if facets are built for the region “Author”, and there were 300 results, facets might
supply the following information from the “Author” region:
Mike 121
Alexandra 72
David 32
Michelle 21
Stephen 19
Alex 11
Paul 6
The interpretation would be that of the 300 results, 121 of them had the value “Mike”
in the “Author” region, 72 had the value “Alexandra”, and so forth. As an application
developer, you can present this information to the user to help them understand more
about their search results. It is also common to allow the user to “drill down” into the
results based on facets. For example, the user might determine they only want

The Information Company™ 91


Understanding Search Engine 21

results authored by Ferdinand. They select Ferdinand, which re-issues the same
search, this time with an additional clause in the query along the lines of AND
[region "Author"] "Ferdinand" (require “Ferdinand” in the region “Author”).

Requesting Facets
OTSE generates facet results when requested within the search queries. There are
no special configuration settings necessary to use facets, although optimization by
protecting commonly required facets may be a good idea. To request facets, in the
‘SELECT’ portion of the query, you add text along these lines:
SELECT “OTObject”,“OTSummary” FACETS
"Author","CreationDate" WHERE …
OTSE would then generate facets for two regions: Author and CreationDate. There
is no defined limit to the number of facets that can be requested for a query, but
memory or performance limitations will become a factor for large numbers of facets.
The design optimizations selected for OTSE are based on expectations of 100 or
fewer distinct facets in use at any time.
Once the query completes, you retrieve the results from the search engines with the
command:
GET FACETS
The output from the GET FACETS command is described in more detail in the Query
Interface section.
Like the search results, the facets for the query are retained until the query is
terminated or times out. Except for date facets, the values are returned sorted from
highest frequency to lowest frequency.
When facet values are returned, there are a couple of additional values provided.
The number of facet values identifies the total number of facet values found. The
returned count is the number of facet values actually returned, which is usually
smaller. There is also an overflow indicator, which identifies whether the number of
facet values exceeded the configurable limit – meaning that the facet results are not
exact since they are incomplete.
In most applications, a user is not interested in reviewing thousands of possible
metadata values in a facet. Usually, only the most common values are of interest.
The facets implementation allows you to place a limit on the number of values for
each facet you want to see. Using syntax such as:
SELECT "OTObject" FACETS "Author"[5], "DocType"[15]
This would return only the 5 highest frequency values in the field “Author” and the 15
highest frequency values in the field “DocType”. By default, the first 20 values are
returned. This default can be overridden by a configuration setting. You are strongly
advised to limit the number of values returned, especially with facets that may contain
arbitrary values, since they can potentially contain millions of values which would
significantly impact search performance.

The Information Company™ 92


Understanding Search Engine 21

Facet Caching
Facets data structures are built on demand. Once created for a given facet, the
structure is retained in memory so that subsequent queries using the facet are very
fast. In order to keep memory use constrained, there is a maximum number of facets
that the search engine will retain. If a query requests new facets that are not in
memory and the maximum number of facets is exceeded, then the search engine will
delete the facet structure that has not been used for the longest time. The default is
to retain up to 25 facet structures in memory. There is a 10 minute “safety margin” –
meaning that even if 25 facets are exceeded, a facet that was used in the last 10
minutes will not be deleted. A facet that that is included in a query can also not be
deleted. The limit is therefore a guideline rather than an absolute maximum.
If your applications use more than 25 facets regularly, then search query
performance may suffer as facet data structures are regularly created and deleted.
You can adjust the number of facets to retain in memory in the [Dataflow_] section of
the search.ini file:
MaximumNumberOfCachedFacets=25

Text Region Facets


Use caution when requesting facets for fields that may contain arbitrary text values.
There may be very large numbers of values, which can result in poor performance for
search queries. As a minimum, ensure that you specify an upper limit on the number
of facet values you want to retrieve, which is described in more detail near the end of
the Facets section.
In building facets, the values of the text fields are examined to build the facets. If the
text regions for which facets are built are stored on disk, then the performance for
search will be impacted. You should consider using RAM storage for text regions for
which you expect to retrieve facets.
Text regions also support multiple values. In these cases, each value is separately
returned. If the region “DocType” for an object had multiple values (“ZIP”, “XML”,
“Word”), then the object is counted 3 times in the facets, once for each of the values.

Date Facets
Date facets represent a special case, which has been constructed specifically to
address a very common and important requirement, namely presenting facets that
represent the “recentness” of an object in the index. Date facets are not designed to
handle arbitrary dates or future dates.
If facets are requested for regions of type DATE, special handling occurs. Each day
within the supported time range is counted multiple times – as a day, within a week,
within a month, within a calendar quarter, and within a calendar year.
Date facets are not sorted by frequency. Instead they are ordered by recentness. If
you have requested facets for 8 months, you will always get the most recent 8
months returned. When constructing a query for date facets, the syntax within the
SELECT statement is:
… FACETS "CreateDate"[d30,w0,m12,q0,y10] …

The Information Company™ 93


Understanding Search Engine 21

The facet counts are optionally specified as a letter followed by the number of facet
values desired, where:
d –
number of days, including today
w –
number of weeks starting on Sunday, including today
m –
number of months, including the current month
q –
number of calendar quarters (Jan, Apr, Jul, Oct),
including the current quarter
y – number of calendar years, including the current year

The example above would request the last 30 days, the last 12 months, the last 10
years, and no facets for weeks or quarters. To obtain no values for a category,
specify zero. Omitting the category will result in the default number of values being
returned. If the count for a value is zero, then no facet value will be returned.
The default number of date values to be returned is defined in the search.ini file. In
the [DataFlow_] section:
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10

The values returned for date facets are formatted to easily identify their type and date
range.
Days: d20120126 (dYYYYMMDD) 26 Jan 2012
Weeks: w20120108 (wYYYYMMDD) week starting 8 Jan 2012
Months: m201202 (mYYYYMM) Feb 2012
Quarters: q201204 (qYYYYMM) quarter starting Apr 2012
Years: y2012 (yYYYY) year 2012

Date facets can only be built for dates where the day is within range of the
maximum number of facet values, per the settings described later. The default is
32767, or about 90 years.

FileSize Facets
Integer regions may be marked in the search.ini file to have their facets presented as
FileSize facets. This mode groups file sizes into a set of about 30 pre-defined
ranges. This mode ignores the number of facets request, and always returns a fixed
number of facet values representing the buckets (or ranges). Details of these facet
values are described in the get facets command section.

Facet Security Considerations


When writing an application that leverages search facets, you may need to consider
the security implications. In a typical application such as Content Server, search

The Information Company™ 94


Understanding Search Engine 21

results are post-processed to filter out results that a particular user is not entitled to
see. It is more difficult to do this with facet values.
For applications in which the security requirements are high, you must ensure that
facets which contain sensitive information are not made available to users without
suitable clearance. In many cases, it is considered acceptable to display facets
which do not contain sensitive data, such as file sizes, object types, or dates. It might
also be possible to achieve acceptable security by reducing the exactness of the
object counts – displaying a more generic frequency count (eg: 1 to 4 “bars”, or labels
such as “many” or “few”) instead of the precise counts from the search engine.
Ultimately, you will need to choose an appropriate tradeoff between user
convenience and improved user search experience versus the risk that a user might
glean harmful information from facets values.

Facet Configuration Settings


There are a number of configuration settings in the search.ini file for facets. All
settings are located in the [DataFlow_] section of the search.ini file.
The expected number of facets is used to determine the initial amount of memory
that should be allocated when the facet data structure is created. This does not
place an upper limit on the number of facets that are possible, since the structure can
grow. It increases performance when the facet data structures are built.
ExpectedNumberOfFacets=8
The expected number of values per facet is used to determine the amount of memory
that should be allocated when a new facet data structure is created. This does not
place an upper limit on the number of values that are possible, since the structure
can grow. This is a minor optimization. Because of the bit-field representation used
for facet structures internally, this value should be a power of 2.
ExpectedNumberOfValuesPerFacet=16
The maximum facet value length represents the maximum length of a value that will
be considered for facet purposes. Longer strings are truncated, which means that
the facet system would treat the distinct values “I am a long facet value ending with
0123” and “I am a long facet value ending with 4567” as identical values. This limit
allows control over memory used for facets.
MaximumFacetValueLength=32
Facets can be entirely disabled within the Search Engines by setting the following
value to false:
UseFacetDataStructure=true

The maximum number of values per facet sets the upper limit on how many distinct
facet values are possible. This limitation is present as a failsafe from abuse, and
presumes the typical facet application is intended for much smaller data sets.
Increasing this value will increase the amount of memory required to store facet
information. Because the internal data structures use bit-fields, the optimal setting for
this value are 1 less than a power of 2 (eg: 2**N – 1). It should be noted that multi-
value text fields consume a facet value for every combination of text values contained
in the field. For example, if the region “Colors” can contain combinations of “red”,

The Information Company™ 95


Understanding Search Engine 21

“blue”, “green” and “black”, then 15 combinations are possible and 15 of the facet
values could potentially be used. If you expect to create facets for regions that may
have many combinations (such as email distribution lists) then this number may need
to be very large, and you may be limited by usable memory.
MaximumNumberOfValuesPerFacet=32767
The number of desired facets is the default number for the “most common N” facet
values to be returned if the number of desired facets is not specified in the query.
This ini setting does not affect the special return values for Date type facets.
NumberOfDesiredFacetValues=20

Reserving Facet Memory


Each search engine allocates memory for facets from its general pool of available
RAM. For small data sets and small numbers of facets, there is often enough
memory available that the search engines can draw from the memory allocation
without any impact. However, as the facet data sets become larger, it is a good idea
to increase the Java memory limit (the –XMX= parameter on the startup command
line).
The increase in memory depends on factors such as the number of facets, the
number of distinct values for a facet, and the size of the values. Roughly speaking, a
2 GB partition would need twice the memory of a 1 GB partition. Facets with a small
set of possible values, such as OTFileType or a Classification, require relatively little
memory. Facets with large numbers of possible string values, such as the parent ID,
keywords or hot phrases, would approach the theoretical maximum memory
requirement.
For reference, the approximate formula for the upper limit in bytes excluding Java
object overhead, is:
MaxSizeInBytes = 1/8* ( MaxNumberOfObjects *
MaximumNumberOfCachedFacets *
Log_2(MaximumNumberOfValuesPerFacet)) +
MaximumNumberOfCachedFacets * MaximumNumberOfValuesPerFacet
* MaximumFacetValueLength

You are unlikely to need to resort to these calculations. In practice, for a 1 GB


partition in Low Memory mode (3 to 5 million objects) with 10 to 15 typical facets in
use, memory consumption by facets is usually less than 40 MB. Content Server uses
this guideline for its default setting.
To re-iterate: facet memory allocation is NOT an explicit setting. Simply increase the
Java heap size available on the command line. Content Server 16 exposes this
incremental memory allocation for search engines in its search admin pages.

Facet Performance Considerations


Computation of facet data increases the cost of a query. As with all performance
discussions, the list of variables and parameters makes providing solid performance
data difficult.

The Information Company™ 96


Understanding Search Engine 21

During normal operation, after the facet data structures are generated, computing
facet information for a single metadata region is relatively fast, typically less than 50
milliseconds. This time varies primarily depending on the number of search results,
since the facet values for every result need to be added together. If your typical
queries return many million results, average times would be closer to 1 second. As
more facets are requested for a query, these time are additive. Experience has been
that facet computation is not a material consideration for performance in most
scenarios.
Conversely, initial generation of facet data structures can be relatively expensive.
Each potential metadata value must be examined, and a new facet value created or
the data structures updated if it already exists. The time to perform this task varies
widely based on the number of items in the partition, the data type, the number of
possible unique values, and for text metadata – whether the values are stored in
memory or on disk.
For example, if there is an enumerated data type with less than 100 possible values
in a partition containing just 1 million items, generation of the facet data structures is
likely less than 1 second.
At the other extreme, generating facet data structures on a text region that has high
cardinality (e.g. 200,000 possible values, such as a folder location or keywords/hot
phrases), in a large partition containing 10 million items that is configured for storage
on disk will take considerably longer, potentially many minutes.
For larger systems in particular, limiting the use of facets for regions with high
cardinality may be necessary to meet performance objectives.

Protected Facets
As noted above, the time required to generate facet data structures can be material.
In addition to building search facets on demand, it is possible to specify facets that
are known to be commonly used. On startup, the data structures for these facets will
be built if they are not in the Checkpoint; they are excluded from facet recycling
(never destroyed); and they are optionally saved in the Checkpoint file for faster
loading on next startup. Content Server uses this feature. To build protected facets at
startup, in the search.ini file, specify the regions in the [Dataflow_] section:
PrecomputeFacetsCSL=region1,region2,region3

As an option, the protected facets may be stored in the Checkpoint file. This also
means a copy of the facet data is maintained in the Index Engines, which requires
additional memory. To enable persisting facets in the Checkpoints, in the [Dataflow_]
section of the search.ini file add:
PersistFacetDataStructure=true

When specifying protected regions, you should also ensure that the desired number
of cached facets is greater than or equal to the number of protected facets specified
in this list. The desired number represents the point at which the search engine will
begin recycling non-protected facets to make room for new facets requested in
queries. In addition, the maximum number of facets should be higher still. The
maximum number of facets is the limit, which may be higher than the desired number

The Information Company™ 97


Understanding Search Engine 21

if there are many facets requested in a single query. Beyond this maximum number,
the facet requests are discarded.
DesiredNumberOfCachedFacets=16
MaximumNumberOfCachedFacets=25

Search Agents
Search Agents are stored queries that are tested against new and changed objects
as part of the indexing process. The two most common uses of Search Agents are to
stay up to date on topics of interest, and for assigning classifications.
The monitoring case is illustrated by the Content Server concept of Prospectors.
Consider a situation where you want to know everything about a particular customer.
You construct a query to match the name of the customer or a few of the known key
contacts at that customer. By adding this as a Prospector, you are notified any time
new data is indexed that matches this query.
For classification, you construct a set of queries that define a specific classification
profile. For example, if all customer service requests use a form that contains the
text “customer support ticket”, then this query is attached to the classification agent,
and any object containing this phrase is marked with the classification. By using
many queries, you can build a complete set of classification categories. One object
may match several possible queries, and be tagged with multiple classifications this
way. In Content Server, this is known as Intelligent Classification.
In operation, the queries to be tested against new data are contained in a file.
Matches to the search agent queries are placed in iPools which are monitored by the
parent application, typically Content Server.

Search Agent Scheduling


There are two ways that Search Agents can be run: with every indexing transaction,
or interval since last run.
Transaction based execution is the default for backwards compatibility. Indexing
transactions are generally in the range of 500 items, so the overhead of running the
agent queries with each transaction can be very high.
Interval Execution
The preferred approach is to run the agents based on an interval. At the completion
of an indexing transaction, the time since the last agent run completed is check, and
if suitable time has elapsed, the agents are run. The metadata region
OTObjectUpdateTime is used to construct queries that match objects which have
been added or changed since the last agent execution. This mode of operation was
first introduced in version 16.2.11 and can dramatically improve search throughput
when agents are used.
To configure interval execution, set the “EveryTransaction” value to false, and the
interval in milliseconds (default is 5 minutes). As a rough guide, the default interval
setting will use about 10% of the compute resources for executing search agents
compared to transaction-based agents.

The Information Company™ 98


Understanding Search Engine 21

[UpdateDistributor_xxx]
RunAgentsEveryTransaction=false
RunAgentIntervalInMS=30000

If the interval is set to a value of -1, the agent execution will pause. There is no loss
of activity – when the interval is restored to a positive value, the agent queries will
include all objects that were indexed while paused. Pausing may be desirable if
there is a temporary need to maximize indexing performance.
The Update Distributor keeps track of the agent execution in files that are stored in a
subdirectory of the search index:
index/enterprise/controls

The files are named upDist.N and contain the timestamp for each of the last agent
runs, expressed in Unix time (also known as Epoch time or POSIX time, i.e.
milliseconds since Jan 1, 1970). Sample file below.
UpDistVersion 1
SearchAgentBaseTimestamp 1571261889130 "MySA0"
SearchAgentBaseTimestamp 1571261889130 "MySA1"
EndOfUpDistState
The timestamp field used by default is the OTObjectUpdateTime. The field can be
changed, but there are currently no known scenarios where the default value should
not be used.
[Dataflow_xxx]
AgentTimestampField=OTObjectUpdateTime
When using interval agent execution, the Update Distributor timing summaries will
include the time spent running agent queries, identified with the label SAgents.

Search Agent Configuration


Search Agents are defined within the search.ini file. Multiple Search Agents are
possible. Each Search Agent has a separate section and an entry in the [Dataflow_]
section identifying the name of the agent. You must specify one query file per search
agent.
[Dataflow_]
SA0=agent1

[SearchAgent_agent1]
operation=OTProspector
readArea=d:\\locationpath
readIpool=334
queryFile=d:\\someDirectory\prosp1.in
The readArea and readIpool parameters specify the file path and directory name
where iPools with results from the Search Agent should be written. These are then
consumed by the controlling application.

The Information Company™ 99


Understanding Search Engine 21

The queryFile contains the search queries to be applied during indexing. You can
have many search queries within each queryFile.
The operation can be one of OTProspector or OTClassify. This value does not
change the operation of the search agents, but is recorded in the output iPools, and
is used to help the application (typically Content Server) determine how the iPool
should be processed.

Search Agent Query Syntax


A Search Agent query file uses the UTF-8 character set encoded in plain text form,
with no Byte Order Marker (BOM) at the start. The file has the following sample
syntax:
SET LEXICON English
SET THESAURUS English
SET EXTRAFILTER ( != "<OTCurrentVersion>FALSE" ) and
( [ region "OTSubType" ] range
"136|557|144|749|751|0|145" )
SYNC "SomeIdentifier"
SELECT "OTObject", "OTScore" WHERE [REGION
"OTEmailSenderAddress"] "bsmith@abc.com"
get results score 1
SYNC "AnotherID"
SELECT "OTObject", "OTScore" WHERE [REGION "OTDate"] >
"20110201"
get results score 1
The SET and SELECT commands are identical to the SET and SELECT query
commands that would be issued to the Search Federator in regular searching. The
EXTRAFILTER section at the beginning is a shortcut representation for adding more
query terms to the WHERE clause of every SELECT statement.
The SYNC command basically echoes the string into the output iPool messages. It is
used by the application requesting the Search Agents to separate or identify the
results from each search query.
The special command “get results score X” requests that all results with a computed
relevance score greater or equal to X are returned. Given that the relevance score is
always between 1 and 100, the value of 1 here requests that all results be returned.

New Search Agent Query Files


Search Agent query files are assumed to have the file extension “.in”. The Update
Distributor consumes these files. In order to prevent contention in the event that an
application tries to modify a search agent query file during processing, a special file
naming convention is used.
The application requesting Search Agents should create the query file with the file
extension “.new”. When the Update Distributor next runs the Search Agents, it will
look for files with the .new extension, and rename them to .in files.

The Information Company™ 100


Understanding Search Engine 21

For example, assume that your application creates a Search Agent file named
prosp1.new. The Update Distributor will delete any existing prosp1.in file and rename
prosp1.new to prosp1.in. This approach allows Search Agent queries to be modified
without changing the search.ini file and restarting the Update Distributor.

Search Agent iPools


Search Agents generate IPools for external consumption in a specific form, illustrated
below with a fragment from an Intelligent Classification IPool from Content Server.
White space has been added for legibility.
<Object>
<Entry>
<Key>OPERATION</Key>
<Value>
<Size>10</Size>
<Raw>OTClassify</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>297</Size>
<Raw>
<SYNC1>2959</SYNC1>
<Q1N0>OTObject</Q1N0>
<Q1R0C0>
<OTObject>DataId=16412&Version=1</OTObject>
</Q1R0C0>
<Q1N1>OTScore</Q1N1>
<Q1R0C1>93</Q1R0C1>
<Q1R1C0>
<OTObject>DataId=16389&Version=1</OTObject>
</Q1R1C0>
<Q1R1C1>39</Q1R1C1>
<Q1R2C0>
<OTObject>DataId=16390&Version=1</OTObject>
</Q1R2C0>
<Q1R2C1>29</Q1R2C1>
</Raw>
</Value>
</Entry>
<Entry>
<Key>MetaData</Key>
<Value>
<Size>2178</Size>
<Raw>
<SYNC2>3276</SYNC2>
<Q2N0>OTObject</Q2N0>

The Information Company™ 101


Understanding Search Engine 21

<Q2R0C0>
<OTObject>DataId=16388&Version=0</OTObject>
</Q2R0C0>
<Q2N1>OTScore</Q2N1>
<Q2R0C1>71</Q2R0C1>
<Q2R1C0>
<OTObject>DataId=16398&Version=0</OTObject>
</Q2R1C0>
<Q2R1C1>71</Q2R1C1>
<Q2R2C0>
<OTObject>DataId=16409&Version=0</OTObject>

The Search Agent type, in this case OTClassify, is the first entry in the IPool. This
value is drawn from the search.ini file in the Search Agent configuration setting.

NOTE: Each section contains a SYNC value, which is the


separator specified in the search agent query file. Content Server
uses these SYNC values to match the search results to the
originating query.

The search results themselves are presented with a naming convention that reflects
a QUERY, ROW, COLUMN numbering convention. For instance, the value
<Q2R0C1> is used for Query 2, Row 0 (the first result), Column 1 (the second region
in the select clause). Likewise, the value <Q1N0> is used to label the Name of
Column 1 for Query 1 (in this case “OTObject”). Note that the names of the regions
are only provided in the first row for a given query.

Performance Considerations
Search Agents are not free. Although the Agents are only applied to newly added
objects, the frequency, complexity and number of queries run as agents can have a
noticeable impact on indexing performance. For applications with high indexing
rates, Search Agents may not be an appropriate feature.
If you require these types of features for high indexing volumes, you can consider
implementing your solution using standard search queries, serviced by the Search
Engines. By enabling the TIMESTAMP feature for objects, the exact indexing time of
objects can be determined, and a pure search application can provide similar
features, running on a scheduled interval.

Relevance Computation
Relevance is a measurement of how well actual search results meet the user
expectations for search result ranking. Relevance is a subjective area, based upon
user judgments and perception, and often requires experimentation and tuning to
optimize. This is one of the fundamental challenges with relevance tuning: if you

The Information Company™ 102


Understanding Search Engine 21

improve relevance for one type of user, you may well be reducing relevance for other
users who have different expectations.
Relevance is a method for determining how close to the top of the list a search result
should be placed. However, relevance has NO IMPACT on whether an object
actually satisfies a query. If a query matches 100,000 items, tuning relevance only
affects the ordering of the items, not which items are matched.
Search relevance is not entirely the responsibility of the search engine. Relevance
scoring is a function of many parameters, most of which are provided by the
application, such as Content Server. Tuning Content Server is also required to
optimize search relevance, but this document will focus more on the OTSE
contributions to relevance.
For typical users trying to find objects, relevance is an important consideration, and
the search results are usually presented sorted by the relevance score. However,
relevance is not a consideration for certain types of applications. For example, Legal
Discovery search is concerned with locating all objects, but does not care about the
order of presentation. Likewise, when using search to browse, results are often
sorted by date or object name.

Retrieving the Relevance Score


The search results will commonly return a relevance score value in the OTScore
region. Simply select OTScore as part of the search query to include the relevance
score in the results.
Some notes about this value are in order. Firstly, this score is NOT an assessment of
the relevance of the object that reflects user expectations. This is a relative value
that assists with ordering of results. In other words, a value of 100 does not mean a
perfect match for the query, and a relevance score value of 20 does not mean the
object is irrelevant. Because of this, displaying the relevance score to users may be
misleading, and is generally not recommended for casual users.
If you are writing an application that consumes search results, you should also be
aware that the OTScore field does not always contain a relevance score number.
This region contains the region values that reflect the requested sort order for results.
Often, this is relevance. But if you are sorting results by date or text regions, then the
OTScore region will not contain the relevance score.

Components of Relevance
There are two different types of computations that are applied to objects in the index
to determine their relevance. The first is “ranking”, which is a computation applied in
the same way on every search query. Ranking typically adjusts relevance by giving
higher weights to recently created objects, office documents, or known important
locations. Before Search Engine 16, Ranking was the only available relevance
scoring method, and ranking and relevance were often used interchangeably.
Beginning with Search Engine 16, a second type of relevance computation is
available, known as “boost”. Unlike Ranking, the Boosting parameters are dynamic,
and are provided on each query. This permits the application to add relevance
adjustments based on context, such as the user identity or current folder location.

The Information Company™ 103


Understanding Search Engine 21

The remainder of this section will cover the Ranking capabilities, with Boost features
detailed later. You can mix and match both Ranking and Boost, although each
additional relevance feature slightly increases the overall search query time.
In most cases, the ranking configuration is comprised of weights and regions. The
weights indicate how important the parameter is in scoring. Note that these weight
values are relative. Setting all the weights high is the same as setting all the weights
to a medium value. The difference in weights is ultimately what matters.
Some of the explanations below contain simplified versions of the equations used to
compute the scores. They are simplified to the extent that a number of additional
computations are performed to adjust the results from each computation to a
normalized range. The equations presented here are only intended to clarify the
impact that adjustments to the parameters make on the ranking computations.

Date Ranking
The date an object was created or updated is typically an important aspect of
relevance, especially for a dynamic or social application. In these cases, users tend
to favor objects that are recent. Applications such as archival on the other hand
typically do not care about recentness, and different settings might be appropriate.
The date ranking parameter allows you to identify metadata regions which contain
date values that reflect the recentness of an object, and configure their scoring
parameters.
Date ranking is computed using a decay rate from the current date. The decay rate
is one of the configurable values. Small values for decay rates will reduce the score
of older items more rapidly. A simplified approximation of the algorithm is:
Date Relevance = decay / (recentness + decay)
In practice, a very aggressive value that strongly favors recent objects would be a
decay rate of 20 days. Consider this chart of some representative values. The
decimal values in the body of the table represent the contribution to ranking, with
higher values representing higher ranking.

AGE IN DAYS

DECAY RATE 3 10 30 60 180

10 0.77 0.50 0.25 .014 0.05


20 0.87 0.67 0.40 0.25 0.10
30 0.91 0.75 0.50 0.33 0.14
50 0.94 0.83 0.63 0.45 0.22
100 0.97 0.91 0.77 0.63 0.36

Clearly, small values of decay rates generate small ranking contributions for older
items. Remember that the date ranking value is only one component of the ranking
score, and you also control the weight to be applied to this computed value.

The Information Company™ 104


Understanding Search Engine 21

The syntax for the date ranking configuration in the search.ini file is:
DateFieldRankers="dateRegion",decay,weight

For example, the following would use the last modified date on an object to compute
date ranking, with a moderately aggressive decay of 45 – but then make the overall
contribution of date to the ranking score small by giving it a weight of 2:
DateFieldRankers="OTModifiedDate",45,2

The date scoring algorithm supports multiple elements. For example, if you had two
different metadata regions that commonly contain important dates that reflect object
recentness, you can specify both, and each is independently computed and added to
the overall ranking score:
DateFieldRankers="OTCreateDate",45,50;"OTVerCDate",30,30

The DateFieldRankers setting is recorded in the search.ini file, and Content server
exposes this configuration setting in the search administration pages.

Object Type Ranking


The ranking computation contains a relatively flexible Type Ranking mechanism for
selectively adjusting the ranking score if the contents of a region match a specific
value. Within Content Server, this capability is presented as a way to boost the score
for certain MIME types or object subtypes. In practice, this feature can be used in
more flexible ways if your application requires it.
The syntax for the object type ranking component looks like this:
TypeFieldRankers="region1",RW1:"textA",TWa:"textB",TWa:"textC",TWc;
"region2",RW2:"textD",TWd:"textE",TWe;
Where:
regionName is an ENUM or TEXT type region in which tests for the specified text will
occur
TW (TextWeight) is the relative weight for this text string within this region, and
should be an integer from 1 to 100.
Text is a string to check. If found, the associated text weight is used
RW is the region weight (importance) attached to Object Type Ranking relative to
other elements of the ranking computation.
Support for TEXT in addition to ENUM was first available starting with Search Engine
10.0 Update 11.
A simple example is shown below, which illustrates an application which is NOT
object type scoring. In this example, we examine a region that describes the
department which owns an object. If Finance owns it, attach a score of 40, and a
score of 30 if Sales owns it. Then set an overall weight for this test relative to the
other ranking components of 33.
TypeFieldRankers="Department",40:"Finance",30:"Sales",33;

The Information Company™ 105


Understanding Search Engine 21

NOTE: These are adjustment values for ranking. Any


objects which do not meet the criteria for the adjustment
have an effective score of 0 for this computation.
Content Server exposes this search.ini setting in the search
administration pages.

Text Region Ranking


It is common that specific metadata regions will contain text that is deemed to be
particularly important when determining the relevance for keyword matching. Often,
these are metadata regions that would contain the name or description of an object.
OTSE allows you to identify the text regions that should be given extra weight when
assessing keyword hits. The syntax is:
ExtraWeightFieldRankers="region1",weight1;"region2",weight2
In order for this feature to work, the regions being adjusted must also be included in
the list of default search regions. Typically, extra weight would be given to fields such
as the file name or description of an object. This setting is found in the search.ini file,
and Content Server exposes this configuration setting in the search administration
pages.

Full Text Search Ranking


Even in the absence of any of the specific ranking variables, OTSE will compute the
rank of an object based upon the statistical distribution of the matching search terms.
The algorithm is quite complex, and varies depending on the types of operators in the
query. The default value for the relative weight of Full Text component is 100. This
weight is applied to the larger of either the Full Text component, or the default search
regions component. The text ranking algorithm is roughly based on the industry
standard “BM25” formula. Some general guidelines:

Relative frequency
The relative ratio of matched search terms to the overall content size is a factor. The
higher this ratio, the higher the relevance. An obvious example… assume you
search for “combustible”. If document ROMEO has the word combustible 30 times in
1000 words (3%) and document JULIETTE has 50 instances of combustible in 2000
words (2.5%), then document ROMEO will be ranked higher.
Frequency
The more often the search terms occur in the text for an object, the higher the
ranking score.
Commonality
The more common a search term is in the dictionary for this partition, the less weight
it is given in computing the text score. For example, with typical English language
data, if you search for keywords “the” AND “scooter” – the value given to matches for

The Information Company™ 106


Understanding Search Engine 21

“scooter” will be considerably higher than matches for “the”, since “the” is overly
common.
The full text search ranking algorithm is applied to the indexed content, plus any
metadata regions defined in the default search list. The relative weight of the full text
search is also configurable. Both values are specified in the search.ini file.
The default region search list is defined in the search INI file as:
DefaultMetadataFieldNamesCSL="OTName,OTDComment"
ExpressionWeight=100
Content Server exposes the list of default regions to search in the administration
pages for search, and the values are stored in the search.ini file. Remember to
ensure that any metadata text regions given an adjusted score are included in this
default region search list.

Object Ranking
The search ranking algorithm also allows external applications to provide ranking
hints for objects. In a defined metadata field, the application can provide a numeric
ranking score – an integer between 0 and 100. The search ranking algorithm can
incorporate this ranking value into the overall rank. You have the ability to set a
ranking value for each object, define the field to be used for object ranking, and
assign an overall weight to Object Ranking relative to other elements of the ranking
algorithm. If there is no Object Ranking value for an object, it gets a ranking
adjustment of zero.
The Object Ranking settings are kept in the search.ini file. In the example below,
OTObjectScore is the metadata region that contains the ranking value, and 80 is the
relative weight attached to the Object Ranking component of the ranking calculation.
ObjectRankRanker="OTObjectScore",80
If you are developing applications around search, using the Object Ranking feature
can improve the overall user experience. Some of the common events used to
modify the ranking include tracking objects that are popular for download, objects
placed in particular “important” folders, how frequently objects are bookmarked, or
other situations which are appropriate to the application. As a developer, you also
need to remember to degrade the object ranking over time – an object which is
important now may well lose its relevance later.
One other observation for developers setting Object Ranking values: as described
elsewhere in this document, OTSE supports indexing select metadata regions for
objects. You do not need to re-index the entire object in order to set the Object Rank
value; using the ModifyByQuery indexing operation is usually a good choice. Re-
indexing the entire object each time a ranking value changes would likely have a
material negative impact on overall system performance – both on the application
and OTSE.
Within Content Server, the use of Object Ranking is a feature that is leverage by the
Recommender module.

The Information Company™ 107


Understanding Search Engine 21

Relevance Boost Overview


Unlike ranking adjustments to relevance, boosting adjustments are specified in the
search query, and can differ with each query. The boost syntax varies depending on
the type of boost being requested. In operation, the ranking operation takes place
first, and results in an interim score in the range of 0 to 100. Boost operations are
applied later, and modify the ranking score to generate the final relevance score.
Relevance boosting is specified in the ORDEREDBY section of the search query:
SELECT … WHERE … ORDEREDBY SCORE[N] boost parameters

SCORE[N] identifies that boost adjusting is desired. N is a multiplier (in percent) of


the relevance computed in the ranking algorithm. Normally, N of 100 would be
recommended, which means that the ranking values are used without modification. If
N was 80, then the ranking values would be multiplied by 0.8 before final adjustments
from boosting. Setting the value of N to 0 would cause the ranking component of
relevance to be ignored (treated as 0).
There are three types of boost operations that may be applied, text, integer and date.
Boosting may allow the score rise above 100, but never below 0.

Query Boost
This boost method is used to adjust the relevance based on whether an object
matches query clauses. For illustration, consider the following example…
SELECT "OTObject" where "animal" ORDEREDBY Score[100] "dog"
BOOST[-10] "cat" BOOST[+15] ("t-rex" and "evolution")
BOOST[+%40]

The query will match items containing the text “animal”. However, we are less
interested in objects that also contain the text “dog”, so 10 is subtracted from the
relevance score. The user likes cats, so if the result contains the text “cat”, then we
add 15 to the score. If the result contained both “dog” and “cat”, then the net
adjustment would be +5. The full text clauses do not need to be simple, as shown
with the dinosaur adjustment. The dinosaur adjustment also illustrates that the
relevance can be boosted by a relative percentage. The text clause can also specify
text metadata regions and include complex parameters…
SELECT "OTObject" where "accident" ORDEREDBY Score[100]
([region "model"] in ("ford","Toyota",”gm") and [region
"Date"] > "-2m") BOOST[+15]

Date Boost
This boost method is used to adjust the relevance based on how closely the value in
a Date region matches a target date. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Date,"region","target",range,adjust]

Date is a keyword, indicating the boost method.

The Information Company™ 108


Understanding Search Engine 21

Region is the metadata field in the search index that should be tested.
Target is the date we are comparing against.
Range is an integer number of days on either side of the target for which a
boost adjustment should be applied.
Adjust is an integer value that specifies the maximum adjustment to be
applied if the value in the region is an exact match for the target. The
adjustment is reduced in a linear fashion based on distance from the target.
An example is in order.
SELECT … ORDEREDBY Score[100]
BOOST[Date,"OTCreateDate","20140415",60,40]

This boost essentially states: Examine the value in OTCreateDate for each matching
search result. If the value is April 15 2014, then add 40 to the relevance score. If the
value in the OTCreateDate field is within 60 days of April 15, then add a pro-rated
value. For example, if the value in OTCreateDate was May 30 (45 days away), then
adjust the relevance score by 10 (which is 40 * (60-45)/60).
The intent of this type of boost is to help users find items based on dates. A typical
use case might be “I am trying to find a document that I think was issued June of
2000, but maybe I am off by 6 months”. Any document in that +/- 6 month range gets
a boosted relevance, with a higher adjustment the closer to the target date.
Another common application would be adjusting for recentness, where the target
date is today, and all objects with dates within 90 days receive an adjustment.

Integer Boost
This boost method is designed to allow a range of values to be mapped to a
relevance contribution. For example, if there was a “usefulness” rating for a
document on a scale of 1 to 10, you could use that range to boost relevance on the
objects. Syntax is…
SELECT … ORDEREDBY Score[100]
BOOST[Integer,"region",lower,upper,adjust]

Integer is a keyword, indicating the boost method.


Region is the metadata field in the search index that should be tested.
Lower is an integer representing to starting point of the range of interest. Values
at the lower range receive small adjustments. Values lower than the lower limit
are ignored, and have not adjustment.
Upper is an integer representing the high end of the range of interest. Values
near the upper limit receive large adjustments. Values higher than the upper limit
are ignored, and have no adjustment.
Adjust is an integer value that specifies the maximum adjustment to be applied if
the value in the region is an exact match for the Upper value. Between Lower
and Upper, the adjustment is scaled proportionately.

The Information Company™ 109


Understanding Search Engine 21

To illustrate the concept:


SELECT … ORDEREDBY Score[100]
BOOST[Integer,"Popularity",100,200,30]

This boost essentially states: Items with a Popularity value greater than 100 and less
than or equal to 200 will receive a relevance boost of up to 30. A value of 200 gets
the maximum adjustment of 30. A value of 120 would get a boost of 6 [ =30*(120-
100)/(200-100) ].

Multiple Boost Values


Multiple boost values can be requested. Note that each boost computation will
increase the search query time. A query with multiple boost values might look like
this:
SELECT … ORDEREDBY Score[100]
BOOST[Integer,"Popularity","100",20,10]
BOOST[Date,"SalesDate","20160115",16,5] (in ("sales
order","purchase order", "PO") and [region "salesrep"]
="doug") BOOST[-5]

Query versus Date / Integer Boost


You can use Date or Integer metadata regions in a Query Boost. For example, if you
simply wanted to boost the relevance of objects created on July 8 2015, you could
use:
SELECT … ORDEREDBY Score[100] [region "Date"] "20150708"
BOOST[+15]

So why are there separate methods for Dates and Integers? The Date and Integer
boost features allow the boost adjustment to be varied depending on how close the
values are to a target, versus the all or nothing adjustment that occurs with Query
Boosting. If you have applications where getting close is useful, versus matching
exactly, the Date or Integer Boosting is superior.

Content Server Relevance Tuning


The section on “Relevance Computation” covers the generic principles available for
adjusting the search relevance scoring algorithm within OTSE. In this section, we
briefly look at some considerations for Content Server.
A small survey of Content Server customers revealed some interesting data. The
majority of Content Server installations are using the default relevance algorithm
settings. Content Server is a very flexible solution, used for a wide range of
applications. It is likely that many customers can improve their search experience by
understanding the effects of adjusting the relevance settings. This is particularly true
of customers that have upgraded from older versions of Content Server, and simply
bring forward their old configuration as part of the upgrade.

The Information Company™ 110


Understanding Search Engine 21

The first step is to consider your application and user expectations. In some cases,
search relevance won’t be an issue. For example, if you always sort results by date
or a metadata region, then search relevance scores are immaterial. If your primary
objective is building collections for eDiscovery applications, then gathering all search
results is far more important than which ones show up at the top of the list.
For most customers however, a review of their search expectations and some
Content Server 16 considerations are in order.

Date Relevance
This is usually an important factor. Content Server has many ‘Date’ fields, where the
date represents specific information. Consider some of the following:
Creation Date – usually refers to the date an object was added to the system. Often
this is a good value for relevance, but the creation date only refers to the first version.
Versioned objects which are updated will not change this date, which reduces its
value for these data types.
Version Creation Date – for versioned objects, such as documents, this is a good
choice. Each version of the object gets an updated version creation date. On the
other hand, many objects do not have the concept of a version creation date.
Modified Date – for some types of objects, such as folders, the modified date clearly
identifies when the folder has been created or updated. However, for other types of
objects, the modified date is too volatile. Depending upon other settings in Content
Server, the modified date may change for many reasons, and therefore does not
reflect the user expectation for when an object has truly changed.
Understanding which types of objects are most important in your application for
search relevance will help you determine which Content Server date values should
be used for date relevance scoring.
There are several other date fields in Content Server that may also be used. Review
the types of objects that are most important for your application, and choose dates
that best reflect creation or change that users would consider material to search
relevance. Recent experiments suggest that new default values for Content Server
using both the Creation Date and the Version Creation Date, with relatively high
weights, may be a good choice for typical document management and workflow
applications.

Boosting Object Types


Historically, a feature known as Object Type Ranking has been used with defaults
that provide a boost to objects based upon their MIME types or their Content Server
object subtypes. Usually, this is used to boost typical “Office” document formats.
This is very easy to review and optimize for the types of content that you want to
emphasize in your system. If boosting Microsoft Office documents based upon their
MIME types continues to be important, there is an important consideration here.
Historically, there were very few MIME types used for Microsoft Office documents.
With the recent versions of Microsoft Office, this situation has changed. There are

The Information Company™ 111


Understanding Search Engine 21

now more than 20 MIME types officially used to represent Microsoft Office 2007 files
alone. The following chart is from the Microsoft technet web site.

File Extension File Type MIME Type

.docx Word 2007 document application/vnd.openxmlformats-


officedocument.wordprocessingml.doc
ument

.docm Word 2007 macro- application/vnd.ms-


enabled document word.document.macroEnabled.12

.dotx Word 2007 template application/vnd.openxmlformats-


officedocument.wordprocessingml.tem
plate

.dotm Word 2007 macro- application/vnd.ms-


enabled document word.template.macroEnabled.12
template

.xlsx Excel 2007 workbook application/vnd.openxmlformats-


officedocument.spreadsheetml.sheet

.xlsm Excel 2007 macro- application/vnd.ms-


enabled workbook excel.sheet.macroEnabled.12

.xltx Excel 2007 template application/vnd.openxmlformats-


officedocument.spreadsheetml.templat
e

.xltm Excel 2007 macro- application/vnd.ms-


enabled workbook excel.template.macroEnabled.12
template

.xlsb Excel 2007 binary application/vnd.ms-


workbook excel.sheet.binary.macroEnabled.12

.xlam Excel 2007 add-in application/vnd.ms-


excel.addin.macroEnabled.12

File Extension File Type MIME Type

.pptx PowerPoint 2007 application/vnd.openxmlformats-


presentation officedocument.presentationml.present
ation

.pptm PowerPoint 2007 application/vnd.ms-


macro-enabled powerpoint.presentation.macroEnabled
presentation .12

The Information Company™ 112


Understanding Search Engine 21

.ppsx PowerPoint 2007 slide application/vnd.openxmlformats-


show officedocument.presentationml.slidesh
ow

.ppsm PowerPoint 2007 application/vnd.ms-


macro-enabled slide powerpoint.slideshow.macroEnabled.1
show 2

.potx PowerPoint 2007 application/vnd.openxmlformats-


template officedocument.presentationml.templat
e

.potm PowerPoint 2007 application/vnd.ms-


macro-enabled powerpoint.template.macroEnabled.12
presentation template

.ppam PowerPoint 2007 add- application/vnd.ms-


in powerpoint.addin.macroEnabled.12

.sldx PowerPoint 2007 slide application/vnd.openxmlformats-


officedocument.presentationml.slide

.sldm PowerPoint 2007 application/vnd.ms-


macro-enabled slide powerpoint.slide.macroEnabled.12

.one OneNote 2007 section application/onenote

.onetoc2 OneNote 2007 TOC application/onenote

.onetmp OneNote 2007 application/onenote


temporary file

.onepkg OneNote 2007 application/onenote


package

.thmx 2007 Office system application/vnd.ms-officetheme


release theme

For new installations of Content Server, the use of MIME types and OTSubTypes for
Type Ranking is discouraged in favor of using OTFileType instead. OTFileType is a
generated by the Document Conversion Server during indexing, and gives every
object a type such as “Microsoft Word”, “Adobe PDF” or “Audio”. This greatly
simplifies constructing the Type Rank, and improves accuracy.
Note that OTFileType was introduced in Content Server 10 Update 5, with some
minor tuning since then. If you have older data, then you may need to re-index the
objects. Details about the values for OTFileType are not included in this document.

The Information Company™ 113


Understanding Search Engine 21

Some of the more common values you may want to configure for Type Ranking using
the OTFileType region might be:
Word, Excel, PowerPoint, PDF, Folder, “Web Page”, Text, Audio, Video or Email.

The Information Company™ 114


Understanding Search Engine 21

Boosting Text Regions


For the majority of Content Server customers, boosting the name and description of
an object (OTName and OTDescription) are reasonable approaches. This basically
means that if the search keywords are in the name or description, the object gets
pushed higher in the rankings.
If users do not enter a name for the object, the file name of a document often
becomes the name of the object. However, the original file name may be different
than the managed object name. Because of this, you may also wish to consider
adding the actual file name to the boosted search regions. Be aware that this may
“double boost” objects where the file name is the same as the object name.
If you have a lot of content that is comprised of web pages, you may wish to add the
HTML keywords field (typically OTFilterKeywords) to the list of boosted text regions.
Remember to make sure that any regions that you boost here are also in the default
metadata region search list.

Default Search Regions


If you do not specify a region in which a term should be searched, the default
behavior is for OTSE to search within a list of regions. The relative rank of this
component is shared with the full text content weight. This is the default behavior
that a typical user will leverage in a basic search query; searching within specific
regions is generally an advanced search feature in most applications.
When using the default search regions, it is not necessary to find all the search terms
within a single region. For example, if the search term is blue butterflies in
the amazon, a potential match could have blue in the name, butterflies in the
description, and the remaining keywords in the body text.
Content Server ships with a number of default regions configured for relevance
searches. Default regions are searched if the user does not specify a region for a
WHERE clause. Default regions can simplify typical searches and improve
relevance, but each region added to the default list increases the time required to
performance a search. The choice of regions that should be included in the default
search regions is an important consideration when fine tuning search relevance. As
a general rule, you should try and include any regions that contain overview, name or
descriptive values. Taxonomy labels, if used, are another good candidate.You should
definitely review these with an eye towards your expected use. Some examples…
Does the average user need to find workflow items? Perhaps workflow values
should be removed from the list.
If email messages are a key part of the managed content you wish to find, adding
the email sender, recipient or subject fields to the default search regions may be
a good idea.
If you have added custom applications with descriptive metadata regions, you
may want to consider whether any of those regions should be in the default
search region list.

The Information Company™ 115


Understanding Search Engine 21

Are HTML pages a key part of your data? Consider adding the HTML keywords
region to the default search regions.
Some applications, such as eDiscovery, are biased towards searching all possible
regions. The challenge is this: more default search regions results in slower query
performance. For small numbers of regions, this is not an issue. For eDiscovery,
with thousands of potential Microsoft Office document properties, this performance
degradation can be material. The “Aggregate-Text” features of the search engine
may be helpful for these situations.

Using Recommender
Recommender is a feature of Content Server which monitors user activity, and
leverages the Object Ranking feature of the search engine to boost the relevance
scores of certain objects. Specifically, the feature of Recommender known as
“Object Ranker” is responsible for computing relevance adjustments and triggering
the appropriate indexing updates. You can review the use of Recommender in the
Content Server documentation.

User Context
Statistically, a user is more likely to be searching for objects that meet one of more of
these types of criteria…
• It is located in my personal work area;
• It was created by me;
• It is located in the folder in which I am currently working;
• It is located in a sub-folder of my current location;
• It is in a location where I was recently working;

OTSE has no knowledge of the user performing a search. Content Server, however,
is aware of the user identity and location. New to Content Server 16, the relevance
boost features allow user context to be incorporated in relevance computation. For
example, each query could specify that items with the current user in the “created by”
metadata fields are emphasized, or that objects in specific locations and folders have
their relevance score enhanced. You should review these configuration settings in
Content Server, and adjust them to reflect your expected user behaviors.

Enforcing Relevancy
Adding Ranking Expressions to a search query results in more work for the Search
Engines. If the default relevance computation is performed (based on the WHERE
clause), then no material penalty occurs since the values are already retrieved as
part of the query evaluation. The Search Engines have an optimization that will
determine if the Ranking Expression is the same as the WHERE clause, in which
case the Ranking Expression computation is skipped. In updates of Content Server
prior to December 2015, the Ranking Expression differs from the WHERE clause,
which will reduce query performance.

The Information Company™ 116


Understanding Search Engine 21

There is a configuration setting that will ignore the Ranking Expression and enforce
use of the default WHERE clause ranking. Effectively, this is the same as using
ORDEREDBY RELEVANCY in the query. For older updates of Content Server that
install the 2015-12 or later update, this setting can be used to achieve a modest
search query performance gain. In the search.ini file [Dataflow_] section, add:

ConvertREtoRelevancy=true

The Information Company™ 117


Understanding Search Engine 21

Extended Query Concepts


What do words like relevance, stemming and phonetic matching really mean? Key
search concepts and their implementation within OTSE are found within this section.

Thesaurus
OTSE has the ability to search not only for keywords, but for synonyms of keywords,
using a thesaurus system. This section of the document explores the use of a
thesaurus with OTSE.

Overview
Searching with a thesaurus specified allows a query to match synonyms of words.
For example, the English thesaurus might have an entry for house which includes
“home”, “residence” and “dwelling”. A search for the keyword “house” would also
match any of those words if the thesaurus is enabled.
The list of synonyms to be used is contained within a thesaurus file. You can have
many thesaurus files, and each query can specify which thesaurus file should be
used. In practice, this flexibility is generally used to select a thesaurus containing
synonyms for a particular language. OTSE ships with a number of standard
thesaurus files: English, French, German, Spanish, and Multilingual.
It is also possible to use a thesaurus to help find specialized words in specific
applications. For example, a medical thesaurus file could contain alternate names for
drugs, symptoms or other medical terminology. A custom corporate thesaurus could
contain synonyms for products, part numbers, customer names or departments.

Thesaurus Files
Thesaurus files should be placed in the “config” directory. They should follow a
naming convention of “thesaurus.xxx”, where xxx defines the language and identifies
the thesaurus file as provided in the search query. By convention, OpenText default
thesaurus files are provided for English, French, German, Spanish and Euro
(multilingual) as follows:
thesaurus.eng
thesaurus.fre
thesaurus.ger
thesaurus.spn
thesaurus.eur
Thesaurus files are stored in a proprietary file format which is optimized for
performance and size. These files are created using a thesaurus builder utility, which
converts a thesaurus from the Princeton WordNet format to the OpenText thesaurus
format.

The Information Company™ 118


Understanding Search Engine 21

Thesaurus Queries
In order to leverage a thesaurus in a search query, you choose the thesaurus using
the “SET” command, and specify thesaurus use for a search term using the
“thesaurus” operator in the query select statement.
set thesaurus eng
select “OTName” where thesaurus “home”
The value for the language (in this case “eng”) must match the extension of the
thesaurus file. This is an optional statement. The default language setting for the
Thesaurus is English.
The “thesaurus” operator in the select statement only applies to simple single terms –
it cannot be combined with other features such as proximity, stemming, wildcards or
phrase search.

Creating Thesaurus Files


OTSE contains several utility functions that will create thesaurus files from different
types of data sources.
One supported format is the Princeton WordNet format, a well documented format for
representing thesaurus information. Thesauri for many languages and purposes are
available in this format, many of which are available at no cost. You can create or
edit a WordNet file to create a custom thesaurus, then convert it to OpenText format
using the utility.
The sample syntax here assumes that you are running from the <OTHOME>/jre/bin
directory. The command line for converting a WordNet thesaurus to OpenText format
is:
java -Xmx500M -classpath ../../bin/otsearch.jar
com.opentext.search.modifiers.WordNetToThesaurus
y:/sourceWordNet ../config/destThesaurus
The second supported format is the EuroWordNet format. This format is used to link
multiple thesaurus files together as a package. OTSE has two utilities which can
build a thesaurus from a EuroWordNet format. The first of these is used to extract
thesaurus information for a single language, with the form:
java -Xmx500M -classpath ../../bin/otsearch.jar
com.opentext.search.modifiers.EuroWordNetToThesaurus
y:/eurowordnet/French/Text ../config/thesaurus.fre
The second utility is used to build a multilingual thesaurus, incorporating all the
available EuroWordNet languages into a single thesaurus. Note that this does not
allow you to search in one language and find synonyms in the other languages. The
syntax for building a multilingual thesaurus is:
java -Xmx500M -classpath ../../bin/otsearch.jar
com.opentext.search.modifiers.MultilingualEuroWordNetToThes
aurus y:/eurowordnet/General/Multi ../config/thesaurus.eur
OTSE is also capable of generating a thesaurus file from a more generic XML file
representation. The syntax is:

The Information Company™ 119


Understanding Search Engine 21

java -Xmx500M -classpath ../../bin/otsearch.jar


com.opentext.search.modifiers.XMLToThesaurus –name
thesaurusname –infile inputxmlfile
OTSE can convert an existing thesaurus file to this XML format:
java -Xmx500M -classpath ../../bin/otsearch.jar
com.opentext.search.modifiers.ThesaurusToXML –name
thesaurusFileName –outfile xmloutputfile
The form of the XML file to generate or read in the thesaurus management utilities is
shown below. This example is limited to a single entry, the <Headword> section
would be repeated once per entry.
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<OTThesaurus>
<Headword>
<Headword_Text>answer</Headword_Text>
<Meaning>
<Meaning_Text>noun meaning</Meaning_Text>
<PartOfSpeech>noun</PartOfSpeech>
<Synonym>response</Synonym>
<Synonym>reply</Synonym>
<Synonym>acknowledgement</Synonym>
<Synonym>riposte</Synonym>
<Synonym>return</Synonym>
<Synonym>retort</Synonym>
<Synonym>repartee</Synonym>
</Meaning>
<Meaning>
<Meaning_Text>verb meaning</Meaning_Text>
<PartOfSpeech>verb</PartOfSpeech>
<Synonym>respond</Synonym>
<Synonym>reply</Synonym>
<Synonym>rebut</Synonym>
<Synonym>retort</Synonym>
<Synonym>rejoin</Synonym>
<Synonym>écho</Synonym>
</Meaning>
</Headword>
</OTThesaurus>

Content Server Considerations


Within Content Server, the search thesaurus features are abstracted. Labels for
languages such as “English” and “French” are mapped to “eng” and “fre” using
configuration files. The search operator in LQL is “qlthesaurus” instead of
“thesaurus”. There is a separate configuration setting for the default thesaurus
language. This thesaurus configuration is covered in more detail in the Content
Server administrator documentation.

The Information Company™ 120


Understanding Search Engine 21

Stemming
Stemming is a method used to find words which have similar root forms, called
“stems”. The easiest way to explain stemming is by example.
The words flowers, flowering and flowered all have the same stem: flower. When
stemming is applied during a search, then a search for one of these words would
match any of these words.
The special terminology “stem” is used since the common element is not always a
word. For instance, for algorithmic reasons, the stem for “baby” might be “babi”,
which facilitates matching words such as babied or babies.
Stemming algorithms are not foolproof. In our example of “flower”, the stemming
algorithm might identify that “flow” is the stem – and try to find matches such as
flows, flowing or flowed. Stemming is a useful tool, but cannot always be relied upon
to behave as a user expects.
The concepts that make stemming possible are not applicable to all languages. In
general, Western European languages can use stemming, since plurals, tenses and
gender are typically formulated in terms of appending different endings to root forms
of words. Accordingly, the algorithms for stemming are different for each language.
There are many languages, such as East Asian languages, where the concept of
stemming does not apply.
Because of the language-specific aspects of stemming, a search engine has many
options available for how stemming should be implemented. One approach is to
stem words during indexing, and create an index of word stems. This can result in
very fast searches (since the stems are all pre-computed), but requires that you know
the language at index time. If only one language will ever be used, this is
acceptable. In multi-language environments, it is less useful. Some search
implementations will guess at the language during indexing and stem accordingly,
which is statistically useful but not always correct.
OTSE applies stemming rules at query time. This reduces the size of the index
(since word stems are not stored), but has a query performance penalty since the
stems for candidate words must be computed for each query.
The other key advantage of query-time stemming is that true multi-lingual stemming
can be used. Consider an index containing the following words:
Arrives (in English documents)
Arrivons (in French documents)
Arriva (in Spanish documents)
Each of these words might have the same stem (“Arriv”). By applying the stemming
algorithm at query time, the search system can differentiate between the English,
French and Spanish forms of the word based on the language preferences used for
stemming, since the English algorithms would not generate query expansions for the
words arrivons or arriva. This approach is not perfect, since in many cases similar
languages have common rules. For example, the French word “arriver” would match
the English stem for “Arrived”, since the postfix “er” is also common in the English
language.

The Information Company™ 121


Understanding Search Engine 21

OTSE supplies stemming rules for 5 languages: English, French, German, Spanish
and Italian. When building a search query, you request the stemming rules in the
“SET” command, using the language preference. To request a match for keyword
stems, use the “stem” operator on a keyword in the select statement:
SET language fre
select “OTName” where stem “arrive”
The stem operator does work not in conjunction with other operators, such as
proximity, wildcards and exact phrase searches.

English Stemming Rules


The basic objective for English stemming is to find singular and plural forms of words.
Additional rules prevent short words from being stemmed. A summary of the
expansion rules are outlined here:
-s rule:
plural: add -s if doesn't end in -s
singular: remove -s if doesn't end in -ss
-es rule:
plural: add -es if ends with -s, -z, -x, -sh, -ch, -o
but not if already ends in -es
singular: remove -es of these cases
address common misspellings
replace -s with -es if ends with -zs, -xs, -shs, -
chs, -os
replace -es with -s if ends with -zes, -xes, -shes, -
ches, -oes
-y rule:
plural: replace -y with -ies if consonant before y,
e.g. baby/babies
singular: replace -ies with -y if consonant before -ies
suffix substitutions:
English -f / -ves, e.g. wolf/wolves
English -fe / -ves, e.g. knife/knives
English -man / -men (with no length check), e.g.
man/men, workman/workmen

French Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which has normalized the text to remove accents. Expansion
rules are:
drop leading "l'" or "d'" (e.g. "l'oiseau" and "d'oiseau"
are replaced with "oiseau")
apply the following suffix substitutions:
-au / -aux, e.g. bureau/bureaux, noyau/noyaux,
oiseau/oiseaux
-eu / -eux, e.g. cheveu/cheveux
-al / -aux, e.g. animal/animaux

The Information Company™ 122


Understanding Search Engine 21

-ou / -oux, e.g. bijou/bijoux


same -s rule as English, e.g. famille/familles
same -es rule as English, e.g. sandwich/sandwiches

Spanish Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which has normalized the text to remove accents. Expansion
rules are:
plural: add -es if word does not end in -e, e.g.
borrador/borradores, ley/leyes, tisú/tisúes
singular: remove -es if resulting word would not end in -e
suffix substitution: -z / -ces, e.g. voz/voces
same -s rule as English, e.g. libro/libros, pera/peras,
café/cafés, camping/campings

Italian Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which has normalized the text to remove accents. Expansion
rules are:
drop leading "l'" or "d'" (e.g. "l'amico" and "d'amico" are
replaced with "amico")
apply the following suffix substitutions:
-o / -i, e.g. gelato/gelati
-a / -e, e.g. casa/case
-e / -i, e.g. bicchiere/bicchieri
-co / -chi, e.g. casco/caschi
-go / -ghi, e.g. lago/laghi
-ica / -iche, e.g. amica/amiche
-ga / -ghe, e.g. paga/paghe
-cia / -ce, e.g. faccia/facce
-cio / -ci, e.g. bacio/baci
-zio / -zi, e.g. negozio/negozi
-gio / -gi, e.g. vantaggio/vantaggi

German Stemming Rules


Singular and plural version expansion is applied. Note that stemming is applied to
words after tokenizing, which typically expands umlaut characters ( äpful  aepful )
and expands the sharp S ( fußball  fussball ). Expansion rules are:
add-umlaut rule to make plural:
i.e. to the last "a", "o", "u" or "au" (not including
"u" which are part of "au") add an umlaut if one was
not already there (i.e. replace with "ae", "oe", "ue"
or "aeu" (respectively))
e.g. Apfel/Äpfel, Boden/Böden
drop-umlaut rule to make singular:

The Information Company™ 123


Understanding Search Engine 21

i.e. replace the last "ae", "oe" or "ue" with "a",


"o" or "u" respectively
plural:
add -e or -en or -er if does not end with e or
e+consonant, add -n otherwise (except when already
ends with -n); also, when adding -e or -er, create
another variant of it with the add-umlaut rule
e.g. Hund/Hunde, Zeit/Zeiten, Kleid/Kleider,
Kugel/Kugeln, Gans/Gänse, Koch/Köche, Fluss/Flüsse,
Maus/Mäuse, Haus/Häuser
singular:
remove -e or -en or -er if the resulting word would
not end with e or e+consonant, remove -n otherwise
(except when ends with -nn); also, when removing -e
or -er, create another variant of it with the drop-
umlaut rule
suffix substitution:
-in / -innen, e.g. Lehrerin/Lehrerinnen
plural:
add -se if ends with -nis, and create another variant
of it with the add-umlaut rule, e.g.
Erlebnis/Erlebnisse
singular:
drop -se if ends with -nisse, and create another
variant of it with the drop-umlaut rule
same -s rule as English, e.g. Auto/Autos

Alternative Stemming Algorithm


There are two implementations of stemming available within OTSE. The default
implementation works by determining the stem of a word, then creating candidate
singular and plural forms to test against. For instance, if you search for “cover”, it
forms the stem “cover”, then tests for “cover” and “covers”. This implementation is
optimized for speed, and only expands the word list to common forms.
There is a second implementation which is much more rigorous and aggressive,
which tests each word in the dictionary to see if it is a possible match for the
keywords. This second implementation is considerably slower, and also matches
variations such as “coverous”, “coverly”, “covered95b”. For most customer
applications, this more aggressive form of stemming is not necessary or appropriate.
This form of stemming was the default in OT7.
The older, more aggressive form of stemming can be enabled in the search.ini file
within the [Dataflow_] section:
UseOldStemmingRules=true

The Information Company™ 124


Understanding Search Engine 21

Content Server and Stemming


Within Content Server, stemming for a term can be enabled in several ways. Firstly,
the administrator manages a global setting from the search administration pages that
enables stemming on all simple search bar keyword searches. Within the LQL
language, the prefix QLSTEM is used to specifically request stemming for a keyword.
In the advanced search pages, the modifier “relatedto” invokes stemming.
If stemming is enabled by default, it can usually be disabled in search bar queries by
adding quotation marks around each term. For example, search for “large” “size”.

Phonetic Matching
Phonetic matching, or “sounds like” algorithms, are used to match words that have
similarities when spoken aloud. There are many possible algorithms that can be
used for phonetic matching, and OTSE contains a phonetic matching algorithm which
is a variation of the classic US Government ‘Soundex’ algorithm.
Phonetic algorithms are primarily designed to help match surnames, particularly
where the names have been transcribed with potential errors. Matching surnames is
of particular interest for a number of reasons:
Many surnames were recorded as phonetic equivalents from other languages, often
with variations in spelling.
A name which sounds generally similar may in fact have different spelling, particularly
with language variations. Consider the dozens of variations of the name “Stephen”
that exist, including Steven, Steffen, Steffan, Stephan, Steafán, and Esteban.
There is no master dictionary that contains a “right” way to spell a surname, so it is
common for people hearing a name to write it as they think it should be spelled.
Smith, Smithe, and Smyth are all legitimate surnames – you cannot perform spelling
correction, since they are all correct.
In many applications, names are recorded over a poor quality phone connection,
which can introduce errors. I say ‘Pidduck’, the recipient hears and records ‘Pittock’.
All phonetic matching algorithms share some common attributes. They relax the
rules for matching search terms in certain ways. The result is more terms matching,
but with a decrease in accuracy. This decrease occurs because the algorithms can
match words which are clearly not related, despite having similar phonetic properties.
Matching “Schmidt” when querying for “Smith” makes sense. But you also need to
be prepared for false matches, such as finding “Country” when searching for
“Ghandi”.
Phonetic matching is generally NOT recommended for general keyword searching. It
is intended for use with names, and works best when applied against a metadata
region which is known to contain names. Otherwise, the number of false positives
will almost certainly be frustrating to a user.
There is one phonetic matching algorithm within OTSE, a modified “Soundex”
implementation. This algorithm is optimized for English. However, the algorithm is
sufficiently generic that it does provide useful results for many Western European
languages. The phonetic matching does not work for non-European languages.

The Information Company™ 125


Understanding Search Engine 21

To request a phonetic match for a keyword in a query, use the modifier ‘phonetic’:
Select X where [region "UserName"] phonetic "smith"
A phonetic modifier can only be applied to a simple keyword, and cannot be
combined with other features such as proximity, wildcards, regular expressions or
exact phrase searches.
There are two dictionaries of terms within the search engine, the primary dictionary
for terms that are “typical” western language words (Western European characters,
no punctuation or numbers), and the secondary dictionary for everything else.
Phonetic matching searches only for terms that meet the criteria for inclusion in the
primary dictionary.

Exact Substring Searching


A particularly difficult use case for searching involves finding exact substrings within
text metadata regions. While regular expressions can find exact substrings, they
have two major restrictions: they are potentially very expensive (slow) and only work
on a single token. Starting with SE10.5 Update 2015-09, a new capability for efficient
exact substring matching has been added that addresses these limitations.
For example, if a metadata region VendorPart has a value such as:
“Vendor_Acme:SSU 876MJACF/24 3.5inchesus”

Users might be accustomed to working only with a subset of the complete value, and
expect to find matches using arbitrary substrings of the value, such as:
Acme:SSU 87
ACF/24
F/24 3.5inches

The traditional searches using tokens, regular expressions and Like operators are
not sufficient.

Configuration
The implementation of exact substring matching is configured on a per-region basis,
and is valid only for text metadata regions. A custom tokenizer ID is configured for
the region in the LLFieldDefinitions.txt file; the custom tokenizer is specified in the
search.ini file; the custom tokenizer is constructed to encode the entire value using 4-
grams.
For example, in search.ini file [DataFlow_xxx] section:
RegExTokenizerFile2=c:/config/otsubstringExact1.txt

In LLFieldDefinitions.txt:
TEXT MyRegion FieldTokenizer=RegExTokenizerFile2

Addition details on configuring custom tokenizers is described in the Tokenizer


section of this document.

The Information Company™ 126


Understanding Search Engine 21

Note that there is an alternative mechanism available for specifying the entry in the
LLFieldDefinitions.txt file. The search.ini file can be used to logically append lines to
the field definitions file at startup (the file is not actually modified). This alternative
can be used by Content Server to control the configuration, since Content Server
does write the search.ini file.
ExtraLLFieldDefinitionsLine0=TEXT MyRegion
FieldTokenizer=RegExTokenizerFile2

Re-indexing is not required. When the Index Engines are next started, a conversion
of the index for the region will be performed. You can apply or remove a custom
tokenizer this way for existing data.
By convention, tokenizers should be located in the config\tokenizers directory.
Content Server uses this location to present a list of available tokenizers to
administrators.

Substring Performance
A region indexed for exact substring matching will require about 8 times as much
space for storing the index for that region. In a typical situation, with only a few
regions configured this way, the storage requirement difference will be minimal.
Exact substring configuration is only possible when the “Low Memory” mode
configuration is enabled for text metadata.
When a region is configured for exact substring matching, every query is equivalent
to having wildcards on either side of the query string. In the example above, a
search for “SSU 87” is effectively a search for “*SSU 87*”.No other operators
(comparisons, regular expressions, etc.) are allowed with regions configured for
exact substring searches.
The exact substring is usually much faster than a regular expression because of the
way the indexing is performed. By way of example, assume the indexed value is:
abcdefghijk. Using 4-grams, the following tokens are added to the dictionary: abcd
bcde cdef defg efgh fghi hijk. You want to search for cdefgh. The query engine will
first look for the first 4 gram... “cdef”, which is fast because it is in the dictionary. It
then looks for all 4-grams starting with “gh**”, and finds values with adjacent “cdef +
gh**” 4-grams. While there may be a number of 4 grams for the regions beginning
with “gh”, this is much more efficient than scanning the entire dictionary with a regular
expression to find matches.

Substring Variations
The choice of Tokenizer determines the behavior of substring matching. The usual
suggested tokenizer would make the data case-insensitive, but otherwise leave all
other characters unchanged, including whitespace and punctuation.
Case sensitivity requires additional mappings in the tokenizer file. By default, the
tokenizer performs upper to lower case conversion. To preserve case sensitivity, add
a section to the start of a tokenizer file:
mappings {

The Information Company™ 127


Understanding Search Engine 21

0x41=0x41
0x42=0x42
0x43=0x43

}
Include a mapping to itself for every character that requires case preservation.
Ensure that suitable mappings for non-ASCII characters are included if those are
important for your application.
The other use case to be aware of is punctuation normalization or elimination.
Consider the example which includes ACF/24 in the value. If users are not expected
to correctly use the slash character “/” correctly, there are a couple of variations that
may be used. Normalization would convert all (or a desired set) of punctuation to a
standard value, perhaps Underscore. The string would be indexed as if it had the
value:
“Vendor_Acme_SSU_876MJACF_24_3_5inchesus”

If the user searches for Acme-SSU or ACF:24, the engine would similarly convert the
queries to “Acme_SSU” and “ACF_24”, which would then match.
Similarly, elimination strips all whitespace and punctuation from index and query
values. The index is built from:
“VendorAcmeSSU876MJACF2435inchesus”

With elimination, the test queries “Acme-SSU” or “ACF:24” are handled as if they
were “AcmeSSU” or “ACF24”, again generating a match. Eliminating punctuation is
generally better at finding a match (since it also handles extraneous whitespace), but
is not as precise – potentially returning some false positives.

Included Tokenizers
Customizing a tokenizer can be a challenge. To facilitate substring matching, there
are 3 tokenizers provided with OTSE that cover the most common exact substring
requirements, in addition to the default tokenizer.
ExactSubstringTokenizer.txt
This tokenizer is case insensitive, but otherwise preserves all punctuation and
spaces.
TolerantSubstringTokenizer.txt
This tokenizer eliminates all punctuation and whitespace. The strings “12-3.MY
name” and “123-m&n_amE” are equivalent, being interpreted as “123myname” in
both queries and indexed values.
EmailAddressTokenizer.txt
This tokenizer treats email addresses in common forms as a single token. With
the traditional tokenizer, bob.smith@acme-corp.com would be 5 tokens, as the
punctuation would be interpreted as white space. The email address tokenizer
would leave the email address intact as a single token. Searching on a single

The Information Company™ 128


Understanding Search Engine 21

token for email is faster and more accurate than a phrase search for multiple
tokens.

Preserving other Query Features


As noted, once a region is marked for use with exact substrings, you cannot use
other search methods on the region. If you need to have both substring and regular
search, consider using the AGGREGATE feature.
AGGREGATE data types are used to create a new searchable region from a set of
existing regions, building only an index and not duplicating value storage. If our text
region is “VendorPart”, then we can create an AGGREGATE for substring searches:
AGGREGATE-TEXT VendorSubstring VendorPart
FieldTokenizer=RegExTokenizerFile2
In this scenario, we can now perform regular searches against the original region
“VendorPart”, and searches against the region “VendorSubstring” will use the exact
substring searching technique.

Part Numbers and File Names


A new feature for OTSE is a set of techniques for optimizing search queries for fields
not normally constructed for human readability. Text metadata regions need to be
configured for this behavior, which is subsequently invoked using the “Like”
operator:
[region "PartNumber"] like "widget14"

Problem
Part numbers and file names are primary examples. A human might describe a part
for a machine as: “the 14 centimeter widget that fits jx27 engine”. Instead, we create
names along the lines of “PN3004/widget-14JX27”. Search technology that is
trying to formulate tokens and patterns based on regular sentence structure and
grammar rules will struggle to match these types of values.
Similarly, we create file names such as “SalesForecast2013-europeFRANCE
Rene&Gina1.doc”. With file uploads and Internet encoding, this can even inject
strings such as %20 or &amp; into the metadata values. Again, algorithms designed
to parse human language have difficultly succeeding with these metadata fields.

Like Operator
To accommodate these types of metadata search requirements, OTSE includes the
concept of a “Likable” region. If you have metadata that fits the problem profile, a list
of the appropriate metadata regions can be declared as Likable in the search.ini file:
OverTokenizedRegions=OTFileName,MyParts

This instructs the Index Engines to build a “shadow” region derived from the original
metadata region, but using a very different set of rules for interpreting the metadata

The Information Company™ 129


Understanding Search Engine 21

and building tokens. For example, the traditional indexed tokens for our sample part
number and file name values might be:
pn3004 widget 14jx27
salesforecast2013 europefrance rene gina1 doc

The tokens indexed in the shadow regions might be:


pn 3004 widget 14 jx 27
sales forecast salesforecast 2013 europe france
europefrance rene gina 1 doc

When a query using the like operator is processed, the query is also tokenized
using the alternate rules, and is tested against the shadow region instead of the
original region. In this case, the following queries would succeed that would typically
fail using normal human language tokenizing rules:
where [region "OTFileName"] like "gina 2013 sales forecast"
where [region "MyParts"] like "JX27 widget 3004"

If the like operator is requested for a region that does not support it, then the
operator is treated as an “AND” between the provided terms, and applied against the
original region instead of the shadow region.

Like Defaults
Since many users will not understand the requirement to specify the “like” operator
in a query, a configuration option is provided in the search.ini that allows the use of
Like as the default operator.
UseLikeForTheseRegions=OTName,OTFileName

If a query for a token or phrase is requested against one of these regions and there is
no explicit term operator provided, then Like will be assumed. This also works if the
region is in the list of default search regions. For example, the common Content
Server region OTName can be both a default search region and have the Like
operator applied by default. Note that Content Server can be configured to inject a
default operator like stem into a query term, which would negate using like by
default.
There is also a configuration setting that controls whether stemming should be used
searching with Like queries. By default, this feature is active. If there is a term
component in a query that is 3 letters or longer, then either the singular or plural form
will match. To disable this feature and only match the entered values, in the
[Dataflow_] section of the search.ini file:
LikeUsesStemming=false

Shadow Regions
The synthetic shadow regions built to support the Like operator have some
properties of interest. They are created when the Index Engines start based upon

The Information Company™ 130


Understanding Search Engine 21

the search.ini settings. This adds some time to startup, but allows the Like feature
to be applied to existing data sets without re-indexing. The shadow regions are
saved on disk as part of the index until removed from the list of over-tokenized
regions, which also occurs on Index Engine restart.
The shadow regions have the same names as their masters, appended by
_OTShadow. If the region OTName is configured as likable, then the synthetic region
OTName_OTShadow is created. These regions consume space in the index. Due
to the extra tokenization, the space requirements for shadow regions are higher than
for equivalent normal text regions.
The shadow regions will show up in lists of regions, and are also directly queryable or
retrievable. Although not the intended use, it is valid to perform other queries on
these regions.

Token Generation with Like


A key element of the Like behavior is the aggressive generation of tokens from a
metadata value. Unlike most other search operators which are algorithmically
specific, the Like behavior attempts to “think like a person”. The rules are more
fluid, and are subject to change over time as more useful cases are identified. Some
of the current tokenization rules include:
• Breaking tokens at transitions between letters and numbers. For example,
14red9 (14 red 9)
• Breaking tokens at punctuation. For example, red.blue-green (red blue
green).
• Breaking tokens at upper to lower case transitions, and also keeping the
conjoined token. For example, HIThat (hi that hit hat hithat)
• Breaking tokens at single character upper case transitions. For example,
MyHouse (my house)
• Breaking number strings at punctuation and retaining string. For example,
17,345 (17 345 17345).
• Removing leading zeros from number strings so that they match with or
without the zero. For example, 0078 ( 78 ).
• Converting URL encoded values to UTF8. For example go%2dfish (go-fish).
• Converting HTML encoded values to UTF8. For example my&nbsp;house
(my house).
• Identifying strings that appear to be URLs (may start with www or http) and
discarding parameters after a question mark.
• Truncating long strings to a maximum of about 256 characters.
Similar rules are applied to the query terms during a search, although usually only the
“best” interpretation of token splitting is used, instead of keeping the variations. In
some cases, alternatives will be used. For example, a search for BUCKShot would
be converted to a search for:

The Information Company™ 131


Understanding Search Engine 21

((bucks and hot) SOR (buck and shot) SOR (buckshot))

Limitations
Multi-value text regions have some limitations in behavior. The aggregate strings
from all the values are gathered together to create a single region that is tokenized
for the like operator. This means that there is no ability to combine the like
operator with specific values, such as might be expected when using attributes to
represent languages.
For example, a multi-value text region might have the English value “RedCar88” and
the French value “VoitureRouge88”. The like operator does not support examining
only one language. A search for “like RedVoiture” would match this object.
A second limitation is hit highlighting. Hit highlighting operations are not processed
using the like operation, which means a likely mismatch between tokens in the
original metadata value and the tokens matched in the shadow region during a query.
It is unclear what the correct operation should be, given the existence of the shadow
regions and one-to-many relationships between tokens and the original values. At
this time, hit highlighting ignores the like aspect of a query.
Most importantly, the like operator may generate many more search results than
expected. Due to the nature of part numbers and the behavior of the tokenization,
many small and common tokens can be generated. The like operator is biased
towards finding candidate search results, not towards filtering results to a most
probable match.
At this time there is no relevance adjustment based on the quality of the match in a
shadow region.

User Guidance
The description of the like operator so far may be good background into
configuration and applications, but does not provide much practical advice for an end
user. The normal warning that this guidance is not applicable in all situations applies.
Suggestions for a user trying to maximize success using a metadata region with the
like operator may include:
• Select fragments that appear to be logically distinct.
• Use spaces in place of punctuation.
• Do not enter a fragment of a longer numeric sequence as a search term.
• Do not enter a fragment of a text sequence as a search term.
• Do not use wild card operators.
• More terms or portions of the part number will be more precise.
An example using a fictitious part number string in a metadata region:

The Information Company™ 132


Understanding Search Engine 21

PN4556-WidgetRED01395b/v5.68.99 $2,867

The following queries would be successful:


4556 red
Widget 1395b
Pn v5.68.99
5 68 99
2867
867
On the other hand, these queries would fail:
idget [use widget]
2,86* [wildcards not permitted]
395 [fragment of1395]

Email Domain Search


A common requirement in search discovery applications is finding email messages
sent to or from various companies. In the case of Content Server, the relevant
metadata regions contain lists of email addresses, possibly as multi-value addresses.
Searching for an email domain is not always reliable.
By way of illustration, assume the email region is OTSender. Perhaps various
values of OTSender include:
Fred.smith@acme.com
Ibm-rep@smith.com
Sales.uk@acme.com
Sales.uk@other-acme.com
other@acme.com

A search for acme.com might also find other-acme.com. A search expecting to


find smith in the domain might also find fred.smith@acme. A search for the
other-acme domain might also find other@acme. In some cases, you could use
exact phrases to better constrain the queries, but this places a high knowledge
burden on the user. Beginning with SE10.5 Update 2014-09, capabilities exist to
facilitate the common email domain search case.
If the region OTSender is declared to be an email region, the Index Engine will
construct a new region named OTSender_OTDomain, and place the domain
portions of the email addresses in this new region. The original OTSender region
remains unaffected. The OTSender_OTDomain region can now be easily searched.
The email domain indexing process can handle multiple values for email addresses
in two ways. If there is a list of addresses in a single value, they will be split using
some simple pattern matching rules, typically comma or semicolon delimited. Multi-
value regions are also supported. In both cases, each distinct email domain will be
represented as a value in the_OTDomain region.

The Information Company™ 133


Understanding Search Engine 21

Where multiple identical email domain values exist for an object in the email region,
duplicates will be removed. This behavior is important given that many recipients of
an email message are often in the same organization or email domain.
In the search.ini file there are several configuration settings for tuning and enabling
email domain search capabilities. The main setting to enable or disable the feature is
a comma-separated list of text metadata regions that should be treated as email
regions. By default, this list is empty:
EmailDomainSourcesCSL=OTEmailSender,OTEmailRecipient

When you add or remove regions from the email domain list, the changes take effect
the next time the Index Engines are started. At startup, any new email domain
regions will be created and the values populated. This may add 10 or so minutes to
the first startup process. Likewise, if any regions were removed, they will be deleted
from the index at next startup.
Tuning of the behavior is possible with remaining configuration settings. By default,
_OTDomain is used as the suffix for the email domain regions, but this can be
adjusted. There is an upper limit on the number of distinct email domain values that
will be retained for a given email value, which defaults to 50. If you anticipate longer
lists of email domains, this value can be adjusted upwards. Finally, the separators
used to delimit an email domain can be defined. When indexing, a simple rule is
used that text after the @ symbol up to a separator character represents the email
domain. The separators are defined in the search.ini file, and default to comma,
colon, semicolon, various brackets and whitespace. The separator string must be
compatible with a Java regular expression.
EmailDomainFieldSuffix=_OTDomain
MaxNumberEmailDomains=50
EmailDomainSeparators=[,:;<>\\[\\]\\(\\)\\s]

An example: if a multi-value email region for an indexed object has the values:
<OTEmailSender>bob@gmail.com</OTEmailSender>
<OTEmailSender>bob@acme.com</OTEmailSender>
<OTEmailSender>sue@gmail.com</OTEmailSender>
<OTEmailSender>bob@smith.com</OTEmailSender>
<OTEmailSender>sue@other.co.uk</OTEmailSender>

The OTEmailSender_OTDomain for that object will have effective values of:
<OTEmailSender_OTDomain>gmail.com</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>acme.com</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>smith.com</OTEmailSender_OTDomain>
<OTEmailSender_OTDomain>other.co.uk</OTEmailSender_OTDomain>

The same _OTDomain values would exist if a single value email region contains the
string:
<OTEmailSender>bob@gmail.com, bob@acme.com[Robert]
sue@gmail.com;bob@smith.com(“MightyBob”);
sue@other.co.uk</OTEmailSender>

The Information Company™ 134


Understanding Search Engine 21

Text Operator - Similarity


The TEXT operator is designed to help locate objects given a large block of text. The
provided text is analyzed to generate a list of terms and phrases that are significant,
and the resulting list is used with either the TERMSET or STEMSET operator to
generate search results. This operator is ideal for similarity applications, typically
within classification and discovery.
Unlike other search operators, the user does not have direct control over the exact
behavior of the search query. A typical use case would be to copy a couple of
paragraphs from a document, and search using the TEXT operator to find documents
with similar information. The TEXT operator takes arbitrary text as the parameter,
excluding closing brackets and end of line characters.
To illustrate by example, perhaps the first few lines from Lewis Carrol’s “Alice in
Wonderland” are used:
text (Alice was beginning to get very tired of sitting by
her sister on the bank, and of having nothing to do: once
or twice she had peeped into the book her sister was
reading, but it had no pictures or conversations in it,
‘and what is the use of a book,’ thought Alice ‘without
pictures or conversations?’ So she was considering in her
own mind (as well as she could, for the hot day made her
feel very sleepy and stupid, whether the pleasure of making
a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink
eyes ran close by her. There was nothing so very remarkable
in that; nor did Alice think it so very much out of the way
to hear the Rabbit say to itself, ‘Oh dear! Oh dear! I
shall be late!’)

First observation… to be compatible with the text operator, the paragraph end
“CRLF” characters and closing braces “)” in the source were removed.
The Text operator would then analyze the text, discarding short words and top words.
Statistical analysis would select notable words (and phrases). Although not in this
text example, overly long words or lists of numbers would be ignored. The resulting
set of 8 to 15 terms would then be used internally with stemset, with an effective
internal query something like:
stemset(80%,alice,sister,book,”pictures or
conversations”,rabbit,considering,trouble,picking,pleasure,
sleepy,stupid,daisies)

Which in turn would match all items that have 80% or more of those terms and
phrases in the full text of the object. In general, numbers are dropped from
consideration in TEXT queries. However, if the provided block of TEXT is relatively
short (less than about 250 characters), numbers will be included if necessary to meet
the minimum number of terms.

The Information Company™ 135


Understanding Search Engine 21

The TEXT operator has a number of configuration settings. See the Top Words
section below for more settings.
Performance degrades with more words used in stemset, while accuracy drops with
too few words. The upper limit on the number of terms and phrases to use is:
TextNumberOfWordsInSet=15

For accuracy, such as trying to match exact documents, termset is a better choice.
Otherwise, stemset is used to find more objects with singular/plural variations but
runs slightly slower:
TextUseTermSet=true

The percentage of matches with termset and stemset can be adjusted. Low values
find more objects with less similarity (eg: 40%). Higher values, such as 80%, require
better matches with the source material:
TextPercentage=80

Top Words
The TEXT query operator is specifically designed to efficiently locate good quality
results when provided with large blocks of text. In this particular scenario, overly
common words are of little value, and need to be discarded. In OTSE, the Top
Words feature is used for this purpose.
Top Words are those words which are found within a large percentage of the
documents. For example, the OpenText corporate document management system
has the word OpenText in many documents, and hence it is eliminated from TEXT
queries. Top Words are determined based upon the percentage of objects containing
a word. For example, if more than 30% of objects contain the word ‘date’, then ‘date’
is added to the Top Words list.
Top Words are computed independently for each search partition. Usually, more
partitions are added over a prolonged period. If the frequency of words changes over
time, then newer partitions will have slightly different Top Words than older partitions.
This also means that TEXT queries which eliminate Top Words might construct
slightly different queries on each partition.
The Top Words are first computed for a partition once it contains approximately
10,000 objects. On reaching 100,000 and 1,000,000 objects, the list is discarded
and recomputed. This helps to ensure that the Top Words properly reflect the
contents of the partition. The Top Words are stored in a file that is not human
readable, and has the name topwords.10000, with the number changing to reflect the
size. If the topwords.n file is missing, it will be generated during next startup or
checkpoint write.
The threshold for selecting Top Words is a real number that should be between 0.01
and 0.99, representing the fraction of objects in the partition that contain the word.
The default value is 33%, which we found in some typical partitions larger than 1
million objects generated a Top Words list of about 750 words. Larger fractions result
in fewer Top Words. In the [Dataflow_] section:

The Information Company™ 136


Understanding Search Engine 21

TextCutOff=0.33

If the Top Words features are not required, generation and use can be disabled by
setting:
TextAllowTopwordsBuild=false

Stop Words
Stop words are words which are considered too common to be relevant, or do not
convey any meaning, and are therefore stripped from search queries, or potentially
not even indexed. For English, a typical list of stop words would contain words such
as:
a, about, above, after, again, against, all, am, an,
and, any, are, aren't, as, at, be, because, been,
before, being, below, between, both, but…
The potential advantage of stop words is a reduction in the size of the search index.
However, use of stop words introduces several limitations for search.
If stop words are applied at indexing time, certain types of queries become
impossible. A Shakespearean scholar could never find Hamlet’s soliloquy “to be or
not to be”, since all of those words are considered stop words, and would not be in
the index.
Another reason to not apply stop words during indexing is the multi-lingual capability
of OTSE. The Spanish word “ante” is very common, so it should be a stop word, and
not indexed. However, in English, this is an uncommon word, so it clearly should be
indexed.
As a result, the search engine does not use stop words during indexing, nor are they
applied as a general rule during search queries. However, there is a closely related
capability known as Top Words that is used under special circumstances.

The Information Company™ 137


Understanding Search Engine 21

Advanced Feature Configuration


Occasionally, you may need to optimize the configuration settings for some very
complex parts of the search grid. This section provides some details about how
these parts of search work, what they do, and how you can adjust them to optimize
search for your application.

Accumulator
The Accumulator is an internal component of the Index Engines which is responsible
for gathering the tokens (or words) that are to be added to the full text search index.
A basic understanding of the Accumulator is useful when considering how to tune
and optimize an OTSE installation.
As objects are provided to the Index Engine, the full text objects are broken into
words using the Tokenizer, and added to the Accumulator. When the Accumulator is
full, this event triggers creation of a new full text search fragment. In a process
known as “dumping” the Accumulator, a fragment containing the objects stored within
the Accumulator is written to disk.
The transactional correctness of indexing is possible in part because of how the
Accumulator works. As objects are added to the accumulator, they are also written to
disk in the accumlog file. These files are monitored by the search engines to keep
the search index incrementally updated. When the Accumulator dumps, a new index
fragment is created, and the accumlog files are available for cleanup.
The size of the accumulator has an impact on system performance, and on the
maximum size of an object that can be indexed. A small Accumulator is forced to
dump frequently, which can reduce indexing performance. A large Accumulator
consumes more memory. The default size value for the Accumulator is 30 Mbytes
(which is a nominal allocation target – Java overhead results in the actual memory
consumption being higher), and can be set from within the Content Server search
administration pages, which sets the [Dataflow_] value in the search.ini file:
AccumulatorSizeInMBytes=30
If a single object is too large to fit within the Accumulator, it will be truncated –
discarding the excess text content. You cannot always predict whether an object will
exceed this size limit, since this is a measurement of internal memory use including
data structures, and not a measurement of the length of the strings being indexed.
The Accumulator will dump if it contains data and indexing has been idle. The idle
time before dumping is configurable:
DumpOnInactiveIntervalInMS=3600000
During indexing of an object, the accumulator also makes an assessment of the
quality of the data it is given to index. If the data is too “random” from a statistical
perspective, then the accumulator will reject it with a “BadObjectHeuristics” error.
The randomness configuration settings in the [Dataflow_] section are:
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0

The Information Company™ 138


Understanding Search Engine 21

MaxAverageTokenLengthHeuristic2=15.0
MinDocSizeInTokens=16384
The heuristics are relatively lax, and essentially designed to try and protect the index
from situations where random data or binary data was provided. It is rare that these
values need to be adjusted, and some experimenting will be needed to find values
that meet special needs. There is a minimum size of about 16,384 bytes before
these heuristics are applied, since small objects would otherwise fail the uniqueness
requirement.
There is one situation where this safety feature is known to occasionally discard good
objects. If a spreadsheet is indexed that contains lists of names, numbers and
addresses, the uniqueness of the tokens may be very high, and it may be rejected as
random.
A related configuration setting is an upper limit on the size of a single object. Objects
are truncated at this limit, meaning that only the first part of the object is indexed.
Note that this size limit is applied to the text given to the Index Engine, not the size of
an original document file. For example, a 15 MB Microsoft PowerPoint file might only
have a filtered size of 100 Kbytes. Conversely, an archive file (ZIP file) with a size of
1 MB might expand to more than 10 MB after filtering.
ContentTruncSizeInMBytes=10
From an indexing perspective, 10 Mbytes is a lot of information. For English
language documents, this would normally be more than 1 million words. By way of
comparison, this entire document in UTF8 form is less well under 1 MByte.

Accumulator Chunking
Starting with Search Engine 10 Update 7, the Accumulator also has the ability to limit
the amount of memory consumed by “chunking” data during the indexing process.
Essentially, if the size of the accumulator exceeds a certain threshold, the input is
broken into smaller pieces, or chunks. Each chunk is separately prepared and
written to disk. When all the chunks are completed, a “merge” operation combines
the chunks into the index.
Chunking is a very disk-intensive process. When chunking occurs, there is a
noticeable impact on the indexing performance. Fortunately, chunking is only
required when indexing very large objects. Using the default settings, we noted while
indexing our own typical “document management” data set that chunking occurs with
hundreds of documents per million indexed, and showed an overall indexing
performance hit of about 15% in a development environment. If indexing
performance must be optimized, you can disable chunking or even reduce the
Content Truncation size described above to a small value (perhaps 1 MByte) such
that chunking may never happen.
There are configuration settings in the [DataFlow_] section of the search.ini file for
tuning the chunking process. The number of bytes in an object before chunking will
occur has a default of 5 MBytes. The feature can be disabled with a large value, say
100,000,000.
AccumulatorBigDocumentThresholdInBytes=5000000

The Information Company™ 139


Understanding Search Engine 21

An additional amount of memory for related data such as the dictionary is reserved
as working space, expressed as a percentage of the Accumulator size (typically 30
Mbytes), with a default of 10 percent.
AccumulatorBigDocumentOverhead=10

As a result of this change, it will no longer be possible to search within XML regions
in the body of text for large XML objects where chunking occurs. Chunking can be
disabled for XML documents by setting the configuration to true, but this will negate
the memory savings from chunking.
CompleteXML=false

Reverse Dictionary
The search engine maintains dictionaries of words in the index. The dictionary is
sorted to be efficient for matching words, and for matching portions of words where
the beginning of the word is known (right-truncation, such as washin*). However, for
matching terms that start with wildcards (left-truncation), the dictionary is not optimal.
The search engine can optionally store a second dictionay, known as the Reverse
Dictionary. This is a dictionary of each term spelled backwards. For instance, the
term “reverse” is stored as “esrever”. This Reverse Dictionary allows high
performance matching of terms that begin with a wildcard, and for certain types of
regular expressions that are right anchored (ending with a $).
There is an indexing performance penalty associated with building and maintaining
the Reverse Dictionary. The penalty varies due to many factors, but has been
observed to be over 10%. There is additional disk space required, typically about 1
GB for a partition with 10 million objects. As far as memory is concerned, another
Accumulator instance is used which consumes about 30 MB of RAM in the default
configuration, and space of about 15 MB is required for term sorting. The Reverse
Dictionary is enabled with a setting in the [Dataflow] section of the search.ini file:
ReverseDictionary=true

By default, the Reverse Dictionary is disabled (false) to maintain backwards


compatibility – this feature was added in version 16.2.5 (June 2018). Existing
indexes will create (or destroy) the Reverse Index during startup after this setting is
changed. The conversion is performed by the Index Engine, and the Search Engine
will then need to be restarted to apply the change during queries. Conversion to
create the Reverse Dictionary is relatively expensive, perhaps 30 minutes per
partition. Data does not need to be re-indexed.
Once enabled, the Reverse Dictionary also reserves memory for sorting results that
match the reverse dictionary during search queries. The default configuration
allocates space to sort up to about 100,000 terms per partition. If this number is
exceeded, performance is impacted. The value can be increased at a cost of about
15 MB per 100000 terms, with the [Dataflow] setting:
ReverseDictionaryScanningBufferWordEntries=100000

The Information Company™ 140


Understanding Search Engine 21

The Reverse Dictionary works with full text content and text metadata stored in “Low
Memory” mode. Older storage modes are not supported. The Reverse Dictionary is
not used with regions that are over-tokenized or configured for exact substring
matching.

Transaction Logs
In the event that an index or partition is corrupted or destroyed, OTSE provides
Transaction Logs to help rebuild and recover indexes with the least amount of re-
indexing. Transaction Logs are generated by the Index Engines with a minimal
record of the indexing operations that have been applied. A fragment of a
Transaction Log looks like this:
2018-03-15T14:19:22Z, replace - content, DataId=1009174&Version=1
2018-03-15T14:19:22Z, add, DataId=1036021&Version=1
2018-03-15T14:19:22Z, delete, DataId=1015932&Version=1
2018-03-15T14:19:22Z, add, DataId=1036022&Version=1
2018-03-15T14:19:22Z, add, DataId=1036023&Version=1
2018-03-15T14:19:22Z, Start writing new checkpoint
2018-03-15T14:19:23Z, Finish writing new checkpoint
2018-03-15T14:19:23Z, add, DataId=834715&Version=1

If an index is corrupted, it can be restored from the most recent backup. The
Transaction Log can then be used to determine which Content Server objects should
be re-indexed or deleted to bring the backup copy of the index up to date, based on
the date/time of the operations since the date of the backup.
The transaction logs are set up to rotate 4 logs of size 100 MB each, which should
typically be able to record more than 50 million operations for a partition. At this time,
these values are not adjustable. In a typical system with regular backups, this should
be more than enough to recover all transactions. If your backups are less frequent,
you may wish to copy these logs on a regular basis.
Multiple copies of the Transaction Logs can be written. The idea here is that these
logs must survive a disk crash to be useful for recovery. If you are concerned about
system recovery, consider recording the Transaction Logs on two different physical
disks. In the [IndexEngine_] section of the search.ini file:
TransactionLogFile=c:\logs\p1\transaction.log,
f:\logs\p1trans.log
TransactionLogRequired=false

In this example, logs are written to two locations. By default, the list is empty, which
disables writing the Transaction logs. The Index Engine will append text to the
provided file name to differentiate between the rotating logs. A second setting
dictates whether a failure to write Transaction Logs should be considered a
transaction failure, or should be accepted and allow indexing to continue. By default,
this is false – meaning the Transaction Logs are “nice to have”.

The Information Company™ 141


Understanding Search Engine 21

Protection
Because Content Server is relatively open and allows many types of applications to
be built on top of it, the search system can be exposed to unexpected data and
applications. This section touches on some of the configurable protection features of
OTSE.

Text Metadata Size


Text metadata regions are optimized for relatively small and important bits of
information. We have seen situations where customers attempt to place megabyte
text values in a text field. While this works, it consumes significant memory and CPU
to process. There is a default maximum size of 256 Kbytes for text in a single region
for an object. In the [Dataflow_] section, MetadataValueSizeLimitInKBytes controls
this size, and any regions listed in MultiValueLimitExclusionCSL are exempt.
If this limit is exceeded, the metadata is truncated to the maximum length, and the
string OTIndexLengthOverflow is added to the end so that you can search for these
conditions, and the OTIndexError count is incremented.

Text Metadata Values


Text metadata regions support multiple values. There is a default limit to the number
of values that can be accepted. This is especially important since processing multi-
value text regions consumes considerable stack space. The default is 200 values, as
defined in the search.ini file by the MultiValueLimitDefault setting. Regions listed in
MultiValueLimitExclusionCSL are exempt, which by default are regions used by
Content Server email management:
OTEmailToAddress
OTEmailToFullName
OTEmailBCCAddress
OTEmailBCCFullName
OTEmailCCAddress
OTEmailCCFullName
OTEmailRecipientAddress
OTEmailRecipientFullName
OTEmailSenderAddress
OTEmailSenderFullName
If this limit is exceeded, additional metadata values are discarded and an additional
metadata value of <>OTIndexMultiValueOverflow</> is added to make this
condition searchable, and the OTIndexError count is incremented.

Incorrect Indexing of Thumbnail Commands


An issue was detected whereby objects that had no content were being given
portions of the input IPool from DCS that contained requests to generate thumbnails.
This issue was corrected in the 16.2.4 update. However, attempts to re-balance
objects affected by this problem will fail – no full text will be provided on re-index, so
the Index Engine will see it as a partial update and not permit a rebalance of the

The Information Company™ 142


Understanding Search Engine 21

object. The full text checksum for the affected objects is always 485363284. There
is a configuration setting to allow objects with this checksum to be treated as if they
have no text:
EnableWeakContentCheck=true

Cleanup Thread
As the Index Engines update the index, they create new files and folders. The
Search Engines read these files to update their view of the index. Left alone, these
files will eventually fill the disk. The Cleanup Thread is the component of the Index
Engine that runs on a schedule to analyze the usage of the files, and delete those
which are no longer necessary.
A Cleanup Thread only examines and deletes files for a single partition; each Index
Engine therefore schedules a Cleanup Thread. The Cleanup Thread will delete
unused configuration files, as well as unused files listed in the configuration files,
such as accumlog, metalog, checkpoint and subindex fragment files. Search
Engines keep file handles open for config files currently in use, and this is the primary
mechanism used by the Cleanup Thread to determine if files can be deleted.
There is no specific process to monitor for the Cleanup Thread; it is part of the Index
Engine process. By default, the Cleanup Thread is scheduled to run every 10
minutes. You can adjust the interval in the search.ini file [Dataflow_] section:
FileCleanupIntervalInMS=600000
The Cleanup Thread also has a secure delete capability, disabled by default.
SecureDelete=false

When set to true, the Cleanup Thread will perform multiple overwrites of files with
patterns and random information before deleting them, making them unreadable by
most disk forensic tools. This also makes the file delete process considerably slower,
and uses significant I/O bandwidth. Some additional notes on this feature:
• The US Government has updated their guidelines to require physical
destruction of disk drives for highest security situations.
• Overwriting files is ineffective with journaling file systems.
• The algorithm is designed for use with magnetic media, and may not provide
any additional security with Solid State Disks.
• Optimizations by Storage Array Network storage systems may defeat this
feature.
The Cleanup Thread code has been enhanced starting with Search Engine 10
Update 4 to delete unused fragments more aggressively. If for some reason you
require the previous behavior, it can be requested in the search INI file with by setting
SubIndexCleanupMode=0. The default value is 1.

The Information Company™ 143


Understanding Search Engine 21

Merge Thread
The Merge Thread is a component of the Index Engine that consolidates full text
index fragments. As the Index Engines add or modify the index, they do not change
the existing files. Instead, they append new files, referred to as the “tail” fragments.
The Search Engines must search against all of the files that comprise the full text
index.
As the number of files containing index fragments grows, the performance of search
queries deteriorate. The purpose of the Merge Thread is to combine fragments to
create fewer files that the Search Engines need to use, ensuring that query
performance remains high. Merging also reduces the overall size of the index on
disk, since deleted objects are simply “marked” as deleted in the tail fragments, and
modified objects will have multiple representations until they are merged.
The Merge Thread will create new full-text index fragment files and then
communicate with the Search Engine using the Control File regarding which files now
comprise the index. Once the Search Engine changes (locks the new files), the
Cleanup Thread will delete the older index files.
Merging is a disk-intensive process. The Merge Thread therefore tries to maintain a
balance between how frequently merges occur and how many index fragments exist.
In a typical index, there are frequent merges taking place within the tail index
fragments, which tend to be small and can be merged quickly. Eventually, older and
larger fragments must also be merged.
An optimal target for the number of fragments an index should have is about 5. In
practice, the number of smaller fragments can grow quite large depending upon the
characteristics of the index. As a safeguard, there is a configuration setting that
places an upper limit on the number of fragments that are permitted for a partition
index, and this will force merges to occur. Too many fragments can seriously affect
query performance due to the level of disk activity in a query and the number of file
handles needed.

The Information Company™ 144


Understanding Search Engine 21

Target size distribution of Index Fragments

Larger, older fragments


change less frequently

Tail Fragments

The Merge Thread configuration settings are located in the [Dataflow_] section of the
search.ini file:
// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
CompactEveryNDays=30
NeighbouringIndexRatio=3
“Want Merges” would normally only be changed for debugging purposes. In most
installations, these settings do not need to be modified. One setting of note is the
Compact Every N Days value, which instructs the Merge Thread to make a more
aggressive attempt to merge indexes over the long term. This setting helps to merge
older index fragments which are relatively stable, and would otherwise not be
scheduled for compaction.
Merge Tokens
Merging fragments temporarily requires additional disk space, nominally the size of
all the fragments being merged. If the temporary disk space needed causes the
partition to exceed the configured maximum size of the partition, then the merge will
fail. One way to address this is to increase the configured allowable disk space.
However, increasing the disk space for every partition can be a costly approach to
solving the problem.
The better approach is to enable Merge Tokens. Merge Tokens are managed by the
Update Distributor, and can be granted on an as-needed basis to Index Engines that
do not have sufficient space to perform merges. If given a Merge Token, the Index
Engine will proceed to perform a merge even if this exceeds the configured maximum
disk space. If the largest index fragments are 20 GB, then 100 GB of temporary
space would suffice for 4 or 5 Merge Tokens. Relatively few Merge Tokens are

The Information Company™ 145


Understanding Search Engine 21

needed. 3 tokens would likely suffice for 10 partitions, perhaps 10 tokens for 100
partitions.
The Merge Token capability was first added in Update 2015-03, and the default
setting is disabled for backwards compatibility. In the [UpdateDistributor_] section of
the search.ini file:
NumOfMergeTokens=0
Too Many Sub-Indexes
Although OTSE has a typical target of merging down to 5 or so index fragments,
there are situations when this may not be possible. There is a maximum number of
allowable index fragments (or sub-indexes), which by default is 512. There have
been scenarios, usually due to odd disk file locking, where this limit has been
reached or exceeded. In this case, a Java exception will occur, logging a message
along these lines:
MergeThread:2:Exception:Exception in
MergeThread:java.lang.ArrayIndexOutOfBoundsException; 512

To recover from this, you can edit the [Dataflow_] section of the search.ini file to
increase the number of allowable sub-indexes (perhaps 600), and restart the affected
engines. Once recovered, the lower number should be restored, since running with
larger values has a potential negative performance impact.
MaximumSubIndexArraySize=512

Tokenizer
The Tokenizer is the module within OTSE that breaks the input data into tokens. A
token is the basic element that is indexed and can be searched. The Tokenization
process is applied to both the input data to be indexed, and the search query terms
to be searched.
There is a default standard Tokenizer (Tokenizer1) built into OTSE that applies to
both the full text and all search regions. The system supports adding new tokenizers
that can be applied to specific metadata regions. In addition, Tokenizer1 can be
replaced and customized, or can be used with a number of configuration options.
Everything that follows until the section entitled “Metadata Tokenizers” describes the
use of the default Tokenizer1.

Language Support
OTSE is based upon the Unicode character set, specifically using the UTF-8
encoding method. This means that all indexing and query features can handle text
from most languages. If there are limitations in supported character sets, any
necessary changes would take place within the Tokenizer.

Case Sensitivity
By design, OTSE is not case sensitive. Text presented for indexing or terms provided
in a query are passed through the Tokenizer, which performs mapping to lower case.

The Information Company™ 146


Understanding Search Engine 21

This design decision provides a slight loss of potential feature capability in full text
search, but improves performance and reduces index size dramatically. Note that
text metadata values are stored in their original form, including accents and case, so
that retrieval of metadata has no accuracy loss. The mapping to lower case is not
applied to other aspects of the index, such as region names, which ARE case
sensitive.

Standard Tokenizer Behavior


When dealing with English words, the Tokenizer has a simple task. Consider the
input sentence “how are you?” The Tokenizer will create 3 tokens:
how
are
you
To improve the find-ability of terms, the Tokenizer is used to normalize input data –
convert it to basic equivalent forms. This includes converting capitals to lower case,
removing accents, or converting certain characters to their common basic forms. So
the input string of “The café Boiñgo” generates the tokens:
the
cafe
boingo
The next set of problems the Tokenizer handles relates to white space and
punctuation. White space is the set of characters used to represent breaks between
tokens. The ‘space’ character is one of these, but there are many more. Punctuation
is generally meaningless from a search perspective, so unless the punctuation is
contained within some specific patterns, the Tokenizer will normally ignore
punctuation characters. In practice, the Tokenizer works by searching for valid
character patterns, rather than by discarding whitespace and punctuation characters.
The Tokenizer handles a number of special cases. Consider the token “end.start”.
Most likely, this should be treated as two words. However, the text “14.5587” is
clearly a number. The Tokenizer recognizes patterns in text and identifies certain
special cases. Where a number is concerned, the period will be kept and the text will
index as a single token. The regular expression matching handles this.
Numbers are the easy case. Languages such as English allow words to be broken
with hyphens, particularly at line breaks. Consider the text “con-tract”. Is this two
tokens, or one? Should the hyphen be removed or replaced with a space? What if
the string is “con- tract”, with a space after the hypen? Again, using appropriate
regular expressions will determine whether this is one or two words.
Several Asian character sets also need special handling. Written languages such as
Japanese, Chinese and Korean do not use the same concept of word breaks that are
common in European languages. For these character sets, the Tokenizer will instead
create overlapping sets of “bigrams” – pairs of adjacent characters.
Finally, the Tokenizer can be used to identify special forms of strings, and keep them
intact. A common case is part numbers. If your business commonly uses part
numbers of the form “1145\hgbuut-4478”, then the Tokenizer can be enhanced to

The Information Company™ 147


Understanding Search Engine 21

recognize this as a special case, and keep a string in this form intact as a single
token instead of breaking it into 3 separate tokens.

Customizing the Tokenizer


Warning: changing the Tokenizer for an existing index can cause unexpected results.
During fragment merges and accumulator dump activities, the Index Engine verifies
that the tokens have not changed. If the new Tokenizer causes existing words to be
tokenized differently, those words will be dropped from the Index and the event
recorded in the log files.
If you have special indexing and search requirements, you can create a custom
Tokenizer file. When you provide a new Tokenizer file, it completely replaces the
internal Tokenizer. The Tokenizer file is read when the OTSE components are
started, and used to build an optimal finite state machine for parsing strings. This
optimization means that a custom Tokenizer will not have a material impact on the
tokenizing performance.
The location of the Tokenizer file may be specified in the search.ini file, and allows a
unique Tokenizer to be used with each dataflow. This is normally not recommended.
[DataFlow_foo03278X2099X12621X11463]
RegExTokenizerFile=tokenizerFileName
The default location for a custom Tokenizer is the “config” folder for the search
engine, with the file name otsearchtokenizer.txt. If placed here, then the same
Tokenizer is used for all dataflows.
A restart of the engines is required after a change to the Tokenizer. Depending on
the change, reindexing of some or all of the content may be desired.
Creating a new Tokenizer is not trivial, and errors in the Tokenizer will require you to
re-index data to correct it. OpenText can provide services to help you customize your
Tokenizer.
Step one in customizing the Tokenizer is to obtain a copy of the default Tokenizer as
a starting point for reference. You can obtain this file from OpenText customer
support – it is built in to OTSE and not shipped with the product.

Tokenizer File Syntax


The basic layout of the tokenizer file is:
#
# comment lines start with the number sign
#
[comm|nocomm]
mappings {
map_specifications
}
ranges {
range_specifications
}

The Information Company™ 148


Understanding Search Engine 21

words{
word_specifications
}
The comm|nocomm line is optional, and not recommended. This controls whether
text that meets the criteria for SGML or XML style comments should be retained or
discarded. The default value is nocomm (do not index comments). This line is
equivalent to setting the standard Tokenizer options in the search.ini file with a value
of TokenizerOptions=2.

Tokenizer Character Mapping


The mappings section is used to map UTF8 characters from one value to another.
For instance, an upper case A to lower case a, or accented characters to non-
accented characters. The mappings section does not completely replace the default
character mapping; it supplements or replaces the specific mappings defined in the
section. However, providing any character mappings will require a complete
tokenizer file to be specified, including range and words sections. In the event that
no mapping exists for a character, the value is passed unchanged.
Mapping of characters takes place AFTER the tokenization has occurred.
The simple example below would be used to convert:

upper case A (hexadecimal 41) to lower case a (hexadecimal 61)


ntilde (ñ – hexadecimal f1) to lower case n (hexadecimal 6e)
character Æ (hexadecimal c6) to the two letters ae
and drop the Unicode diacritical mark ` (combining grave accent, hexadecimal
300)
mappings {
0x41=0x61 0xf1=0x6e
0xc6=0x00650061 0x300=0x00
}

NOTE: the special case for mapping one character to two


characters. To use this feature, you must express the two
characters as a single 32 bit value, with leading zeros, with the
second character first.

Using the null character as the “to” value in a mapping is a special case. Null
characters are skipped during a subsequent Indexing step, so mapping a character
to 0x00 will effectively drop it from the string. This may be useful for removing
standalone diacritical marks or punctuation such as the single quote mark from the
word “shouldn’t”.
The following table illustrates the default character mappings for many of the
European languages.

The Information Company™ 149


Understanding Search Engine 21

From To

A-Z a-z

À Á Â Ã Å à á ã å Ā ā Ă ă Ą ą a

Ä Æ ä æ ae

Ç ç Ć ć Ĉ ĉ Ċ ċ Č č c

Ďď Đ đ d

È É Ê Ë èé ê ë Ē e

Ì Í Î Ï ì í î ï i

Ð ð ð

Ñ ñ n

Ò Ó Ô Õ Ø ò ó ô õ ø o

Ö ö oe

Ú Û ù ú û u

Ü ü ue

Ý ý ÿ y

Þ Þ (Large Thorn)

þ Þ (small Thorn)

ß ss
Note: prior to Update 2014-12, upper and lower case Ø characters were mapped to a zero.

Latin Extended-A Character Set Mapping


The Latin extended – A code page, also known as Unicode Code Page 1, has all
characters mapped to their nearest equivalent single ASCII character equivalents
with the following exceptions:

The upper and lower case IJ ligatures are mapped to the two letters I J.
Upper and lower case Letter L with Middle Dot are preserved ( Ŀ and ŀ).
Upper and lower case Πligatures converted to oe.
Accented W and Y characters are preserved (Ŵ ŵ Ŷ ŷ Ÿ ).
The ſ character (small letter “long s”) is preserved.

Arabic Characters
There are special cases implemented for tokenization of Arabic character sets, which
improves the findability of Arabic words.
Step 1 is character mapping. The character mapping is extended to handle cases in
which multiple characters must be mapped as a group. These mappings are:

The Information Company™ 150


Understanding Search Engine 21

0627 ARABIC LETTER ALEF 


0622 ARABIC LETTER ALEF WITH MADDA ABOVE
0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
0625 ARABIC LETTER ALEF WITH HAMZA BELOW
0675 ARABIC LETTER HIGH HAMZA ALEF

0647 ARABIC LETTER HEH 


0629 ARABIC LETTER TEH MARBUTA

064A ARABIC LETTER YEH


0649 ARABIC LETTER ALEF MAKSURA
0626 ARABIC LETTER YEH WITH HAMZA ABOVE
0678 ARABIC LETTER HIGH HAMZA YEH

0648 ARABIC LETTER WAW 


0624 ARABIC LETTER WAW WITH HAMZA ABOVE
0676 ARABIC LETTER HIGH HAMZA WAW

06C1 ARABIC LETTER HEH GOAL 


06C2 ARABIC LETTER HEH GOAL WITH HAMZA ABOVE

06C7 ARABIC LETTER U 


0677 ARABIC LETTER U WITH HAMZA ABOVE

06D2 ARABIC LETTER YEH BARREE 


06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE

06D5 ARABIC LETTER AE 


06C0 ARABIC LETTER HEH WITH YEH ABOVE

In addition, several hundred Presentation Form characters are mapped to their


equivalent non-Presentation Forms.

Step 2 is removal of the following Unicode diacritical marks:

0640 ARABIC TATWEEL


064B ARABIC FATHATAN
064C ARABIC DAMMATAN
064D ARABIC KASRATAN
064E ARABIC FATHA
064F ARABIC DAMMA
0650 ARABIC KASRA
0651 ARABIC SHADDA
0652 ARABIC SUKUN
0653 ARABIC MADDAH ABOVE

The Information Company™ 151


Understanding Search Engine 21

0654 ARABIC HAMZA ABOVE


0655 ARABIC HAMZA BELOW
0670 ARABIC LETTER SUPERSCRIPT ALEF
0674 ARABIC LETTER HIGH HAMZA

Step 3 is removal of WAW and ALEF-LAM prefixes, only if doing so leaves at least 2
characters remaining.
The final step is removal of HEH-ALEF and YEH-HEH suffixes, again only if at least 2
characters will remain in the token.

Note that Arabic tokenization was improved significantly starting with Update 2014-
12.

Complete List of Character Mappings


For completeness, a table of all the character mappings performed by OTSE is
included in the Configuration Files section later in this document.

Tokenizer Ranges
Ranges define the primitive building blocks of characters, organizing them in logical
groups. Each range specification is comprised of Unicode characters and character
ranges, expressed in hexadecimal notation. For example, a range for the simple
numeric characters 0 through 9 would be:
number 0x30-0x39
In practice, there are multiple Unicode code points where numbers could be
represented, so a richer definition of a number might need to include Arabic numerals
(x660-x669), Devenagari numerals (0x966-0x96f) and similar representations from
other languages. You would probably also want to use the character mapping
feature to convert these all to the ASCII equivalents:
number 0x30-0x39 0x660-0x669 0x996-0x96f

Tokenizer Regular Expressions


The words section describes how word tokens are built from the range values. Each
definition is on a separate line, and is a regular expression using one or more ranges.
There should be no line breaks within a word specification. If the text matches the
regular expression, it is accepted as a token. A simple example:
currency?dash?number+(nseparators+number+)*
This regular expression is based upon the ranges currency, dash, number and
nseparators. Specifically the regular expression above indicates that the text is a
word if it meets these criteria:

May or may not start with currency – currency would be a list of symbols such as
$ ¥ £ or €.
May or may not start with a dash after the optional currency sign.

The Information Company™ 152


Understanding Search Engine 21

Has one or more numbers (0-9) following optional dash and currency.
Has zero or more sets of nseparators (, and .) and numbers following the first
number.
In general, the regular expressions are greedy – matching the longest possible string.
The following operations on ranges are supported, and are applied following the
range:
? Zero or one instances of the range

* Zero or more instances of the range

+ One or more instances of the range

( ) Use brackets to clarify order of operations and create groups

| Vertical bar is used as an “OR” operator between ranges

- Token matching this pattern is not valid, advance start pointer one
character and continue

The Tokenizer begins at a specific character, and attempts to find the longest valid
regular expression match. Once found, it takes the matching value as a word,
advances to the character following the match, and repeats. If no match is found, it
advances one character and repeats.
In general, regular expressions that you construct should be relatively lax. In the
currency example above, for instance, we do not enforce 3 digits between commas.
Erring on the side of indexing information rather than rejecting it is a good guideline.

East Asian Characters


Languages such as Chinese, Japanese and Korean do not generally split into tokens
like European languages do. There is a special mechanism available to group these
characters into pairs, called bi-grams.
By way of example, the string of 3 characters 信用卡 would be indexed as two
overlapping sets of bi-grams: 信用 and 用卡. This approach improves search
quality for these character sets, although the resulting index is somewhat larger than
an index built by treating each character as a unique token.
In the Tokenizer, the characters indexed in this way are expressed in a range, as
follows:
ranges{
gram2 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d 0xfa30-
0xfa6a 0xfa70-0xfad9
0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d
0x3041-0x3094 0x30a1-0x30fe 0xff66-0xff9d 0xff9e-
0xff9f
}
In the words section, there is a reserved keyword that implements this bi-gram
behavior for the matched regular expression, _NGRAM2:
words{

The Information Company™ 153


Understanding Search Engine 21

_NGRAM2 gram2+
}
Bigram indexing is the default behavior for these languages. Older versions of the
Search Engine indexed each East Asian character as a separate token. There is a
configuration setting in the search.ini file that can force use of the older method. This
may be useful if you have an older index that predates OTSE with significant East
Asian character content that you do not wish to re-index.

Tokenizer Options
If you are using the standard Tokenizer, the following options are available in
[Dataflow_xxx] section of the search.ini file:
TokenizerOptions=128
The default value is 0 (no options set). The options are a bit field, and can be added
together to combine values. The bit field values are:

1 : a dash character “-“ is counted as a standard character for words. The string
“red-bananas-26” would be indexed as a single token, instead of as 3 the
consecutive tokens “red”, “bananas”, “26”.
2 : XML comments are indexed. By default, strings which fit the pattern for an
XML comment are stripped from the input. XML comments have the form
<!--any text in comment-->
4 : treat underscore characters “_” as separators. This would cause input such
as “My_house” to be indexed as two tokens, “my” and “house”. The default
would preserve this as a single token.
8 : special case handling to look for software version numbers of the form v2.0
and treat them as a single token.
16: treat the “at symbol” @ as a character in a word.
32: treat the Euro symbol as a character in a word.
128 : used to request the “older” method of indexing East Asian character strings
with each character as a separate token. The default indexes these strings as 2-
character “bi-grams”.

Testing Tokenizer Changes


This is an unsupported component of the OTSEARCH.JAR file. If it does not work as
expected, OpenText has no obligation to correct it. This capability is documented
simply for convenience in the event that it may be useful if you are debugging
Tokenizer behavior.
There is a Java class function within the OTSEARCH.JAR that can accept a
tokenizer file and a test file, and display the tokens that would be generated. An
example in a Linux environment:
cd ~
cp=$otsearch.jar
java -cp $cp com.opentext.search.tokenizer.RegExBreaker $@
testtok -inifile tok.ini inputfile

testtok is the class for the test.

The Information Company™ 154


Understanding Search Engine 21

-inifile identifies that the tokenizer filename follows, in this case tok.ini.
inputfile is the name of the file containing the data you wish to tokenize.
If inputfile contains “THIS is a TEßT”, the output would be of the form:
|THIS|this
|is|is
|a|a
|TEßT|tesst
Where the first value on each line represents the word tokens accepted by the
regular expression parser, and the second value represents the results after the
character mappings are applied.

Sample Tokenizer
The following sample tokenizer file is similar to the default implementation. Indented
lines have been wrapped to fit the available space. In practice, lines should not be
broken.

ranges {
alpha 0x30-0x39 0x41-0x5a 0x5f 0x61-0x7a 0xc0-0xd6
0xd8-0xf6 0xf8-0x131 0x134-0x13e 0x141-0x148
0x14a-0x173 0x179-0x17e 0x384-0x386 0x388-0x38a
0x38c 0x38e-0x3a1 0x3a3-0x3ce 0x400-0x45f 0x5d0-0x5ea
0xFF10-0xFF19 0xFF21-0xFF3a 0xFF41-0xFF5a
number 0x30-0x39
numin 0x2c-0x2e
currency 0x24 0xfdfc
numstart 0x2d
alphain 0x5f
tagstart 0x3c
colon 0x3a
tagend 0x3e
slash 0x2f
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d 0xfa30-0xfa6a
0xfa70-0xfad9 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d
0x3041-0x3094 0x30a1-0x30fe 0xff66-0xff9d 0xff9e-0xff9f
arabic 0x621-0x63a 0x640-0x655 0x660-0x669 0x670-0x6d3
0x6f0-0x6f9 0x6fa-0x6fc 0xFB50-0xFD3D 0xFD50-0xFDFB
0xFE70-0xFEFC 0x6d5 0x66e 0x66f 0x6e5 0x6e6 0x6ee 0x6ef
0x6ff 0xFDFD
indic 0x900-0x939 0x93C-0x94E 0x950-0x955 0x958-0x972
0x979-0x97F 0xA8E0-0xA8FB 0xC01-0xC03 0xC05-0xC0C
0xC0E-0xC10 0xC12-0xC28 0xC2A-0xC33 0xC35-0xC39
0xC3D-0xC44 0xC46-0xC48 0xC4A-0xC4D 0xC55 0xC56
0xC58 0xC59 0xC60-0xC63 0xC66-0xC6F 0xC78-0xC7F
0xB82 0xB83 0xB85-0xB8A 0xB8E-0xB90 0xB92-0xB95

The Information Company™ 155


Understanding Search Engine 21

0xB99 0xB9A 0xB9C 0xB9E 0xB9F 0xBA3 0xBA4


0xBA8-0xBAA 0xBAE-0xBB9 0xBBE-0xBC2 0xBC6-0xBC8
0xBCA-0xBCD 0xBD0 0xBD7 0xBE6-0xBFA
}
words {
alpha+(alphain+alpha+)*
currency?numstart?number+(numin+number+)*
arabic+
onechar
indic+
_NGRAM2 gram2+
tagstart ( alpha+ (alphain+alpha+)* | arabic+ | onechar
|indic+ | gram2)+ (colon- ( alpha+ (alphain+alpha+)*
| arabic+ | onechar |indic+| gram2)+)? (slash tagend)?
tagstart slash ( alpha+ (alphain+alpha+)* | arabic+
| onechar |indic+ | gram2)+ (colon- ( alpha+
(alphain+alpha+)* | arabic+ | onechar|indic+
| gram2)+)? Tagend
}

Metadata Tokenizers
The default configuration uses the full text tokenizer for text metadata regions. OTSE
supports the use of additional tokenizers for text metadata regions. There are 3
requirements to enable this: creating the tokenizer file; referencing the tokenizer file
in the search.ini file; and associating the tokenizer with a metadata region.
Adding or changing the tokenizer configuration for text metadata is possible. When
the search system is restarted, the text metadata stored values are used to rebuild
the text metadata index using the new tokenizer settings. This may require several
hours on large search grids. There are configuration settings that determine the
behavior of the rebuilding when the tokenizers are changed. The first setting is a
failsafe to prevent accidental conversion if the tokenizers are deleted or changed
unintentionally. It requires that today’s date be provided for the conversion to occur.
Use the value “any” to allow conversion any time the tokenizers are changed. The
second setting determines whether the conversion is applied to existing data, or
simply to new data. Usually, applying to new data only is not recommended due to
inconsistent results, so the default value is true. In the [Dataflow_] section:
AllowAlternateTokenizerChangeOnThisDate=20170925
ReindexMODFieldsIfChangeAlternateTokenizer=true

The search.ini file is used to define where the search tokenizer files are located. In
the search.ini file, to add two metadata tokenizer files:

[Dataflow_]
RegExTokenizerFile2=c:/config/tokenizers/partTKNZR.txt
RegExTokenizerFile3=c:/config/tokenizers/NoSpaceTokens.txt

The Information Company™ 156


Understanding Search Engine 21

Note that the additional tokenizer values start at the number 2. The first tokenizer
entry is always reserved for the full text tokenizer. The tokenizer definition files in this
example are located in the config/tokenizers directory, which is recommended by
convention as the preferred location for tokenizer definition files.
The next step is to identify the text metadata regions which should use the
enumerated tokenizers. This is done as an optional extension to the text region
definition in the LLFieldDefinitions.txt file:

TEXT OTPartNum FieldTokenizer=RegExTokenizerFile2


TEXT RegionX FieldTokenizer=RegExTokenizerFile3

The search engine would then apply the rules defined in partTKNZR.txt to the region
OTPartNum, and the tokenizer rules in the file NoSpaceTokens.txt to RegionX. The
tokenizer files are constructed using the same rules as the default full text tokenizer.

Metadata Tokenizer Example 1


This relatively simple tokenizer uses the default character mappings (e.g. upper to
lower case). It does not replace punctuation with whitespace, and does not break
words into multiple tokens. Instead, the output is a single value preserving all
punctuation, but encoded using 4-grams except for dual-byte characters such as
Chinese, which are encoded using bi-grams. This approach to tokenization would be
appropriate for metadata regions that require efficient exact substring matching on
the unmodified values.
ranges {
gram4 0x9-0xe00 0xe2f 0xe3b-0xe3f 0xe4e-0x3004
0x3007-0x3040 0x3095-0x30a0 0x30ff-0x33ff 0x9fa6-0xabff
0xd7a4-0xf8ff 0xfa2e-0xfa2f 0xfa6b-0xfa6f 0xfada-0xff60
0xffa0-0xfffd
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d 0x3041-0x3094
0x30a1-0x30fe 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d
0xfa30-0xfa6a 0xfa70-0xfad9 0xff66-0xff9d 0xff9e-0xff9f
}
words {
_NGRAM4 gram4+
onechar
_NGRAM2 gram2+
}

Metadata Tokenizer Example 2


This example differs from the previous example in one material way. All punctuation
and white space is mapped to a null character, leaving a dense set of characters. The
default conversion of ASCII to lower case still applies (not explicitly required). This
example also uses 3-gram encoding, which is useful in some exact substring

The Information Company™ 157


Understanding Search Engine 21

matching situations. An input of: So-me&vAL/ue her?e will be reduced to:


somevaluehere, then encoded with 3-grams (som ome mev eva val alu lue
ueh ehe her ere).
mappings {
0x9=0x0
0xa=0x0
0xb=0x0
0xc=0x0
0xd=0x0
0xe=0x0
0xf=0x0
0x10=0x0
0x11=0x0
0x12=0x0
0x13=0x0
0x14=0x0
0x15=0x0
0x16=0x0
0x17=0x0
0x18=0x0
0x19=0x0
0x1a=0x0
0x1b=0x0
0x1c=0x0
0x1d=0x0
0x1e=0x0
0x1f=0x0
0x20=0x0
0x21=0x0
0x22=0x0
0x23=0x0

< thousands of null mappings omitted>

0xfffb=0x0
0xfffc=0x0
0xfffd=0x0
}
ranges {
gram4 0x9-0xe00 0xe2f 0xe3b-0xe3f 0xe4e-0x3004
0x3007-0x3040 0x3095-0x30a0 0x30ff-0x33ff 0x9fa6-0xabff
0xd7a4-0xf8ff 0xfa2e-0xfa2f 0xfa6b-0xfa6f 0xfada-0xff60
0xffa0-0xfffd
onechar 0x3005-0x3006 0xff61-0xff65
gram2 0xe01-0xe2e 0xe30-0xe3a 0xe40-0xe4d 0x3041-0x3094
0x30a1-0x30fe 0x3400-0x9fa5 0xac00-0xd7a3 0xf900-0xfa2d
0xfa30-0xfa6a 0xfa70-0xfad9 0xff66-0xff9d 0xff9e-0xff9f
}

The Information Company™ 158


Understanding Search Engine 21

words {
_NGRAM4 gram4+
onechar
_NGRAM2 gram2+
}

The Information Company™ 159


Understanding Search Engine 21

Administration and Optimization


From using administration APIs through maintenance, backups, scaling and
performance optimization. How to get the most out of your OTSE installation.

Index Quality Queries


There are a number of search queries that may be used to test the quality of the data
in the search index. The details of each feature are described elsewhere in this
document. As a convenience, these are summarized to provide a quick reference for
testing index quality.

Index Error Counts


Search on the OTIndexError region to identify objects which had invalid metadata
values. The larger the count, the more errors an object has.

Content Quality Assessment


Search on the OTContentStatus region to find objects where the content could not be
correctly or completely indexed. Presentation in a facet could make interpretation
easier.

Partition Sizes
Search for a partition name in the OTPartitionName region to get a count of the
number of objects stored in a given partition.

Metadata Corruption
Search for -1 in the region OTMetadataChecksum to identify if the metadata for any
objects are corrupt. This is only valid if the metadata checksum feature is enabled.

Bad Format Detection


Search for “unknown” in the OTFileType region. Use information about the results to
adjust the format recognition settings in the Document Conversion Server. You can
then collect and re-index the objects with “unknown” file types.

Text Metadata Truncation


Search for “OTIndexLengthOverflow” in the region OTMeta. This identifies metadata
that was too long and was truncated. Truncated metadata may indicate applications
abusing field sizes, or regions that need the limit adjusted.

The Information Company™ 160


Understanding Search Engine 21

Text Value Truncation


Search for “OTIndexMultiValueOverflow” in the region OTMeta. This identifies
metadata that had too many values, and the additional values were discarded. This
can isolate applications abusing the multi-value text region feature, or identify regions
that need the default value increased.

Search Result Caching


Search queries have the potential to run for very long times, especially if the results
are being retrieved in chunks, and the consuming application takes time to process
each chunk of search results. This interactive process of retrieving results can
extend the duration of a search query indefinitely. An example of this in Content
Server might be “Collect all search results”.
When queries are active for prolonged periods, they consume threads in the Search
Engines, potentially preventing other queries from running. In addition, while a
search query is active, the Search Engines will not update their index to ensure that
the query remains transactionally complete, which may eventually cause indexing
operations to stall.
To mitigate this possibility, the Search Federator has the ability to cache search
results. This behavior is triggered by the query duration. Once a query transaction is
open longer than a defined time, the Search Federator will pro-actively request all
search results from all the Search Engines. The results will be written to temporary
disk storage. The transactions with the Search Engines are then closed, and
subsequent requests to fetch results for the transaction are serviced by the Search
Federator from the temporary storage. Temporary files are deleted upon transaction
completion or startup of the Search Federator. The amount of space required
depends on the number of active transactions cached, number of search results in a
transaction, and amount of region data returned in the results. For typical
applications, 1 GB of space should be more than adequate.
Caching provides the highest value in scenarios that retrieve all the results from the
search engine.
There are two configuration controls for this feature, both in the [SearchFederator_]
portion of the search.ini file. The time before caching determines the time in
milliseconds at which the Search Federator will decide to begin caching the search
results. This time should generally be long enough that caching is not triggered
every time a query is slow. The other setting defines the disk storage location for the
temporary files. This should be the path to a working folder that the Search
Federator can access. This temporary file location must be defined to enable
caching of search results, and is empty (disabled) by default.
SearchResultCacheDirectory=G:\cache
TimeBeforeCachingResultsInMS=180000

Query Time Analysis

The Information Company™ 161


Understanding Search Engine 21

Query time and throughput varies based on many factors. The first step in optimizing
search query behavior is understanding how time is being consumed during search
queries. To help with this, the Search Federator keeps statistical information about
query performance, which is written to the Search Federator log once per hour.
Using this data, you can assess whether changes to the system or configuration are
improving or degrading search performance.
The data is written in tabular form, such that you can copy it and paste it into a
spreadsheet as Comma Separated Values to make analysis easier. The log entries
have this form, with leading time stamps and thread data omitted:

:Search Performance Summary for 12;00 on 2015-05-28:


:Time, Query Count, Elapsed, Execution, Wait, SELECT, RESULTS, FACETS, HH, STATS:
:12, 3, 50476, 19786, 30690, 18537, 1234, 0, 0, 15:
:11, 2, 64533, 38444, 26089, 38051, 299, 47, 0, 47:
:10, 0, 0, 0, 0, 0, 0, 0, 0, 0:
:9, 0, 0, 0, 0, 0, 0, 0, 0, 0:
… (values for up to 24 hours)…
:Days Ago, Query Count, Elapsed, Execution, Wait, SELECT, RESULTS, FACETS, HH, STATS:
:1, 18, 728909, 35687, 27098, 36709, 376, 36, 0, 45:
… (values for up to 14 days) …

Reading left to right, the values are:


Time: the reporting time. Time is written on the hour. For hourly entries, this is the
end of the hour period, values from 0 to 23. For days, this is the number of days ago
(e.g. 1 is yesterday). Hourly values are gathered into a daily value at midnight.
Query Count: the number of search queries processed in the time period.
Elapsed: the sum of total time for each query, from start to end transaction. Divide
by number of queries to get the average..
Execution: total active time used by search engines and search federator to perform
the search.
Wait: time while transaction is open, but with no tasks for the search engines.
Typically while Content Server is permission trimming.
SELECT: portion of Execution time running the SELECT portion of the query. This is
where the matching results are computed.
RESULTS: time spent fetching search results, typically a result of GET operations.
FACETS: time spent retrieving facets. Not that facets are computed during SELECT,
so a portion of facet generation time is no included here.
HH: time spent computing hit highlighting information.
STATS: time spent retrieving query analysis statistics. As with facets, a portion of the
time generating STATS data is performed during the SELECT phase.
The accuracy of the timing is typically limited by the system timer. On Windows,
these typically have 15 or 16 ms resolutions. The times are in milliseconds, total for
all transactions. Divide by the number of transactions to obtain averages. The

The Information Company™ 162


Understanding Search Engine 21

statistics are not persisted between restarts, so the data starts at zero after every
startup of the search grid. This information is written when the log level is set to
status level or higher. Data on a given query is collected when the query completes,
so queries that cross an hour or day boundary are reported for the time when the
query finished.
This data is also available on demand through the admin interface using the
command: getstatustext performance

Administration API
In addition to a socket-level interface to support search queries, the search
components have a socket-level interface that support a number of administration
tasks. Each component honors a different set of commands, and in some cases
reply to the same command with different information. Commands that make sense
for an Index Engine may be irrelevant for the Search Federator.
This section outlines the most common commands and the components to which
they apply. The client making the requests is also responsible for establishing a
socket connection to the component. The configuration of the port numbers for the
sockets is controlled in the search.ini file.
You do not need to use this API for management and maintenance. Applications
such as Content Server leverage the Administration API to hide details of
administration and provide unified administration interfaces.
The examples below use a > (prompt) symbol to represent the command(s), followed
by the response. White space has been added in responses for readability.

stop
Stops the process as soon as possible. Applies to all processes.
> stop
true

getstatustext
In the Index Engine, this command returns information about uptime, memory use
and number index operations performed:
> getstatustext

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_IEname0_STATUS>
<upTimeSeconds>303</upTimeSeconds>
<numberOfRequests>0</numberOfRequests>
</LIVELINK_IEname0_STATUS>
<MetadataUsagePercentage>1</MetadataUsagePercentage>
<ContentUsagePercentage>0</ContentUsagePercentage>
</stats>

The Information Company™ 163


Understanding Search Engine 21

In the Search Federator, getStatusText returns summary information about uptime


and requests. In addition, this call is used to obtain detailed information about the
current status of each metadata region. In the “moveable” section, each region
defined in the index is listed along with a status indicating whether it is moveable.
The moveable status essentially identifies text regions, which can be moved to other
storage modes (DISK versus RAM storage, for example).
There are sections for ReadWrite, NoAdd and ReadOnly. In these sections, every
text (moveable) region is listed. In this example, the partition is in Read-Write mode,
so the regions are listed in the ReadWrite section. For each text region, an estimate
of the amount of memory currently used by the region, and the amount of memory
that would be used if the region was changed to other storage modes is provided.
Note that these are ESTIMATES and should not be used to accurately compute
memory requirements.
> getstatustext

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_SFname0_STATUS>
<upTimeSeconds>363</upTimeSeconds>
<numberOfRequests>1</numberOfRequests>
</LIVELINK_SFname0_STATUS>
<partitionMemInfo>
<moveable>
<OTWFMapTaskDueDate>false</OTWFMapTaskDueDate>
<PHYSOBJDefaultLoc>false</PHYSOBJDefaultLoc>
<OTWFileName>true</OTWFileName>
</moveable>
<ReadWrite>
<OTSomeRegion>
<sizeInMemory>684</sizeInMemory>
<sizeInMemoryKB>0</sizeInMemoryKB>
<sizeOnDisk>0</sizeOnDisk>
<sizeOnDiskKB>0</sizeOnDiskKB>
<sizeOnRet>0</sizeOnRet>
<sizeOnRetKB>0</sizeOnRetKB>
</OTSomeRegion>
</ReadWrite>
<NoAdd>
</NoAdd>
<ReadOnly>
</ReadOnly>
</partitionMemInfo>
</stats>
In the Search Engines, this command is used to obtain basic uptime and number of
search queries performed since startup.
> getstatustext

The Information Company™ 164


Understanding Search Engine 21

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_SEname0_STATUS>
<upTimeSeconds>580</upTimeSeconds>
<numberOfRequests>1</numberOfRequests>
</LIVELINK_SEname0_STATUS>
</stats>
Within the Update Distributor, the getstatustext command is used to obtain uptime,
and statistics about the number of IPool messages processed.
> getstatustext

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<LIVELINK_UpDist1_STATUS>
<upTimeSeconds>619</upTimeSeconds>
<requests>
<numberOfRequests>0</numberOfRequests>
</requests>
<IPoolTransactions>
<NumberCommitted>0</NumberCommitted>
<AverageTime>NaN</AverageTime>
<RunningStdDev>NaN</RunningStdDev>
<MaxTime>4.9E-324</MaxTime>
<MinTime>1.7976931348623157E308</MinTime>
</IPoolTransactions>
<ForcedCheckpoint>
<InForcedCheckpoint>STATUS</InForcedCheckpoint>
<TotalPartitionsToCheckpoint>X</TotalPartitionsToCheckpoint>
<PartitionsInCheckpoint>Y</PartitionsInCheckpoint>
<PartitionsFinishedCheckpoint>Z</PartitionsFinishedCheckpoint>
</ForcedCheckpoint>
</LIVELINK_UpDist1_STATUS>
</stats>
The ForcedCheckpoint section identifies how many partitions are busy writing
checkpoints. The possible values for STATUS are:
No Checkpoint Command
Checkpoint pending
Checkpoint in progress
If STATUS is not “Checkpoint in progress”, then X, Y and Z are 0. Otherwise, these
values represent the number of partitions in various stages of writing checkpoint files.

With the Search Federator, a variation of getstatustext can be used to retrieve data
about search query performance. The interpretation of the values is outlined in the
section entitled “Query Time Analysis”.
> getstatustext performance

<?xml version="1.0" encoding="UTF-8"?>

The Information Company™ 165


Understanding Search Engine 21

<performance>
<hours>
<hour>
<hourNumber>13</hourNumber>
<numQueries>1</numQueries>
<elapsed>71305</elapsed>
<execution>1149</execution>
<wait>70156</wait>
<SELECT>376</SELECT>
<RESULTS>773</RESULTS>
<FACETS>0</FACETS>
<HH>0</HH>
<STATS>0</STATS>
</hour>
<hour>
<hourNumber>12</hourNumber>
<numQueries>4</numQueries>
<elapsed>149954</elapsed>
<execution>100071</execution>
<wait>49883</wait>
<SELECT>99761</SELECT>
<RESULTS>201</RESULTS>
<FACETS>16</FACETS>
<HH>0</HH>
<STATS>93</STATS>
</hour>
</hours>
</performance>

Similarly, the Update Distributor can provide accumulated statistics about indexing
throughput and errors with “getstatustext performance”. First introduced in 20.4,
the output is in XML form and includes the same data that is written to the logs on an
hourly basis.
<?xml version="1.0" encoding="UTF-8"?>
<performance>
<hours>
<hour>
<hourNumber>8</hourNumber>
<AddOrReplace>0</AddOrReplace>
<AddOrModify>0</AddOrModify>
<Delete>0</Delete>
<DeleteByQuery>0</DeleteByQuery>
<ModifyByQuery>0</ModifyByQuery>
<Modify>0</Modify>

The Information Company™ 166


Understanding Search Engine 21

Starting with the 2015-09 update, a new option for getstatustext will return a subset
of information, faster. The “basic” variation reduces the time needed by Content
Server to display partition data. The subset of data was specifically selected to meet
the needs of the Content Server “partition map” administration page. When basic is
used, the status and size of partitions is retrieved from cached data, and only
updated during select indexing operations such as “end transaction”. While
technically the information could be slightly incorrect, it is accurate enough for
practical purposes. If there is no cached data, then the slower methods are used –
querying each index engine for data.

> getstatustext basic

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<IEname0>
<status>12</status>
<MetadataUsagePercentage>11</MetadataUsagePercentage>
<ContentUsagePercentage>31</ContentUsagePercentage>
<Mode>ReadWrite</Mode>
<Behaviour>Normal</Behaviour>
</IEname0>
</stats>

For the Index Engines, there is new data in this response. Percentage full is
presented in two different ways, one for text metadata, and one for usage of the
allocated space on disk of the index. The Behaviour represents the “soft” modes of a
read/write partion - update only, rebalancing. Sample responses from the other
search processes are shown below, returning the same codes as a “getstatuscode”
command.
<?xml version="1.0" encoding="UTF-8"?>
<stats>
<UpDist1>
<status>135</status>
</UpDist1>
</stats>

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<SFname0>
<status>12</status>
</SFname0>
</stats>

<?xml version="1.0" encoding="UTF-8"?>


<stats>
<SEname0>

The Information Company™ 167


Understanding Search Engine 21

<status>12</status>
</SEname0>
</stats>

getstatuscode
This function is used to determine if a process is ready, in error, or starting up.
Starting up is generally the status while an index is being loaded.

> getstatuscode

12

getstatuscode response values:

All Processes

10 Running, but not yet ready. Usually when loading an


index.

-11 An error condition exists

12 Ready

Index Engine Codes from 301 to 500

301 Looking for Update Distributor

Update Distributor Codes from 129 to 300

131 Polling for transaction

133 Done

134 Waiting for partitions to be added (RMI mode)

135 Waiting for index engines

137 Contacting index engines

Search Engine Codes from 501 to 700

501 Registering with Search Federator

502 Waiting for initial index

503 Initializing search engine index

Search Federator Codes from 701 to 900

701 Waiting for search engine

The Information Company™ 168


Understanding Search Engine 21

registerWithRMIRegistry
For all processes, this command forces a reconnection with the RMI Registry, and
reloads the remote process dependencies. Useful for resynchronizing after some
types of configuration changes without needed to restart the processes. If the search
grid is configured to not use RMI, this command is ignored.
> registerWithRMIRegistry

received ack

checkpoint
The checkpoint function is issued to the Update Distributor to force all partitions to
write a checkpoint file. This is especially useful as part of a graceful shutdown
process. If large metalogs are configured, the time to replay the metalogs during
startup can take a long time. Forcing checkpoints shortly before shutdown eliminates
metalogs and can dramatically improve startup time. After issuing the checkpoint
command, the Update Distributor waits for a number to be provided. The number is
a percentage, representing the threshold over which a checkpoint should be written.
For example, if a checkpoint is normally written when metalogs reach 200 Mbytes, a
value of 10 means that a checkpoint should be immediately forced if the metalog has
reached 20 Mbytes in size. The same logic applies for other checkpoint triggers,
such as number of new objects or number of objects modified. Any value other than
an integer from 0 to 99 will simply abort the command.
> checkpoint
> 10

true

reloadSettings
This command applies to all processes. Some, but not all, of the search.ini settings
can be applied while the processes are running, and some can only be applied when
the processes first start. This command requests that the process reload settings. A
list of reloadable settings is included near the end of this document.
> reloadSettings

received ack

getsystemvalue
Used to obtain specific values from the Index Engine. Currently, there are only two
keys defined. ConversionProgressPercent will return the percentage complete when
an index conversion is taking place. A “ping” operation to check that the process is
responding is also available. This command is different from the others in that it

The Information Company™ 169


Understanding Search Engine 21

requires two separate submissions, the first being the command and the second
being the key.
> getsystemvalue
> marco

polo

> getsystemvalue
> ConversionProcessPercent

36

addRegionsOrFields
This command applies to the Update Distributor only, and can be used to dynamically
add a region definition. Once added to an index, regions are generally sticky. The
LLFieldDefinitions.txt file is not updated, so note that using this command may cause
a drift between the index and the LLFieldDefinitions.txt file. This discrepancy is not a
problem, but should be kept in mind in support situations.
The syntax requires exactly one TAB character after the type and before the region
name. This command waits for additional lines of definitions until an empty line is
sent, which terminates the input mode. The function returns true on completion.
> addRegionsOrFields
> text flip
> integer flop
>

true

runSearchAgents
Update Distributor only. Instructs the Update Distributor to run all of the search
agents which are currently defined against the entire index. Results are sent to the
search agent IPool.
> runsearchagents

true

runSearchAgent
Update Distributor only. Instructs the Update Distributor to run a specific search
agent. The search agent named must be correctly defined in the search.ini file.
Results are sent to the search agent IPool. This command expects one line with the
search agent after the command.
> runsearchagent
> bob

true

The Information Company™ 170


Understanding Search Engine 21

runSearchAgentOnUpdated
Update Distributor only. Instructs the Update Distributor to run all of the specific
listed search agents. Time is based on the values in upDist.N file, and the timestamp
is updated (see Search Agent Scheduling). Requests are added to a queue and may
require some time to complete. Results are sent to the search agent IPool.
> runsearchagentonupdated
> MyAgentName
> AnotherAgent

true

runSearchAgentsOnUpdated
Update Distributor only. Instructs the Update Distributor to run all the search agents.
Time is based on the values in upDist.N file, and the timestamp is updated (see
Search Agent Scheduling). Requests are added to a queue and may require some
time to complete. Results are sent to the search agent IPool.
> runsearchagentsonupdated

true

Server Optimization
There are many performance tuning parameters available with OTSE. There is no
single perfect configuration that meets all requirements. You can optimize for
indexing performance or query performance. There are tradeoffs between memory
and performance, and many external parameters can affect the OTSE behavior. In
this section we examine some of the most common options for system tuning. The
focus here is on administration and configuration tuning, not on application
optimization.

Metadata Region Fragmentation


When metadata values are modified, fragmentation of the memory used to store the
metadata takes place. In a typical system, this defragmentation will slowly increase
the memory used to store metadata over a period of days or weeks.
To combat this, OTSE includes a metadata memory defragmentation capability. By
default, this is scheduled to run monthly or nightly, depending on the metadata
storage methods being used. For most applications, this will be sufficient to prevent
any material memory loss.
With Low Memory configuration, fragmentation is much less pronounced, and the
defragmentation impact is also smaller. Starting with the 16.0.3 update,
defragmentation is restricted to running only on the first Sunday of each month. If
you require nightly defragmentation, there is a setting in the [DataFlow_] section that
can enforce this:
DefragmentFirstSundayOfMonthOnly=0

The Information Company™ 171


Understanding Search Engine 21

If your use of OTSE includes high volumes of indexing and metadata updates, then
fragmentation may occur more quickly. You can consider modifying the configuration
settings to run the defragmentation several times per day. While defragmentation is
happening, there will be short periods, typically a few seconds at a time, where
search query performance is degraded. In practice, we find that Low Memory Mode
without daily defragmentation is providing the best indexing throughput.
The tuning parameters typically do not require adjustment unless you are
experiencing extraordinary levels of memory fragmentation. Within the
search.ini_override file, in the [DataFlow] section, the following settings can be
added to make adjustments if necessary:
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
Defragmentation times can be a list in 24 hour format (for example, 2:30;14:30) to run
multiple times per day. Space is the maximum temporary memory to consume while
defragmenting in MB; the larger the value, the faster defragmentation runs – up to a
limit based on the size of the largest region. To completely disable defragmentation,
set the DefragmentMemoryOptions value to 0. Setting the options value to 1 is not
recommended – it enables aggressive defragmentation, whereby all regions are
defragmented without relinquishing control to allow searches while defragmentation
occurs.
There are two other defragmentation settings that you will normally not need to
adjust:
DefragmentMaxStaggerMinutes=60
DefragmentStaggerSeedToAppend=SEED
If you have multiple search partitions, each partition will randomly select a
defragmentation start time up to “MaxStaggerMinutes” after the specified daily
defragmentation time. The purpose of this is to distribute CPU load randomly if you
have many partitions. The SEED value is a string used to seed the random number,
and is available to change if for some reason the default string “SEED” produces start
times which cluster too tightly. It is unlikely you will need to provide an alternative
string.

Partition Metadata Memory Sizing


When you configure a partition, one of the key settings is the amount of memory that
should be reserved for metadata. As the partition accepts more objects, it consumes
this memory. The memory usage is typically measured as a percentage of the
allocated memory, and generally referred to as “percent full”.
If your system will encompass millions of objects managed, the chances are good
that you will need multiple partitions. The number of partitions you require is based
upon many variables; the one we will consider here is the amount of memory that you
allocate to metadata in each partition.
Before delving too deeply into the alternatives, a note about 32 bit environments is in
order. Content Server 9.7.1 for Windows is deployed by default within a 32 bit Java
environment. The 32 bit environment places a restriction on the amount of memory

The Information Company™ 172


Understanding Search Engine 21

that a single process can consume of about 1.3 gigabytes. Once you factor out
memory needed for other purposes, the practical upper limit for memory that can be
reserved for metadata is about 1 gigabyte. Customers using Content Server on
Solaris, which uses a 64 bit JVM, have reported success using larger partition sizes,
up to 3 gigabytes.
Assuming a 64 bit Java environment, such as Content Server 10.5 or 16, you can set
the partition sizes larger. Because of the number of variables, there is no simple
optimal size which is always correct. For systems which cannot contain the entire
index within a single partition, larger partition sizes are synonymous with fewer
partitions. Here are some of the tradeoffs:

The memory overhead for a partition is more or less constant, regardless of the
partition size. Larger partitions are therefore more efficient in terms of memory
use, which can reduce the overall cost of hardware.

In operation, partitions engage in high levels of disk access. Typically, fewer


partitions will result in more efficient use of the available disk bandwidth.

During indexing, the Update Distributor will balance the load over the available
index engines. If high indexing performance is a key requirement, more
partitions may be preferable.

For search queries returning small numbers of results (typical user searches),
fewer partitions are more efficient. This is typical of most Content Server
installations.
Some specific types of queries are slow and performance is based on the
number of text values in the partition dictionary. Smaller partitions are therefore
faster. If regular expression (complex pattern) queries on text values stored in
memory are common for your application, then smaller partitions may be a better
choice.

A small partition would reserve about 1 gigabyte of RAM for metadata. A very
large partition would be about 8 gigabytes in size. Experimenting with
intermediate sizes before configuring a large partition is strongly recommended.

It is easy to make a small partition larger by changing the


configuration, but making a large partition smaller is more complex
and may require some level of re-indexing. Don’t use large
partitions until you are confident that they are appropriate for your
environment.
With Low Memory mode, the number of items that can be stored in a partition is
considerably larger than when memory storage modes are configured. Putting aside
all the caveats about performance variations, for new systems that are expected to
become relatively large, our reference for development is:
• Low Memory Mode configuration
• 2 GB Partition Memory configuration

The Information Company™ 173


Understanding Search Engine 21

This configuration should handle up to 10 Million Content Server objects with


reasonable performance.

Automatic Partition Modes


To minimize the intervention needed by an administrator to monitor and manage
search partitions, OTSE has the ability to automatically change the mode of operation
as the partition fills with data. There are three effective modes of operation for a
read-write partition: normal, update and rebalance. These modes are selected based
on the size of a partition as measured by metadata memory use.
Memory Usage Mode Switching
In normal operation, the partition will accept any operations. As the partition is filled,
eventually it will cross an “update only” threshold. Above this threshold, the partition
will not accept new objects for indexing, although it will continue to accept updates to
objects already indexed within the partition. If the percent full falls below this
threshold, the partition will once again accept new objects. This can happen if
objects are deleted or partition settings are changed.
If the partition continues to fill, eventually it will reach a “rebalance” threshold. In
rebalancing mode, updates to objects will cause them to be moved to other partitions,
as determined by the Update Distributor. Rebalancing continues until the partition
falls below the “stop rebalancing” percent full threshold. Rebalancing ensures that
the partition does not exceed the available memory, but it is an expensive process,
and should be considered a safety mechanism of last recourse. Reserving sufficient
‘update only’ memory will minimize the need for rebalancing.
The figure below illustrates the percent full memory usage and thresholds for read-
write partitions.

The Information Company™ 174


Understanding Search Engine 21

Currently, very conservative default values are used: 80% full for rebalancing and
77% for the stop rebalancing threshold, which reflects the amount of memory
typically used by existing Content Server customers.
Selecting a suitable threshold for update-only mode requires a little more thought,
and depends upon your expected use of the search engine. The default value with
Content Server is a setting of 70%, which reserves 10% of the space for metadata
changes. Some considerations for adjusting this setting include:
• If your system has applications or custom modules known to add significant
new metadata to existing objects, you should allow more space for updates.
• Archival systems which rarely modify metadata can reduce the space
reserved for updates. Note that Content Server Records Management will
often update metadata when activities such as holds take place, even with
archive applications.
Note that these values are representative for traditional partitions with 1 GB of
memory for metadata. If you are using a larger partition, then reserving less space
for updates and rebalancing may be appropriate. The best practice is to periodically
review the percent full status of your partitions, and adjust the partition percent full
thresholds based upon your actual usage patterns.
The values in the search.ini file that defined the various thresholds are:
MaxMetadataSizeInMBytes=1000
StartRebalancingAtMetadataPercentFull=99
StopRebalancingAtMetadataPercentFull=96

The Information Company™ 175


Understanding Search Engine 21

StopAddAtMetadataPercentFull=95
WarnAboutAddPercentFull=true
MetadataPercentFullWarnThreshold=90

Disk Usage Mode Switching


The previous section describes automated mode selection based on memory used
for metadata. A similar capability exists for switching modes based on disk usage.
The method is identical to the metadata memory scenario, except that percent full is
measured relative to the amount of disk space used to store the index.
The amount of space needed to represent the index changes in size as metalogs are
consumed into checkpoints, or text index files are merged. Merge operations may
temporarily require twice the disk space used by the partition. This can be addressed
by keeping the maximum used space relatively low, or enabling Merge Tokens.
The maximum allowable disk usage for a partition is specified in MB, the thresholds
are set as percentages relative to this value. The default values are shown below.
MaxContentSizeInMBytes=50000
StartRebalancingAtContentPercentFull=95
StopRebalancingAtContentPercentFull=92
StopAddAtContentPercentFull=90
WarnAboutAddPercentContentFull=true
ContentPercentFullWarnThreshold=85

Selecting a Text Metadata Storage Mode


As described elsewhere in this document, there are several storage modes available
for text metadata, each with relative strengths. To summarize:
• Memory Storage (RAM) provides the fastest retrieval of metadata values, but
consumes the most memory.
• Value Storage (DISK) reduces the memory required by storing theText metadata
values on disk, but keeps the Text metadata index in memory.
• Low Memory mode (DISK) moves the Text Metadata index to disk, dramatically
reducing the memory requirements.
• Merge File mode places the Text Metadata values in a separate set of files that
are merged in background processes. This mode is standard for Content Server
16.
• Retrieval Storage (DISK_RET) uses the least memory, storing the values on disk,
and discarding the index entirely, making the values non-searchable.
Use the FieldModeDefinitions.ini file to choose optimal settings for your application.
In additional to allowing specification of each field individually, this file can also be
used to set a default storage mode to be applied unless otherwise indicated.
For the majority of applications, Disk Storage with Low Memory Mode and Merge File
mode enabled is probably the optimal setting, and is certainly the configuration that
will provide the highest possible search indexing throughput. Retrieval Storage is

The Information Company™ 176


Understanding Search Engine 21

usually indicated for the Hot Phrases and Summaries regions (OTHP and
OTSummary).
Note that if you fill a partition in a low memory mode, you may not have enough
space later to convert to a higher memory usage mode. For example, if the partition
memory is 80% full with text regions in DISK mode, it is unlikely that you will be able
to switch the default setting to RAM mode unless some regions are removed or the
partition size is increased.

Content Server customers: remember you shouldn’t edit this file


directly, since there are administration pages within Content Server
that allow you to manage these settings, and Content Server will
over-write the FieldModeDefinitions.txt file.

High Ingestion Environments


Some applications are focused on making large amounts of data searchable and
consider the indexing performance to be the key factor to optimize. There are a
number of specific considerations for tuning a search grid in these situations.
The first recommendation is the use of Low Memory mode for text metadata storage,
since high ingestion rates drive large search grids, and Low Memory mode minimizes
the number of partitions that will be required. Secondly, use Merge File mode for
Text Metadata index storage, which reduces the Checkpoint size, since Checkpoint
writes consume a considerable portion of the total Index Engine time.
Update Distributor Bottlenecks
After every indexing batch transaction completes, the Update Distributor records
performance metrics in its log file if the log DebugLevel is set to info or status. This
information can shed light on where time is spent during the indexing process, and
help guide optimization. A typical performance record looks like this:

1398979250175:IPoolReadingThread:4:Timing info (counts). Total time 175255285 ms.


Start Transaction 11106708 ms (5513). End Transaction 7486279 ms (5513). Checkpoint
114769662 ms (1639). Local Update 12225821 ms (23161). Global Update 16140505 ms
(23161). Idle 9216366 ms (264). IPool Reading 1077868 ms (28674). Batch Processing
43376 ms (23161). Start Transaction and Checkpoint 0 ms (0). :

The times are cumulative since the Update Distributor was started. Each entry has
the form:

Category N ms (count).

Update Distributor Categories:

The Information Company™ 177


Understanding Search Engine 21

Total Time Total uptime of the Update Distributor – this includes the start-up time
that is not included in any other category – hence it will be larger than
the sum of the other categories.

Start Transaction Time the Update Distributor spends waiting for the Index Engines to be
ready to start a transaction.

End Transaction Time the Update Distributor spends waiting for a transaction to end,
excluding time to write checkpoint files. Too much time in this category
may indicate an excessive amount of time is spent running search
agents (for Content Server, usually Intelligent Classification or
Prospectors).

Checkpoint Time the Update Distributor waits for the Index Engines to write
checkpoint files. Large percentages of time here suggest that
checkpoints are created too frequently, or the storage system is under-
powered. Metalog thresholds can be adjusted to reduce the frequency
of checkpoint writes.

Local Update Time the Update Distributor is working with the Index Engines to update
the search index. This is useful time. It is common for this value to
remain below 15% of the time even when a system is performing well.

Global Update Time in which the Update Distributor is interrogating the Index Engines
prior to initiating the local update steps. A typical purpose is to establish
which Index Engine should receive a given indexing operation. Long
times here may indicate that Update Distributor batch sizes are too
small.

Idle The amount of time the update distributor is idle – it has completed all
the indexing it can, and is waiting for new updates to ingest. A high
percentage of time idle indicates that OTSE has additional capacity. If
indexing is slow and there is sufficient idle time, the bottlenecks likely
exist upstream in the indexing process (DCS, Extractors or DataFlow
processes). Note that you should always have some idle time, since the
demand on indexing throughput is not constant.

IPool Reading The amount of time the Update Distributor spends reading indexing
instructions from the disk. In general, this should be relatively small
compared to measurements such as Local Updates. If not, it may
indicate poor disk performance for the disk hosting the input IPools.

Batch Processing The amount of time planning how to proceed with the local update. This
value should be very small as a percentage of global update time.

Start Transaction Older systems, using RMI mode could not differentiate between time
and Checkpoint spent writing checkpoints and time spent on starting a transaction.
Therefore on these systems those two operations are grouped into a
single category. A properly configured system should have a value of 0
in this field.

The Information Company™ 178


Understanding Search Engine 21

Parsing Time spent parsing metadata.

Backups Time spent performing backups.

Search Agents Time spent running search agent queries. Does not apply when
configured to use the older method of running agents after every index
transaction.

Network Problems The values NetIO1 through NetIO5 capture the number of times 1 to 5
retries were needed to read or write to network IO. The NetIOFailed
counts the number of times IO failed after 5 retries.

The Update Distributor also keeps statistical summaries of performance for up to 1


week. Once per hour, a summary of the data is written to the Update Distributor log
file. The first line of output is the hour interval for which the data was collected.
Search for “Index Performance Summary” to locate this data.
The second line is a list of comma-separated titles. Subsequent lines contain the
data points, up to 24 hourly values (the list resets at midnight) and up to 7 lines for
daily summaries. These lines can be copied and pasted into a spreadsheet as
comma separated values for easier readability, analysis and charting. Selected
values in the table include:
Operation Counts
The number of IPool messages processed for each of the operation types:
AddOrReplace, AddOrModify, Delete, DeleteByQuery, ModifyByQuery, and
Modify. A count of the number of IPools processed is also included.
Percentage Times
The time spent as a percentage performing various operations, per the Update
Distributor Categories table above. Idle time is a key measurement, indicating
whether the indexing system has sufficient capacity.
Backup Times
This information can be used to verify that backups are occurring, and ensure
they are completing in a reasonable time.
Agent Times
Agents are search queries run against data during indexing, generating IPools for
ingesting into Content Server. Content Server uses agents for Classification and
Prospectors. Too many agents, or complex/expensive agents, can materially
affect indexing throughput.
NetIO Stats
Keeps track of network retries and errors. Includes disk errors, most of which are
network attached. Non-zero numbers here can indicate hardware issues.

The Information Company™ 179


Understanding Search Engine 21

Checkpoint Writing Thresholds


In the Dataflow section of the search.ini file are settings that instruct the Index
Engines to create Checkpoint files when certain conditions are met. For Memory-
based partitions (not Low Memory mode) the default settings are that Checkpoints
are generated when the metalog grows to 16 MB, or when 5,000 objects are added,
or when 500 objects are modified.
[Dataflow sections]
MetaLogSizeDumpPointInBytes=16777216
MetaLogSizeDumpPointInObjects=5000
MetaLogSizeDumpPointInReplaceOps=500

Because the characteristics of Low Memory mode are different, these values can be
adjusted upwards significantly, perhaps to 100 MB, or 50,000 new objects or 10,000
objects modified. In order to maintain backwards compatibility and mixed mode
operation, OTSE has a separate set of Checkpoint Threshold configuration settings
for Low Memory Mode:
MetaLogSizeDumpPointInBytesLowMemoryMode=100000000
MetaLogSizeDumpPointInObjectsLowMemoryMode=50000
MetaLogSizeDumpPointInReplaceOpsLowMemoryMode=5000

Throughput normally increases with larger values because the number of times that
Checkpoints are created decreases. At the same time, this increases the likelihood
that many partitions will need to create checkpoint files at the same time. This may
place a high load on your disk system, and stall indexing for longer periods when
Checkpoint writes happen.
Larger values mean that more data is kept in the metalog and accumlog files instead
of in the Checkpoint. Larger metalog files require more time to consume during the
startup process for Index Engines or Search Engines. In most cases, this is a one-
time penalty and is acceptable.
When checkpoints are written, the Update Distributor writes lines to the log file that
indicate progress against each of the three configuration thresholds for each partition
that will write a checkpoint. Reviewing these lines can help you understand where
adjustments may be appropriate. The log lines look like this:

1399063311301:main:5:Added partition ZZZZ to checkpointing list:


1399063311301:main:5:with metalog size ratio 209715200/209715200=1.0:
1399063311301:main:5:with metalog object ratio 55321/50000:
1399063311301:main:5:with metalog replace operation ratio 0/5000:
When using Merge File storage mode, there are analogous settings that manage the
behavior of the background merge process:
MODCheckMode=0
MODCheckLogSizeDumpPointInBytes=536870912
MODCheckMergeThreadIntervalInMS=10000
MODCheckMergeMemoryOptions=0

The Information Company™ 180


Understanding Search Engine 21

Set the CheckMode to 1 to enable use of metadata Merge File mode. The LogSize
determines how large the CheckLog files may become before a merge operation is
triggered, and defaults to 512 MBytes. The MergeThreadInterval determines how
often the Index Engines check to see if a merge should be performed, with a default
of 10 seconds. The MemoryOptions is optimized to minimize memory use; setting
this value 1 uses perhaps 100 MB of additional RAM per partition for a relatively
small performance increase while performing merge operations.
Index Batch Sizes
The Update Distributor breaks input IPools into smaller batches for delivery to Index
Engines. The default is a batch size of 100. For Low Memory mode, this can be
higher, perhaps 500. Since the batch size is distributed across all the Index Engines
that are currently accepting new objects, the batch size can be further increased if
you have many partitions. A guideline might be 500 + 50 per partition. Larger
batches result in less transaction overhead.
[Update Distributor section]
MaxItemsInUpdateBatch=500

Note that the batch size is also limited by the number of items in an IPool. Often, the
default Content Server maximum size for IPools is about 1000, so this may also need
to be modified to take full advantage of increases in the Update Distributor batch
size.
Starting with 20.3, batches are also split when the total size of the metadata plus text
in the objects to be indexed exceeds a defined threshold. The default is 10 MB, but
can be set higher if indexing large objects is common. This has been seen when
indexing email that has distribution lists with thousands of recipients. In the
[Dataflow_] section:
MaxBatchSizeInBytes=20000000

Prior to 20.3 the splitting of batches based on size used a different approach, where
the total size of the metadata of the objects in the batch can not exceed half of the
content truncation size (typically 5 MB).
There is another configuration setting that enables an optimization added in 16.2.2
related to how batches are handled. When processing ModifyByQuery or
DeleteByQuery operations, each request is sent to every Index Engine separately. In
practice, there are often many such contiguous operations in an IPool. The
optimization bundles these contiguous operations into a single communication to
each Index Engine, reducing the coordination overhead. By default, this optimization
is enabled, and can be controlled in the [DataFlow] section of the search.ini file:
GroupLocalUpdates=true

Partition Biasing
Research has shown that there is a strong correlation between the number of
partitions used for indexing and the typical indexing throughput rate. As expected,
more partitions improve parallel operation, and increases the throughput. However,
the transaction overhead per partition is relatively fixed, and the batch sizes become

The Information Company™ 181


Understanding Search Engine 21

fragmented into small batches when the operations are distributed to many partitions.
Depending on hardware, the optimal indexing throughput is usually in the range of 4
to 8 partitions.
To enable indexing in this optimal range for large search grids, there is a feature in
OTSE that restricts indexing of new objects to a specified number of partitions. For
example, you may have 12 partitions, but want to only fill 5 at a time for optimal
throughput. This is called partition biasing, and is set in the [Dataflow section]:
NumActivePartitions=5

The default value is 0, which disables partition biasing. Biasing only applies to new
objects being indexed. Updates to existing objects are always sent to the partition
that contains the object, regardless of biasing. For biasing purposes, a partition is
considered “full” when it reaches its “update only” percent full setting. The algorithm
for distributing new objects across active partitions is based upon sending objects
with approximately similar total sizes of full text and text metadata.
During an indexing performance test at HP labs in the summer of 2013, a brief test of
indexing throughput versus the number of partitions was performed. At the time, the
index contained about 46 million objects. There was plenty of spare CPU capacity,
and a very fast SAN was used for the index. In this particular test, the throughput
peaked around 12 partitions.

Parallel Checkpoints
Another index throughput adjustment setting is control over parallel checkpoints.
When a partition completes an indexing batch, it checks to see if the conditions for
writing a Checkpoint have been met. If so, then all partitions are given the
opportunity to write Checkpoint files. The logic being that if at least one partition is
stalled, then any partition that might need to write a Checkpoint soon should do it
now. However, if there are many Checkpoints, you may saturate disk or CPU

The Information Company™ 182


Understanding Search Engine 21

capacity when large numbers of partitions write Checkpoints, causing dramatic


performance degradation while writing the Checkpoints. The parallel Checkpoint
control lets you specify the maximum number of partitions that are allowed to write a
Checkpoint at the same moment. If more need to write Checkpoints, they must wait
until a slot is freed up by a Checkpoint write completing in another partition. You
should only need to adjust this if thrashing due to partition writing is suspected as a
problem. Disabled by default, in the Dataflow section of the search.ini file:
MaximumParallelCheckpoints=8

Testing for Object Ownership


Beginning with 16.2.3 (December 2017), a new optimization is available for
ModifyByQuery and DeleteByQuery operations. For these operations, the Update
Distributor broadcasts the operation to every IndexEngine. The Index Engines then
run the query to determine which object(s) match the query criteria.
The most common query criteria with Content Server is to match a single object ID,
having the form: [region "OTObject"] "DataId=445195828". The optimization is to
recognize this specific form of the query, and instead of running a query, the DataId is
hashed and tested against a Bloom Filter. Only the partitions that pass the Bloom
Filter test go on to perform the query to match the DataId. This optimization has the
highest impact when there are many partitions.
The Bloom Filter is enabled by default, and typically requires about 8 MB of memory
for a partition that contains 10 million objects. Smaller partitions use less memory,
and the Bloom Filters are recomputed as partitions grow in size. When enabled, the
Bloom Filters have a minimum size of 2 MB, and maximum size of 128 MB. Bloom
Filter data is not persisted, which means that an Index Engine will typically require
one or two additional seconds per million objects to compute the Bloom Filter data
during start up.
When objects are deleted from a partition, their corresponding bits in the Bloom Filter
are not removed. Eventually, if many objects are deleted, this will result in higher
false positive responses, thereby reducing the degree of optimization. A restart of the
Index Engine is needed to rebuild the Bloom Filter.
There are several configuration settings for tuning the behavior of Bloom Filters in the
[Dataflow] section of the search.ini file. In general, the defaults are designed to stay
below a false positive rate of about 3%. The LogPeriod determines how frequently
statistics about the performance of the Bloom Filter are written to the log file.
AutoAdjust is recommended. The test to see if an adjustment is needed (resize and
recompute the Bloom Filter) is determined by the MinAddsBetweenRebuilds value. If
AutoAdjust is disabled, then you are responsible for setting the the NumBits and
Number of Hash functions, which are ignored when AutoAdjust is enabled. The
defaults settings are good to about 10 million items. Consulting public sources on the
math behind sizing and false positive rates would be advisable in this case.
LogPeriodOfDataIdQueries=1000
NumBitsInDataIdBloomFilter=67108864
NumDataIdHashFunctions=3
AutoAdjustDataIdBloomFilterSize=true
AutoAdjustDataIdBloomFilterMinAddsBetweenRebuilds=1048576

The Information Company™ 183


Understanding Search Engine 21

To completely disable Bloom Filters, AutoAdjust should be set to false, and the
Number of Hash Functions should be set to 0.
A further optimization was added in version 20.4, in which a quick single-token
search for the data ID is performed to get a short list of objects which are then tested
for the phrase match. This behavior is considerably faster since phrase searches are
considerably slower than single token searches. This fast lookup can be disable if
necessary in the [Dataflow_ ] section of the search.ini file:
DisableDataIdPhraseOpt=true

Compressed Communications
There is a configurable option in OTSE that allows the content data sent from the
Update Distributor to the Index Engines to be compressed. For systems which have
excess CPU capacity and slow networking to the Index Engines, enabling this option
can improve indexing throughput. Most systems do not have this performance
profile, so the feature is disabled by default. The threshold setting determines the
minimum size of full text content that needs to be present before the compression is
triggered for a specific object. Note that compression also requires additional
memory. The memory requirement varies based upon the maximum size of the text
content, and for a system with a content truncation size of 10 MB an Index Engine
would consume another 12 MB of RAM. In the [Dataflow_] section:
CompressContentInLocalUpdate=false
CompressContentInLocalUpdateThresholdInBytes=65535
Scanning Long Lists
There is a specific optimization available for updates to text metadata in partitions not
using Low Memory mode. Low Memory mode uses different data structures and
does not exhibit this behavior.
If metadata updates are applied to metadata values where many objects have the
same value, the update operation can be extremely slow. For example, the
“OTCurrentVersion” region may have 1 million objects with the value “true”. Updates
to this field would be very slow.
The optimization makes these updates fast, but requires additional memory.
Because many customers with this configuration have full partitions, they cannot
tolerate extra memory requirements, so the default is for the optimization to be
disabled (a value of 0). The configuration setting specifies the distance between
know synchronization points in the data structure. Values of about 2000 perform
well, values below 500 become memory-intensive. In the [Dataflow] section:
TextIndexSynchronizationPointGap=2000
Ingestion versus Size
When measuring performance of search indexing, bear in mind that throughput
reduces as the number of objects in the partition increases. As data structures
become larger, extending and updating the index becomes slower. The single largest
contributing factor to the performance degradation is writing Checkpoints. A
Checkpoint is a complete snapshot of the search partition. As the partition gets
larger, the time to create the Checkpoint increases. As a guideline, the indexing

The Information Company™ 184


Understanding Search Engine 21

throughput as a partition approaches 10 million items will be about 30% of the


throughput experienced for the first million items indexed.
Using achievable numbers with typical Content Server objects, indexing the first
million items in a partition may be possible in 6 hours. Indexing items 9 million to 10
million in a partition may require 18 hours or more.
Content Server Considerations
In many scenarios, the bottleneck for indexing occurs upstream of the search engine.
The indexing process starts with the Extractor in Content Server, which feeds IPools
to DCS. DCS prepares the data, and creates IPools that feed the Update Distributor
component of the search engine.
This first constraint is typically the Document Conversion Server. There are
mechanisms available in Content Server to run multiple DCS instances in parallel,
and the worker processes that each DCS instance manages for operations such as
format parsing or thumbnail generation can also be scaled up. If DCS throughput is
not the limiting factor, then running multiple Extractors in parallel is also an option that
can be configured.

Ingestion Rate Case Study


To help assess the performance of the Low Memory Mode configuration for high
ingestion rates or large systems, Hewlett-Packard graciously agreed to provide time
in their labs for testing ingestion on one of their multi-CPU servers with a fast HP
SAN as the index and test data storage. There is a performance white paper
available from OpenText that provides details about the hardware configuration.
Given that limited time was available, the testing was pragmatic rather than rigorously
scientific. For example, we might change some parameters for a short time as the
index was growing to assess the impact, but we did not have time to back out the
changes and run the identical test again in multiple configurations. Regardless, the
results have provided significant insights. Note that this test is focused on the search
engine. The inputs were IPool messages, hence factors such as DCS, database or
Extractor performance are not considered.
One key objective was to determine how indexing performance degrades as the size
of the index grows. Another was to confirm that performance remains acceptable as
partitions are used to store more items – since the historic comfort zone for a search
partitions is less than 2 million objects. Finally, we also ran concurrent search load
tests to confirm that searches on large indexes under heavy indexing load return
results in an acceptable time.
Indexing batches of about 2.6 million items each were used for most test runs. The
objects indexed are statistically generated as IPools to simulate typical email
ingestion scenarios, about 2 KB of metadata and 31 KB of full text content. Each
batch added a net increase of about 2 million new objects, although a mixture of
metadata updates and deletes were also included in each batch to simulate real-
world behavior with Content Server. The number of partitions was nominally 8,
although variations were tested. The chart on the next page provides a summary,
with commentary below.

The Information Company™ 185


Understanding Search Engine 21

The test was seeded with an 8-partition index of about 14 million items. Initially, 12 to
16 partitions were enabled. After each batch of 2 million items was ingested, the
performance was reviewed and occasionally changes made to the configuration of
hardware or the index.
Below 50 million items in the index, an important observation is that the Update
Distributor does not appear to be a bottleneck, despite all data for all Index Engines
passing through the Update Distributor. We see many data points where the overall
throughput exceeds 100 items per second, which would be in the neighborhood of 8
million objects per day.
Once we had confirmed that performance with 16 partitions was relatively high, we
adjusted the number of partitions down to 8, to focus on building larger partitions in
the available lab time. As expected, the throughput with 8 partitions is significantly
lower. By the end of the test, the 8 partitions contained indexes of 10 million objects
each. At this size, the indexing throughput had decreased to just under 30 objects
per second. This is nearly 2 million objects per day, not including excess capacity for
downtime or spikes.
Some interesting data points:
• At about 94 million objects, we enabled more active partitions and observed
that much higher ingestion rates were still possible.

The Information Company™ 186


Understanding Search Engine 21

• Around the 30 million object mark, a faulty network card was replaced,
resulting in a material jump in performance.
• During one interval we duplicated the exact same test on the same
hardware, running concurrently. Our indexing tests were not fully engaging
the capacity of the HP hardware, generally staying below 30% CPU use.
Doubling the indexing load on the hardware resulted in dropping the
throughput from about 40 to about 30 objects per second for the observed
test, although we did manage to get a peak CPU use above 60%. The
duplicate concurrent test had similar performance characteristics. It would
appear that the HP environment has capacity for a much larger index than
we tested, or could also be used for other purposes such as the Document
Conversion Server.
• We disabled CPU hyper-threading for two runs, which reduced throughput
again from about 40 to 30 new objects per second. Lesson learned: leave
hyper-threading enabled for Intel CPUs.
What about searching? Search load tests from within Content Server were
performed concurrently while indexing was occurring. As expected, search became
slower as the index size increased. By test end, with 100 million items and indexing
40 objects per second, simple keyword searches from the search bar averaged less
than 3 seconds, and advanced search queries about 6 seconds, including search
facets. This is not the search engine time, but the overall time including Content
Server.
Does this ingestion case study have relevance for even larger systems? Yes. The
indexing throughput we measured is based on the number of “active” partitions, using
partition biasing. Eventually, you may have many more partitions, but by biasing
indexing to a limited subset, the indexing throughput can be modeled along the lines
seen in this example.
As a final note, this test was performed using Search Engine 10.0 Update 11. There
are a number of performance improvements, in particular for high ingestion rates that
have been implemented since this test was performed. Consider these data points to
be conservative.

Re-Indexing
Although OTSE has many features that provide upgrade capability and in-place data
correction, there are times when you may want to completely re-index your data set.
If you have a small index, re-indexing is fast and easy. For larger indexes, there are
some performance considerations.
It is faster to rebuild from an empty index than to re-index over existing data. There
are several reasons for this. Firstly, the checkpoint writing process slows down as
the index becomes larger, since there is more data to write to disk. When starting
fresh, the early checkpoint writing overhead is very small. Modifying values is also
more expensive than adding values – searching for existing values, removing them,
and adding new values to the structure is slower than simply adding data to a
structure.

The Information Company™ 187


Understanding Search Engine 21

Another key factor is the metalog update rules. In particular, the default checkpoint
write threshold is lower for updates than it is for adding new items to the index. This
is a reasonable value during normal operation, but when a complete re-index is in
progress and all objects are being modified, this setting will result in a high
checkpoint overhead. A purge and re-index avoids this problem entirely. If re-
indexing very large data sets, increasing the threshold replace operations may be a
useful strategy.

Optimize Regions to be Indexed


Don’t index metadata regions that are not needed.
Review the DCS documentation to ensure you have the right level of indexing
enabled for Microsoft Office Document properties. Chances are that unless
eDiscovery for Litigation Support is a key requirement, you can materially reduce
your index size by suppressing indexing of the extra document properties.
Examine your region definitions file (LLFieldDefinitions.txt). Make sure that DROP or
REMOVE is applied for regions which have little value for your business case. Verify
that the most efficient region type is defined for the remaining regions.
If you aren’t sure, you can move a text field to DISK_RET mode to reduce memory,
making it non-searchable. If you later determine that you do indeed need the field to
be searchable, you can change its storage mode back to DISK or RAM and make it
searchable again.

Selecting a Storage System


OTSE is a disk-intensive application. The characteristics of your disk storage can
have a dramatic effect on the performance of your search grid. If you are familiar
with configuring databases, applying similar guidelines to setting up storage for the
search index will normally give you good results.
The search grid is comprised of dozens of active files per partition. With a large
search grid, active reads and writes of hundreds of files simultaneously will be
happening. Indexing creates new files and performs disk intensive merging of
existing files. Both the Index Engines and Search Engines perform independent
operations on these files.
While there is no specific rule or guideline for what constitutes an appropriate disk
configuration, keep the following in mind.
If you don’t care about indexing and search performance, then don’t worry about
configuring a high performance disk system. If your data indexing rate will be low
and search queries don’t require fast response, then you can probably tolerate a low
performance storage system.
Each incremental search partition adds files that need to be managed and accessed.
Accessing many files always impacts query performance. The increased file access
for indexing will usually not be noticeable if you have low rates of indexing, perhaps
below a few typical updates per second (yes – per second. Depending upon the
situation, an Index Engine is capable of indexing 50 or more objects per second).

The Information Company™ 188


Understanding Search Engine 21

For maximizing indexing throughput, disk performance is a key parameter, since disk
I/O is usually the limiting factor. Using several sample test setups on similar (but not
identical configurations) in 2012 we measured indexing times with 4 partitions of:

390 Minutes with a single good SCSI hard disk installed in the computer.

270 Minutes attached to a lightly loaded storage array with a 10 GB network


connection running on VMware ESX.

5000+ Minutes attached to a busy NFS storage array shared with other
applications with a 10 GB network connection running on VMware ESX.
Read that last one again. You really can configure disk storage that will reduce the
performance of OTSE by a factor of 20 or more. Disk fragmentation also has an
impact. On Windows, we typically see a 20% indexing performance drop between a
pristine disk and one with 60% file fragmentation.
Note that the caching features of some SANs are too aggressive, and can report
incorrect information about file locking and update times.
Customers using basic Network Attached Storage such as file shares generally report
poor search performance. In general, storing the search index on a network file
share will give very poor results.
The incidence of network errors that customers experience when using either SAN or
NAS is surprisingly high. OTSE has relatively robust error detection and retries for
these cases, but failure of the search grid due to network errors is still possible.
When using any type of network storage for the index, monitoring the network for
errors is a good practice that may prevent a lot of frustration due to intermittent
errors.

NOTE: Do not use Microsoft SMB2. The Microsoft SMB2


storage system caches information in such a way that it does not
accurately report file locking and updates in a timely fashion,
resulting in incorrect behavior of OTSE.

NOTE: Apply Windows NTFS patches. When using SAN


storage with an NTFS file system and large search partitions,
some customers have hit Windows operating system limits for file
fragmentation. The Microsoft Knowledge Base article 967351
contains information about this limit and provides a patch that can
solve the problem for some Windows operating systems.

NOTE: Use Drive Mapping. Wherever possible, use drive


mapping instead of UNC paths for search index components. In
particular, customers have reported instability with Java accessing
drives on Network Attached Storage when UNC path names are
used.

The Information Company™ 189


Understanding Search Engine 21

A dedicated physical high performance disk system will usually outperform a network
attached disk system. However, a SAN with high bandwidth often has other benefits,
such as high availability, which make them attractive. If you are configuring a SAN
for use with search, treat the search engine like a database. The performance of the
disk system is almost always the limiting factor in performance.
Any type of network storage is acceptable for index backups. In fact, backing up the
index onto a different physical system is generally recommended.
Finally, a word about Solid State Disks (SSD). SSDs are gaining acceptance for high
performance enterprise storage. The characteristics of fast SSD are a good fit for
search engines. Given the large number of small random access reads that occur
when searching, SSD storage is an excellent choice for maximizing search query
performance. Indexing performance is not as dramatically affected, since the Index
Engines are generally optimized to read and write data in larger sequential blocks.
However, even with indexing, the highest indexing throughputs we have measured in
our labs occurred with local SSD storage for the index, around 1 million objects
indexed per hour. If you need to improve the query performance or indexing
throughput, investing in good SSD storage media for the index is likely the best
hardware investment you can make.

Measuring Network Quality


Some of the most difficult search issues to diagnose are due to errors in the
environment that affect reliability of network communications. Beginning with the
16.2.9 update, OTSE records network problems encountered by the Update
Distributor communicating with Index Engines, and the Search Federator
communication with Search Engines. This communication will retry up to 5 times if
needed. OTSE counts the number of retries needed to complete a communication,
and the number of times the communication failed despite retries. The counts are
written to log files on an hourly basis as an extension of the “Index Performance
Summary” and “Search Performance Summary”. In the log files, the column
headings are NetIO1 through NetIO5 for retries, and NetIOFailed for failures. The
counts are also included in a “getstatustext performance” query to the admin port.
Retries and failures indicate problems in the environment and may include unreliable
network cards, bad cables, port conflicts, or virus/port scanners.
By default, recording of the network quality metrics is enabled, and can be disabled in
the [Dataflow_] section of the configuration file by setting the value to false:
LogNetworkIOStatistics=true

Measuring Disk Performance


OTSE keeps rudimentary disk performance statistics that are intended to help identify
when an environment if not performing as expected. During operation, both the
Index Engines and Search Engines track the performance of selected disk
operations. Occasionally, the summary information is written to the log files. The
data is cumulative since startup.
For the Index Engines, the average access time plus histogram data is maintained for
Writes, Syncs, Seeks and Close operations. In a fast environment, the times should

The Information Company™ 190


Understanding Search Engine 21

ideally be in the 0-2 millisecond bucket. If there are counts recorded for long periods,
this is a strong indicator that there are performance problems with the storage
system.
Disk IO Counters. Read Bytes 0. Write Bytes 154394096.:
Histogram of Disk Writes. Avg 0 ms (381/18979). 0-2 ms
(18979). 3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0).
51-100 ms (0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms
(0).:
Histogram of Disk Syncs. Avg 179 ms (37276/208). 0-2 ms (0).
3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (37). 51-100
ms (23). 101-200 ms (59). 201-500 ms (87). 501-Inf ms (2).:
Histogram of Disk Seeks. Avg 0 ms (1/78). 0-2 ms (78). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Closes. Avg 0 ms (0/2). 0-2 ms (2). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:

In addition to the times, the number of disk errors that occur and the number of
retries needed to succeed are recorded. If errors exist, an additional line of this form
will be written:
Disk IO Retries Needed. 1 (7). 2 (6). 3 (8). 4 (2). 5+ (22).
failed (17).
For example, this entry indicates that on 7 occasions, 1 error/retry was required. On
22 occasions 5 or more retries were attempted, and 17 times the disk I/O failed even
with retries.
Similarly, the Search Engine reports performance for selected disk operations, writing
entries of this form:
Disk IO Counters. Read Bytes 112711231. Write Bytes 0.:
Histogram of Disk Reads. Avg 0 ms (122/12347). 0-2 ms (12347).
3-5 ms (0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms
(0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Seeks. Avg 0 ms (3/360). 0-2 ms (359). 3-5
ms (1). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms
(0). 101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:
Histogram of Disk Closes. Avg 0 ms (0/1). 0-2 ms (1). 3-5 ms
(0). 6-10 ms (0). 11-20 ms (0). 21-50 ms (0). 51-100 ms (0).
101-200 ms (0). 201-500 ms (0). 501-Inf ms (0).:

By default, reporting of this data is enabled and is written every 25 transactions. The
feature can be disabled and the frequency of reporting can be controlled in the
[Dataflow_] section of the search.ini file:
LogDiskIOTimings=true
LogDiskIOPeriod=25

The Information Company™ 191


Understanding Search Engine 21

Checkpoint Compression
There is an optional feature in OTSE that allows Checkpoint files to be compressed.
Checkpoint files can be large, over 1 GB as you exceed 1 million objects in a
partition. New Checkpoint files are written from time to time, usually by all partitions
at once, which can place a significant burden on the disk system.
The compression feature is disabled by default since, in a simple system with a
single spinning disk, compression makes Checkpoint writing CPU bound, and
indexing throughput may decrease by 10% to 15%. However, if you have a system
which is limited by disk bandwidth rather than CPU, then enabling Checkpoint
compression may be a good choice, and actually increase indexing performance.
The compression feature generally reduces the size of Checkpoint files by about
60%. Compression is enabled in the [Dataflow_] section of the search.ini file:
UseCompressedCheckpoints=true

Disk Configuration Settings


OTSE has a number of configuration variables which can potentially change the
characteristics of disk usage. The default settings are normally appropriate, but
experimenting with some of these parameters may be needed depending on your
disk system. These values are configured in the search.ini file.
Delayed Commit
Some Storage Array Networks (SANS) with slow SYNC characteristics require a
delay between certain types of operations. While normally 0, this setting will insert a
pause between key disk operations that improves system stability in these cases:
DelayedCommitInMilliseconds=10
Chunk Size
Some storage systems are sensitive to the chunk size when reading or writing data.
The default is 8192. Although normally we recommend the Java default of 32768, this
parameter can be forced to a smaller maximum value if necessary in the DataFlow
section of the search.ini file:
IOChunkBufferSize=8192
Query Parallelism
The Search Federator asks each Search Engine to return results. There are two key
performance tuning values in this process in the search.ini file. The first is how
aggressive the Search Federator will be with respect to asking Search Engines to
pre-fetch results to keep the Search Federator result merging queue full. The default
value of 0 is used to pre-fetch as much as possible, measured in terms of Search
Engine result blocks. Setting this number higher will delay pre-fetching, which can
reduce the number of results fetched but introduces delays into result retrieval. For
example, a value of 3 will wait until a Search Engine has been asked for 3 blocks of
results before beginning to pre-fetch results.
MergeSortCacheThreshold=3

The Information Company™ 192


Understanding Search Engine 21

The other parameter is the number of results a Search Engine fetches each time the
Search Federator asks for a set of results. The default value is 50. Larger values
are more efficient when the typical query is for many results. Smaller values are
more efficient for typical relevance-driven queries. In general, if using the preload
above, a value of 20 to 50 is likely optimal, and reduces the potential load on the disk
system.
MergeSortChunkSize=50
These values are multiplicative with the number of partitions. For example, if you
have 8 partitions and a MergeSortChunk size of 250, then the MINIMUM number of
results that the Search Engines together will provide to the Search Federator is 2000.
Keeping MergeSortChunk size value low for systems with many partitions is
recommended.
Throttling Indexing
In some environments, it may be the case that indexing operations are creating
metalogs faster than they can be consumed by the search engines. There is an
upper limit on how many unprocessed metalog files are acceptable, which can be
adjusted if necessary should Search Engines chronically lag behind the Index
Engines. This can happen in environments in which long-running search queries tie
up the Search Engines at the same time that high indexing rates are occurring. In
some cases this problem can be resolved by configuring Search Federator caching.
When this limit is reached the indexing updates will pause to allow the Search
Engines to close the gap.
AllowedNumConfigs=200
In situations where queries are constantly running, it may be necessary to force a
pause in processing search queries in order to give the Search Engines an
opportunity to consume the index changes. There are two settings to control this,
one that specifies the maximum time that queries are allowed to run continuously
(thus blocking updates), and the other is the duration of the pause which is injected
into searching. By default, this feature is disabled.
[SearchFederator_xxx]
BlockNewSearchesAfterTimeInMS=0
PauseTimeForIndexUpdatingInMS=30000

Small Read Cache


The Search Engines have an optional feature to reserve memory for a disk read
cache, which can buffer recent small blocks read from the index during queries.
Testing on a typical index showed a reduction of read operations of up to 17%. If you
enable this feature, ensure that you do a before/after set of timing tests. While a
benefit is typical, some environments show a small performance degradation of a few
percent in queries. By default, this is disabled (set to 0). In the [Dataflow_] section of
the search.ini file. There is very little measured benefit to exceeding 10 MB in a
Search Engine for this optimization.
SmallReadCacheDesiredMaximumSizeOfSmallReadCachesInMB=5
File Retries
Experience has shown that disk reads and writes are not always reliable, especially
when low performance disk systems such as NAS or distributed file systems are in

The Information Company™ 193


Understanding Search Engine 21

use. To try and ensure correct operation in these environments, most file accesses
will detect errors and retry operations multiple times. The delay between retries is
about 2 seconds times the attempt number. For a number N, the total retry time is
N*(N+1) seconds (e.g. if N is 5, up to 30 seconds). In update 21.1, this setting was
extended to cover retries for reading the livelink.### files (aka livelink.ctl files). Using
these types of disk environments is strongly discouraged, and even if correct, can be
extremely slow. The number of retries is adjustable, and defaults to 5.
NumberOfFileRecoveryAttempts=5

Indexing Large Objects


The default settings of OTSE and the Document Conversion Server are designed to
handle all normal document sizes. Text is typically truncated to 10 or 20 MB, which
accounts for all but the very largest of documents. This document, for example,
contains well under 1 MB of text. The text in very large documents is often of little
value, and the first 10 MB contains matches due to redundant terms. Note that the
amount of text in a document is often very much smaller than the file size – for
example, a PowerPoint file might be 100 MB in size, but contain only 10 KB of actual
text.
However, there are situations where all of the text in very large documents must be
indexed. In experiments, we have successfully indexed documents comprised of
more than 200 MB of text. In order to achieve this, the Engines will need to have
significant spare memory (gigabytes). This is effectively done by setting the metadata
memory size to a large value (say 6 GB) with a maximum allowable utilization of
30%. In one experiment, we measured indexing success with available memory of
approximately (100MB RAM + 8MB RAM/1MB TEXT), in both index and search
engines. For example, a 200 MB text file succeeded with 1.7 GB of available
memory for processing. This experiment occurred before Update 16.2.3, where the
worst case scenario would require available RAM equivalent to 7x the size of the text.
Beginning with 16.2.3, the situation has improved, with a worst case RAM
requirement of 3x the size of the text.
The truncation size will also need to be adjusted upwards from about 10 MB to the
desired size, perhaps 210 MB. The timeouts for the Index Engines may also need to
be increased. Changes to settings in the Document Conversion Server will also be
required, including allocating more memory, adjusting truncation limits, and providing
much longer timeout values for processing formats.

Servers with Multiple CPUs


Large servers with multiple physical CPUs may require special consideration.
Several customers have experienced very slow operation with high-end expensive
hardware, which is counter-intuitive. Investigation has identified that systems with a
Non-Uniform Memory Access (NUMA) architecture need to be carefully configured.
OTSE and the Admin Servers do not have any special handling for execution on
NUMA nodes. The operating system tools are relied on for optimizing processor
affinity. In most cases, the default behavior of the operating system will allocate

The Information Company™ 194


Understanding Search Engine 21

processes and threads such that there is no problem in a system with multiple NUMA
nodes.
In a NUMA system, memory is partitioned with fast access to one CPU, and much
slower access by the other CPUs. OTSE uses many threads for execution, and the
operating system could assign different threads for the same Search Partition to
different physical CPUs. Tasks undertaken by the threads on CPUs not attached to
the memory take about 5 times longer to execute, in part because of slower memory
access, but also because serial interconnects between the CPUs must be used to
synchronize caches.
One approach to resolving this issue is to use operating system tools to pin
applications to physical CPUs. In a Content Server environment, Search Engine
processes are started and ‘owned’ by an Admin Server. It may therefore be
necessary to set the affinity of an Admin Server and all of its attached processes to a
single CPU. This in turn may require changing the number of Admin Servers in use
and allocating Search Engine processes to the Admin Servers to meet your
performance goals. In the Content Server environment, the Document Conversion
Server may likewise need to be adjusted.
The tools used to analyze the allocation of applications to CPUs and to pin
applications to CPUs vary by operating system. You may wish to investigate the use
of some of the following operating system functions for optimizing execution on
NUMA nodes:
Linux: taskset, numactl
Solaris: priocntl, pbind
Windows: start /NODE (may require hotfix to cmd.exe)
If you are running OTSE in a Virtual Environment, the VM tools will often have
processor and NUMA node affinity controls that may also be used to set node affinity.
Note that these considerations only apply to servers with multiple physical CPUs.
There is no scalability performance issue associated with many cores on a single
CPU.

Virtual Machines
In principal, virtual machines should be indistinguishable from physical computers
from the perspective of the software. In practice, there are occasionally problems
which arise from running software in a virtual environment. OTSE is known to
operate with VMWare ESX, Microsoft HyperV, and Solaris Zones. However,
OpenText cannot in reality rigorously test and certify every possible combination of
hardware and virtual environment, and there may be configurations of these virtual
environments that OpenText has not encountered which might be incompatible with
the search grid.
The most important point is this: virtual machines do NOT reduce the size of the
hardware you need to successfully operate a search grid. If anything, operating a
search grid in a virtual environment will require MORE hardware to achieve the same
performance levels, when measured in terms of memory and CPU cores/speed.
For small installations of the search grid where performance issues are not a factor, a
virtual environment can be attractive. However, as your system increases in size to

The Information Company™ 195


Understanding Search Engine 21

require many partitions, be aware that a virtual environment may be more costly than
a physical environment for the search grid, which needs to be considered against VM
benefits such as simplified deployment and management. Consider a search engine
as being analogous to a database. For larger or performance-intensive database
applications, the database is often left on bare metal, even if the remainder of an
application is virtualized. The Search Engine has performance characteristics similar
to a database and it may make sense to leave the Search Engine on dedicated
hardware.
One example of a limitation we have seen is virtual machines in a Windows server
environment. In some cases, the I/O stack space is not sufficient once the extra VM
layers are introduced, and tuning of the Windows settings to increase I/O resources
becomes necessary.
As with most applications deployed in a virtual environment, the software runs slower.
The change in performance depends on many factors, but a 10% to 15%
performance penalty is not uncommon.
We have also seen instances in which the memory used by Java in a VM
environment is reported as much higher than the equivalent situation on bare
hardware. In practice, the actual memory in use is very similar, but the reported
values can differ wildly. Often, over a period of many hours, the reported VM
memory will decline and converge on memory consumption reported on a bare
hardware environment.

Garbage Collection
The Java Virtual Machine will generally try to optimize the number of threads it
allocates to Garbage Collection. However, it is not always correct. For example,
when running in a Solaris Zones environment, the “SmartSharing” feature of Zones
can trigger the Java Garbage Collector to allocate very large numbers of threads and
memory resources, which in Zones may be manifested as Solaris Light Weight
Processes (LWPs).
If the number of threads on a system allocated to Garbage Collection seems
unusually large, you likely need to place a limit on the number of Garbage Collection
threads, which can be done using by modifying the Java command line to add the –
XX:ParallelGCThreads=N, where N is the maximum number of threads. Selecting N
may require experimentation, but values on the order of 8 are typical for a system
with 8 partitions, and values over 16 may provide little or no incremental value.

File Monitoring
Some tools that monitor file systems can cause contention for file access. One
known example of this is Windows Explorer. If you browse to a folder used by SE
10.5 to represent the search index using Windows Explorer, then you will likely cause
file I/O errors and a failure of the search system.

The Information Company™ 196


Understanding Search Engine 21

Virus Scanning
The performance impact of virus scanning applications on the search grid is
catastrophic because of the intense disk activity that the search grid performs. In
some cases, file lock contention can also cause failure or corruption of the index. You
must ensure that virus scanning applications are disabled on all search grid file I/O.
The search system only indexes data provided by other applications. If virus
scanning is necessary, then scanning the data as it is added to the controlling
application (such as Content Server) is the recommended approach.
Related to this, we see virus scanners now offering port scanning features as well.
Like virus scanners, we have found that port scanners can significantly reduce
performance or cause failure of the software.

Thread Management
OTSE makes extensive use of the multi-threading capabilities of Java. In general,
this leads to performance improvements when the CPUs have threads available.
However, for very large search grids with over 100 search partitions, the number of
threads requested by OTSE may exceed the default configuration values for specific
operating systems. Depending upon the operating system, it is usually possible to
increase the limits for the number of usable threads. This problem is less likely to
occur when running with socket connections instead of RMI connections.
Configuring an operating system to permit more threads for a single Java application
is beyond the scope of this document, and may also include tuning memory
allocation parameters for the JRE. The objective here is simply to make you aware
that additional system tuning outside the parameters of OTSE may be necessary.

Scalability
This section explores various approaches to scaling OTSE for performance or high
availability. OTSE does not incorporate specific scalability features. Instead, by
leveraging standard methods for system scalability with an understanding of how the
search grid functions, we can illustrate some typical approaches to search scalability.

Query Availability
The majority of customers that desire high availability are generally concerned with
search query performance and uptime. Usually, this is tackled by running parallel
sets of the Search Federators and Search Engines in ‘silos’, with a shared search
index stored on a high availability file system, as illustrated below:

The Information Company™ 197


Understanding Search Engine 21

To obtain the benefit of high availability, the search silos should be located on
separate physical hardware in order to tolerate equipment failure.
Search queries are not stateless transactions; they consist of a sequence of
operations – open a connection, issue a query, fetch results, and close the
connection. Because of this, simple load balancing solutions cannot easily be used
as a front end for multiple search federators. Instead, the application issuing search
queries should have the ability to direct entire query sequences to the appropriate
silo and Search Federator.
Content Server is one such application. If multiple silos are configured, search
queries will be issued to each one alternately. In the event that one silo stops
responding, Content Server will remove that target from the query rotation. Refer to
the Content Server search administration documentation for more information.
In this configuration, the Search Engines share access to a single search index. This
works because Search Engines are “read only” services which lock files that are in
use. All changes to the Search Index files are performed by the Index Engines.
When a Search Engine is using an index file, it keeps a file handle open – effectively
locking it. The Index Engines will not remove an index file until all Search Engines
remove their locks on a fragment. Because these locks are based on file handles in
the operating system, a Search Engine which crashes will not leave locks on files.
When Search Engines start, they load their status from the latest current checkpoint
and index files, and apply incremental changes from the accumlog and metalog files.

The Information Company™ 198


Understanding Search Engine 21

Because of this, no special steps are needed to ensure that Search Engines in each
silo are synchronized. They will automatically synchronize to the current version of
the index.
It is possible for an identical query sent to each silo at the same time to have minor
differences in the search results. The differences are rare, probably small, and short
lived – and would not be noticed or important for most applications. These potential
variances arise due to race conditions. The Search Engines in each silo update their
data autonomously. When an Index Engine updates the index files, perhaps adding
or modifying a number of objects, the Search Engines will independently detect the
change and update their data. For a short period of time, a given update to the
search index may be reflected in one of the search silos but not the other.
This approach to high availability for queries also allows many search grid
maintenance tasks to be performed on Search Federators or Search Engines without
disrupting search query availability. By stopping one silo, performing maintenance,
restarting the silo, and then repeating the process with the other silo, user queries are
not impacted throughout the process. Note that some administration tasks which
change fundamental configuration settings may not be possible without service
interruption.
An additional benefit of parallel silos is search throughput. Since applications such
as Content Server can distribute the query load across multiple silos, the overall
search performance might be higher. This will not be the case if the hardware on
which the search index is stored is a performance bottleneck, particularly the disk
which is shared by each silo.
For correct operation, each silo must have identical configuration settings. If you
have hand-edited any of the configuration files, you must ensure this is properly
reflected on both silos.

Indexing High Availability


By its nature, search indexing is not typically a real-time application. Objects for
indexing are placed in IPools (which are queues), then prepared by DCS, and added
to the index in batch transactions. By definition, there is latency and delay in these
processes, which vary based on many factors, including the indexing throughput, the
size and types of objects being indexed, or the number of objects in an IPool.
Because of this, high availability for search indexing is not a requirement for most
customers. Search queries can be available even should the indexing operations be
down. Given the cost of adding duplicate equipment for high availability and the
limited benefits, it is rare that you would need to implement indexing high availability.
What is commonly required is a way to redeploy or recover in a reasonable time
frame if indexing dies. Configuring the indexing system on a virtual machine is one
possible approach to reducing the recovery and redeployment times for search
indexing. In the event that a system fails, the VM images for the indexing processes
can be rapidly deployed on other hardware.
Within Content Server, the “Admin Server” component is also available. The Admin
Server will monitor the indexing processes, and is capable of restarting them in the
event that unexpected errors occur.

The Information Company™ 199


Understanding Search Engine 21

If you absolutely must have true high availability for indexing, this must be
implemented using technologies external to the search grid, with a combination of
configuration settings and external clustering hardware or software. The general
principal is that two completely separate search grids are created, the indexing
workflow is split and duplicated, and the indexes are independently created and
managed. This is an exercise pursued using products such as Microsoft Cluster
Server, and beyond the scope of this document.

Sizing a Search Grid


Determining how a hardware environment should be sized for search is not always a
simple task. There are many variables that can affect this. While there is no firm
formula for estimating hardware requirements, this section will examine some of the
common considerations for search index sizing. The bottom line – experimenting in a
test environment with your actual application and representative data is the only way
to make solid predictions about how search grid hardware should be sized.

Minimizing Metadata
Many Content Server applications index much more metadata than is actually used in
searches. Using the LLFieldDefinitions file to REMOVE metadata fields that will
never be used can minimize the RAM requirements.

Metadata Types
By default, metadata regions are created as type TEXT. Integers, ENUM, Boolean
are more efficient, and using the LLFieldDefinitions file to pre-configure types for
these regions can reduce the RAM requirements.

Hot Phrases and Summaries


New installations of Content Server should consider configuring the OTHP and
OTSummary regions as “Retrieve Only” regions, which can reduce RAM
requirements by 25% or more depending on the type of data you index.

Partition RAM Size


Each partition brings with it a relatively fixed amount of overhead of several hundred
Mbytes of RAM requirements. Added to this is the memory that each partition will
use to store Metadata. The larger the memory allocated to a partition, the fewer
partitions there are required, and thus the overall RAM requirements are lower.
However, simply using large partitions is not always the correct approach, since
performance of a partition reduces slightly as it grows in size.
Content Server 9.7.1 installations on Windows or Linux are using a 32 bit JRE, which
places an upper limit of about 1 GByte on the partition RAM size. Secondly, it is
relatively easy to allocate more RAM to partitions to make them larger, but difficult to
break a large partition into smaller ones. If you have performance bottlenecks in
indexing or searching, splitting the index into multiple partitions can sometimes

The Information Company™ 200


Understanding Search Engine 21

improve performance by leveraging the parallelism of multiple partitions, although this


may require additional CPU cores or disk bandwidth to leverage the parallel
capabilities for performance.
Sample Data Point
Our sample system is comprised of a relatively typical mix of Content Server data
types from a “document management” application, including some use of forms
and workflow. There are several hundred core metadata regions, and several
thousand lesser-value metadata regions from applications such as Workflow.
In “RAM” mode, without tuning, a default 1 GB partition holds about 1.5 million
objects. Using a 3 GB partition size, we measure 4.4 million objects using about
2.5 GBytes of RAM for metadata.
In “DISK” mode, with the same data we can index the same 4.4 million objects
using a little more than 1.8 GBytes of RAM for metadata, which is roughly a 2.5
GByte partition.
In “Low Memory” disk mode, the same 4.4 million objects requires about 700
Mbytes of RAM, which can be done in a 1 GByte partition. We extrapolate that a
2 GB partition in Low Memory mode can potentially handle up to 10 Million
indexed objects from Content Server.
The general guideline using Low Memory mode with Content Server is that you can
expect a partition to accommodate 7 to 10 million typical Content Server objects with
reasonable performance using a 2 GB RAM partition size. The overall conservative
memory budget for such a partition is approximately 6 GB (2 GB RAM + 1 GB
overhead and Java for each of the Index Engine and Search Engine).
Memory Use
When running a Java process, the amount of memory it may use is specified on
the command line. Java can be aggressive about consuming this memory. You
may be able to operate a partition with 1 GB of RAM, but if you made 8 GB of
memory available, Java may consume all of it. This memory use can be
misleading when analyzing resources used by a search partition.

Redundancy
If you are building a high availability system with failover capabilities, the hardware
must be suitably duplicated.

Spare Capacity
In the event that there are maintenance outages, or a requirement to re-index
portions of your data, you will need spare CPU capacity to handle this situation.
Although OTSE is a solid product, indexing problems can happen – generally
incorrect configuration or network/disk errors, although (perish the thought) there are
occasionally bugs found. Sizing the hardware to meet the bare minimum operating
capacity won’t allow you any headroom to recover from problems.

The Information Company™ 201


Understanding Search Engine 21

Indexing Performance
As with all sizing exercises, making predictions is fraught with danger. Ignoring the
peril, our anecdotal experience is that the Index Engines can ingest more than 1
Gigabyte of IPool data per hour.
A specific example on a computer that we frequently use for performance testing:
• Windows 2008 operating system, 2 Intel X5660 CPUs, 16 Gbytes RAM
• Update Distributor
• 4 Index Engines / partitions
• Partition metadata size of 1000 Mbytes
• Index stored on a single SCSI local hard disk
• Predominantly English data flow
consumes more than 4 GB per hour, comprised of nearly 200,000 objects added or
modified per hour. Usually, high performance indexing is limited by disk I/O capacity.
Refer to the Hard Drive Storage section for more information.
Beyond about 4 partitions, the performance of the Update Distributor becomes a
factor, and you may need to ensure that the disk read capability for the indexing
IPools is adequate.

CPU Requirements
There is no single rule for the number of physical CPUs needed for a search grid.
Don’t rely on hyperscaling – physical CPUs are key. The requirement is directly
related to your performance expectations. Some of the variables you should bear in
mind are outlined here.
Most customers optimize for cost and have low CPU counts. This means that search
works, but user satisfaction with performance may be low.
Active searches are CPU intensive. If good search time performance is expected,
you should have at least 1 CPU per search engine. This is especially true if multiple
concurrent searches will be running.
Searches are bursty in nature. CPUs will sit idle until a search request arrives, then
saturate the system. Administrators will tend to look at the average CPU use over
time, and claim that utilization is low, therefore no additional CPUs are needed. They
are wrong. Check to see if CPU utilization hits high levels during active searches,
then plan your CPUs based on load during that period.
Search Agents (Intelligent Classification, Prospectors) place an additional load on the
Search Engines. If you are using these features heavily, you may need to
accommodate with some additional fractional CPU. Search Agents run on a
schedule, so they have no impact most of the time, but a heavy potential impact
when run.
Indexing is expensive. If you need high indexing throughput, you should have at
least 1 CPU per active partition, plus 0.25 CPU per inactive partition, plus 1 CPU for

The Information Company™ 202


Understanding Search Engine 21

the Update Distributor. With low indexing throughput requirements, 1 CPU for 4
Index Engines may suffice.
In addition, spare capacity is needed on the Index Engines for the following events…
running index backups, writing checkpoints, performing background merge
operations. These operations are designed to limit activity to a subset of partitions
concurrently (default about 6). You can choose degraded indexing during these
periods or allocate additional CPUs.
Example… if you want good search performance with many searches being run
(including searches for background RM disposition and hold) and expect hundreds of
thousands of indexing additions and updates every day. A medium-large system with
40 partitions (perhaps 500 million items). Configured with 6 active partitions (number
of partitions that accept new data, write checkpoints, merge concurrently).
1 CPU – Update Distributor
6 CPUs – Active Index Engines
8 CPUs – Update Index Engines
40 CPUs - Search engines with fast response
Assume indexing throughput can tolerate short slowdowns for background
operations, no extras. Over 50 CPUs is an appropriate size. Conversely, the same
system which can tolerate large backlogs for indexing (perhaps catching up in the
evenings) and is comfortable with users waiting 20 seconds on average for a search
can probably get by with 16 CPUs.

Maintenance
As with all sophisticated server software, there are a number of suggestions, best
practices and configurations that contribute to the long term health and performance
assessment. This section outlines some of the considerations.

Log Files
Each OTSE component has the ability to generate log files. There are separate log
files for each instance of each component. The basic settings are:
Logfile=<SectionName>.log
RequestsPerLogFlush=1
IncludeConfigurationFilesInLogs=true
Where the Logfile= specifies the path for logging (the file name is generated from the
component and the name of the partition). Requests per log flush specifies how
many logging events should be buffered before writing. The value of 1 is the least
performant, but does the best job of guaranteeing that logging occurs if something
crashed unexpectedly.
At startup, information about the version of OTSE and the environment are recorded
in the form of copies of the main configuration files, and can be used to verify that the
correct versions of software are running. This can be disabled by setting the Include
Configuration Files setting to false.

The Information Company™ 203


Understanding Search Engine 21

Log Levels
The log files have a configurable level of detail used when writing log files. The log
level for each component of the search engine is separately configured in the
search.ini file:
DebugLevel=0
The available log levels are:
0 – Lowest level, “Guaranteed logging” level output still
occurs.
1 – Severe Errors are logged
2 – All Error conditions are logged
3 – Warnings are logged
4 – Significant status information is logged
5 – Information level, most detail
If you are experiencing problems that require diagnosis, setting the log level to 5 is
recommended. You do not need to restart the search engine processes to change
the DebugLevel, these are reloadable settings.

Log File Management


OTSE supports several methods for managing log files. For most installations, using
the rotating log file method is recommended. The rotating method cycles through a
fixed number of log files of a configurable size, ensuring that the space used for log
files is bounded, and also ensuring that the latest portions of the log files are
available if an error occurs. The rotating log file method also sets aside the startup
portion of the log file, since the startup information often provides valuable debugging
information. The rolling log file parameters in the search.ini file are:
LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
These settings essentially request that 25 files are retained with a size of 25 MB
each, and in addition, the last 10 log files from startup of the component are retained,
also of size 25 MB.
The logging method to be used is set in the search INI file:
CreationStatus=0
Where the acceptable values are:
0 – Append new data to existing log file
1 – Replace the existing log file each time the component starts
2 – Create a new log file on startup, rename the old one to current date/time
3 – Log to console. Windows only – don’t use this
4 – Rolling log files
Value 3 – log to console, should generally NOT be used. It is listed here for
completeness, but is not a production grade implementation.

The Information Company™ 204


Understanding Search Engine 21

RMI Logging
The RMI logging section determines how the RMI Registry component performs
logging. It is defined in the General section and the behavior is similar to the
descriptions above, however the names of settings in the search.ini file are different.
RMILogFile ---> Logfile
RMILogTreatment ---> CreationStatus
RMILogLevel ---> DebugLevel

Backup and Restore


In many applications, the ability to search the index is considered essential. If an
index backup is not available and the index is destroyed, then a complete re-indexing
of the data is needed. Depending on the size of your data, this may take an
unacceptable period of time. In spite of this, many customers do not back up their
search index on a regular basis, and this eventually leads to considerable pain when
a hard disk fails.
Best practice is regular backup of the search index.
The “Backup and Restore” section of this document contains additional information
on the mechanics of managing index backups.

Application Level Index Verification


The verification tools available with OTSE can determine whether the index is self-
consistent. These built-in tools can verify that checksums are correct, file locations
and names are as expected, and other structural elements are intact. You should
use these tools any time you suspect the disk data may be corrupted.
These tools cannot verify whether the index contains the objects expected by the
application using the index. For this reason, applications should use the OTSE
features to implement a higher level of index verification.
Content Server, for example, provides an index verification feature within its search
administration pages. You should refer to the Content Server documentation for
details. This tool checks to ensure that the objects in the index match the objects
currently being managed, and that the indexed object quality is appropriate. The
Content Server Index Verification tool can also issue updates to the index to correct
discrepancies by adding, removing or updating objects, and it generates a status
report upon completion.

Purging a Partition Index


In the event that disk errors or other system events render the index files for a
particular search partition completely unusable, you may wish to reset the index to an
empty state. Be warned that this is a destructive process. The index will be lost.
This approach should only be taken as a last recourse. It would be a good idea to
back up files before attempting this.

The Information Company™ 205


Understanding Search Engine 21

Step 1
Ensure that the Index Engine and Search Engine for the partition are stopped. In
some cases, the processes might have started even if the index is corrupted. For
example, the corruption might be such that searching can still occur if only index
offset files are corrupted, preventing further indexing from happening.
Step 2
Check the IndexDirectory= setting in the search.ini file in the
[Partition_xxxx] section to be certain which directory you should work in.
Certain key files in the index partition directory need to be preserved and all other
files in the directory removed. The files that must be KEPT are:
Signature file (partition name with .txt extension, typically of the form
servernameX848474X999040X74657.txt)
ALL the .ini configuration files, which includes:
FieldModeDefinitions.ini
Backup process definition files
Step 3
Create an empty file in the partition index directory named createindex.ot. At this
point the directory should have only the INI files, the signature file, and
createindex.ot.
Step 4
Start the Index Engine. It will create a new, empty search index.

Security Considerations
OTSE does not directly implement any application security measures. However, the
interfaces to the search components are well defined, and if necessary can be locked
down using standard computer and network security tools.
A quick checklist of security access points that should be considered if you are
contemplating securing access to OTSE and the index:
• Socket API ports
• RMI API ports
• Access to folders where OTSE stores the index on disk.
• Access to the configuration files – search.ini, search.ini_override,
llfielddefinitions.txt, fieldmodedefinitions.ini.
• Access to create indexing requests, written to an input IPool folder.
• Access to logging files or folders.
• Access to the search agents configuration file.

The Information Company™ 206


Understanding Search Engine 21

• Access to the search agents output IPool.


• Execute permissions for launching the application.
• Folders used in backup and restore operations.
• Java security policy
As mentioned in the performance tuning section, you should not implement virus
scanning applications for the search index. The performance degradation is severe
in these cases. Virus scanning should be implemented upstream where objects are
added to Content Server or the ECM application which is using the search
technology.

Java Security Policy


To improve security, OTSE can leverage a Java Security Policy file. By default, no
policy file (or a permissive policy file) is provided. However, when present, the policy
file can be used to enforce restrictions such as which IP addresses can connect
using sockets. The policy files to be used are specified in the search.ini file, and
located in the config directory. The features of the policy file are standard Java
capabilities, and are not documented here.
A typical policy file might look like this:

grant {
permission java.io.FilePermission "<<ALL FILES>>", "read, write, delete,
execute";

permission java.lang.RuntimePermission "loadLibrary.jniipool";


permission java.lang.RuntimePermission "loadLibrary.jnimigrate";
permission java.lang.RuntimePermission "loadLibrary.libjniipool";
permission java.lang.RuntimePermission "loadLibrary.libjnimigrate";
permission java.lang.RuntimePermission "accessClassInPackage.*";

permission java.util.PropertyPermission "user.home", "read";


permission java.util.PropertyPermission "user.dir", "read";
permission java.util.PropertyPermission "java.security.policy", "read,
write";
permission java.util.PropertyPermission "java.rmi.server.codebase", "read,
write";
permission java.util.PropertyPermission "problemGenerator", "read";

permission java.lang.RuntimePermission "setIO";


permission java.util.PropertyPermission "sun.net.client.*", "read, write";
permission java.net.NetPermission "setDefaultAuthenticator", "read, write";
permission java.util.PropertyPermission "http.strictPostRedirect", "read,
write";

permission java.net.SocketPermission "127.0.0.1", "accept, connect, listen,


resolve";

The Information Company™ 207


Understanding Search Engine 21

permission java.net.SocketPermission "10.5.26.41", "accept, connect,


listen, resolve";
};
The IP Whitelist capability is illustrated in the SocketPermission section. Both IPv4
and IPv6 forms are accepted. This file is created by Content Server if the
administrator enables the IP Whitelist feature in the search administration pages.
The security policy file applies to the Update Distributor, Index Engines, Search
Federators and Search Engines. If RMI is used, the RMI Registry component
distributes the policy file to the other components. If socket communications are
being used instead of RMI, then each component loads the security policy file
independently.

Backup and Restore


Having a reasonably current backup of the search index is a key part of maintaining
your system. Index backups are instrumental in restoring service in the event of
index corruption or for disaster recovery.
The backup methods described here backup or restore only the index. You must
ALSO ensure that you have an appropriate copy of the supporting files available in
the event that these are changed between the time a backup and a restore occur.
This includes configuration files, such as search.ini, search.ini_override,
LLFieldDefinitions.txt, FieldModeDefinitions.ini and any custom tokenizers, thesaurus
or similar files. In most cases, these will be unchanged, and some of these files are
re-generated by Content Server – however, making a copy of these files is strongly
recommended.
There are three different approaches to backup and restore operations. The
recommended approach is to use the backup feature first made available in 16.2.9,
which will be described here first.
The second approach is to use the backup and restore utilities. These utilities
support both complete and differential backups, and have been in the product for
many years. These utilities are superseded by the backup commands in Content
Server 16.2.9 (June 2019), but remain in the search engine to support older versions
of Content Server (e.g. Content Server 16.0). This approach is being deprecated
since it is complex to understand and manage, and most customers have abandoned
its use.
The third option is to stop the search system and use operating system file copy
utilities. The index is a collection of files, so this approach is relatively easy – but has
the undesirable requirement that search and indexing are disabled for the duration.
As systems grow large (many TB of size for the index), an outage for backups
becomes material.

Backup Feature – Method 1


The backup command is an instruction to the Update Distributor to create a complete
set of index backups. The Update Distributor will communicate with each Index
Engine, instructing the Index Engine to create a backup of its associated partition. In

The Information Company™ 208


Understanding Search Engine 21

a large search grid, the Update Distributor will manage the number of Index Engines
that are creating backups concurrently to ensure that CPU and disk capacity are not
abused.
The backup process does NOT require a search outage. Indexing and search
operations continue, subject to possible impacts of additional CPU and IO used by
the backup process. This method creates a complete backup of the grid. The
backup does not represent a single moment in time – each partition may have a
different capture time. The Index Transaction Logs can be used in conjunction with
the backups to reconstitute a current index from the backups.
There are several configuration settings that control the behavior of the backup
process. In the [UpdateDistributor_] section of the search.ini file:
BackupParentDir=c:/temp/backups
MaximumParallelBackups=4
BackupLabelPrefix=MyLabel
ControlDirectory=
KeepOldControlFiles=false
The BackupParentDir field specifies where the backups should be written. This must
be a drive mapping that is visible to all the admin servers running search indexing
processes. Within this directory, a sub-directory with the time the backup starts will
be created, and within that directory each Index Engine will create a directory using
the partition names to store the index. You must have enough space available to
capture a complete copy of the index. The MaximumParallelBackups setting
determines how many Index Engines can be running backups concurrently. This
number should reflect the CPU and disk capacity of your system. The LabelPrefix is
optional and can be used by a controlling application to help track status. The
ControlDirectory is optional, allowing you to override the default location for control
files used to manage the backup process. The KeepOldControlFiles is included for
completeness and is generally reserved for running test scenarios. Except for the
ControlDirectory, these settings can be reloaded (changed without restart). However,
some of the settings are only used at the start of a backup, and best practice is to
make changes only when there is no backup running.
The admin port on the Update Distributor will listen for and respond to the following
commands related to creating backups
backup
backup pause
backup resume
backup cancel
getstatustext
Backup is used to start a new backup process. Cancel and pause will complete
writing backups for the partitions that have already been instructed to create backup
files. This may take several minutes, so status checks include “pausing” status
results (note that some partitions may still be writing their output even though status
is “paused”. Resume will continue a paused backup. The response to backup
commands is “true” if the command has been accepted and acted upon, false
otherwise.

The Information Company™ 209


Understanding Search Engine 21

Getstatustext responses are extended to include information about backups. Status


includes: None; InProgress; Completed; Paused; Pausing; Cancelled; Failed. The
details about the backups are returned as XML elements in a getstatustext operation,
along these lines:

<BackupStatus>
<InBackup>InProgress</InBackup>
<BackupLabel>MyLabel_20190322_112519734</BackupLabel>
<TotalPartitionsToBackup>10</TotalPartitionsToBackup>
<PartitionsInBackup>4</PartitionsInBackup>
<PartitionsFinishedBackup>0</PartitionsFinishedBackup>
<BackupDir>C:\p4\search.ot7\main\obj\log\ot7testoutput\
BackupGridTest_testBackup5199Ten4\backups\
20190322_112519734</BackupDir>
<BackupMessage></BackupMessage>
</BackupStatus>

Up to 3 BackupStatus elements may exist, one each for “InProgress” (including


Paused or Pausing), “Cancelled” (including Failed) and “Completed”.
The Update Distributor persists the status and progress of backups in files named
upDist.#. A file named upDist.ctl is used to track the current version of the upDist.ctl.
Because this data is persisted, any configured backup will resume and complete
even if the Update Distributor is stopped and started. By default, these files are
stored in the same directory that contains the Update Distributor log files. The file
contents are similar to this:
UpDistVersion 1
BackupStatus Completed
BackupTimestampString 20190325_143138107
BackupLabel MyLabel_20190325_143138107
BackupDir c:/temp/backups\20190325_143138107
NumPartitionsInThisBackup 1
NumBackupPartitionsCompleted 1
EndOfBackupRecord ----------------------------------------
BackupStatus Cancelled
BackupTimestampString 20190325_162312142
BackupLabel MyLabel_20190325_162312142
BackupDir c:/temp/backups\20190325_162312142
NumPartitionsInThisBackup 1
NumBackupPartitionsCompleted 0
EndOfBackupRecord ----------------------------------------
EndOfUpDistState
When a backup process completes successfully, a file named goodBackup.txt is
added to the backup location. This file is not required by OTSE for operation – it’s
presence is to make it easier for administrators that are inspecting the file system to
determine if the backup in that location is good. The file contains a summary of the
backup using the same syntax as the upDist.ctl file, for example:

The Information Company™ 210


Understanding Search Engine 21

BackupStatus Completed
BackupTimestampString 20200528_123935489
BackupLabel SocketGridBase_BackupLabel_20200528_123935489
BackupDir
C:\p4\search.ot7\main\obj\log\ot7testoutput\BackupGridTest_testBackup5179a\b
ackups\20200528_123935489
NumPartitionsInThisBackup 2
NumBackupPartitionsCompleted 2
EndOfBackupRecord ----------------------------------------
Restoring Partitions
When restoring an index, the search partition(s) being restored must first be stopped.
Use file copy to restore the entire contents of the partition backup, then start the
Index Engine and Search Engine. The Transaction Logs can then be used to identify
missing transactions to bring the index up to date. Be sure you have Transaction
Logs enabled. As a convenience, entries are written to the Transaction Logs to mark
the point at which a backup occurred. The backup markers in the Transaction Log
have this form:
2018-06-11T20:49:57Z, Backup started,
backupDir="c:/temp/backups\20180608_132859489/partition1",
label="MyLabel_20180608_132859489-partition1", config="livelink.27"

Backup – Method 2
Operating system file copy utilities can be used to back up the search index. All
search and index processes must be stopped for this approach to succeed. Ensure
that the entire contents of the index directories for each partition are copied.

Backup Utilities – Method 3


This method is no longer recommended. It remains in the product for backwards
compatibility, and is used in versions of Content Server up to 16.2.8. All backup and
restore descriptions in the remainder of the section are related to this method.
The backup utility has the ability to make a backup copy of an index while the index is
still in use. The backup utility also runs index verification checks before initiating a
backup, and also on the backed up copy of the files. This is equivalent to a “Level 1”
plus a partial “Level 4” index verification, which is sufficient to ensure that the files are
not corrupted – although it does not test the internal correctness of the data. Note
that the verification of backups is a new feature with Update 3.
There are both incremental and complete backup operations available. Use of
incremental backups is discouraged, since you must ensure when restoring that the
right sequence of complete and incremental backups are applied. OpenText is
considering whether incremental backups as a feature should be removed from
Content Server in a future version.
When used with older versions of Content Server, the backup and restore features
are accessible through the search administration pages of Content Server – and
Content Server looks after the difficult bits of setting up the INI files and running the

The Information Company™ 211


Understanding Search Engine 21

utilities. This backup method is not supported with current updates of Content
Server.
Differential Backup
The very first backup ever performed must necessarily be a full backup.
Subsequent backups can be differential backups or full backups depending on your
preference. A differential backup differs from a full backup in that it only makes a
copy of files that have changed from the last backup that was performed. These files
are:
• metaLog and accumLog: these change frequently. The backup always saves
these for both full and differential backups.
• checkpoint file: for some partitions this can be a large file (over a GB). It is
only copied if it has changed.
• sub-index fragments: new fragments are saved.
The differential backup reduces the amount of disk space required for the backup
and also reduces the time taken to make the backup. However, it does make the
restore process more complex, and requires that you have a complete trail of
differential backups available traced back to a full backup.
Backup Process Overview
The backup and restore processes rely on special configuration files to control their
behavior and to record the status of the backups. As an administrator, you should
normally not modify these files. Content Server automatically generates these files
as needed for backups. This information is primarily for troubleshooting and as a
starting point for developers that are integrating index backup and restore into their
applications.
To run a full backup, a configuration file with the name ‘Full.ini’ must first be created
and placed in each partition folder. For a differential backup, a file with the name
‘Diff.ini’ must be created.
The backup utility is then run, which performs the backup operation on a single
partition.
On completion, the backup data is contained in a folder target directory, called FULL
for a full backup and DIFFx for a differential backup (where x is the order number of
this differential backup relative to the baseline full backup). The backup process also
creates a file called ‘backup.ini’ with copies in the source and backup target partition
folders.
Sample Full.ini File
Note that the Diff.ini file is identical except for its name. The Full.ini file uses a basic
Windows INI file syntax with a single section [Backup]. There a comments injected
here for explanatory purposes (the line starts with a # symbol) which should not exist
in the actual file. In practice, the only values you may want to change are the log file
name and log level.

The Information Company™ 212


Understanding Search Engine 21

[Backup]
# AutoNew requests that a new folder is created if it
# does not already exist.
AutoNewDir=True
DelConfig=FALSE

# DestDir identifies the folder where that backup should


# be placed.
DestDir=F:/backup/cs1064main01/ent

# These strings are identifiers that are required. Do


# not change these values.
DiffString=Differential
FullString=Full

# Index is the root location of the source index being backed up.
Index=F:/OpenText/cs1064main01/index/enterprise/index1

# Specify the names of regions that contain date and time values
# that can reasonably be expected to reflect object index dates.
IndexDateTag=OTCreateDate
IndexTimeTag=OTCreateTime

# Specify a name for the index. Can leave this as a constant.


IndexName=Livelink

# Label is a template for how the backup should be named.


# LangFile provides additional hints for formatting the Label.
Label=Enterprise_%m_%d_%Y_%T_58863
LangFile=F:/OpenText/cs1064main01/config/backuplabel.xml

# Specify logging level and locations for the backup process


LogFileName=F:/OpenText/cs1064main01/index/enterprise/index1/starskyX9099X181
37X59605.log
LogLevel=1

# Option – leave as COPY. Don’t change this.


Option=COPY

# The location of the Content Server binaries, used to perform


# a search to obtain the most recent date and time
OTBinPath=F:/OpenText/cs1064main01/bin
ScriptFileName=

# Specify whether a full or differential backup is to be performed.


# Values can be DIFF or FULL
Type=DIFF

The Information Company™ 213


Understanding Search Engine 21

Sample Lang File


This file is basically used by the automatic file naming process in the backups to map
numeric date and time values to display forms. The default file uses English
language conventions for days and months. This is a convenience function, and
unless you simply cannot accept English date structures for file names, you should
most likely leave this alone.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>


<labelstrings xml:lang="">
<days>
<day number="1" longformat="Sunday" shortformat="Sun" />
<day number="2" longformat="Monday" shortformat="Mon" />
<day number="3" longformat="Tuesday" shortformat="Tue" />
<day number="4" longformat="Wednesday" shortformat="Wed" />
<day number="5" longformat="Thursday" shortformat="Thu" />
<day number="6" longformat="Friday" shortformat="Fri" />
<day number="7" longformat="Saturday" shortformat="Sat" />
</days>
<months>
<month number="1" longformat="January" shortformat="Jan" />
<month number="2" longformat="February" shortformat="Feb" />
<month number="3" longformat="March" shortformat="Mar" />
<month number="4" longformat="April" shortformat="Apr" />
<month number="5" longformat="May" shortformat="May" />
<month number="6" longformat="June" shortformat="Jun" />
<month number="7" longformat="July" shortformat="Jul" />
<month number="8" longformat="August" shortformat="Aug" />
<month number="9" longformat="September" shortformat="Sep" />
<month number="10" longformat="October" shortformat="Oct" />
<month number="11" longformat="November" shortformat="Nov" />
<month number="12" longformat="December" shortformat="Dec" />
</months>
<era>
<era number="1" shortformat="BC" />
<era number="2" shortformat="AD" />
</era>
<timeperiods>
<timeperiod number="1" shortformat="AM" />
<timeperiod number="2" shortformat="PM" />
</timeperiods>
</labelstrings>

Related to this are the format codes that are used in the label string. The codes are:

The Information Company™ 214


Understanding Search Engine 21

Value Description

%% A percentage sign

%a The three-character abbreviated weekday name (e.g., Mon,


Tue, etc.)

%b The three-character abbreviated month name (e.g., Jan, Feb,


etc.)

%d The two-digit day of the month, from 01 to 31 (e.g., 01-31)

%j The three-digit day of year, from 001 through 366

%m The two-digit month (e.g., 01-12)

%p AM or PM

%w The 1-digit weekday, from 1 through 7, where 1= Sunday

%y The two-digit year (e.g., 93)

%A The full weekday name (e.g., Monday)

%B The full month name (e.g., January)

%H The two-digit hour on a 24-hour clock, from 00 to 23

%I The two-digit hour, from 01 through 12

%M The minutes past the hour, from 00 to 59

%P AD or BC

%S The seconds past the minute, from 00 to 59

%Y The year, including the century (e.g., 1993)

%T Replaced with the value of FullString or INCRString that is


specified on the command line or backup config.ini file.

Sample Backup.ini File


The Backup.ini file is created or updated after a backup operation completes. It is
later used by the restore utility. The basic purpose is to record the status of the
backup, the files which are included in the backup, and checksum data to allow the
backup and restore operations to validate correct file copies.
This particular example is from the second differential backup of an index. Each
differential backup results in another [DIFFx] section. A full backup would only
contain the [FULL] section, with no [DIFFx] sections. Commentary and white space
has been added.

[General]
# 0 status is good, other values are error codes

The Information Company™ 215


Understanding Search Engine 21

Status=0

# The last differential backup number


DIFF=2

DiffString=Differential
FullString=Full

[FULL]
CheckPointSize=624
MetaLogNumber=51
MetaLogOffset=0
AccumLogNumber=39
AccumLogOffset=0
I1=61
I1Size=447
I2=66
I2Size=39
TotalIndexSize=1109
Label=Enterprise_04_08_2011_Full_58863
Date=20110408 145139
MetaLogChkSum=524293
AccumLogChkSum=524293
CheckPointChkSum=206517074
I1ChkSum=15804739427
I2ChkSum=11071697352
ConfigChkSum=1160933350
Success=0

[DIFF2]
CheckPointSize=665
MetaLogNumber=53
MetaLogOffset=0
AccumLogNumber=41
AccumLogOffset=11785068
I1=69
I1Size=9
TotalIndexSize=674
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 150047
MetaLogChkSum=524293
AccumLogChkSum=258080884
CheckPointChkSum=1282731032
I1ChkSum=9500506248
ConfigChkSum=624209792
Success=0

[DIFF1]

The Information Company™ 216


Understanding Search Engine 21

CheckPointSize=664
MetaLogNumber=52
MetaLogOffset=4732284
AccumLogNumber=39
AccumLogOffset=5824292
TotalIndexSize=664
Label=Enterprise_04_08_2011_Differential_58863
Date=20110408 145644
MetaLogChkSum=1542696885
AccumLogChkSum=238343344
CheckPointChkSum=3018112456
ConfigChkSum=389190926
Success=0
Running the Backup Utility
Once the Full.ini or Diff.ini file is in place the backup utility can be run. The utility is
contained within the Search Engine, and documented in the Utilities section of this
document.

Restore Process – Method 3


The restore procedure is considerably more complex than the backup. In its simplest
form, restore an index is comprised of the following stages:
Preparation
The partition to be restored is placed in a known location. A configuration file is
created which points to this location, called restore.ini. The target directory needs to
be empty, which means moving or deleting any existing index.
Analysis
The backup.ini file from the backup location is analyzed to determine which files and
folders are required to perform the restore operation. This information is written into
the restore.ini file.
The controlling application then needs to prompt the administrator to stage the
necessary folders before proceeding. Content Server is one application which
performs this coordination.
Copy
In the copy phase, the files specified in the restore.ini file are used as a guideline for
copying all the necessary files from that backup location to the search index. The
copy process takes place iteratively, with one differential backup folder processed on
each invocation, and the administrator staging needed files for the next copy
operation. The process is structured to support complex backup storage systems,
where each backup may have been placed in a tape archive.

Validate
The final step is validation, in which the restored index is checked for integrity.

The Information Company™ 217


Understanding Search Engine 21

These stages do not automatically happen one after the other. The administrator or
the controlling application needs to initiate the steps sequentially after ensuring that
appropriate file preparation occurs.
The restore operation works on a single partition. Content Server provides a
mechanism to simplify the restore of the entire index, and prompts that administrator
to ensure the appropriate files and folders are available at each step. The syntax of
the restore utility is documented in the Utilities section of this document.

Restore.ini File
The restore.ini file is used for each stage of the restore procedure, and modified after
each stage. This file is the mechanism for transporting process information from one
phase to the next.
Before first running the analyze stage, a restore.ini file needs to be created that looks
like this:

[restore]
otbinpath=d:\opentext\bin
SourceDir=d:\llbackup\ent\incr18
destdir=d:\temprest
option=analyse
Once the analysis is complete, the restore.ini file will have been updated with
information about files that will be copied, and should look like this, without the added
comments and white space:

[restore]
OTBinPath=d:\opentext\bin
BackupIndexName=livelink
LogFilename=indexrestore.log
RestoreHistory=restore.ini
BackupHistory=backup.ini
DestDir=d:\temprest
SourceDir=d:\llbackup\ent\incr18
loglevel=1

# CurrentImage indicates which [IMAGE#] section of this INI


# file should be examined to retrieve the needed files.
CurrentImage=1
success=0

# The insert option identifies that copy will take place next
option=insert

# Total images are the number of differential backups that


will
# be copied.
TotalImage=4

The Information Company™ 218


Understanding Search Engine 21

LastObjectSize=110750
LastObjectDate=20010426

# Each image is all or part of a saved differential backup.


# Only 1 image is shown here.
[IMAGE1]
TotalIndexSize=12
Processed=No
Date=20010612 110624
Label=Enterprise_06_12_2001_Incremental
TotalFrag=5
Frag5Size=0.111492
Frag5CkSum=12169
Frag5=00097
Frag4Size=0.220095
Frag4CkSum=3698
Frag4=00096
Frag3Size=0.468937
Frag3CkSum=59855
Frag3=00095
Frag2Size=5.679858
Frag2CkSum=19250
Frag2=00094
Frag1Size=2.669858
Frag1CkSum=59557
Frag1=00087
Master=Yes
In operation, the Administrator (or controlling application) is expected to examine the
IMAGE# section for the current image number, and mount the backup folder which
has the specified label and date. Once this is staged, the Admin edits the restore.ini
file to change the option from “insert” to “copy”, and run the restore.
The restore utility will then copy the files from that one image, change the option to
insert, update the current image number, and the process repeats until all the IMAGE
sections are processed.

The Information Company™ 219


Understanding Search Engine 21

Index and Configuration Files


A complete list of search.ini settings and background information on how the
partitions store the index on the file system.

Index Files
OTSE persists the search index on disk, in a specific hierarchy of folders and file
names. This section outlines each of the folders and files and its purpose. Below is
a typical listing for a search partition for reference which will be described in detail
below. There is one such folder for each partition.

servernameX848474X999040X74657.txt
accumlog.39
checkpoint.51
FieldModeDefinitions.ini
index.lck
livelink.280
livelink.ctl
metalog.51
topwords.100000
MODaccumlog.47
MODindex
\2
\\coreidx1.idx
\\coreidx2.idx
\\coreobj.dat
\\coreoff.dat
\\coreskip.idx
\\map
\\otheridx1.idx
\\otheridx2.idx
\\otherobj.dat
\\otheroff.dat
\\otherskip.idx
\\regionidx1.idx
\\regionobj.dat
\\regionoff.dat
\\regionskip.idx
\\updmask.dat
\3
\\ same
61
\coreidx1.idx
\coreidx2.idx
\coreobj.dat

The Information Company™ 220


Understanding Search Engine 21

\coreoff.dat
\coreskip.idx
\map
\otheridx1.idx
\otherobj.dat
\otheroff.dat
\otherskip.dat
\regionidx1.idx
\regionobj.dat
\regionoff.dat
\regionskip.dat
62
\ same

Signature File
This first file in the list, servernameXXXXX.txt, is technically not part of the search
index, and not required for search or indexing operations. Content Server adds this
file to allow the administration interfaces in Content Server to verify that related
Search Engines and Index Engines are referencing the same directories. If upgrades
occur, older server names may migrate, this is expected.

Accumulator Log File


This file (accumlog) contains the incremental updates to the full text index which have
occurred since the index file was last updated and written into an index fragment.
This file is managed by the Index Engines, and consumed by the Search Engines in
the normal course of operation. The accumlog enables rollback of partially
completed transactions.
The file is of the form accumlog.x, where x is a number that increments sequentially
each time the contents of the accumlog are committed to a new index fragment and a
new instance of the accumlog is created. The accumlog contains incremental adds
and deletes.

Metadata Checkpoint Files


Checkpoint files are of the form checkpoint.x, where x is an incrementing integer. A
Checkpoint contains a complete copy of the metadata for the partition, including the
values, the index and the dictionary. Checkpoints are managed by the Index
Engines.
A new Checkpoint file is created when the size of incremental metadata changes
(metalogs) exceeds a configuration value, typically 16 Mbytes. New checkpoint files
are also created when index conversions are performed during Index Engine startup.
To ensure synchronization, checkpoint files are written at the same time by all
partitions in a system. The coordination of simultaneous checkpoint creation is
directed by the Update Distributor.

The Information Company™ 221


Understanding Search Engine 21

Upon startup, or upon resynchronization, Search Engines load their metadata image
from the checkpoint file, and then apply incremental changes from the metalogs.
It is possible for multiple checkpoint files to exist for a partition. Normally, this only
occurs for a short period, when a Search Engine is still using an older checkpoint file
after the Index Engine has created a new one. The Index Engines will reduce the
number of checkpoint files to one at the earliest safe opportunity.
Lock File
The Lock File is used by the Index Engine to indicate that this partition is in use. This
is a failsafe mechanism to ensure that multiple Index Engines will not attempt to use
the same data. In a properly configured system, this would not happen. The Lock
file provides additional insurance.
Control File
The Control File, named Livelink.ctl, is used by the Index Engines to record the name
of the current Config file. The Search Engines read this file to obtain the name of the
current Config file. To ensure atomic reads and writes, both the Index Engine and
Search Engine will lock this file when accessing it.
Top Words
Optional file. Top Words are used to track which words in an index are candidates for
exclusion from TEXT queries because they are too common. The file is named
topwords.n, where n is one of 10000, 100000 or 1000000 – which reflects the
number of objects in the partition when the file was generated.
Config File
Named livelink.x, where x is an incrementing number. The config file contains
detailed information about the index fragments, working file offsets, file checksums,
and other parameters needed by the Index Engine and Search Engine to properly
interpret the index files.
A new Config file is written each time the Index Engine creates a new fragment or
generates a checkpoint. A Search Engine will place a non-exclusive lock on the
Config file which represents the accumlog and metalog files it is currently consuming.
The Index Engine will clean up older, unused Config files.

Metalogs
A metalog contains incremental updates to metadata. The Index Engine writes
updates to the metalogs, and occasionally creates a checkpoint file that rolls up all
the metalogs since the last checkpoint into a new checkpoint file.
Search engines consume updates from the metalog files to keep their copy of the
metadata current. When a metalog exceeds a configurable size, a new checkpoint is
created and a new metalog started. It is possible for multiple metalogs to exist for
short periods while the Search Engines consume older metalogs.

The Information Company™ 222


Understanding Search Engine 21

Index Fragment Folders


The full text content for a partition is broken into partition fragments. Each fragment
is contained within a numbered folder within the partition index folder. In the example
at the start of this section, these folders are labeled 61 and 62. Folder 61 is exploded
to show the files within.
A new Index Fragment is created when the Index Engine fills the accumulator and
‘dumps’ it to disk. The files within a fragment are never modified once written to disk.
The Index Engines occasionally merge fragments to consolidate them, creating new
larger fragments in the process, and allowing the smaller fragments to be deleted. A
cleanup task in the Index Engine will delete the older, smaller fragments once the
Search Engine stops referencing them.
In an optimal configuration, the merge process attempts to structure the fragments
such that the number approaches about 5 fragments, with geometrically related
sizes. For example, 1000 MB, 300 MB, 100 MB, 30 MB, 10 MB. In practice, the
sizes will vary from this pattern given the reality of the sizes available for merging and
the opportunistic scheduling of merges based on the indexing load. If the indexing
load is high and sustained, the opportunity for merges may be rare, and the number
of Index Fragments can become large. Large numbers of fragments are undesirable
for query performance, so there is a configuration setting in the search.ini file that
places an upper limit on the number of acceptable fragments, which will force merge
activity, stalling the indexing process if necessary.
Within the Index Fragment Folder, there are a number of files as described below.
Core, Region and Other
Examining the fragment folder, note that there are files of the same type but having
the prefixes core, region and other. These file sets are similar, but used for different
data.
The ‘core’ files contain the full text search data for words which are comprised of the
basic ASCII character set (typically English).
The ‘region’ files contain the full text index for XML region names. These are special
cases that improve the performance of search for values within XML fields.
The ‘other’ files contain the full text index for all other words – those which are not
English and not XML tags.
The descriptions below for core files are also applicable to the files with ‘region’ and
‘other’ prefixes.
Index Files
The file coreidx1.idx contains the ‘dictionary’ of terms, plus pointers to the object id
file. As the dictionary grows large, multiple levels of dictionary pointers are created,
so you will often see coreidx2.idx, coreidx3.idx, and so forth. These higher numbered
index files contain references to the lower numbered files, with successively more
accurate data points. For instance, the coreidx2.idx file contains entries for every
16th dictionary value. This hierarchy improves dictionary lookup time. This structure
repeats until the highest numbered index file is smaller than 1 MByte. This 1 MByte
dictionary is kept in memory to optimize performance.

The Information Company™ 223


Understanding Search Engine 21

Object Files
The file coreobj.dat contains a list of all internal object IDs and pointers to the word
location lists in the offset file.
Offset File
The file coreoff.dat contains the lists of word offsets. These word offsets indicate to
the search engine the relative position of a word within an indexed object.
Skip File
The file coreskip.dat contains pointers to the offset file that allows the Search Engine
to quickly skip over large data sets.
Map File
The map file contains checksums that can be used to verify that the index fragment
files have not been corrupted. There is only one Map file per partition fragment.

Low Memory Metadata Files


The Low Memory mode of storing text region indices is available starting with Update
8, disabled by default. When configured, there are additional files in the index folder.
The purpose and functions of the Low Memory metadata files are the same as the
corresponding full text index files, except that they contain indices for text metadata
regions.
The Low Memory metadata index fragment subfolders are contained in the directory
called MODindex. There is also a MODaccumlog file, which is analogous to the
accumlog file.
If there are multiple metadata index fragment subfolders, then some of the subfolders
will also contain a file called updmask.dat, which is used to identify objects with
entries in earlier fragments that have been modified.

Metadata Merge Files


The Merge File storage mode places the Text metadata values in a MODcheck file,
with incremental changes in aMODcheckLog file. These operate much like the full
text index files – using background merge processes to consolidate recently changed
values into larger compacted files.
The files in the index are as follows:

MODCheck.x
This is the master file for the metadata values, and the target after a merge.
The value of x increments after each merge operation.
MODcheckLog.x
Changes to text values are recorded in this file until a merge operation
occurs.
MODpremerge.x+1

The Information Company™ 224


Understanding Search Engine 21

MODptrs.x+1
Files containing pointers used for recovery and playback during startup.
It is possible that multiple versions (values of .x) of these files may exist, especially if
a Search Engine is lagging in accepting updates from the Index Engine, or multiple
Search Engines exist.

Configuration Files
OTSE derives the bulk of its configuration settings from a number of files. In this
section, we review each of the files to convey the basic purpose of each.

Search.ini
Most settings for OTSE are contained within the search.ini file. There is one
search.ini file per Admin Server. In practice, this usually means one per physical
computer, although other permutations are possible.
When used with Content Server, the search.ini file is generated by Content Server.
Although Content Server may preserve some of the edit changes you might make to
the search.ini file, this is not guaranteed. In general, you should not edit this file.
Most of the entries are set by Content Server, and using the Content Server search
administration pages is the preferred method for interacting with this file.
If you must edit this file within a Content Server application, consider using the
search.ini_override file instead.
The search.ini file follows generally accepted conventions for the structure of a ‘.ini’
file.
The file consists of several configuration sections. Where sections contain settings
for a particular partition, the section name will include the partition name. Refer to
the Search.ini section of this document for detailed information on entries in the
Search.ini file.

Search.ini_override
This file is specifically designed to supplement or override any values set in the
search.ini file. Because the search.ini file is controlled by Content Server, editing the
search.ini file does not ensure that your changes will be preserved.
The override file is optional. When present, it need contain only those configuration
settings which you want to take precedence over the default settings or the settings
within the search.ini file.
There is a special value that can be used in override settings, the DELETE_OVERRIDE
value. When this value is encountered, it means that the explicit value for the setting
in the search.ini file should be ignored, and the default value used instead.
For example, the default value for CompactEveryNDays is 30. If the search.ini file
contains the setting:

The Information Company™ 225


Understanding Search Engine 21

CompactEveryNDays=100
But the search.ini_override file contains:

CompactEveryNDays=DELETE_OVERRIDE
Then the default value of 30 will be used.
Note that the override file may need to be edited any time the partition configuration
changes. The most common situation is that when you create new partitions, you will
need to add corresponding sections to the override file.
If you use automatic partition creation (such as date based partition creation) within
Content Server, you may have difficulty keeping the override file current with newly
created partitions, and the override file might not be a good choice for this type of
deployment.

Backup.ini
This is an optional configuration file which is used to set the parameters for index
backup operations and record the status of the last backup operation. You should
not normally modify this file. Refer to the section on index backup for more
information.

FieldModeDefinitions.ini
This file defines the storage modes for text metadata regions, and should be located
in the partition directory. There is one FieldModeDefinitions.ini file per partition.
Although each partition could have different settings, keeping them identical across
partitions is generally recommended, and within a Content Server environment this is
enforced. A FieldModeDefinitions.ini file has the following form:

[General]
NoAdd=DISK
ReadOnly=DISK
ReadWrite=RAM

[ReadWrite]
someRegion1=DISK
someRegion2=RAM

[ReadOnly]
someRegion1=RAM
someRegion3=DISK

[NoAdd]
someRegion1=DISK_RET
someRegion2=RAM

The Information Company™ 226


Understanding Search Engine 21

The General section defines the default storage mode for a text metadata region.
The ReadWrite, ReadOnly and NoAdd sections allows control over storage of
specific regions, which have priority over the General section. The possible values
are DISK, RAM and DISK_RET. Refer to the section on text metadata storage for
details.
Within Content Server, the FieldModeDefinitions.ini file is created and managed by
Content Server, and should not be edited.

LLFieldDefinitions.txt
The field definitions file has several purposes. Experience indicates that most
customers do not understand or modify this file, which is unfortunate, since significant
performance and memory use benefits may be possible by reviewing and editing this
file BEFORE indexing your content. Once an index has been created, it is not
possible to change some of the settings in this file without generating startup errors.
One function of the file is to establish the type for each metadata region to be
indexed. Each region is tagged with a type such as:
• INT
• LONG
• TEXT
• DATETIME
• TIMESTAMP
• USER
• CHAIN
• AGGREGATE-TEXT
A second purpose for the field definitions file is to provide metadata parsing hints for
nested metadata regions. Using the NESTED operative, the input IPool parser can
ignore outer tags and extract and index the inner region elements.
The field definitions file also provides instructions for special handling of certain
region types. This includes dropping, removing, renaming and merging metadata
regions. You can also use the aggregate feature to create a new region comprised of
multiple text regions.
One field definitions file is required per Admin server. As a general rule, each field
definitions file should be identical for partitions with different Admin servers.
Differences will result in inconsistent handling of regions between partitions.
Content Server does not edit, generate or manage this file. In general, changes to
this file must be done manually. There is one exception to this – the search.ini file
has a special setting for logically appending lines to the LLFieldDefinitions.txt file.
This allows limited control over the definitions from Content Server. For example, if
the search.ini file contained these two lines:
ExtraLLFieldDefinitionsLine0=CHAIN MyID UserID
TwitterID FacebookID

The Information Company™ 227


Understanding Search Engine 21

ExtraLLFieldDefinitionsLine1=LONG OTBigNumber
Then at startup time, OTSE acts as if these lines existed at the end of the
LLFieldDefinitions.txt file:
CHAIN MyID UserID TwitterID FacebookID
LONG OTBigNumber
Content Server usually ships with two versions of this file – a standard version, and
one for use with Enterprise Library Services. The determination of which version to
use is determined by a setting in the search.ini file:

FieldModeDefinitions=FieldModeDefinitions.ini
Detailed information about each of the functions and data types of the field mode
definitions file can be found in the section of this document which covers metadata
regions.

SEARCH.INI Summary
This section gathers together most of the accessible configuration values that can be
used in the search.ini file, or the search.ini_override file. There are a number of
additional values which are only used for specific debugging or testing purposes that
are not listed here. A number of these configuration values are covered in more
detail in relevant sections of this document.
Not all processes read all sections of the search.ini file. Content Server generates
search.ini files for each process, and typically only includes values needed by the
process. Note that Content Server files do not include all of the entries, and default
settings are common.
Default values are displayed in this section wherever possible. Annotations in this
section are indicated with a // at the beginning of the line – this is not syntax
supported in an actual search.ini file, it is used here as a documentation device.
The settings in the INI file are applied when the processes start. Changes to this file
may require a restart of some or all of the search grid in order to take effect. Some of
these values can be re-applied to a running process without a restart, refer to the
“Reloadable Settings” section for a list.

General Section
This section is required for every search.ini file. The basic purpose is to share with
all components the configuration settings for the RMI Grid Registry and the Admin
Server. If RMI communication between grid components is not used, then the
General Section is ignored and not required.

[General]
AdminServerHostName=localhost

// RMI Registry
RMIRegistryPort=1099

The Information Company™ 228


Understanding Search Engine 21

RMIPolicyFile=otrmi.policy
RMICodebase=../bin/otsearch.jar
RMIAdminPort=8997

// RMI Grid Registry logging


RMILogFile=RMIRegistry.log
RMILogTreatment=0
RMILogLevel=10

Partition Section
The Partition Section contains basic information about a partition, such as size,
memory usage preferences and, and mode of operation. The section name must
include the partition name after the underscore.
[Partition_]
AllowedNumConfigs=500 (-1 = none)
AccumulatorSizeInMBytes=30
PartitionMode=ReadWrite | ReadOnly | NoAdd | Retired

// Size of index on disk


MaxContentSizeInMBytes=50000
StartRebalancingAtContentPercentFull=95
StopRebalancingAtContentPercentFull=92
StopAddAtContentPercentFull=90
WarnAboutAddPercentContentFull=true
ContentPercentFullWarnThreshold=85

// Metadata memory usage


MaxMetadataSizeInMBytes=1000
StartRebalancingAtMetadataPercentFull=99
StopRebalancingAtMetadataPercentFull=96
StopAddAtMetadataPercentFull=95
WarnAboutAddPercentFull=true
MetadataPercentFullWarnThreshold=90

// Set true to reserve partition for large objects


LargeObjectPartition=false

// IE0, IE1, etc. This is REQUIRED


// For RMI, this is location of grid registry
// For sockets, this is location of Index Engine
IE#=//host:port/indexEngineName

DataFlow Section
The DataFlow section contains the majority of configuration settings relating to how
data should be processed. The partition name must be appended to the section
name after the underscore.

The Information Company™ 229


Understanding Search Engine 21

[DataFlow_]
FieldDefinitionFile=LLFieldDefinitions.txt
FieldModeDefinitions=FieldModeDefinitions.ini
QueryTimeOutInMS=120000
SessionTimeOutInMS=216000
StatsTriggerThreshold=200

LastModifiedFieldName=OTModifyDate

// Interval for reading metalog files


UpdatePollIntervalInMS=10000

// For tuning use of basestore. In general, don’t touch this.


// Max 200 values in a multivalue text field
// Max 256K of data in a text field
// Allow email management regions to exceed these limits
MultiValueOverflowBoundary=0
MultiValueLimitDefault=200
MetadataValueSizeLimitInKBytes=256
MultiValueLimitExclusionCSL=OTEmailToAddress,OTEmailToFullName,
OTEmailBCCAddress,OTEmailBCCFullName,
OTEmailCCAddress,OTEmailCCFullName,
OTEmailRecipientAddress,OTEmailRecipientFullName,
OTEmailSenderAddress,OTEmailSenderFullName

// Time zone obtained from OS by default, you can set e.g +5 for EST
TimestampTimeZone=

// Accumulator configuration
ContentTruncSizeInMBytes=10
DumpOnInactiveIntervalInMS=3600000
MaxRatioOfUniqueTokensPerObjectHeuristic1=0.1
MaxRatioOfUniqueTokensPerObjectHeuristic2=0.5
MaxAverageTokenLengthHeuristic1=10.0
MaxAverageTokenLengthHeuristic2=15.0
MinDocSizeInTokens=16384
DumpToDiskOnStart=false
AccumulatorBigDocumentThresholdInBytes=5000000
AccumulatorBigDocumentOverhead=10
CompleteXML=false

// Configure the Reverse Dictionary


ReverseDictionary=false
ReverseDictionaryScanningBufferWordEntries=100000

// Tokenizer
RegExTokenizerFile=otsearchtokenizer.txt

The Information Company™ 230


Understanding Search Engine 21

RegExTokenizerFileX=c:/config/tokenizers/partTKNZR.txt
TokenizerOptions=0
UseLikeForTheseRegions=
OverTokenizedRegions=
LikeUsesStemming=true
AllowAlternateTokenizerChangeOnThisDate=20170925
ReindexMODFieldsIfChangeAlternateTokenizer=true

// Facets
ExpectedNumberOfValuesPerFacet=16
ExpectedNumberOfFacetObjects=100000
MaximumFacetValueLength=32
UseFacetDataStructure=true
MaximumNumberOfValuesPerFacet=32767
NumberOfDesiredFacetValues=20
DateFacetDaysDefault=45
DateFacetWeeksDefault=27
DateFacetMonthsDefault=25
DateFacetQuartersDefault=21
DateFacetYearsDefault=10
GeometricFacetRegionsCSL=OTDataSize,OTObjectSize,FileSize
MaximumNumberOfCachedFacets=25
DesiredNumberOfCachedFacets=16

// Facet regions to compute on startup and protect


PrecomputeFacetsCSL=
PersistFacetDataStructure=true

// Enable and configure span features


SpanScanning=false
SpanMaxNumOfWords=20000
SpanMaxNumOfOffsets=1000000
SpanMaxTmpDirSizeInMB=1000
SpanDiskModeSizeOfOr=30

// Disk I/O tuning


DelayedCommitInMilliseconds=0
IOChunkBufferSize=8192
ParallelCommit=true
SmallReadCacheDesiredMaximumSizeOfSmallReadCachesInMB=0
NumberOfFileRecoveryAttempts=5

// Control reporting of disk timing in logs


LogDiskIOTimings=true
LogDiskIOPeriod=25

// Enable recording of network problems


LogNetworkIOStatistics=true

The Information Company™ 231


Understanding Search Engine 21

// Search IO buffers default to index buffer size.


// Modest space savings with small performance hit if set smaller.
MaxSizeInBytesOfSearchIOBuffers=-1

// The region name and value used to identify


// when objects should be indexed as XML with text regions
ContentRegionFieldName=OTFilterMIMEType
ContentRegionFieldValue=text/xml

// Enable region forgery checking with otb= attribute


IgnoreOTBAttribute=false

// Several controls exist for determining when a new metalog should


// be created. All are approximations!
MetaLogSizeDumpPointInBytes=16777216
MetaLogSizeDumpPointInObjects=5000
MetaLogSizeDumpPointInReplaceOps=500
MetaLogSizeDumpPointInBytesLowMemoryMode=100000000
MetaLogSizeDumpPointInObjectsLowMemoryMode=50000
MetaLogSizeDumpPointInReplaceOpsLowMemoryMode=5000

SubIndexCapSizeInMBytes=2147483647

// Skips indexing of regions when new data same as old


SkipMetadataSetOfEqualValues=true

// Merge thread
AttemptMergeIntervalInMS=10000
WantMerges=true
DesiredMaximumNumberOfSubIndexes=5
MaximumNumberOfSubIndexes=15
TailMergeMinimumNumberOfSubIndexes=8
MaximumSubIndexArraySize=512
CompactEveryNDays=30
NeighbouringIndexRatio=3

// Enable Merge Files for Text Metada values, set to 1


MODCheckMode=0
// Control Merge File metalog size and merge time checks
MODCheckLogSizeDumpPointInBytes=536870912
MODCheckMergeThreadIntervalInMS=10000
MODCheckMergeMemoryOptions=0

// Some metadata regions need to be treated as content because


// they are derived from full text. MS Office properties.
ExtraDCSRegionNames=OTSummary,OTHP,OTFilterMIMEType,
OTContentLanguage,OTConversionError,OTFileName,OTFileType

The Information Company™ 232


Understanding Search Engine 21

ExtraDCSStartsWithNames=OTDoc,OTCA_,OTXMP_,OTCount_,OTMeta
DCSStartsWithNameExemptions=OTDocumentUserComment,OTDocumentUserExplanation
ExtrasWillOverride=false
// Handle bug where thumbnail requests were indexed as text
EnableWeakContentCheck=true

// Cleanup thread, removes unused files


// Cleanup mode 0 is pre-Update 3 algorithm
FileCleanupIntervalInMS=600000
SubIndexCleanupMode=1
WantFileCleanup=all |none
SecureDelete=false

// Metadata defragmentation
DefragmentFirstSundayOfMonthOnly=0
DefragmentMemoryOptions=2
DefragmentSpaceInMBytes=10
DefragmentDailyTimes=2:30
DefragmentMaxStaggerInMinutes=60
DefragmentStaggerSeedToAppend=SEED

// If a “validate” operation on a Checkpoint file fails, stop.


ContinueOnCorruptCheckpoint=false

// If changing existing types with LLFieldModeDefinitions.txt,


// enter today’s date
EnableRegionTypeConversionAsADate=YYYYMMDD

// List of Content Server fields that MUST be long


FieldsToBeLongCSL=OTCreatedByGroupID, OTDataID, OTOwnerID, OTParentID,
OTUserGroupID, OTVerCreatedByGroupID, OTWFManagerID, OTWFMapManagerID,
OTWFMapTaskPerformerID, OTWFMapTaskSubMapID, OTWFSubWorkMapID,
OTWFTaskPerformerID

// Set to false to disable removing empty regions


RemoveEmptyRegionsOnStartup=true

// Set to true to enable compression of Checkpoint files on disk


UseCompressedCheckpoints=false

// Two techniques for converting RAM/DISK fields for text metadata


// 0 is faster but uses more RAM. 1 is slower, less memory
MetadataConversionOptions=0

// Force IE to checkpoint if regions change at startup


// For instance, remove, merge, rename
WantCheckpointAfterFieldDefChanges=true

The Information Company™ 233


Understanding Search Engine 21

// On startup, if index has regions with Null characters.


// .. this is BAD DATA – a repair feature.
RemoveRegionsWithNull=false

// Relevance tuning
ExpressionWeight=100
ObjectRankRanker=
ExtraWeightFieldRankers=
DateFieldRankers=
TypeFieldRankers=
DefaultMetadataFieldNamesCSL=
// Set true for minor query performance boost on older CS instances
ConvertREtoRelevancy=false

// Maximum number of operators before a huge query should be


// broken into chunks, glued together.
// Slower, but handles extreme cases
MaximumNumberOfBinaryOperators=15000

// Field length included in relevance for metadata on disk


// 2015-09 for scanning operators (regex, *, etc.).
// To reset to old form (less RAM, faster), set false
MODScanningLengthNorm=true

// For backwards compatibility. New apps should set this false


// Affects how the OTScore is computed in some edge cases
UseOldIntScores=true

// New stemming is faster, but less comprehensive


UseOldStem=false

// If updates and queries are contending, how many queries


// should be serviced before allowing an update
MaxSearchesWhileUpdateBlocked=0

// If updates and queries are contending, time before retry


RWGateRetryIntervalInMS=1000

// Optional logging to check for potential IO resource leaks


LogHighestNumberOfIOBuffers=false

// Faster. Force more frequent logging output by setting true.


SyncIndexEngineLogEveryCommit=false

// Smaller values provide faster text operation, use more RAM


TextIndexSynchronizationPointGap=1000

// By default, relative comparison in full text is not allowed

The Information Company™ 234


Understanding Search Engine 21

// (although it is still allowed in Metadata regions)


AllowFullTextComparison=false

// Optimization that groups local updates for MBQ, DBQ


GroupLocalUpdates=true

// Restrict automatic timestamping to a list of


// Content Server object types (numeric OTSubType)
IndexTimestampOnlyCSL=

// For multivalue metadata with attributes, which attribute


// should have precedence for sorting. “Language” used since
// multilingual metadata is the primary user of this feature.
SystemDefaultSortLanguage=

// Orderby collation default is locale-sensitive


OrderedbyRegionOld=false

//
DiskRetSection=DISK_RET
FieldAliasSection=FAS_label

// Determines whether RMI or sockets are used within GRID


// New / better is sockets
GridConnectionType=rmi | direct
// Policy file – new location when sockets in use. Replaces the
// RMIPolicyFile from General section
PolicyFile=<path to otsearch.policy>
// Timeout for socket connections only
IEUpdateTimeoutinMS=120000

// name the search agent sections


SA0=label

// Configure email domain search


EmailDomainSourcesCSL=
EmailDomainFieldSuffix=_OTDomain
MaxNumberEmailDomains=50
EmailDomainSeparators=[,:;<>\\[\\]\\(\\)\\s]

// Optional – append lines to LLFieldDefinitions.txt


ExtraLLFieldDefinitionsLine0=CHAIN MyID UserID TwitterID FacebookID
ExtraLLFieldDefinitionsLine1=LONG OTBigNumber

// Index Engine Bloom Filter configuration


LogPeriodOfDataIdQueries=1000
NumBitsInDataIdBloomFilter=67108864
NumDataIdHashFunctions=3

The Information Company™ 235


Understanding Search Engine 21

AutoAdjustDataIdBloomFilterSize=true
AutoAdjustDataIdBloomFilterMinAddsBetweenRebuilds=1048576
DisableDataIdPhraseOpt=false

// Tuning for the TEXT query operator


TextCutOff=0.33
TextAllowTopwordsBuild=true
TextNumberOfWordsInSet=15
TextUseTermSet=true
TextPercentage=80

// Define large object partition size threshold


ObjectSizeThresholdInBytes=1000000

// Compress text content to Index Engines


CompressContentInLocalUpdate=false
CompressContentInLocalUpdateThresholdInBytes=65535

// TimeStamp region used for search agent scheduling


AgentTimestampField=OTObjectUpdateTime

// Do not use. Sets unit test conditions


BlockBackupIfThisFileExists
BlockStartTransactionIfThisFileExists

Update Distributor Section


Each Update Distributor requires an instance of this section. The name of the
Update Distributor is appended to the section name after the underscore.

[UpdateDistributor_]
// RMIServerPort not needed for direct socket connection mode
RMIServerPort=

AdminPort=
AllowRebalancingOfNoAddPartitions=false
IEUpdateTimeoutMilliSecs=3600000
MaxItemsInUpdateBatch=100
MaxBatchesPerIETransaction=1000
MaxBatchSizeInBytes=20000000
ReadOnlyConvertionBatchSize=1

// Retry and total wait time talking to UD, direct socket mode
WaitForTransactionMS=10000
MaxWaitForTransactionMS=600000

The Information Company™ 236


Understanding Search Engine 21

// for direct (non RMI) how often / long to try connecting to IE


ConnectionAttempts=5
ConnectionDelayBetweenAttemptsInMS=1000

// P0, P1, etc.


P#=partitionName

// ID is the path where inbound IPools reside, and the ReadArea


// is an integer for the folder number
IPoolId=
IPoolReadArea=

// Number of partitions that can merge even if disk percent full


NumOfMergeTokens=0

// Number of partitions to which new objects should be added


NumActivePartitions=0

// Number of partitions allowed to write checkpoints at same time


MaximumParallelCheckpoints=0

// logging
LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1

// index backup configuration


BackupParentDir=c:/temp/backups
MaximumParallelBackups=4
BackupLabelPrefix=MyLabel
ControlDirectory=
KeepOldControlFiles=false

Index Engine Section


This section is used primarily by the Index Engines. The Index Engine name must be
added to the section name after the underscore.

[IndexEngine_]
AdminPort=
IndexDirectory=

The Information Company™ 237


Understanding Search Engine 21

// RMI settings not needed if using sockets between UD and IE


RMIServerPort=
RMIUpdateDistributorURL=

// For direct (non RMI) a timeout between connection and first command
IEConnectionTimeoutInMS=10000

// Used in the backup/restore process


IndexName=Livelink

// Metadata Integrity Checksums


MetadataIntegrityMode=off | on | idle
MetadataIntegrityBatchSize=100
MetadataIntegrityBatchIntervalinMS=2000
TestMetadataIntegrityonDisk=true

// Log file configuration


LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1

// Transaction log files


TransactionLogFile=
TransactionLogRequired=false

// Level 3 Index Verify option


MaxVerifyIndexMODExceptions=300

Search Federator Section


This section is consumed by the Search Federators. The name of the Search
Federator must follow the underscore character in the section name.

[SearchFederator_]
RMIServerPort=
AdminPort=
SearchPort=8500

// SE0, SE1, etc. Required for each SE attached.


// For RMI, this is RMIRegistry location (same for each)
// For sockets, this is Search Engine location
SE#=//host:port/searchEngineName

The Information Company™ 238


Understanding Search Engine 21

// For sockets (not RMI), SE connections: 5 retries 1 second apart


ConnectionAttempts=5
ConnectionDelayBetweenAttempts=1000

// Worker threads and queue size determine how many queries


// can be active and waiting. Larger values consume more
// system resources.
WorkerThreads=10
QueueSize=25

// Low priority search queue, disabled by default


LowPrioritySearchPort=-1
LowPriorityWorkerThreads=2
LowPriorityQueueSize=25

// Test setting, forcing long search queries


MinSearchTimeInMS=0
// Test setting, forcing random long holds on a read lock
RandomMaxReaderDelayInMS=0

// Chunks are results the SF asks SE to fetch


// Larger chunks are slower. Very small chunks add overhead
MergeSortChunkSize=50

// Cache threshold determines how aggressively SF cache is filled


// by the SEs. 0 is most aggressive, best in most cases.
MergeSortCacheThreshold=0

// Timeouts for closing application connections that are idle


// Use 0 to disable feature
FirstCommandReadTimeoutInMS=10000
SubsequentCommandReadTimeoutInMS=120000

// Query suspension time to prevent index throttling


BlockNewSearchesAfterTimeInMS=0
PauseTimeForIndexUpdatingInM=30000

// Removing duplicates slows search queries. More than about


// 1 million values to dedupe uses a LOT of RAM.
RemoveDuplicatesIDs=false
MaximumDuplicatesToRemove=1000000

// Optimistic – if an SE dies, restart it.


// Pessimistic – glass is half empty
ErrorRecovery=optimistic | pessimistic

// Log file management

The Information Company™ 239


Understanding Search Engine 21

LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1

// Search Result Cache time trigger and temp directory


SearchResultCacheDirectory=G:\cache
TimeBeforeCachingResultsInMS=300000

Search Engine Section


The configuration settings in this section are consumed by the Search Engines. The
partition name must follow the underscore character in the section name.

[SearchEngine_]
AdminPort=
IndexDirectory=

// Used in backup processes


IndexName=Livelink

// Log file configuration


LogSizeLimitInMBytes=25
MaxLogFiles=25
MaxStartupLogFiles=10
DebugLevel=0
CreationStatus=0
IncludeConfigurationFilesInLogs=true
Logfile=<SectionName>.log
RequestsPerLogFlush=1

// CS10 Update 4 and later can use direct instead of RMI


ConnectionType=rmi | direct

// If using direct sockets, RMI settings not needed


RMISearchFederatorServerName=
RMIServerPort=

// If using direct sockets need this for each SE


ServerPort=<search_engine_port>

// Disk tuning values that you should leave alone unless you
// are having disk problems. Use cautiously.
UseSystemIOBuffers=true

The Information Company™ 240


Understanding Search Engine 21

MaximumNumberCachedIOBuffers=100
SizeInBytesIOBuffers=4096

DiskRet Section
This section is present to allow use of DISK_RET storage mode in older systems
where Content Server does not support DISK_RET configuration in the search
administration pages. Normally, should only be present in a search.ini_override file.
CS10 Update 3 and later would put this into the FieldModeDefinitions.ini file instead.

[DiskRetSection]
RegionsOnReadWritePartitions=
RegionsOnNoAddPartitions=
RegionsOnReadOnlyPartitions=

Search Agent Section


Search Agents are queries that are run on objects as they are being indexed. Within
Content Server, these are typically from Intelligent Classification or from Prospectors.
The search agent name must follow the underscore character in the section name.

[SearchAgent_]
operation=OTProspector | OTClassify

// IPool is the path, and readArea is a number which represents a


// folder within the ipool path were the results are stored.
readArea=
readIpool=

// The queries to be applied are contained in this path/file


queryFile=

Field Alias Section


This area defines a list of Content Server field names that are mapped to OTSE
region names at query time and during indexing. The partition name must be
appended to the FAS section name after the underscore character.

[FAS_label]
From=to
// example
Author=OTUserName

Index Maker Section


This section defines a number of internal values used to configure how a full text
search index is constructed and interpreted. It is included here for completeness.
DO NOT CHANGE these settings unless you have strong reasons for doing so and

The Information Company™ 241


Understanding Search Engine 21

understand exactly what you are doing. In general, this section is not present in a
search.ini file, and the default values are used.

[IndexMaker]
ObjectSkip=32
ObjectUseRLE=true
ObjectUseNyble=true
OffsetSkip=16
OffsetUseRLE=true
OffsetUseNyble=true
SmallestIndexIndexSizeInBytes=1048576
IndexingPartitionFactor=256

Reloadable Settings
A subset of the search.ini settings can be applied to search processes that are
already running. This feature is triggered using the “reloadSettings” command over
the admin API port. The search.ini settings applied at reload are:

Common Values
These values are reloadable in the Update Distributor, Index Engines, Search
Federator and Search Engines.

Logfile
RequestsPerLogFlush
CreationStatus
DebugLevel
LogSizeLimitInMBytes
MaxLogFiles
MaxStartupLogFiles
IncludeConfigurationFilesInLogs
NumberOfFileRecoveryAttempts
LargeObjectPartition
ObjectSizeThresholdInBytes
BlockBackupIfThisFileExists
BlockStartTransactionIfThisFileExists

If using RMI…

RMIRegistryPort
RMIPolicyFile
RMICodebase
AdminServerHostName

If not using RMI…

The Information Company™ 242


Understanding Search Engine 21

PolicyFile

Search Engines
DefaultMetadataFieldNamesCSL
DefragmentMemoryOptions
DefragmentSpaceInMBytes
DefragmentDailyTimes
DefragmentMaxStaggerInMinutes
DefragmentStaggerSeedToAppend
SkipMetadataSetOfEqualValues
MetadataConversionOptions
ExpressionWeight
ObjectRankRanker
ExtraWeightFieldRankers
DateFieldRankers
TypeFieldRankers
UseOldStem
HitLocationRestrictionFields
FieldAliasSection
DefaultMetadataAttributeFieldNames
SystemDefaultSortLanguage
SortingSequences
PrecomputeFacetsCSL
MaximumNumberOfCachedFacets
DesireNumberOfCachedFacets
TextNumberOfWordsInSet=15
TextUseTermSet=true
TextPercentage=80

Update Distributor
MaxItemsInUpdateBatch
MaxBatchSizeInBytes
MaxBatchesPerIETransaction
NumOfMergeTokens
RunAgentIntervalInMS

** The list of partitions is also reloaded from the section names in the Update
Distributor, allowing partitions to be added without restarts.
Although Search Agent definitions are not included in this list, changes to the Search
Agents do not require a restart. Search Agents use another mechanism for updates;
refer to the section on Search Agents for details.

The Information Company™ 243


Understanding Search Engine 21

Tokenizer Mapping
Earlier in this document, the Tokenizer section references various character
mappings. For reference, a detailed list of character mappings performed by the
tokenizer is included below. If a character is not included in this table, it is not
mapped – it is added to the index as itself.
The leftmost character in each row (and its hexadecimal Unicode value) represents
the output character(s) of the mapping. The remaining values following the colon
represent a list of source characters that are mapped to that output character. Each
of these source characters in the list is separated by a comma, with Unicode values
in parentheses.

a (61): A (41), À (c0), Á (c1), Â (c2), Ã (c3), Å (c5), à (e0), á


(e1), â (e2), ã (e3), å (e5), Ā (100), ā (101), Ă (102), ă
(103), Ą (104), A (ff21), a (ff41)
b (62): B (42), B (ff22), b (ff42)
c (63): C (43), Ç (c7), ç (e7), ą (105), Ć (106), ć (107), Ĉ (108), ĉ
(109), Ċ (10a), ċ (10b), Č (10c), č (10d), C (ff23), c
(ff43)
d (64): D (44), Ď (10e), ď (10f), Đ (110), đ (111), D (ff24), d
(ff44)
e (65): E (45), È (c8), É (c9), Ê (ca), Ë (cb), è (e8), é (e9), ê
(ea), ë (eb), Ē (112), ē (113), Ĕ (114), ĕ (115), Ė (116), ė
(117), Ę (118), ę (119), Ě (11a), ě (11b), E (ff25), e
(ff45)
f (66): F (46), F (ff26), f (ff46)
g (67): G (47), Ĝ (11c), ĝ (11d), Ğ (11e), ğ (11f), Ġ (120), ġ (121),
Ģ (122), ģ (123), G (ff27), g (ff47)
h (68): H (48), Ĥ (124), ĥ (125), Ħ (126), ħ (127), H (ff28), h
(ff48)
i (69): I (49), Ì (cc), Í (cd), Î (ce), Ï (cf), ì (ec), í (ed), î
(ee), ï (ef), Ĩ (128), ĩ (129), Ī (12a), ī (12b), Ĭ (12c), ĭ
(12d), Į (12e), į (12f), İ (130), ı (131), I (ff29), i
(ff49)
j (6a): J (4a), Ĵ (134), ĵ (135), J (ff2a), j (ff4a)
k (6b): K (4b), Ķ (136), ķ (137), K (ff2b), k (ff4b)
l (6c): L (4c), ĸ (138), Ĺ (139), ĺ (13a), Ļ (13b), ļ (13c), Ľ (13d),
ľ (13e), Ł (141), ł (142), L (ff2c), l (ff4c)
m (6d): M (4d), M (ff2d), m (ff4d)
n (6e): N (4e), Ñ (d1), ñ (f1), Ń (143), ń (144), Ņ (145), ņ (146), Ň
(147), ň (148), Ŋ (14a), ŋ (14b), N (ff2e), n (ff4e)
o (6f): O (4f), Ò (d2), Ó (d3), Ô (d4), Õ (d5), Ø (d8), ò (f2), ó
(f3), ô (f4), õ (f5), ø (f8), Ō (14c), ō (14d), Ő (150), ő
(151), O (ff2f), o (ff4f)
p (70): P (50), P (ff30), p (ff50)
q (71): Q (51), Q (ff31), q (ff51)
r (72): R (52), Ŕ (154), ŕ (155), Ŗ (156), ŗ (157), Ř (158), ř (159),
R (ff32), r (ff52)
s (73): S (53), Ś (15a), ś (15b), Ŝ (15c), ŝ (15d), Ş (15e), ş (15f),
Š (160), š (161), S (ff33), s (ff53)
t (74): T (54), Ţ (162), ţ (163), Ť (164), ť (165), Ŧ (166), ŧ (167),
T (ff34), t (ff54)
u (75): U (55), Ù (d9), Ú (da), Û (db), ù (f9), ú (fa), û (fb), Ũ
(168), ũ (169), Ū (16a), ū (16b), Ŭ (16c), ŭ (16d), Ů (16e),

The Information Company™ 244


Understanding Search Engine 21

ů (16f), Ű (170), ű (171), Ų (172), ų (173), U (ff35), u


(ff55)
v (76): V (56), V (ff36), v (ff56)
w (77): W (57), W (ff37), w (ff57)
x (78): X (58), X (ff38), x (ff58)
y (79): Y (59), Ý (dd), ý (fd), ÿ (ff), Y (ff39), y (ff59)
z (7a): Z (5a), Ź (179), ź (17a), Ż (17b), ż (17c), Ž (17d), ž (17e),
Z (ff3a), z (ff5a)
ae (00650061): Ä (c4), Æ (c6), ä (e4), æ (e6)
ð (f0): Ð (d0)
oe (0065006f): Ö (d6), ö (f6), Œ (152), œ (153)
ue (00650075): Ü (dc), ü (fc)
ss (00730073): ß (df)
ij (006a0069): IJ (132), ij (133)
ά (3ac): Ά (386)
έ (3ad): Έ (388)
ή (3ae): Ή (389)
ί (3af): Ί (38a)
ό (3cc): Ό (38c)
ύ (3cd): Ύ (38e)
ώ (3ce): Ώ (38f)
α (3b1): Α (391)
β (3b2): Β (392)
γ (3b3): Γ (393)
δ (3b4): Δ (394)
ε (3b5): Ε (395)
ζ (3b6): Ζ (396)
η (3b7): Η (397)
θ (3b8): Θ (398)
ι (3b9): Ι (399)
κ (3ba): Κ (39a)
λ (3bb): Λ (39b)
μ (3bc): Μ (39c)
ν (3bd): Ν (39d)
ξ (3be): Ξ (39e)
ο (3bf): Ο (39f)
π (3c0): Π (3a0)
ρ (3c1): Ρ (3a1)
σ (3c3): Σ (3a3)
τ (3c4): Τ (3a4)
υ (3c5): Υ (3a5)
φ (3c6): Φ (3a6)
χ (3c7): Χ (3a7)
ψ (3c8): Ψ (3a8)
ω (3c9): Ω (3a9)
ϊ (3ca): Ϊ (3aa)
ϋ (3cb): Ϋ (3ab)
ѐ (450): Ѐ (400)
ё (451): Ё (401)
ђ (452): Ђ (402)
ѓ (453): Ѓ (403)
є (454): Є (404)
ѕ (455): Ѕ (405)
і (456): І (406)
ї (457): Ї (407)
ј (458): Ј (408)
љ (459): Љ (409)
њ (45a): Њ (40a)
ћ (45b): Ћ (40b)
ќ (45c): Ќ (40c)

The Information Company™ 245


Understanding Search Engine 21

ѝ (45d):
Ѝ (40d)
ў (45e):
Ў (40e)
џ (45f):
Џ (40f)
а (430):
А (410)
б (431):
Б (411)
в (432):
В (412)
г (433):
Г (413)
д (434):
Д (414)
е (435):
Е (415)
ж (436):
Ж (416)
з (437):
З (417)
и (438):
И (418)
й (439):
Й (419)
к (43a):
К (41a)
л (43b):
Л (41b)
м (43c):
М (41c)
н (43d):
Н (41d)
о (43e):
О (41e)
п (43f):
П (41f)
р (440):
Р (420)
с (441):
С (421)
т (442):
Т (422)
у (443):
У (423)
ф (444):
Ф (424)
х (445):
Х (425)
ц (446):
Ц (426)
ч (447):
Ч (427)
ш (448):
Ш (428)
щ (449):
Щ (429)
ъ (44a):
Ъ (42a)
ы (44b):
Ы (42b)
ь (44c):
Ь (42c)
э (44d):
Э (42d)
ю (44e):
Ю (42e)
я (44f):
Я (42f)
‫ا‬ Arabic
(627): ‫ ﴼ‬,(675) ‫ٵ‬ ,(625) ‫ إ‬,(623) ‫ أ‬,(622) ‫( آ‬fd3c), ‫ﴽ‬
(fd3d), ‫( ﹵‬fe75), ‫( ﺁ‬fe81), ‫( ﺂ‬fe82), ‫( ﺃ‬fe83), ‫( ﺄ‬fe84), ‫ﺇ‬
(fe87), ‫( ﺈ‬fe88), ‫( ﺍ‬fe8d), ‫( ﺎ‬fe8e)
‫ و‬Arabic (648): ‫ ؤ‬,(676) ‫ٶ‬ ,(624) ‫( ؤ‬fe85), ‫( ﺆ‬fe86), ‫( و‬feed), ‫ﻮ‬
(feee)
‫ ي‬Arabic (64a): ‫ ﯨ‬,(678) ‫ ٸ‬,(649) ‫ ى‬,(626) ‫( ئ‬fbe8), ‫( ﯩ‬fbe9), ‫ﱝ‬
(fc5d), ‫( ﲐ‬fc90), ‫( ﺉ‬fe89), ‫( ﺊ‬fe8a), ‫( ﺋ‬fe8b), ‫( ﺌ‬fe8c), ‫ﻯ‬
(feef), ‫( ﻰ‬fef0), ‫( ﻱ‬fef1), ‫( ﻲ‬fef2), ‫( ﻳ‬fef3), ‫( ﻴ‬fef4)
‫ ه‬Arabic (647): ‫ ﳙ‬,(629) ‫( ة‬fcd9), ‫( ﺓ‬fe93), ‫( ﺔ‬fe94), ‫( ﻩ‬fee9), ‫ﻪ‬
(feea), ‫( ﻫ‬feeb), ‫( ﻬ‬feec)
0 (30): ۰ (660), ۰ (6f0), 0 (ff10)
1 (31): ۱ (661), ۱ (6f1), 1 (ff11)
2 (32): ۲ (662), ۲ (6f2), 2 (ff12)
3 (33): ۳ (663), ۳ (6f3), 3 (ff13)
4 (34): ٤ (664), ۴ (6f4), 4 (ff14)
5 (35): ٥ (665), ۵ (6f5), 5 (ff15)
6 (36): ٦ (666), ۶ (6f6), 6 (ff16)
7 (37): ۷ (667), ۷ (6f7), 7 (ff17)
8 (38): ۸ (668), ۸ (6f8), 8 (ff18)
9 (39): ۹ (669), ۹ (6f9), 9 (ff19)
‫ ۇ‬Arabic (6c7): ‫ ﯇‬,(677) ‫( ٷ‬fbc7), ‫( ﯗ‬fbd7), ‫( ﯘ‬fbd8), ‫( ﯝ‬fbdd)
‫ ە‬Arabic (6d5): ‫( ۀ‬6c0), ‫( ۀ‬fba4), ‫( ﮥ‬fba5), ‫( ﯀‬fbc0)

The Information Company™ 246


Understanding Search Engine 21

‫ ه‬Arabic (6c1): ‫( ۀ‬6c2), ‫( ﮦ‬fba6), ‫( ہ‬fba7), ‫( ﮨ‬fba8), ‫( ﮩ‬fba9), ‫﯁‬


(fbc1), ‫( ﯂‬fbc2)
‫ ے‬Arabic (6d2): ‫( ۓ‬6d3), ‫( ے‬fbae), ‫( ﮯ‬fbaf), ‫( ۓ‬fbb0), ‫( ﮱ‬fbb1), ‫﯒‬
(fbd2)
- (2d): − (2212), - (ff0d)
ゟ (309f): ゜ (309c)
ア (30a2): ァ (30a1), ァ (ff67), ア (ff71)
イ (30a4): ィ (30a3), ィ (ff68), イ (ff72)
ウ (30a6): ゥ (30a5), ゥ (ff69), ウ (ff73)
エ (30a8): ェ (30a7), ェ (ff6a), エ (ff74)
オ (30aa): ォ (30a9), ォ (ff6b), オ (ff75)
ツ (30c4): ッ (30c3), ッ (ff6f), ツ (ff82)
ヤ (30e4): ャ (30e3), ャ (ff6c), ヤ (ff94)
ユ (30e6): ュ (30e5), ュ (ff6d), ユ (ff95)
ヨ (30e8): ョ (30e7), ョ (ff6e), ヨ (ff96)
ワ (30ef): ヮ (30ee), ワ (ff9c)
カ (30ab): ヵ (30f5), カ (ff76)
ケ (30b1): ヶ (30f6), ケ (ff79)
‫ ٱ‬Arabic (671): ‫( ٱ‬fb50), ‫( ﭑ‬fb51)
‫ ٻ‬Arabic (67b): ‫( ٻ‬fb52), ‫( ﭓ‬fb53), ‫( ﭔ‬fb54), ‫( ﭕ‬fb55)
‫ پ‬Arabic (67e): ‫( پ‬fb56), ‫( ﭗ‬fb57), ‫( ﭘ‬fb58), ‫( ﭙ‬fb59)
‫ ڀ‬Arabic (680): ‫( ڀ‬fb5a), ‫( ﭛ‬fb5b), ‫( ﭜ‬fb5c), ‫( ﭝ‬fb5d)
‫ ٺ‬Arabic (67a): ‫( ٺ‬fb5e), ‫( ﭟ‬fb5f), ‫( ﭠ‬fb60), ‫( ﭡ‬fb61)
‫ ٿ‬Arabic (67f): ‫( ٿ‬fb62), ‫( ﭣ‬fb63), ‫( ﭤ‬fb64), ‫( ﭥ‬fb65)
‫ ٹ‬Arabic (679): ‫( ٹ‬fb66), ‫( ﭧ‬fb67), ‫( ﭨ‬fb68), ‫( ﭩ‬fb69)
‫ ڤ‬Arabic (6a4): ‫( ڤ‬fb6a), ‫( ﭫ‬fb6b), ‫( ﭬ‬fb6c), ‫( ﭭ‬fb6d)
‫ ڦ‬Arabic (6a6): ‫( ڦ‬fb6e), ‫( ﭯ‬fb6f), ‫( ﭰ‬fb70), ‫( ﭱ‬fb71)
‫ ڄ‬Arabic (684): ‫( ڄ‬fb72), ‫( ﭳ‬fb73), ‫( ﭴ‬fb74), ‫( ﭵ‬fb75)
‫ ڃ‬Arabic (683): ‫( ڃ‬fb76), ‫( ﭷ‬fb77), ‫( ﭸ‬fb78), ‫( ﭹ‬fb79)
‫ چ‬Arabic (686): ‫( چ‬fb7a), ‫( ﭻ‬fb7b), ‫( ﭼ‬fb7c), ‫( ﭽ‬fb7d)
‫ ڇ‬Arabic (687): ‫( ڇ‬fb7e), ‫( ﭿ‬fb7f), ‫( ﮀ‬fb80), ‫( ﮁ‬fb81)
‫ ڍ‬Arabic (68d): ‫( ﮂ‬fb82), ‫( ﮃ‬fb83)
‫ ڌ‬Arabic (68c): ‫( ڌ‬fb84), ‫( ﮅ‬fb85)
‫ ڎ‬Arabic (68e): ‫( ڎ‬fb86), ‫( ﮇ‬fb87)
‫ ڈ‬Arabic (688): ‫( ڈ‬fb88), ‫( ﮉ‬fb89)
‫ ژ‬Arabic (698): ‫( ژ‬fb8a), ‫( ﮋ‬fb8b)
‫ ڑ‬Arabic (691): ‫( ڑ‬fb8c), ‫( ﮍ‬fb8d)
‫ ک‬Arabic (6a9): ‫( ک‬fb8e), ‫( ﮏ‬fb8f), ‫( ﮐ‬fb90), ‫( ﮑ‬fb91)
‫ گ‬Arabic (6af): ‫( ﮒ‬fb92), ‫( ﮓ‬fb93), ‫( ﮔ‬fb94), ‫( ﮕ‬fb95)
‫ ڳ‬Arabic (6b3): ‫( ڳ‬fb96), ‫( ﮗ‬fb97), ‫( ﮘ‬fb98), ‫( ﮙ‬fb99), ‫( ﮳‬fbb3)
‫ ڱ‬Arabic (6b1): ‫( ڱ‬fb9a), ‫( ﮛ‬fb9b), ‫( ﮜ‬fb9c), ‫( ﮝ‬fb9d)
‫ ں‬Arabic (6ba): ‫( ں‬fb9e), ‫( ﮟ‬fb9f), ‫( ﮺‬fbba)
‫ ڻ‬Arabic (6bb): ‫( ڻ‬fba0), ‫( ﮡ‬fba1), ‫( ﮢ‬fba2), ‫( ﮣ‬fba3), ‫( ﮻‬fbbb)
‫ ھ‬Arabic (6be): ‫( ھ‬fbaa), ‫( ﮫ‬fbab), ‫( ھ‬fbac), ‫( ﮭ‬fbad), ‫( ﮾‬fbbe)
‫ ڲ‬Arabic (6b2): ‫( ﮲‬fbb2)
‫ ڴ‬Arabic (6b4): ‫( ﮴‬fbb4)
‫ ڵ‬Arabic (6b5): ‫( ﮵‬fbb5)
‫ ڶ‬Arabic (6b6): ‫( ﮶‬fbb6)
‫ ڷ‬Arabic (6b7): ‫( ﮷‬fbb7)
‫ ڸ‬Arabic (6b8): ‫( ﮸‬fbb8)
‫ ڹ‬Arabic (6b9): ‫( ﮹‬fbb9)
‫ ڼ‬Arabic (6bc): ‫( ﮼‬fbbc)
‫ ڽ‬Arabic (6bd): ‫( ﮽‬fbbd)
‫ ڿ‬Arabic (6bf): ‫( ﮿‬fbbf)
‫ ة‬Arabic (6c3): ‫( ﯃‬fbc3)
‫ ۄ‬Arabic (6c4): ‫( ﯄‬fbc4)
‫ ۅ‬Arabic (6c5): ‫( ﯅‬fbc5), ‫( ﯠ‬fbe0), ‫( ﯡ‬fbe1)
‫ ۆ‬Arabic (6c6): ‫( ﯆‬fbc6), ‫( ﯙ‬fbd9), ‫( ﯚ‬fbda)

The Information Company™ 247


Understanding Search Engine 21

‫ۈ‬ Arabic (6c8): ‫﯈‬ (fbc8), ‫( ﯛ‬fbdb), ‫( ﯜ‬fbdc)


‫ۉ‬ Arabic (6c9): ‫﯉‬ (fbc9), ‫( ﯢ‬fbe2), ‫( ﯣ‬fbe3)
‫ۊ‬ Arabic (6ca): ‫﯊‬ (fbca)
‫ۋ‬ Arabic (6cb): ‫﯋‬ (fbcb), ‫( ﯞ‬fbde), ‫( ﯟ‬fbdf)
‫ی‬ Arabic (6cc): ‫﯌‬ (fbcc), ‫( ﯼ‬fbfc), ‫( ﯽ‬fbfd), ‫( ﯾ‬fbfe), ‫( ﯿ‬fbff)
‫ۍ‬ Arabic (6cd): ‫﯍‬ (fbcd)
‫ێ‬ Arabic (6ce): ‫﯎‬ (fbce)
‫ۏ‬ Arabic (6cf): ‫﯏‬ (fbcf)
‫ې‬ Arabic (6d0): ‫﯐‬ (fbd0), ‫( ﯤ‬fbe4), ‫( ﯥ‬fbe5), ‫( ﯦ‬fbe6), ‫( ﯧ‬fbe7)
‫ۑ‬ Arabic (6d1): ‫﯑‬ (fbd1)
‫ڭ‬ Arabic (6ad): ‫ڭ‬ (fbd3), ‫( ﯔ‬fbd4), ‫( ﯕ‬fbd5), ‫( ﯖ‬fbd6)
‫ ﯾﺎ‬Arabic (0627064a): ‫( ﯪ‬fbea), ‫( ﯫ‬fbeb)
‫ ﯾە‬Arabic (06d5064a): ‫( ﯬ‬fbec), ‫( ﯭ‬fbed)
‫ ﯾﻮ‬Arabic (0648064a): ‫( ﯮ‬fbee), ‫( ﯯ‬fbef)
‫ ﯾﯘ‬Arabic (06c7064a): ‫( ﯰ‬fbf0), ‫( ﯱ‬fbf1)
‫ ﯾﯚ‬Arabic (06c6064a): ‫( ﯲ‬fbf2), ‫( ﯳ‬fbf3)
‫ ﯾﯜ‬Arabic (06c8064a): ‫( ﯴ‬fbf4), ‫( ﯵ‬fbf5)
‫ ﯾﯥ‬Arabic (06d0064a): ‫( ﯶ‬fbf6), ‫( ﯷ‬fbf7), ‫( ﯸ‬fbf8)
‫ ﯾﻲ‬Arabic (064a064a): ‫( ﯹ‬fbf9), ‫( ﯺ‬fbfa), ‫( ﯻ‬fbfb), ‫( ﰃ‬fc03), ‫ﰄ‬
(fc04), ‫( ﱙ‬fc59), ‫( ﱚ‬fc5a), ‫( ﱨ‬fc68), ‫( ﱩ‬fc69), ‫( ﲕ‬fc95),
‫( ﲖ‬fc96)
‫ ﯾﺞ‬Arabic (062c064a): ‫( ﰀ‬fc00), ‫( ﱕ‬fc55), ‫( ﲗ‬fc97), ‫( ﳚ‬fcda)
‫ ﯾﺢ‬Arabic (062d064a): ‫( ﰁ‬fc01), ‫( ﱖ‬fc56), ‫( ﲘ‬fc98), ‫( ﳛ‬fcdb)
‫ ﯾﻢ‬Arabic (0645064a): ‫( ﰂ‬fc02), ‫( ﱘ‬fc58), ‫( ﱦ‬fc66), ‫( ﲓ‬fc93), ‫ﲚ‬
(fc9a), ‫( ﳝ‬fcdd), ‫( ﳟ‬fcdf), ‫( ﳰ‬fcf0)
‫ ﺑﺞ‬Arabic (062c0628): ‫( ﰅ‬fc05), ‫( ﲜ‬fc9c)
‫ ﺑﺢ‬Arabic (062d0628): ‫( ﰆ‬fc06), ‫( ﲝ‬fc9d)
‫ ﺑﺦ‬Arabic (062e0628): ‫( ﰇ‬fc07), ‫( ﲞ‬fc9e)
‫ ﺑﻢ‬Arabic (06450628): ‫( ﰈ‬fc08), ‫( ﱬ‬fc6c), ‫( ﲟ‬fc9f), ‫( ﳡ‬fce1)
‫ ﺑﻲ‬Arabic (064a0628): ‫( ﰉ‬fc09), ‫( ﰊ‬fc0a), ‫( ﱮ‬fc6e), ‫( ﱯ‬fc6f)
‫ ﺗﺞ‬Arabic (062c062a): ‫( ﰋ‬fc0b), ‫( ﲡ‬fca1)
‫ ﺗﺢ‬Arabic (062d062a): ‫( ﰌ‬fc0c), ‫( ﲢ‬fca2)
‫ ﺗﺦ‬Arabic (062e062a): ‫( ﰍ‬fc0d), ‫( ﲣ‬fca3)
‫ ﺗﻢ‬Arabic (0645062a): ‫( ﰎ‬fc0e), ‫( ﱲ‬fc72), ‫( ﲤ‬fca4), ‫( ﳣ‬fce3)
‫ ﺗﻲ‬Arabic (064a062a): ‫( ﰏ‬fc0f), ‫( ﰐ‬fc10), ‫( ﱴ‬fc74), ‫( ﱵ‬fc75)
‫ ﺛﺞ‬Arabic (062c062b): ‫( ﰑ‬fc11)
‫ ﺛﻢ‬Arabic (0645062b): ‫( ﰒ‬fc12), ‫( ﱸ‬fc78), ‫( ﲦ‬fca6), ‫( ﳥ‬fce5)
‫ ﺛﻲ‬Arabic (064a062b): ‫( ﰓ‬fc13), ‫( ﰔ‬fc14), ‫( ﱺ‬fc7a), ‫( ﱻ‬fc7b)
‫ ﺟﺢ‬Arabic (062d062c): ‫( ﰕ‬fc15), ‫( ﲧ‬fca7)
‫ ﺟﻢ‬Arabic (0645062c): ‫( ﰖ‬fc16), ‫( ﲨ‬fca8)
‫ ﺣﺞ‬Arabic (062c062d): ‫( ﰗ‬fc17), ‫( ﲩ‬fca9)
‫ ﺣﻢ‬Arabic (0645062d): ‫( ﰘ‬fc18), ‫( ﲪ‬fcaa)
‫ ﺧﺞ‬Arabic (062c062e): ‫( ﰙ‬fc19), ‫( ﲫ‬fcab)

The Information Company™ 248


Understanding Search Engine 21

‫ ﺧﺢ‬Arabic (062d062e): ‫( ﰚ‬fc1a)


‫ ﺧﻢ‬Arabic (0645062e): ‫( ﰛ‬fc1b), ‫( ﲬ‬fcac)
‫ ﺳﺞ‬Arabic (062c0633): ‫( ﰜ‬fc1c), ‫( ﲭ‬fcad), ‫( ﴴ‬fd34)
‫ ﺳﺢ‬Arabic (062d0633): ‫( ﰝ‬fc1d), ‫( ﲮ‬fcae), ‫( ﴵ‬fd35)
‫ ﺳﺦ‬Arabic (062e0633): ‫( ﰞ‬fc1e), ‫( ﲯ‬fcaf), ‫( ﴶ‬fd36)
‫ ﺳﻢ‬Arabic (06450633): ‫( ﰟ‬fc1f), ‫( ﲰ‬fcb0), ‫( ﳧ‬fce7)
‫ ﺻﺢ‬Arabic (062d0635): ‫( ﰠ‬fc20), ‫( ﲱ‬fcb1)
‫ ﺻﻢ‬Arabic (06450635): ‫( ﰡ‬fc21), ‫( ﲳ‬fcb3)
‫ ﺿﺞ‬Arabic (062c0636): ‫( ﰢ‬fc22), ‫( ﲴ‬fcb4)
‫ ﺿﺢ‬Arabic (062d0636): ‫( ﰣ‬fc23), ‫( ﲵ‬fcb5)
‫ ﺿﺦ‬Arabic (062e0636): ‫( ﰤ‬fc24), ‫( ﲶ‬fcb6)
‫ ﺿﻢ‬Arabic (06450636): ‫( ﰥ‬fc25), ‫( ﲷ‬fcb7)
‫ ﻃﺢ‬Arabic (062d0637): ‫( ﰦ‬fc26), ‫( ﲸ‬fcb8)
‫ ﻃﻢ‬Arabic (06450637): ‫( ﰧ‬fc27), ‫( ﴳ‬fd33), ‫( ﴺ‬fd3a)
‫ ﻇﻢ‬Arabic (06450638): ‫( ﰨ‬fc28), ‫( ﲹ‬fcb9), ‫( ﴻ‬fd3b)
‫ ﻋﺞ‬Arabic (062c0639): ‫( ﰩ‬fc29), ‫( ﲺ‬fcba)
‫ ﻋﻢ‬Arabic (06450639): ‫( ﰪ‬fc2a), ‫( ﲻ‬fcbb)
‫ ﻏﺞ‬Arabic (062c063a): ‫( ﰫ‬fc2b), ‫( ﲼ‬fcbc)
‫ ﻏﻢ‬Arabic (0645063a): ‫( ﰬ‬fc2c), ‫( ﲽ‬fcbd)
‫ ﻓﺞ‬Arabic (062c0641): ‫( ﰭ‬fc2d), ‫( ﲾ‬fcbe)
‫ ﻓﺢ‬Arabic (062d0641): ‫( ﰮ‬fc2e), ‫( ﲿ‬fcbf)
‫ ﻓﺦ‬Arabic (062e0641): ‫( ﰯ‬fc2f), ‫( ﳀ‬fcc0)
‫ ﻓﻢ‬Arabic (06450641): ‫( ﰰ‬fc30), ‫( ﳁ‬fcc1)
‫ ﻓﻲ‬Arabic (064a0641): ‫( ﰱ‬fc31), ‫( ﰲ‬fc32), ‫( ﱼ‬fc7c), ‫( ﱽ‬fc7d)
‫ ﻗﺢ‬Arabic (062d0642): ‫( ﰳ‬fc33), ‫( ﳂ‬fcc2)
‫ ﻗﻢ‬Arabic (06450642): ‫( ﰴ‬fc34), ‫( ﳃ‬fcc3)
‫ ﻗﻲ‬Arabic (064a0642): ‫( ﰵ‬fc35), ‫( ﰶ‬fc36), ‫( ﱾ‬fc7e), ‫( ﱿ‬fc7f)
‫ ﻛﺎ‬Arabic (06270643): ‫( ﰷ‬fc37), ‫( ﲀ‬fc80)
‫ ﻛﺞ‬Arabic (062c0643): ‫( ﰸ‬fc38), ‫( ﳄ‬fcc4)
‫ ﻛﺢ‬Arabic (062d0643): ‫( ﰹ‬fc39), ‫( ﳅ‬fcc5)
‫ ﻛﺦ‬Arabic (062e0643): ‫( ﰺ‬fc3a), ‫( ﳆ‬fcc6)
‫ ﻛﻞ‬Arabic (06440643): ‫( ﰻ‬fc3b), ‫( ﲁ‬fc81), ‫( ﳇ‬fcc7), ‫( ﳫ‬fceb)
‫ ﻛﻢ‬Arabic (06450643): ‫( ﰼ‬fc3c), ‫( ﲂ‬fc82), ‫( ﳈ‬fcc8), ‫( ﳬ‬fcec)
‫ﻛﻲ‬ Arabic (064a0643): ‫( ﰽ‬fc3d), ‫( ﰾ‬fc3e), ‫( ﲃ‬fc83), ‫( ﲄ‬fc84)
‫ﻟﺞ‬ Arabic (062c0644): ‫( ﰿ‬fc3f), ‫( ﳉ‬fcc9)
‫ﻟﺢ‬ Arabic (062d0644): ‫( ﱀ‬fc40), ‫( ﳊ‬fcca)
‫ﻟﺦ‬ Arabic (062e0644): ‫( ﱁ‬fc41), ‫( ﳋ‬fccb)
‫ ﻟﻢ‬Arabic (06450644): ‫( ﱂ‬fc42), ‫( ﲅ‬fc85), ‫( ﳌ‬fccc), ‫( ﳭ‬fced)
‫ ﻟﻲ‬Arabic (064a0644): ‫( ﱃ‬fc43), ‫( ﱄ‬fc44), ‫( ﲆ‬fc86), ‫( ﲇ‬fc87)
‫ ﻣﺞ‬Arabic (062c0645): ‫( ﱅ‬fc45), ‫( ﳎ‬fcce)

The Information Company™ 249


Understanding Search Engine 21

‫ ﻣﺢ‬Arabic (062d0645): ‫( ﱆ‬fc46), ‫( ﳏ‬fccf)


‫ ﻣﺦ‬Arabic (062e0645): ‫( ﱇ‬fc47), ‫( ﳐ‬fcd0)
‫ ﻣﻢ‬Arabic (06450645): ‫( ﱈ‬fc48), ‫( ﲉ‬fc89), ‫( ﳑ‬fcd1)
‫ ﻣﻲ‬Arabic (064a0645): ‫( ﱉ‬fc49), ‫( ﱊ‬fc4a)
‫ ﻧﺞ‬Arabic (062c0646): ‫( ﱋ‬fc4b), ‫( ﳒ‬fcd2)
‫ ﻧﺢ‬Arabic (062d0646): ‫( ﱌ‬fc4c), ‫( ﳓ‬fcd3)
‫ ﻧﺦ‬Arabic (062e0646): ‫( ﱍ‬fc4d), ‫( ﳔ‬fcd4)
‫ ﻧﻢ‬Arabic (06450646): ‫( ﱎ‬fc4e), ‫( ﲌ‬fc8c), ‫( ﳕ‬fcd5), ‫( ﳮ‬fcee)
‫ ﻧﻲ‬Arabic (064a0646): ‫( ﱏ‬fc4f), ‫( ﱐ‬fc50), ‫( ﲎ‬fc8e), ‫( ﲏ‬fc8f)
‫ ھﺞ‬Arabic (062c0647): ‫( ﱑ‬fc51), ‫( ﳗ‬fcd7)
‫ ھﻢ‬Arabic (06450647): ‫( ﱒ‬fc52), ‫( ﳘ‬fcd8)
‫ ھﻲ‬Arabic (064a0647): ‫( ﱓ‬fc53), ‫( ﱔ‬fc54)
‫ ﯾﺦ‬Arabic (062e064a): ‫( ﱗ‬fc57), ‫( ﲙ‬fc99), ‫( ﳜ‬fcdc)
‫ ذ‬Arabic (630): ‫( ﱛ‬fc5b), ‫( ﺫ‬feab), ‫( ﺬ‬feac)
‫ ر‬Arabic (631): ‫( ﱜ‬fc5c), ‫( ﺭ‬fead), ‫( ﺮ‬feae)
‫ ﯾﺮ‬Arabic (0631064a): ‫( ﱤ‬fc64), ‫( ﲑ‬fc91)
‫ ﯾﺰ‬Arabic (0632064a): ‫( ﱥ‬fc65), ‫( ﲒ‬fc92)
‫ ﯾﻦ‬Arabic (0646064a): ‫( ﱧ‬fc67), ‫( ﲔ‬fc94)
‫ ﺑﺮ‬Arabic (06310628): ‫( ﱪ‬fc6a)
‫ ﺑﺰ‬Arabic (06320628): ‫( ﱫ‬fc6b)
‫ ﺑﻦ‬Arabic (06460628): ‫( ﱭ‬fc6d)
‫ ﺗﺮ‬Arabic (0631062a): ‫( ﱰ‬fc70)
‫ ﺗﺰ‬Arabic (0632062a): ‫( ﱱ‬fc71)
‫ ﺗﻦ‬Arabic (0646062a): ‫( ﱳ‬fc73)
‫ ﺛﺮ‬Arabic (0631062b): ‫( ﱶ‬fc76)
‫ ﺛﺰ‬Arabic (0632062b): ‫( ﱷ‬fc77)
‫ ﺛﻦ‬Arabic (0646062b): ‫( ﱹ‬fc79)
‫ ﻣﺎ‬Arabic (06270645): ‫( ﲈ‬fc88)
‫ ﻧﺮ‬Arabic (06310646): ‫( ﲊ‬fc8a)
‫ ﻧﺰ‬Arabic (06320646): ‫( ﲋ‬fc8b)
‫ ﻧﻦ‬Arabic (06460646): ‫( ﲍ‬fc8d)
‫ ﯾﮫ‬Arabic (0647064a): ‫( ﲛ‬fc9b), ‫( ﳞ‬fcde), ‫( ﳠ‬fce0), ‫( ﳱ‬fcf1)
‫ ﺑﮫ‬Arabic (06470628): ‫( ﲠ‬fca0), ‫( ﳢ‬fce2)
‫ ﺗﮫ‬Arabic (0647062a): ‫( ﲥ‬fca5), ‫( ﳤ‬fce4)
‫ ﺻﺦ‬Arabic (062e0635): ‫( ﲲ‬fcb2)
‫ ﻟﮫ‬Arabic (06470644): ‫( ﳍ‬fccd), ‫( ﷲ‬fdf2)
‫ ﻧﮫ‬Arabic (06470646): ‫( ﳖ‬fcd6), ‫( ﳯ‬fcef)
‫ ﺛﮫ‬Arabic (0647062b): ‫( ﳦ‬fce6)
‫ ﺳﮫ‬Arabic (06470633): ‫( ﳨ‬fce8), ‫( ﴱ‬fd31)
‫ ﺷﻢ‬Arabic (06450634): ‫( ﳩ‬fce9), ‫( ﴌ‬fd0c), ‫( ﴨ‬fd28), ‫( ﴰ‬fd30)
‫ ﺷﮫ‬Arabic (06470634): ‫( ﳪ‬fcea), ‫( ﴲ‬fd32)

The Information Company™ 250


Understanding Search Engine 21

‫ ﻃﻲ‬Arabic (064a0637): ‫( ﳵ‬fcf5), ‫( ﳶ‬fcf6), ‫( ﴑ‬fd11), ‫( ﴒ‬fd12)


‫ ﻋﻲ‬Arabic (064a0639): ‫( ﳷ‬fcf7), ‫( ﳸ‬fcf8), ‫( ﴓ‬fd13), ‫( ﴔ‬fd14)
‫ ﻏﻲ‬Arabic (064a063a): ‫( ﳹ‬fcf9), ‫( ﳺ‬fcfa), ‫( ﴕ‬fd15), ‫( ﴖ‬fd16)
‫ ﺳﻲ‬Arabic (064a0633): ‫( ﳻ‬fcfb), ‫( ﳼ‬fcfc), ‫( ﴗ‬fd17), ‫( ﴘ‬fd18)
‫ ﺷﻲ‬Arabic (064a0634): ‫( ﳽ‬fcfd), ‫( ﳾ‬fcfe), ‫( ﴙ‬fd19), ‫( ﴚ‬fd1a)
‫ ﺣﻲ‬Arabic (064a062d): ‫( ﳿ‬fcff), ‫( ﴀ‬fd00), ‫( ﴛ‬fd1b), ‫( ﴜ‬fd1c)
‫ ﺟﻲ‬Arabic (064a062c): ‫( ﴁ‬fd01), ‫( ﴂ‬fd02), ‫( ﴝ‬fd1d), ‫( ﴞ‬fd1e)
‫ ﺧﻲ‬Arabic (064a062e): ‫( ﴃ‬fd03), ‫( ﴄ‬fd04), ‫( ﴟ‬fd1f), ‫( ﴠ‬fd20)
‫ ﺻﻲ‬Arabic (064a0635): ‫( ﴅ‬fd05), ‫( ﴆ‬fd06), ‫( ﴡ‬fd21), ‫( ﴢ‬fd22)
‫ ﺿﻲ‬Arabic (064a0636): ‫( ﴇ‬fd07), ‫( ﴈ‬fd08), ‫( ﴣ‬fd23), ‫( ﴤ‬fd24)
‫ ﺷﺞ‬Arabic (062c0634): ‫( ﴉ‬fd09), ‫( ﴥ‬fd25), ‫( ﴭ‬fd2d), ‫( ﴷ‬fd37)
‫ ﺷﺢ‬Arabic (062d0634): ‫( ﴊ‬fd0a), ‫( ﴦ‬fd26), ‫( ﴮ‬fd2e), ‫( ﴸ‬fd38)
‫ ﺷﺦ‬Arabic (062e0634): ‫( ﴋ‬fd0b), ‫( ﴧ‬fd27), ‫( ﴯ‬fd2f), ‫( ﴹ‬fd39)
‫ ﺷﺮ‬Arabic (06310634): ‫( ﴍ‬fd0d), ‫( ﴩ‬fd29)
‫ ﺳﺮ‬Arabic (06310633): ‫( ﴎ‬fd0e), ‫( ﴪ‬fd2a)
‫ ﺻﺮ‬Arabic (06310635): ‫( ﴏ‬fd0f), ‫( ﴫ‬fd2b)
‫ ﺿﺮ‬Arabic (06310636): ‫( ﴐ‬fd10), ‫( ﴬ‬fd2c)
‫ ﺗﺠﻢ‬Arabic (0645062c062a): ‫( ﵐ‬fd50)
‫ ﺗﺤﺞ‬Arabic (062c062d062a): ‫( ﵑ‬fd51), ‫( ﵒ‬fd52)
‫ ﺗﺤﻢ‬Arabic (0645062d062a): ‫( ﵓ‬fd53)
‫ ﺗﺨﻢ‬Arabic (0645062e062a): ‫( ﵔ‬fd54)
‫ ﺗﻤﺞ‬Arabic (062c0645062a): ‫( ﵕ‬fd55)
‫ ﺗﻤﺢ‬Arabic (062d0645062a): ‫( ﵖ‬fd56)
‫ ﺗﻤﺦ‬Arabic (062e0645062a): ‫( ﵗ‬fd57)
‫ ﺟﻤﺢ‬Arabic (062d0645062c): ‫( ﵘ‬fd58), ‫( ﵙ‬fd59)
‫ ﺣﻤﻲ‬Arabic (064a0645062d): ‫( ﵚ‬fd5a), ‫( ﵛ‬fd5b)
‫ ﺳﺤﺞ‬Arabic (062c062d0633): ‫( ﵜ‬fd5c)
‫ ﺳﺠﺢ‬Arabic (062d062c0633): ‫( ﵝ‬fd5d)
‫ ﺳﺠﻲ‬Arabic (064a062c0633): ‫( ﵞ‬fd5e)
‫ ﺳﻤﺢ‬Arabic (062d06450633): ‫( ﵟ‬fd5f), ‫( ﵠ‬fd60)
‫ ﺳﻤﺞ‬Arabic (062c06450633): ‫( ﵡ‬fd61)
‫ ﺳﻤﻢ‬Arabic (064506450633): ‫( ﵢ‬fd62), ‫( ﵣ‬fd63)
‫ ﺻﺤﺢ‬Arabic (062d062d0635): ‫( ﵤ‬fd64), ‫( ﵥ‬fd65)
‫ ﺻﻤﻢ‬Arabic (064506450635): ‫( ﵦ‬fd66), ‫( ﷅ‬fdc5)
‫ ﺷﺤﻢ‬Arabic (0645062d0634): ‫( ﵧ‬fd67), ‫( ﵨ‬fd68)
‫ ﺷﺠﻲ‬Arabic (064a062c0634): ‫( ﵩ‬fd69)
‫ ﺷﻤﺦ‬Arabic (062e06450634): ‫( ﵪ‬fd6a), ‫( ﵫ‬fd6b)
‫ ﺷﻤﻢ‬Arabic (064506450634): ‫( ﵬ‬fd6c), ‫( ﵭ‬fd6d)
‫ ﺿﺤﻲ‬Arabic (064a062d0636): ‫( ﵮ‬fd6e), ‫( ﶫ‬fdab)

The Information Company™ 251


Understanding Search Engine 21

‫ ﺿﺨﻢ‬Arabic (0645062e0636): ‫( ﵯ‬fd6f), ‫( ﵰ‬fd70)


‫ ﻃﻤﺢ‬Arabic (062d06450637): ‫( ﵱ‬fd71), ‫( ﵲ‬fd72)
‫ ﻃﻤﻢ‬Arabic (064506450637): ‫( ﵳ‬fd73)
‫ ﻃﻤﻲ‬Arabic (064a06450637): ‫( ﵴ‬fd74)
‫ ﻋﺠﻢ‬Arabic (0645062c0639): ‫( ﵵ‬fd75), ‫( ﷄ‬fdc4)
‫ ﻋﻤﻢ‬Arabic (064506450639): ‫( ﵶ‬fd76), ‫( ﵷ‬fd77)
‫ ﻋﻤﻲ‬Arabic (064a06450639): ‫( ﵸ‬fd78), ‫( ﶶ‬fdb6)
‫ ﻏﻤﻢ‬Arabic (06450645063a): ‫( ﵹ‬fd79)
‫ ﻏﻤﻲ‬Arabic (064a0645063a): ‫( ﵺ‬fd7a), ‫( ﵻ‬fd7b)
‫ ﻓﺨﻢ‬Arabic (0645062e0641): ‫( ﵼ‬fd7c), ‫( ﵽ‬fd7d)
‫ ﻗﻤﺢ‬Arabic (062d06450642): ‫( ﵾ‬fd7e), ‫( ﶴ‬fdb4)
‫ ﻗﻤﻢ‬Arabic (064506450642): ‫( ﵿ‬fd7f)
‫ ﻟﺤﻢ‬Arabic (0645062d0644): ‫( ﶀ‬fd80), ‫( ﶵ‬fdb5)
‫ ﻟﺤﻲ‬Arabic (064a062d0644): ‫( ﶁ‬fd81), ‫( ﶂ‬fd82)
‫ ﻟﺠﺞ‬Arabic (062c062c0644): ‫( ﶃ‬fd83), ‫( ﶄ‬fd84)
‫ ﻟﺨﻢ‬Arabic (0645062e0644): ‫( ﶅ‬fd85), ‫( ﶆ‬fd86)
‫ ﻟﻤﺢ‬Arabic (062d06450644): ‫( ﶇ‬fd87), ‫( ﶈ‬fd88)
‫ ﻣﺤﺞ‬Arabic (062c062d0645): ‫( ﶉ‬fd89)
‫ ﻣﺤﻢ‬Arabic (0645062d0645): ‫( ﶊ‬fd8a)
‫ ﻣﺤﻲ‬Arabic (064a062d0645): ‫( ﶋ‬fd8b)
‫ ﻣﺠﺢ‬Arabic (062d062c0645): ‫( ﶌ‬fd8c)
‫ ﻣﺠﻢ‬Arabic (0645062c0645): ‫( ﶍ‬fd8d)
‫ ﻣﺨﺞ‬Arabic (062c062e0645): ‫( ﶎ‬fd8e)
‫ ﻣﺨﻢ‬Arabic (0645062e0645): ‫( ﶏ‬fd8f)
‫ ﻣﺠﺦ‬Arabic (062e062c0645): ‫( ﶒ‬fd92)
‫ ھﻤﺞ‬Arabic (062c06450647): ‫( ﶓ‬fd93)
‫ ھﻤﻢ‬Arabic (064506450647): ‫( ﶔ‬fd94)
‫ ﻧﺤﻢ‬Arabic (0645062d0646): ‫( ﶕ‬fd95)
‫ ﻧﺤﻲ‬Arabic (064a062d0646): ‫( ﶖ‬fd96), ‫( ﶳ‬fdb3)
‫ ﻧﺠﻢ‬Arabic (0645062c0646): ‫( ﶗ‬fd97), ‫( ﶘ‬fd98)
‫ ﻧﺠﻲ‬Arabic (064a062c0646): ‫( ﶙ‬fd99), ‫( ﷇ‬fdc7)
‫ ﻧﻤﻲ‬Arabic (064a06450646): ‫( ﶚ‬fd9a), ‫( ﶛ‬fd9b)
‫ ﯾﻤﻢ‬Arabic (06450645064a): ‫( ﶜ‬fd9c), ‫( ﶝ‬fd9d)
‫ ﺑﺨﻲ‬Arabic (064a062e0628): ‫( ﶞ‬fd9e)
‫ ﺗﺠﻲ‬Arabic (064a062c062a): ‫( ﶟ‬fd9f), ‫( ﶠ‬fda0)
‫ ﺗﺨﻲ‬Arabic (064a062e062a): ‫( ﶡ‬fda1), ‫( ﶢ‬fda2)
‫ ﺗﻤﻲ‬Arabic (064a0645062a): ‫( ﶣ‬fda3), ‫( ﶤ‬fda4)
‫ ﺟﻤﻲ‬Arabic (064a0645062c): ‫( ﶥ‬fda5), ‫( ﶧ‬fda7)
‫ ﺟﺤﻲ‬Arabic (064a062d062c): ‫( ﶦ‬fda6), ‫( ﶾ‬fdbe)

The Information Company™ 252


Understanding Search Engine 21

‫ ﺳﺨﻲ‬Arabic (064a062e0633): ‫( ﶨ‬fda8), ‫( ﷆ‬fdc6)


‫ ﺻﺤﻲ‬Arabic (064a062d0635): ‫( ﶩ‬fda9)
‫ ﺷﺤﻲ‬Arabic (064a062d0634): ‫( ﶪ‬fdaa)
‫ ﻟﺠﻲ‬Arabic (064a062c0644): ‫( ﶬ‬fdac)
‫ ﻟﻤﻲ‬Arabic (064a06450644): ‫( ﶭ‬fdad)
‫ ﯾﺤﻲ‬Arabic (064a062d064a): ‫( ﶮ‬fdae)
‫ ﯾﺠﻲ‬Arabic (064a062c064a): ‫( ﶯ‬fdaf)
‫ ﯾﻤﻲ‬Arabic (064a0645064a): ‫( ﶰ‬fdb0)
‫ ﻣﻤﻲ‬Arabic (064a06450645): ‫( ﶱ‬fdb1)
‫ ﻗﻤﻲ‬Arabic (064a06450642): ‫( ﶲ‬fdb2)
‫ ﻛﻤﻲ‬Arabic (064a06450643): ‫( ﶷ‬fdb7)
‫ ﻧﺠﺢ‬Arabic (062d062c0646): ‫( ﶸ‬fdb8), ‫( ﶽ‬fdbd)
‫ ﻣﺨﻲ‬Arabic (064a062e0645): ‫( ﶹ‬fdb9)
‫ ﻟﺠﻢ‬Arabic (0645062c0644): ‫( ﶺ‬fdba), ‫( ﶼ‬fdbc)
‫ ﻛﻤﻢ‬Arabic (064506450643): ‫( ﶻ‬fdbb), ‫( ﷃ‬fdc3)
‫ ﺣﺠﻲ‬Arabic (064a062c062d): ‫( ﶿ‬fdbf)
‫ ﻣﺠﻲ‬Arabic (064a062c0645): ‫( ﷀ‬fdc0)
‫ ﻓﻤﻲ‬Arabic (064a06450641): ‫( ﷁ‬fdc1)
‫ ﺑﺤﻲ‬Arabic (064a062d0628): ‫( ﷂ‬fdc2)
‫ ﺻﻠﮯ‬Arabic (06d206440635): ‫( ﷰ‬fdf0)
‫ ﻗﻠﮯ‬Arabic (06d206440642): ‫( ﷱ‬fdf1)
‫ اﻛﺒﺮ‬Arabic (0631062806430627): ‫( ﷳ‬fdf3)
‫ ﻣﺤﻤﺪ‬Arabic (062f0645062d0645): ‫( ﷴ‬fdf4)
‫ ﺻﻠﻌﻢ‬Arabic (0645063906440635): ‫( ﷵ‬fdf5)
‫ رﺳﻮل‬Arabic (0644064806330631): ‫( ﷶ‬fdf6)
‫ ﻋﻞ‬Arabic (06440639): ‫( ﷷ‬fdf7)
‫ ﺳﻠﻢ‬Arabic (064506440633): ‫( ﷸ‬fdf8)
‫ ﺻﻠﻲ‬Arabic (064a06440635): ‫( ﷹ‬fdf9)
‫ ﺻﻠﻰ ﷲ ﻋﻠﯿﮫ وﺳﻠﻢ‬Arabic
(064506440633064800200647064a0644063900200647064406440627002
0064906440635): ‫( ﷺ‬fdfa)
‫ ﺟﻞ ﺟﻼﻟﮫ‬Arabic (0647064406270644062c00200644062c): ‫( ﷻ‬fdfb)
‫ ﷼‬Arabic (0644062706cc0631): ‫( ﷼‬fdfc)
ًArabic (64b): ‫( ﹱ‬fe71)
‫ ٳ‬Arabic (673): ‫( ﹳ‬fe73)
َArabic (64e): ‫( ﹷ‬fe77)
ُArabic (64f): ‫( ﹹ‬fe79)
ِArabic (650): ‫( ﹻ‬fe7b)
ّArabic (651): ‫( ﹽ‬fe7d)
ْArabic (652): ‫( ﹿ‬fe7f)
‫ ء‬Arabic (621): ‫( ﺀ‬fe80)
‫ ب‬Arabic (628): ‫( ب‬fe8f), ‫( ﺐ‬fe90), ‫( ﺑ‬fe91), ‫( ﺒ‬fe92)
‫ ت‬Arabic (62a): ‫( ت‬fe95), ‫( ﺖ‬fe96), ‫( ﺗ‬fe97), ‫( ﺘ‬fe98)
‫ ث‬Arabic (62b): ‫( ث‬fe99), ‫( ﺚ‬fe9a), ‫( ﺛ‬fe9b), ‫( ﺜ‬fe9c)

The Information Company™ 253


Understanding Search Engine 21

‫ج‬ Arabic (62c): ‫( ج‬fe9d), ‫( ﺞ‬fe9e), ‫( ﺟ‬fe9f), ‫( ﺠ‬fea0)


‫ح‬ Arabic (62d): ‫( ﺡ‬fea1), ‫( ﺢ‬fea2), ‫( ﺣ‬fea3), ‫( ﺤ‬fea4)
‫خ‬ Arabic (62e): ‫( خ‬fea5), ‫( ﺦ‬fea6), ‫( ﺧ‬fea7), ‫( ﺨ‬fea8)
‫د‬ Arabic (62f): ‫( د‬fea9), ‫( ﺪ‬feaa)
‫ز‬ Arabic (632): ‫( ز‬feaf), ‫( ﺰ‬feb0)
‫س‬ Arabic (633): ‫( س‬feb1), ‫( ﺲ‬feb2), ‫( ﺳ‬feb3), ‫( ﺴ‬feb4)
‫ش‬ Arabic (634): ‫( ش‬feb5), ‫( ﺶ‬feb6), ‫( ﺷ‬feb7), ‫( ﺸ‬feb8)
‫ص‬ Arabic (635): ‫( ص‬feb9), ‫( ﺺ‬feba), ‫( ﺻ‬febb), ‫( ﺼ‬febc)
‫ض‬ Arabic (636): ‫( ض‬febd), ‫( ﺾ‬febe), ‫( ﺿ‬febf), ‫( ﻀ‬fec0)
‫ط‬ Arabic (637): ‫( ط‬fec1), ‫( ﻂ‬fec2), ‫( ﻃ‬fec3), ‫( ﻄ‬fec4)
‫ظ‬ Arabic (638): ‫( ظ‬fec5), ‫( ﻆ‬fec6), ‫( ﻇ‬fec7), ‫( ﻈ‬fec8)
‫ع‬ Arabic (639): ‫( ﻉ‬fec9), ‫( ﻊ‬feca), ‫( ﻋ‬fecb), ‫( ﻌ‬fecc)
‫غ‬ Arabic (63a): ‫( غ‬fecd), ‫( ﻎ‬fece), ‫( ﻏ‬fecf), ‫( ﻐ‬fed0)
‫ف‬ Arabic (641): ‫( ف‬fed1), ‫( ﻒ‬fed2), ‫( ﻓ‬fed3), ‫( ﻔ‬fed4)
‫ق‬ Arabic (642): ‫( ق‬fed5), ‫( ﻖ‬fed6), ‫( ﻗ‬fed7), ‫( ﻘ‬fed8)
‫ك‬ Arabic (643): ‫( ك‬fed9), ‫( ﻚ‬feda), ‫( ﻛ‬fedb), ‫( ﻜ‬fedc)
‫ل‬ Arabic (644): ‫( ل‬fedd), ‫( ﻞ‬fede), ‫( ﻟ‬fedf), ‫( ﻠ‬fee0)
‫م‬ Arabic (645): ‫( م‬fee1), ‫( ﻢ‬fee2), ‫( ﻣ‬fee3), ‫( ﻤ‬fee4)
‫ن‬ Arabic (646): ‫( ن‬fee5), ‫( ﻦ‬fee6), ‫( ﻧ‬fee7), ‫( ﻨ‬fee8)
‫ﻻ‬ Arabic (06270644): ‫( ﻵ‬fef5), ‫( ﻶ‬fef6), ‫( ﻷ‬fef7), ‫( ﻸ‬fef8), ‫ﻹ‬
(fef9), ‫( ﻺ‬fefa), ‫( ﻻ‬fefb), ‫( ﻼ‬fefc)
・ (30fb): ・ (ff65)
ヲ (30f2): ヲ (ff66)
ー (30fc): ー (ff70)
キ (30ad): キ (ff77)
ク (30af): ク (ff78)
コ (30b3): コ (ff7a)
サ (30b5): サ (ff7b)
シ (30b7): シ (ff7c)
ス (30b9): ス (ff7d)
セ (30bb): セ (ff7e)
ソ (30bd): ソ (ff7f)
タ (30bf): タ (ff80)
チ (30c1): チ (ff81)
テ (30c6): テ (ff83)
ト (30c8): ト (ff84)
ナ (30ca): ナ (ff85)
ニ (30cb): ニ (ff86)
ヌ (30cc): ヌ (ff87)
ネ (30cd): ネ (ff88)
ノ (30ce): ノ (ff89)
ハ (30cf): ハ (ff8a)
ヒ (30d2): ヒ (ff8b)
フ (30d5): フ (ff8c)
ヘ (30d8): ヘ (ff8d)
ホ (30db): ホ (ff8e)
マ (30de): マ (ff8f)
ミ (30df): ミ (ff90)
ム (30e0): ム (ff91)
メ (30e1): メ (ff92)
モ (30e2): モ (ff93)
ラ (30e9): ラ (ff97)
リ (30ea): リ (ff98)
ル (30eb): ル (ff99)
レ (30ec): レ (ff9a)

The Information Company™ 254


Understanding Search Engine 21

ロ (30ed): ロ (ff9b)
ン (30f3): ン (ff9d)
゛ (309b): ゙ (ff9e)
゜ (309c): ゚ (ff9f)

The Information Company™ 255


Understanding Search Engine 21

Additional Information
Version history and selected built-in utilities.

Version History
This section of the document identifies which updates of Search Engine 10 and 10.5
contain new features or material changes in behavior. This is not comprehensive, but
a list of the more notable changes.

Search Engine 10
Released with Content Server 10, approximately September 2010. The versions of
the search engine prior to this release were generally referred to as OT7.
• Add support for key-value attributes in text metadata, used for multi-lingual
metadata indexing and search.
• Added Hindi, Tamil and Telugu to the standard tokenizer.
• New percent full model with “soft” update-only mode and rebalancing.
• Defragmentation of metadata storage.
• Added ModifyByQuery.
• Added DeleteByQuery.
• Added Disk Retrieval Storage mode.
• Bi-gram indexing of far-east character sets. May require re-indexing of existing
content with far-east character sets.
• Faster ‘stemming’ focused on noun plurals.
• Content Status feature added.
• Synthetic regions: partition name and mode.
• Change bad metadata to record error instead of halting.
• Search Federator closes connections from inactive clients.
• Rolling log file support added.
• Various bug fixes
• Support for Java 6 (Update 20)

Search Engine 10 Update 1


Released with Content Server 10 Update 1 and Content Server 9.7.1 November
2010 cumulative patch, approximately February 2011.
• Stagger defragmentation times to limit CPU loading.

The Information Company™ 256


Understanding Search Engine 21

• Fewer checkpoints on startup if conversions take place.


• Tokenizer modified for case-insensitive Russian character indexing. Optional re-
indexing of Russian content may be desired to leverage this feature for older
objects.
• Default number of results per query reduced, improves get results performance.
• New TIMESTAMP data type implemented using ISO 8601 format.
• Aggregate-text feature implemented.
• REMOVE region capability added for LLFieldDefinitions.txt.
• RENAME feature for existing regions added.
• MERGE feature for existing regions added.
• Various bug fixes

Search Engine 10 Update 2


Released with Content Server 10 Update 2 and Content Server 9.7.1 Update 2,
approximately April 2011.
• RENAME feature extended for new data being indexed.
• MERGE feature extended for new data being indexed.
• Copy of configuration files included as reference in log files.
• Various bug fixes

Search Engine 10 Update 3


Released with Content Server 10 Update 3 and Content Server 9.7.1 Update 3,
approximately July 2011.
• Percent full defaults revised downwards for conservative deployment.
• Backup utility now performs Level 1 + partial Level 4 verification.
• Add OTFileType and OTContentLanguage to default LLFieldDefinitions.txt
• Check for base offset errors when creating fragments.
• Various bug fixes.

Search Engine 10 Update 4


Released with Content Server 10 Update 4 and Content Server 9.7.1 Update 4,
approximately September 2011.
• Adds search facet generation capabilities.
• Adds socket communication as alternative to RMI.
• Enhanced cleanup thread, more aggressive file removal.

The Information Company™ 257


Understanding Search Engine 21

• Fixed: Memory loss with defragmentation.


• Configuration limits for maximum transactions in a metalog.

Search Engine 10 Update 5


Released with Content Server 10 Update 5 and Content Server 9.7.1 Update 5,
approximately December 2011.
• Sockets and threads persist between Search Federators and Search Engines
• Search Engines can now terminate / recover broken socket connections
• Accumulator memory requirements reduced
• GetStatusText now much faster, but possibly less accurate in estimates

Search Engine 10 Update 5 Release 2


Available March 2012, this interim release fixed specific select issues and
represented a “stable” version for use with Content Server 10 and 9.7.1 Updates 3
through 5.

Search Engine 10 Update 6


Released with Content Server 10 Update 6 and Content Server 9.7.1 Update 6,
approximately March 2012.
• Accumulator memory use reduced by chunking large text objects

Search Engine 10 Update 7


Released with Content Server 10 Update 7 and Content Server 9.7.1 Update 7,
approximately June 2012.
• Additional accumulator memory use reduction
• Accumulator performance improvement when chunking
• Multi-value text region support with Facets
• Relaxed whitespace rules parsing configuration files
• Invalid hostnames allowed by Windows now reported as errors

Search Engine 10 Update 8


Released with Content Server 10 Service Pack 2 and Content Server 9.7.1 Update 8,
approximately September 2012.
• Cleanup of unused facet data structures
• Improved Date facets
• Support for FileSize facets

The Information Company™ 258


Understanding Search Engine 21

• Significant performance improvements converting OT7 indexes


• Text metadata size and number of values protection
• Disk-based index storage available for beta testing

Search Engine 10 Update 9


Released with Content Server 10 Service Pack 2 Update 9 and Content Server 9.7.1
Update 9, approximately December 2012.
• Added in-place conversion of region type definitions
• Performance improvements for regular expressions and left truncation
• Improved cache management of search facets
• Added limits to number of values and total length of values for text metadata

Search Engine 10 Update 10


Released with Content Server 10 Service Pack 2 Update 10 and Content Server
9.7.1 Update 10, approximately March 2013.
• Addition of support for DATE metadata region type
• Optional compression of Checkpoint files
• Improved removal of invalid regions
• Performance improvements with Low Memory mode
• Improved hit highlighting for bi-gram indexed characters
• Command line echo option added to search client

Search Engine 10 Update 11


Released with Content Server 10 Service Pack 2 Update 11 and Content Server
9.7.1 Update 11, approximately June 2013.
• Accepts “Oracle” as vendor for use with Java 7
• Improved removal of nulls in regions during conversion
• Compute protected facets on startup
• IO Buffer leaks corrected
• TEXT fields can now be used as TypeFieldRankers
• Added query by OTPartitionName or OTPartitionMode

The Information Company™ 259


Understanding Search Engine 21

Search Engine 10 Update 12


Released with Content Server 10 Service Pack 2 Update 12 and Content Server
9.7.1 Update 12, approximately September 2013.
• Separate metalog checkpoint settings for Low Memory Mode
• Index biasing feature introduce to give preference to filling partitions
• Optional limit on number of simultaneous partitions writing checkpoints
• Fast traversal option for TEXT region in RAM with many identical values
• Get Regions function now includes region type information

Search Engine 10.5


Released with Content Server 10.5 and Content Server 10 Service Pack 2 Update
13, approximately December 2013.
• Index Engines will tolerate Search Engines not consuming metalogs
• Improved Index Engine shutdown reduces forced search grid restarts
• Introduction of the “LIKE” modifier
• RETIRE mode for partitions introduced
• REMOVE date regions
• Optimize facet creation on startup

Search Engine 10.5 Update 2014-03


Released with Content Server 10.5 Update 2014-03 and Content Server 10 Service
Pack 2 Update 2014-03, approximately March 2014.
• Performance optimization for sorting search results
• Optimize disk reads during metadata updates
• Remove TIME regions from DATETIME pairs

Search Engine 10.5 Update 2014-03 R2


Released as a hotfix for Content Server 10.5 Update 2014-03 and Content Server 10
Service Pack 2 Update 2014-03, March 2014.
• Bypass sorting if “Nothing” selected for sort order
• Optimize batch processing of DeleteByQuery and ModifyByQuery
• Add features for caching of search results

The Information Company™ 260


Understanding Search Engine 21

Search Engine 10.5 Update 2014-06


Released with Content Server 10.5 Update 2014-06 and Content Server 10 Service
Pack 2 Update 2014-06, June 2014.
• Add features for caching of search results
• Optimize indexing by using 1 thread per partition in Update Distributor
• Add administration command to force checkpoint writes
• Enhance Update Distributor “getstatustext” to include checkpoint data
• Enable the ‘LIKE’ capabilities for filename and part number search
• Add ‘Modify’ indexing operation

Search Engine 10.5 Update 2014-09


Released with Content Server 10.5 Update 2014-09 and Content Server 10 Service
Pack 2 Update 2014-09, September 2014.
• Reduced garbage collection, different estimation of percent full
• Added feature for searchable email domain regions
• ModifyByQuery will update regions with empty values
• OTFileType region repair on startup

Search Engine 10.5 Update 2014-12


Released with Content Server 10.5 Service Pack 1 Update 2014-12 and Content
Server 10 Update 2014-12, December 2014.
• Comparison queries for full text disallowed by default
• Relative day, week, month, quarter and year queries for DATE regions
• Added support for IN and NOT IN operators
• ModifyByQuery can completely remove empty values
• Improved Norsk/Dansk and Arabic tokenization
• Optional selective timestamps based on subtype

Search Engine 10.5 Update 2015-03


Released with Content Server 10.5 Update 2015-03 and Content Server 10 Update
2015-03, March 2015.
• Merge Tokens added to allow partitions to merge when out of disk space
• Partition rebalancing using disk percent full is supported

The Information Company™ 261


Understanding Search Engine 21

• Background text region index merges enabled, providing smaller checkpoint


files, faster startup, and higher ingestion throughput

Search Engine 10.5 Update 2015-06


Released with Content Server 10.5 Update 2015-06 and Content Server 10 Update
2015-06, June 2015.
• Additional Tokenizers can be used with text metadata regions
• Phonetic matching optimization, typically 30% faster
• Update Distributor performance statistics are logged

Search Engine 10.5 Update 2015-09


Released with Content Server 10.5 Update 2015-09 and Content Server 10 Update
2015-09, September 2015.
• Search execution times available per query or statistically
• Search facet types can be queried
• Getstatustext basic - efficient partition status calls added
• Added 2 decimal currency data type
• Exact substring matching for text metadata values added

Search Engine 10.5 Update 2015-12


Released with Content Server 10.5 Update 2015-12 and Content Server 10 Update
2015-12, December 2015.
• ConvertREtoRelevancy setting for query performance improvement on older
updates of Content Server.
• Reduced memory use with queries on ENUM and DATE types
• Federator caching now supports facets, statistics

Search Engine 16 Update 2016-03


Released with Content Server 16, Content Server 10.5 Update 2016-03 and Content
Server 10 Update 2016-03, March 2016.
• Protected facets are stored in the checkpoint for faster startup
• Per-query relevance boosting introduced
• Improved indexing throughput by optimizing file reads
• Reduced memory use in Search Engines for queries
• Large queries can be processed in chunks

The Information Company™ 262


Understanding Search Engine 21

Search Engine 16.0.1 (June 2016)


Released with Content Server 16.0.1, Content Server 10.5 Update 2016-06 and
Content Server 10 Update 2016-06, June 2016.
• Default operators for queries before chunking increased to 15,000
• Region comparisons converted to range operators
• Termset operator introduced
• Stemset operation introduced
• Query memory optimizations

Search Engine 16.0.2 (September 2016)


Released with Content Server 16.0.2 and Content Server 10.5 Update 2016-09,
September 2016.
• Metadata region forgery prevention added, using otb= attribute
• Optimizations for certain search query scenarios
• Corrected several relevance computation edge case issues

Search Engine 16.0.3 (December 2016)


Released with Content Server 16.0.3 and Content Server 10.5 Update 2016-12,
December 2016.
• Defragmentation is monthly with Low Memory Mode
• Optimized date facet generation
• Various indexing and query performance improvements

Search Engine 16.2.0 (March 2017)


Released with Content Server 16.2.0, Content Server 16.0.4, and Content Server
10.5 Update 2017-03, March 2017.
• Search.ini can be used to logically append lines to LLFieldDefinitions.txt
• Maximum number of sub-indexes now configurable
• Search Federator can report more than 2 billion results

Search Engine 16.2.1 (June 2017)


Released with Content Server 16.2.1, Content Server 16.0.5, and Content Server
10.5 Update 2017-06, June 2017.
• Priority CHAIN region introduced
• [first …] syntax added for dynamic queries on priority chains
• IndexVerify can test that TEXT values are readable

The Information Company™ 263


Understanding Search Engine 21

• Maximum number of sub-indexes now configurable


• Optimization in Metalog replay and AGGREGATE-TEXT indexing
• Java 8 u121

Search Engine 16.2.2 (September 2017)


Released with Content Server 16.2.2, Content Server 16.0.6, and Content Server
10.5 Update 2017-09, September 2017.
• MIN and MAX region capabilities added
• Optimization for grouping ModifyByQuery ops added, off by default
• Improved timeout handling for very large result sets
• Java 8 u131

Search Engine 16.2.3 (December 2017)


Released with Content Server 16.2.3, Content Server 16.0.7, and Content Server
10.5 Update 2017-12, December 2017.
• Bloom Filters added to optimize ModifyByQuery performance
• Reduced memory required to index very large objects
• Reduced temporary memory needed to build facets
• ANY search operator added
• ANY region query feature added
• ALL region query feature added
• Optimized hit highlighting performance through parallelization
• Java 8 u144

Search Engine 16.2.4 (March 2018)


Released with Content Server 16.2.4, Content Server 16.0.8, and Content Server
10.5 Update 2018-03, March 2018.
• Fix problem with thumbnail requests being indexed as part of text
• Extended file error retries due to DFS problems
• Introduce Top Words lists
• First implementation of the TEXT operator
• Java 8 u152

The Information Company™ 264


Understanding Search Engine 21

Search Engine 16.2.5 (June 2018)


Released with Content Server 16.2.5, Content Server 16.0.9, and Content Server
10.5 Update 2018-06, June 2018.
• Add Transaction Log file capability
• Add Reverse Dictionary
• Force conversion of some CS Integers to Long
• Java 8 u162

Search Engine 16.2.6 (September 2018)


Released with Content Server 16.2.6, Content Server 16.0.10, and Content Server
10.5 Update 2018-09, September 2018.
• Reverse Dictionary optimizations
• Report Disk I/O performance stats in log files
• Java 8 u181

Search Engine 16.2.7 (December 2018)


Released with Content Server 16.2.7, Content Server 16.0.11, and Content Server
10.5 Update 2018-12, December 2018.
• Optional low priority search queue added
• Java 8 192

Search Engine 16.2.8 (March 2019)


Released with Content Server 16.2.8, Content Server 16.0.12, March 2019.
• Optional query suspension feature to prevent index throttling
• OpenJDK 11.0.1

Search Engine 16.2.9 (June 2019)


Released with Content Server 16.2.8, Content Server 16.0.12, March 2019.
• Introduced span operator for advanced proximity searching
• New backup procedure
• Capture and log statistics on network errors
• OpenJDK 11.0.1

Search Engine 16.2.10 (September 2019)


Released with Content Server 16.2.10 and Content Server 16.0.14, September 2019.

The Information Company™ 265


Understanding Search Engine 21

• Optimization of numeric range search in text metadata fields


• OpenJDK 11.0.3

Search Engine 16.2.11 (December 2019)


Released with Content Server 16.2.11 and Content Server 16.0.15, December 2019.
• Capture and log statistics on network errors
• OTObjectUpdateTime always refreshed
• Regular expression and wildcard support in span operator
• Interval-based Search Agents introduced
• AdoptOpenJDK 11.0.5

Search Engine 20.2 (March 2020)


Released with Content Server 20.2 and Content Server 16.0.16, March 2020.
• Reserved partitions for objects with very large full text
• Additional span limits and controls
• Search Agent timing added to performance summary
• Support for long tokens
• AdoptOpenJDK 11.0.5

Search Engine 20.3 (July 2020)


Released with Content Server 20.3 and Content Server 16.0.17, July 2020.
• Improved batch splitting based on object and metadata size
• GroupLocalUpdates defaults to true
• Improve Search Federator query queue servicing
• Agent IPool info added to hourly stats
• Option to compress content sent to Index Engines
• Search Agent timing added to performance summary
• AdoptOpenJDK 11.0.6

Search Engine 20.4 (October 2020)


Released with Content Server 20.4 and Content Server 16.0.18, October 2020.
• Optimize modify/delete by query lookup times
• Locale-sensitive ordering of text metadata
• Improve distribution of objects to partitions by total size

The Information Company™ 266


Understanding Search Engine 21

• getstatustext performance for Update Distributor stats


• “good” file indicators written for index backups
• AdoptOpenJDK 11.0.7

The Information Company™ 267


Understanding Search Engine 21

Error Codes

Errors and warnings from OTSE may be exposed in multiple ways. Process Error
codes are responses to communications. Detailed information about errors is
normally contained in the log files. The chart below articulates many of the possible
Process Error codes. This is not a comprehensive list.

Update Distributor
Code Description

129 Unable to load JNI library. To read or write IPools, OTSE leverages
Content Server libraries. This file is named jniipool.dll (Windows) or
jniipool.so and is expected to reside in the <OTHOME>\bin directory.

131 Insufficient memory. The memory can be adjusted using the –XMX
parameter on the command line. Content Server exposes this
control in its administration pages.

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Update Distributor is incorrect.

150 Invalid URL. Is this RMI only?

152 Invalid partition name.

153 Error reading configuration file. The search.ini file is improperly


constructed, or has an invalid setting.

170 Insufficient memory in an Index Engine. The Update Distributor


cannot run because at least one Index Engine is out of memory

171 Unable to contact at least one Index Engine. Possible causes


include incorrect configuration, or conflicting use of ports.

172 One or more Index Engines have insufficient disk space.

173 Index is full. All Index Engines report they are unable to accept new
objects.

174 IPool read or write error occurred.

The Information Company™ 268


Understanding Search Engine 21

Index Engine
Code Description

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Index Engine is incorrect.

150 Invalid URL. Is this RMI only?

153 Error reading configuration file. The search.ini file is improperly


constructed, or has an invalid setting.

171 Unable to contact at least one Index Engine. Possible causes


include incorrect configuration, or conflicting use of ports.

174 IPool read or write error occurred.

175 Unreadable index. The Index Engine is unable to load an existing


index partition.

176 A restore from backup operation has failed.

180 Index failed to start. In some cases, this error is acceptable if the
Index Engine is already running.

181 Request to start the Index Engine has been ignored because an
index restore operation is in progress.

Search Federator
Code Description

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Search Federator is incorrect.

150 Invalid URL. Is this RMI only?

153 Error reading configuration file. The search.ini file is improperly


constructed, or has an invalid setting.

The Information Company™ 269


Understanding Search Engine 21

Search Engine
Code Description

132 Unhandled exception. This error generally indicates that an error


occurred for which no specific error handling exists. Resolving the
cause of this error will usually require examination of detailed logs

149 Command line error. At least one of the parameters on the


command line used to start the Search Engine is incorrect.

150 Invalid URL. Is this RMI only?

153 Error reading configuration file. The search.ini file is improperly


constructed, or has an invalid setting.

Utilities
OTSE contains a number of built-in utilities and diagnostic tools. These are often
used by OpenText support staff and developers when analyzing and testing an index.
Many of these will have limited value for customers, but may be of assistance when
diagnosing particular index problems. For convenience, basic documentation for
some of the more common utilities is included here.
Many of the utilities are NOT a supported feature of the product. They are not
guaranteed to work as described, and may be modified or removed at any time.

You are strongly advised to use the utilities on a backup of your


index, and not on a production copy. The potential exists to render
an index unusable for your application with some of these tools.
You have been warned.

General Syntax

The utilities are invoked by launching the search JAR using appropriate parameters.
The general syntax is:
java [-Xmx#M] –classpath <othome>\bin\otsearch.jar
com.opentext.search.tools.<subclasspath>
[parameters]

Where:

<othome>\bin is the file path where the search JAR file is located.

<subclasspath> is the name of the utility to be used.

[parameters] vary depending on the utility, as described in the following sections.

The Information Company™ 270


Understanding Search Engine 21

An example command line, using the VerifyIndex utility:

Java –classpath c:\opentext\bin\otsearch.jar


com.opentext.search.tools.index.VerifyIndex
-level 5 –config search.ini –indexengine ieName
-outFile verify_results.out –verbose true

Backup
The backup utility is used to create either differential or full backups of a partition.
Refer to the section on Backup and Restore for more information.
Java –classpath otsearch.jar com.opentext.search.backup.Backup
-inifile J:\index\Diff.ini
Where the inifile identifies the backup configuration file to be used.

Restore
The restore utility is used to restore an index from a prior backup. Restore to the
section on Backup and Restore for more information.

Java –classpath otsearch.jar com.opentext.search.backup.Restore


-inifile J:\index\Res.ini

Where the inifile identifies the restore.ini file to be used. You may need to run the
restore process many times. Using the utility directly is not for the faint of heart, and
you should probably let Content Server manage this for you.

DumpKeys
The DumpKeys utility attempts to generate a list of all the object IDs for objects in the
partition. This is often a tool of last recourse for repairing a corrupted index. The
DumpKeys tool will sometimes be able to get data from a partition which is
unreadable.
The input to dumpkeys is the search.ini file and partition information, and the output
is a file of object IDs. Sample output looks like this:
c DataId=41280133&Version=1
c DataId=41280132&Version=1
c DataId=41280131&Version=1
The first character details where the object ID was found. If in the checkpoint file, the
first character is a ‘c’ (as in the example above). If an object ID was found in the
metalog file (recently indexed), the first character reflects the operation type:

The Information Company™ 271


Understanding Search Engine 21

n: new
a: add
r: replace
m: modify
d: delete
Invoking Dumpkeys:
java -Xmx2000M -Xss10M -cp .\otsearch.jar;
com.opentext.search.tools.analysis.DumpKeys -inifile <path_to_search.ini> -
sectionName <IE_or_SE_Section_Name> -log <Path_to_log_file> -output
<Path_to_DumpKeys_Output>

Parameters:
path_to_search.ini: Path to the search.ini file, typically /config/search.ini
IE_or_SE_Section_Name: The full section name including the SearchEngine_ or
IndexEngine_ prefix.
Path_to_DumpKeys_Output: Path to where the output file should be created.
Path_to_log_file: Path to where the log file should be created.

VerifyIndex
This utility performs internal checks of the structure of the index. Levels 1 through 5
are cumulative, and level 10 is a distinct operation. Parameters are:
–level K -config SearchIniFile –indexengine IEName
[–outFile OutFile] [-html true] [-verbose true]

level: a value between 1-5 or 10; see below for details


config: search.ini file
IEName: Index Engine signature in the search.ini file of the partition to be processed
outFile: output file containing results of verification
html: true requests output in HTML format
verbose: true generates progress and status information, even without errors

Levels are cumulative from 1-5, level 10 is distinct


1: Surface level check of the index, identifying inconsistencies

The Information Company™ 272


Understanding Search Engine 21

2: also verifies the checksum of the checkpoint file


3: also verifies the checkpoint is loadable
4: also verifies the checksum of all full text index files
5: also verifies the word pointers in the content index are consistent
10: Verifies that the word pointers in the metadata index are present
In addition to the output report, the VerifyIndex process has an exit code = 0 if OK, 1
if not.
VerifyIndex should be run with a partition which is currently not in use by search or
index engines. Although a level 1 test may only take a few seconds, a level 5 or level
10 test might require up to 10 hours to run, depending upon the partition size and the
capabilities of the computer.
A Level 3 test has an option to rigorously test that TEXT metadata values can be
successfully read. When enabled, this will nearly double the time to execute. This
setting is controlled in the [IndexEngine_] section of the search.ini file, which
identifies how many exceptions should be logged before stopping, set to 300 by
default, set to 0 to disable this check.
MaxVerifyIndexMODExceptions=300
By way of example, a sample level 5 test output is shown below:

Level 5 check starting on /index4


Level 5 verifies the postings in the content files are consistent

SubIndex Statistics

SubIndex \index4\12401 Tokens 22675687/25761333 (88%) Postings 1361632092/1440242913


(95%)
SubIndex \index4\21883 Tokens 15940369/16069809 (99%) Postings 761869365/774825024 (98%)
SubIndex \index4\23990 Tokens 931805/934883 (100%) Postings 34611908/34731025 (100%)
SubIndex \index4\25273 Tokens 348058/348058 (100%) Postings 5892009/5892009 (100%)
SubIndex \index4\27066 Tokens 28350/28350 (100%) Postings 1293163/1293163 (100%)

Index Statistics

Index Size = 4975289946


Max internalID = 629940
Active ObjectIDs = 594584
Total index: Tokens 39924269/43142433 (93%) Postings 2165298537/2256984134 (96%)
Total core tokens = 5954186 Total token length = 48085094 Average token length = 8.075847
Total other tokens = 37185437 Total token length = 405466390 Average token length = 10.903903
Total region tokens = 2810 Total token length = 42521 Average token length = 15.132029
Total content compression = 0.2505714
Level 5 check complete in 23528700 ms
If errors are found in a Level 10 diagnostic, they can usually be corrected using the
RebuildIndex utility.

The Information Company™ 273


Understanding Search Engine 21

RebuildIndex
This utility rebuilds the dictionary and index for metadata in a partition. This is
possible because an exact copy of the metadata is stored in the checkpoint files.
This does not affect the full text index. This utility can often be used to repair errors
detected by a Level 10 VerifyIndex.
Parameters:

-iniFile SearchIniFile –indexengine IEName

Where
SearchIniFile is the location and name of the search.ini file which should be used.
IEName is the name of the partition which should be rebuilt.

Because this utility needs to build and load the entire index, you may need to ensure
an appropriate -Xmx (memory allocation) parameter is specified on the Java
command line.

LogInterleaver
Each component of the search grid – index and search engines, search federators
and the update distributor – create their own log files. It can be difficult trying to trace
a single operation through multiple log files. The LogInterleaver function will combine
multiple log files by ordering entries according to their time stamps into a single log
file to simplify interpretation. The output file will have a slightly different syntax –
each line of output will be prefixed by the original log file name.
Parameters:

-d logDir | -o outputFile

OR

outputFile logFile1 logFile2 logFileN


The first usage will combine all the log files within a requested directory into the log
file specified by outputfile. In this usage, the logDir should be the same as the
working directory. The second usage combines a specific list of files.

tools.analysis.ConvertDateFormat
Log files from the search components have a time stamp in milliseconds from a
reference date. This utility will convert a log file to have human-readable time/date
values instead, which can be helpful when interpreting the logs manually.
This utility is somewhat unusual in that it reads from console input and writes to
console output, so the typical usage is to “pipe” the source logfile into the java
command line, and redirect the output to a target file like this:

The Information Company™ 274


Understanding Search Engine 21

type <logfile> | java –classpath c:\opentext\bin\otsearch.jar


com.opentext.search.tools.VerifyIndex >
formattedlog.txt

com.opentext.search.tokenizer.LivelinkTokenizer
This utility enters a console loop. You enter one line of text, and it responds by
printing out each search token generated on a separate line. Control-C will terminate
the loop.
Optional command line parameters:
-TokenizerOptions <Number> -tokenizerfile <RegExParserFile>

Where Number represents the bitwise controls for tokenizer options, as defined in
the Tokenizer section of this document. The tokenizerfile parameter specifies
an optional or custom tokenizer definition that may be used.

ProfileMetadata
This utility function loads a checkpoint file, and writes information about the metadata
in the checkpoint to the console. You may wish to redirect the console output to a file
to capture the data.
Parameters:

[-l (0|1|2)] [-values (true|false)] checkpointFile

Where:
l: profile level where 0=High Level,
1=Field Level (Default),
2=Field Part Level
values: true requests the # of objects with values
and the estimated total memory requirement
checkpointFile: file name of the checkpoint file to be
profiled
Refer to sample output fragments for the profile levels below.

Level 0:
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719

Level 1:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
2060 Field(Text):OTDocCompany
1996 Field(Text):OTDocRevisionNumber
10668 Field(Text):OTVerCDate

The Information Company™ 275


Understanding Search Engine 21

0 Field(Text):OTReservedByName
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719

Level 2:
5201 Global:userIDMap
1932 Global:userNameGlobals
3036 Global:userLoginGlobals
1376 Field(Text [RAM]):OTDocCompany dictionary (mappingEntries=0 wsEntries=1
tokenEntries=3)
256 Field(Text [RAM]):OTDocCompany content
428 Field(Text [RAM]):OTDocCompany index
2060 Field(Text [RAM]):OTDocCompany combined
1312 Field(Text [RAM]):OTDocRevisionNumber dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
256 Field(Text [RAM]):OTDocRevisionNumber content
428 Field(Text [RAM]):OTDocRevisionNumber index
1996 Field(Text [RAM]):OTDocRevisionNumber combined
10668 Field(Date):OTVerCDate combined
684 Field(Date):OTDateEffective combined
1312 Field(Text [RAM]):OTContentIsTruncated dictionary (mappingEntries=0
wsEntries=0 tokenEntries=1)
33920 Field(Text [RAM]):OTContentIsTruncated content
428 Field(Text [RAM]):OTContentIsTruncated index
35660 Field(Text [RAM]):OTContentIsTruncated combined
Field(UserID):OTAssignedTo combined

Field(Integer):OTTimeCompleted combined
0 Field(UserLogin):OTReservedByName combined
3872084 Total accounted for memory
NumOfDataIDs=10721
NumOfValidDataIDs=10719
If the parameter “values” is true, the information for each region is considerably more
detailed:

Field(Text):OTWFMapTaskUserData values= 18 valuesSize= 7133 memorySize= 4108


Field(Text):OTVersionName values= 10630 valuesSize= 10630 memorySize= 35788
Field(Text):OTHP values= 5 valuesSize= 948 memorySize= 4364
Field(Date):OTVerMDate values= 10630 valuesSize= 85040 memorySize= 10668
Field(Integer):OTOwnerID values= 10719 valuesSize= 53583 memorySize= 7424
Field(Text):OTUserGroupID values= 10706 valuesSize= 42824 memorySize= 35916

tools.index.DiskReadWriteSpeed
The search configuration files allow you to control several aspects of file I/O. Tuning
these for optimal performance can be difficult, since many factors are involved. The
DiskReadWriteSpeed utility can help by simulating disk performance using several of
the available configurations. For each mode, this utility performs 32678 iterations of

The Information Company™ 276


Understanding Search Engine 21

the test using 8KB block of data. Note that this information can help you tune disk
performance or identify system I/O bottlenecks, but is not necessarily sufficient to
draw a firm conclusion regarding the optimal configuration.
Parameters:

(write|read|both) TestDirectory
The operations tested are:

Stream read/write using RandomAccessFile


Stream read/write using FileOutputStream
Stream read/write using NIO1 (FileOutputStream base)
Stream read/write using NIO2 (RandomAccessFile base)
Random read/write using RandomAccessFile
Random read/write using NIO1 (FileOutputStream base)
Random read/write using NIO2 (RandomAccessFile base)

NOTE: NIO operations pull from a ByteBuffer as opposed to using


a static byte array.

SearchClient
The SearchClient is a console application that allows you to interactively issue
commands to the Search Federator. The SearchClient is useful for determining that
search is working as expected, or running queries without having an application such
as Content Server running. All console output is expressed in UTF-8 characters.
Note that you might need to adjust the default Search Federator timeout values
higher if using the SearchClient.
It is possible to use the SearchClient with an index that is also being used in a live
production system. In this situation, a SearchClient that is open consumes a search
transaction from the available pool, so this may impact the available pool of search
transactions.
Parameters:
–host SFHost –port SearchPort [-adminport SFAdminPort] [-time
true] [-echo true] [-pretty true]
SFHost is the URI for the target Search Federator, connected on SearchPort. The –
time true parameter adds response time information to each response.
The –echo parameter will add the input command to the output. This is useful when
redirecting input from a file for batch operations, so you can associate the commands
with the responses. By default, echo is false.
The –pretty parameter will use an alternate formatting of GET RESULTS. The
alternate format does not adhere to the API spec, but is better formatted for human
readability when developing or debugging.

The Information Company™ 277


Understanding Search Engine 21

The –csv true parameter will output the results in a form that can be easily imported
into a spreadsheet (comma separated values). This feature is most useful when
redirecting input and output from/to files. If –pretty is specified, it takes precedence
over –csv.
The –adminport setting enables specific commands to be interpreted and sent to the
administration port of the Search Federator. These admin commands are:

Reload – reload settings from the search.ini file


Stats – get statistics
Sendshutdown – send a request to shut down the Search Federator
In operation, the console of the SearchClient supports search query operations plus
some special commands. Query operations include SELECT, GET RESULTS, and
similar functions. The special administrative operations are:

exit / quit [close the client]


close [close the socket without closing the client]
sleep # [make the client wait for # ms]
sendquit [shutdown the SF via the search port]

Repair BaseOffset Errors


To minimize disk space, the offsets (or pointers) to portions of the index are kept as
values relative to the start of an index fragment. In rare cases where network or disk
errors have occurred, it might be possible for the base values to become misaligned,
resulting in overlapping (and therefore incorrect) indices which will not load properly.
The current version of the Search Engine will normally catch these cases when they
occur and exit immediately, but older versions of the software would sometimes
propagate these, resulting in a badly formed index.
The search engine contains a number of utilities that can be used to repair an index
in this state. This is a multi-step process, and ultimately any identified overlapping
objects will need to be deleted and re-indexed.
The main utility is RepairSubIndexes, which performs the cut of overlapped sub-
indexes.
The second utility is DumpSubIndexesIDs. For an input partition, the
DumpSubIndexesIDs utility goes through every active sub-index in the partition and
prints out internal-external objectID pairs to a file. It also prints out all the internal-
external IDs in the deleteMask (i.e., items marked for deletion).
The third tool is the DiffObjectIDFiles tool. It takes in as input, the output file of
RepairSubIndexes and an output file from DumpSubIndexesIDs to produce a diff of
internal-external ID pairs. The reason/use of these tools will be explained below.

Problem Illustration
subindex1 has internal IDs = 1,2,3,4,5,6,8,9
subindex2 has internal IDs = 5,7,8,9,10

The Information Company™ 278


Understanding Search Engine 21

BaseOffset problem: subindex1 should only contain 1,2,3,4. Internal IDs 5,6,8 and 9
overlap with subindex2.
Fix: cut 5,6,8 and 9 from subindex1.
Items 5, 8, and 9 already exist as duplicates in subindex2. However, item 6 only
exists in subindex1, so the fix would remove the only instance of item 6 from the
index content.
After fix:

subindex1New: 1,2,3,4
subindex2: 5,7,8,9,10
Output of DumpSubIndexes before fix: ids for subindex1, subindex2 and deleteMask
Ouput of RepairSubIndexes: a file which lists the objects removed from subindex1
(5, 6, 8 and 9) along with their external IDs for re-indexing.
Output of Diff tool: a file which only lists object 6 along with its external ID for re-
indexing.
Output of DumpSubIndexes after fix: ids for subindex1New, subindex2 and
deleteMask

Repair Option 1
This approach requires about 30 to 60 minutes for a typical partition, and makes the
index usable as quickly as possible. However, there may be a lot of objects that
need to be reindexed.
Running the RepairSubIndexes utility

java -otsearch.jar
com.opentext.search.tools.index.RepairSubIndexes
-level x -config search.ini -indexengine firstEngine

where level x is 1, 2 or 4 (slowest but more detailed)


where config is search.ini
where firstEngine is the IE associated with the partition that you want fixed as
specified in the search.ini
example (assuming the new otsearch.jar is in the current directory):

prompt>java -Xmx1300M -Xss10M -classpath ./otsearch.jar


com.opentext.search.tools.index.RepairSubIndexes -level 1 -config
/opentext/config/search.ini -indexengine IEname0
The minimum search.ini sections necessary to run this tool are: IE section, Dataflow
section, Partition section. Any file paths mentioned in these sections should be
adjusted to point to the actual location of your index partition directory in your
environment

The Information Company™ 279


Understanding Search Engine 21

Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.
If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will
warn and exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending
on the size of the sub-index that is being fixed. The utility will produce an
output file bearing the name of the sub-index that was fixed. This file contains
the internal-external objectID (OTObject region value) pairs that can be
utilized for re-indexing.
3. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.

Additional Comments:
While running the tools, it is strongly recommended that the output be
redirected out to a file for easier analysis (… > repairoutput.txt).
During the repair process, it is possible to navigate inside the directory where
the index under repair sits. It is possible to observe the new sub-index
fragment being written out, growing larger in size over time.
At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being
repaired (same location where new fragment is made)

Repair Option 2
This method typically requires about 45 minutes longer per partition, but minimizes
the number of objects which may require re-indexing.
Running the RepairSubIndexes utility

java -classpath otsearch.jar com.opentext.search.tools.index.RepairSubIndexes


-level x -config search.ini -indexengine firstEngine
• where level x is 1, 2 or 4 (slowest but more detailed)

The Information Company™ 280


Understanding Search Engine 21

• where config is search.ini


• where firstEngine is the IE associated with the partition that you want fixed as
specified in the search.ini
example (assuming the new otsearch.jar is in the current directory):

prompt> java -Xmx1300M -Xss10M -classpath ./otsearch.jar


com.opentext.search.tools.index.RepairSubIndexes -level 1
-config /opentext/config/search.ini -indexengine IEname0
Running the DumpSubIndexesIDs utility

java -classpath otsearch.jar;otsearch-util.jar


com.opentext.search.tools.index.DumpSubIndexesIDs
-config search.ini -indexengine firstEngine

NOTE: no level info needs to be specified , and the utility jar is


required.

example (assuming that both the new otsearch.jar and otsearch-util.jar are in the
current directory):

prompt>java -Xmx1300M -Xss10M -classpath ./otsearch.jar;./otsearch-util.jar


com.opentext.search.tools.index.DumpSubIndexesIDs
-config /opentext/config/search.ini -indexengine IEname0

Running the DiffObjectIDFiles utility


prompt>java -classpath otsearch.jar;otsearch-util.jar
com.opentext.search.tools.index.DiffObjectIDFiles
-dir /index -deleteIDsFile fileName -subIndexIDsFile fileName

• where dir is the index directory where all the output files were written out
• where deleteIDsFile is the output file made by the RepairSubIndexes utility for
the sub-index that was fixed
• where subIndexIDsFile is the appropriate output file made by
DumpSubIndexesIDs utility. It is crucial to use the correct file; if we have
subindex1 and subindex2 with overlap and subindex1 was cut out, then use the
DumpSubIndexesIDs file for subindex2.
example:

prompt>java -Xmx1300M -Xss10M -classpath ./otsearch.jar;./otsearch-util.jar


com.opentext.search.tools.index.DiffObjectIDFiles -dir /index -deleteIDsFile
index12401_ReIndexIDs.log -subIndexIDsFile 21883.log_1299091996464

The Information Company™ 281


Understanding Search Engine 21

The minimum search.ini sections necessary to run this tool are the Index Engine
section, Dataflow section and Partition section. Any file paths mentioned in these
sections should be adjusted to point to the actual location of your index partition
directory in your environment
Steps
1. Back-up the partition on which you will be doing the repair. Make sure that there
are no active processes accessing this partition (IEs, SEs, etc) during the repair.
2. Run RepairSubIndexes at level 1, 2 or 4. These levels map directly to the
equivalent VerifyIndex level used internally by RepairSubIndexes to test the
partition.

If the partition is healthy, the utility will produce a report and exit.
If the utility detects a problem other than the “baseOffset” problem, it will warn
and exit.
Otherwise it will perform the repair. This can take 30-60 minutes depending on
the size of the sub-index that is being fixed. The utility will produce an output file
bearing the name of the sub-index that was fixed. This file contains the internal-
external objected (OTObject region value) pairs that can be utilized for re-
indexing.
3a. Run RepairSubIndexes again to verify the health of the newly built partition. If
further repair is needed, the utility will begin the work. This should be repeated
until the partition is reported as being healthy.
3b. Run the DumpSubIndexesIDs utility after repair. This will generate a date-
stamped file for each sub-index. The file contains all the internal-external IDs for
each sub-index.
3c. Run the DiffObjectIDFiles tool (this only takes a few minutes). This will produce a
smaller set of objects to re-index. This set contains objects whose content was
cut from the bad sub-index and whose content is NOT contained anywhere else
in the partition.
4. Re-index the objects listed in the output file. This re-index must necessarily be a
delete and an add. An update operation will not be sufficient for this case. Note:
The deletes must be fully completed BEFORE the add operations are attempted.

NOTE: While running the DumpSubIndexesIDs tool, the utility will


likely report that many regions were ‘removed’ from the index.
This is due to the mode in which the utility runs while hydrating
the metadata part. No actual regions are permanently removed
and this should not cause alarm.

Additional Comments:
While running the tools, it is strongly recommended that the output be redirected
out to a file for easier analysis (… > repairoutput.txt).

The Information Company™ 282


Understanding Search Engine 21

During the repair process, it is possible to navigate inside the directory where the
index under repair sits. It is possible to observe the new sub-index fragment
being written out, growing larger in size over time.
At the end of the process, the new sub-index will be slightly smaller than the
original sub-index.
The output file is written to the same directory as the index that is being repaired
(same location where new fragment is made.

New Base Offset Errors


If a repaired sub-index (or an existing good index) generates a new index fragment
which has overlapping base offsets, this case will be detected when the Index Engine
next attempts to merge subindexes or dump the accumulator to a new subindex.
At the point of this detection, the IE will stop and the partition data will remain
unchanged.
A new lock file, called “baseOffset.stop” will also be written out to the current index
directory.
While this file remains in that directory, the Index Engine will be unable to
start/restart. This ensures that the statue of the index is not modified, and the files
should be collected for customer support to assist in determining the root cause of
the problem (Index Engine, Update Distributor logs and the IPools that triggered the
error).
If for some reason you must ignore this condition and continue:
• if the baseOffset.stop lock is already present in the index directory, delete it
• make an empty file called “ignoreBaseOffset.ig” and place it in the index directory
The IE should come up and ignore the baseOffset problem. WARNING: this WILL
generated an index with base offset errors that will later need to be repaired.

The Information Company™ 283


Understanding Search Engine 21

Index of Terms

!=, 73 Arabic, 150


", 84 ASC, 89
$, 83 ascending, 89
( ), 83 asterisk, 72
*, 83 attribute, 79
., 82 ATTRIBUTE, 70
.in, 100 Attributes, 23
.new, 100 Backup, 208
?, 83 Backup.ini, 215
[ ], 82 baseOffset.stop, 282
[^ ], 83 Bloom Filter, 183
^, 83 Boolean, 29
|, 83 Boost, 108
+, 83 Brackets, 72
<, 73, 76 Caching, 161
<=, 73 Case Sensitivity, 146
=, 73 CHAIN, 31
>, 73 Character Mapping, 149
>=, 73 checkpoint command, 169
3-gram, 157 Checkpoint Compression, 192
4-grams, 157 Chunk Size, 192
Accumulator, 138, 140 Cleanup Thread, 141, 142, 143
AddOrModify, 53 ConversionProcessPercent, 170
AddOrReplace, 52 ConvertDateFormat, 273, 274
addRegionsOrFields, 170 Currency, 29
Aggregate-Text, 30 Cursor, 57
all, 73 Date Facets, 93, 94
AllowedNumConfigs, 193 DateTime, 30
AND, 71 de-duplication, 66
AND-NOT, 69, 71 Default, 89

The Information Company™ 284


Understanding Search Engine 21

Default Search Regions, 115 Full.ini, 212


Defining a Region, 17 Garbage Collection, 196
defragmentation, 171 Get Facets, 60
Delayed Commit, 192 Get Regions, 66
DelayedCommitInMilliseconds, 192 Get Results, 58
Delete, 53 Get Time, 65
DeleteByQuery, 54 getstatuscode, 168
DESC, 89 getstatustext, 163
descending, 89 getsystemvalue, 169
Diff.ini, 212 hh, 64
DiffObjectIDFiles, 280 High Ingestion, 177
Disk Configuration, 192 Hit Highlight, 64
Disk fragmentation, 189 HIT LOCATIONS, 70
Disk Performance, 190 HyperV, 195
Disk Storage, 35 ignoreBaseOffset.ig, 282
DiskReadWriteSpeed, 275 IN operator, 86
DROP, 18 Index Engines, 8
DumpSubIndexesIDs, 280 Integer, 27
Email Domain, 133 Interchange Pools, 50
Empty Regions, 19 IOChunkBufferSize, 192
Entire Value, 72 iPool errors, 49
ENUM, 29 iPools, 50
Error Codes, 267 IPv6, 208
EuroWordNet, 119 JNI, 5
Existence, 90 Key, 24, 25
Expand, 64 Lang File, 214
EXTRAFILTER, 100 left-truncation, 75, 76
Facet Memory, 96 Like, 129
Facet Security, 94 LogInterleaver, 273
Facets, 91 Long, 27
File Monitoring, 196 Low Memory, 36
FileCleanupIntervalInMS, 143 LQL, 69
first, 80 marco, 170
Fragmentation, 171 Maximum, 81

The Information Company™ 285


Understanding Search Engine 21

Memory Sizing, 172 OTMetadataChecksum, 39


Memory Storage, 35 OTMetadataUpdateTime, 44
Memory Use, 201 OTObject, 26, 39
Merge Thread, 144 OTObjectIndexTime, 44
Merge Tokens, 145, 146, 176 OTObjectUpdateTime, 45
MergeSortCacheThreshold, 192 OTPartitionMode, 42
MergeSortChunkSize, 193 OTPartitionName, 42
Merging Regions, 20, 21 OTSQL, 56, 68
MetadataValueSizeLimitInKBytes, 142 OTSTARTS, 56
Minimum, 81 OTURN, 26
MODDeflateMode, 36 ParallelGCThreads, 196
Modify, 53 Part Numbers, 129
ModifyByQuery, 54 Partition Biasing, 181
Multiple CPUs, 194 Partitions, 13
MultiValueLimitDefault, 142 phonetic, 75
Non-Uniform Memory Access, 194 polo, 170
Nothing, 89 port scanners, 197
null characters, 18 ProfileMetadata, 274
Null Regions, 18 PROX, 71
NUMA, 194 Purge, 205
Object Ranking, 107, 108 Quarantine, 55
OR, 69, 71 range, 75
ORDEREDBY, 88 RankingExpression, 89
OT7, 4 Rawcount, 90
otb, 24 Read-Only, 14
OTChecksum, 39 Read-Write, 16
OTContentLanguage, 42 RebuildIndex, 273
OTContentStatus, 40 regex, 75, 82
OTContentUpdateTime, 44 Region, 89
OTData, 38 Region Names, 17
OTIndexError, 43, 44 Regions, 17
OTIndexLengthOverflow, 142 registerWithRMIRegistry, 169
OTIndexMultiValueOverflow, 142 Regular Expressions, 82
OTMeta, 38 Re-Indexing, 187

The Information Company™ 286


Understanding Search Engine 21

Relative Date, 85 SOR, 71


Relevance, 102 span, 76, 77, 78
Relevancy, 89 starting at, 58
reloadSettings, 169 stem, 75
Removing Regions, 19 STEMSET, 86
Renaming Regions, 20 stop, 163
RepairSubIndexes, 278, 279 Substring, 126
Restore, 208 SYNC, 100
Retired, 15 TERMSET, 86
Retrieval Storage, 37 Text, 26
RFC 2373, 13 Text Operator, 135
RFC 952, 13 thesaurus, 75
right-truncation, 75 Thread Management, 197
runSearchAgent, 170 Throttling Indexing, 193
runSearchAgents, 170, 171 Timestamp, 27
Search Engines, 9 Tokenizer, 146
Search Federator, 8 Type Ranking, 105
SearchClient, 276 Update Distributor, 8
Select, 56 Update-Only, 14
SEQ, 89 User, 30
SEQUENCE, 89 Values, 22
Server Names, 13 VerifyIndex, 270, 271
Set lexicon, 66 Virtual Machines, 195
Set thesaurus, 66 Virus Scanning, 197
Set uniqueids, 66 VMWare ESX, 195
Shadow Regions, 130 WHERE, 70
shards, 13 WHERE Operators, 73
Signature File, 221 WHERE Regions, 79, 80
SmartSharing, 196 WHERE Relationships, 71
Sockets, 9 WHERE Terms, 72
Solaris Light Weight Processes, 196 WordNet, 119
Solaris Zones, 195, 196 XML Text, 38
Solid State Disks, 143 XOR, 69, 71

The Information Company™ 287


Understanding Search Engine 21

About OpenText
OpenText enables the digital world, creating a better way for organizations to work with information, on premises or in the
cloud. For more information about OpenText (NASDAQ: OTEX, TSX: OTC) visit opentext.com.
Connect with us:

OpenText CEO Mark Barrenechea’s blog


Twitter | LinkedIn

www.opentext.com
Copyright © 2021 Open Text SA or Open Text ULC (in Canada).
All rights reserved. Trademarks owned by Open Text SA or Open Text ULC (in Canada).

You might also like