You are on page 1of 5

Intel Threading Challenge 2009

String Matching

Hao, Jianan School of Computer Engineering, Nanyang Technological University justhjn@gmail.com

JUNE 6, 2009

ENVIRONMENT

OS Windows XP Professional 64bit IDE Visual Studio 2008 Compiler Intel C++ Compiler 11.0.074 (EMT64 platform) Technique: TBB Please compile the source in Release (x64) configuration, or use the file StringMatching.exe directly.

PROBLEM DESCRIPTION
Write a threaded program to search a database of DNA sequences, represented as strings of characters, to find matches of other DNA subsequences. Both the database and the query strings are made up of only four characters: 'A', 'C', 'G'. and 'T'. The output must report the location of an exact match within any given input DNA sequence for each input search query string. If the query string matches within multiple sequences within the database, each result must be reported; and if the query string matches multiple locations within the same database sequence, the earliest position that matches exactly must be reported. The file names for this problem (database file, queries file, output results) will be given on the command line. File formats: The input database and query files will have the same format. Each sequence will be prefixed by a line starting with a greater than character ('>') followed by a description of the origin of the sequence of not more than 131 characters. The sequence will then begin on the next line for some number of lines. Each line will contain exactly 80 characters from the set 'A', 'C', 'G'. and 'T', except the last line, which may hold fewer than 80 characters. Following this last line will be the descriptor line from the next sequence until the end of the file has been reached, which will be signified by the descriptor (">EOF"). For each query string contained within the second input file, the output file should print the descriptor of the query sequence and the descriptors of any database sequences that contain a match as well as the position within the database sequence of that exact match. If the query seqeunce string is not found within any database sequences, a message to that effect should be printed after the query descriptor. Timing: Total execution time will be used for scoring. This will allow for encoding or compression of the input database sequences to be done during input (if your algorithm uses such transformation).

ALGORITHMS DESIGN
The most famous algorithm for string matching would be KMP1 which requires O(m+n) time where m and n are length for database and query strings. However, for special scenario such as DNA genome matching which one unit is not 1 byte (DNA has only 4 genomes), KMP is not that

efficient for comparing genome one by one. We suggest that DNA sequence can be compressed to packed string whose element is only 2 bits representing one genome. Then we generate prefix for each query string and compare it with DNA database. Once we find a matching prefix, we will continue compare the following genomes; otherwise, we move database string forward. Certainly, the compression and shifting operations have overhead. But considered there are several query strings, compression would be helpful to 1) load strings into cache; 2) express more genomes in fix length. Thus, we expect these additional operations are worthy.

SERIAL IMPLEMENTATION
The algorithm mainly has four parts: 1) read original genome data from files; 2) compress genome in database and generate prefix for query; 3) search queries in database entries; 4) output results to file. For the first part, we use mapping file to eliminate I/O calling time. As file formats of database and query data are identical, we implement one function to cover both of them. In this part, we also delete the ending characters, i.e.\r\n, copy genomes to aligned memory and pad the buffer in order to benefit compression operation in the future. For the second part, we employ SSE instruction to compress 16 genomes into 1 double word. Results are combined first and then store to another memory region. For the third and core part, the main operations here are shifting and comparing. Every time, we compare 64bit prefix with database from current position first, namely prefix comparison. If they matched, we further examine remaining genomes; otherwise, we shift database and do the comparison again. To avoid repeat operation, for each database entry, we compare all the queries when shifting the data. This algorithm may be panic for queries with large variation on length. For the last part, we just use fprintf to output. We tried to enlarge prefix to 128 bits but no time to eliminate bugs on it.

PARALLELISM
We use OpenMP to parallel searching operation by applying parallel_for to each database entry. For the input data with more than 16 database entries, since no data dependency on threads except false share issue on result writing, program can be expected to utilize all the processors on target platform. We measured the acceleration is about 3.5 on core quad.

OPTIMIZATIONS
A lot of techniques are used to optimize the code.

A. Elimination on Branch

Converting A,T,C,G into 2 bit in general code may introduce many branch structures that significantly break the pipeline of CPU. Thus, we use shift operation to achieve itSHR 1 and AND 0x03 to genome.

B. Data Affinity
We put the related data together to gain cache performance as well as meet the requirement to use SSE aligned moving instruction.

C. Data Compression
Data compression, though it introduces overhead, highly benefit cache performance and increase hit rate in prefix comparison. For example, without compression, 64-bit prefix can only represent 8 genomes compared to 32 genomes by using compression.

D. SSE instruction
SSE instruction can benefit memory performance and parallel operations. We use SSE instruction to compress genome.

E. Unrolling Loop
Unrolling the loop will save the time of jump operation. We manually unroll the compression loop to combine results into 128-bit register.

FURTHER POTENTIAL IMPROVEMENTS


First, we may use SSE to prefix comparison to increase hit rate. Second, I/O and compression can construct pipeline to gain parallelism, especially for the large file. Third, we may read query file first and thus can start searching just after read first database entry. Fourth, currently, shift operation is in C++ code and we may alternate it by SHRD instruction. In fact, we tried to use ASM but fail to debug the error. Fifth, task scheduling can be improved to balance workload. Sixth, we can group similar length query string and search them together rather than search all the entries.

CONCLUSION
String matching is a classical problem and easy to run in parallel. Our approach is based on the assumption that database substring which matches the prefix will match the whole query entry. When it is satisfied, our code is fast and able to utilize multi-core.

In addition, many optimization tricks are used in coding and more potential optimizations are available.

REFERENCES
1. String searching algorithm, http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm.

You might also like