You are on page 1of 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/309672981

Research Statement for Assistant Professorship Application (2016)

Research Proposal · November 2016


DOI: 10.13140/RG.2.2.28828.90243/1

CITATIONS READS
0 1,264

1 author:

Yongchao Liu
Ant Group
66 PUBLICATIONS   2,398 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Compact computing for big data View project

Kmerind View project

All content following this page was uploaded by Yongchao Liu on 04 November 2016.

The user has requested enhancement of the downloaded file.


Research Statement
Yongchao Liu∗

Advances in high-throughput sequencing (HTS) technologies have enabled cost-affordable sequencing of


whole genomes, population-scale variation screening, and mass spectrometry-based proteomics. These HTS
technologies have propelled the development of a myriad of biological applications, and have significantly
altered the landscape of genomic and genetic research. Along with the ever-increasing throughput and
eve-decreasing sequencing cost, these technologies are continually propelling related research into the era
of big data. Taking the newest Illumina HiSeq X Ten system as an example, this sequencer enables the
sequencing of over 18,000 human genomes at the price of < $1000 per 30× genome, successfully breaking
the thousand dollar human genome barrier. In other words, this scale of sequencing throughput corresponds
to the production of a total of 90 billion base-pairs (bps) per human genome and about 1.62 quadrillion bps
per year, by only a single sequencer. Under these circumstances, efficient management and processing of big
biological data has already become a very challenging proposition and established strong needs of new and
sophisticated computational solutions.

Acceleration via Parallel Processing


Conventional acceleration technologies employ shared-memory multiple CPUs and distributed-memory CPU
clusters to address a bulk of computational challenges. However, the size of such kinds of computers is consid-
erably limited by some important factors, e.g. power and cooling, and this has driven the world-wide adoption
of modern many-core architectures, including GPUs and Intel Many-Integrated-Core (MIC) processors, as
accelerators for data-intensive and compute-intensive applications. These accelerators are enabling massive
computational parallelism as well as revolutionizing approaches to interconnect communication and memory
accesses that could potentially reduce computation time of complete applications by orders of magnitude.
I have been working on designing and implementing novel parallel algorithms for efficient analysis of
big biological data by employing a diversity of acceleration technologies, including accelerators (GPUs and
MIC processors) and clusters (CPU/GPU/MIC processor clusters). In addition to bioinformatics-specific
algorithms and applications (to be described below), I am researching and developing general-purpose parallel
algorithms that can be used in a diversity of applications and are represented by

• Provable bijective functions between job identifier and coordinate spaces for symmetric all-pairs com-
putation workload distribution both in shared-memory and distributed-memory systems [1];

• A lightweight dynamic workload distribution approach based on atomic operations for sparse linear
algebra (won the Best Paper Award in the prestigious IEEE ASAP 2015 conference) [2];

• A unified cache hit rate metric for performance measurement of cache-enabled computational kernels
[3];

• A fast parallel scan algorithm by taking advantage of global L2 cache coherency on NVIDIA GPUs [4].

In addition, I am assembling a set of general-purpose parallel building blocks to construct a compute uni-
fied library for heterogeneous computing, as the landscape of computation increasingly trends to become
heterogeneous.
∗ School of Computational Science & Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA; Email:

yliu860@gatech.edu; Web: http://cc.gatech.edu/∼yliu. Last updated: October 12, 2016.

1
Y. Liu Research Statement

Big Biological Data Analytics


My research philosophy is to inspire technological innovation for health care and serve people
around the globe. Based on this belief, I have been working on developing novel parallel algorithms and
software tools for the analysis of big biological data for the benefit of human individuals. These algorithms
and tools streamline fundamental and computationally challenging biological issues with parallel computing,
via in-depth exploration of acceleration technologies, and covers a broad spectrum of research topics, in-
cluding protein sequence database search, multiple protein sequence alignment, comparative genomics, motif
discovery, sequence indexing and pattern search, end-to-end next-generation sequencing (NGS) data analysis
pipeline (including assembly, error correction, alignment and SNV/indel calling), and machine-learning-driven
gene co-expression networks. These works have garnered a substantial amount of interests from related com-
munities and even well recognized in academia and industry. In the following, I will merely show some of my
featured works (the number of citations of the corresponding papers can be obtained from my Google Scholar
profile).

• CUDASW++ [5, 6, 7]: a popular and state-of-the-art protein sequence database search algorithm
and is rated by NVIDIA Corporation as a popular GPU-accelerated application. This
tool pioneered a partitioned vectorized maximal-sensitivity local alignment algorithm based on GPU
computing, in its second version [5] and investigated for the first time a GPU SIMD approach that
employs CUDA PTX SIMD video instructions to gain more data parallelism beyond the SIMT execution
model, in its third version [7]. Moreover, the ideas proposed in this tool have been extended to support
optimal alignment backtracking in our recent work [8].

• SWAPHI [9]: the first parallel algorithm to accelerate maximal-sensitivity protein database search
algorithm on multiple MIC processors sharing the same host.
• SWAPHI-LS [10]: the first parallel and distributed maximal-sensitivity genome comparison algorithm
on MIC processor clusters. This work was recommended for Best Paper Award in the prestigious
IEEE Cluster 2014 conference.

• MSAProbs [11, 12], a parallel and state-of-the-art protein multiple sequence alignment tool based on
Hidden Markov Models.

• mCUDA-MEME [13, 14]: a popular parallel motif discovery algorithm based on GPU computing and
is rated by NVIDIA Corporation as a popular GPU-accelerated application. This tool is
a highly parallel formulation and implementation of the popular MEME motif discovery algorithm on
GPU clusters, and has been deployed in our CompleteMOTIFS [15] pipeline collaboratively done with
Harvard Medical School.

• CUSHAW [16, 17, 18, 19]: a popular software suite for NGS short-read and long-read alignment and
is rated by NVIDIA Corporation as a popular GPU-accelerated application. CUSHAW
[16] is the first release of this software package and pioneered a complete alignment pipeline for NGS
paired-end reads using GPU computing. CUSHAW2 [17] (with CUSHAW2-GPU [20] being the GPU-
accelerated version) is the second release and pioneered to use maximal exact matches as seeds for fully
gapped alignment based on the well-known seed-and-extend paradigm. CUSHAW3 [18] is the third
release (with CUSHAW3-UPC [19] being the distributed-memory version) and investigated a hybrid
seeding approach to improve alignment quality by incorporating different seed types.

• DecGPU [21]: the first parallel and distributed error correction algorithm for NGS reads based on CUDA
and MPI parallel programming models. This work was reported by GenomeWeb, an independent
online news organization and serves the global community of scientists, technology professionals, and
executives who use and develop the latest advanced tools in molecular biology research and molecular
diagnostics.

2
Y. Liu Research Statement

• Musket [22]: a new and efficient multistage k-mer spectrum based corrector for Illumina short-read data.
Inspired by Musket, our another error correction algorithm Hector [23] pioneered a novel homopolymer
spectrum based approach to handle homopolymer insertions or deletions, which are the dominant
sequencing errors in 454 pyrosequencing reads.

• PASHA [24]: a parallel and memory-efficient NGS short-read assembler using de Bruijn graphs and is
most memory-efficient for complete de novo assembly of human genomes in the literature, while keeping
highly competitive speed and assembly quality.

• SNVSniffer [25, 26]: an efficient and integrated caller identifying both germline and somatic SNVs/indels
from NGS data based on Bayesian models.

• All-Food-Seq [27]: a software pipeline for quantitative measurement of species composition in foodstuff
material, which takes advantage of untargeted deep sequencing of total metagenomic DNA and uses
a sequence read counting approach to quantify species proportions. This works was reported by
Australian Food News, the premier news website for the food industry in Australia.

• ParaBWT [28]: a leading parallel and space-efficient construction algorithm for Burrows-Wheeler trans-
form and suffix array on big genome data.

• LightPCC [1, 29]: the first parallel and distributed software package for pairwise correlation/dependence
computation on MIC processor clusters.

In particular, some of my algorithms still keep top-performing in their respective research targets. Moreover,
I also worked on other problems such as k-mer indexing [30], alignment-free distance estimation [31] and
phylogenetic tree reconstruction [32]. Nevertheless, as technologies continually advance, we sill need to
incessantly tackle the challenges aroused by growing data scale and increasing compute and memory demands.
Therefore, more endeavors should be made on developing groundbreaking techniques to advance research.

Machine Learning
Machine learning has become pervasive in our daily life in the past few years and we may be constantly using
machine learning technologies even without noticing their existence. I am very interested in applying ma-
chine learning techniques to solve important and challenging computational problems in bioinformatics,
and have already done some outstanding works based on machine learning.
Firstly, I developed MSAProbs [11, 12], a state-of-the-art and cutting-edge protein multiple sequence
alignment tool based on Hidden Markov Models. A number of recent benchmarking studies have shown its
high accuracy, consistently ranking MSAProbs as a top-performing aligner. While yielding high alignment
accuracy, this algorithm comes at the expense of relatively long runtimes for large-scale input datasets
(e.g. thousands of protein sequences) and we therefore developed a MPI-based distributed-memory parallel
version [33]. By using distributed-memory systems, we managed to overcome high memory overhead barriers
for multiple alignment of (tens of) thousands of protein sequences; and by scaling with hundreds of cores, we
can reach faster speed for large-scale protein sequence datasets.
Secondly, as NGS-enabled accurate discovery of genetic variations is critical yet challenging, I developed
SNVSniffer [25, 26], an efficient and integrated caller identifying both germline and somatic SNVs/indels from
NGS data based on Bayesian models. For germline variant calling, we model allele counts per site to follow
a multinomial conditional distribution. For somatic variant calling, we rely on paired tumor-normal
pairs from identical individuals and introduce a hybrid subtraction and joint sample analysis approach by
modeling tumor-normal allele counts per site to follow a joint multinomial conditional distribution.
This algorithm demonstrates highly competitive accuracy with superior speed compared to leading callers.
Thirdly, pairwise association measure is an important operation in searching for meaningful insights
within a dataset by examining potentially interesting relationships between data variables of the dataset.
In bioinformatics, one typical application is to mine gene co-expression relationship via gene expression

3
Y. Liu Research Statement

data, which can be realized by query-based gene expression database search or gene co-expression network
analysis. In these regards, I developed LightPCC [1, 29], the first parallel and distributed software package
for pairwise correlation/dependence computation on MIC processor clusters. As of today, this library
has incorporated Pearson’s product-moment correlation coefficient, Spearman’s rank correlation coefficient,
Kendall’s rank correlation coefficient, distance correlation and mutual information (to be publicly released).

Future Research
Compact Computing for Big Data As biological data volume increases exponentially and data com-
pression is popularly used in major data repositories such as NCBI and EBI, I would expect that directly
operating on compressed data would become commonplace in the future. However, parallel processing of com-
pressed data is a challenging proposition for both shared-memory and distributed-memory systems and the
challenges could come from constrained random accesses to uncompressed content, independent decompres-
sion of data blocks, balanced distribution of data, memory and computation, adaption of existing algorithms
and applications to meet the requirements of compressed data processing and so on. Based on these con-
cerns, I am conceiving of a new concept of computing and name it as Compact Computing tentatively.
In general, Compact Computing targets robust, flexible and reproducible parallel processing of big data and
consists of three core components, in principle: (i) tightly-coupled architectures, (ii) compressive and elastic
data representation, and (iii) efficient, scalable and service-oriented algorithms and applications. In this
context, component 1 can comprise conventional CPUs, a diversity of accelerators (e.g. FPGAs, GPUs, MIC
processors and etc.) and fast interconnect communication facilities; component 2 concentrates on data struc-
tures and formats that enable efficient on-the-fly streaming compression and decompression; and component
3 targets the development of algorithms and applications that enable robust and efficient processing of big
data streams in parallel.
Albeit being promising, we need to address the challenges from some important factors faced by Compact
Computing, e.g. algorithmic design and diversity of acceleration technologies. For algorithmic design, I have
already done some preliminary works on data indexing/compression. One example is a provable space-efficient
multi-threaded algorithm for the construction of Burrows-Wheeler transform and suffix array for big genome
data [28] and this algorithm has been a state-of-the-art parallel big genome data indexing algorithm on shared-
memory systems since its debut in 2013. Besides making every effort to address algorithmic challenges, I
plan to concentrate on the following challenges posed by the diversity of acceleration technologies: unified
performance benchmarking metrics to measure different kernels targeting distinct architectures and unified
parallel programming models supporting multiple types of processing units.

Combinatorial Disease Diagnostic Analysis While providing opportunities, the use of the NGS tech-
nologies in health care is faced of some specific challenges. In my opinion, there are at least four challenges
that need to be addressed: (i) new algorithms and methodologies are needed to infer significant informa-
tion from the biological information of patients, (ii) how to reduce computational cost invested into the
complex data analysis, (iii) how to combinatorially interpret the analysis results from diverse data, such as
single-genomic DNA/RNA sequence data, genotyping data, gene expression data, and metagenomic data,
in order to draw more convincing and more confident diagnosis, and (iv) how to guarantee the correctness
of the analysis, since misinterpretation could ultimately result in serious diagnostic errors. Based on these
considerations, I aim to investigate a set of core algorithms for combinatorial diagnostic analysis of diseases,
which mainly rely on machine learning approaches for causal inference, pathogen discovery and progres-
sion prediction, and take advantage of multifaceted analysis of a diversity of genomic and genetic
information from individuals in order to provide more guarantee on diagnostic accuracy. By utilizing
various data and providing automated yet cost-effective computation, this research has the potential to in-
spire technological innovation for health care and thereby augment the insight, discovery and evidence-based
decision support for health services.

4
Y. Liu Research Statement

Software and Platform Integration As mentioned above, I developed and released a set of open-source
parallel software tools for big biological data analysis. These tools cover a set of key and closely related
problems and some of them are already among the state-of-the-art tools with respect to their respective
research target. In this case, my future work will include (i) systematic integration of parallel algorithms and
software tools to constitute a robust, HPC-enabled, big-data-oriented software platform specifically
designed for flexible and reproducible analysis of big biological data, (ii) the exploration of promising business
models to establish a self-financing scientific regime, and (iii) empowering partnership with industry to
bring in external scholarship/grants/gifts as well as keep pace with the changes of users’ demands in
real-case world. These efforts on software and platform integration/commercialization as well as external
industrial collaboration will facilitate sustainable software development by means of ensuring high code
quality and long-term maintenance, and would be expected to have substantial impact on relevant scientific
and business communities.
In addition, I will make every effort to win grants from well-known national funding agencies such as
NSF, NIH, DOE and DARPA. Finally, as technologies continually advance and new problems incessantly
emerge, we are propelled to continually conceive new algorithms and approaches to efficiently address data-
driven challenges faced by big data analytics, abridge the gap between theory and practice,
and bridge parallel computing and data science & engineering.
In the future, I will further consolidate the partnerships with current collaborators and seek more oppor-
tunities to have interdisciplinary/multidisciplinary involvement.

References
[1] Yongchao Liu, Tony Pan, and Srinivas Aluru. Parallel pairwise correlation computation on intel xeon phi
clusters. In 28th International Symposium on Computer Architecture and High Performance Computing,
2016. in press.

[2] Yongchao Liu and Bertil Schmidt. Lightspmv: Faster csr-based sparse matrix-vector multiplication
on cuda-enabled gpus. In 2015 IEEE 26th International Conference on Application-specific Systems,
Architectures and Processors, pages 82–89. IEEE, 2015.

[3] Yongcaho Liu and Bertil Schmidt. Lightspmv: faster cuda-compatible sparse matrix-vector multiplica-
tion using compressed sparse rows. Journal of Signal Processing Systems, 2016. under review.

[4] Yongcaho Liu and Srinivas Aluru. Lightscan: faster scan primitive on cuda compatible manycore pro-
cessors. ournal of Parallel and Distributed Computing, 2016. under review.

[5] Yongchao Liu, Douglas Maskell, and Bertil Schmidt. Cudasw++: optimizing smith-waterman sequence
database searches for cuda-enabled graphics processing units. BMC Research Notes, 2(1):73, 2009.

[6] Yongchao Liu, Bertil Schmidt, and Douglas Maskell. Cudasw++ 2.0: enhanced smith-waterman protein
database search on cuda-enabled gpus based on simt and virtualized simd abstractions. BMC Research
Notes, 3(1):93, 2010.

[7] Yongchao Liu, Adrianto Wirawan, and Bertil Schmidt. Cudasw++ 3.0: accelerating smith-waterman
protein database search by coupling cpu and gpu simd instructions. BMC Bioinformatics, 14(1):117,
2013.

[8] Yongchao Liu and Bertil Schmidt. Gswabe: faster gpu-accelerated sequence alignment with optimal
alignment retrieval for short dna sequences. Concurrency and Computation: Practice and Experience,
27(4):958–972, 2015.

[9] Yongchao Liu and Bertil Schmidt. Swaphi: Smith-waterman protein database search on xeon phi co-
processors. In 2014 IEEE 25th International Conference on Application-specific Systems, Architectures
and Processors, pages 184–185. IEEE, 2014.

5
Y. Liu Research Statement

[10] Yongchao Liu, Tuan-Tu Tran, Felix Lauenroth, and Bertil Schmidt. Swaphi-ls: Smith-waterman al-
gorithm on xeon phi coprocessors for long dna sequences. In 2014 IEEE International Conference on
Cluster Computing, pages 257–265. IEEE, 2014.

[11] Yongchao Liu, Bertil Schmidt, and Douglas L Maskell. Msaprobs: multiple sequence alignment based on
pair hidden markov models and partition function posterior probabilities. Bioinformatics, 26(16):1958–
1964, 2010.

[12] Yongchao Liu and Bertil Schmidt. Multiple protein sequence alignment with msaprobs. Multiple Sequence
Alignment Methods, pages 211–218, 2014.

[13] Yongchao Liu, Bertil Schmidt, Weiguo Liu, and Douglas L Maskell. Cuda-meme: accelerating motif
discovery in biological sequences using cuda-enabled graphics processing units. Pattern Recognition
Letters, 31(14):2170–2177, 2010.

[14] Yongchao Liu, Bertil Schmidt, and Douglas L Maskell. An ultrafast scalable many-core motif discov-
ery algorithm for multiple gpus. In 2011 IEEE International Symposium on Parallel and Distributed
Processing Workshops and Phd Forum, pages 428–434. IEEE, 2011.

[15] Lakshmi Kuttippurathu, Michael Hsing, Yongchao Liu, Bertil Schmidt, Douglas L Maskell, Kyungjoon
Lee, Aibin He, William T Pu, and Sek Won Kong. Completemotifs: Dna motif discovery platform for
transcription factor binding experiments. Bioinformatics, 27(5):715, 2011.

[16] Yongchao Liu, Bertil Schmidt, and Douglas L Maskell. Cushaw: a cuda compatible short read aligner
to large genomes based on the burrows-wheeler transform. Bioinformatics, 2012.
[17] Yongchao Liu and Bertil Schmidt. Long read alignment based on maximal exact match seeds. Bioinfor-
matics, 28(18):i318–i324, 2012.

[18] Yongchao Liu, Bernt Popp, and Bertil Schmidt. Cushaw3: sensitive and accurate base-space and color-
space short-read alignment with hybrid seeding. PloS one, 9(1):e86869, 2014.

[19] Jorge González-Domı́nguez, Yongchao Liu, and Bertil Schmidt. Parallel and scalable short-read align-
ment on multi-core clusters using upc++. PloS one, 11(1):e0145490, 2016.

[20] Yongchao Liu and Bertil Schmidt. Cushaw2-gpu: empowering faster gapped short-read alignment using
gpu computing. Design & Test, IEEE, 31(1):31–39, 2014.

[21] Yongchao Liu, Bertil Schmidt, and Douglas Maskell. Decgpu: distributed error correction on massively
parallel graphics processing units using cuda and mpi. BMC Bioinformatics, 12(1):85, 2011.

[22] Yongchao Liu, Jan Schröder, and Bertil Schmidt. Musket: a multistage k-mer spectrum-based error
corrector for illumina sequence data. Bioinformatics, 29(3):308–315, 2013.

[23] Adrianto Wirawan, Robert S Harris, Yongchao Liu, Bertil Schmidt, and Jan Schröder. Hector: a parallel
multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC Bioinformatics,
15(1):131, 2014.

[24] Yongchao Liu, Bertil Schmidt, and Douglas Maskell. Parallelized short read assembly of large genomes
using de bruijn graphs. BMC Bioinformatics, 12(1):354, 2011.

[25] Yongchao Liu, Martin Loewer, Srinivas Aluru, and Bertil Schmidt. Snvsniffer: An integrated caller
for germline and somatic snvs based on bayesian models. In 2015 IEEE International Conference on
Bioinformatics and Biomedicine, pages 83–90. IEEE, 2015.

[26] Yongchao Liu, Martin Loewer, Srinivas Aluru, and Bertil Schmidt. Snvsniffer: an integrated caller for
germline and somatic single-nucleotide and indel mutations. BMC Systems Biology, 10(2):47, 2016.

6
Y. Liu Research Statement

[27] Fabian Ripp, Christopher Felix Krombholz, Yongchao Liu, Mathias Weber, Anne Schäfer, Bertil
Schmidt, Rene Köppel, and Thomas Hankeln. All-food-seq (afs): a quantifiable screen for species
in biological samples by deep dna sequencing. BMC Genomics, 15(1):639, 2014.

[28] Yongchao Liu, Thomas Hankeln, and Bertil Schmidt. Parallel and space-efficient construction of burrows-
wheeler transform and suffix array for big genome data. IEEE/ACM Transactions on Computational
Biology and Bioinformatics, 13(3):592–598, 2016.

[29] Yongchao Liu, Tony Pan, and Srinivas Aluru. Parallelized kendall’s tau coefficient computation on many-
integrated-core processors. In 31st IEEE International Parallel & Distributed Processing Symposium,
2016. submitted.

[30] Tony Pan, Patrick Flick, Chirag Jain, Yongchao Liu, and Srinivas Aluru. Kmerind: A flexible parallel
library for k-mer indexing of biological sequences on distributed memory systems. In 7th ACM Conference
on Bioinformatics, Computational Biology, and Health Informatics. ACM, 2016. in press.

[31] Sharma V Thankachan, Sriram P Chockalingam, Yongchao Liu, Alberto Apostolico, and Srinivas Aluru.
Alfred: a practical method for alignment-free distance computation. Journal of Computational Biology,
2015.

[32] Yongchao Liu, Bertil Schmidt, and Douglas L Maskell. Parallel reconstruction of neighbor-joining trees
for large multiple sequence alignments using cuda. In IEEE International Symposium on Parallel &
Distributed Processing, pages 1–8. IEEE, 2009.

[33] Jorge González-Domı́nguez, Yongchao Liu, Juan Touriño, and Bertil Schmidt. Msaprobs-mpi: parallel
multiple sequence aligner for distributed-memory systems. Bioinformatics, page btw558, 2016.

View publication stats

You might also like