You are on page 1of 14

Textomics: A Dataset for Genomics Data Summary Generation

Mu-Chun Wang1,∗ , Zixuan Liu2,∗ , Sheng Wang2


1
University of Science and Technology of China
2
Paul G. Allen Scholl of Computer Science and Engineering, University of Washington
amoswang2000@mail.ustc.edu.cn
{zucksliu, swang}@cs.washington.edu

Abstract software, the last step of summarizing discovery


is still largely performed manually, substantially
Summarizing biomedical discovery from ge- slowing down the progress of scientific discovery
nomics data using natural languages is an
(Hwang et al., 2018). A plausible solution is to
essential step in biomedical research but is
mostly done manually. Here, we introduce automatically summarize the discovery from ge-
Textomics, a novel dataset of genomics data nomics data using neural text generation, which
description, which contains 22,273 pairs of has been successfully applied to radiology report
genomics data matrices and their summaries. generation (Wang et al., 2021; Yuan et al., 2019)
Each summary is written by the researchers and clinical notes generation (Melamud and Shiv-
who generated the data and associated with ade, 2019; Lee, 2018; Miura et al., 2021).
a scientific paper. Based on this dataset, we
In this paper, we study this novel task of gen-
study two novel tasks: generating textual sum-
mary from a genomics data matrix and vice erating sentences to summarize a genomics data
versa. Inspired by the successful applications matrix. Several excisting approaches demonstrate
of k nearest neighbors in modeling genomics encouraging results in generating short phrases to
data, we propose a kNN-Vec2Text model to describe functions of a set of genes (Wang et al.,
address these tasks and observe substantial im- 2018; Zhang et al., 2020; Kramer et al., 2014).
provement on our dataset. We further illustrate However, our task is fundamentally different from
how Textomics can be used to advance other these: the input of our task is a matrix that contains
applications, including evaluating scientific pa-
tens of thousands of genes, which could be noisier
per embeddings and generating masked tem-
plates for scientific paper understanding. Tex- than a set of selected genes; the outputs of our task
tomics serves as the first benchmark for gen- are sentences instead of short phrases or controlled
erating textual summaries for genomics data vocabularies.
and we envision it will be broadly applied to To study this task, we curate a novel dataset, Tex-
other biomedical and natural language process- tomics, by integrating data from PMC, PubMed,
ing applications.1 and Gene Expression Omnibus (GEO) (Edgar et al.,
2002) (Figure 1). GEO is the default database
1 Introduction
repository for researchers to upload their genomics
Modern genomics research has become increas- data matrices, such as gene expression matrices
ingly automated through being roughly divided into and mutation matrices. Each genomics data ma-
three sequential steps: next-generation sequenc- trix in GEO is a sample by feature matrices, where
ing technology produces a massive amount of ge- samples are from often humans or mice that are
nomics data, which are in turn processed by bioin- sequenced together to study a specific biological
formatics tools to identify key variants and genes, problem, and features are genes or variants. Each
and, ultimately, analyzed by biologists to summa- matrix is also associated with a few sentences that
rize the discovery (Goodwin et al., 2016; Kanehisa are written by researchers to summarize this data
and Bork, 2003). In contrast to the first two steps matrix. After pre-processing, we obtain 22,273 ma-
that have been automated by new technologies and trix summary pairs, spanning 9 sequencing technol-
∗ ogy platforms. Each matrix has on average 2,475
Equal Contribution
1
The link to access our code: https://github.com/ samples and 22,796 features. Each summary has
amos814/Textomics on average 46 words.

4878
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 4878 - 4891
May 22-27, 2022 2022
c Association for Computational Linguistics
a Building Dataset b Textomics Dataset c Tasks and Applications d kNN-Vec2Text Model
gene text
Dataset Matrix Summary Platform 1 Vec2Text query neaest
expression summary

Analysis of uterine mi This aim of neighbour


Analysis of uterine mi Analysis of
Analysis of uterine
croenvironment at gemi this study is
croenvironment at ge
necroenvironment
expression levelat....ge < , , uterine mi
croenviro...
, > to figure...
distance
ne transformation ....

... ...
nomics level such ....
Text2Vec
Platform 2 This aim of This We find The Analysis
study ... bone ... role of ... of ...
This data identifies th this study is
Analysis of uterine mi
e Analysis
ensembleofofuterine
stress mi Measurem to figure... Embedding
croenvironment at ge
croenvironment at ge ent of emb
responsive genes ....
bone protein as the ....
ne transformation ....
< , , ryo gene...
, > Paper similarity
Matrix-Paper
Link
... similar?
reweighted
Self Attention Module

...
embedding

Platform 9 Paper understanding


This study is about
This aim of [MASK] ...
epithelium Language Model
glucose
< ,
this study is
, to figure... , > ...
fibroblast
Output: Analysis of uterine microenvironment at
gene expression level. The hypothesis tested in ...
Platform 1 Platform 9

Figure 1: Flow chart of Textomics. a. Genomics data matrices and summaries are collected from GEO. Scientific
papers are collected from PMC and PubMed. Each data matrix is associated with a unique summary and a unique
scientific paper in Textomics. b. Textomics is divided into 9 sequencing platforms, spanning over various species.
Data matrices in the same platforms share the same features and can therefore be used to train a machine learning
model. c. Textomics can be used as the benchmark for a variety of tasks, including Vec2Text, Text2Vec, measuring
paper similarity, and scientific paper understanding. d. kNN-Vec2Text is developed to address the task of Vec2Text,
by first constructing a reference summary using similar genomics data matrices and then unifying these summaries
to generate a new summary.

We further propose a novel approach to automati- then discussed the related works and the potential
cally generate a summary from a genomics data ma- direction of future works (section 7 and 8).
trix, which is highly noisy and high-dimensional. k
nearest neighbor (kNN) approaches have obtained 2 Textomics Dataset
great success in genomics data by capturing the hid-
We collected genomics data matrices from Gene
den modules within it (Levine et al., 2015; Baran
Expression Omnibus (GEO) (Edgar et al., 2002).
et al., 2019). The key idea of our method is to find k
The feature of each data matrix represents the ex-
nearest summaries according to the genomics data
pression level of a gene or other genomic measure-
similarity and then exploit the attention mechanism
ments of a variant (typically real numbers). The
to convert these k nearest summaries to a new sum-
sample of each matrix is an experimental subject,
mary. Our method obtained substantial improve-
such as an experimental animal or a patient. Each
ment in comparison to baseline approaches. We
data matrix is associated with an expert-written
further illustrated how we can generate a genomics
summary, describing this data matrix. We obtained
data matrix from a given summary, offering the
in total 164,667 matrix-summary pairs, spanning
possibility to simulate genomics data from textual
12,219 sequencing platforms.
description. We then introduced how Textomics
can be used as a novel benchmark for measuring Samples in different platforms have different
scientific paper similarity and evaluating scientific features. However, data matrices belonging to
paper understanding. To the best of our knowledge, the same sequencing platform are from the same
Textomics and kNN-Vec2Text together build up species and share the same set of features, thus
the first large-scale benchmark for genomics data can be used together for model training. To further
summary generation, and can be broadly applied alleviate the missing feature problem, we kept the
to a variety of natural language processing tasks. top-20000 features with a lower missing rate and
filtered out the rest. We further selected 9 platforms
Our paper is written as follows: We first in- with the average lowest rate of missing value and
troduce the Textomics dataset (section 2) and de- the largest amount of matrix-summary pairs to guar-
scribe the Text2Vec and Vec2Text tasks (section antee the quality and the scale of the dataset. After
3). We then propose a baseline model and kNN- all, we imputed the resulted data matrices using
Vec2Text model for Vec2Text task (section 4.1) averaging imputation across different features.
and the model for Text2Vec task. We then evaluate Data matrices belonging to the same platform
our method (section 5) and provide two applica- have distinct samples (e.g., patient samples col-
tions (section 6) based on Textomics dataset. We lected from two hospitals). To make them com-

4879
parable and provide fixed-size features for ma- platforms in Supplementary Table S1.
chine learning models, we empirically used a five- Some of the data matrices are associated with a
number summary to represent each data matrix. In scientific paper, which describes how the authors
particular, we calculated the smallest, the first quar- generated and used the data. Therefore, the data
tile, the median, the third quartile, and the largest matrix and the summary can be used to help embed
value of each feature across samples in a specific these papers. We additionally retrieved these pa-
data matrix. We then concatenated these values of pers from PubMed and PMC databases according
all features, resulting in a 100k-dimensional fea- to the paper titles enclosed in GEO. We obtained
ture vector for each data matrix. Compared with the full text for those 7,691 freely accessible ones
other statistics such as mean, median, and mode of (Supplementary Table S1). We will introduce
the features, the five number statistics maintain the two applications that jointly use scientific papers
patterns hidden in the raw matrices better. This vec- and matrix-summary pairs in section 6.
tor will be finally used as the input to the machine
learning model. 3 Task Description

All genomics data summaries we collected were We aim to accelerate genomics discovery by gen-
written by the biologists who generate the corre- erating a textual summary given the five-number
sponding genomics data matrices. Therefore, these summary-based vector of a genomics data matrix.
summaries can properly reflect biologists’ descrip- We refer to the five-number summary-based vector
tions of their datasets. Since the summary is the as a gene feature vector for simplicity.
first piece of information that one can learn about Specifically, consider textual summary domain
the dataset, authors often tend to clearly character- D and gene feature vector domain V, let D =
dist
ize their dataset in the summary. However, directly {DD , DV } = {(di , vi )}N i=1 ∼ P(D, V) be a
leveraging raw data of these summaries is question- dataset containing N summary-vector pairs sam-
able. On the syntactic level, the lengths of sum- pled from the joint distribution of these two do-
nd
mary for each sample are different and comments mains, where di , hd1i , d2i , ..., di i i denotes a to-
are often used in genomics descriptions. In order ken sequence and vi ∈ Rlv denotes the gene fea-
to align our data and leverage the advanced Trans- ture vector. Here dji ∈ C, C is the vocabulary.
former model that requires fix-length sentences as We now formally define two cross-domain gener-
well as simplifies the structure of the summary, we
empirically removed the text in the brackets and Spearman correlation=0.45
truncated the summaries length to 64 words (the
1.0
Summary embedding similarity

percentage of summaries with a length greater than


64 is 41%). On the semantic level, there could 0.9
be non-informative summaries such as a simple 0.8
density

sentence ‘Please see our data below’ and some


0.7
outliers that are substantially different from other
summaries. In order to increase the quality of these 0.6
genomics data summaries, we manually inspected 0.5
and removed the non-informative summary and ex- 0.4
cluded the outliers based on the pairwise BLEU
(Papineni et al., 2002) scores through a progres- 0.3
sive automated procedure. Specifically, for every 0.0 0.2 0.4 0.6 0.8 1.0
summary, we treated it as the query text and cal- Gene feature vector similarity
culated the pairwise BLEU-1 scores with all other
Figure 2: Density plot showing the Spearman cor-
summaries, filtered out those median that is lower
relation between text-based similarity (y-axis) and
than 0.09, and then re-applied the procedure with vector-based similarity (x-axis) on sequencing platform
a higher threshold of 0.13. Finally, each of the 9 GPL6246. Each dot is a pair of data samples. A larger
platforms contains 471 matrix-summary pairs on Spearman correlation indicates this Encd is more accu-
average, presenting a desirable number of train- rate in embedding scientific papers.
ing samples to develop data summary generation
models. We summarized the statistics of these 9 ation tasks, Vec2Text and Text2Vec, based on our

4880
dataset. Given a gene feature vector vi , Vec2Text great challenges to effectively learn that projection.
aims to generate a summary di that could best de- k-nearest neighbors models have been extensively
scribe this vector vi ; given a textual summary di , used as the solution to overcome such issues in
Text2Vec aims to generate the gene feature vector genomics data analysis (Levine et al., 2015; Baran
vi that di describes. Since we are studying a novel et al., 2019). Therefore, one plausible solution
task on a novel dataset, we first examined the fea- is to explicitly leverage summaries from similar
sibility of this task. To this end, we obtained the gene feature vectors to improve the generation.
dense representation of each textual summary us- Inspired by the encouraging performance in us-
ing the pre-trained SPECTER model (Cohan et al., ing k-nearest neighbors (kNN) in seq2seq models
2020) and use these representations to calculate (Khandelwal et al., 2020, 2021) and genomics data
a summary-based similarity between each pair of analysis (Levine et al., 2015; Baran et al., 2019),
summaries. We also calculated a vector-based sim- we propose to convert the Vec2Text problem to
ilarity based on the gene feature vector using the a Text2Text problem according to the k-nearest
cosine similarity. We found that these two simi- neighbor of each vector.
larity measurements show a substantial agreement For a given gene feature vector g, we use ei ∈ R
(Figure 2, Supplementary Table S2). After fil- to denote its Euclidean distance to another gene
tering out the outliers, all 9 platforms achieved a feature vectors vi in D. We then select the sum-
Spearman correlation greater than 0.2, suggesting maries of k samples that have the minimum Eu-
the possibility to generate textual summary from clidean distances as the reference summary list
the gene feature vector and vice versa. t̃ = [dj1 , ..., djk ], where jm ∈ {1, 2, ..., |D|} de-
notes the index of ordered summaries w.r.t the Eu-
4 Methods
clidean distance, i.e, ej1 ≤ ej2 ≤ ... ≤ ej|D| .
4.1 Vec2Text In addition to alleviating the noise in genomics
We first introduce a baseline model that tries to en- data using the reference summary list (Levine et al.,
code gene feature vectors into the semantic embed- 2015; Baran et al., 2019), our method explicitly
ding space and then decodes it to generate text. The converts the Vec2Text problem to a Text2Text prob-
baseline model contains a word embedding func- lem, and can thus seamlessly incorporate many
tion Emb(.), a gene feature vector encoder Encv (.) advanced pre-trained language models into our
and a decoder Decv (.). Given a gene feature vector framework. The resulted problem we need to solve
vi , the encoder will first embed the data into a se- is a k sources to one target generation problem.
(0) One naive solution is to concatenate the k ref-
mantic representation space si = Encv (vi ), and
then the decoder will start from this representation erence summaries together. However, this con-
for the text generation. The generation process is catenation will make the source text much longer
(j) than the target text and how to order each sum-
autoregressive. It generates j-th word dˆi and its
(j) mary during concatenation also remains unclear.
embedding si as:
(j) (<j) (<j) Instead, we propose to transform this problem
P (dˆ |s
i i ) = Decv (s i ), j = 1, ..., nd . (1)
i into k one-to-one generation problem and then
Then we sample the next word and obtain its em- use attention-based strategy to fuse them. Con-
bedding as: cretely, let nj = max{nj1 , ..., njk } be the maxi-
(j) (j) (j) sample (j) (<j)
s = Emb(dˆ ), dˆ
i i i ∼ P (dˆ |s i ). (2)
i
mum length among all the reference summaries.
This model is trained using the following loss func- We first get the representation of each summary
(1) (n )
tion: xjm = Emb(djm ) = hxjm , ..., xjmj i for m =
|DV | nd (i)
1 X Xi (j) (<j) 1, ..., k. Here xjm denotes the vector embedding of
Lbaseline = − logP (dˆi |si ).
|DV | the i-th word in m-th summary. We construct fixed-
i=1 j=1
(3) length reference summaries by padding after the
end of each summary with length less than nj . We
4.1.1 kNN-Vec2Text Model then utilize self-attention module (SA) (Vaswani
The baseline model attempts to learn an encoder et al., 2017) to get the aggregated embedding of
that projects a gene feature vector to a semantic rep- each reference with their embeddings as well as
resentation. However, the substantial noise and the the gene feature vector distance ei . Let Qr , Kr , Vr
high-dimensionality of the gene feature vector pose be the query, key, value matrices of embedding

4881
sequence r = hr(1) , ..., r(lr ) i, we have: els, we used a one layer MLP network as its en-
SA(r) = Attention(Qr , Kr , Vr ). (4) coder, and tested with different decoder structure,
We then calculate the attention score as following: including canonical Transformer (decoder of T5)
(1) (nj ) (Vaswani et al., 2017), GPT-2 (Radford et al., 2019),
ajm = SA(hxjm , ..., xjm k i), (5)
and Sent-VAE (Bowman et al., 2016). For kNN-
scj = SA(hej1 · aj1 , ..., ejk · ajk i), (6) Vec2Text, we directly used both the encoder and
k
where scj = [scj1 , ..., scjk ] ∈ R . Here we used the decoder of T5 (Raffel et al., 2020), one of the
a 2-layer self attention scheme to first acquire the state-of-the-art Transformer style models. we set
aggregated feature of each summary and then cal- k = 4 and τ = 0.1 as this setting achieved the
culate the attention score based on that. The fi- best empirical performance, though it is worth not-
nal score is then calculated based on the attention ing that our model is robust on the choices of k
scores and temperature τ as: (from 1 to 4) and τ (from 0 to 1). For all 9 plat-
exp(τ · scjm ) forms, we reported the average performance under
wjm = Pk . (7)
l=1 exp(τ · scjl )
5-fold cross validation to evaluate the robustness
Then, we aggregate embedding sequences by tak- of our method. The results of BLEU-1 score (Pa-
ing weighted averages: pineni et al., 2002) are summarized in Figure 3a.
k We found that kNN-Vec2Text substantially outper-
(l) (l)
X
x̃j = wjm xjm , l = 1, ..., nj . (8) formed other methods by a large margin. Specif-
m=1 ically, kNN-Vec2Text obtained a 0.206 BLEU-1
Let P<l,x (d) = PθLM (d(l) |d(<l) , x), 0 < l < nd score on average while none of the other three meth-
be the probability distribution of d(l) output by the ods achieved an average BLEU-1 score greater than
language model θLM conditioned on the sequences 0.150. The prominent performance of our method
of the embedding vectors x and the first l − 1 se- demonstrates the effectiveness of using a k-nearest-
quence tokens. We feed the aggregated embedding neighbor approach to convert the Vec2Text problem
sequences into the language model to reconstruct to a Text2Text problem.
the summary d using an autoregressive-based loss To further understand the superior performance
function: of the kNN-Vec2Text model, we presented a case
nd
X X logP<l,x̃j (d) study in Table 1. In this case study, the generated
LkNN-Vec2Text = − . (9)
|DD | summary is highly accurate compared to the ground
d∈DD l=1
truth summary. By examining the summaries of
4.2 Text2Vec the 4 nearest neighbors in the gene feature vec-
tor space, we found that the generated summary
We model the reverse problem of generating the
is composed of short spans from each individual
gene feature vector v from a textual summary
neighbor, again indicating the advantage of using
d as a regression problem. Our model is com-
a k-nearest neighbor for this task. Our method
posed with a semantic encoder Encd (.) and a read-
leveraged an attention mechanism to unify these
out head MLP(.). Specifically, the encoder will
four neighbors, thus offering an accurate genera-
embed the textual summary into dense represen-
tion. We also observed consistent improvement of
tation x = Encd (d), and the readout head will
our method over comparison approaches on other
map the representation to the gene feature vector
metrics and summarized the results in Supplemen-
v̂ = MLP(x). Then we train this model by min-
tary Table S3.
imizing the rooted mean squared errors (RMSE):

5.2 Text2Vec
s
1 X
Lv = ||vˆi − vi ||22 . (10)
|DV | We next used the Text2Vec task to illustrate how our
vi ∈DV
dataset can be used to compare the performance of
5 Results different pre-trained language models. In particular,
we compared a recently proposed scientific paper
5.1 Vec2Text embedding method SPECTER (Cohan et al., 2020),
To evaluate the performance of kNN-Vec2Text which has demonstrated prominent performance
on the task of Vec2Text, we compared it to the in a variety of scientific paper analysis tasks, with
baseline models in 4.1. For the baseline mod- SciBERT (Beltagy et al., 2019), BioBERT (Lee

4882
a b
0.3

Improvement over baseline


kNN-Vec2Text SPECTER
GPT-2 70%
SciBERT
Sent-VAE
SentBERT
Transformer
BLEU-1 score

0.2
50% BioBERT
BERT

0.1
30%

0.0 10%
GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL
96 6244 10558 6887 198 570 6246 1261 13534 6887 10558 1261 96 198 570 6244 6246 13534

Textomics platform Textomics platform

Figure 3: Performance on Vec2Text (a) and Text2Vec (b) using Textomics as the benchmark. a. Bar plot comparing
our method kNN-Vec2Text with existing approaches on the ask of Vec2Text across 9 platforms in Textomics.
b. Bar plot comparing the performance of different scientific paper embedding methods across 9 platforms in
Textomics.

Table 1: A case study of the generated text by kNN-Vec2Text. Summaries of the four nearest neighbors in the
input space are shown. The generated text is composed of short spans from the four different neighbors (colored
in red). The BLEU-1 score for this example is 1 (prefect).

Analysis of B16 tumor microenvironment at gene expression level. The hypothesis tested in the present
Neighbor 1:
study was that Tregs orchestrated the immune response triggered in presence of tumors.
This study aims to look at gene expression profiles between wildtype and Bapx1 knockout cells of the gut
Neighbor 2:
in a E12.5 mouse embryo.
The role of bone morphogenetic protein 2 in regulating transformation of the uterine stroma during embryo
Neighbor 3:
implantation in mice was investigated by the conditional ablation of Bmp2 in the uterus using the mouse.
Measurement of specific gene expression in clinical samples is a promising approach for monitoring the
Neighbor 4:
recipient immune status to the graft in organ transplantation.
Analysis of uterine microenvironment at gene expression level. The hypothesis tested in the present study
Generated:
was that Tregs orchestrated the immune response triggered in presence of embryo.
Analysis of uterine microenvironment at gene expression level. The hypothesis tested in the present study
Truth:
was that Tregs orchestrated the immune response triggered in presence of embryo.

et al., 2020) and SentBERT (Wang and Kuo, 2020) overall performance among all five methods, sug-
and the vanilla BERT (Devlin et al., 2019). While gesting the advantage of separately modeling the
the other language models directly take the token title and the abstract when embedding scientific
sequence as the input, SPECTER model needs to papers.
take both the abstract and the title. To make a fair
comparison, we concatenated the title and the sum- 6 Applications
mary as the input for models other than SPECTER.
For all 9 platforms, we reported the average perfor- 6.1 Evaluate paper embedding via Textomics
mance under 5-fold cross validation. We further Embedding scientific papers is crucial to effectively
implemented a simple averaging baseline approach identify emerging research topics and new knowl-
that predicts the vector for a test summary accord- edge from scientific literature. To this end, many
ing to the average vectors of training samples. This machine learning models have been proposed to
baseline does not utilize any textual summary and embed scientific papers into dense embeddings
can thus help us assess the effect of using textual and then applied these embeddings for a variety
summary information in this task. We used RMSE of downstream applications (Cohan et al., 2020;
to evaluate the performance of all methods. We Lee et al., 2020; Wang and Kuo, 2020; Beltagy
reported the RMSE improvement of each method et al., 2019; Devlin et al., 2019). However, there
over the averaging baseline model in Figure 3b. is currently limited golden standard that can mea-
We found that all methods outperform the baseline sure the similarity between two papers. As a result,
approaches by gaining at least 15% improvement, existing approaches use surrogate metrics such as
indicating the importance of considering textual citation relationship, keywords, and user activities
summary in this task. SPECTER achieved the best to evaluate their paper embeddings (Cohan et al.,

4883
a b
0.6 0.6
SPECTER Abstract

Spearman correlation
Spearman correlation SentBERT Introduction
0.5 0.5 Method
SciBERT
BioBERT Result
0.4 BERT 0.4 Conclusion

0.3 0.3

0.2 0.2

0.1 0.1

0.0 0.0
GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL GPL
198 13534 6244 570 10558 6246 1261 6887 96 198 13534 6244 570 10558 6246 1261 6887 96

Textomics platform Textomics platform


Figure 4: Performance on using Textomics as the benchmark to evaluate scientific paper embeddings. (A). Bar plot
showing the comparison on embedding scientific papers using Textomics as the benchmark. (B). Bar plot showing
the comparison on SPECTER embedding of different paper sections using Textomics as the benchmark.

2020; Chen et al., 2019; Wang et al., 2019). conclusion. We first compared different paper em-
Textomics can be used to measure these paper bedding methods using the abstract of a paper. The
embedding approaches by examining the consis- five embedding methods we considered are intro-
tency between the embedding-based paper similar- duced in section 5.1. Since SPECTER takes both
ity and the embedding-based summary similarity the title and paragraph as the input we used the first
since both the paper and the summary are written sentence of the summary as a pseudo-title when
by the same authors. In particular, for a pair of encoding the summary. The results are summa-
summaries di , dj ∈ DD , let ti , tj be the text (e.g., rized in Figure 4a. We found that SPECTER was
abstracts) extracted from their corresponding scien- substantially better than other methods on 8 out
tific papers. Let Encd be the encoder of the paper of the 9 platforms. SPECTER is specifically de-
embedding method we want to evaluate. We first veloped to embed scientific papers by processing
get their embeddings as: the title and the abstract separately, whereas other
pre-trained language models simply concatenated
sdi , sdj = Encd (di ), Encd (dj ) ∈ Rls , (11)
the title and the abstract. The superior performance
ls
sti , stj = Encd (ti ), Encd (tj ) ∈R . (12) of SPECTER suggests the importance of separately
We then compute the pairwise Euclidean distance modeling paper title and abstract when embedding
between all pairs of summaries and all pairs of scientific papers. SentBERT obtained the best per-
paper text as: formance among four pre-trained language mod-
els, partially due to its prominent performance in
v
u ls
uX (k) (k)
s = t (s − s )2 di,j ∈ R, (13) sentence-level embedding. We further noticed that
di dj
k=1 the relative performance among different methods
v
u ls is largely consistent with the previous work evalu-
uX (k) (k) ated on other metrics (Cohan et al., 2020), demon-
sti,j = t (s − s )2 ti tj ∈ R. (14)
strating the high-quality of Textomics.
k=1

To evaluate the quality of the encoder Encd , we After observing the superior performance of
can calculate the Spearman correlation between the SPECTER, we next investigated which section of
pairwise summary similarity and the pairwise text the paper can be best used to assess paper similarity.
similarity. A larger Spearman correlation means Although existing paper embedding approaches of-
the summary / textual contents of two samples in ten leverage the abstract for embedding, other sec-
the pair are better aligned with each other, which tions, such as introduction and results might also be
indicates this Encd is more accurate in embedding informative, especially for papers describing a spe-
scientific papers. As a proof-of-concept, we ob- cific dataset or method. We thus applied SPECTER
tained the full text of 7,691 papers in our dataset to embed five different sections of each scientific
from the freely accessible PubMed Central. We paper and used Textomics to evaluate which section
segmented each paper into five sections, which in- can best reflect paper similarity. We observed a con-
cluded abstract, introduction, method, result and sistent improvement of using the abstract section

4884
in comparison to other paper sections (Figure 4B), the paragraph text ti . To generate Cbio , we lever-
which is consistent with the intuition that the ab- aged a recently developed biological terminology
stract represents a good summary of the scientific dataset Graphine (Liu et al., 2021), which provides
paper, again indicating the reliability of using Tex- the biological phrases spanning 227 categories. We
tomics to evaluate paper embedding methods. selected 10 categories that can produce the largest
number of masked sentences in Textomics. We
6.2 Scientific paper understanding manually filtered ambiguous words and stop words.
Creating masked sentences and then filling in these On average, each category contains 317 keywords.
masks can examine whether the machine learning We used a fully connected neural network to per-
model has properly understood a scientific paper form the multi-class classification task. The input
(Yang et al., 2019; Guu et al., 2020; Ghazvininejad feature is the concatenation of the masked summary
et al., 2019; Bao et al., 2020; Salazar et al., 2020). embedding and the paragraph embedding. We used
However, one challenge in such research is how SPECTER to derive these embeddings as it has
to generate masked sentences that are relevant to obtained the best performance in our previous anal-
a given paper while also ensuring the answer is ysis. The results are summarized in Figure 5. We
enclosed in the paper. Our dataset could be used observed improved accuracy on all ten categories,
to automatically generate such masked sentences which are much better than the 0.4% accuracy by
using the summary, which is highly relevant to the random guessing, indicating the usefulness of our
paper but also not overlapped with the paper. In benchmark in scientific paper understanding. Fi-
particular, we can mask out keywords from the nally, we found that the performance of each cate-
summary and then use this masked summary as gory varied across different platforms, suggesting
the question and let a machine learning model to the possibility to further improve the performance
find the answer from the non-overlapping scientific by jointly learning from all platforms.
paper. Let Cbio be a dictionary that contains bio-
logical keywords we want to mask out from the 7 Related work
summary, (di , ti ) be a pair of textual summary and
Our task is related to existing works that take struc-
paragraph text extracted from its corresponding sci-
(j) tured data as the input and then generate the un-
entific paper. If the j-th word wi = di ∈ Cbio in
structured text. Different input data modalities and
the summary belongs to Cbio , our proposed task is
related datasets have been considered in the litera-
to predict which word in Cbio is the missing word
ture, including text triplets in RDF graphs (Gardent
in dmasked given ti . The masked summary dmasked
et al., 2017; Ribeiro et al., 2020; Song et al., 2020;
is the same as di except its j-th word is substi-
Chen et al., 2020)), text-data tables (Lebret et al.,
tuted with [PAD]. For simplicity, we only mask
2016; Rebuffel et al., 2022; Dusek et al., 2020; Re-
at most one token in di . We, therefore, form our
buffel et al., 2020; Puduppully and Lapata, 2021;
task as a multi-class classification problem. Sim-
Chen et al., 2020), electronic medical records (Lee,
2018; Guan et al., 2018), radiology reports (Wang
0.7 Covoc Ecocore et al., 2021; Yuan et al., 2019; Miura et al., 2021),
Eupath Xpo
Obi_iee Argo and other continuous data modalities without ex-
Obi Trak
Planp Premedonto plicit textual structures such as image (Lin et al.,
0.5
Accuracy

2014; Cornia et al., 2020; Ke et al., 2019; Radford


et al., 2021), audio (Drossos et al., 2020; Manco
0.3
et al., 2021; Wu et al., 2021; Mei et al., 2021),
and video (Li et al., 2021; Ging et al., 2020; Zhou
0.1
et al., 2018; Li et al., 2020). Different from these
GPL
570
GPL
6887
GPL
6244
GPL
13534
GPL
6246
GPL
1261
GPL
10558
GPL
198
GPL
96 structures, our dataset takes a high dimensional
Textomics platform genomics feature matrix as input, which doesn’t
Figure 5: Bar plot showing the accuracy of filling the exhibit structure and is thus substantially different
masked sentences of ten biomedical categories across from other modalities. Moreover, our dataset is the
9 platforms using Textomics as the benchmark. first dataset that aims to convert genomics feature
vector to textual summary. The substantial noise
ilar to section 6.1, we used the paper abstract as and high-dimensionality of genomics data matrices

4885
further pose unique challenges in text generation. scription generation models jointly consider thou-
Our kNN-Vec2Text model is inspired by the re- sands of datasets, enabling the transfer of knowl-
cent success in applying kNN-based language mod- edge from other datasets. The generated descrip-
els to machine translation (Khandelwal et al., 2021) tion can guide biologists to write more informative
and language models (Khandelwal et al., 2020; He descriptions, which ultimately leads to better and
et al., 2021; Ton et al., 2021). The main difference larger genomics description data. When biologists
between our methods and their approaches is that start to obtain the generated description from NLP
while we try to leverage kNN in the genomics vec- tools, they will be able to write more informative
tor space to construct reference text, they use kNN descriptions with the assistance from these NLP
in the text embedding space during the autoregres- tools. On the NLP side, the relationship between
sive generation process to help adjust the sample a summary and a dataset is analogous to the rela-
distribution. Some other methods can be used to tionship between an abstract and a scientific paper.
generate text from vectors, such as (Bowman et al., A high-quality summary ideally contains all per-
2016; Song et al., 2019; Miao and Blunsom, 2016; spectives of a study, including problems, methods,
Montero et al., 2021; Zhang et al., 2019). Their and discoveries. Moreover, our work will bridge
inputs are latent vectors that need to be inferred the NLP and the genomics community and moti-
from the data and do not have specific meanings, vate people to analyze genomics data using NLP
which are different from our gene feature vectors. methods based on the multi-modality dataset in-
troduced in this paper. Textomics could also be
8 Conclusion and future work used to help scientific paper analysis tasks, such as
paper recommendation (Bai et al., 2019), citation
In this paper, we have proposed a novel dataset Tex- text generation (Luu et al., 2020), and citation pre-
tomics, containing 22,273 pairs of genomics ma- diction (Suzen et al., 2021).
trices and their corresponding textual summaries. Our method searches for the nearest neighbours
We then introduce a novel task of Vec2Text based by calculating the Euclidean distance between five-
on our dataset. This task aims to generate the tex- number summary vectors of the genomics feature
tual summary based on the gene feature vector. matrices. However, this might lose useful infor-
To address this task, we propose a novel method mation hidden in the original matrices. It’s chal-
kNN-Vec2Text, which constructs the reference text lenging and worth exploring end-to-end approaches
using nearest neighbors in the gene feature vector that can learn embeddings from the genomics fea-
space and then generates a new summary accord- ture matrices instead of representing them as five-
ing to this reference text. We further introduce number summary vectors. On the Text2Vec side,
two applications that can be advanced using our one remaining challenge that could be the future di-
dataset. One application aims at evaluating sci- rection of our work is to directly generate the whole
entific paper similarity according to the similarity genomics feature matrix instead of the five-number
of its corresponding data summary, and the other summary vector. Also, it would be interesting yet
application leverages our dataset to automatically challenging to jointly learn the Text2Vec and the
generate masked sentences for scientific paper un- Vec2Text tasks, and one potential solution is to fur-
derstanding. ther decode the generated vector to reconstruct the
To the best of our knowledge, Textomics and embedding of summaries in Text2Vec, and lever-
kNN-Vec2Text serve as the first large-scale ge- age the resulted decoder to predict the embedding
nomics data description benchmark, and we en- of text by using kNN method in the text embed-
vision it will be broadly applied to other natural ding space. Also, it is interesting to jointly model
language processing and biomedical tasks. On data from multiple platforms, which might lead to
the biomedical side, we provide the benchmark beneficial results by transferring biological insights
to develop new NLP tools that can generate the learned from different platforms.
description for a genomics data. Since each pub-
lic genomics data needs a description, such tools
will substantially accelerate this process. Also, de-
References
scriptions generated from Textomics could contain
new knowledge. While humans write the descrip- Xiaomei Bai, Mengyang Wang, Ivan Lee, Zhuo Yang,
tion almost solely based on that single dataset, de- Xiangjie Kong, and Feng Xia. 2019. Scientific paper

4886
recommendation: A survey. IEEE Access, 7:9324– Ondrej Dusek, Jekaterina Novikova, and Verena Rieser.
9339. 2020. Evaluating the state-of-the-art of end-to-end
natural language generation: The E2E NLG chal-
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan lenge. Comput. Speech Lang., 59:123–156.
Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jian-
feng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2020. Ron Edgar, Michael Domrachev, and Alex E. Lash.
Unilmv2: Pseudo-masked language models for uni- 2002. Gene expression omnibus: Ncbi gene expres-
fied language model pre-training. In ICML. sion and hybridization array data repository. Nucleic
acids research, 30 1.
Yael Baran, Akhiad Bercovich, Arnau Sebe-Pedros,
Yaniv Lubling, Amir Giladi, Elad Chomsky, Zo-
Claire Gardent, Anastasia Shimorina, Shashi Narayan,
har Meir, Michael Hoichman, Aviezer Lifshitz, and
and Laura Perez-Beltrachini. 2017. Creating train-
Amos Tanay. 2019. Metacell: analysis of single-cell
ing corpora for NLG micro-planners. In Proceed-
rna-seq data using k-nn graph partitions. Genome
ings of the 55th Annual Meeting of the Association
biology, 20(1):1–19.
for Computational Linguistics (Volume 1: Long Pa-
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: pers), pages 179–188, Vancouver, Canada. Associa-
A pretrained language model for scientific text. In tion for Computational Linguistics.
EMNLP/IJCNLP (1), pages 3613–3618. Association
for Computational Linguistics. Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and
Luke Zettlemoyer. 2019. Mask-predict: Parallel de-
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- coding of conditional masked language models. In
drew M. Dai, Rafal Józefowicz, and Samy Ben- EMNLP.
gio. 2016. Generating sentences from a continuous
space. In CoNLL, pages 10–21. ACL. Simon Ging, Mohammadreza Zolfaghari, Hamed Pir-
siavash, and Thomas Brox. 2020. COOT: coopera-
Liqun Chen, Guoyin Wang, Chenyang Tao, Ding- tive hierarchical transformer for video-text represen-
han Shen, Pengyu Cheng, Xinyuan Zhang, Wenlin tation learning. In NeurIPS.
Wang, Yizhe Zhang, and Lawrence Carin. 2019. Im-
proving textual network embedding with global at- Sara Goodwin, John D McPherson, and W Richard
tention via optimal transport. In ACL (1), pages McCombie. 2016. Coming of age: ten years of
5193–5202. Association for Computational Linguis- next-generation sequencing technologies. Nature
tics. Reviews Genetics, 17(6):333–351.
Wenhu Chen, Yu Su, Xifeng Yan, and William Yang
Wang. 2020. KGPT: knowledge-grounded pre- Jiaqi Guan, Runzhe Li, Sheng Yu, and Xuegong Zhang.
training for data-to-text generation. In EMNLP (1), 2018. Generation of synthetic electronic medical
pages 8635–8648. Association for Computational record text. In BIBM, pages 374–380. IEEE Com-
Linguistics. puter Society.

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-
Downey, and Daniel S. Weld. 2020. SPECTER: pat, and Ming-Wei Chang. 2020. REALM: retrieval-
Document-level Representation Learning using augmented language model pre-training. CoRR,
Citation-informed Transformers. In ACL. abs/2002.08909.

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, Junxian He, Graham Neubig, and Taylor Berg-
and Rita Cucchiara. 2020. Meshed-memory trans- Kirkpatrick. 2021. Efficient nearest neighbor lan-
former for image captioning. In CVPR, pages guage models. In EMNLP.
10575–10584. Computer Vision Foundation / IEEE.
Byungjin Hwang, Ji Hyun Lee, and Duhee Bang. 2018.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Single-cell rna sequencing technologies and bioin-
Kristina Toutanova. 2019. BERT: pre-training of formatics pipelines. Experimental & molecular
deep bidirectional transformers for language under- medicine, 50(8):1–14.
standing. In NAACL-HLT (1), pages 4171–4186. As-
sociation for Computational Linguistics.
Minoru Kanehisa and Peer Bork. 2003. Bioinfor-
George Doddington. 2002. Automatic evaluation matics in the post-sequence era. Nature genetics,
of machine translation quality using n-gram co- 33(3):305–310.
occurrence statistics. In Proceedings of the Second
International Conference on Human Language Tech- Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-
nology Research, page 138–145, San Francisco, CA, Wing Tai. 2019. Reflective decoding network for im-
USA. Morgan Kaufmann Publishers Inc. age captioning. In ICCV, pages 8887–8896. IEEE.

Konstantinos Drossos, Samuel Lipping, and Tuomas Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke
Virtanen. 2020. Clotho: an audio captioning dataset. Zettlemoyer, and Mike Lewis. 2021. Nearest neigh-
In ICASSP, pages 736–740. IEEE. bor machine translation. In ICLR. OpenReview.net.

4887
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zequn Liu, Shukai Wang, Yiyang Gu, Ruiyi Zhang,
Zettlemoyer, and Mike Lewis. 2020. Generalization Ming Zhang, and Sheng Wang. 2021. Graphine: A
through memorization: Nearest neighbor language dataset for graph-aware terminology definition gen-
models. In ICLR. OpenReview.net. eration. In EMNLP (1), pages 3453–3463. Associa-
tion for Computational Linguistics.
Michael Kramer, Janusz Dutkowski, Michael Yu, Vi-
neet Bafna, and Trey Ideker. 2014. Inferring gene Kelvin Luu, Rik Koncel-Kedziorski, Kyle Lo, Isabel
ontologies from pairwise similarity data. Bioinfor- Cachola, and Noah A. Smith. 2020. Citation text
matics, 30(12):i34–i42. generation. ArXiv, abs/2002.00317.

Alon Lavie and Abhaya Agarwal. 2007. METEOR: Ilaria Manco, Emmanouil Benetos, Elio Quinton, and
an automatic metric for MT evaluation with high György Fazekas. 2021. Muscaps: Generating cap-
levels of correlation with human judgments. In tions for music audio. In IJCNN, pages 1–8. IEEE.
WMT@ACL, pages 228–231. Association for Com-
putational Linguistics. Xinhao Mei, Qiushi Huang, Xubo Liu, Gengyun
Chen, Jingqian Wu, Yusong Wu, Jinzheng Zhao,
Rémi Lebret, David Grangier, and Michael Auli. 2016. Shengchen Li, Tom Ko, H. Tang, Xi Shao, Mark D.
Neural text generation from structured data with Plumbley, and Wenwu Wang. 2021. An encoder-
application to the biography domain. In EMNLP, decoder based audio captioning system with transfer
pages 1203–1213. The Association for Computa- and reinforcement learning. In DCASE, pages 206–
tional Linguistics. 210.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Oren Melamud and Chaitanya Shivade. 2019. Towards
Donghyeon Kim, Sunkyu Kim, Chan Ho So, automatic generation of shareable synthetic clinical
and Jaewoo Kang. 2020. Biobert: a pre-trained notes using neural language models. In Proceedings
biomedical language representation model for of the 2nd Clinical Natural Language Processing
biomedical text mining. Bioinform., 36(4):1234– Workshop, pages 35–45.
1240.
Yishu Miao and Phil Blunsom. 2016. Language as a
Scott Lee. 2018. Natural language generation for elec-
latent variable: Discrete generative models for sen-
tronic health records. CoRR, abs/1806.01353.
tence compression. In EMNLP.
Jacob H. Levine, Erin F. Simonds, Sean C. Bendall,
Kara L. Davis, El ad D. Amir, Michelle D. Tad- Yasuhide Miura, Yuhao Zhang, Emily Tsai, Curtis Lan-
mor, Oren Litvin, Harris G. Fienberg, Astraea Jager, glotz, and Dan Jurafsky. 2021. Improving factual
Eli R. Zunder, Rachel Finck, Amanda L. Gedman, completeness and consistency of image-to-text radi-
Ina Radtke, James R. Downing, Dana Pe’er, and ology report generation. In Proceedings of the 2021
Garry P. Nolan. 2015. Data-driven phenotypic dis- Conference of the North American Chapter of the As-
section of AML reveals progenitor-like cells that cor- sociation for Computational Linguistics (NAACL).
relate with prognosis. Cell, 162(1):184–197.
Ivan Montero, Nikolaos Pappas, and Noah A. Smith.
Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, 2021. Sentence bottleneck autoencoders from trans-
Licheng Yu, and Jingjing Liu. 2020. HERO: former language models. In EMNLP.
hierarchical encoder for video+language omni-
representation pre-training. In EMNLP (1), pages Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
2046–2065. Association for Computational Linguis- Jing Zhu. 2002. Bleu: a method for automatic eval-
tics. uation of machine translation. In Proceedings of
the 40th Annual Meeting of the Association for Com-
Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, and putational Linguistics, pages 311–318, Philadelphia,
Tao Mei. 2021. X-modaler: A versatile and high- Pennsylvania, USA. Association for Computational
performance codebase for cross-modal analytics. In Linguistics.
ACM Multimedia, pages 3799–3802. ACM.
Ratish Puduppully and Mirella Lapata. 2021. Data-to-
Chin-Yew Lin. 2004. ROUGE: A package for auto- text generation with macro planning. Trans. Assoc.
matic evaluation of summaries. In Text Summariza- Comput. Linguistics, 9:510–527.
tion Branches Out, pages 74–81, Barcelona, Spain.
Association for Computational Linguistics. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, Gretchen Krueger, and Ilya Sutskever. 2021. Learn-
and C. Lawrence Zitnick. 2014. Microsoft COCO: ing transferable visual models from natural language
common objects in context. In ECCV (5), volume supervision. In ICML, volume 139 of Proceedings
8693 of Lecture Notes in Computer Science, pages of Machine Learning Research, pages 8748–8763.
740–755. Springer. PMLR.

4888
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Sheng Wang, Jianzhu Ma, Michael Ku Yu, Fan Zheng,
Dario Amodei, Ilya Sutskever, et al. 2019. Lan- Edward W Huang, Jiawei Han, Jian Peng, and Trey
guage models are unsupervised multitask learners. Ideker. 2018. Annotating gene sets by mining large
OpenAI blog, 1(8):9. literature collections with protein networks. In PA-
CIFIC SYMPOSIUM ON BIOCOMPUTING 2018:
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Proceedings of the Pacific Symposium, pages 602–
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, 613. World Scientific.
Wei Li, and Peter J. Liu. 2020. Exploring the lim-
its of transfer learning with a unified text-to-text Wenlin Wang, Chenyang Tao, Zhe Gan, Guoyin Wang,
transformer. Journal of Machine Learning Research, Liqun Chen, Xinyuan Zhang, Ruiyi Zhang, Qian
21(140):1–67. Yang, Ricardo Henao, and Lawrence Carin. 2019.
Improving textual network learning with variational
Clément Rebuffel, Marco Roberti, Laure Soulier, Geof- homophilic embeddings. In NeurIPS, pages 2074–
frey Scoutheeten, Rossella Cancelliere, and Patrick 2085.
Gallinari. 2022. Controlling hallucinations at word
level in data-to-text generation. Data Min. Knowl. Yixin Wang, Zihao Lin, Jiang Tian, Zhongchao
Discov., 36(1):318–354. Shi, Yang Zhang, Jianping Fan, and Zhiqiang He.
2021. Confidence-guided radiology report genera-
Clément Rebuffel, Laure Soulier, Geoffrey tion. arXiv preprint arXiv:2106.10887.
Scoutheeten, and Patrick Gallinari. 2020. A
hierarchical model for data-to-text generation. Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar,
In ECIR (1), volume 12035 of Lecture Notes in and Juan Pablo Bello. 2021. Wav2clip: Learning ro-
Computer Science, pages 65–80. Springer. bust audio representations from clip. arXiv preprint
arXiv:2110.11499.
Leonardo F. R. Ribeiro, Yue Zhang, Claire Gardent,
and Iryna Gurevych. 2020. Modeling global and Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Car-
local node contexts for text generation from knowl- bonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019.
edge graphs. Trans. Assoc. Comput. Linguistics, Xlnet: Generalized autoregressive pretraining for
8:589–604. language understanding. In NeurIPS.

Julian Salazar, Davis Liang, Toan Q. Nguyen, and Ka- Jianbo Yuan, Haofu Liao, Rui Luo, and Jiebo Luo.
trin Kirchhoff. 2020. Masked language model scor- 2019. Automatic radiology report generation based
ing. In ACL. on multi-view image fusion and medical concept en-
richment. In MICCAI (6), volume 11769 of Lec-
Linfeng Song, Ante Wang, Jinsong Su, Yue Zhang, ture Notes in Computer Science, pages 721–729.
Kun Xu, Yubin Ge, and Dong Yu. 2020. Structural Springer.
information preserving for graph-to-text generation.
In ACL, pages 7987–7998. Association for Compu- Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen,
tational Linguistics. and Lawrence Carin. 2019. Syntax-infused varia-
tional autoencoder for text generation. In ACL (1),
Tianbao Song, Jingbo Sun, Bo Chen, Weiming Peng, pages 2069–2078. Association for Computational
and Jihua Song. 2019. Latent space expanded vari- Linguistics.
ational autoencoder for sentence generation. IEEE
Access, 7:144618–144627. Yanjian Zhang, Qin Chen, Yiteng Zhang, Zhongyu
Wei, Yixu Gao, Jiajie Peng, Zengfeng Huang, Wei-
Neslihan Suzen, Alexander Gorban, Jeremy Levesley, jian Sun, and Xuan-Jing Huang. 2020. Automatic
and Evgeny Mirkes. 2021. Semantic analysis for term name generation for gene ontology: Task and
automated evaluation of the potential impact of re- dataset. In Proceedings of the 2020 Conference on
search articles. Empirical Methods in Natural Language Processing:
Findings, pages 4705–4710.
Jean-Francois Ton, Walter A. Talbott, Shuangfei Zhai,
and Joshua M. Susskind. 2021. Regularized train- Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard
ing of nearest neighbor language models. ArXiv, Socher, and Caiming Xiong. 2018. End-to-end
abs/2109.08249. dense video captioning with masked transformer. In
CVPR, pages 8739–8748. Computer Vision Founda-
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob tion / IEEE Computer Society.
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all A Appendices
you need. In NIPS, pages 5998–6008.
We provided more details here about our dataset
Bin Wang and C.-C. Jay Kuo. 2020. SBERT-WK:
A sentence embedding method by dissecting bert-
and related experimental results here. In Table S1,
based word models. IEEE ACM Trans. Audio we summarized the statistics information of 9 Tex-
Speech Lang. Process., 28:2146–2157. tomics platforms. There are 3 different species

4889
#Matrix #Matrix # Matrix
Platform Species #Feature Missing rates
(All) (PMC) (Vec2Text)
GPL96 Homo Sapiens 1,371 353 240 100K 0.19
GPL198 Arabidopsis Thaliana 1,081 194 250 100K 0.03
GPL570 Homo Sapiens 5,822 1,879 1,004 100K 0.12
GPL1261 Mus Musculus 4,563 1,326 1,059 100K 0.09
GPL6244 Homo Sapiens 1,831 659 307 100K 0.10
GPL6246 Homo Sapiens 2,366 850 388 100K 0.08
GPL6887 Mus Musculus 1,150 407 240 100K 0.09
GPL10558 Homo Sapiens 2,580 1,261 519 100K 0.11
GPL13534 Homo Sapiens 1,509 762 234 100K 0.26

Table S1: Statistics of the Textomics data. Each row is a sequencing platform in Textomics. All, PMC, Vec2Text
represent number of samples without filtering, with associated PMC full text article, and after using automated
filtering, respectively.

Textomics GPL GPL GPL GPL GPL GPL GPL GPL GPL
platform 96 198 570 1261 6244 6246 6887 10558 13534
Spearman correlation 0.36 0.20 0.24 0.34 0.44 0.45 0.22 0.38 0.30

Table S2: The result of Spearman correlation between gene data matrices and text summaries on 9 platforms.

Platform BLEU-1 ROUGE-1 ROUGE-L METEOR NIST


GPL96 0.179 0.233 0.166 0.143 0.817
GPL198 0.198 0.257 0.192 0.168 0.889
GPL570 0.212 0.269 0.205 0.182 0.936
GPL1261 0.229 0.283 0.226 0.202 0.980
GPL6244 0.183 0.250 0.179 0.156 0.750
GPL6246 0.219 0.269 0.210 0.187 0.950
GPL6887 0.198 0.260 0.196 0.171 0.847
GPL10558 0.191 0.257 0.177 0.165 0.842
GPL13534 0.242 0.332 0.279 0.260 1.124

Table S3: More results on evaluating Vec2Text task on Textomics.

4890
across 9 platforms, including Homo sapiens, Ara-
bidopsis thailiana, and Mus musculus. #Sample
(All) represents the entire number of samples for 9
platforms, #Sample (Vec2Text) represents the num-
ber of samples in the subset after BLEU filtering,
and #Sample (PMC) represents the number of sam-
ples in the subset with full scientific articles.
We also represented the results of Spearman cor-
relations between text-based similarity and vector-
based simlarity across 9 platforms in Table S2. The
Spearman correlations are all higher than 0.2 in ev-
ery platform, which shows a substantial agreement
between text-based similarity and vector-based sim-
ilarity.
In Table S3, We represented the scores of
different widely-used automatic metrics for
word level sentence generation evaluation on
Vec2Text task, including BLEU-1(Papineni et al.,
2002), BLEU-2, ROUGE-1(Lin, 2004), ROUGE-
L, METEOR(Lavie and Agarwal, 2007) and
NIST(Doddington, 2002). The results indicated
consistent improvement of our method over com-
parison approaches on different automatic metrics.

4891

You might also like