Professional Documents
Culture Documents
15 8111–8125
doi: 10.1093/nar/gkz646
Received May 25, 2019; Revised July 07, 2019; Editorial Decision July 13, 2019; Accepted July 15, 2019
* To
whom correspondence should be addressed. Tel: +86 20 8522 4031; Email: zhanggong-uni@qq.com
Correspondence may also be addressed to Qing-Yu He. Tel: +86 20 8522 7039; Email: tqyhe@email.jnu.edu.cn
Correspondence may also be addressed to Tong Wang. Tel: +86 20 8522 2616; Email: tongwang@jnu.edu.cn
†
The authors wish it to be known that, in their opinion, the first three authors should be regarded as joint First Authors.
C The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
8112 Nucleic Acids Research, 2019, Vol. 47, No. 15
Indeed, mis-annotations of individual ncRNAs have 37◦ C for 30 min, RFPs were magnetically isolated and pu-
been reported with their protein products since a decade rified by ethanol precipitation. The probe sequences of were
ago, such as CLUU1 (11) and ESRG (12), although some listed in Supplementary Table S8.
ncRNAs only encode short peptides (13–17). The discov- Sequencing libraries were constructed by following the
ery of these individual mis-annotations implicatively fa- guide for the NEBNext Multiplex Small RNA Library Prep
vors our hypothesis that a hidden human proteome en- Set for Illumina (NEB, Beijing, China). Briefly, the multi-
coded by purported ‘non-coding’ RNAs may exist. Re- plex 3 SR adapters were ligated to RFPs and hybridized
cently, Heesch et al. combined ribosome profiling with with reverse transcription primers. The reverse transcrip-
(http://proteomecentral.proteomexchange.org) with the fol- with the same molecular weights caused by certain post-
lowing respective identifiers, PXD000529, PXD000533, translational modifications as we described previously (26).
PXD000535 (19) and PXD001305 (25). The human lung In addition, only unique peptides with at least 9 residues
cell proteomics data are available from the iProX database were used for protein identifications (26). Proteins that do
(http://www.iprox.org, accession number: IPX00076200); not belong to PE1–5 proteins (the ‘protein products of
and the MS raw data for low molecular weight pro- known coding genes’ according to the Swiss-Prot database)
teins are publicly available in iProX (accession number: (28) were defined as ‘new proteins’.
IPX00076300).
on the transitions in the unfractionated peptide mixtures. gradient buffer (6%-30% solvent B (80% ACN, 0.1% FA))
The positive-response transitions were collected and com- for 30 min under a flow rate of 200 nl/min. The MRM
bined to run the MRM confirmation experiments at the acquisition was performed with the TSQ Quantiva sys-
second round, using the peptides mixtures before fraction- tem (Thermo Scientific) coupled with a Nanospray NG ion
ated as sample input. We further performed the third-round soure in positive ion mode. MS conditions were set as fol-
MRM to specifically analyze all the negative-response tran- lows: ion spray voltage 2300 V; ion transfer tube tempera-
sitions using the fractionations of each sample. ture 320◦ C; CID gas 1.5 mTorr; cycle time 3 s; chrome fil-
All data files were imported into Skyline and manually ter of 3 s. The raw data were analyzed using the Skyline
CRISPR/Cas9 Knockout Prolong Gold Antifade Reagent with DAPI (Life Technolo-
gies), slides were observed with a Zeiss LSM710 confocal
UBAP1-AST6 knockout cell lines were generated by using
microscope (Zeiss, Oberkochen, Germany).
the CRISPR/Cas9 system. The gRNAs targeting UBAP1-
AST6 gene were designed by the online tool developed by
Prof. Zhang (http://crispr.mit.edu/). The DNA sequences qRT-PCR
expressing these gRNAs were cloned into pGK1.1 lin- Total RNA was extracted from cells using the Trizol to-
ear vector (Genloci Biotechnologies Inc. Nanjing, Jiangsu, tal RNA isolation reagent (Life Technologies). RNA levels
Figure 1. The translating lncRNAs that can encode proteins (canonical ORF length ≥150 nt). (A) The number of translating lncRNAs detected by the
full-length translating mRNA sequencing (RNC-seq) in nine cell lines. (B) Ribosome footprints (RFP) of two translating lncRNAs as examples. Red bars
denote the RFP coverage along the RNA, and the grey region marks the predicted canonical ORF. (C) The chromosome enrichment of the translating
lncRNAs. P-values were calculated using Fisher Exact test. (D, E) Expression patterns of the translating lncRNAs in nine cell lines. The threshold of 0.1
(D) and 1.0 (E) RPKM are respectively used for positive detections.
Nucleic Acids Research, 2019, Vol. 47, No. 15 8117
Our ribosome profiling results confirmed that these trans- to use data-independent acquisition MS methods, includ-
lating lncRNAs were actively translated by ribosomes, as ing multiple reaction monitoring (MRM) and parallel re-
the ribosome footprints spanned across the transcripts, es- action monitoring (PRM). In general MRM/PRM are to
pecially across the predicted CDS regions (examples are select certain parent peptides for fragmentation and quan-
shown in Figure 1B). The translating lncRNAs distributed tification. This can be assisted by addition of synthetic pep-
across the entire human genome in all chromosomes (Sup- tides serving as spike-in standards for further quantification
plementary Figure S1) with almost no chromosome enrich- of those peptides in samples. With regular MRM (with-
ment detected (Figure 1C), indicating that these potential out synthetic peptides), we found evidence for 203 unique
dances were mostly <1 RPKM in mRNA as well as at the compositions (Figure 3D). The new proteins used a simi-
translation level, they were often considered as expression lar amount of uncharged amino acids as PE1 proteins, while
or experimental noises. Using our highly accurate and ex- they tended to use more positively charged amino acids and
perimental validated mapping algorithm (22), we could de- less negatively charged amino acids (Figure 3D). Third, the
termine their existence at transcription and translation lev- shorter length of the new proteins led to significantly less
els. The new proteins were also less abundant in protein level possibility of tryptic digestion (P < 10−16 , KS-test, Figure
than the PE1 proteins (Figure 3B), which challenged the 3E), thus decreasing the possibility of finding more unique
sensitivity of MS-matching methods. Second, the isoelec- peptides and supportive peptides. Fourth, the new proteins
tric point (pI) of the new proteins are significantly higher were predicted to be less stable in vivo than the PE1 pro-
than the PE1 proteins (P < 10−16 , KS-test, Figure 3C), teins, as calculated by the instability index (36) (P < 10−16 ,
which explains the difficulty for MS to detect them (26,35). KS-test, Figure 3F), indicating less possibility to find such
This difference tended to be caused by their amino acid
Nucleic Acids Research, 2019, Vol. 47, No. 15 8119
proteins in intact form in the cells. These properties make machine-learning. Poly(A) sites were predicted by aligning
the new proteins less probable to be detected. EST or RNA-seq reads to genome sequences (reviewed in
(37)) and thus the method is unable to distinguish protein-
Origin of the new proteins coding mRNAs and long non-coding RNAs with polyA
tails.
Another interesting question is why these coding genes have
An important criterion of the coding propensity predict-
been erroneously classified as non-protein-coding genes be-
ing algorithms lies in the evolutionary conservatism of pro-
fore. Previous studies used computational approaches to
teins across the species. Therefore, we performed homol-
predict genes as protein-coding or non-coding, based on
ogy search of these new proteins against various genomes
properties of gene and exon structures, potential CDS fea-
ranging from all genome-sequenced bacteria to primates,
tures (codon bias, etc), polyA sites and homology across
in comparison to PE1 proteins. As a result, we could de-
genomes (reviewed in (37,38)). We found that these new
tect homologous genes in all vertebrates for more than 88%
coding genes contained significantly less number of exons
of PE1 proteins, with nearly 40% PE1 genes having their
than the PE1 proteins (Figure 4A). 88% of the new cod-
ancestors in bacteria. In sharp contrast, more than half of
ing genes contain single-exons. The sequence properties of
the new proteins emerged in primates and did not exist even
the new proteins significantly deviated from known proteins
in rodents, and only <6% could be found in bacteria (Fig-
(Figures 2A and 3C-E), which might have confused the pre-
ure 4B). This implicates that the new proteins are mostly
diction algorithms based on supervised classification and
8120 Nucleic Acids Research, 2019, Vol. 47, No. 15
very young genes during evolution, which partially explains found that 27 (71%) of these proteins were also from un-
why the predicting algorithms have considered these genes known origin (Figure 4D).
as non-coding. Interestingly, most of the new proteins in
orangutan and gorilla show very high homology (>60%) Functional potency of new proteins
to the human counterparts, while only <30% of the PE1
proteins show such high homology to their human counter- To further investigate whether these new proteins were
parts. This suggests that the new proteins are well conserved potentially functional, we calculated the function-relevant
since their emergence in primates. translation ratio (TR) values, which we had previously de-
We then searched for the evolutionary ancestors for the fined as the translating mRNA versus total mRNA ratio of
new proteins within human genome. We found that 28.9% a certain gene (7). TR mainly reflects the translation initi-
of the new proteins were alternative splice variants of the ation, and it is known that high TR genes determine the
known protein-coding genes, 21.1% were homologous copy cellular phenotypes (7,20,39). Here, we found that the aver-
of known protein-coding genes with genetic alterations age TR of the translating lncRNAs was significantly higher
(Figure 4C). A small fraction of 1.3% was homologous to than that of known coding genes in seven out of nine tested
bacteria and virus, which might indicate a genetic hori- cell lines, except A549 and Hep3B (Figure 5A). This indi-
zontal transfer during symbiosis or infection (Figure 4C). cates a more active translation initiation of the new pro-
48.7% could not be explained by any of above origins (Fig- teins. It is known that more stable RNA secondary structure
ure 4C). Further comparison with the 38 new proteins (Fig- near the start codon decreases the translation initiation effi-
ure 2D) that fully complied with the HPP Guideline 2.1, we ciency (24). Our analyses showed that the new proteins have
less stable RNA structure near the start codon (Figure 5B),
which partially explains the more active translation initia-
Nucleic Acids Research, 2019, Vol. 47, No. 15 8121
tion of the new proteins. As the new proteins exist in a stable nucleus and 13 cytosol proteins). Representative images are
and full-length form in vivo (Figure 2E and F), they are un- shown in Figure 5D and Supplementary Figure S6. Our re-
likely to be merely products of erroneous ribosome binding. sults suggest that the new proteins are highly likely to be
Most proteins need to be localized properly to be func- functional in vivo.
tional. Therefore, we next predicted the subcellular localiza-
tion of the new proteins. The Cellular Component terms of
Cellular phenotypes regulated by new protein UBAP1-AST6
gene ontology were exported from BLAST2GO search. RE-
VIGO enrichment analysis (40) showed clear enrichment of We next tried to provide direct evidence on the functional-
the new proteins localized in various types of subcellular ity of one new protein UBAP1-AST6. We targeted UBAP1-
structures, including membrane systems, nucleus and mito- AST6 because of its interesting nucleoli localization and
chondrion (Figure 5C). We then employed confocal fluores- its high TR in the lung cancer A549 cells. UBAP1-AST6
cent microscopy to verify these predicted localizations by strictly localized in nucleus using EGFP and mCherry fu-
constructing these genes into pEGFP-N1 plasmids trans- sion proteins, respectively; our results excluded the influ-
fected into H1299 cells. As such we successfully validated ence of the fluorescent tags (Figure 6A and B). We then
all 20 new proteins with predicted subcellular localizations raised the antibody specifically against the UBAP1-AST6
(3 mitochondrion, 1 endoreticulum, 1 Golgi apparatus, 2 and found that it also localized in the nucleus in various hu-
man cell lines and human tissues (Figure 6C). To show the
8122 Nucleic Acids Research, 2019, Vol. 47, No. 15
Figure 6. The potential biological function of a new protein UBAP1-AST6. (A) Subcellular localization of UBAP1-AST6, fused by EGFP. (B) Same
as (A), UBAP1-AST6 is fused with mCherry. (C) Western blot verification of the nucleus localization of UBAP1-AST6 using three cell lines and three
human tissues. (D) The DNA sequence UBAP1-AST6-ATG mut plasmid. The start codon ATG of the UBAP1-AST6 ORF was mutated to GCG to
abolish translation initiation. The sequences were verified by Sanger sequencing. (E) qRT-PCR analysis of relative UBAP1-AST6 RNA expression in over-
expression (OV) and knock-out (KO) models. A549 cells were infected with LentiViral-flag (OV-Control) or LentiViral-UBAP1-AST6-flag (OV-UBAP1-
AST6), followed by qPCR analysis of UBAP1-AST6 relative to GAPDH. Similar analysis was also performed on CRISPR/Cas9 KO and pcDNA3.1-
UBAP1-AST6 rescue (KO-rescue) groups. In the rescue models, we included an ATG-mutated pcDNA3.1-UBAP1-AST6 group (KO-rescue-ATG-mut) as
a control. (F) Immunoblotting validation of UBAP1-AST6 expression. (G) Proliferation assays using WST-1. n = 3. (H) Colony formation assay. n = 3.
Nucleic Acids Research, 2019, Vol. 47, No. 15 8123
function of UBAP1-AST6, we built overexpression (OV) panded the reference protein database, resulting in an ele-
and knock-out (KO)/rescue models. To answer whether vated threshold to control FDR, thus reducing the sensi-
the function elicited by proteins or RNAs, in the rescue tivity of peptide identifications (42). In this study, the use
experiments, we included a control using ATG-mutated of cell-type specific translating mRNAs to serve as ORF
pcDNA3.1-UBAP1-AST6 plasmid to transfect A549 cells prediction and reference database for MS data search en-
(KO-rescue-ATG-mut) to abolish the translation initiation. sured the accurate matches. Third, the new coding genes
The qRT-PCR results showed that all OV or KO groups are generally not conserved. More than half of these genes
worked as expected (Figure 6D), and the protein and RNA emerged in primates, representing very young genes in evo-
24. Bentele,K., Saffert,P., Rauscher,R., Ignatova,Z. and Bluthgen,N. spectrometry data interpretation guidelines 2.1. J. Proteome Res., 15,
(2013) Efficient translation initiation dictates codon usage at gene 3961–3970.
start. Mol. Syst. Biol., 9, 675. 35. Horvatovich,P., Lundberg,E.K., Chen,Y.J., Sung,T.Y., He,F.,
25. Kelstrup,C.D., Jersie-Christensen,R.R., Batth,T.S., Arrey,T.N., Nice,E.C., Goode,R.J., Yu,S., Ranganathan,S., Baker,M.S. et al.
Kuehn,A., Kellmann,M. and Olsen,J.V. (2014) Rapid and deep (2015) Quest for missing proteins: update 2015 on
proteomes by faster sequencing on a benchtop quadrupole chromosome-centric human proteome project. J. Proteome Res., 14,
ultra-high-field Orbitrap mass spectrometer. J. Proteome Res., 13, 3415–3431.
6187–6195. 36. Guruprasad,K., Reddy,B.V. and Pandit,M.W. (1990) Correlation
26. Chen,Y., Li,Y., Zhong,J., Zhang,J., Chen,Z., Yang,L., Cao,X., between stability of a protein and its dipeptide composition: a novel