You are on page 1of 53

Computational Methods for Single-Cell

Data Analysis Guo-Cheng Yuan


Visit to download the full and correct content document:
https://textbookfull.com/product/computational-methods-for-single-cell-data-analysis-g
uo-cheng-yuan/
More products digital (pdf, epub, mobi) instant
download maybe you interests ...

Single Cell Analysis: Methods and Protocols Miodrag


Gužvi■

https://textbookfull.com/product/single-cell-analysis-methods-
and-protocols-miodrag-guzvic/

Computational Methods for Next Generation Sequencing


Data Analysis 1st Edition Ion Mandoiu

https://textbookfull.com/product/computational-methods-for-next-
generation-sequencing-data-analysis-1st-edition-ion-mandoiu/

Computational Cell Biology Methods and Protocols Louise


Von Stechow

https://textbookfull.com/product/computational-cell-biology-
methods-and-protocols-louise-von-stechow/

Computational Stem Cell Biology Methods and Protocols


Patrick Cahan

https://textbookfull.com/product/computational-stem-cell-biology-
methods-and-protocols-patrick-cahan/
Neural cell biology 1st Edition Cheng Wang

https://textbookfull.com/product/neural-cell-biology-1st-edition-
cheng-wang/

Time Series Analysis Methods and Applications for


Flight Data Zhang

https://textbookfull.com/product/time-series-analysis-methods-
and-applications-for-flight-data-zhang/

Mobile Data Mining Yuan Yao

https://textbookfull.com/product/mobile-data-mining-yuan-yao/

Computational Methods for Processing and Analysis of


Biological Pathways 1st Edition Anastasios Bezerianos

https://textbookfull.com/product/computational-methods-for-
processing-and-analysis-of-biological-pathways-1st-edition-
anastasios-bezerianos/

Computational methods for numerical analysis with R 1st


Edition James Patrick Howard Ii

https://textbookfull.com/product/computational-methods-for-
numerical-analysis-with-r-1st-edition-james-patrick-howard-ii/
Methods in
Molecular Biology 1935

Guo-Cheng Yuan Editor

Computational
Methods for
Single-Cell Data
Analysis
METHODS IN MOLECULAR BIOLOGY

Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK

For further volumes:


http://www.springer.com/series/7651
Computational Methods
for Single-Cell Data Analysis

Edited by

Guo-Cheng Yuan
Dana–Farber Cancer Institute and Harvard Chan School of Public Health, Boston, MA, USA
Editor
Guo-Cheng Yuan
Dana–Farber Cancer Institute and Harvard Chan
School of Public Health
Boston, MA, USA

ISSN 1064-3745 ISSN 1940-6029 (electronic)


Methods in Molecular Biology
ISBN 978-1-4939-9056-6 ISBN 978-1-4939-9057-3 (eBook)
https://doi.org/10.1007/978-1-4939-9057-3
Library of Congress Control Number: 2018967307

© Springer Science+Business Media, LLC, part of Springer Nature 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction
on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation,
computer software, or by similar or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations
and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to
be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty,
express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.
The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Humana Press imprint is published by the registered company Springer Science+Business Media, LLC, part of
Springer Nature.
The registered company address is: 233 Spring Street, New York, NY 10013, U.S.A.
Preface

The cell is the fundamental unit of life. The biological functions of an organ or tissue are
results of coordinated action of a large number of cells, each having its own properties and
dynamic behavior. While it is traditional to classify cells with similar function and morphol-
ogy as cell types, it is also well-recognized that, even within each cell type, there remain
significant differences, or states, among individual cells. Current knowledge of the reper-
toire of cell types and cell states, as well as their dynamic changes, remains highly incomplete.
Systematic, comprehensive characterization of spatial and temporal organization of cellular
heterogeneity, along with the mechanisms underlying cell-type/state transition and main-
tenance, has important implications in development and diseases.
It is not until recently that it has become feasible to systematically investigate cellular
heterogeneity at the single-cell resolution, thanks to the rapid development of a number of
advanced technologies including sequencing, imaging, and microfluidic devices. Collec-
tively, single-cell technologies have created exciting opportunities to systematically charac-
terize the molecular behavior of individual cells at the omics scale. At the same time, the
analysis and integration of single-cell omic data are difficult due to a number of challenges
such as sparsity, technical variability, and spatial-temporal complexity.
During the past few years, numerous computational methods and software packages
have been developed to overcome these challenges. The aim of this book is to introduce to
the community the state of the art of computational approaches in single-cell data analysis.
Each chapter presents a computational toolbox that is aimed to overcome a specific chal-
lenge in single-cell analysis, such as data normalization, rare cell-type identification, and
spatial transcriptomics analysis. Rather than explaining the mathematical details, here the
focus is on hands-on implementation of computational methods for analyzing experimental
data. Taken together, these chapters cover a wide range of tasks and may serve as a handbook
for single-cell data analysis.
Finally, I would like to thank Prof. John M. Walker for his kind invitation and sustained
support throughout the preparation of this book. I would also like to express my sincere
gratitude to all the contributors for sharing their protocols.

Boston, MA, USA Guo-Cheng Yuan

v
Contents

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Quality Control of Single-Cell RNA-seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Peng Jiang
2 Normalization for Single-Cell RNA-Seq Data Analysis. . . . . . . . . . . . . . . . . . . . . . . 11
Rhonda Bacher
3 Analysis of Technical and Biological Variability in Single-Cell
RNA Sequencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Beomseok Kim, Eunmin Lee, and Jong Kyoung Kim
4 Identification of Cell Types from Single-Cell Transcriptomic Data . . . . . . . . . . . . 45
Karthik Shekhar and Vilas Menon
5 Rare Cell Type Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Lan Jiang
6 scMCA: A Tool to Define Mouse Cell Types Based on Single-Cell
Digital Expression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Huiyu Sun, Yincong Zhou, Lijiang Fei, Haide Chen, and Guoji Guo
7 Differential Pathway Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Jean Fan
8 Pseudotime Reconstruction Using TSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Zhicheng Ji and Hongkai Ji
9 Estimating Differentiation Potency of Single Cells
Using Single-Cell Entropy (SCENT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Weiyan Chen and Andrew E. Teschendorff
10 Inference of Gene Co-expression Networks from Single-Cell
RNA-Sequencing Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Alicia T. Lamere and Jun Li
11 Single-Cell Allele-Specific Gene Expression Analysis. . . . . . . . . . . . . . . . . . . . . . . . . 155
Meichen Dong and Yuchao Jiang
12 Using BRIE to Detect and Analyze Splicing Isoforms
in scRNA-Seq Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Yuanhua Huang and Guido Sanguinetti
13 Preprocessing and Computational Analysis of Single-Cell
Epigenomic Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Caleb Lareau, Divy Kangeyan, and Martin J. Aryee

vii
viii Contents

14 Experimental and Computational Approaches for Single-Cell


Enhancer Perturbation Assay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Shiqi Xie and Gary C. Hon
15 Antigen Receptor Sequence Reconstruction and Clonality Inference
from scRNA-Seq Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
Ida Lindeman and Michael J. T. Stubbington
16 A Hidden Markov Random Field Model for Detecting Domain
Organizations from Spatial Transcriptomic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Qian Zhu

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Contributors

MARTIN J. ARYEE  Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA; Broad Institute of MIT and Harvard, Cambridge, MA, USA
RHONDA BACHER  Department of Biostatistics, University of Florida, Gainesville, FL, USA
HAIDE CHEN  Center for Stem Cell and Regenerative Medicine, Zhejiang University School
of Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou,
China
WEIYAN CHEN  CAS Key Lab of Computational Biology, CAS-MPG Partner Institute for
Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institute of
Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences,
Shanghai, China
MEICHEN DONG  Department of Biostatistics, Gillings School of Global Public Health,
University of North Carolina, Chapel Hill, NC, USA
JEAN FAN  Department of Chemistry and Chemical Biology, Harvard University, Boston,
MA, USA
LIJIANG FEI  Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
GUOJI GUO  Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
GARY C. HON  Department of Obstetrics and Gynecology, Cecil H. and Ida Green Center for
Reproductive Biology Sciences, University of Texas Southwestern Medical Center, Dallas,
TX, USA
YUANHUA HUANG  EMBL-European Bioinformatics Institute, Cambridgeshire, UK
HONGKAI JI  Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
ZHICHENG JI  Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health,
Baltimore, MD, USA
LAN JIANG  Howard Hughes Medical Institute, Boston Children’s Hospital, Boston, MA,
USA; Program in Cellular and Molecular Medicine, Boston Children’s Hospital, Boston,
MA, USA; Division of Hematology/Oncology, Department of Pediatrics, Boston Children’s
Hospital, Boston, MA, USA
PENG JIANG  Regenerative Biology Laboratory, Morgridge Institute for Research, Madison,
WI, USA
YUCHAO JIANG  Department of Biostatistics, Gillings School of Global Public Health,
University of North Carolina, Chapel Hill, NC, USA; Department of Genetics, School of
Medicine, University of North Carolina, Chapel Hill, NC, USA; Lineberger
Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC, USA
DIVY KANGEYAN  Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA
BEOMSEOK KIM  Department of New Biology, DGIST, Daegu, Republic of Korea
JONG KYOUNG KIM  Department of New Biology, DGIST, Daegu, Republic of Korea
ALICIA T. LAMERE  Mathematics Department, Bryant University, Smithfield, RI, USA

ix
x Contributors

CALEB LAREAU  Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA, USA; Department of Pathology, Massachusetts General Hospital, Boston, MA,
USA
EUNMIN LEE  Department of New Biology, DGIST, Daegu, Republic of Korea
JUN LI  Applied and Computational Mathematics and Statistics Department, University of
Notre Dame, Notre Dame, IN, USA
IDA LINDEMAN  Wellcome Sanger Institute, Hinxton, Cambridge, UK; KG Jebsen Coeliac
Disease Research Centre and Department of Immunology, University of Oslo, Oslo, Norway
VILAS MENON  Janelia Research Campus, Howard Hughes Medical Institute, Ashburn, VA,
USA; Columbia University Medical Center, New York, NY, USA
GUIDO SANGUINETTI  School of Informatics, University of Edinburgh, Edinburgh, UK
KARTHIK SHEKHAR  Klarman Cell Observatory, Broad Institute of MIT and Harvard,
Cambridge, MA, USA
MICHAEL J. T. STUBBINGTON  Wellcome Sanger Institute, Hinxton, Cambridge, UK
HUIYU SUN  Center for Stem Cell and Regenerative Medicine, Zhejiang University School of
Medicine, Hangzhou, China; Stem Cell Institute, Zhejiang University, Hangzhou, China
ANDREW E. TESCHENDORFF  CAS Key Lab of Computational Biology, CAS-MPG Partner
Institute for Computational Biology, Shanghai Institute of Nutrition and Health,
Shanghai Institute of Biological Sciences, University of Chinese Academy of Sciences,
Chinese Academy of Sciences, Shanghai, China; UCL Cancer Institute, University College
London, London, UK
SHIQI XIE  Department of Obstetrics and Gynecology, Cecil H. and Ida Green Center for
Reproductive Biology Sciences, University of Texas Southwestern Medical Center, Dallas,
TX, USA
YINCONG ZHOU  Stem Cell Institute, Zhejiang University, Hangzhou, China; College of Life
Sciences, Zhejiang University, Hangzhou, China
QIAN ZHU  Dana-Farber Cancer Institute, Boston, MA, USA
Chapter 1

Quality Control of Single-Cell RNA-seq


Peng Jiang

Abstract
Single-cell RNA-seq (scRNA-seq) is emerging as a promising technology to characterize and dissect the
cell-to-cell variability. However, the mixture of technical noise and intrinsic biological variability makes
separating technical artifacts from real biological variation cells particularly challenging. Proper detection
and filtering out technical artifacts before downstream analysis are critical. Here, we present a protocol that
integrates both gene expression patterns and data quality to detect technical artifacts in scRNA-seq samples.

Key words scRNA-seq, Quality control, Integrate, Gene expression patterns, Data quality

1 Introduction

Single-cell RNA-seq (scRNA-seq) provides a relatively unbiased


approach to investigate the heterogeneity of cells in complex mix-
tures [1]. It has revolutionized our capacity to understand the
transcriptomic diversity of cellular states [2, 3], lineages [4], and
diseases [5]. However, one of the major challenges of this technol-
ogy is the noise behind the data [6, 7]. For example, profiling low
amounts of mRNA will likely lead to missing transcripts (“dropout”
events) during the reverse-transcription step and also substantially
distort original transcript abundance [6, 8]. Comparison of the top
differentially expressed genes between cell populations shows poor
consistency, suggesting that high variance could be caused by high-
magnitude outliers [8]. On the other hand, gene expression among
cells is inherently stochastic and the cell-to-cell variations can also
be a result of transcriptional bursts or fluctuations [9]. Controlling
quality of scRNA-seq and discarding technical artifacts are very
important for downstream analysis.
To detect potential technical artifacts (bad samples) in scRNA-
seq, previous studies have used various strategies that can be gener-
ally grouped into three categories. The first category involves using
housekeeping genes to perform quality control (QC). For example,
cells are filtered out if certain housekeeping genes (e.g., Actb,

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_1, © Springer Science+Business Media, LLC, part of Springer Nature 2019

1
2 Peng Jiang

Gapdh) are not expressed or abnormally expressed [10, 11]. The


assumption of this approach is that housekeeping genes are highly
and consistently expressed. This is true for bulk RNAs but it is not
necessarily true for single cells (see Note 1). For instance, a study
using single-cell qPCR has shown that the expressions of
housekeeping genes have high variations among individual cells,
and different cell types have distinguished housekeeping gene
expression patterns [12]. Thus, a reliance on housekeeping genes
to perform QC does not work for scRNA-seq samples. The second
category for QC involves using overall gene expression patterns to
define technical artifacts. For example, cells show distinguished
gene expression patterns if compared with the majority of the
cells are excluded from downstream analysis [13] (see Notes 2–3).
The major problem of these methods in this category is that they
may remove cells with real biological variation. The third category
involves using the number of genes detected and/or the reads
mapping rate to define technical artifacts [14]. However, the num-
ber of genes detected varies among experiments depending on the
quality of a particular library, cell type, or RNA-protocol. The
mapping rate cutoff is also hard to make, and thus the cutoff
settings are typically arbitrary. Thus, although single-cell
approaches hold great promise in investigating the cell heterogene-
ity, QC remains one of major challenges [7]. Nonetheless, our
previous studies and own work demonstrate that integrating both
gene expression patterns and sequencing data quality can be a
reasonable strategy for performing QC [15]. The basic assumption
of this approach is that if gene expression outliers are also associated
with poor sequencing library quality, they are more likely to be
technical artifacts than being real biological variation cells. We also
assume that gene expression outliers contain both cells with real
biological variation and technical artifacts, but the rest of the cells
(main population cells) in general are more likely to contain good
quality cells. Thus, we can use cells of the main population as
controls to estimate data quality cutoffs and a corresponding false
positive rate (FPR) (Fig. 1).
Here, we describe in detail our procedure for detecting techni-
cal artifacts in scRNA-seq using three batches of our published
human embryonic stem cells (ES cells) scRNA-seq data [16].

2 Materials

2.1 Lab Equipment 1. C1 Single-Cell Auto Prep IFC (Fluidigm).


2. EVOS FL Auto Cell Imaging system (Life Technologies).
3. Illumina HiSeq 2500 system.
Quality Control of Single-Cell RNA-seq 3

Data
D Quality (Main Population) False Positives Rate is estimated by % of main
population cells fail to pass data quality cutoffs
Cutoff
< 5% of Cells
Gene expression outliers
High Low
Technical Artifact with low data quality

Gene Expression Outliers

Subpopulation Cells

Fig. 1 Illustration of quality control (QC) for scRNA-seq framework. Cells can be separated out based on gene
expression patterns into gene expression outliers and cells of the main population. The data quality cutoffs are
determined by allowing a certain percentage (e.g., <5%) of main population cells that fail to pass them. The
technical artifacts are defined as gene expression outliers that fail to pass data quality cutoffs. The
subpopulation cells are defined as gene expression outliers that can pass data quality cutoffs

2.2 Kits 1. SMARTer PCR cDNA Synthesis kit (Clontech).


2. Advantage 2 PCR kit (Clontech).
3. Nextera XT DNA Sample Preparation Index Kit (Illumina).

2.3 ScRNA-seq Data 1. Raw scRNA-seq dataset (H1) can be accessed by Gene Expres-
sion Omnibus (GEO) with accession number (GSE64016).
2. The downloaded files from GEO are SRA format.
3. SRA toolkit (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.
cgi?view¼software) can be used to convert files from SRA
format to FASTQ format via “fastq-dump” utility.

3 Methods

3.1 H1 Human 1. Undifferentiated H1 human embryonic stem cells (hESCs)


Embryonic Stem Cells were cultured in E8 medium [17] on Matrigel-coated tissue
(hESCs) culture plates with daily media feeding at 37  C with 5% (vol/-
vol) CO2.
2. Cells were split every 3–4 days with 0.5 mM EDTA in 1  PBS
for standard maintenance.
3. Immediately before preparing single-cell suspensions for each
experiment, hESCs were individualized by Accutase (Life
Technologies), washed once with E8 medium, and resus-
pended at densities of 5.0–8.0  105 cells/mL in E8 medium
for cell capture.
4 Peng Jiang

4. The H1 hESCs is registered in the NIH Human Embryonic


Stem Cell Registry with the Approval Number: NIHhESC-10-
0043.
5. Details of the H1 cells can be found online (http://grants.nih.
gov/stem_cells/registry/current.htm?id¼29).

3.2 Single-Cell 1. 5000–8000 cells were loaded onto a medium size (10–17 μm)
Capture and cDNA C1 Single-Cell Auto Prep IFC (Fluidigm).
Library Preparation 2. The capture efficiency was inspected using EVOS FL Auto Cell
Imaging system (Life Technologies) to perform an automated
area scanning of the 96 capture sites on the IFC.
3. Empty capture sites or sites having more than one cell captured
were first noted and those samples were later excluded from
further library processing for RNA-seq.
4. Immediately after capture and imaging, reverse transcription
and cDNA amplification were performed in the C1 system
using the SMARTer PCR cDNA Synthesis kit (Clontech) and
the Advantage 2 PCR kit (Clontech).
5. Full-length, single-cell cDNA libraries were harvested the next
day from the C1 chip and diluted to a range of 0.1–0.3 ng/μL.
6. Diluted single-cell cDNA libraries were fragmented and ampli-
fied using the Nextera XT DNA Sample Preparation Kit and
the Nextera XT DNA Sample Preparation Index Kit (Illumina).
7. Libraries were multiplexed at 24 libraries per lane, and single-
end reads of 67-bp were sequenced on an Illumina HiSeq 2500
system.

3.3 Reads Mapping 1. Using Bowtie [18] to map raw reads against the reference
genes (e.g., human hg19 Refseq reference) allowing up to
two mismatches and a maximum of 20 multiple hits.
2. The mapped expected read counts and TPMs can be estimated
by RSEM [19].

3.4 Classification of 1. Given a cell, calculate a list of Spearman rank correlations


Cells into Gene comparing that given cell to the rest of the cells in the dataset
Expression Outliers (“one-to-others”).
and Cells of the Main 2. Then, that given cell is removed and a list of pairwise Spearman
Population rank correlations is calculated for the remaining cells
(“pairwise”).
3. Uses a one-sided Wilcoxon signed-rank test to assess whether
the “one-to-others” correlation is significantly lower than the
set of “pairwise” correlations.
4. A similar procedure is also performed using Pearson product-
moment correlations.
Quality Control of Single-Cell RNA-seq 5

5. Classify cells as either gene expression outliers or cells of the


main population based on p-values of both tests.
6. In this study, we define gene expression outliers as cells with p-
values less than 0.001 in both Spearman and Pearson tests.

3.5 Metrics to 1. Total number of mapped reads: the sum of mapped reads for all
Evaluate the scRNA- the genes. An extremely low number of mapped reads may
seq Library Quality affect the ability to characterize the transcriptome and could
be due to either a low mapping rate or other technical issues
introduced during sample prep or sequencing.
2. Mapping rate: the total number of mapped reads divided by the
read depth. Mapping rate can be effected by RNA degradation,
contamination with genomic DNA, or other technical issues
introduced during sample prep or sequencing.
3. Reads complexity: the ratio of unique reads (the count of reads
after removing duplicates) over the total number of all reads.

3.6 Combining 1. For each cell, calculates a quantile score (QS) for each quality
Library Quality Metrics metric. Given a metric, the QS of a cell is defined as the number
to Combined Scores of other cells in the dataset with equal or lower values divided
by the total number of cells. For example, if a cell has the 20th
highest mapping rate among a set of 80 cells, then the mapping
rate QS for this particular cell is 0.75. A higher QS indicates
better data quality.
2. Minimal Quantile Score (MQS): the minimal QS of the three
quality metrics.

MQS ¼ minfQSi g
i∈fmapped reads; mapping rate; reads complexityg
MQS assumes that each of the three quality metrics is critical
and that a deficiency in any of the three is a potential indicator of
technical issues. Thus the “final quality” of a cell depends on its
lowest quality metric score.
3. Weighted Combined Quality Score (WCQS): WCQS assumes
that the importance of each quality metric may depend on
specific experimental batches, protocols, and/or conditions.
WCQS assumes that the importance of each quality metric for
detecting technical artifacts is proportional to its ability to
discriminate between gene expression outliers and cells of the
main population. For example, given a batch of cells, if the
mapping rate of a given batch of cells can perfectly discriminate
between gene expression outliers and cells of the main popula-
tion, then it is more likely that the mapping rate is a dominant
player in detecting technical artifacts. In contrast, if a metric
does not indicate differences between gene expression outliers
and cells of the main population, then it should be removed
6 Peng Jiang

from prediction of potential technical artifacts. WCQS calcu-


lates a weighted aggregation quality score for each sample
defined as:
Pk
i¼1 w i Z i
WCQS ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
q Pk 2
i¼1 w i

where Zi is the transformed Z-score of QS for data quality metric i,


according to Zi ¼ Φ1 (1  Pi) where Φ is the standard normal
cumulative distribution function, and Pi is the probability that
quality metric i is lower in a given cell than in the rest of the cells.
We estimate Pi is as Pi ¼ 1  QSi. To avoid numerical error, we can
set maximal and minimal Zi as +8.5 and 8.5, which corresponds to
Pi < 1016 and Pi > (1–1016), respectively. The wi is the weight-
ing factor for data quality metric i and is estimated according to the
individual quality metric’s ability to discriminate between cells of
the main population and gene expression outliers as
(
AUCi  0:5
wi ¼ ðAUCi > 0:5Þ
0:5
0 ðAUCi  0:5Þ
where AUCi is the area under the curve (AUC) of the receiver
operating characteristic (ROC) curve for quality metric i. If a
quality metric i (e.g., mapping rate) can perfectly discriminate
between cells of the main population and gene expression outliers,
then AUCi ¼ 1 and thus wi ¼ 1. If the values of a quality metric i are
randomly distributed in cells of the main population and gene
expression outliers, then the expected AUCi ¼ 0.5 and thus wi ¼ 0.

3.7 Identification of 1. We assume that good quality cells should pass particular MQS
Technical and WCQS cutoffs. We use cells of the main population as
controls to determine these cutoffs (see Note 4). You can
enumerate all possible combinatorial pairs of MQS and
WCQS cutoffs in a given dataset, calculate the fraction of cells
of the main population that pass both cutoffs of a pair, and then
uses the remaining cells of the main population to estimate the
corresponding false positive rate (FPR) for that pair (Fig.1).
2. If more than one pair of MQS and WCQS cutoffs results in the
same FPR, you can choose the cutoff pair that maximizes the
percentage of gene expression outliers failing to pass.
3. Applies these cutoffs to the gene expression outliers to identify
technical artifacts. Technical artifacts are defined as gene
expression outliers with poor data quality measurements.

3.8 SinQC Software 1. SinQC [15] is designed for implementing (Subheadings


3.3–3.6) (see Note 5).
2. The SinQC software and detailed user manual are available at
http://www.morgridge.net/SinQC.html
Quality Control of Single-Cell RNA-seq 7

4 Notes

1. Several studies used housekeeping genes to perform quality


control for scRNA-seq datasets [10, 11]. To further investigate
the feasibility of using housekeeping genes to perform quality
control for scRNA-seq datasets, we calculated the gene expression
levels (TPMs) for two housekeeping genes (Actb and Gapdh) in a
mouse scRNA-seq dataset [20]. The Gapdh is significantly higher
expressed in ES cells than in MEF cells (P ¼ 5.6e–06, 1-sided
Wilcoxon rank sum test) while the Actb is significantly lower
expressed in ES cells than in MEF cells (P < 2.2e–16, 1-sided
Wilcoxon rank sum test) [15]. It suggests that it is infeasible to use
housekeeping genes to perform QC for scRNA-seq dataset.
2. Using median gene expression values or the number of genes
detected (TPM > 1) to perform quality control (QC): Low
data quality (e.g., low mapping rate) can result in fewer number
of genes detected or low median gene expression values. How-
ever, the number of genes detected (TPM > 1) can also be
biologically related. The number of genes detected varies
depending on the quality of a particular library and cell types
[8]. We calculated the number of genes detected in a highly
heterogeneous scRNA-seq dataset containing 301 cells (mix-
ture of 11 different cell types) [4]. The number of genes
detected is highly cell type dependent, suggesting using the
number of gene detected to identify technical artifacts will
result in substantial bias ([15], Fig. S8). For highly heteroge-
neous scRNA-seq datasets, the technical artifacts detected by
this method are more likely to have fewer genes detected if
compared with QC pass cells. But this does not mean that the
cells with fewer genes detected are technical artifacts.
3. Using “genes detected and/or mapping rate” to perform qual-
ity control (QC): The basic idea of using “genes detected
and/or mapping rate” [14] to perform QC is that the fewer
number of genes detected could be due to both technical issues
and biological heterogeneity. But if a cell with fewer genes
detected is also associated with low mapping rate (mapping
rate is technical related), the cell might be more likely to be a
technical artifact. This approach is the most conceptually simi-
lar to our method. However, our method has strengths on two
aspects: First, since the mapping rate and the number of genes
detected are not directly correlated, the mapping rate cutoff
and the number of genes detected cutoff chosen are very
difficult and arbitrary. Our method maximizes the probability
that the technical artifacts are correctly detected while also
minimizing the false positives by using cells of the main popu-
lation as data quality controls. Second, in addition to mapping
rate, our method also takes other library quality metrics into
consideration (e.g., library complexity).
8 Peng Jiang

4. Our method assumes that gene expression outliers contain


both technical artifacts and biological variant cells, but cells of
the main population, in general, are more likely to contain
good quality cells. Thus, our method uses cells of the main
population as controls to estimate data quality score cutoffs and
a corresponding false positive rate (FPR). However, given a
FPR, it is a challenge to estimate the corresponding false nega-
tive rate (technical artifacts that are missed), due to that
scRNA-seq has no “ground-truth” for “bad samples.” Sensi-
tivity (also called the true positive rate) is the proportion of
positives (“technical artifacts”) that are correctly identified.
Specificity (also called the true negative rate) measures the
proportion of negatives (“good quality single cells”) that are
correctly identified. Since scRNA-seq has no “ground-truth”
for “good samples” and “bad samples,” it is a challenge to
estimate these two measurements directly. To further compare
the sensitivity and specificity of our method in high-
heterogeneity and low-heterogeneity datasets, we applied our
method to datasets with mixture of different portions of cell
types, and compared the overlap of technical artifacts detected
among them. For example, using a mouse scRNA-seq dataset
(48 ES cells and 44 MEF cells) [20], we mixed the cells into
three different categories: high-heterogeneity (48 ES cells +44
MEF cells), medium-heterogeneity (“ES cells (all) + 1/5
(MEF) cells” and (“MEF cells (all) + 1/5 (ES) cells”), and
low heterogeneity ((48 ES cells) and (44 MEF cells), sepa-
rately). Our method detects two technical artifacts (ESC_46
and ESC_32) in the high-heterogeneity dataset (48 ES cells
+44 MEF cells). These two technical artifacts can also be
robustly detected either in medium-heterogeneity dataset or
low heterogeneity dataset. However, if we apply our method to
each individual ES (48 cells) or MEF (44 cells) dataset sepa-
rately, we can detect more artifacts, comparing to apply our
method on pooled mixture datasets (48 ES cells +44 MEF
cells). Therefore, we conclude that our method increases spec-
ificity at the cost of dropping sensitivity when the extent of
heterogeneity in a dataset is high. In highly heterogeneous cell
populations, detecting technical artifacts carries a higher risk of
dropping real biological variation cells. The increased specific-
ity and decreased sensitivity of our method for highly hetero-
geneous cell populations is a good feature that can minimize
the false positives.
5. The running SinQC for scRNA-seq QC is not restrictive to
RSEM output files (“*.genes.results”). For users who do not
use RSEM, they can make a customized RSEM files (“*.genes.
results”) to run SinQC. A detailed manual can be found in
SinQC website (http://www.morgridge.net/SinQC.html).
Quality Control of Single-Cell RNA-seq 9

References
1. Eberwine J, Sul J-Y, Bartfai T, Kim J (2014) 11. Treutlein B, Brownfield DG, Wu AR, Neff NF,
The promise of single-cell sequencing. Nat Mantalas GL, Espinoza FH et al (2014) Recon-
Methods 11(1):25–27 structing lineage hierarchies of the distal lung
2. Trapnell C, Cacchiarelli D, Grimsby J, epithelium using single-cell RNA-seq. Nature
Pokharel P, Li S, Morse M et al (2014) The 509(7500):371–375. Epub 2014/04/18.
dynamics and regulators of cell fate decisions https://doi.org/10.1038/nature13173
are revealed by pseudotemporal ordering of 12. Oyolu C, Zakharia F, Baker J (2012) Distin-
single cells. Nat Biotechnol 32(4):381–386. guishing human cell types based on
Epub 2014/03/25. https://doi.org/10. housekeeping gene signatures. Stem Cells 30
1038/nbt.2859 (3):580–584
3. Xue Z, Huang K, Cai C, Cai L, Jiang CY, Feng Y 13. Zeisel A, Muñoz-Manchado AB, Codeluppi S,
et al (2013) Genetic programs in human and Lönnerberg P, La Manno G, Juréus A et al
mouse early embryos revealed by single-cell (2015) Cell types in the mouse cortex and
RNA sequencing. Nature 500(7464):593–597. hippocampus revealed by single-cell RNA-seq.
Epub 2013/07/31. https://doi.org/10. Science 347(6226):1138–1142
1038/nature12364 14. Kumar RM, Cahan P, Shalek AK, Satija R,
4. Pollen AA, Nowakowski TJ, Shuga J, Wang X, DaleyKeyser AJ, Li H et al (2014) Decon-
Leyrat AA, Lui JH et al (2014) Low-coverage structing transcriptional heterogeneity in plu-
single-cell mRNA sequencing reveals cellular ripotent stem cells. Nature 516(7529):56–61.
heterogeneity and activated signaling pathways Epub 2014/12/05. https://doi.org/10.
in developing cerebral cortex. Nat Biotechnol 1038/nature13920
32(10):1053–1058. Epub 2014/08/05. 15. Jiang P, Thomson JA, Stewart R (2016) Qual-
https://doi.org/10.1038/nbt.2967 ity control of single-cell RNA-seq by SinQC.
5. Patel AP, Tirosh I, Trombetta JJ, Shalek AK, Bioinformatics 32(16):2514–2516. https://
Gillespie SM, Wakimoto H et al (2014) Single- doi.org/10.1093/bioinformatics/btw176
cell RNA-seq highlights intratumoral hetero- 16. Leng N, Chu LF, Barry C, Li Y, Choi J, Li X,
geneity in primary glioblastoma. Science 344 et al (2015) Oscope identifies oscillatory genes
(6190):1396–1401. Epub 2014/06/14. in unsynchronized single-cell RNA-seq experi-
https://doi.org/10.1126/science.1254257 ments. Nat Methods 12(10):947–950.
6. Sandberg R (2014) Entering the era of single- https://doi.org/10.1038/nmeth.3549.
cell transcriptomics in biology and medicine. PubMed PMID: 26301841; PubMed Central
Nat Methods 11(1):22–24 PMCID: PMC4589503
7. Stegle O, Teichmann SA, Marioni JC (2015) 17. Chen G, Gulbranson DR, Hou Z, Bolin JM,
Computational and analytical challenges in Ruotti V, Probasco MD et al (2011) Chemically
single-cell transcriptomics. Nat Rev Genet 16 defined conditions for human iPSC derivation
(3):133–145 and culture. Nat Methods 8(5):424–429.
8. Kharchenko PV, Silberstein L, Scadden DT https://doi.org/10.1038/nmeth.1593
(2014) Bayesian approach to single-cell differ- 18. Langmead B, Trapnell C, Pop M, Salzberg SL
ential expression analysis. Nat Methods 11 (2009) Ultrafast and memory-efficient align-
(7):740–742 ment of short DNA sequences to the human
9. Munsky B, Neuert G, van Oudenaarden A genome. Genome Biol 10(3):R25. https://
(2012) Using gene expression noise to under- doi.org/10.1186/gb-2009-10-3-r25
stand gene regulation. Science 336 19. Li B, Dewey CN (2011) RSEM: accurate tran-
(6078):183–187 script quantification from RNA-Seq data with
10. Ting DT, Wittner BS, Ligorio M, Vincent or without a reference genome. BMC Bioinfor-
Jordan N, Shah AM, Miyamoto DT et al matics 12:323. https://doi.org/10.1186/
(2014) Single-cell RNA sequencing identifies 1471-2105-12-323
extracellular matrix gene expression by pancre- 20. Islam S, Kj€allquist U, Moliner A, Zajac P, Fan
atic circulating tumor cells. Cell Rep 8 J-B, Lönnerberg P et al (2011) Characteriza-
(6):1905–1918. Epub 2014/09/23. https:// tion of the single-cell transcriptional landscape
doi.org/10.1016/j.celrep.2014.08.029 by highly multiplex RNA-seq. Genome Res 21
(7):1160–1167
Chapter 2

Normalization for Single-Cell RNA-Seq Data Analysis


Rhonda Bacher

Abstract
In this chapter, we describe a robust normalization method for single-cell RNA sequencing data. The
procedure, SCnorm, is implemented in R and is part of Bioconductor. Also included in the package are
diagnostic functions to visualize normalization performance. This chapter provides an overview of the
methodology and provides example work-flows.

Key words Single-cell RNA-seq, Normalization, Gene expression, Read count, High-throughput
sequencing

1 Introduction

The purpose of normalization is to remove the effects of systematic


technical variability on the observed gene expression measure-
ments. Popular RNA-seq analyses such as sample clustering or
classification, and differential gene expression require between-
sample normalization as a first step to ensure measurements are
comparable across samples. Variation in sequencing depth (the total
amount each sample has been sequenced) is one particular artifact
that arises during the RNA-seq experiment. As samples are
sequenced more deeply, all gene counts should, on average,
increase proportionally [1]. We call this dependence of the read
counts on sequencing depth the count-depth relationship [2].
Most existing normalization methods developed for bulk
RNA-seq experiments calculate global scale factors to adjust each
sample for sequencing depth (one scale factor per sample is applied
to all genes in the sample) [1, 3–5]. These methods work well when
the count-depth relationship is common across genes, as it is in
bulk RNA-seq data (Fig. 1a). When global scale-factor normaliza-
tion [3] is effective, the resulting normalized count-depth relation-
ships are near zero for most genes (Fig. 1b). Conversely, in single-
cell RNA-seq data (scRNA-seq) genes exhibit variability in the
count-depth relationship (Fig. 1c). In this case, when the count-

Guo-Cheng Yuan (ed.), Computational Methods for Single-Cell Data Analysis, Methods in Molecular Biology, vol. 1935,
https://doi.org/10.1007/978-1-4939-9057-3_2, © Springer Science+Business Media, LLC, part of Springer Nature 2019

11
12 Rhonda Bacher

A Bulk C Single-cell

2.0
7 Expression Expression

9
High High
Un-normalized

8
6

1.5
7
Log Expression

Log Expression
5

6
Density

Density
4

5
Low Low

1.0
4
3

3
2

0.5
2
1

0.0
0

0
14.5 14.7 14.9 15.1 15.3 −1 0 1 2 14.0 14.5 15.0 15.5 16.0 −2 −1 0 1 2 3

Log Sequencing Depth Slope Log Sequencing Depth Slope

B D

2.0
7

9
Expression Expression
High High

8
6

1.5
7
Normalized
Log Expression

Log Expression
5

6
3
Density

Density
4

5
Low Low

1.0
4
3

3
2

0.5
2
1
1

0.0
0

0
14.5 14.7 14.9 15.1 15.3 −2 −1 0 1 2 14.0 14.5 15.0 15.5 16.0 −2 −1 0 1 2 3
Log Sequencing Depth Slope Log Sequencing Depth Slope

Fig. 1 For each gene, median quantile regression was used to estimate the count–depth relationship for a bulk
and single-cell RNA-seq dataset before and after normalization. (a) The left plot shows log un-normalized
expression versus log depth with estimated regression fits for three genes in a bulk RNA-seq dataset
containing no zero measurements and having low, moderate, and high expression defined as the median
expression among nonzero un-normalized measurements in the 10th to 20th quantile (blue), 40th to 50th
quantile (black), and 80th to 90th quantile (red), respectively. On the right, the estimated regression fits for all
genes within ten equally sized gene groups where genes were grouped by their median expression among
nonzero un-normalized measurements. (b) Similar to a for normalized bulk RNA-seq data. (c and d) Similar to
a and b for single-cell RNA-seq data

depth relationship is not common across genes, the application of


traditional global scale factor normalization methods leads to
biased counts (Fig. 1d).
Our method, SCnorm [2], addresses the variability in the
count-depth relationship in scRNA-seq data. SCnorm uses quantile
regression to estimate the count-depth relationship for each gene.
Genes with similar relationships are grouped together, and a second
quantile regression estimates scaling factors within each group. In
this chapter, we will further describe the SCnorm method and its
software implementation.

2 Materials

Our method for normalizing single-cell RNA-seq data is imple-


mented in the R package SCnorm and is available on Bioconductor:
https://bioconductor.org/packages/devel/bioc/html/
SCnorm.html
The package is compatible with R versions greater than 3.4.0.
The package may be installed directly from Bioconductor:
Normalization for Single-Cell RNA-Seq Data Analysis 13

source("https://bioconductor.org/biocLite.R")
biocLite("SCnorm")

3 Methods

We describe our approach step-by-step as follows:


1. Filter genes. Filtering of genes with very low average expression
is a common preprocessing step in RNA-seq analyses. As
detailed in Note 1, we require genes to have at least ten
nonzero expression counts. Filtered genes will not be included
in the normalization.
2. Estimate count-depth relationship per gene. Let Yg,j denote the
log nonzero expression for gene g in cell j for g ¼ 1,...,m and
j ¼ 1,...,n.
XLet 
Xj denote the log sequencing depth, calculated
m
as log g¼1
eY g, j
: A median quantile regression is fit to
estimate each gene’s relationship between log un-normalized
expression and log sequencing depth as
 
Q 0:5 Y g , j jX j ¼ βg , 0 þ βg , 1 X j
The estimated slope, βbg , 1 , represents the count-depth relation-
ship for gene g.
3. Group genes. Genes are clustered into K groups based on their
βbg , 1 using the K-medoids algorithm [6] via the clara function
in the R package cluster [7]. If a cluster contains less than
100 genes, it is joined to the nearest cluster based on the cluster
centers (medoids). Starting at K ¼ 1, SCnorm will increase
K to K + 1 until a convergence criterion is satisfied (described
below in step 5. Evaluating K).
4. Group fit. Within a homogenous gene group, a representative
gene could be selected and the scale factors calculated as the
ratio of its counts in each sample to the gene’s overall mean.
However, due to numerous zeros in the data and high levels of
variability [8], a representative gene for the group must be
estimated. We estimate a representative gene as the predicted
values from a quantile regression between the log nonzero
un-normalized expression counts from all genes in the group
and log sequencing-depth. For computational reasons, we
restrict the number of genes considered (see Note 1) to the
25% whose βbg , 1 is closest to the group’s overall count-depth
relationship, estimated as modeg β^g , 1 .
When fitting the quantile regression, the median may not
always best represent subtle effects of the count-depth relation-
ship for genes in a particular group, therefore we consider
multiple quantiles τ and degrees d and fit:
14 Rhonda Bacher

 
Q τk , d k Y j jX j ¼ βτ0k þ βτ1k X j þ . . . þ βτdk X dj k

We fit over a grid of combinations of τ and d from τ ∈ (.05, .10,


. . ., .85, .90) and d ∈ (1, 2, . . ., 6). For each model fit, the pre-
dicted values, Y b τk , d k , can be viewed as a representative gene. To
j

evaluate which specific values of τk and d k , τ∗ k and d k , are optimal,
we choose the model in which the predicted gene’s count-depth
relationship is closest to the group’s mode. Specifically, the pre-
dicted gene’s count depth relationship, b η τ1k , d k , is estimated using
median quantile regression:
 τk , d k 
Q 0:5 Y b jX j ¼ ητ0k , d k þ ητ1k , d k X j
j

The selected model is then the τk and dk which minimize


 modeg β^g , 1 j
η τ1k , d k
j^
Once the best model is selected, the scale factors are estimated
∗ ∗
b τk , d k
eY j ∗
as SF j , k ¼ τ∗
where Y τk is the τ*th quantile of expression
eY k

counts in the kth group. The normalized counts Y 0g , j are given by


Y
e g, j
SF j , k .
Evaluating K. To determine if a given K is sufficient, we
evaluate the count-depth relationship of the normalized data. For
every gene, its log normalized counts are regressed against the
original log sequencing depth in a median quantile regression:
 
Q 0:5 Y 0g , j jX j ¼ β0g , 0 þ β0g , 1 X j

We check that the distribution of β0g , 1 is centered around zero


for genes across all expression levels. To do so, we group genes into
ten equally sized groups based on their median nonzero expression.
For each group, the mode of the slopes is estimated. Any mode
outside of (0.1, 0.1) is evidence that the current K is not optimal,
and in which case K is increased by one and the group fit and
evaluation steps are repeated (see Note 2 for an example).
Multiple Conditions. When multiple conditions or batches are
present, SCnorm utilizes the procedure above for each condition
and then rescales across all cells. During rescaling, all genes are split
into quartiles based on their nonzero median un-normalized
expression measurements. Within each quartile and condition,
each gene is adjusted by a common scale factor defined as the
median of the gene specific fold-changes, where fold-changes are
calculated between each gene’s condition-specific mean and its
mean across conditions. Means are calculated over nonzero counts
(see Note 3 for additional details). If spike-ins are available and are
representative of the biological genes, they may be used in this step
to estimate the across condition scaling factors (additional details
are in Note 4).
Normalization for Single-Cell RNA-Seq Data Analysis 15

4 Notes

Note 1 reviews all arguments for running the main functions in the
SCnorm package. Note 2 demonstrates a standard workflow for
normalizing scRNA-seq data and evaluating the normalization.
Note 3 reviews important considerations for normalization when
multiple conditions are present. Note 4 details how to use spike-ins
for across condition scaling.
1. There are two main functions accessible by the user in the
SCnorm package:
SCnorm—implements the normalization procedure.
plotCountDepth—estimates and graphically displays the
count-depth relationships.
The SCnorm function requires only two arguments: the
un-normalized expression matrix (Data) and a vector denoting
the condition or batch each cell belongs to (Conditions). The
Data argument should contain un-normalized expression measure-
ments with genes (or other features) on the rows and cells on the
columns. The expected format of Data is a data matrix in R or of
class SummarizedExperiment of the SummarizedExperiment
R package [9]. The Conditions argument should be a vector with
length equal to the number of cells and match the exact order of the
columns of the Data argument.
The SCnorm function will implement the entire normalization
procedure described above, automatically iterating until an optimal
K is reached. A dataset with 100 cells and 20,000 genes will take
approximately 15 min to run with three computing cores. The
computation time will increase as the number of cells and genes
increases, though increasing the number of cores can be used to
offset the increased time. In the example given, increasing to seven
cores reduces the time to around 8 min.
The output of SCnorm is a SummarizedExperiment object
containing at minimum the normalized data (NormalizedData)
and a list of the genes not included in the normalization (Genes-
FilteredOut). Additional outputs may be generated using the
non-default options and are described in more detail below.
The full SCnorm function with default arguments is:

SCnorm(Data = NULL, Conditions = NULL, PrintProgressPlots = FALSE,


reportSF = FALSE, FilterCellNum = 10, FilterExpression = 0,
Thresh = 0.1, K = NULL, NCores = NULL, ditherCounts = FALSE,
PropToUse = 0.25, Tau = 0.5, withinSample = NULL, useZerosToScale
= FALSE,
useSpikes = FALSE)
16 Rhonda Bacher

Additional options the user may specify are:


PrintProgressPlots: If set to TRUE, SCnorm will automat-
ically produce count-depth plots evaluating each value of
K attempted. It is highly recommended to set this option to
TRUE and wrap the SCnorm function call in a pdf (or other gra-
phics) device. Although SCnorm determines the optimal
K internally, viewing these plots is one way to evaluate the results
of the normalization.
reportSF: If set to TRUE, SCnorm will output the scaling
factors calculated for each group. This may be useful if users plan
to conduct downstream differential expression (DE) tests, where
the matrix of scale factors may be provided to functions in the
EBSeq [10] and DESeq2 [11] packages which require the
un-normalized data as input. For scRNA-seq DE tools, such as
MAST [12] and scDD [13], the normalized counts may be used
directly.
FilterExpression: A single numeric value denoting the
cutoff used to exclude genes from the normalization based on
average expression. Occasionally (but rarely) very lowly expressed
genes with many tied counts may hinder the convergence of
SCnorm and need to be filtered from the normalization procedure.
FilterCellNum: The minimum number of nonzero counts
required for a gene to be included in the normalization. This value
must be larger than or equal to ten. Since SCnorm fits a median
quantile regression for each gene, we require a minimum of ten
nonzero expression values for each gene by default. This is usually
sufficient in practice although the user may wish to further filter
genes that have expression in at least some proportion of cells. In
that case, the proportion should be converted to an integer for the
specific experiment to be normalized.
Tau: This option likely does not need to be adjusted by the
user. It specifies the quantile used to estimate the quantile regres-
sion. We recommend using the median (Tau¼.5) quantile.
PropToUse: During the group fitting, only a subset of genes is
used in order to speed up computation time. We recommend using
25% of genes in the group. However, if the data are summarized by
transcripts rather than genes, additional reduction in computation
time may be obtained by setting this to 10%.
Thresh: During the evaluation procedure, SCnorm will calcu-
late the distance of the normalized slope modes from zero. The
default distance is 0.1 and in practice this offers the best trade-off
between bias and computation time. A smaller value will take
longer to converge, while a larger value may not result in the
optimal normalization.
K: The number of groups to split the genes into for normaliza-
tion. If any numeric value is supplied the iterative procedure of
SCnorm will not be performed. Users are not advised to set this
option. However, it may be helpful in quickly obtaining normalized
Normalization for Single-Cell RNA-Seq Data Analysis 17

data in the instance that the user previously ran the normalization
but did not save the data or wishes to change arguments that only
affect the across-condition scaling step.
NCores: The number of cores to use. The more cores available,
the faster SCnorm will perform. By default, SCnorm will use one
less than the number of cores available on the machine.
ditherCounts: When this option is set to TRUE, counts will
be randomly jittered by 0.01 prior to fitting. With unique molecu-
lar identifier (UMI) scRNA-seq experiments, the data typically have
many tied count values, which occasionally cause the quantile
regression fit to fail. We find that dithering the counts by a small
value avoids this issue and does not otherwise affect the normaliza-
tion procedure or resulting normalized counts.
withinSample: As demonstrated in previous papers, gene-
specific features may vary across samples. We have implemented
the method from Risso et al. [14] if the user wishes to first normal-
ize the counts based on a gene-specific feature such as GC content
or gene length. This argument expects a vector of equal length to
the number of rows of Data (and in matching order) with values
representing the gene-specific feature to normalize. Note that
within sample normalization should be used with caution as it is
often specific to the experiment and exploratory analyses are highly
recommended.
useZerosToScale: If set to TRUE, the zeros will be used
when scaling across conditions. Use of this argument depends on
which downstream differential expression tool will be used. If using
methods which test zeros separately from continuous counts, such
as MAST [12] or scDD [13], this option should remain FALSE.
However, for methods such as DESeq2 [11] which test all counts
together, this flag should be set to TRUE. A detailed example is
given in Note 3.
useSpikes: We do not implement the use of spike-ins for
within group normalization at this time because there are currently
too few to estimate scale factors robustly in all groups. However,
when multiple conditions or batches are being normalized, if this
argument is TRUE then spike-ins will be used to perform the across
condition scaling. The spike-ins are expected to be named follow-
ing the convention of “ERCC-”. Additional details regarding the
use of spike-ins is given in Note 4.
The plotCountDepth function is used to visualize the count-
depth relationships. It includes a wrapper for internal functions that
estimate the count-depth relationships and then outputs a plot.
During the normalization, if PrintProgressPlots¼TRUE, mul-
tiple calls will be made to the plotCountDepth function, other-
wise the function may be used stand-alone. The required
arguments are Data and Conditions similar to the SCnorm
function. All genes will be split into ten equally sized groups
based on their nonzero un-normalized median expression. A
18 Rhonda Bacher

density curve of the gene-specific count-depth relationships will be


plot with a different color for each expression group. In addition to
outputting a plot, the argument will provide a list object with each
gene’s name, the expression group it belongs to, and its estimated
count-depth relationship.
The default arguments for the plotCountDepth function are:

plotCountDepth <- function(Data, NormalizedData= NULL, Conditions =


NULL,
Tau = .5, FilterCellProportion = .10,
FilterExpression = 0, NumExpressionGroups =
10,
NCores=NULL, ditherCounts = FALSE)

Most of the additional arguments are the same as for the


SCnorm function, including ditherCounts, NCores, Filter-
Expression, and Tau. Here we use the option FilterCellPro-
portion, which allows the user to directly filter genes from
consideration if they do not have a certain proportion of cells
expressed. Arguments unique to this function are:
NumExpressionGroups: The number of equally sized expres-
sion groups used to visualize the count-depth relationships. We
typically use ten groups.
NormalizedData: If evaluating normalized data, then this
argument specifies the normalized data matrix (the format is
expected to be the same as the Data argument). Supplying this
argument is critical to ensure the evaluation of the normalized
measurements is done in terms of the original sequencing depths.
2. Here we apply SCnorm to a scRNA-seq dataset of 92 H1
human embryonic stem cells. Prior to sequencing, and follow-
ing library preparation, the cDNA for each cell was split into
two pools. One pool was sequenced with 96 cells per lane,
while the other pool was sequenced with 24 cells per lane.
Since lanes average similar numbers of total reads, this setup
results in one pool of cells having an average sequencing depth
of one million (H1-1 M) and the other with an average of four
million (H1-4 M). Prior to normalization, the H1-4 M group
will appear four times higher on average than the H1-1 M
group. Since the cells are exactly the same, these data provide
a benchmark to evaluate normalization procedures. It may be
downloaded directly from GEO (file: GSE85917_Bacher.
RSEM.xlsx):
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc¼GSE85917
Normalization for Single-Cell RNA-Seq Data Analysis 19

We will use the first two sheets in the Excel file, which can be
loaded into R by:

> library(readxl)
> h1cells.4M <-
data.frame(read_excel("GSE85917_Bacher.RSEM.xlsx",
sheet=1), stringsAsFactors=F)
> h1cells.1M <-
data.frame(read_excel("GSE85917_Bacher.RSEM.xlsx",
sheet=2), stringsAsFactors=F)

Next, we visualize the variability in the count-depth relation-


ship across each of the datasets.

> library(SCnorm)
> cdr.1M <- plotCountDepth(Data = h1cells.1M, Conditions =
rep("1M",
ncol(h1cells.1M)))
> cdr.4M <- plotCountDepth(Data = h1cells.4M, Conditions =
rep("4M",
ncol(h1cells.4M)))

The results are shown in Fig. 2a, b. The count-depth relation-


ship varies across genes, indicating that SCnorm should be used to
normalize the data.
Using SCnorm we can normalize each batch independently and
then scale across the batches by specifying the Conditions
argument:

> Conditions <- c(rep("1M", ncol(h1cells.1M)), rep("4M",


ncol(h1cells.4M)))
> allData <- cbind(h1cells.1M, h1cells.4M)
> myNormData <- SCnorm(Data = allData, Conditions = Conditions,
PrintProgressPlots = TRUE, reportSF = TRUE)

Since PrintProgressPlots¼TRUE, each attempted value of


K will be plot until the procedure reaches convergence (Fig. 2c, d).
SCnorm finished in 11 min using three cores.
The output is obtained by using the results function:

> normData <- results(myNormData, type = "NormalizedData")


> genesOUT <- results(myNormData, type = "GenesFilteredOut")
> scaleFactors <- results(myNormData, type = "ScaleFactors")

3. Single-cell specific differential expression methods such as


MAST [12] and scDD [13] test the continuous (or nonzero)
counts separately from zero counts. Thus, the default option
20 Rhonda Bacher

A H1 - 1M
Expression Group Medians B 2.5
H1 - 4M
Expression Group Medians
2.0 1.67 (lowest) 2 (lowest)

3 5
2.0
4.5 8.5
1.5
15

Density
6.71 1.5
Density

10 27
1.0 48.62
15 1.0
23.5 88
0.5 0.5 155.88
39
71.32 299.6
0.0 0.0 852.85 (highest)
202.96 (highest)
−2 0 2 −2 0 2
C Slope
D Slope

Fig. 2 Count-depth relationships before and during normalization for the H1-1 M and H1-4 M data. (a) For the
H1-1 M dataset, the estimated count-depth relationships for all genes within ten equally sized gene groups
where genes were grouped by their median expression among nonzero un-normalized measurements. (b)
Similar to a for the H1-4 M dataset. (c) The count-depth relationship is shown for the normalized counts for
each value of K tried by SCnorm for the H1-1 M dataset. The genes remain in their initial expression groups as
shown in a. (d) Similar to C but for the H1-4 M dataset

for SCnorm when scaling across conditions is use-


Zeros ¼ FALSE in order to ensure the nonzero counts are
scaled appropriately. The reason for this is demonstrated in
Fig. 3 for a single-gene example in the H1-1 M and H1-4 M
datasets.
In this example, since the two conditions contain the exact
same cells, the normalized counts should have the same mean
across the two groups. When useZeros¼FALSE, SCnorm esti-
mates across condition scale factors based on the nonzero expres-
sion counts and the resulting nonzero means are equal (Fig. 3a).
This will lead to an appropriate call of not differentially expressed by
methods such as MAST [12] and scDD [13]. Figure 3b demon-
strates that if the DE method includes zeros in the differential
expression test, then the overall mean of the two groups will not
be equal. This effect is more pronounced in this example since the
two conditions have very different proportions of zeros. To avoid
an incorrect call of differential expression, the argument
Normalization for Single-Cell RNA-Seq Data Analysis 21

Mean of non-zero Mean of all


expression expression
A B
useZeros =FALSE

5
H1−1M mean H1−1M mean
H1−4M mean H1−4M mean
4

4
Log (Expression + 1)

Log (Expression + 1)
3

3
2

2
1

1
0

0
13.0 14.0 15.0 16.0 13.0 14.0 15.0 16.0
Log Sequencing Depth Log Sequencing Depth
C H1−1M mean
D H1−1M mean
useZeros =TRUE

H1−4M mean H1−4M mean


5

5
4

4
Log (Expression + 1)

Log (Expression + 1)
3

3
2

2
1

1
0

13.0 14.0 15.0 16.0 13.0 14.0 15.0 16.0


Log Sequencing Depth Log Sequencing Depth

Fig. 3 For a single gene, the log of the normalized counts for both the H1-1 M (blue) and H1-4 M (red) datasets
versus log sequencing depth are shown. A constant of one was added to the counts before taking the log to
highlight the zero counts. The top and bottom rows contain the normalized counts when useZer-
os¼FALSE or useZeros¼TRUE, respectively. The left column shows the condition-specific means
calculated on the nonzero counts only. The right column shows the condition-specific means calculated over
all counts

useZeros¼TRUE should be used instead and SCnorm will scale


across conditions including the zeros in the scale factor estimation.
Under this option, the means of the nonzero counts will not be
equal if the proportion of zeros is different across conditions and
will result in the nonzero mean of the group with more zeros to
appear higher (Fig. 3c), however the gene means including zeros
will be equal (Fig. 3d).
The results of the across condition scaling under the two
options are most critical when the conditions have very different
Another random document with
no related content on Scribd:
precious, is superior to that of coins, which were often carelessly
executed, as being merely designed for a medium of commercial
exchange. High art would not usually spend itself upon small
copper money, but be reserved for the more valuable pieces,
especially those of gold and silver[16]. The subjects of gems are
mostly mythological, or are connected with the heroic cycle; a
smaller, but more interesting number, presents us with portraits,
which however are in general uninscribed. At the same time, by
comparing these with portrait-statues and coins we are able to
identify Socrates, Plato, Aristotle, Demosthenes, Alexander the
Great, several of the Ptolemies, and a few others; most of which
may have been engraved by Greco-Roman artists. But the
catalogue of authentic portraits preserved to us, both Greek and
Roman, is, as K. O. Müller observes, now very much to be
thinned.
16. This remark however must not be pressed too closely. Certain small Greek
copper coins of Italy, Sicily, &c., are exceedingly beautiful.

With regard to ancient iconography in general, coins, without


doubt, afford the greatest aid; but no certain coin-portraits are, I
believe, earlier than Alexander[17]. The oldest Greek portrait-
statue known to me is that of Mausolus, now in the British
Museum; but the majority of the statues of Greek philosophers
and others are probably to be referred to the Roman times, when
the formation of portrait-galleries became a favourite pursuit.
With the Greeks it was otherwise; the ideal was ever uppermost
in their mind: they executed busts of Homer indeed and placed
his head on many of their coins; but of course these were no
more portraits than the statues of Jupiter and Pallas are portraits.
With regard to the relation of Greek archæology to the history of
Greece, both the monuments and the literature are abundant,
and they mutually illustrate one another; and the same remark is
more or less true for the histories of the nations afterwards to be
mentioned, upon which I shall therefore not comment in this
respect.
17. I am aware that there are reasons for believing that a Persian coin preserves a
portrait of Artaxerxes Mnemon, who reigned a little earlier.
From Greece, who taught Rome most or all that she ever
knew of the arts, we pass to the contemplation of the mistress of
the world herself. She found indeed in her own vicinity an earlier
civilisation, the Etruscan, whose archæological remains and
history generally are amongst the most obscure and perplexing
matters in all the world of fore-time. The sepulchral and other
monuments of Etruria are often inscribed, but no ingenuity has
yet interpreted them. The words of the Etruscan and other Italian
languages have been recently collected by Fabretti. There is
some story about a learned antiquary after many years’ research
coming to the conclusion that two Etruscan words were
equivalent to vixit annos, but which was vixit, and which annos,
he was as yet uncertain. We have also Etruscan wall-paintings,
and various miscellaneous antiquities in bronze, and among
them the most salient peculiarity of Etruscan archæology not
easily to be conjectured, its elegantly-formed bronze mirrors.
These, which are incised with mythological subjects, and often
inscribed, have attracted the especial attention of modern
scholars and antiquaries, who have gazed upon them indeed
almost as wistfully as the Tuscan ladies themselves.
But Greece had far more influence over Roman life and art
than Etruria.

Græcia capta ferum victorem cepit, et artes


Intulit agresti Latio.

Accordingly, Greek architecture (mostly of the later Corinthian


style, which was badly elaborated into the Composite) was
imported into Rome itself, and continued to flourish in the Greek
provinces of the empire. Temples and theatres continued much
as before; but the triumphal arch and column, the amphitheatre,
the bath and the basilica, are peculiarly Roman.
The genius of Rome however was essentially military, and the
stamp which she has left on the world is military also. Her
camps, her walls, and her roads, strata viaram, which, like
arteries, connected her towns one with another and with the
capital, are the real peculiarities of her archæology. The treatise
on Roman roads, by Bergier, occupies above 800 pages in the
Thesaurus of Grævius. Instead of bootlessly wandering over the
width of the world on these, let us rather walk a little over those
in our own country, and as we travel survey the general
character of the Roman British remains, which may serve as a
type of all. In the early part of this lecture, I observed that we, in
common with the rest of Western Europe, find in our islands
weapons which belong to the stone, bronze, and iron periods;
and here also, as in other places, the last-named period
doubtless connects itself with the Roman. But besides these, we
have other remains, many of which may be referred to the Celtic
population which Cæsar had to encounter, when he invaded our
shores. These remains may in great part perhaps (for I am
compelled to speak hesitatingly on a subject which I have
studied but little, and of which no one, however learned, knows
very much) be anterior to Roman times. Of this kind are the
cromlechs at Dufferin in South Wales, in Anglesey, and in
Penzance, of which there are models in the British Museum; of
this kind also are, most probably, the gigantic structures at
Stonehenge, about which so much has been written and
disputed. The British barrows of various forms and other
sepulchral remains may also be referred, I should conceive, in
part at least, to the pre-Roman Celtic period. The earlier mounds
contain weapons and ornaments of stone, bronze and ivory, and
rude pottery; the later ones, called Roman British barrows,
appear mostly not to contain stone implements, but various
articles of bronze and iron and pottery; also gold ornaments and
amber and bead necklaces. Other sepulchral monuments consist
merely of heaps of stones covering the body which has been laid
in the earth. Many researches into this class of remains have of
late years been made, and by none perhaps more patiently and
more successfully than by the late Mr Bateman, in Derbyshire.
The archæology of Wales has also been made the special object
of study by a society formed for the purpose. Some tribes of the
ancient Britons were certainly acquainted with the art of die-
sinking, and a great many coins, principally gold, are extant,
some of which may probably be as early as the second century
before Christ. They are, to speak generally, barbarous copies of
the beautiful gold staters of Philip of Macedon, which circulated
over the Greek world, and so might become known to our
forefathers by the route of Marseilles.
With these remarks I leave the Celtic remains in Britain; all
attempts to connect together the literary notices and the
antiquities of the Celts and Druids, so as to make out a history
from them, have been compared to attempts to “trace pictures in
the clouds[18].” Still we may say to the Celtic archæologist,
Θαρσεῖν χρὴ, φίλε Βύττε, τάχ’ αὔριον ἔσσετ’ ἄμεινον.

18. Pict. Hist. of England, Vol. I. p. 59. London, 1837.

One day matters may become clearer by the help of an


extended and scientific archæology.
But of the Romano-British remains it may be necessary to say
something. When we look at the map in Petrie’s Monumenta
Historica Britannica, in which the Roman roads are laid down by
their actual remains, we see the principal Roman towns and
stations connected together by straight lines, which are but little
broken. So numerous are they that we might almost fancy that
we were looking at a map in an early edition of a Railway Guide.
In this county they abound and have been very carefully traced,
and both here and in other counties are still used as actual
roads. In a few instances mile-stones have also been found. In
our own country, cut off, as Virgil says, from the whole world, we
do not expect the splendid monuments of Roman greatness, yet
even here the temple, the amphitheatre and the bath are not
unknown; and in our little Pompeii at Wroxeter we have, if my
memory deceive me not, some vestiges of fresco-painting, an art
of which we have such beautiful Roman examples elsewhere.
But everywhere we stumble upon camps and villas; everywhere
The tesselated pavements shew
Where Roman lamps were wont to glow.

And of these lamps themselves we have an infinite number and


variety, and on many of them representations of the games of the
circus and of various other things, formed in relief; a remark
which may also be made of their fine and valuable red Samian
ware; fragments of which are commonly met with, but the vases
are rarely entire. Of their other pottery, and of their glass and
personal ornaments, and miscellaneous objects, I must hardly
say any thing; but only observe that the Romans have left us a
very interesting series of coins relating to Britain; Claudius
records in gold the arch he raised in triumphant victory over us:
in the same way Hadrian, Antoninus Pius, Septimius Severus,
besides building their great walls against us, have, as well as
Caracalla and Geta, struck many pieces in silver and copper to
commemorate our tardy subjugation. The British emperors or
usurpers, Carausius and Allectus, have also left us very ample
series of coins, and indeed it is by these, much more than by the
monuments of letters, that their histories are known. In the fourth
and fifth centuries the monetary art declined greatly in the
Western Empire, and was on the whole at a very low ebb in the
Eastern or Byzantine Empire, and in the middle ages, generally,
throughout Europe.
At Constantinople a new school of Roman art arose, which
exercised a powerful influence on medieval art in general. Soon
after the foundation of Constantinople, Roman artists worked
there in several departments with a skill by no means
contemptible, though of a strangely conventional and grotesque
character; and from them, as it would seem, the medieval artists
of Central and Western Europe caught the love of the same
crafts, and carried them to much higher excellence. I would
allude in the first place, as being among the earliest, to ivory
carvings, principally consular diptychs. From the time of the
emperors it was the custom for consuls and other curule
magistrates to make presents both to officials and their friends of
ivory diptychs, which folded together like a pair of book-covers,
on which sculptures in low relief were carved, as a mode of
announcing their elevation. From the fourth and fifth centuries
down to the fourteenth we find them, some of the earliest with
classical subjects, as the triumph of Bacchus, probably of the
fourth century; but mostly with Scriptural ones, or with
representations of consuls. Some of these are enriched with
jewellery. The inscriptions accompanying them are either in
Greek or in Latin. In Germany they occur in the Carlovingian
period, though rarely, and in France and Italy later still. Perhaps it
should be mentioned that the ivory episcopal chair of St
Maximian at Ravenna, a work of the sixth century, is the finest
example extant of this class of antiques, and is doubly interesting
as being one of the very few extant specimens of furniture during
the first three centuries of the middle ages. Various casts of
medieval ivories, it may be added, have been executed and
circulated by the Arundel Society.
Another art learnt from Rome in her decline, or from
Constantinople, is the illumination of MSS., which the
calligraphers of the middle ages in all countries throughout
Europe carried to a very high perfection. Perhaps the earliest
example to be named is the Greek MS. of Genesis in the LXX,
now preserved in the Imperial Library at Vienna, probably of the
fourth century. The vellum is stained purple, and the MS. is
decorated with pictures executed in a quaint, but vigorous style.
In these, we find (as M. Labarte[19], a great authority for medieval
art, assures us) all the characters of Roman art in its decline,
such as it was imported to Constantinople by the artists whom
Constantine called to his new capital; and “they have served,” as
he adds, “for a point of departure” in the examination which he
has made of the tendencies and destinies of Byzantine art.
Compare the Vatican MSS. of Terence and Virgil. I cannot be
expected to enter into details about illuminations; they occur in
MSS. of all sorts, more or less, in Europe, down to the sixteenth
century, but especially in sacred books, such as were used in
Divine service. I need only call to your remembrance the
beautiful assemblage exhibited in the Fitzwilliam Museum and in
the University Library, to say nothing of the treasures possessed
by our different colleges.
19. Histoire des Arts au moyen âge. Album. Vol. II. pl. lxxvii. Paris, 1864.

There are many other objects of medieval art not unworthy of


being enlarged upon, which I intentionally pass over lightly, lest
their multiplicity should distract us; thus I will say little of its
pottery, its coins, or of its sculptures and bas-reliefs in stone.
With regard to the first of them, M. Labarte observes: “It is not
until the beginning of the fifteenth century that we find among the
European nations any pottery, but such as has been designed for
the commonest domestic use, and none that art has been
pleased to decorate.” These are objects which the middle ages
have in common with others; and they are objects in which a
comparison will not be favourable to medieval art. Still, we must
take care that a love of art does not blind us to the real value of
such things; they are always interesting for the history of art,
whatever their rudeness or whatever their ugliness; and,
moreover, they are often, as the coins of various nations, of high
historical interest. For examine, on our own series of barbarous
Saxon coins we have not only the successions of kings handed
down to us, in the several kingdoms of the so-called Heptarchy
and in the united kingdom, but also on the reverses of the same
coins we have mention made of a very large number of cities
and towns at which they were respectively struck. For example,
to take Cambridge, we find that coins were struck here by King
Edward the Martyr, Ethelred the Second, Canute, Harold the
First, and Edward the Confessor; also after the Conquest by
William the First and William the Second. We are thus furnished
with very early notices, and so in some measure able to estimate
the importance of the cities and towns of our island in medieval
times; though great caution is necessary here in making
deductions; for no coins appear to have been struck in
Cambridge after the reign of William Rufus. And this seems at
first sight so much the more surprising when we bear in mind
that money was struck in some of our cities, as York, Durham,
Canterbury, and Bristol, quite commonly, as late as the fifteenth
and sixteenth centuries. But, in truth, from the twelfth century
downwards, the number of cities and towns in which lawful
money was struck became comparatively small.
But I must not wander too far into numismatics. The art of
enamelling, peculiarly characteristic of the later periods of the
middle ages, is very fully treated of by M. Labarte, from whom I
derive the following facts. The most ancient writer that mentions
it is the elder Philostratus, a Greek writer of the third century,
who emigrated from Athens to Rome. In his Icones, or Treatise
on Images, the following passage occurs. After speaking of a
harness enriched with gold, precious stones, and various
colours, he adds: “It is said that the barbarians living near the
ocean pour colours upon heated brass, so that these adhere and
become like stone, and preserve the design represented.” It may,
therefore, be considered as established that the art of enamelling
upon metals had no existence in either Greece or Italy at the
beginning of the third century; and, moreover, that this art was
practised at least as early in the cities of Western Gaul. During
the invasions and wars which desolated Europe from the fourth
to the eleventh century almost all the arts languished, and some
may have been entirely lost. Enamelling was all but lost; for
between the third and the eleventh centuries the only two works
which occur as landmarks are the ring of King Ethelwulf in the
British Museum, and the ring of Alhstan, probably the bishop of
Sherburne, who lived at the same time. These two little pieces,
however, only serve to establish the bare existence of
enamelling in the West in the ninth century. But in this same
century the art was in all its splendour at Constantinople, and we
possess specimens of Byzantine workmanship of even an earlier
date. I cannot enter into the various modes of enamelling, which
are fully described by M. Labarte; but merely mention, without
comment, a few of the principal specimens, independently of the
Limoges manufacture, which constituted the chief glory of that
city from the eleventh century to the end of the medieval period.
“This became the focus whence emanated nearly all the
beautiful specimens of enamelled copper, which are so much
admired and so eagerly sought after for museums and
collections.” The principal earlier examples then are these; the
crown and the sword of Charlemagne, of the ninth century, now
in the Imperial Treasury at Vienna; the chalice of St Remigius, of
the twelfth century, in the Imperial Library at Paris; the shrine of
the Magi in Cologne, and the great shrine of Nôtre Dame at Aix-
la-Chapelle, presented by the Emperor Frederick Barbarossa in
the latter part of the same twelfth century. Also the full-length
portrait (25 inches by 13) of Geoffrey Plantagenet, father of our
Henry II., which formerly ornamented his tomb in the cathedral,
but is now in the Museum at Le Mans. The British Museum
likewise contains two or three fine examples; and among them
an enamelled plate representing Henry of Blois, Bishop of
Winchester, and brother of King Stephen.
Very fine also are the extant products of the goldsmith’s art in
the middle ages; which date principally from the eleventh
century, when the art received a new impulse in the West; those
of earlier date, with very few exceptions, now cease to exist.
They are principally chalices, reliquaries, censers, candlesticks,
croziers and statuettes.
Nor can I pass over in absolute silence the armour of the
middle ages. Until the middle of the ninth century it would appear
to have resembled the Roman fashion, of which it is needless to
say anything; but in Carlovingian times the hilts and scabbards of
dress-swords were very highly decorated; and about this period,
or rather later, the description of armour used by the ancients
was exchanged for the hauberk or coat of mail, which was the
most usual defensive armour during the period of the Crusades.
The first authentic monument where this mail-armour is
represented is on the Bayeux tapestry of Queen Matilda,
representing the invasion of England by William Duke of
Normandy in 1066; the most famous example of medieval
tapestry in existence, though other specimens are to be seen at
Berne, Nancy, La Chaise Dieu, and Coventry. The art of the
tapissier, however, in the eleventh century, when the Bayeux
tapestry was made, would appear to have been on the decline.
In the beginning of the fourteenth century plate-armour began to
come into use; and by and by this was decorated with
Damascene work, a style of art applied to the gate of a basilica
in Rome, which was sent from Constantinople, as early as the
eleventh century, but which did not become general in the West
till the fifteenth. To this I may just add, that sepulchral brasses,
on which figures in armour are often elaborately represented by
incised lines, are a purely medieval invention of the thirteenth
century. Sir Roger de Trumpington’s brass at Trumpington is one
of the very earliest examples. But time forbids me to say more of
sepulchral brasses, a class of antiquities almost confined to our
own country, of which we have some few specimens as late as
the seventeenth century, or to do more than allude to the
beautiful sepulchral monuments in stone of the medieval period,
with which we are all more or less familiar.
The most remarkable art to which the middle age gave birth
was oil-painting, the very queen of all the fine arts, though it was
to the age of the Medici that its immense development was due.
Previously painting had been subordinated to architecture; but
now, while mosaics, frescoes, and painted glass remained still
subservient to her, the art of painting occupies a distinct and
prominent rank of its own. It used commonly to be said that the
invention of painting on prepared panel was due to Margaritone
of Arezzo, who died about 1290, and in like manner that John
van Eyck invented oil-painting in 1410. Both these errors have
been propagated by the authority of Vasari. But it is now well
known, and has been conclusively proved, both by M. Labarte
and by Sir C. Eastlake, that these modes of painting are
mentioned by authors who lived more than a century before
Margaritone, in particular by the monk Theophilus, who in the
twelfth century composed a work entitled Diversarum artium
schedula. Paintings in oil either are or lately were in existence
anterior to John van Eyck; for example one at Naples, executed
by Filippo Tesauro, and dated 1309. We must ascend to much
earlier times to discover the true origin of portable paintings, and
we shall find it in the Byzantine Empire. The Greeks, about the
time that the controversy respecting images was rife, multiplied
little pictures of saints; these were afterwards brought over in
abundance by the priests and monks who followed the crusades,
and from the study of them, schools of painting in tempera arose
in Italy, in the twelfth century, at Pisa, Florence and other places.
The Byzantine school, M. Labarte tells us, reigned paramount in
Italy until the time of Giotto, i.e. the beginning of the fourteenth
century, and also in the schools of Bohemia and Cologne, the
most ancient in northern Europe, until towards the end of the
fourteenth century. In this country we have two very early
paintings, one of the beginning and the other of the end of the
same fourteenth century, in Westminster Abbey. The former,
probably a decoration of the high altar, is on wood; it represents
the Adoration of the Magi and other Scriptural subjects, and is
declared by Sir C. Eastlake to be worthy of a good Italian artist of
the fourteenth century, though he thinks that it was executed in
England. The latter is the canopy of the tomb of Richard II. and
Anne, his first wife, representing the Saviour and the Virgin and
other figures. The action and expression are declared by Sir C.
Eastlake to indicate the hand of a skilful painter. In 1396, £20
was paid by the sacrist for the execution of the work. These
remarks must suffice for a notice of medieval painting; the
glorious period of its history belongs rather to the Renaissance,
or post-medieval age.
The only archæological monuments of great importance which
remain to be mentioned are those of architecture, in connection
with the accessories of mosaics, frescoes, and painted glass.
The two former descended from classical times, the last is the
creation of the middle age. Mosaics having been originally used
only in pavements, at length were employed as embellishments
for the walls of basilicas, and, by a natural transition, of
churches. Constantine and his successors decorated many
churches in this manner, and in the East a ground of gold or
silver was introduced below the glass cubes of the mosaics, and
a lustre was by this means spread over the work which in earlier
times was altogether unknown. Thus the tympanum above the
principal door of the narthex of the Church of St Sophia, built by
the Emperor Justinian at Constantinople, is adorned with a
mosaic picture of the Saviour seated, the cubes of the mosaics
being of silvered glass; it is accompanied by Greek texts. This
and other later mosaics are figured by M. Labarte, in his last and
most splendid work, entitled Histoire des Arts au moyen âge;
among the rest a Transfiguration of the tenth century. The
Byzantine art, with its stiff conventionality, prevailed every where
till Cimabue, G. Gaddi, and Giotto imparted to its rudeness a
grace and nobleness which marked a new era. In the vestibule of
St Peter is a noble mosaic, partly after the design of Giotto,
representing Christ walking on the water, and the apostles in the
ship. But the very masters who raised the art to its perfection
brought about its destruction. Painting, restored by these same
great men, was too powerful a rival; and after the sixteenth
century, when it still flourished in Venice under the
encouragement of Titian, we hear little more of mosaics on any
great scale.
Passing over frescoes, which were much encouraged by
Charlemagne, and by various sovereigns and popes during the
middle ages, because the ravages of time have either destroyed
them altogether or left them in a deplorable condition, as for
example in some parish-churches in England, I will make a few
remarks on painted glass, so extensively used in the decoration
of the later churches.
The art of painting glass was unknown to the ancients, and
also to the early periods of the middle ages. “It is a fact,” says M.
Labarte, “acknowledged by all archæologists, that we do not now
know any painted glass to which an earlier date than the
eleventh century can be assigned with certainty.” Two
specimens, and no more, of this century, are figured by M.
Lasteyrie. The painted windows of the twelfth and thirteenth
centuries are nearly of the same character. They consist of little
historical medallions, distributed over mosaic grounds composed
of coloured (not painted) glass, borrowed from preceding
centuries. Fine examples from the church of St Denys and La
Sainte Chapelle at Paris, of the twelfth and thirteenth centuries,
are figured by M. Lasteyrie, and also by M. Labarte, who has
many beautiful remarks on their harmony with the buildings to
which they belong, on the elegance of their form, the richness of
their details, and the brilliancy of their colours. In the fourteenth
century, when examples become common, the glass-painters
copied nature with more fidelity, and exchanged the violet-tinted
masses, by which the flesh-tints had been rendered, for a
reddish gray colour, painted upon white glass, which approached
more nearly to nature. Large single figures now often occupy an
entire window. The improvement in drawing and colouring is a
compensation for the more striking effects of the brilliant yet
mysterious examples of the preceding centuries; and the end of
the fourteenth century is one of the finest epochs in the history of
painted glass. Painting on glass followed the progress of painting
in oils in the age which followed; and artists more and more
aimed at producing individual works; and in the latter half of the
fifteenth century buildings and landscapes in perspective were
first introduced. The decorations which surround the figures
being borrowed from the architecture of the time have often a
very beautiful effect. But the large introduction of grisailles
deprives the windows of this period of the transparent brilliancy
of the coloured mosaics of the earlier glass-painting. In the
sixteenth century, however, glass was nothing more than the
material subservient to the glass-painter, like canvas to the oil-
painter. Small pictures very highly finished were executed after
the designs of Michael Angelo, Raffaelle, and the other great
painters of the Renaissance. “But,” as M. Labarte truly says, “the
era of glass-painting was at an end. From the moment that it was
attempted to transform an art of purely monumental decoration
into an art of expression, its intention was perverted, and this led
of necessity to its ruin. The resources of glass-painting were
more limited than those of oil, with which it was unable to
compete. From the end of the sixteenth century the art was in its
decline, and towards the middle of the seventeenth was” almost
“entirely given up.” Our own age has seen its revival, and though
the success has been indeed great, we may hope that the zenith
has not yet been reached. “It is,” says Mr Winston, “a distinct and
complete branch of art, which, like many other medieval
inventions, is of universal applicability, and susceptible of great
improvement.” I have been a little more diffuse on glass-painting
than on some other subjects, as it is a purely medieval art, and
one which has now acquired a living interest. Various examples
of the different styles will easily suggest themselves to many, or,
if not, they may be studied in the splendid work of M. Lasteyrie,
entitled Histoire de la Peinture sur Verre d’après ses monuments
en France, and on a smaller scale in Mr Winston’s valuable Hints
on Glass-painting.
With regard to the architectural monuments of the medieval
world, I may, in addressing such an audience, consider them to
be sufficiently well known for my present purpose, which is to
give an indication, and little more, of the archæological remains
which have come down to our own days. Medieval architecture is
in itself a boundless subject; and as I have not specially studied
it, I could not, if I would, successfully attempt an epitome of its
various forms of Byzantine, Saracenic, Romanesque, Lombardic,
and of infinitely diversified Gothic. For a succinct yet
comprehensive view of all these and more, I must refer you to Mr
Fergusson’s Handbook of Architecture. Yet when we let our
imagination idly roam over Europe, and the adjoining regions of
Asia and Africa, what a host of architectural objects flits before it
in endless successions of variety and beauty! Think of Justinian’s
Church of St Sophia, which he boasted had vanquished
Solomon’s temple, and again of St Mark’s at Venice, as
Byzantine examples. Think next of the mosque of the Sultan
Hassan, and of the tombs of the Memlooks mingled with lovely
minarets and domes at Cairo; of the Dome of the Rock at
Jerusalem; of the Alhambra in Spain, with all the witchery of its
gold and azure decorations. Float, if you will, along the banks of
the Rhine or the Danube (as many of us have actually done),
and conjure up the majestic cathedrals, the spacious
monasteries and the ruined castles, telling of other days, with
which they are fringed. Let the bare mention of the names of
Milan, Venice, Rome; again of Paris, Rheims, Chartres, Amiens,
Troyes, Rouen, Avignon; and in fine those of Antwerp, Louvain,
and Brussels, suggest their own stories. Yet the magnificent
structures, secular and ecclesiastical, which I have either named
or hinted at, need not make us ashamed of our own country. We
are surrounded on all sides by an archæology which is
emphatically an archæology of progress, and we may justly be
proud of it as Englishmen. In this University and its immediate
neighbourhood we have fine specimens of Saxon, Norman, Early
English, Decorated, and Perpendicular styles of Gothic
architecture; and as regards the last of them, one of the most
splendid examples in the world. In the opinion of competent
judges the English cathedrals, while surpassed in size by many
on the Continent, are in excellence of art superior to those of
France or of any country in Europe. “Nothing can exceed the
beauty of the crosses which Edward I. erected on the spots
where the body of Queen Eleanor rested on its way to London.”
Some of these, Waltham for example, are quite equal to anything
of their class found on the Continent. “The vault of Westminster
Abbey” (says Mr Fergusson, on whose authority I make almost
every statement relating to medieval architecture) “is richer and
more beautiful in form than any ever constructed in France;” the
triforium is as beautiful as any in existence; and its
appropriateness of detail and sobriety of design render it one of
the most beautiful Gothic edifices in Europe.
I thus conclude my sketch, such as it is, of the archæology of
the world. Its aim has been to bring under review the rude
implements and weapons of primeval man; the colossal
structures of civilised man in Egypt and India; the strangely-
compounded palace-sculptures of Assyria and Babylonia; the
exquisitely ornamented columns of Persian halls; the massive
architecture of Phœnicia; the Gothic-like rock-tombs of Lycia; the
lovely temples, and incomparable works of art of every kind,
great and small, of Greece; the military impress of Roman
conquest; the medieval works of art in ivory, in enamel, in glass-
painting, as well as its glorious architectural remains, connecting
the middle ages with our own times. It has been drawn, as I
observed at the outset, under very adverse circumstances, and
must on that account venture to sue for much indulgence. It is
open, no doubt, to many criticisms: I expect to be charged with
grievous sins of omission, and perhaps of commission also: nor
do I suppose that I could entirely vindicate myself from such
charges. Worse than all perhaps, I have exposed myself to the
unanswerable sarcasm that I have talked about many subjects of
which I know but little. If, however, I have been able to compile
from trustworthy sources or manuals so much respecting those
particular branches of archæology which I have not studied, as
to bring before you their salient features in an intelligible manner,
that is enough for my purpose. I want no more, and I pretend to
no more; and I am conscious enough that even this purpose has
been but feebly accomplished. Tediousness, indeed, in dealing
with numerous details could hardly be altogether avoided; but
this is so much lighter a fault than an indulgence in mere
platitudes, running smoothly and amusingly, but emptily withal,
that I shall hear your verdict of guilty with composure.
It now only remains that I should very briefly point out what
qualifications are necessary for an archæologist, and also the
pleasure and advantage which result from his pursuits.
With regard to the first of these matters, the qualifications
necessary for an archæologist, they are to some considerable
extent the same as are necessary for a naturalist.
Like the naturalist, the antiquary must in the first place bring
together a large number of facts and objects. This is, no doubt, a
matter of great labour, but believe me, ‘labor ipse voluptas.’ The
labour is its own ample reward. The hunting out, the securing,
and the amassing facts and objects of antiquity, or of natural
history, are the field-sports of the learned or scientific Nimrod. In
a certain sense every archæologist must be a collector; he must
be mentally in possession of a mass of facts and objects,
brought together either by himself or by others. It is not
absolutely necessary that he should be a collector, in the sense
of being owner of a collection of his objects of study; in some
departments indeed of archæology to amass the objects
themselves is impossible: who, for instance, can collect Roman
roads or Gothic cathedrals? models, plans, and drawings, are
the only substitutes possible. But, with the facts relating to his
favourite objects, and also as much as possible with the objects
themselves, he must be familiar.
Yet this familiarity will not be enough to make him an
archæologist. Such knowledge may be possessed, and very
often is possessed, by a mere dealer in antiquities. The true
antiquary must not only be well acquainted with his facts, but he
must also, when there are sufficient data, proceed to reason
upon them. He puts them together, and considers what story
they have to render up. We saw a beautiful illustration of this in
the joint labours of the Scandinavian antiquaries and naturalists.
The order and sequence of the stone, bronze, and iron ages,
were distinctly made out; and even their chronology may one day
be discovered. The antiquary is enabled to form some judgment
of the civilisation, the arts, and the religion of the nations whose
remains he studies. Very often, as in the Roman series of coins,
he makes out political events in their history, and assigns their
dates. He determines the place of things in the historical series,
much as the naturalist does in the natural series.
Like the naturalist also he must be a man of learning, i.e. he
must be acquainted with what has been written by his fellow-
labourers in the same branch of study. Few know, prior to
experience, what a serious business this is. The bibliography of
every department of archæology, as well as of natural history, is
now becoming immense.
But besides a knowledge of facts, and objects, and books,
there are one or two other qualifications necessary for many
departments of archæology, the want of which has been very
prejudicial to some distinguished writers. Exact scholarship is
one of these qualifications. I do not merely mean that if a man be
engaged in Greek archæology, he must be aware of the
passages of Greek authors, in which the vases or the coins he is
talking about are alluded to, though he must certainly be
acquainted with these, and possess sufficient scholarship to
construe them correctly; but he must also be able to interpret his
written archæological monuments, such as his inscriptions and
the legends of his coins. This is oftentimes no easy matter, and it
requires a knowledge of strange words and dialects. Moreover, if
an inscription or a legend be mutilated (and this is very
frequently the case), unless the archæologist has an accurate
knowledge of the language in which it is written, whatever that
may be, Greek, Latin, Norman-French, or any other, what hope
is there that he will ordinarily be able to restore it, and having so
done interpret it with security or satisfaction? As one illustration
of many, I will cite Prof. Ramsay’s remark on Nibby’s dissertation
Delle vie degli Antichi: “In the first part of this article (on Roman
roads) his essay has been closely followed. Considerable
caution, however, is necessary in using the works of this author,
who, although a profound local antiquary is by no means an
accurate scholar[20].” Mr Bunbury, while pointing out the
advantages which scholars would derive from some
acquaintance with archæology, points out by implication the
advantage which archæologists would derive from scholarship.
“In this country,” says he, “the study of archæology is but too
much neglected; it forms no part of the ordinary training of our
classical scholars at the Universities, and is rarely taken up by
them in after life. It is generally considered as the exclusive
province of the professed antiquarian, who has seldom
undergone that early training in accurate scholarship, which is
regarded, and we think with perfect justice, by the student from
Oxford or Cambridge, as the indispensable foundation of sound
classical knowledge[21].” I think he is a little over-severe on us;
living men like Mr C. T. Newton, Mr Waddington, Mr Vaux, Mr C.
W. King, Mr C. K. Watson, and, last, but not least, like himself, to
whom others might be added, prove that his assertions must be
taken cum grano; even if it be true that this country has produced
no work connected with ancient art which can be compared with
the writings of Gerhard, or Welcker; of Thiersch, or Karl Otfried
Müller[22].
20. See Smith’s Dict. Gr. and Rom. Antiq. s. v. Viæ.

21. Edinburgh Review, u. s.

22. I feel a little inclined to dispute this: Stuart, one of the authors of the Antiquities of
Athens, which have been continued by other very able hands, and have also
been translated into German, may, perhaps, take rank with the authors named in
the text. K. O. Müller himself calls Millingen’s Ancient Unedited Monuments
(London, 1822) “a model of a work;” and though without doubt Millingen is inferior
to Müller in scholarship and in acquaintance with books, he is probably at least
his equal as a practical archæologist. Colonel Leake’s Numismata Hellenica
(London, 1856) may also be cited as an admirable combination of learning with
practical archæology.

Another thing very desirable for the successful prosecution of


some branches of archæology is an appreciation of art. Without
it we cannot judge of the value of many antiques, or enter into
their spirit or feeling; we neither discern their excellencies nor
their deficiencies. Mr King, who has made the province of
ancient gems peculiarly his own, justly calls them “little
monuments of perfect taste, ... only to be appreciated by the
educated and practised eye[23].” Moreover, this is the very
knowledge often so requisite for distinguishing genuine
antiquities from modern counterfeits. The modern forgers, who
fabricate Greek coins from false dies, do not often reach the
freedom and beauty of the originals; though it must be confessed
that some of them, as Becker, have carried their execrable art to
a very high perfection. It is but rarely that these men meet with
the punishment they deserve; yet it is satisfactory to know that
Charles Patin, great scholar and great antiquary as he was, was
banished by Lewis XIV. from his court for ever, for selling him a
false coin of Otho; and that a manufacturer of antiques in the
East, near Bagdad I believe, lately received by order of the
Turkish governor a sound bastinado on the soles of his feet for
reproducing the idols of misbelievers of old time.
23. Antique Gems, Introd. p. xxiii. London, 1860.

A knowledge of natural history in fine is occasionally very


useful to an antiquary. I will give two instances, not at all
generally known, one taken from zoology, one from botany. On
the reverse of the splendid Greek coins of Agrigentum a crab is
commonly represented. To an ignorant eye the crab looks much
like the crab in our shops here in Cambridge; the zoologist
recognises in it the fresh-water crab of the regions of the
Mediterranean; the numismatist, profiting by this knowledge,
sees at once that the type of the coin symbolizes not the harbour
of Agrigentum, as he had supposed, but its river. Again, on the
reverse of the beautiful Greek coins of Rhodes occurs a flower,
about which numismatists have disputed since the time of
Spanheim, whether it was the flower of the rose or of the
pomegranate. Even Col. Leake has here taken the wrong side,
and decided in favour of the pomegranate; the divided calyx at
once shews every botanist that the representation is intended for
the rose, conventional as that representation may be, from which
flower the island derives its name.
These are, I think, the principal qualifications which are
necessary or desirable for the archæologist. It only remains that I
should point out briefly some of the pleasures and advantages
that result from his pursuits. For I shall not so insult any one of
you, who are here present, as to suppose that this question is
lurking secretly in your mind, “Is there any good in archæology at
all? To what practical end do your researches tend?” My learned
predecessor well says that “this question is sometimes put to the
lover of science or letters by those from whom nature has
withheld the faculty of deriving pleasure from the exercise of the
intellect, and he feels for the moment degraded to the level of
such.” It is not so clear however that the fault must be put to the
account of nature. Rather, we may say,

Homine imperito nunquam quidquam injustius,


Qui nisi quod ipse facit, nihil rectum putat.

“No one,” says a Swedish scholar of the seventeenth century,


“blames the study of antiquity without evidencing his own
ignorance; as they that esteem it do credit to their own judgment;
so that to sum up its advantages we may assert, there is nothing
useful in literature, if the knowledge of antiquity be judged
unprofitable[24].” It is doubtless one of the many charms of
archæology that it illustrates and is illustrated by literature;
indeed, some knowledge of antiquity is little less than necessary
for every man of letters. Unless we have some knowledge of the

You might also like