Professional Documents
Culture Documents
by
PATIL, NISHANT
Marietta, GA
November 2018
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 2
Abstract
A previous research uncovered findings regarding H3K9me3 marks in early embryonic stages of
mice. This research aimed to see whether the researcher could reproduce these findings by
reproducing the data visualizations and bioinformatics analysis of three figures. This was done
by utilizing the raw, published data and processing it and programming visualizations according
to each figure. The researcher then analyzed the trends in each figure and decided how similar
the visualizations were and whether the original and reproduced figures yielded the same
conclusion. The results of the research found that two of the three reproductions did not visually
match, although the bioinformatics analysis of the did yield the same or complementary
conclusions. The researcher's mentor supervised the research process and approved the results
when steps were correctly followed for reproduction and analysis. Due to a limitation in the type
of data, the reproductions could not be visually similar to the original figures, but rather
complementary in interpretation. This research concluded that based off of the data used for
reproduction, figure reproductions could not be made to be visually similar. The bioinformatics
analysis did reach the same conclusions, so the analysis could be reproduced. In genetics, this
study could imply how useful it can be to verify findings through other formats of data. This
research allowed the researcher to gain knowledge on how to produce figures and process data.
However, it also shows that working with the same format of data as previous research would
Table of Contents
References ......................................................................................................................................37
LIST OF ABBREVIATIONS
GO Gene Ontology
LIST OF TABLES
LIST OF FIGURES
Figure 1e ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during
embryonic development. ....................................................................................................18
Figure 2a A heatmap showing H3K9me3 domains during mouse embryo development .............19
Figure 1 Schematic showing how stage-specific H3K9me3 marks and genes were identified.....20
Figure 3 Gene ontology analysis of combined ICM and morula stage downregulated genes .......42
Figure 4 Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage upregulated genes .42
Figure 5 Gene ontology analysis of combined ICM and morula stage upregulated genes ............43
Figure 6 Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage downregulated genes
............................................................................................................................................44
Chapter 1: Introduction
In their research, Wang et al. (2018) studied genetic reprogramming of mice during early
trimethylation (H3K9me3) on Long Terminal Repeats, which are long, repetitive sequences of
DNA (Wang et al., 2018). Mammalian fertilized eggs undergo epigenetic modifications
following fertilization, and thus the mouse genome becomes demethylated during early
embryonic development. LTRs become hypomethylated and need to be regulated, and previous
studies have shown that H3K9me3 modifiers have regulated LTRs (Wang et al., 2018).
chromosomes that are tightly bound) in LTR regulation was unknown. The study by Wang et al.
(2018) figured out the molecular details behind the reprogramming of H3K9me3-dependent
heterochromatin. This research looks into how well the findings, being figures and
bioinformatics analysis, from their data can be reproduced using data under a different format,
RNA-seq data.
How well can one reproduce the bioinformatics analysis and graphs from the
This research aims to reproduce the figures and analysis of a published research in order
to learn how figures are produced and how to interpret these figures. The process of reproduction
is useful knowledge for helping researchers to understand how to conduct genetic research in the
Research Questions
1. Applied subproblem: How well can one reproduce the bioinformatics analysis of the
2. Applied subproblem: How well can one reproduce the graphs of published data using R?
Hypothesis Statements
1. ASP 1: Using the bioinformatics algorithms, it is possible to reproduce the results of the
data analysis that the Wang et al. (2018) have produced and reach the same conclusion.
The independent variable is the bioinformatics algorithms, and the dependent variable is
2. ASP 2: Using the R programs and packages, it is possible to reproduce the results of the
visualization that Wang et al. (2018) have produced with a degree of similarity. The
independent variable is the packages used in R programming, and the dependent variable
The study of genetics involves collection of a lot of complex data. Analysis of that data
can be a very overwhelming, challenging, and time-consuming job. This research assists in
verifying the findings presented in the study by Wang et al. (2018). Since the data are sequenced
differently between RNA-seq and ChIP-seq, although the values and genetic information may be
different, the conclusions should ultimately be the same. Verifying these findings strengthens the
conclusions derived from the study by Wang et al. (2018) since more supporting information
1. Bioinformatics: The sum of the computational approaches to analyze, manage, and store
3. Cuffdiff: A program that you can use to find significant changes in transcript expression,
splicing, and promoter use (Trapnell Lab at the University of Washington's Department
5. Gene read: Each nucleotide sequences in genes is called a read (Porter, 2007, para. 3).
6. ggplot2: ggplot2 is a plotting system for R, based on the grammar of graphics, which
tries to take the good parts of base and lattice graphics and none of the bad parts
(HGCC) is a computer network composed of 1 head node and 22 compute nodes and
serves multiple functions related to genomic projects and data storage (Department of
8. Open Source: Open source software is software with source code that anyone can
9. Package: In R, the fundamental unit of shareable code is the package (Wickham, 2015,
para. 1).
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 10
10. R: R is a system for statistical computation and graphics (R Core Team, 2018).
11. RNA-seq: A technique that examines the quantity and sequence of RNA in a sample
12. TopHat: Tophat is a fast splice junction mapper for RNA-Seq reads. TopHat works in
conjunction with Bowtie, which aligns gene reads. (Center for Computational Biology at
Summary
Wang et al. (2018) found connections regarding H3K9me3 marks during specific
embryonic stages, and this research aims to see whether the conclusions derived can be matched
making reproductions using RNA-seq data. The hypotheses claim that one can reproduce the
This research consists of two foundational subproblems: bioinformatics analysis and data
sequenced RNA-seq or ChIP-seq data, converting them to appropriate format, mapping the
genetic data to an appropriate genome, comparing differential gene expression, and analysis of
the data ("RNA Sequencing Analysis with TopHat", n.d., p.4). The second foundational
subproblem involves importation of output from the previous subproblem into R programs for
data visualization. This subproblem involves usage of various packages, a "fundamental unit of
shareable code", (Wickham, 2015, para. 1), and programs in R. The output of this stage consists
of various graphs that data scientists refer to for trends and related analysis.
Bioinformatics Processing
The first foundational subproblem of the research is the bioinformatics analysis. It asks
how one can analyze bioinformatics data. As a part of bioinformatics analysis, the researcher
collected and processed RNA sequencing samples. The researcher retrieved data samples using
Read Archive), dbGaP, and ADSP data" (Sequence Read Archive, 2018a, para. 2). They used
various options such as '-h' to get help regarding the documentation of commands. The '-h'
command option "displays ALL options, general usage, and version information" (Sequence
Read Archive, 2018a, para. 2). This is a very useful option for technical help. The researcher
used the '-f' option to "force object download" to ensure proper retrieval of raw data (Sequence
Read Archive, 2018a, para. 2). The '-l' option lists "the contents of a kart file" (Sequence Read
After downloading the data, the next step was to perform a file conversion from SRA to
the '.fastq.gz' file type. The 'fastq-dump' command allowed conversion of "SRA data into fastq
format" (Sequence Read Archive, 2018b, para. 1). The researcher used various options available
for this command. As always, the '-h' option was handy for easy reference of documentation. The
'-M' option is useful to filter data "by sequence length" (Sequence Read Archive, 2018b, para. 2).
The '-o' option helped to specify the output directory to store the information (Sequence Read
Archive, 2018b, para. 2). There are two types of data: pair-end sequence data and single-read
sequence data. FASTQ files generated by the 'fastq-dump' command have different outputs based
The next important step was to align the "RNA-Seq reads to a genome in order to identify
exon-exon splice junctions" (Center for Computational Biology at Johns Hopkins University,
2016, para. 1). The researcher used the TopHat algorithm for this purpose. TopHat works along
with Bowtie: Bowtie maps gene reads to a reference genome, and TopHat finds spliced junctions
and aligns gene reads. TopHat produced outputs differently based on whether the sequences are
'accepted_hits.bam' file, which was essential for further computation (Center for Computational
The next step was to use the bamToBed command, which "is a conversion utility that
converts sequence alignments in BAM format to BED records" (Quinlan Lab at University of
Utah, 2017, p. 46). The researcher input the 'accepted_hits.bam' file generated by the TopHat
command to run the bamToBed command on. The researcher used various command options
such as "-mate1", "-split", "-ed", "-tag", etc. to modify certain aspects of a BAM file (Quinlan
The last important step was to compare differential gene expression using Cuffdiff "to
find significant changes in transcript expression, splicing, and promoter use" (Trapnell Lab at the
algorithm compared differences in gene mapping for analysis in an experiment. This step
In order to process RNA-seq data successfully to prepare it for visualization, one must
first fetch, convert, align reads, make compatible, and identify differential gene expression. This
processing is done first via the 'prefetch' command, which downloaded data from a database such
as NCBI. Afterwards, processing prepares the raw data for TopHat input via the 'fastq-dump'
command, which converted files from SRA files to '.fastq.gz' file type. Then, the output of the
'fastq-dump' command went processing via the TopHat algorithm. This algorithm, in conjunction
with Bowtie, aligned and mapped gene reads. The TopHat output was then given to the
'bamtobed' command, which converted files to BED format so that other programs that do not
read BAM files, such as CuffDiff, could read BED files. Lastly, the output was processed via the
CuffDiff algorithm, which compares gene mapping among two TopHat outputs. This processing
allowed one to analyze and further process the data for visualizing specific information using
Data Visualization
bioinformatic data output from the first foundational subproblem. It asked how one can visualize
bioinformatics data using R programming. The researcher used the R programming language for
statistical analysis and graphical representation of data. The researcher imported the output of the
first subproblem into R programs for data visualization. One of the advantages R offers is a wide
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 14
range of open-source (i.e. free) packages that provide advanced algorithms and complex
graphing capabilities (Krill, 2015, para. 5). Some of the important packages the researcher used
in R are 'ggplot2' and 'dplyr' (Krill, 2015, para. 8). Packages like ggplot2 and plotly provided
specialized functionality related to plotting graphs. Different options like bar graphs, pie charts,
scatter plots, and heat maps were available for visualization. The researcher could use these two
packages together since they complement each other. In ggplot2, "you provide the data, tell
'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of
the details" (Wickham et al., 2018, p. 1). 'ggplot' is the most important function available in this
package. This function allowed the researcher to create a new ggplot graph to visualize data
(Wickham et al., 2018, p. 112). The researcher used this function in conjunction with functions
like 'geom_bar' to create bar charts (Wickham et al., 2018, p. 43). After creating a graph in
ggplot2, the researcher used 'plotly' to "easily translate 'ggplot2' graphs to an interactive web-
based version and/or create custom web-based visualizations directly from R" for ease of access
from anywhere on the internet (Sievert et al, 2018, p. 1). 'ggplotly' is on the important functions
that allows conversion of ggplot2 to plotly (Sievert et al., 2018, p. 22). Specialized packages
such as 'RcolorBrewer' create beautiful color palettes for data visualization (Neuwirth, 2015, p.
1). 'dplyr' is a package that provided "fast, consistent tool for working with data frame like
objects, both in memory and out of memory" (Wickham, François, Henry, & Müller, 2018, p. 1).
The researcher used this package to work with datasets using functions like 'bind' (Wickham,
François, Henry, & Müller, 2018, p. 2), 'select' (Wickham, François, Henry, & Müller, 2018, p.
3), and 'filter' (Wickham, François, Henry, & Müller, 2018, p. 2) for data manipulation. 'knitr' is
another useful package that provided a tool for report generation in R (Xie et al., 2018, p. 1).
'knit' and 'stitch' are two important functions in this package. 'knit' converted the data in an input
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 15
file to a proper format (Davis, 2018, p. 27). 'stitch' automatically created "a report based on an R
script and a template" (Xie and Friendly, 2018, p. 65). This package was very useful in cases
where the researcher created templates with documentation for other researchers to reference.
The output of this subproblem consists of various graphs for data visualization.
Packages provided powerful functions that allowed for a great variety of customization of plots.
The ggplot2 package allowed for the creation of bar graphs, scatter plots, heat maps, and many
other visualizations. The plotly package worked in conjunction with ggplot2 by allowing for the
creation of web-based visualizations or by making plots on a web format. dplyr allowed for
working and manipulating data. One could use knitr to create documentation for written code for
future reference.
Summary
collected. Only visualization is not enough for a clear understanding of that data. That is where
the analysis of the data visualization aspect of bioinformatics came into picture. Data
visualization complemented the analysis by providing clear and easily understandable graphs.
Technology helped this process by providing tools. The software packages and commands
available with technology allowed the researcher to reproduce the bioinformatics analysis and
The main requirement the mentor specified is to have reproduced the bioinformatics
analysis and the data visualization of the published mouse genetic data which is referred to in the
research paper by Wang et al (2018). The overarching question of this research was: how well
can one reproduce the bioinformatics analysis and graphs from the "Reprogramming of
This research aimed to reproduce the figures and analysis of a published research in order to
learn how figures are produced and how to interpret these figures. The process of reproduction is
useful knowledge for helping researchers to understand how to conduct genetic research in the
future, how to account for differences, and interpreting sources of differences. This chapter
identifies how the researcher reproduced the figures and bioinformatics analysis of the data from
the paper by Wang et al. (2018). This chapter encompasses the processing and the figure creation
process, along with the grading of the results of the visualizations and the bioinformatics
analysis.
This research followed the engineering design process (EDP). The research involved
requirements from the mentor, analysis of those requirements, and designing and developing a
solution in order to replicate the bioinformatics analysis and the visualizations of the research
paper by Wang et al. (2018). Finally, as a part of testing, the mentor compared the visualizations
and analyses of this research to those from the paper and provided approval.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 17
Population
Based off of the original study by Wang et al. (2018), 6 B6D2F1 and C57BL female
mice, all of which were 8-10 weeks old were mated with B6D2F1 or DBA2 male mice for this
study.
Sample
Since the researcher conducted research off of published raw data from the research by
Wang et al. (2018), the researcher fetched all of the raw RNA-seq data available from the study.
Materials/Instruments
The raw RNA-seq data the researcher processed was obtained from an NCBI database
and consisted of SRA files. This maintains validity and test-retest reliability, as the researcher
did not modify or use any data from a different source. For instruments used to process data, the
researcher used R programming language, Excel, HGCC, and Gene Ontology Consortium. These
materials retain construct reliability and test-retest reliability, because the same tools were used
by Wang et al. (2018) and because machine algorithms and processes are not subject to human
error.
For grading the visualization reproductions on a scale of 1-5, the criteria for each number was as
follows: A rating of 1 meant that the reproduced figure was a different type of figure than the
expected figure. A rating of 2 meant that the reproduced figure was the correct type of graph, but
with some error in the data being graphed, such as incorrect processing of data or plotting a
wrong column as an axis. A rating of 3 meant that the reproduced figure is the same type of
figure as the expected figure and graphs the correct data but could not be numerically accurate
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 18
due to a data limitation. A rating of 4 meant that the reproduced figure was the correct type of
graph and had plotted the correct values but had minor errors that prevented it from being an
exact reproduction. A rating of 5 meant that the reproduced figure is exactly visually identical to
For grading the bioinformatics analysis of the reproductions in comparison to the original data,
the result of either a "satisfactory match" or an "unsatisfactory match" was applied. For example,
test conclusion of a reproduced graph could be either a "satisfactory match" if the same
conclusion was derived from both figures or an "unsatisfactory match" if the analyses of the
The researcher reviewed the requirements provided by the mentor to reproduce the
bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the
research paper by Wang et al. (2018). The researcher made reproductions of figures 1e and 2a,
and then provided an analysis of the trends. The researcher also created a gene ontology table, a
figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure was
essentially only similar in concept as the researcher attempted reproductions with RNA-seq data,
which led to different gene ontology terms and y-values. The figures the researcher attempted to
Figure 1e:
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 19
Figure 1e. ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 621. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
Figure 2c:
Figure 2c. Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming
al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part
of Springer Nature.
Figure 2a:
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 20
Figure 2a. A heatmap showing H3K9me3 domains during mouse embryo development.
embryo development," by Wang et al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by
The following figure demonstrates the logic behind reproducing figure 1e from the paper by
Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.
Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages
(before and after) were used for comparison, with the 2 cell stage as an example stage to
Before reproducing the graphs, the researcher carried out the following steps to prepare the raw
1. The researcher fetched mouse gene data provided by the mentor through HGCC (Human
Genetic Computer Cluster), a network of computers for genome projects, by logging into
2. The researcher accessed a script file to setup and execute the TopHat algorithm (an
algorithm that mapped genetic data to corresponding parts of a genome) for processing of
the mouse gene data through HGCC for mapping and aligning the mouse genes to the
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 21
mouse genome (a set of chromosomes). The output of this algorithm (.bam and .bed files)
3. The researcher accessed a script file to setup and execute the Cuffdiff algorithm (an
algorithm that compared differences in gene expression) on the mouse gene data from the
various stages of embryo development using this algorithm. The output of the CuffDiff
between two embryonic stages, which was used as input when further processing data for
visualization.
Since the goal of this research was to reproduce the graphs of the research paper using the
RNA-seq data, statistical analysis of the data did not apply to this research and was not part of
the scope. The results of this processing were in the form of data tables which one can
understand better through data visualization (i.e. plotting graphs). The researcher carried out the
1. The researcher imported the data from '.diff' files generated the CuffDiff comparisons to
other embryonic cell stages into an R program by reading the files into R.
Below is the programming logic for the researcher used for this step:
Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate
'.diff' file and import it into an R data table by reading each line of data.
2. The researcher wrote R programs to filter the imported data for selective processing
Below is the programming logic the researcher used for this step:
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 22
Use the subset function in R programming to filter out the data based on a filter applied
processing.
3. The researcher extracted a list and the number of genes specific to each embryonic
development stage.
Below is the programming logic the researcher used for this step:
Create lists of genes intersecting the previous and the next stage CuffDiff comparison
'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from
the CuffDiff comparison to the previous stage with the downregulated genes in the next
stage. For downregulated genes specific to a stage, intersect the downregulated genes
from the CuffDiff comparison to the previous stage with the upregulated genes in the
next stage.
Below is the programming logic the researcher used for this step:
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the reproduction was complete following these steps, the researcher analyzed the trends of
The researcher carried out the following steps for producing gene ontology tables for
and input four lists total: two lists for combined 2 cell, 4 cell, and 8 cell stages'
upregulated and downregulated genes and two lists for combined ICM and morula stages'
upregulated and downregulated genes. The Gene Ontology Consortium analyzed these
2. The researcher then made Excel tables out of gene ontology terms with the highest P
values. The most emphasis was on highest P values first, and then uniqueness of the
terms for a variety of gene ontology terms. The x-axis was the gene ontology terms, and
the y-axis was the P value of the gene ontology term which resulted from the analysis.
3. The researcher wrote a bar graph R program to plot each of the Excel tables.
Below is the programming logic the researcher used for this step:
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the gene ontology graphs were completed following these steps, the researcher analyzed
The researcher carried out the following steps to make the reproduction of figure 2a:
1. The researcher took the lists of upregulated and downregulated genes specific to each
embryonic stage and extracted the start and end locations of the genes from the CuffDiff
output. The genes and their respective start and end locations were saved as an Excel file.
2. The researcher input this Excel file into an R script which generated a matrix from this
data, normalized the matrix, and then ran a k-means clustering code on the matrix.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 24
3. The researcher created a heat map R program to visualize the final matrix.
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
After the reproduction was complete following these steps, the researcher analyzed the trends of
the graph to provide bioinformatics analysis of the data. After completing the reproductions and
their analysis, the researcher carried out testing of the bioinformatics analysis results and data
visualization plots before submitting them to the mentor. The researcher used the following
Table 1
1
2
3
Test case #: Identified each scenario number with a number for each analysis situation.
Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the
Expected figure: A picture of the original figure from the paper by Wang et al. (2018)
Reproduced figure: The figure the researcher generated as a result of following the method to
Degree of similarity: The visual similarity between the expected figure and the reproduced figure
on a scale of 1-5. The judgement for numerical ratings is explained in the previous section.
Table 2
1
2
3
Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher
Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated
figure.
Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced
figure.
Conclusion: Determined whether the analysis of the associated figure is a satisfactory match to
the analysis of the reproduced figure. The criteria is provided in the previous section.
After completion of the above testing, the researcher will provide the following products to the
The mentor reviewed the results provided by the researcher, compare those with the original
research paper results, and approved that the results of the research were made properly.
Assumptions
1. The R programming licensing remained free during the course of the internship. Since R
is an open-source language, its contents are free for everyone to download and modify
2. The licenses for the R packages (ggplot2, RcolorBrewer, dplyr, plotly, and knitr) that the
researcher used remained free during the course of the internship. Due to R's open-source
nature, packages too are open-source, so their contents are free to download and modify.
3. The mentor supervised the researcher during the course of the internship. To guide the
researcher through the process of the reproductions and to teach the researcher how to
interpret the results, the mentor supervised the researcher to ensure successful research.
4. The researcher conducted research at Yao Lab at Emory University. Since the mentor and
researcher meet only at Yao Lab at Emory university, the researcher conducted the study
5. The researcher assumed that the published figures from the paper by Wang et al. (2018)
are accurate. The researcher assumed that before publishing the findings, Wang et al.
(2018) must have verified their results to prevent mistakes from occurring.
Limitations
1. Limitation: The researcher used RNA-seq data, which limited the embryonic stages that
the researcher analyzed and caused some figure reproductions to not look similar,
2. Limitation: The time window available for the internship was from September 10th,
Delimitations
3. Delimitation: R programming was the choice of language for data visualization in this
research.
4. Delimitation: The scope of this research was limited to reproducing the data analysis and
visualization results of the published data from the study by Wang et al. (2018).
5. Delimitation: Only the TopHat and CuffDiff algorithms were in the scope to map, align,
Summary
In order to collect the data, the researcher processed the data through HGCC and
reproduced each visualization by further processing data according to each visualization using R
programming and various functions inside packages. Then, the researcher compared each
reproduction with the associated figure and judged each reproduced figure according to the scale
presented in the previous section. Next, the researcher analyzed the trends of the figures and
decided whether the conclusion derived from the original matched the conclusion derived from
the reproductions.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 28
Chapter 4: Findings
The purpose of this research was to reproduce the figures and bioinformatics analyses of
the figures in the research paper by Wang et al. (2018) in order to help the researcher to
understand the differences in the techniques Wang et al. (2018) have used and the techniques the
researcher uses. This also helped the researcher understand the involvement of bioinformatics
analysis and R programming in genetics research. This chapter discusses the results of the
reproductions and bioinformatics analyses, whether they were similar to a satisfactory degree,
Results
Below are the results of the figure reproductions and bioinformatics analyses:
Table 1
1 1e 3
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 29
2 2c 3
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 30
3 2a 4
Table 2
Regarding the reproductions of the figures, the reproduction of figure 1e had a degree of
similarity of 3, indicating that a data limitation prevented a visually identical reproduction. The
gene ontology tables, which were the reproduction of figure 2c from the paper by Wang et al.
(2018), has a degree of similarity of 3 as well, also indicating that a data limitation prevented a
The bioinformatics analysis reproduction results indicate that the data for both the
reproduction and the associated figure contain the same conclusion for figure 1e, making the
that are different in ranges compared to the associated figure, leading to a conclusion of an
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 32
light blue to dark blue color in terms of gene expression in the 2 cell and 4 cell stages, which
Evaluation of Findings
Although the figure reproductions were not visually identical to the original figures and
therefore neither are some of the bioinformatics analyses of the reproductions, these differences
were expected due to a difference in the type of data the researcher and Wang et al. (2018) used.
Wang et al. (2018) used ChIP-seq data, which measures H3K9me3 marks, whereas the
researcher uses RNA-seq data. In the context of biology, the results of the reproductions, while
they initially seemed contradictory compared to the figures, are complementary. H3K9me3 is a
repressive marker which stops genes from being expressed. The higher the concentration of
H3K9me3 marks, the fewer the number of genes that are expressed. Comparing figure 1e to the
RNA-seq reproduction visualization with this knowledge explains why the bioinformatics
analysis of the two lead to the same conclusion: the reproduction showed a lower number of
genes specific to a stage being expressed at points where there are many H3K9me3 marks. The
mismatch in gene ontology tables was also expected, as gene ontology analysis for H3K9me3
marks is bound to be different compared to genes specific to embryonic stages. In fact, due to the
difference between ChIP-seq and RNA-seq data, the gene ontology tables were expected to not
match. The heat maps were expected to match, and thus they reached the same conclusion based
Summary
Without any knowledge of genetics, the figures and bioinformatics analysis compared to
their RNA-seq reproductions may seem to portray contradictory data due to a lack of visual
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 33
similarity. However, with knowledge of how H3K9me3 affects gene expression and knowledge
of the difference between RNA-seq and ChIP-seq data, the original figures and bioinformatics
analysis were complementary with their RNA-seq reproduction counterparts, as some data is
This research aimed to see how well one can reproduce the bioinformatics analysis and
data visualization from the published data by Wang et al. (2018). Through this research, the
researcher learned how bioinformatics and programming are connected in genetics research. The
researcher also learned how to produce figures from data as well. The researcher reproduced the
figures and bioinformatics analysis first by processing raw data with algorithms such as TopHat,
Bowtie, and CuffDiff, and then further processing the data according to each figure specifically.
Afterward, the researcher analyzed the trends to provide an analysis. One of the main limitations
of this research is that the researcher conducted research on RNA-seq data, whereas the paper by
Wang et al. (2018) conducted research on ChIP-seq data. Researching on RNA-seq data limited
the embryonic stages the researcher could analyze compared to the ChIP-seq data and caused
some different visualizations and analyses due to a fundamental difference between RNA-seq
and ChIP-seq data. This chapter discusses how the research data affects the answers to the
Implications
The first applied subproblem discusses how well one can reproduce the graphs of
published data from the paper by Wang et al. (2018) using R. The researcher's hypothesis
mentioned that one can reproduce the visualizations to a satisfactory degree. Comparing the
reproductions with the original figures without factoring genetic interpretation, only one of the
three reproductions were truly visually similar, although the correct data was graphed. The
reproductions were not visually similar to the original figures, and therefore, figures from the
paper by Wang et al. (2018) cannot be reproduced to be visually similar. The second applied
subproblem discusses how well the bioinformatics analysis of the original data can match the
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 35
bioinformatics analysis of the reproduced data. The researcher's hypothesis claimed that the
bioinformatics analyses of both the original and the reproduction can lead to the same
conclusion. Based off of the data, the conclusions of the original data did match those of the
reproductions, so the bioinformatics analysis of the reproduction can match those of the original
data. However, the gene ontology tables' conclusions did not match those of the gene ontology
table from the paper by Wang et al. (2018), but this was expected. A limitation that affects the
interpretation of the results is a constraint on the data the researcher worked with. The researcher
was assigned by their mentor to reproduce the data on RNA-seq data. Working with ChIP-seq
data probably would have allowed for much greater visual and interpretational similarity. These
results describe the differences between RNA-seq and ChIP-seq data and demonstrate that
knowledge of genetics is essential to properly interpret research results in this study. This
research implies that in genetics, it can be important to verify the results with different data, such
as ChIP-seq and RNA-seq. This can help to strengthen studies, as the conclusions that are
derived as a result of genetic studies such as these are backed by more than one format of data.
Although this study itself does not connect or warrant any future research, this research
teaches how to make figures for research and how to interpret them in general by allowing the
actual paper. This helps the researcher to grow by letting the researcher gain first-hand
experience in a genetics lab setting and by letting the researcher communicate with professionals
in the field.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 36
Recommendations
For future research, the data collected implies that working with the same format of data
would lead to more visually accurate figure reproductions and possibly more accurate
bioinformatics analysis reproductions as well. For the purpose of accurately reproducing figures
and analysis of data, the results of this research may imply working with ChIP-seq data.
Conclusions
The bioinformatics analyses of the reproductions did conclude the same as the original
data. Not taking genetics knowledge into account, however, the figure reproductions were not
similar to the original figures. Thus, bioinformatics analysis can be successfully reproduced, but
the figures cannot. Due to the difference between RNA-seq and ChIP-seq data, however, slightly
different figures were expected, especially for gene ontology tables. Interpreting both the data
and the figures from both the original and the reproductions implies that some of the visual
References
Center for Computational Biology (2016). A spliced read mapper for RNA-Seq [Software].
Retrieved from
https://ccb.jhu.edu/software/tophat/manual.shtml
https://www.sas.com/en_us/insights/big-data/data-visualization.html
http://genetics.emory.edu/about/index.html
Krill, P. (June 30, 2015). Why R? The pros and cons of the R language. Retrieved from
https://www.infoworld.com/article/2940864/application-development/r-programming-
language-statistical-data-analysis.html
Mackenzie, R. J. (April 06, 2018). RNA-seq: Basic Applications and Protocol. Technology
https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-
protocol-299461
Neuwirth, E. (February 19, 2015). Package 'RColorBrewer' (Version 1.1-2) [Software PDF].
https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf
https://opensource.com/resources/what-open-source
Patil, N. (November 26, 2018). Reproducing the Bioinformatics Analysis and Data Visualization
Porter, S. (January 28, 2007). Basics: How do you sequence a genome? part III, reads and
https://digitalworldbiology.com/archive/basics-how-do-you-sequence-genome-part-iii-
reads-and-chromats
Quinlan Lab at University of Utah (December 08, 2017). Bedtools Documentation (Version
https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
R Core Team. (July 2, 2018). R Language Definition (Version 3.5.1) [Software PDF].
https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
RNA Sequencing Analysis With TopHat [Software PDF]. (n.d.). Retrieved from
https://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf
Sequence Read Archive (2018a). SRA Toolkit Documentation: Tool Prefetch [Software].
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=prefetch
Sequence Read Archive (2018b). SRA Toolkit Documentation: Tool fastq-dump [Software].
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., & Despouy, P.
(July 20, 2018). Package 'plotly' (Version 4.8.0) [Software PDF]. Comprehensive R
https://cran.r-project.org/web/packages/plotly/plotly.pdf
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 39
https://www.medicinenet.com/script/main/art.asp?articlekey=16836
http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/
Wang, C., Liu, X., Gao, Y., Yang, L., Li, C., Liu, W., . . . Gao, S. (2018). Reprogramming of
http://www.chipseq.com/chromatin-immunoprecipitation/
http://had.co.nz/ggplot2/
http://r-pkgs.had.co.nz/intro.html
Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahashi, K., Wilke, C., & Woo. K. (July
https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
Wickham, H., François, R., Henry, L., & Müller, K. (October 16, 2018). Package 'dplyr' (Version
https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 40
Xie, Y., Vogt, A., Andrew, A., Zvoleff, A., Simon, A., Atkins, A., … Foster, Z. (February 20,
https://cran.r-project.org/web/packages/knitr/knitr.pdf
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 41
APPENDIX A
Figure Reproductions
Figure 2. RNA-seq reproduction of figure 1e. Number of upregulated and downregulated genes
Figure 3. Gene ontology analysis of combined ICM and morula stage downregulated genes.
Analysis from Gene Ontology Consortium determined the most relevant and unique categories
for genes.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 43
Figure 4. Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage upregulated genes.
Analysis from Gene Ontology Consortium determined the most relevant and unique categories
for genes.
Figure 5. Gene ontology analysis of combined ICM and morula stage upregulated genes.
Analysis from Gene Ontology Consortium determined the most relevant and unique categories
for genes.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 44
Figure 6. Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage downregulated
genes. Analysis from Gene Ontology Consortium determined the most relevant and unique
Figure 7. RNA-seq reproduction of heat map. Heat map plotting gene specific concentrations.