Patil Nishant Finalresearchpaper

Running Head: REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 1
Reproducing the Bioinformatics Analysis and Data Visualization of a Research Paper
Advanced Scientific Research Paper
Submitted to the Center for Advanced Studies, Wheeler High School
by
PATIL, NISHANT
The Center for Advanced Studies
Wheeler High School
Marietta, GA
November 2018
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 2
Abstract
A previous research uncovered findings regarding H3K9me3 marks in early embryonic stages of
mice. This research aimed to see whether the researcher could reproduce these findings by
reproducing the data visualizations and bioinformatics analysis of three figures. This was done
by utilizing the raw, published data and processing it and programming visualizations according
to each figure. The researcher then analyzed the trends in each figure and decided how similar
the visualizations were and whether the original and reproduced figures yielded the same
conclusion. The results of the research found that two of the three reproductions did not visually
match, although the bioinformatics analysis of the did yield the same or complementary
conclusions. The researcher's mentor supervised the research process and approved the results
when steps were correctly followed for reproduction and analysis. Due to a limitation in the type
of data, the reproductions could not be visually similar to the original figures, but rather
complementary in interpretation. This research concluded that based off of the data used for
reproduction, figure reproductions could not be made to be visually similar. The bioinformatics
analysis did reach the same conclusions, so the analysis could be reproduced. In genetics, this
study could imply how useful it can be to verify findings through other formats of data. This
research allowed the researcher to gain knowledge on how to produce figures and process data.
However, it also shows that working with the same format of data as previous research would
prove more effective for reproduction purposes.
Key Words: epigenetics, reproduction, programming, bioinformatics, visualization

Table of Contents
Chapter 1: Introduction ....................................................................................................................7

Statement of the problem .............................................................................................................7
Purpose of the Study ....................................................................................................................7
Research Questions ......................................................................................................................8
Hypothesis Statements .................................................................................................................8
Significance of the Study .............................................................................................................8
Definition of Key Terms ..............................................................................................................9
Summary ....................................................................................................................................10
Chapter 2: Literature Review .........................................................................................................11
Bioinformatics Processing..........................................................................................................11
Data Visualization ......................................................................................................................13
Summary ....................................................................................................................................15
Chapter 3: Research Method ..........................................................................................................16

Research Methods and Design(s) ...............................................................................................16
Population...................................................................................................................................17
Sample ........................................................................................................................................17
Materials/Instruments .................................................................................................................17
Operational Definition of Variables ...........................................................................................17
Data Collection, Processing, and Analysis .................................................................................18
Assumptions ...............................................................................................................................26
Limitations .................................................................................................................................26
Delimitations ..............................................................................................................................27
Summary ....................................................................................................................................27
Chapter 4: Findings ........................................................................................................................28

Results ........................................................................................................................................28
Evaluation of Findings ...............................................................................................................32
Summary ....................................................................................................................................32
Chapter 5: Implications, Real World Connections, Recommendations, and Conclusions ............34

Implications ................................................................................................................................34
Real World Connections ............................................................................................................35
Recommendations ......................................................................................................................36
Conclusions ................................................................................................................................36
References ......................................................................................................................................37
Appendix A: Figure Reproductions ...............................................................................................41

LIST OF ABBREVIATIONS
GO Gene Ontology
HGCC Human Genetics Computer Cluster
H3K9me3 Trimethylation of lysine 9 on histone H3
LTR Long Terminal Repeat
NCBI National Center for Biotechnology Information

LIST OF TABLES
Table 1 Data Visualization Testing................................................................................................28

Table 2 Bioinformatics Analysis Testing ......................................................................................30
LIST OF FIGURES
Figure 1e ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during
embryonic development. ....................................................................................................18
Figure 2c Gene ontology analysis for oocyte-specific genes.........................................................19
Figure 2a A heatmap showing H3K9me3 domains during mouse embryo development .............19
Figure 1 Schematic showing how stage-specific H3K9me3 marks and genes were identified.....20
Figure 2 RNA-seq reproduction of figure 1e .................................................................................41
Figure 3 Gene ontology analysis of combined ICM and morula stage downregulated genes .......42
Figure 4 Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage upregulated genes .42
Figure 5 Gene ontology analysis of combined ICM and morula stage upregulated genes ............43
Figure 6 Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage downregulated genes
............................................................................................................................................44
Figure 7 RNA-seq reproduction of heat map ................................................................................45

Chapter 1: Introduction
In their research, Wang et al. (2018) studied genetic reprogramming of mice during early
embryonic stages, specifically the marks of heterochromatin marker histone 3 lysine 9
trimethylation (H3K9me3) on Long Terminal Repeats, which are long, repetitive sequences of
DNA (Wang et al., 2018). Mammalian fertilized eggs undergo epigenetic modifications
following fertilization, and thus the mouse genome becomes demethylated during early
embryonic development. LTRs become hypomethylated and need to be regulated, and previous
studies have shown that H3K9me3 modifiers have regulated LTRs (Wang et al., 2018).
However, the involvement of H3K9me3-dependent heterochromatin (heterochromatin being
chromosomes that are tightly bound) in LTR regulation was unknown. The study by Wang et al.
(2018) figured out the molecular details behind the reprogramming of H3K9me3-dependent
heterochromatin. This research looks into how well the findings, being figures and
bioinformatics analysis, from their data can be reproduced using data under a different format,
RNA-seq data.
Statement of the Problem
How well can one reproduce the bioinformatics analysis and graphs from the
"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo
development" research paper?
Purpose of the Study
This research aims to reproduce the figures and analysis of a published research in order
to learn how figures are produced and how to interpret these figures. The process of reproduction
is useful knowledge for helping researchers to understand how to conduct genetic research in the
future, how to account for differences, and interpreting sources of differences.

Research Questions
1. Applied subproblem: How well can one reproduce the bioinformatics analysis of the
genetics data from published genetic data?
2. Applied subproblem: How well can one reproduce the graphs of published data using R?
Hypothesis Statements
1. ASP 1: Using the bioinformatics algorithms, it is possible to reproduce the results of the
data analysis that the Wang et al. (2018) have produced and reach the same conclusion.
The independent variable is the bioinformatics algorithms, and the dependent variable is
the conclusion derived from the analysis.
2. ASP 2: Using the R programs and packages, it is possible to reproduce the results of the
visualization that Wang et al. (2018) have produced with a degree of similarity. The
independent variable is the packages used in R programming, and the dependent variable
is the degree of similarity on a scale of 1 to 5.
Significance of the Study
The study of genetics involves collection of a lot of complex data. Analysis of that data
can be a very overwhelming, challenging, and time-consuming job. This research assists in
verifying the findings presented in the study by Wang et al. (2018). Since the data are sequenced
differently between RNA-seq and ChIP-seq, although the values and genetic information may be
different, the conclusions should ultimately be the same. Verifying these findings strengthens the
conclusions derived from the study by Wang et al. (2018) since more supporting information
backs up the findings.

Definition of Key Terms
1. Bioinformatics: The sum of the computational approaches to analyze, manage, and store
biological data (Stöppler, 2016, para. 1).
2. ChIP-seq: A technique to analyze the histone modifications, DNA modifications, or
binding sequence of proteins within a genome ("What is Chromatin Immunoprecipitation
(ChIP)?" n.d., para. 1).
3. Cuffdiff: A program that you can use to find significant changes in transcript expression,
splicing, and promoter use (Trapnell Lab at the University of Washington's Department
of Genome Sciences, 2017).
4. Data visualization: Data visualization is the presentation of data in a pictorial or
graphical format (SAS, n.d., para. 1).
5. Gene read: Each nucleotide sequences in genes is called a read (Porter, 2007, para. 3).
6. ggplot2: ggplot2 is a plotting system for R, based on the grammar of graphics, which
tries to take the good parts of base and lattice graphics and none of the bad parts
(Wickham, 2013, para. 1).
7. Human Genetics Computer Cluster (HGCC): Human Genetics Computer Cluster
(HGCC) is a computer network composed of 1 head node and 22 compute nodes and
serves multiple functions related to genomic projects and data storage (Department of
Human Genetics at Emory University School of Medicine, 2017, para. 3).
8. Open Source: Open source software is software with source code that anyone can
inspect, modify, and enhance (Opensource.com, 2013, para. 3).
9. Package: In R, the fundamental unit of shareable code is the package (Wickham, 2015,
para. 1).
10. R: R is a system for statistical computation and graphics (R Core Team, 2018).
11. RNA-seq: A technique that examines the quantity and sequence of RNA in a sample
(Mackenzie, 2018, para. 1).
12. TopHat: Tophat is a fast splice junction mapper for RNA-Seq reads. TopHat works in
conjunction with Bowtie, which aligns gene reads. (Center for Computational Biology at
Johns Hopkins University, 2016, para. 1).
Summary
Wang et al. (2018) found connections regarding H3K9me3 marks during specific
embryonic stages, and this research aims to see whether the conclusions derived can be matched
making reproductions using RNA-seq data. The hypotheses claim that one can reproduce the
conclusions and claims accurately.

Chapter 2: Literature Review
This research consists of two foundational subproblems: bioinformatics analysis and data
visualization. The bioinformatics analysis foundational subproblem consists of collecting the
sequenced RNA-seq or ChIP-seq data, converting them to appropriate format, mapping the
genetic data to an appropriate genome, comparing differential gene expression, and analysis of
the data ("RNA Sequencing Analysis with TopHat", n.d., p.4). The second foundational
subproblem involves importation of output from the previous subproblem into R programs for
data visualization. This subproblem involves usage of various packages, a "fundamental unit of
shareable code", (Wickham, 2015, para. 1), and programs in R. The output of this stage consists
of various graphs that data scientists refer to for trends and related analysis.
Bioinformatics Processing
The first foundational subproblem of the research is the bioinformatics analysis. It asks
how one can analyze bioinformatics data. As a part of bioinformatics analysis, the researcher
collected and processed RNA sequencing samples. The researcher retrieved data samples using
prefetch, a bioinformatics command that "allows command-line downloading of SRA (Sequence
Read Archive), dbGaP, and ADSP data" (Sequence Read Archive, 2018a, para. 2). They used
various options such as '-h' to get help regarding the documentation of commands. The '-h'
command option "displays ALL options, general usage, and version information" (Sequence
Read Archive, 2018a, para. 2). This is a very useful option for technical help. The researcher
used the '-f' option to "force object download" to ensure proper retrieval of raw data (Sequence
Read Archive, 2018a, para. 2). The '-l' option lists "the contents of a kart file" (Sequence Read
Archive, 2018a, para. 2).

After downloading the data, the next step was to perform a file conversion from SRA to
the '.fastq.gz' file type. The 'fastq-dump' command allowed conversion of "SRA data into fastq
format" (Sequence Read Archive, 2018b, para. 1). The researcher used various options available
for this command. As always, the '-h' option was handy for easy reference of documentation. The
'-M' option is useful to filter data "by sequence length" (Sequence Read Archive, 2018b, para. 2).
The '-o' option helped to specify the output directory to store the information (Sequence Read
Archive, 2018b, para. 2). There are two types of data: pair-end sequence data and single-read
sequence data. FASTQ files generated by the 'fastq-dump' command have different outputs based
on whether the SRA data is pair-end or single-read.
The next important step was to align the "RNA-Seq reads to a genome in order to identify
exon-exon splice junctions" (Center for Computational Biology at Johns Hopkins University,
2016, para. 1). The researcher used the TopHat algorithm for this purpose. TopHat works along
with Bowtie: Bowtie maps gene reads to a reference genome, and TopHat finds spliced junctions
and aligns gene reads. TopHat produced outputs differently based on whether the sequences are
pair-ended or single-read. One of the outputs of the TopHat command was an
'accepted_hits.bam' file, which was essential for further computation (Center for Computational
Biology at Johns Hopkins University, 2016, para. 19).
The next step was to use the bamToBed command, which "is a conversion utility that
converts sequence alignments in BAM format to BED records" (Quinlan Lab at University of
Utah, 2017, p. 46). The researcher input the 'accepted_hits.bam' file generated by the TopHat
command to run the bamToBed command on. The researcher used various command options
such as "-mate1", "-split", "-ed", "-tag", etc. to modify certain aspects of a BAM file (Quinlan
Lab at University of Utah, 2017, p. 46).

The last important step was to compare differential gene expression using Cuffdiff "to
find significant changes in transcript expression, splicing, and promoter use" (Trapnell Lab at the
University of Washington's Department of Genome Sciences, 2017). In simple terms, this
algorithm compared differences in gene mapping for analysis in an experiment. This step
concluded the first foundational subproblem of the research.
In order to process RNA-seq data successfully to prepare it for visualization, one must
first fetch, convert, align reads, make compatible, and identify differential gene expression. This
processing is done first via the 'prefetch' command, which downloaded data from a database such
as NCBI. Afterwards, processing prepares the raw data for TopHat input via the 'fastq-dump'
command, which converted files from SRA files to '.fastq.gz' file type. Then, the output of the
'fastq-dump' command went processing via the TopHat algorithm. This algorithm, in conjunction
with Bowtie, aligned and mapped gene reads. The TopHat output was then given to the
'bamtobed' command, which converted files to BED format so that other programs that do not
read BAM files, such as CuffDiff, could read BED files. Lastly, the output was processed via the
CuffDiff algorithm, which compares gene mapping among two TopHat outputs. This processing
allowed one to analyze and further process the data for visualizing specific information using
programming such as in R programming language.
Data Visualization
The second foundational subproblem of the research is data visualization of the
bioinformatic data output from the first foundational subproblem. It asked how one can visualize
bioinformatics data using R programming. The researcher used the R programming language for
statistical analysis and graphical representation of data. The researcher imported the output of the
first subproblem into R programs for data visualization. One of the advantages R offers is a wide
range of open-source (i.e. free) packages that provide advanced algorithms and complex
graphing capabilities (Krill, 2015, para. 5). Some of the important packages the researcher used
in R are 'ggplot2' and 'dplyr' (Krill, 2015, para. 8). Packages like ggplot2 and plotly provided
specialized functionality related to plotting graphs. Different options like bar graphs, pie charts,
scatter plots, and heat maps were available for visualization. The researcher could use these two
packages together since they complement each other. In ggplot2, "you provide the data, tell
'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of
the details" (Wickham et al., 2018, p. 1). 'ggplot' is the most important function available in this
package. This function allowed the researcher to create a new ggplot graph to visualize data
(Wickham et al., 2018, p. 112). The researcher used this function in conjunction with functions
like 'geom_bar' to create bar charts (Wickham et al., 2018, p. 43). After creating a graph in
ggplot2, the researcher used 'plotly' to "easily translate 'ggplot2' graphs to an interactive web-
based version and/or create custom web-based visualizations directly from R" for ease of access
from anywhere on the internet (Sievert et al, 2018, p. 1). 'ggplotly' is on the important functions
that allows conversion of ggplot2 to plotly (Sievert et al., 2018, p. 22). Specialized packages
such as 'RcolorBrewer' create beautiful color palettes for data visualization (Neuwirth, 2015, p.
1). 'dplyr' is a package that provided "fast, consistent tool for working with data frame like
objects, both in memory and out of memory" (Wickham, François, Henry, & Müller, 2018, p. 1).
The researcher used this package to work with datasets using functions like 'bind' (Wickham,
François, Henry, & Müller, 2018, p. 2), 'select' (Wickham, François, Henry, & Müller, 2018, p.
3), and 'filter' (Wickham, François, Henry, & Müller, 2018, p. 2) for data manipulation. 'knitr' is
another useful package that provided a tool for report generation in R (Xie et al., 2018, p. 1).
'knit' and 'stitch' are two important functions in this package. 'knit' converted the data in an input
file to a proper format (Davis, 2018, p. 27). 'stitch' automatically created "a report based on an R
script and a template" (Xie and Friendly, 2018, p. 65). This package was very useful in cases
where the researcher created templates with documentation for other researchers to reference.
The output of this subproblem consists of various graphs for data visualization.
In order to successfully visualize data in R, packages and functions were necessary.
Packages provided powerful functions that allowed for a great variety of customization of plots.
The ggplot2 package allowed for the creation of bar graphs, scatter plots, heat maps, and many
other visualizations. The plotly package worked in conjunction with ggplot2 by allowing for the
creation of web-based visualizations or by making plots on a web format. dplyr allowed for
working and manipulating data. One could use knitr to create documentation for written code for
future reference.
Summary
Understanding bioinformatics data required thorough analysis of the sample data
collected. Only visualization is not enough for a clear understanding of that data. That is where
the analysis of the data visualization aspect of bioinformatics came into picture. Data
visualization complemented the analysis by providing clear and easily understandable graphs.
Technology helped this process by providing tools. The software packages and commands
available with technology allowed the researcher to reproduce the bioinformatics analysis and
graphs by reusing the raw data.

Chapter 3: Research Method
The main requirement the mentor specified is to have reproduced the bioinformatics
analysis and the data visualization of the published mouse genetic data which is referred to in the
research paper by Wang et al (2018). The overarching question of this research was: how well
can one reproduce the bioinformatics analysis and graphs from the "Reprogramming of
H3K9me3-dependent heterochromatin during mammalian embryo development" research paper?
This research aimed to reproduce the figures and analysis of a published research in order to
learn how figures are produced and how to interpret these figures. The process of reproduction is
useful knowledge for helping researchers to understand how to conduct genetic research in the
future, how to account for differences, and interpreting sources of differences. This chapter
identifies how the researcher reproduced the figures and bioinformatics analysis of the data from
the paper by Wang et al. (2018). This chapter encompasses the processing and the figure creation
process, along with the grading of the results of the visualizations and the bioinformatics
analysis.
Research Methodology and Design
This research followed the engineering design process (EDP). The research involved
requirements from the mentor, analysis of those requirements, and designing and developing a
solution in order to replicate the bioinformatics analysis and the visualizations of the research
paper by Wang et al. (2018). Finally, as a part of testing, the mentor compared the visualizations
and analyses of this research to those from the paper and provided approval.
Population
Based off of the original study by Wang et al. (2018), 6 B6D2F1 and C57BL female
mice, all of which were 8-10 weeks old were mated with B6D2F1 or DBA2 male mice for this
study.
Sample
Since the researcher conducted research off of published raw data from the research by
Wang et al. (2018), the researcher fetched all of the raw RNA-seq data available from the study.
This data was available at NCBI.
Materials/Instruments
The raw RNA-seq data the researcher processed was obtained from an NCBI database
and consisted of SRA files. This maintains validity and test-retest reliability, as the researcher
did not modify or use any data from a different source. For instruments used to process data, the
researcher used R programming language, Excel, HGCC, and Gene Ontology Consortium. These
materials retain construct reliability and test-retest reliability, because the same tools were used
by Wang et al. (2018) and because machine algorithms and processes are not subject to human
error.
Operational Definition of Variables
For grading the visualization reproductions on a scale of 1-5, the criteria for each number was as
follows: A rating of 1 meant that the reproduced figure was a different type of figure than the
expected figure. A rating of 2 meant that the reproduced figure was the correct type of graph, but
with some error in the data being graphed, such as incorrect processing of data or plotting a
wrong column as an axis. A rating of 3 meant that the reproduced figure is the same type of
figure as the expected figure and graphs the correct data but could not be numerically accurate
due to a data limitation. A rating of 4 meant that the reproduced figure was the correct type of
graph and had plotted the correct values but had minor errors that prevented it from being an
exact reproduction. A rating of 5 meant that the reproduced figure is exactly visually identical to
the expected figure.
For grading the bioinformatics analysis of the reproductions in comparison to the original data,
the result of either a "satisfactory match" or an "unsatisfactory match" was applied. For example,
test conclusion of a reproduced graph could be either a "satisfactory match" if the same
conclusion was derived from both figures or an "unsatisfactory match" if the analyses of the
graphs are not complementary.
Data Collection, Processing, and Analysis
The researcher reviewed the requirements provided by the mentor to reproduce the
bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the
research paper by Wang et al. (2018). The researcher made reproductions of figures 1e and 2a,
and then provided an analysis of the trends. The researcher also created a gene ontology table, a
figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure was
essentially only similar in concept as the researcher attempted reproductions with RNA-seq data,
which led to different gene ontology terms and y-values. The figures the researcher attempted to
reproduce are shown below:
Figure 1e:
Figure 1e. ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during
embryonic development. Reprinted from "Reprogramming of H3K9me3-dependent
heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell
Biology, 20, p. 621. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.
Figure 2c:
Figure 2c. Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming
of H3K9me3-dependent heterochromatin during mammalian embryo development," by Wang et
al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part
of Springer Nature.
Figure 2a:
Figure 2a. A heatmap showing H3K9me3 domains during mouse embryo development.
Reprinted from "Reprogramming of H3K9me3-dependent heterochromatin during mammalian
embryo development," by Wang et al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by
Macmillan Publishers Limited, part of Springer Nature.
The following figure demonstrates the logic behind reproducing figure 1e from the paper by
Wang et al. (2018):
Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.
Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages
(before and after) were used for comparison, with the 2 cell stage as an example stage to
demonstrate which stages to compare to.
Before reproducing the graphs, the researcher carried out the following steps to prepare the raw
data for further processing for visualization:
1. The researcher fetched mouse gene data provided by the mentor through HGCC (Human
Genetic Computer Cluster), a network of computers for genome projects, by logging into
their HGCC account.
2. The researcher accessed a script file to setup and execute the TopHat algorithm (an
algorithm that mapped genetic data to corresponding parts of a genome) for processing of
the mouse gene data through HGCC for mapping and aligning the mouse genes to the
mouse genome (a set of chromosomes). The output of this algorithm (.bam and .bed files)
was used as input in the next step.
3. The researcher accessed a script file to setup and execute the Cuffdiff algorithm (an
algorithm that compared differences in gene expression) on the mouse gene data from the
TopHat output and compared modifications of H3K9me3-dependent heterochromatin at
various stages of embryo development using this algorithm. The output of the CuffDiff
algorithm is a '.diff' file generated from comparison of differential gene expression
between two embryonic stages, which was used as input when further processing data for
visualization.
Since the goal of this research was to reproduce the graphs of the research paper using the
RNA-seq data, statistical analysis of the data did not apply to this research and was not part of
the scope. The results of this processing were in the form of data tables which one can
understand better through data visualization (i.e. plotting graphs). The researcher carried out the
following steps to make the reproduction of figure 1e:
1. The researcher imported the data from '.diff' files generated the CuffDiff comparisons to
other embryonic cell stages into an R program by reading the files into R.
Below is the programming logic for the researcher used for this step:
Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate
'.diff' file and import it into an R data table by reading each line of data.
2. The researcher wrote R programs to filter the imported data for selective processing
based on predefined criteria for status and significance of genes.
Below is the programming logic the researcher used for this step:
Use the subset function in R programming to filter out the data based on a filter applied
on particular columns (status, significant, etc) to remove undesirable data from
processing.
3. The researcher extracted a list and the number of genes specific to each embryonic
development stage.
Create lists of genes intersecting the previous and the next stage CuffDiff comparison
'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from
the CuffDiff comparison to the previous stage with the downregulated genes in the next
stage. For downregulated genes specific to a stage, intersect the downregulated genes
from the CuffDiff comparison to the previous stage with the upregulated genes in the
next stage.
4. The researcher wrote a bar graph R program to reproduce figure 1e.
Use graphing functions like 'ggplot' from the ggplot2 package available in R
programming to plot the data filtered in the previous step. Use geometric functions such
as 'geom_bar', etc. to define the bar graph to be created.
After the reproduction was complete following these steps, the researcher analyzed the trends of
the graph to provide bioinformatics analysis of the data.
The researcher carried out the following steps for producing gene ontology tables for
genes specific to stages of embryonic development for a reproduction of figure 2c:

1. The researcher went to the Gene Ontology Consortium website (www.geneontology.org)
and input four lists total: two lists for combined 2 cell, 4 cell, and 8 cell stages'
upregulated and downregulated genes and two lists for combined ICM and morula stages'
upregulated and downregulated genes. The Gene Ontology Consortium analyzed these
lists for biological processes.
2. The researcher then made Excel tables out of gene ontology terms with the highest P
values. The most emphasis was on highest P values first, and then uniqueness of the
terms for a variety of gene ontology terms. The x-axis was the gene ontology terms, and
the y-axis was the P value of the gene ontology term which resulted from the analysis.
3. The researcher wrote a bar graph R program to plot each of the Excel tables.
as 'geom_bar', etc. to define the bar graph to be created.
After the gene ontology graphs were completed following these steps, the researcher analyzed
the trends of the graphs to provide bioinformatics analysis of the data.
The researcher carried out the following steps to make the reproduction of figure 2a:
1. The researcher took the lists of upregulated and downregulated genes specific to each
embryonic stage and extracted the start and end locations of the genes from the CuffDiff
output. The genes and their respective start and end locations were saved as an Excel file.
2. The researcher input this Excel file into an R script which generated a matrix from this
data, normalized the matrix, and then ran a k-means clustering code on the matrix.
3. The researcher created a heat map R program to visualize the final matrix.
Below is the programming logic for this step:
as 'geom_tile', etc. to define the heat map to be created.
After the reproduction was complete following these steps, the researcher analyzed the trends of
the graph to provide bioinformatics analysis of the data. After completing the reproductions and
their analysis, the researcher carried out testing of the bioinformatics analysis results and data
visualization plots before submitting them to the mentor. The researcher used the following
format for testing prototypes:
Table 1
Data Visualization Testing
Test case # Figure being Expected figure Reproduced Degree of

reproduced figure similarity (1-5)
1
2
3
Test case #: Identified each scenario number with a number for each analysis situation.
Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the
researcher attempted a reproduction of.
Expected figure: A picture of the original figure from the paper by Wang et al. (2018)
Reproduced figure: The figure the researcher generated as a result of following the method to
reproduce the corresponding figure.

Degree of similarity: The visual similarity between the expected figure and the reproduced figure
on a scale of 1-5. The judgement for numerical ratings is explained in the previous section.
Table 2
Bioinformatics Analysis Testing
Test case # Associated Analysis of the Analysis of the Conclusion

figure associated figure reproduced
figure
1
2
3
Test case #: Unique number to identify the test case.
Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher
attempted a reproduction of.
Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated
figure.
Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced
figure.
Conclusion: Determined whether the analysis of the associated figure is a satisfactory match to
the analysis of the reproduced figure. The criteria is provided in the previous section.
After completion of the above testing, the researcher will provide the following products to the
mentor for review:
● Data visualization graphs for bioinformatics analysis
● Bioinformatics analysis data results

The mentor reviewed the results provided by the researcher, compare those with the original
research paper results, and approved that the results of the research were made properly.
Assumptions
1. The R programming licensing remained free during the course of the internship. Since R
is an open-source language, its contents are free for everyone to download and modify
2. The licenses for the R packages (ggplot2, RcolorBrewer, dplyr, plotly, and knitr) that the
researcher used remained free during the course of the internship. Due to R's open-source
nature, packages too are open-source, so their contents are free to download and modify.
3. The mentor supervised the researcher during the course of the internship. To guide the
researcher through the process of the reproductions and to teach the researcher how to
interpret the results, the mentor supervised the researcher to ensure successful research.
4. The researcher conducted research at Yao Lab at Emory University. Since the mentor and
researcher meet only at Yao Lab at Emory university, the researcher conducted the study
at the presence of the mentor.
5. The researcher assumed that the published figures from the paper by Wang et al. (2018)
are accurate. The researcher assumed that before publishing the findings, Wang et al.
(2018) must have verified their results to prevent mistakes from occurring.
Limitations
1. Limitation: The researcher used RNA-seq data, which limited the embryonic stages that
the researcher analyzed and caused some figure reproductions to not look similar,
because they complemented the ChIP-seq data of the original figures.
2. Limitation: The time window available for the internship was from September 10th,
2018 to December 11th, 2018.

Delimitations
3. Delimitation: R programming was the choice of language for data visualization in this
research.
4. Delimitation: The scope of this research was limited to reproducing the data analysis and
visualization results of the published data from the study by Wang et al. (2018).
5. Delimitation: Only the TopHat and CuffDiff algorithms were in the scope to map, align,
and compare genetic datasets.
Summary
In order to collect the data, the researcher processed the data through HGCC and
reproduced each visualization by further processing data according to each visualization using R
programming and various functions inside packages. Then, the researcher compared each
reproduction with the associated figure and judged each reproduced figure according to the scale
presented in the previous section. Next, the researcher analyzed the trends of the figures and
decided whether the conclusion derived from the original matched the conclusion derived from
the reproductions.
Chapter 4: Findings
The purpose of this research was to reproduce the figures and bioinformatics analyses of
the figures in the research paper by Wang et al. (2018) in order to help the researcher to
understand the differences in the techniques Wang et al. (2018) have used and the techniques the
researcher uses. This also helped the researcher understand the involvement of bioinformatics
analysis and R programming in genetics research. This chapter discusses the results of the
reproductions and bioinformatics analyses, whether they were similar to a satisfactory degree,
and an explanation in the case they were not similar.
Results
Below are the results of the figure reproductions and bioinformatics analyses:
Table 1
Data Visualization Testing
Test Figure Expected figure Reproduced figure Degree

case # being of
reproduced similarity
(1-5)
1 1e 3
2 2c 3
3 2a 4
Table 2
Bioinformatics Analysis Testing
Test case # Associated Analysis of the Analysis of the Conclusion

figure associated figure reproduced
figure
1 1e A large number Low numbers of H3K9me3 is a

of H3K9me3 stage specific repressive
established genes are marker, thus the
marks are shown expressed in 4 higher the
on the 4 cell, 8 cell, 8 cell, and number of
cell, and morula morula stages. established
stages. H3K9me3
marks, the lower
the number of
genes specific to
stages.
Satisfactory
match.
2 2c The P values in The P values in The reproduced
the figure range most of the figures' P values
from 5 to 10. tables range for GO terms do
from 1 to 8. not match the
Only in 2 cell, 4 range of P

cell, and 8 cell values for the
downregulated expected graph,
GO analysis do indicating
the P values go different
above 10. biological
processes to be
significant.
Unsatisfactory
match.
3 2a Normalized Weaker gene Strong
H3K9me3 input expression in 2 H3K9me3 marks
ratio is high in cell and 4 cell in the ChIP-seq
MII, zygote, 2 stages. original is met
cell, and 4 cell with weak
stages. specific gene
expression in the
next stage in the
RNA-seq
reproduction.
Both
complement
each other.
Satisfactory
match.
Regarding the reproductions of the figures, the reproduction of figure 1e had a degree of
similarity of 3, indicating that a data limitation prevented a visually identical reproduction. The
gene ontology tables, which were the reproduction of figure 2c from the paper by Wang et al.
(2018), has a degree of similarity of 3 as well, also indicating that a data limitation prevented a
visually identical reproduction. The reproduction of figure 2a had a degree of similarity of 4.
The bioinformatics analysis reproduction results indicate that the data for both the
reproduction and the associated figure contain the same conclusion for figure 1e, making the
reproduction a satisfactory match. However, the reproductions of figure 2c contained P values
that are different in ranges compared to the associated figure, leading to a conclusion of an
unsatisfactory match of bioinformatics analyses. The reproduction of figure 2a showed a mostly
light blue to dark blue color in terms of gene expression in the 2 cell and 4 cell stages, which
indicated weaker gene expression in these stages.
Evaluation of Findings
Although the figure reproductions were not visually identical to the original figures and
therefore neither are some of the bioinformatics analyses of the reproductions, these differences
were expected due to a difference in the type of data the researcher and Wang et al. (2018) used.
Wang et al. (2018) used ChIP-seq data, which measures H3K9me3 marks, whereas the
researcher uses RNA-seq data. In the context of biology, the results of the reproductions, while
they initially seemed contradictory compared to the figures, are complementary. H3K9me3 is a
repressive marker which stops genes from being expressed. The higher the concentration of
H3K9me3 marks, the fewer the number of genes that are expressed. Comparing figure 1e to the
RNA-seq reproduction visualization with this knowledge explains why the bioinformatics
analysis of the two lead to the same conclusion: the reproduction showed a lower number of
genes specific to a stage being expressed at points where there are many H3K9me3 marks. The
mismatch in gene ontology tables was also expected, as gene ontology analysis for H3K9me3
marks is bound to be different compared to genes specific to embryonic stages. In fact, due to the
difference between ChIP-seq and RNA-seq data, the gene ontology tables were expected to not
match. The heat maps were expected to match, and thus they reached the same conclusion based
off of the information.
Summary
Without any knowledge of genetics, the figures and bioinformatics analysis compared to
their RNA-seq reproductions may seem to portray contradictory data due to a lack of visual
similarity. However, with knowledge of how H3K9me3 affects gene expression and knowledge
of the difference between RNA-seq and ChIP-seq data, the original figures and bioinformatics
analysis were complementary with their RNA-seq reproduction counterparts, as some data is
meant to be similar, while other data is not.

Chapter 5: Implications, Real World Connections, Recommendations, and Conclusions
This research aimed to see how well one can reproduce the bioinformatics analysis and
data visualization from the published data by Wang et al. (2018). Through this research, the
researcher learned how bioinformatics and programming are connected in genetics research. The
researcher also learned how to produce figures from data as well. The researcher reproduced the
figures and bioinformatics analysis first by processing raw data with algorithms such as TopHat,
Bowtie, and CuffDiff, and then further processing the data according to each figure specifically.
Afterward, the researcher analyzed the trends to provide an analysis. One of the main limitations
of this research is that the researcher conducted research on RNA-seq data, whereas the paper by
Wang et al. (2018) conducted research on ChIP-seq data. Researching on RNA-seq data limited
the embryonic stages the researcher could analyze compared to the ChIP-seq data and caused
some different visualizations and analyses due to a fundamental difference between RNA-seq
and ChIP-seq data. This chapter discusses how the research data affects the answers to the
applied subproblems and its connections to the real world.
Implications
The first applied subproblem discusses how well one can reproduce the graphs of
published data from the paper by Wang et al. (2018) using R. The researcher's hypothesis
mentioned that one can reproduce the visualizations to a satisfactory degree. Comparing the
reproductions with the original figures without factoring genetic interpretation, only one of the
three reproductions were truly visually similar, although the correct data was graphed. The
reproductions were not visually similar to the original figures, and therefore, figures from the
paper by Wang et al. (2018) cannot be reproduced to be visually similar. The second applied
subproblem discusses how well the bioinformatics analysis of the original data can match the
bioinformatics analysis of the reproduced data. The researcher's hypothesis claimed that the
bioinformatics analyses of both the original and the reproduction can lead to the same
conclusion. Based off of the data, the conclusions of the original data did match those of the
reproductions, so the bioinformatics analysis of the reproduction can match those of the original
data. However, the gene ontology tables' conclusions did not match those of the gene ontology
table from the paper by Wang et al. (2018), but this was expected. A limitation that affects the
interpretation of the results is a constraint on the data the researcher worked with. The researcher
was assigned by their mentor to reproduce the data on RNA-seq data. Working with ChIP-seq
data probably would have allowed for much greater visual and interpretational similarity. These
results describe the differences between RNA-seq and ChIP-seq data and demonstrate that
knowledge of genetics is essential to properly interpret research results in this study. This
research implies that in genetics, it can be important to verify the results with different data, such
as ChIP-seq and RNA-seq. This can help to strengthen studies, as the conclusions that are
derived as a result of genetic studies such as these are backed by more than one format of data.
Real World Connections
Although this study itself does not connect or warrant any future research, this research
teaches how to make figures for research and how to interpret them in general by allowing the
researcher to take a hands-on approach to replicating figures and bioinformatics analysis of an
actual paper. This helps the researcher to grow by letting the researcher gain first-hand
experience in a genetics lab setting and by letting the researcher communicate with professionals
in the field.
Recommendations
For future research, the data collected implies that working with the same format of data
would lead to more visually accurate figure reproductions and possibly more accurate
bioinformatics analysis reproductions as well. For the purpose of accurately reproducing figures
and analysis of data, the results of this research may imply working with ChIP-seq data.
Conclusions
The bioinformatics analyses of the reproductions did conclude the same as the original
data. Not taking genetics knowledge into account, however, the figure reproductions were not
similar to the original figures. Thus, bioinformatics analysis can be successfully reproduced, but
the figures cannot. Due to the difference between RNA-seq and ChIP-seq data, however, slightly
different figures were expected, especially for gene ontology tables. Interpreting both the data
and the figures from both the original and the reproductions implies that some of the visual
differences were actual genetic complements.

References
Center for Computational Biology (2016). A spliced read mapper for RNA-Seq [Software].
Retrieved from
https://ccb.jhu.edu/software/tophat/manual.shtml
Data Visualization: What it is and why it matters. (n.d.). Retrieved from
https://www.sas.com/en_us/insights/big-data/data-visualization.html
Emory University School of Medicine Department of Genetics. (n.d.). Retrieved from
http://genetics.emory.edu/about/index.html
Krill, P. (June 30, 2015). Why R? The pros and cons of the R language. Retrieved from
https://www.infoworld.com/article/2940864/application-development/r-programming-
language-statistical-data-analysis.html
Mackenzie, R. J. (April 06, 2018). RNA-seq: Basic Applications and Protocol. Technology
Networks. Retrieved from
https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-
protocol-299461
Neuwirth, E. (February 19, 2015). Package 'RColorBrewer' (Version 1.1-2) [Software PDF].
Comprehensive R Archive Network. Retrieved from
https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf
Opensource.com (2013). What is open source? Retrieved from
https://opensource.com/resources/what-open-source
Patil, N. (November 26, 2018). Reproducing the Bioinformatics Analysis and Data Visualization
of a Research Paper. Unpublished manuscript.

Porter, S. (January 28, 2007). Basics: How do you sequence a genome? part III, reads and
chromats. Retrieved from
https://digitalworldbiology.com/archive/basics-how-do-you-sequence-genome-part-iii-
reads-and-chromats
Quinlan Lab at University of Utah (December 08, 2017). Bedtools Documentation (Version
2.27.0) [Software PDF]. Retrieved from
https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf
R Core Team. (July 2, 2018). R Language Definition (Version 3.5.1) [Software PDF].
Comprehensive R Archive Network. Retrieved from
https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf
RNA Sequencing Analysis With TopHat [Software PDF]. (n.d.). Retrieved from
https://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf
Sequence Read Archive (2018a). SRA Toolkit Documentation: Tool Prefetch [Software].
National Center for Biotechnology Information. Retrieved from
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=prefetch
Sequence Read Archive (2018b). SRA Toolkit Documentation: Tool fastq-dump [Software].
National Center for Biotechnology Information. Retrieved from
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump
Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., & Despouy, P.
(July 20, 2018). Package 'plotly' (Version 4.8.0) [Software PDF]. Comprehensive R
Archive Network. Retrieved from
https://cran.r-project.org/web/packages/plotly/plotly.pdf
Stöppler, M. C. (2016). Definition of Bioinformatics. Retrieved from
https://www.medicinenet.com/script/main/art.asp?articlekey=16836
Trapnell Lab at University of Washington's Department of Genome Sciences (2017) [Software].
Cufflinks. Retrieved from
http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/
Wang, C., Liu, X., Gao, Y., Yang, L., Li, C., Liu, W., . . . Gao, S. (2018). Reprogramming of
H3K9me3-dependent heterochromatin during mammalian embryo development. Nature
Cell Biology, 20(5), 620-631. doi:10.1038/s41556-018-0093-4
What is Chromatin Immunoprecipitation (ChIP)?. (2015). Retrieved from
http://www.chipseq.com/chromatin-immunoprecipitation/
Wickham, H. (2013). ggplot2 [Software]. Retrieved from
http://had.co.nz/ggplot2/
Wickham, H. (2015). R Packages [Software]. Retrieved from
http://r-pkgs.had.co.nz/intro.html
Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahashi, K., Wilke, C., & Woo. K. (July
3, 2018). Package 'ggplot2' (Version 3.0.0) [Software PDF]. Comprehensive R Archive
Network. Retrieved from
https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
Wickham, H., François, R., Henry, L., & Müller, K. (October 16, 2018). Package 'dplyr' (Version
0.7.7) [Software PDF]. Comprehensive R Archive Network. Retrieved from
https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
Xie, Y., Vogt, A., Andrew, A., Zvoleff, A., Simon, A., Atkins, A., … Foster, Z. (February 20,
2018). Package 'knitr' (Version 1.20) [Software PDF]. Comprehensive R Archive
Network. Retrieved from
https://cran.r-project.org/web/packages/knitr/knitr.pdf
APPENDIX A
Figure Reproductions
Figure 2. RNA-seq reproduction of figure 1e. Number of upregulated and downregulated genes
specific to each embryonic stage.

Figure 3. Gene ontology analysis of combined ICM and morula stage downregulated genes.
Analysis from Gene Ontology Consortium determined the most relevant and unique categories
for genes.
Figure 4. Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage upregulated genes.
for genes.
Figure 5. Gene ontology analysis of combined ICM and morula stage upregulated genes.
for genes.
Figure 6. Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage downregulated
genes. Analysis from Gene Ontology Consortium determined the most relevant and unique
categories for genes.

Figure 7. RNA-seq reproduction of heat map. Heat map plotting gene specific concentrations.

Patil Nishant Finalresearchpaper

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Patil Nishant Finalresearchpaper

Uploaded by

Copyright:

Available Formats

Running Head: REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 1

Reproducing the Bioinformatics Analysis and Data Visualization of a Research Paper

Advanced Scientific Research Paper

Submitted to the Center for Advanced Studies, Wheeler High School

The Center for Advanced Studies

Wheeler High School

prove more effective for reproduction purposes.

Key Words: epigenetics, reproduction, programming, bioinformatics, visualization

Chapter 1: Introduction ....................................................................................................................7

Chapter 3: Research Method ..........................................................................................................16

Chapter 4: Findings ........................................................................................................................28

Chapter 5: Implications, Real World Connections, Recommendations, and Conclusions ............34

Appendix A: Figure Reproductions ...............................................................................................41

HGCC Human Genetics Computer Cluster

H3K9me3 Trimethylation of lysine 9 on histone H3

LTR Long Terminal Repeat

NCBI National Center for Biotechnology Information

Table 1 Data Visualization Testing................................................................................................28

Figure 2c Gene ontology analysis for oocyte-specific genes.........................................................19

Figure 2 RNA-seq reproduction of figure 1e .................................................................................41

Figure 7 RNA-seq reproduction of heat map ................................................................................45

embryonic stages, specifically the marks of heterochromatin marker histone 3 lysine 9

However, the involvement of H3K9me3-dependent heterochromatin (heterochromatin being

Statement of the Problem

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

development" research paper?

Purpose of the Study

future, how to account for differences, and interpreting sources of differences.

genetics data from published genetic data?

the conclusion derived from the analysis.

is the degree of similarity on a scale of 1 to 5.

Significance of the Study

backs up the findings.

Definition of Key Terms

biological data (Stöppler, 2016, para. 1).

2. ChIP-seq: A technique to analyze the histone modifications, DNA modifications, or

binding sequence of proteins within a genome ("What is Chromatin Immunoprecipitation

(ChIP)?" n.d., para. 1).

of Genome Sciences, 2017).

4. Data visualization: Data visualization is the presentation of data in a pictorial or

graphical format (SAS, n.d., para. 1).

(Wickham, 2013, para. 1).

7. Human Genetics Computer Cluster (HGCC): Human Genetics Computer Cluster

Human Genetics at Emory University School of Medicine, 2017, para. 3).

inspect, modify, and enhance (Opensource.com, 2013, para. 3).

(Mackenzie, 2018, para. 1).

Johns Hopkins University, 2016, para. 1).

conclusions and claims accurately.

Chapter 2: Literature Review

visualization. The bioinformatics analysis foundational subproblem consists of collecting the

prefetch, a bioinformatics command that "allows command-line downloading of SRA (Sequence

Archive, 2018a, para. 2).

on whether the SRA data is pair-end or single-read.

pair-ended or single-read. One of the outputs of the TopHat command was an

Biology at Johns Hopkins University, 2016, para. 19).

Lab at University of Utah, 2017, p. 46).

University of Washington's Department of Genome Sciences, 2017). In simple terms, this

concluded the first foundational subproblem of the research.

programming such as in R programming language.

The second foundational subproblem of the research is data visualization of the

In order to successfully visualize data in R, packages and functions were necessary.

Understanding bioinformatics data required thorough analysis of the sample data

graphs by reusing the raw data.