You are on page 1of 45

Running Head: REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 1

Reproducing the Bioinformatics Analysis and Data Visualization of a Research Paper

Advanced Scientific Research Paper

Submitted to the Center for Advanced Studies, Wheeler High School

by

PATIL, NISHANT

The Center for Advanced Studies

Wheeler High School

Marietta, GA

November 2018
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 2

Abstract

A previous research uncovered findings regarding H3K9me3 marks in early embryonic stages of

mice. This research aimed to see whether the researcher could reproduce these findings by

reproducing the data visualizations and bioinformatics analysis of three figures. This was done

by utilizing the raw, published data and processing it and programming visualizations according

to each figure. The researcher then analyzed the trends in each figure and decided how similar

the visualizations were and whether the original and reproduced figures yielded the same

conclusion. The results of the research found that two of the three reproductions did not visually

match, although the bioinformatics analysis of the did yield the same or complementary

conclusions. The researcher's mentor supervised the research process and approved the results

when steps were correctly followed for reproduction and analysis. Due to a limitation in the type

of data, the reproductions could not be visually similar to the original figures, but rather

complementary in interpretation. This research concluded that based off of the data used for

reproduction, figure reproductions could not be made to be visually similar. The bioinformatics

analysis did reach the same conclusions, so the analysis could be reproduced. In genetics, this

study could imply how useful it can be to verify findings through other formats of data. This

research allowed the researcher to gain knowledge on how to produce figures and process data.

However, it also shows that working with the same format of data as previous research would

prove more effective for reproduction purposes.

Key Words: epigenetics, reproduction, programming, bioinformatics, visualization


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 3

Table of Contents

Chapter 1: Introduction ....................................................................................................................7


Statement of the problem .............................................................................................................7
Purpose of the Study ....................................................................................................................7
Research Questions ......................................................................................................................8
Hypothesis Statements .................................................................................................................8
Significance of the Study .............................................................................................................8
Definition of Key Terms ..............................................................................................................9
Summary ....................................................................................................................................10
Chapter 2: Literature Review .........................................................................................................11
Bioinformatics Processing..........................................................................................................11
Data Visualization ......................................................................................................................13
Summary ....................................................................................................................................15

Chapter 3: Research Method ..........................................................................................................16


Research Methods and Design(s) ...............................................................................................16
Population...................................................................................................................................17
Sample ........................................................................................................................................17
Materials/Instruments .................................................................................................................17
Operational Definition of Variables ...........................................................................................17
Data Collection, Processing, and Analysis .................................................................................18
Assumptions ...............................................................................................................................26
Limitations .................................................................................................................................26
Delimitations ..............................................................................................................................27
Summary ....................................................................................................................................27

Chapter 4: Findings ........................................................................................................................28


Results ........................................................................................................................................28
Evaluation of Findings ...............................................................................................................32
Summary ....................................................................................................................................32

Chapter 5: Implications, Real World Connections, Recommendations, and Conclusions ............34


Implications ................................................................................................................................34
Real World Connections ............................................................................................................35
Recommendations ......................................................................................................................36
Conclusions ................................................................................................................................36

References ......................................................................................................................................37

Appendix A: Figure Reproductions ...............................................................................................41


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 4

LIST OF ABBREVIATIONS

GO Gene Ontology

HGCC Human Genetics Computer Cluster

H3K9me3 Trimethylation of lysine 9 on histone H3

LTR Long Terminal Repeat

NCBI National Center for Biotechnology Information


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 5

LIST OF TABLES

Table 1 Data Visualization Testing................................................................................................28


Table 2 Bioinformatics Analysis Testing ......................................................................................30
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 6

LIST OF FIGURES

Figure 1e ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during
embryonic development. ....................................................................................................18

Figure 2c Gene ontology analysis for oocyte-specific genes.........................................................19

Figure 2a A heatmap showing H3K9me3 domains during mouse embryo development .............19

Figure 1 Schematic showing how stage-specific H3K9me3 marks and genes were identified.....20

Figure 2 RNA-seq reproduction of figure 1e .................................................................................41

Figure 3 Gene ontology analysis of combined ICM and morula stage downregulated genes .......42

Figure 4 Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage upregulated genes .42

Figure 5 Gene ontology analysis of combined ICM and morula stage upregulated genes ............43

Figure 6 Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage downregulated genes

............................................................................................................................................44

Figure 7 RNA-seq reproduction of heat map ................................................................................45


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 7

Chapter 1: Introduction

In their research, Wang et al. (2018) studied genetic reprogramming of mice during early

embryonic stages, specifically the marks of heterochromatin marker histone 3 lysine 9

trimethylation (H3K9me3) on Long Terminal Repeats, which are long, repetitive sequences of

DNA (Wang et al., 2018). Mammalian fertilized eggs undergo epigenetic modifications

following fertilization, and thus the mouse genome becomes demethylated during early

embryonic development. LTRs become hypomethylated and need to be regulated, and previous

studies have shown that H3K9me3 modifiers have regulated LTRs (Wang et al., 2018).

However, the involvement of H3K9me3-dependent heterochromatin (heterochromatin being

chromosomes that are tightly bound) in LTR regulation was unknown. The study by Wang et al.

(2018) figured out the molecular details behind the reprogramming of H3K9me3-dependent

heterochromatin. This research looks into how well the findings, being figures and

bioinformatics analysis, from their data can be reproduced using data under a different format,

RNA-seq data.

Statement of the Problem

How well can one reproduce the bioinformatics analysis and graphs from the

"Reprogramming of H3K9me3-dependent heterochromatin during mammalian embryo

development" research paper?

Purpose of the Study

This research aims to reproduce the figures and analysis of a published research in order

to learn how figures are produced and how to interpret these figures. The process of reproduction

is useful knowledge for helping researchers to understand how to conduct genetic research in the

future, how to account for differences, and interpreting sources of differences.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 8

Research Questions

1. Applied subproblem: How well can one reproduce the bioinformatics analysis of the

genetics data from published genetic data?

2. Applied subproblem: How well can one reproduce the graphs of published data using R?

Hypothesis Statements

1. ASP 1: Using the bioinformatics algorithms, it is possible to reproduce the results of the

data analysis that the Wang et al. (2018) have produced and reach the same conclusion.

The independent variable is the bioinformatics algorithms, and the dependent variable is

the conclusion derived from the analysis.

2. ASP 2: Using the R programs and packages, it is possible to reproduce the results of the

visualization that Wang et al. (2018) have produced with a degree of similarity. The

independent variable is the packages used in R programming, and the dependent variable

is the degree of similarity on a scale of 1 to 5.

Significance of the Study

The study of genetics involves collection of a lot of complex data. Analysis of that data

can be a very overwhelming, challenging, and time-consuming job. This research assists in

verifying the findings presented in the study by Wang et al. (2018). Since the data are sequenced

differently between RNA-seq and ChIP-seq, although the values and genetic information may be

different, the conclusions should ultimately be the same. Verifying these findings strengthens the

conclusions derived from the study by Wang et al. (2018) since more supporting information

backs up the findings.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 9

Definition of Key Terms

1. Bioinformatics: The sum of the computational approaches to analyze, manage, and store

biological data (Stöppler, 2016, para. 1).

2. ChIP-seq: A technique to analyze the histone modifications, DNA modifications, or

binding sequence of proteins within a genome ("What is Chromatin Immunoprecipitation

(ChIP)?" n.d., para. 1).

3. Cuffdiff: A program that you can use to find significant changes in transcript expression,

splicing, and promoter use (Trapnell Lab at the University of Washington's Department

of Genome Sciences, 2017).

4. Data visualization: Data visualization is the presentation of data in a pictorial or

graphical format (SAS, n.d., para. 1).

5. Gene read: Each nucleotide sequences in genes is called a read (Porter, 2007, para. 3).

6. ggplot2: ggplot2 is a plotting system for R, based on the grammar of graphics, which

tries to take the good parts of base and lattice graphics and none of the bad parts

(Wickham, 2013, para. 1).

7. Human Genetics Computer Cluster (HGCC): Human Genetics Computer Cluster

(HGCC) is a computer network composed of 1 head node and 22 compute nodes and

serves multiple functions related to genomic projects and data storage (Department of

Human Genetics at Emory University School of Medicine, 2017, para. 3).

8. Open Source: Open source software is software with source code that anyone can

inspect, modify, and enhance (Opensource.com, 2013, para. 3).

9. Package: In R, the fundamental unit of shareable code is the package (Wickham, 2015,

para. 1).
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 10

10. R: R is a system for statistical computation and graphics (R Core Team, 2018).

11. RNA-seq: A technique that examines the quantity and sequence of RNA in a sample

(Mackenzie, 2018, para. 1).

12. TopHat: Tophat is a fast splice junction mapper for RNA-Seq reads. TopHat works in

conjunction with Bowtie, which aligns gene reads. (Center for Computational Biology at

Johns Hopkins University, 2016, para. 1).

Summary

Wang et al. (2018) found connections regarding H3K9me3 marks during specific

embryonic stages, and this research aims to see whether the conclusions derived can be matched

making reproductions using RNA-seq data. The hypotheses claim that one can reproduce the

conclusions and claims accurately.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 11

Chapter 2: Literature Review

This research consists of two foundational subproblems: bioinformatics analysis and data

visualization. The bioinformatics analysis foundational subproblem consists of collecting the

sequenced RNA-seq or ChIP-seq data, converting them to appropriate format, mapping the

genetic data to an appropriate genome, comparing differential gene expression, and analysis of

the data ("RNA Sequencing Analysis with TopHat", n.d., p.4). The second foundational

subproblem involves importation of output from the previous subproblem into R programs for

data visualization. This subproblem involves usage of various packages, a "fundamental unit of

shareable code", (Wickham, 2015, para. 1), and programs in R. The output of this stage consists

of various graphs that data scientists refer to for trends and related analysis.

Bioinformatics Processing

The first foundational subproblem of the research is the bioinformatics analysis. It asks

how one can analyze bioinformatics data. As a part of bioinformatics analysis, the researcher

collected and processed RNA sequencing samples. The researcher retrieved data samples using

prefetch, a bioinformatics command that "allows command-line downloading of SRA (Sequence

Read Archive), dbGaP, and ADSP data" (Sequence Read Archive, 2018a, para. 2). They used

various options such as '-h' to get help regarding the documentation of commands. The '-h'

command option "displays ALL options, general usage, and version information" (Sequence

Read Archive, 2018a, para. 2). This is a very useful option for technical help. The researcher

used the '-f' option to "force object download" to ensure proper retrieval of raw data (Sequence

Read Archive, 2018a, para. 2). The '-l' option lists "the contents of a kart file" (Sequence Read

Archive, 2018a, para. 2).


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 12

After downloading the data, the next step was to perform a file conversion from SRA to

the '.fastq.gz' file type. The 'fastq-dump' command allowed conversion of "SRA data into fastq

format" (Sequence Read Archive, 2018b, para. 1). The researcher used various options available

for this command. As always, the '-h' option was handy for easy reference of documentation. The

'-M' option is useful to filter data "by sequence length" (Sequence Read Archive, 2018b, para. 2).

The '-o' option helped to specify the output directory to store the information (Sequence Read

Archive, 2018b, para. 2). There are two types of data: pair-end sequence data and single-read

sequence data. FASTQ files generated by the 'fastq-dump' command have different outputs based

on whether the SRA data is pair-end or single-read.

The next important step was to align the "RNA-Seq reads to a genome in order to identify

exon-exon splice junctions" (Center for Computational Biology at Johns Hopkins University,

2016, para. 1). The researcher used the TopHat algorithm for this purpose. TopHat works along

with Bowtie: Bowtie maps gene reads to a reference genome, and TopHat finds spliced junctions

and aligns gene reads. TopHat produced outputs differently based on whether the sequences are

pair-ended or single-read. One of the outputs of the TopHat command was an

'accepted_hits.bam' file, which was essential for further computation (Center for Computational

Biology at Johns Hopkins University, 2016, para. 19).

The next step was to use the bamToBed command, which "is a conversion utility that

converts sequence alignments in BAM format to BED records" (Quinlan Lab at University of

Utah, 2017, p. 46). The researcher input the 'accepted_hits.bam' file generated by the TopHat

command to run the bamToBed command on. The researcher used various command options

such as "-mate1", "-split", "-ed", "-tag", etc. to modify certain aspects of a BAM file (Quinlan

Lab at University of Utah, 2017, p. 46).


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 13

The last important step was to compare differential gene expression using Cuffdiff "to

find significant changes in transcript expression, splicing, and promoter use" (Trapnell Lab at the

University of Washington's Department of Genome Sciences, 2017). In simple terms, this

algorithm compared differences in gene mapping for analysis in an experiment. This step

concluded the first foundational subproblem of the research.

In order to process RNA-seq data successfully to prepare it for visualization, one must

first fetch, convert, align reads, make compatible, and identify differential gene expression. This

processing is done first via the 'prefetch' command, which downloaded data from a database such

as NCBI. Afterwards, processing prepares the raw data for TopHat input via the 'fastq-dump'

command, which converted files from SRA files to '.fastq.gz' file type. Then, the output of the

'fastq-dump' command went processing via the TopHat algorithm. This algorithm, in conjunction

with Bowtie, aligned and mapped gene reads. The TopHat output was then given to the

'bamtobed' command, which converted files to BED format so that other programs that do not

read BAM files, such as CuffDiff, could read BED files. Lastly, the output was processed via the

CuffDiff algorithm, which compares gene mapping among two TopHat outputs. This processing

allowed one to analyze and further process the data for visualizing specific information using

programming such as in R programming language.

Data Visualization

The second foundational subproblem of the research is data visualization of the

bioinformatic data output from the first foundational subproblem. It asked how one can visualize

bioinformatics data using R programming. The researcher used the R programming language for

statistical analysis and graphical representation of data. The researcher imported the output of the

first subproblem into R programs for data visualization. One of the advantages R offers is a wide
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 14

range of open-source (i.e. free) packages that provide advanced algorithms and complex

graphing capabilities (Krill, 2015, para. 5). Some of the important packages the researcher used

in R are 'ggplot2' and 'dplyr' (Krill, 2015, para. 8). Packages like ggplot2 and plotly provided

specialized functionality related to plotting graphs. Different options like bar graphs, pie charts,

scatter plots, and heat maps were available for visualization. The researcher could use these two

packages together since they complement each other. In ggplot2, "you provide the data, tell

'ggplot2' how to map variables to aesthetics, what graphical primitives to use, and it takes care of

the details" (Wickham et al., 2018, p. 1). 'ggplot' is the most important function available in this

package. This function allowed the researcher to create a new ggplot graph to visualize data

(Wickham et al., 2018, p. 112). The researcher used this function in conjunction with functions

like 'geom_bar' to create bar charts (Wickham et al., 2018, p. 43). After creating a graph in

ggplot2, the researcher used 'plotly' to "easily translate 'ggplot2' graphs to an interactive web-

based version and/or create custom web-based visualizations directly from R" for ease of access

from anywhere on the internet (Sievert et al, 2018, p. 1). 'ggplotly' is on the important functions

that allows conversion of ggplot2 to plotly (Sievert et al., 2018, p. 22). Specialized packages

such as 'RcolorBrewer' create beautiful color palettes for data visualization (Neuwirth, 2015, p.

1). 'dplyr' is a package that provided "fast, consistent tool for working with data frame like

objects, both in memory and out of memory" (Wickham, François, Henry, & Müller, 2018, p. 1).

The researcher used this package to work with datasets using functions like 'bind' (Wickham,

François, Henry, & Müller, 2018, p. 2), 'select' (Wickham, François, Henry, & Müller, 2018, p.

3), and 'filter' (Wickham, François, Henry, & Müller, 2018, p. 2) for data manipulation. 'knitr' is

another useful package that provided a tool for report generation in R (Xie et al., 2018, p. 1).

'knit' and 'stitch' are two important functions in this package. 'knit' converted the data in an input
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 15

file to a proper format (Davis, 2018, p. 27). 'stitch' automatically created "a report based on an R

script and a template" (Xie and Friendly, 2018, p. 65). This package was very useful in cases

where the researcher created templates with documentation for other researchers to reference.

The output of this subproblem consists of various graphs for data visualization.

In order to successfully visualize data in R, packages and functions were necessary.

Packages provided powerful functions that allowed for a great variety of customization of plots.

The ggplot2 package allowed for the creation of bar graphs, scatter plots, heat maps, and many

other visualizations. The plotly package worked in conjunction with ggplot2 by allowing for the

creation of web-based visualizations or by making plots on a web format. dplyr allowed for

working and manipulating data. One could use knitr to create documentation for written code for

future reference.

Summary

Understanding bioinformatics data required thorough analysis of the sample data

collected. Only visualization is not enough for a clear understanding of that data. That is where

the analysis of the data visualization aspect of bioinformatics came into picture. Data

visualization complemented the analysis by providing clear and easily understandable graphs.

Technology helped this process by providing tools. The software packages and commands

available with technology allowed the researcher to reproduce the bioinformatics analysis and

graphs by reusing the raw data.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 16

Chapter 3: Research Method

The main requirement the mentor specified is to have reproduced the bioinformatics

analysis and the data visualization of the published mouse genetic data which is referred to in the

research paper by Wang et al (2018). The overarching question of this research was: how well

can one reproduce the bioinformatics analysis and graphs from the "Reprogramming of

H3K9me3-dependent heterochromatin during mammalian embryo development" research paper?

This research aimed to reproduce the figures and analysis of a published research in order to

learn how figures are produced and how to interpret these figures. The process of reproduction is

useful knowledge for helping researchers to understand how to conduct genetic research in the

future, how to account for differences, and interpreting sources of differences. This chapter

identifies how the researcher reproduced the figures and bioinformatics analysis of the data from

the paper by Wang et al. (2018). This chapter encompasses the processing and the figure creation

process, along with the grading of the results of the visualizations and the bioinformatics

analysis.

Research Methodology and Design

This research followed the engineering design process (EDP). The research involved

requirements from the mentor, analysis of those requirements, and designing and developing a

solution in order to replicate the bioinformatics analysis and the visualizations of the research

paper by Wang et al. (2018). Finally, as a part of testing, the mentor compared the visualizations

and analyses of this research to those from the paper and provided approval.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 17

Population

Based off of the original study by Wang et al. (2018), 6 B6D2F1 and C57BL female

mice, all of which were 8-10 weeks old were mated with B6D2F1 or DBA2 male mice for this

study.

Sample

Since the researcher conducted research off of published raw data from the research by

Wang et al. (2018), the researcher fetched all of the raw RNA-seq data available from the study.

This data was available at NCBI.

Materials/Instruments

The raw RNA-seq data the researcher processed was obtained from an NCBI database

and consisted of SRA files. This maintains validity and test-retest reliability, as the researcher

did not modify or use any data from a different source. For instruments used to process data, the

researcher used R programming language, Excel, HGCC, and Gene Ontology Consortium. These

materials retain construct reliability and test-retest reliability, because the same tools were used

by Wang et al. (2018) and because machine algorithms and processes are not subject to human

error.

Operational Definition of Variables

For grading the visualization reproductions on a scale of 1-5, the criteria for each number was as

follows: A rating of 1 meant that the reproduced figure was a different type of figure than the

expected figure. A rating of 2 meant that the reproduced figure was the correct type of graph, but

with some error in the data being graphed, such as incorrect processing of data or plotting a

wrong column as an axis. A rating of 3 meant that the reproduced figure is the same type of

figure as the expected figure and graphs the correct data but could not be numerically accurate
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 18

due to a data limitation. A rating of 4 meant that the reproduced figure was the correct type of

graph and had plotted the correct values but had minor errors that prevented it from being an

exact reproduction. A rating of 5 meant that the reproduced figure is exactly visually identical to

the expected figure.

For grading the bioinformatics analysis of the reproductions in comparison to the original data,

the result of either a "satisfactory match" or an "unsatisfactory match" was applied. For example,

test conclusion of a reproduced graph could be either a "satisfactory match" if the same

conclusion was derived from both figures or an "unsatisfactory match" if the analyses of the

graphs are not complementary.

Data Collection, Processing, and Analysis

The researcher reviewed the requirements provided by the mentor to reproduce the

bioinformatics analysis and to reproduce the data visualization (i.e. graphs) referenced in the

research paper by Wang et al. (2018). The researcher made reproductions of figures 1e and 2a,

and then provided an analysis of the trends. The researcher also created a gene ontology table, a

figure similar to figure 2c from the research paper by Wang et al. (2018), but this figure was

essentially only similar in concept as the researcher attempted reproductions with RNA-seq data,

which led to different gene ontology terms and y-values. The figures the researcher attempted to

reproduce are shown below:

Figure 1e:
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 19

Figure 1e. ChIP-seq graph: Established (red) and disappeared (blue) H3K9me3 marks during

embryonic development. Reprinted from "Reprogramming of H3K9me3-dependent

heterochromatin during mammalian embryo development," by Wang et al. 2018, Nature Cell

Biology, 20, p. 621. Copyright 2018 by Macmillan Publishers Limited, part of Springer Nature.

Figure 2c:

Figure 2c. Gene ontology analysis for oocyte-specific genes. Reprinted from "Reprogramming

of H3K9me3-dependent heterochromatin during mammalian embryo development," by Wang et

al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by Macmillan Publishers Limited, part

of Springer Nature.

Figure 2a:
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 20

Figure 2a. A heatmap showing H3K9me3 domains during mouse embryo development.

Reprinted from "Reprogramming of H3K9me3-dependent heterochromatin during mammalian

embryo development," by Wang et al. 2018, Nature Cell Biology, 20, p. 622. Copyright 2018 by

Macmillan Publishers Limited, part of Springer Nature.

The following figure demonstrates the logic behind reproducing figure 1e from the paper by

Wang et al. (2018):

Figure 1. Schematic showing how stage-specific H3K9me3 marks and genes were identified.

Comparison of H3K9me3 marks/genes specific to the circled stage: The immediate stages

(before and after) were used for comparison, with the 2 cell stage as an example stage to

demonstrate which stages to compare to.

Before reproducing the graphs, the researcher carried out the following steps to prepare the raw

data for further processing for visualization:

1. The researcher fetched mouse gene data provided by the mentor through HGCC (Human

Genetic Computer Cluster), a network of computers for genome projects, by logging into

their HGCC account.

2. The researcher accessed a script file to setup and execute the TopHat algorithm (an

algorithm that mapped genetic data to corresponding parts of a genome) for processing of

the mouse gene data through HGCC for mapping and aligning the mouse genes to the
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 21

mouse genome (a set of chromosomes). The output of this algorithm (.bam and .bed files)

was used as input in the next step.

3. The researcher accessed a script file to setup and execute the Cuffdiff algorithm (an

algorithm that compared differences in gene expression) on the mouse gene data from the

TopHat output and compared modifications of H3K9me3-dependent heterochromatin at

various stages of embryo development using this algorithm. The output of the CuffDiff

algorithm is a '.diff' file generated from comparison of differential gene expression

between two embryonic stages, which was used as input when further processing data for

visualization.

Since the goal of this research was to reproduce the graphs of the research paper using the

RNA-seq data, statistical analysis of the data did not apply to this research and was not part of

the scope. The results of this processing were in the form of data tables which one can

understand better through data visualization (i.e. plotting graphs). The researcher carried out the

following steps to make the reproduction of figure 1e:

1. The researcher imported the data from '.diff' files generated the CuffDiff comparisons to

other embryonic cell stages into an R program by reading the files into R.

Below is the programming logic for the researcher used for this step:

Locate the folder in which the '.diff' file is placed on the computer. Open the appropriate

'.diff' file and import it into an R data table by reading each line of data.

2. The researcher wrote R programs to filter the imported data for selective processing

based on predefined criteria for status and significance of genes.

Below is the programming logic the researcher used for this step:
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 22

Use the subset function in R programming to filter out the data based on a filter applied

on particular columns (status, significant, etc) to remove undesirable data from

processing.

3. The researcher extracted a list and the number of genes specific to each embryonic

development stage.

Below is the programming logic the researcher used for this step:

Create lists of genes intersecting the previous and the next stage CuffDiff comparison

'.diff' files. For upregulated genes specific to a stage, intersect the upregulated genes from

the CuffDiff comparison to the previous stage with the downregulated genes in the next

stage. For downregulated genes specific to a stage, intersect the downregulated genes

from the CuffDiff comparison to the previous stage with the upregulated genes in the

next stage.

4. The researcher wrote a bar graph R program to reproduce figure 1e.

Below is the programming logic the researcher used for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_bar', etc. to define the bar graph to be created.

After the reproduction was complete following these steps, the researcher analyzed the trends of

the graph to provide bioinformatics analysis of the data.

The researcher carried out the following steps for producing gene ontology tables for

genes specific to stages of embryonic development for a reproduction of figure 2c:


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 23

1. The researcher went to the Gene Ontology Consortium website (www.geneontology.org)

and input four lists total: two lists for combined 2 cell, 4 cell, and 8 cell stages'

upregulated and downregulated genes and two lists for combined ICM and morula stages'

upregulated and downregulated genes. The Gene Ontology Consortium analyzed these

lists for biological processes.

2. The researcher then made Excel tables out of gene ontology terms with the highest P

values. The most emphasis was on highest P values first, and then uniqueness of the

terms for a variety of gene ontology terms. The x-axis was the gene ontology terms, and

the y-axis was the P value of the gene ontology term which resulted from the analysis.

3. The researcher wrote a bar graph R program to plot each of the Excel tables.

Below is the programming logic the researcher used for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_bar', etc. to define the bar graph to be created.

After the gene ontology graphs were completed following these steps, the researcher analyzed

the trends of the graphs to provide bioinformatics analysis of the data.

The researcher carried out the following steps to make the reproduction of figure 2a:

1. The researcher took the lists of upregulated and downregulated genes specific to each

embryonic stage and extracted the start and end locations of the genes from the CuffDiff

output. The genes and their respective start and end locations were saved as an Excel file.

2. The researcher input this Excel file into an R script which generated a matrix from this

data, normalized the matrix, and then ran a k-means clustering code on the matrix.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 24

3. The researcher created a heat map R program to visualize the final matrix.

Below is the programming logic for this step:

Use graphing functions like 'ggplot' from the ggplot2 package available in R

programming to plot the data filtered in the previous step. Use geometric functions such

as 'geom_tile', etc. to define the heat map to be created.

After the reproduction was complete following these steps, the researcher analyzed the trends of

the graph to provide bioinformatics analysis of the data. After completing the reproductions and

their analysis, the researcher carried out testing of the bioinformatics analysis results and data

visualization plots before submitting them to the mentor. The researcher used the following

format for testing prototypes:

Table 1

Data Visualization Testing

Test case # Figure being Expected figure Reproduced Degree of


reproduced figure similarity (1-5)

1
2
3

Test case #: Identified each scenario number with a number for each analysis situation.

Figure being reproduced: The figure number from the paper by Wang et al. (2018) that the

researcher attempted a reproduction of.

Expected figure: A picture of the original figure from the paper by Wang et al. (2018)

Reproduced figure: The figure the researcher generated as a result of following the method to

reproduce the corresponding figure.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 25

Degree of similarity: The visual similarity between the expected figure and the reproduced figure

on a scale of 1-5. The judgement for numerical ratings is explained in the previous section.

Table 2

Bioinformatics Analysis Testing

Test case # Associated Analysis of the Analysis of the Conclusion


figure associated figure reproduced
figure

1
2
3

Test case #: Unique number to identify the test case.

Associated figure: The figure number from the paper by Wang et al. (2018) that the researcher

attempted a reproduction of.

Analysis of the associated figure: Bioinformatics analysis of the trends shown in the associated

figure.

Analysis of the reproduced figure: Bioinformatics analysis of the trends shown in the reproduced

figure.

Conclusion: Determined whether the analysis of the associated figure is a satisfactory match to

the analysis of the reproduced figure. The criteria is provided in the previous section.

After completion of the above testing, the researcher will provide the following products to the

mentor for review:

● Data visualization graphs for bioinformatics analysis

● Bioinformatics analysis data results


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 26

The mentor reviewed the results provided by the researcher, compare those with the original

research paper results, and approved that the results of the research were made properly.

Assumptions

1. The R programming licensing remained free during the course of the internship. Since R

is an open-source language, its contents are free for everyone to download and modify

2. The licenses for the R packages (ggplot2, RcolorBrewer, dplyr, plotly, and knitr) that the

researcher used remained free during the course of the internship. Due to R's open-source

nature, packages too are open-source, so their contents are free to download and modify.

3. The mentor supervised the researcher during the course of the internship. To guide the

researcher through the process of the reproductions and to teach the researcher how to

interpret the results, the mentor supervised the researcher to ensure successful research.

4. The researcher conducted research at Yao Lab at Emory University. Since the mentor and

researcher meet only at Yao Lab at Emory university, the researcher conducted the study

at the presence of the mentor.

5. The researcher assumed that the published figures from the paper by Wang et al. (2018)

are accurate. The researcher assumed that before publishing the findings, Wang et al.

(2018) must have verified their results to prevent mistakes from occurring.

Limitations

1. Limitation: The researcher used RNA-seq data, which limited the embryonic stages that

the researcher analyzed and caused some figure reproductions to not look similar,

because they complemented the ChIP-seq data of the original figures.

2. Limitation: The time window available for the internship was from September 10th,

2018 to December 11th, 2018.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 27

Delimitations

3. Delimitation: R programming was the choice of language for data visualization in this

research.

4. Delimitation: The scope of this research was limited to reproducing the data analysis and

visualization results of the published data from the study by Wang et al. (2018).

5. Delimitation: Only the TopHat and CuffDiff algorithms were in the scope to map, align,

and compare genetic datasets.

Summary

In order to collect the data, the researcher processed the data through HGCC and

reproduced each visualization by further processing data according to each visualization using R

programming and various functions inside packages. Then, the researcher compared each

reproduction with the associated figure and judged each reproduced figure according to the scale

presented in the previous section. Next, the researcher analyzed the trends of the figures and

decided whether the conclusion derived from the original matched the conclusion derived from

the reproductions.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 28

Chapter 4: Findings

The purpose of this research was to reproduce the figures and bioinformatics analyses of

the figures in the research paper by Wang et al. (2018) in order to help the researcher to

understand the differences in the techniques Wang et al. (2018) have used and the techniques the

researcher uses. This also helped the researcher understand the involvement of bioinformatics

analysis and R programming in genetics research. This chapter discusses the results of the

reproductions and bioinformatics analyses, whether they were similar to a satisfactory degree,

and an explanation in the case they were not similar.

Results

Below are the results of the figure reproductions and bioinformatics analyses:

Table 1

Data Visualization Testing

Test Figure Expected figure Reproduced figure Degree


case # being of
reproduced similarity
(1-5)

1 1e 3
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 29

2 2c 3
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 30

3 2a 4

Table 2

Bioinformatics Analysis Testing

Test case # Associated Analysis of the Analysis of the Conclusion


figure associated figure reproduced
figure

1 1e A large number Low numbers of H3K9me3 is a


of H3K9me3 stage specific repressive
established genes are marker, thus the
marks are shown expressed in 4 higher the
on the 4 cell, 8 cell, 8 cell, and number of
cell, and morula morula stages. established
stages. H3K9me3
marks, the lower
the number of
genes specific to
stages.
Satisfactory
match.
2 2c The P values in The P values in The reproduced
the figure range most of the figures' P values
from 5 to 10. tables range for GO terms do
from 1 to 8. not match the
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 31

Only in 2 cell, 4 range of P


cell, and 8 cell values for the
downregulated expected graph,
GO analysis do indicating
the P values go different
above 10. biological
processes to be
significant.
Unsatisfactory
match.
3 2a Normalized Weaker gene Strong
H3K9me3 input expression in 2 H3K9me3 marks
ratio is high in cell and 4 cell in the ChIP-seq
MII, zygote, 2 stages. original is met
cell, and 4 cell with weak
stages. specific gene
expression in the
next stage in the
RNA-seq
reproduction.
Both
complement
each other.
Satisfactory
match.

Regarding the reproductions of the figures, the reproduction of figure 1e had a degree of

similarity of 3, indicating that a data limitation prevented a visually identical reproduction. The

gene ontology tables, which were the reproduction of figure 2c from the paper by Wang et al.

(2018), has a degree of similarity of 3 as well, also indicating that a data limitation prevented a

visually identical reproduction. The reproduction of figure 2a had a degree of similarity of 4.

The bioinformatics analysis reproduction results indicate that the data for both the

reproduction and the associated figure contain the same conclusion for figure 1e, making the

reproduction a satisfactory match. However, the reproductions of figure 2c contained P values

that are different in ranges compared to the associated figure, leading to a conclusion of an
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 32

unsatisfactory match of bioinformatics analyses. The reproduction of figure 2a showed a mostly

light blue to dark blue color in terms of gene expression in the 2 cell and 4 cell stages, which

indicated weaker gene expression in these stages.

Evaluation of Findings

Although the figure reproductions were not visually identical to the original figures and

therefore neither are some of the bioinformatics analyses of the reproductions, these differences

were expected due to a difference in the type of data the researcher and Wang et al. (2018) used.

Wang et al. (2018) used ChIP-seq data, which measures H3K9me3 marks, whereas the

researcher uses RNA-seq data. In the context of biology, the results of the reproductions, while

they initially seemed contradictory compared to the figures, are complementary. H3K9me3 is a

repressive marker which stops genes from being expressed. The higher the concentration of

H3K9me3 marks, the fewer the number of genes that are expressed. Comparing figure 1e to the

RNA-seq reproduction visualization with this knowledge explains why the bioinformatics

analysis of the two lead to the same conclusion: the reproduction showed a lower number of

genes specific to a stage being expressed at points where there are many H3K9me3 marks. The

mismatch in gene ontology tables was also expected, as gene ontology analysis for H3K9me3

marks is bound to be different compared to genes specific to embryonic stages. In fact, due to the

difference between ChIP-seq and RNA-seq data, the gene ontology tables were expected to not

match. The heat maps were expected to match, and thus they reached the same conclusion based

off of the information.

Summary

Without any knowledge of genetics, the figures and bioinformatics analysis compared to

their RNA-seq reproductions may seem to portray contradictory data due to a lack of visual
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 33

similarity. However, with knowledge of how H3K9me3 affects gene expression and knowledge

of the difference between RNA-seq and ChIP-seq data, the original figures and bioinformatics

analysis were complementary with their RNA-seq reproduction counterparts, as some data is

meant to be similar, while other data is not.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 34

Chapter 5: Implications, Real World Connections, Recommendations, and Conclusions

This research aimed to see how well one can reproduce the bioinformatics analysis and

data visualization from the published data by Wang et al. (2018). Through this research, the

researcher learned how bioinformatics and programming are connected in genetics research. The

researcher also learned how to produce figures from data as well. The researcher reproduced the

figures and bioinformatics analysis first by processing raw data with algorithms such as TopHat,

Bowtie, and CuffDiff, and then further processing the data according to each figure specifically.

Afterward, the researcher analyzed the trends to provide an analysis. One of the main limitations

of this research is that the researcher conducted research on RNA-seq data, whereas the paper by

Wang et al. (2018) conducted research on ChIP-seq data. Researching on RNA-seq data limited

the embryonic stages the researcher could analyze compared to the ChIP-seq data and caused

some different visualizations and analyses due to a fundamental difference between RNA-seq

and ChIP-seq data. This chapter discusses how the research data affects the answers to the

applied subproblems and its connections to the real world.

Implications

The first applied subproblem discusses how well one can reproduce the graphs of

published data from the paper by Wang et al. (2018) using R. The researcher's hypothesis

mentioned that one can reproduce the visualizations to a satisfactory degree. Comparing the

reproductions with the original figures without factoring genetic interpretation, only one of the

three reproductions were truly visually similar, although the correct data was graphed. The

reproductions were not visually similar to the original figures, and therefore, figures from the

paper by Wang et al. (2018) cannot be reproduced to be visually similar. The second applied

subproblem discusses how well the bioinformatics analysis of the original data can match the
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 35

bioinformatics analysis of the reproduced data. The researcher's hypothesis claimed that the

bioinformatics analyses of both the original and the reproduction can lead to the same

conclusion. Based off of the data, the conclusions of the original data did match those of the

reproductions, so the bioinformatics analysis of the reproduction can match those of the original

data. However, the gene ontology tables' conclusions did not match those of the gene ontology

table from the paper by Wang et al. (2018), but this was expected. A limitation that affects the

interpretation of the results is a constraint on the data the researcher worked with. The researcher

was assigned by their mentor to reproduce the data on RNA-seq data. Working with ChIP-seq

data probably would have allowed for much greater visual and interpretational similarity. These

results describe the differences between RNA-seq and ChIP-seq data and demonstrate that

knowledge of genetics is essential to properly interpret research results in this study. This

research implies that in genetics, it can be important to verify the results with different data, such

as ChIP-seq and RNA-seq. This can help to strengthen studies, as the conclusions that are

derived as a result of genetic studies such as these are backed by more than one format of data.

Real World Connections

Although this study itself does not connect or warrant any future research, this research

teaches how to make figures for research and how to interpret them in general by allowing the

researcher to take a hands-on approach to replicating figures and bioinformatics analysis of an

actual paper. This helps the researcher to grow by letting the researcher gain first-hand

experience in a genetics lab setting and by letting the researcher communicate with professionals

in the field.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 36

Recommendations

For future research, the data collected implies that working with the same format of data

would lead to more visually accurate figure reproductions and possibly more accurate

bioinformatics analysis reproductions as well. For the purpose of accurately reproducing figures

and analysis of data, the results of this research may imply working with ChIP-seq data.

Conclusions

The bioinformatics analyses of the reproductions did conclude the same as the original

data. Not taking genetics knowledge into account, however, the figure reproductions were not

similar to the original figures. Thus, bioinformatics analysis can be successfully reproduced, but

the figures cannot. Due to the difference between RNA-seq and ChIP-seq data, however, slightly

different figures were expected, especially for gene ontology tables. Interpreting both the data

and the figures from both the original and the reproductions implies that some of the visual

differences were actual genetic complements.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 37

References

Center for Computational Biology (2016). A spliced read mapper for RNA-Seq [Software].

Retrieved from

https://ccb.jhu.edu/software/tophat/manual.shtml

Data Visualization: What it is and why it matters. (n.d.). Retrieved from

https://www.sas.com/en_us/insights/big-data/data-visualization.html

Emory University School of Medicine Department of Genetics. (n.d.). Retrieved from

http://genetics.emory.edu/about/index.html

Krill, P. (June 30, 2015). Why R? The pros and cons of the R language. Retrieved from

https://www.infoworld.com/article/2940864/application-development/r-programming-

language-statistical-data-analysis.html

Mackenzie, R. J. (April 06, 2018). RNA-seq: Basic Applications and Protocol. Technology

Networks. Retrieved from

https://www.technologynetworks.com/genomics/articles/rna-seq-basics-applications-and-

protocol-299461

Neuwirth, E. (February 19, 2015). Package 'RColorBrewer' (Version 1.1-2) [Software PDF].

Comprehensive R Archive Network. Retrieved from

https://cran.r-project.org/web/packages/RColorBrewer/RColorBrewer.pdf

Opensource.com (2013). What is open source? Retrieved from

https://opensource.com/resources/what-open-source

Patil, N. (November 26, 2018). Reproducing the Bioinformatics Analysis and Data Visualization

of a Research Paper. Unpublished manuscript.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 38

Porter, S. (January 28, 2007). Basics: How do you sequence a genome? part III, reads and

chromats. Retrieved from

https://digitalworldbiology.com/archive/basics-how-do-you-sequence-genome-part-iii-

reads-and-chromats

Quinlan Lab at University of Utah (December 08, 2017). Bedtools Documentation (Version

2.27.0) [Software PDF]. Retrieved from

https://media.readthedocs.org/pdf/bedtools/latest/bedtools.pdf

R Core Team. (July 2, 2018). R Language Definition (Version 3.5.1) [Software PDF].

Comprehensive R Archive Network. Retrieved from

https://cran.r-project.org/doc/manuals/r-release/R-lang.pdf

RNA Sequencing Analysis With TopHat [Software PDF]. (n.d.). Retrieved from

https://www.illumina.com/documents/products/technotes/RNASeqAnalysisTopHat.pdf

Sequence Read Archive (2018a). SRA Toolkit Documentation: Tool Prefetch [Software].

National Center for Biotechnology Information. Retrieved from

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=prefetch

Sequence Read Archive (2018b). SRA Toolkit Documentation: Tool fastq-dump [Software].

National Center for Biotechnology Information. Retrieved from

https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

Sievert, C., Parmer, C., Hocking, T., Chamberlain, S., Ram, K., Corvellec, M., & Despouy, P.

(July 20, 2018). Package 'plotly' (Version 4.8.0) [Software PDF]. Comprehensive R

Archive Network. Retrieved from

https://cran.r-project.org/web/packages/plotly/plotly.pdf
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 39

Stöppler, M. C. (2016). Definition of Bioinformatics. Retrieved from

https://www.medicinenet.com/script/main/art.asp?articlekey=16836

Trapnell Lab at University of Washington's Department of Genome Sciences (2017) [Software].

Cufflinks. Retrieved from

http://cole-trapnell-lab.github.io/cufflinks/cuffdiff/

Wang, C., Liu, X., Gao, Y., Yang, L., Li, C., Liu, W., . . . Gao, S. (2018). Reprogramming of

H3K9me3-dependent heterochromatin during mammalian embryo development. Nature

Cell Biology, 20(5), 620-631. doi:10.1038/s41556-018-0093-4

What is Chromatin Immunoprecipitation (ChIP)?. (2015). Retrieved from

http://www.chipseq.com/chromatin-immunoprecipitation/

Wickham, H. (2013). ggplot2 [Software]. Retrieved from

http://had.co.nz/ggplot2/

Wickham, H. (2015). R Packages [Software]. Retrieved from

http://r-pkgs.had.co.nz/intro.html

Wickham, H., Chang, W., Henry, L., Pederson, T. L., Takahashi, K., Wilke, C., & Woo. K. (July

3, 2018). Package 'ggplot2' (Version 3.0.0) [Software PDF]. Comprehensive R Archive

Network. Retrieved from

https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

Wickham, H., François, R., Henry, L., & Müller, K. (October 16, 2018). Package 'dplyr' (Version

0.7.7) [Software PDF]. Comprehensive R Archive Network. Retrieved from

https://cran.r-project.org/web/packages/dplyr/dplyr.pdf
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 40

Xie, Y., Vogt, A., Andrew, A., Zvoleff, A., Simon, A., Atkins, A., … Foster, Z. (February 20,

2018). Package 'knitr' (Version 1.20) [Software PDF]. Comprehensive R Archive

Network. Retrieved from

https://cran.r-project.org/web/packages/knitr/knitr.pdf
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 41

APPENDIX A

Figure Reproductions

Figure 2. RNA-seq reproduction of figure 1e. Number of upregulated and downregulated genes

specific to each embryonic stage.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 42

Figure 3. Gene ontology analysis of combined ICM and morula stage downregulated genes.

Analysis from Gene Ontology Consortium determined the most relevant and unique categories

for genes.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 43

Figure 4. Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage upregulated genes.

Analysis from Gene Ontology Consortium determined the most relevant and unique categories

for genes.

Figure 5. Gene ontology analysis of combined ICM and morula stage upregulated genes.

Analysis from Gene Ontology Consortium determined the most relevant and unique categories

for genes.
REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 44

Figure 6. Gene ontology analysis of combined 2 cell, 4 cell, and 8 cell stage downregulated

genes. Analysis from Gene Ontology Consortium determined the most relevant and unique

categories for genes.


REPRODUCING BIOINFORMATICS ANALYSIS AND GRAPHS 45

Figure 7. RNA-seq reproduction of heat map. Heat map plotting gene specific concentrations.

You might also like