Biological Data Analytics and Computing

Biological Sciences
1
Big Data
Module 0013 - Big Data
We live in the information age.

A deluge of data is being generated by humans. In the internet for example,
we have social media interactions, in terms of scientific activities, scientific
data collection processes, business transactions, and adoption of new
technologies, information technology has taken a very large stride if
compared from just a few years back. Information technology has already
transformed and will continue to transform science, technology, business,
and arts. In short, I.T. has changed all the aspects of our lives.
Data science is a new discipline that will allow us to cope with the “big data
revolution” of modern times. The goal of this module is to teach students
have sense of that big data is. It will also introduce how data analysis skills
are also important in biology.
Because of the explosion in biological knowledge, many of the challenges in
the different fields are now challenges in computing and data handling.
Bioinformatics and data science, the application of computational techniques
to analyze information associated with biomolecules slowly got attention an
applications in the biological fields. Biological computing allowed us to
perform analysis and make sense of data that otherwise cannot be touched
by traditional approaches. Data sciences in biology encompass a wide range
of subject areas from structural biology, genomics to gene expression studies.
In this module, we will have an overview of the current state of the field and
discuss the main principles of big data analyses and examine some of the
studies that are being conducted in biological computing.
At the end of this module, you will be able to:
1. state the roles of computational sciences in biological research
2. show the practical application of big data science
3. discuss the future trends in biological research
“Numbers have an important story to tell. They rely on you to give them a
voice.”
- Stephen Few
Course Module
The Power of Computing in Biology
Biological data are being produced at a phenomenal rate, thanks to the
advent of computers and the marriage of information technology and
biology. Also, in the past decades, the compilation of data from hundreds and
thousands of patients and human sujects in basic research studies has
generated a repository of biologically relevant databases. For example, the
GenBank repository of nucleic acid sequences, and the SWISS-PROT database
are databases that have been doubling in size yearly since their
establishment a decade back.
Since the publication of the complete genome sequence of an influenza strain,
the complete sequences for a lot of organisms across the different families
like bacteria, and humans to birds, or frogs, have been released. The size of
these datasets ranges from 450 genes to over 100,000. Related projects on
gene expression, protein structures, and their interaction add to the
enormous quantity of information that we have to deal with now.
In the past decades, computers have become indispensable to biological
research. It might also be safe to say that we have been dependent on
computers in all our data analysis, and storage. This is of practical
importance because computers can handle large quantities of data and probe
the complex dynamics among them. Bioinformatics, an emerging
computational field in biology, is defined as the application of computer
sciences in understanding the information related with and useful for
biological systems. The union between computer science and biology
ushered the information technology age.
Use of information sciences and computation help in the deep understanding
of biological systems – directly or indirectly. For example, around the world,
scientists and researchers produce over 100 gigabytes of data a day. This size
can only be matched by developments in computer technology which allows
for faster computations and better data storage. Merging of the two fields
also revolutionized the methods for accessing, storing, managing and
exchanging data from different parts of the world. In short, these activities
eliminate physical boundaries.
In 2015, for example, 1 trillion photos were taken and shared online. This
year (2017), nearly 90% of all photos taken in the world were shot by smart
phones.
What is Bioinformatics?
The aims of bioinformatics, among those that are anticipated in the future,
are:
 First, it organizes data to maximize access to existing information.
Biological Sciences
3
Big Data
 The second aim is to develop tools, methods and freely available

resources that aid in data analysis. Example of this application is in
comparison of protein sequences.
 The third aim is to use these tools to analyze the data and interpret
the results in such a way that they will be biologically meaningful.
For example, the largest dataset for yeasts used in biotech studies has
made many time-point measurements for thousands and thousands of
genes. Genomic scale dataset for all other species include biochemical
information on metabolic pathways, regulatory networks, protein-protein
interaction data from different types of experiments, and systematic
knockouts of genes. These approaches are used to study all their
applications, for example, in the fields of medicine and food production.
Bacteria, whose genome was previously sequenced, may also be used for
biotechnology.
Figure 1. Above is a schematic outlining how scientists can use bioinformatics to aid rational drug
discovery. MLH1 is a human gene encoding a mismatch repair protein (mmr) situated on the short
arm of chromosome 3. Through linkage analysis and its similarity to mmr genes in mice, the gene
has been implicated in nonpolyposis colorectal cancer. Given the nucleotide sequence, the
probable amino acid sequence of the encoded protein can be determined using translation
software. Sequence search techniques can be used to find homologues in model organisms, and
based on sequence similarity, it is possible to model the structure of the human protein on
experimentally characterised structures. Finally, docking algorithms could design molecules that
could bind the model structure, leading the way for biochemical assays to test their biological
activity on the actual protein.
Course Module
Statistical Tools in Biological Studies
With the current big data available publicly, computational methods have
become indispensable. Bioinformatics encompass a wide range of subject
areas. Actually, if there is data, bioinformatics can be applied.
The work, carried out by George Church and Sri Kosuri uses DNA as a digital
storage device, just like a portable USB drive. Strands of DNA was used to
store 96 bits of information using the DNA nucleotides.
Two principal approaches are fundamental to all studies involving
bioinformatics. It is used in comparing and grouping of data according to
biologically meaningful similarities. The other approach uses bioinformatics
in the analysis of data to be used to infer the observations for another data
type.
These approaches are in the main aims of the using informatics in biology or
in the marriage of the two fields. We all aim to understand and organize the
information associated with biological molecules. In this way, we will be able
to examine individual systems in detail and possibly use them in bigger
populations. This technique will allow us to uncover and highlight unusual
features unique to some of the systems in focus.
The importance of statistical analysis in biological research is immense. At
the same time, statisticians are looking for better ways of looking at bf
thinking about data interpretation and presentation, to help scientists avoid
or miss important insights and data information.
Figure 2. The P value is the “gold standard” to tell significance in biological research.
In the life sciences there are typically two types of publication using
information analysis or the so called big data. These materials are large data
uses and rely mostly or wholly on statistical evidence. This knowledge is
applicable to the fields of medical sciences, clinical trials and biotech studies.
All these fields encompass the fields of cell and molecular biology,
biochemistry and modern genetics.
Biological Sciences
5
Big Data
Statistical evidences are important in biological sciences. Computing sample

size, reporting on outlying results and other issues are also being monitored.
Cell and molecular biologists have the luxury of being able to probe their
experimental systems in multiple, independent ways using this approach in
tandem with sophisticated statistics to draw significant conclusions.
Understanding basic statistics and information data handling allow
experimental biologists to determine the relationships of their observations.
For example, the results from one representative experiment cannot be
shown as a valid experiment. Only when N is 2 or 3, would it be more
transparent that the data are replicable and interpretable.
Statistical conclusions are important in biological research, particularly those
involving large datasets. Descriptive statistics, for instance, are necessary
and most applicable when there are too many data points to visualize easily.
Inferential statistics may be utilized to make it easier to interpret the results.
Course Module
Future Trends in Biological Research
Life sciences have, with the advent of information technology, been a data-
driven subfield of science undertaking. New issues emerge and will continue
to challenge the established ways of basic and applied research and
technology applications. In the new era of Big Data, life sciences will show
rapid development. Data algorithms and knowledge of how to deal with them
will also become increasingly available in public as more and more databases
are being built.
When nucleotide sequencers are being able to generate data for us to

sequence and analyze, big data started to govern life sciences. With this, new
integrative and radical new ways of doing research will be proposed.
Electronic-infrastructures will play a key role in this shaping the future
research directions in life sciences and biological research.
A good example of Big Data complexity is the information resources that can
be combined to elucidate how bio-molecules interact with each other.
Multidisciplinary approaches are used and are the basis for the discovery of
such connections. Discoveries of this scale can lead to the development of
new drugs. Structural biology and bioinformatics are key tools providing
insight into interactions at the molecular level. However, the rapid expansion
of available data on structure, chemistry and dynamics of biomolecules poses
new technical challenges. For example, we might be needing more cloud
storage and other form of exchange to ensure data security.
Increasing Importance of Connections for Mathematical Science

Research
The increased interconnectivity in the biomedical and mathematical sciences
community led to a significant increase in interdisciplinary research and
opening up of fields previously not explored or are considered not connected.
Innovation of modes for scholarly interactions and professional growth

The Internet and the World Wide Web has broken all communication and
cultural barriers in science. Software tools and data sets are now publicly
available and dissemination of research results had never been easier.
Sharing of informal ideas through blogs and other venues has also been
freely and liberally circulated. These technologies changed how collaboration
is done across fields.
Reintegrating nature: systems biology

Biology is a single connected science focused on scientific approaches to
understanding biological phenomena. It is characterized by the focus of
interest and a biological question. By limiting the effect of other elements on
the problem of interest, biology enhances our understanding the problem.
Continued learning about a specific biological system is necessary to make
connections with other systems and structures. Ultimately, it is impossible to
isolate life from the rest of our universe. This question is the realm of
systems biology.
Biological Sciences
7
Big Data
Figure 3. Bioinformatics and systems biology are expected to connect the areas of biology.
What have we learned?

In this module we discussed the roles of computational sciences in biological
research and showed several examples of the practical application of big data
science in biological research. Finally, we discussed the future trends in
biological research and the future directions of big data research.
Glossary
Standard deviation (s.d.) – The typical difference between a value and the
mean value
Standard error of the mean (s.e.m.) – An estimate of how variable the
means
Confidence interval –the population mean lying within an interval.
Independent data – Values from separate experiments
Replicate data – Values from experiments
Sampling error – Variation caused by sampling part rather than measuring
the whole population
References and Supplementary Materials

Books and Journals
Lackie, J.M. (2013). The Dictionary of Cell and Molecular Biology, 5th Ed. London:
Academic Press.
Course Module
Lodish, H., Berk, A., Kaiser, C.A., Krieger, M., Ploegh, H., Amon, A., & Martin, K.C. (2016).
Molecular Cell Biology, 4th Ed. New York: MacMillan.
Reece, J. B., & Campbell, N. A. (2011). Campbell Biology. 6th Ed. Boston: Benjamin
Cummings / Pearson.
Online Supplementary Reading Materials
Bioinformatics. Retrieved from: www.ebi.ac.uk/luscombe/docs/imia_review.pdf on 15
May 2017.
Future in Computational Sciences. Retrieved from:
www.nap.edu/read/15269/chapter/6 on 15 May 2017.
Statistics and Science. Retrieved from: www.nature.com/news/scientific-method-
statistical-errors-1.14700 on 15 May 2017.
Online Supplementary Reading Materials
Cells; www.bbc.co.uk/schools/gcsebitesize/science/add_aqa_pre_2011/cells/cells1.shtml; Date
accessed: 14 February 2017.
Microscopes have come a long way since 1665; www.sciencenews.org/article/microscopes-have-
come-long-way-1665; Date accessed: 14 February 2017.

Biological Data Analytics and Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biological Data Analytics and Computing

Uploaded by

Copyright:

Available Formats

Biological Sciences

Module 0013 - Big Data

We live in the information age.

 The second aim is to develop tools, methods and freely available

Statistical evidences are important in biological sciences. Computing sample

When nucleotide sequencers are being able to generate data for us to

Increasing Importance of Connections for Mathematical Science

Innovation of modes for scholarly interactions and professional growth

Reintegrating nature: systems biology

What have we learned?

References and Supplementary Materials

You might also like