You are on page 1of 15

# INTRODUCTION TO SCIENTIFIC COMPUTING

CONCLUSION WORK

August 7, 2010

## What is scientific computing?

Scientific computing applies the computational power of available computer technology
to solving problems in various fields of science, such as chemistry, physics, astronomy,
pharmacy, biology, mechanics, but also economy, linguistics and other. In order to better
understand natural phenomena, scientists can use computers in a variety of ways. One
of the major uses of computers is running simulations of events occurring in the world
around us. Data from the results of simulations, but also from other sources, can be
displayed on a computer in such a way that humans can see patterns and understand the
underlying process. Methods from the computer science field artificial intelligence can be
used to automatically classify or even predict properties of objects based on unknown and
possibly complex rules. Simulations together with mathematical optimization methods
can be used to find the best possible solution of practical real-world problems. Computers
can easily, quickly and accurately process large datasets and work with complex mathe-
matical equations. These tasks were error-prone and used to take a lot of time in the past.
Now, they are a matter of minutes. Even the measuring devices which scientists use now
contain computers. Some of them, such as satellites, have limited ability to communicate
2 METHODS OF SCIENTIFIC COMPUTING

with the operators and therefore must do some work automatically. For these reasons,
computers are an indispensable tool in today’s science.

## Methods of scientific computing

In this section we take a closer look at the various methods.

Modelling
In this context, model is a simplified version of an actual phenomena, process or object
used for study and experimentation. Such models can be physical (for example the well
known small models of buildings in architecture) or virtual/mathematical. Modelling is
used for this purpose not only in science, but also in industry, education or art. Scientific
computing usually deals with mathematical models of natural phenomena represented
in a computer. There are 3 basic types of models used in this field: simulation models,
mathematical models and optimization models.

Mathematical model
In the simpler cases, it’s possible to construct a closed form equation which models the
process we are studying. Such equations have the advantage of being easy to evaluate
for any given parameter and they can also be easily studied from multiple viewpoints
or integrated into other models. For more complex systems, it’s usually not possible to
obtain such a representation.
For example a model of position (s) and velocity (v) of a freely falling object is
v(t) = gt
s(t) = 1/2gt 2
assuming an uniform gravitational field, no resistance, zero initial speed and origin of
coordinate system equal to position of the object at t = 0.
There are also other kinds of mathematical models which are closer to simulation or
optimization.

Simulations
Simulations are one of the most important tools computers provide in this field, because
it’s highly versatile and not too difficult to implement. To create a simulation, scientists
3 METHODS OF SCIENTIFIC COMPUTING

write a computer model of the natural process they want to study in which they describe
the simple behaviour of the actors in the model (such as atoms, stars, ants, . . .), specify how
many actors are there in the model and let the computer simulate their behaviour step by
step. This allows them to watch and study processes which are otherwise hard, expensive
or even impossible to watch in reality. There are many examples of such phenomena, the
"hard" group includes the behaviour in the microworld, which are sometimes possible to
observe, material behaviour (gases, sands, . . .) and others whereas the "impossible" group
includes the evolution of the universe, collisions of galaxies and other.
Simulations are also used to predict, for example when an comet is approaching the Earth,
its trajectory is calculated in by simulating the movement of the comet.
Simulations always use only a model of the studied process, which means that details not
important for our current point of view are not considered. This is necessary in order to be
able to actually create the model and run it on available computers in a reasonable time.
Nevertheless, there are always big enough models to make use of the biggest and fastest
computers. Currently clusters of thousands of processors are being used for scientific
simulations.

Visualization
The human brain has an incredible ability to find patterns in what it sees. Therefore the
benefit of proper visualization of data is not to be underrated. And computers can help
very much here as well. Thanks to computer games, the field of computer graphics is well
developed and there is also cheap dedicated hardware available. Of course, the interesting
and sometimes beautiful products are also used for science popularization. Other uses of
visualization are that the state of a simulation can be visualized on the fly which allows
the operator to check if it is running as expected or even to steer it and see a particular
interesting part of the problem which was not foreseen at the beginning.

Prediction
The field of Artificial Intelligence has developed several methods for machine learning.
Machine learning is a process in which the parameters of a mathematical model exe-
cuted by the computer are automatically tuned based on the input data. This phase is
called learning. If we’re successful, after the learning the model’s parameters reflect the
underlying pattern in the data and we can use it to predict unknown features of new data.
For example, we might have a dataset of 100 chemical compounds together with the
information whether they are active in interaction with one specific target or not. We can
4 METHODS OF SCIENTIFIC COMPUTING

build an artificial neural network and train it with these compounds and their activity.
When it’s done, the neural network can be used to predict activity of new compounds
based on its similarity to the compounds "it knows". Another example is the usage of
machine learning for automatic insect classification into species based on the insect’s
features or its photo. This can help with routinne classification and save a lot of tedious
work.

Optimization
In real world we often encounter the situation that we can choose among many alter-
natives and we know how to calculate the profit of each alternative, but don’t know
which brings the highest profit, because there are just too many of them. This is called
an optimization problem and it can occur both in normal life (in business for instance)
and as a subproblem in computing problems.
For example finding the stable configuration of a molecule is an optimization problem,
because each atom’s position is a variable and we are able to (approximately) calculate
the energy potential of each configuration. There are, however, usually too many configu-
rations to be able to try them all out, so we need to find a smarter way of finding the best
one.
Over the last several decades, many approaches have been developed which try to tackle
this problem. Optimization problems can be also divided into several classes, some of
which being easier to solve, some of them harder.

Data processing
In some fields, measuring devices can supply large amounts of data about some natural
phenomena and we need to process these data in order to understand them. This is yet
another application of computers in science. Whether we need to automate methods
which have been used for years or devise completely new methods, computers can help
significantly.
In the simpler cases, it’s just the ability to store and retrieve data based on queries about
any feature or automatic correlation. For example Bioinformatics is is the field of applying
computers to process biological data. It works with data such as DNA sequences, protein
sequences, gene expression or even medical publications. For biological sequences, meth-
ods have been developed which allow to align and compare them, find common motifs
and based on that, build fylogenetic trees (trees of ancestry) of species. By comparing
the sets of expressed genomes in healthy and unhealthy (cancer-suffering for example)
5 METHODS OF SCIENTIFIC COMPUTING

patients, more can be learned about the specific disease. Protein folding, which means
determining the 3D structure of protein based on its sequence of amino acids, is a vital
task for understanding its function in the organism and for drug discovery. Currently
computers are not strong enough to solve this task without additional data, but methods
which use known folding of similar proteins are being successfully applied today. In
bioinformatics and medicine, the amount of publications per year is very large and the
human body is terribly complex. Semantic methods, which have their origin in philosophy,
are being used to categorise and search in the publications and information about the
human body.
And while bioinformatics deals with data on the order of billions of DNA base-pairs,
there are even bigger challenges. The recently launched Large Hadron Collider employs
gigantic detectors which will observe the debris after particle collisions. The data from
this detector will be processed in the hope that proof of existence of so called Higgs boson
will be found. But the volume of data will be tremendous and cannot be processed even
by all the computers physicists have available at the moment, so it must be first cut down
by simple filters, which should delete the uninteresting part. And even after this cutting
down, there will be a lot of work.

Measuring devices
Computers are ubiquitous nowadays and that includes data-collecting devices as well.
Especially those, which need to be autonomous to an extent or must process the data
first.
For example, telescopes on the Earth orbit must compress the data first, because there is
limited connection bandwidth and time. And explorers on Mars have to posses a certain
degree of autonomous behaviour, because the time it takes for signal to get over the large
distance to an operator on Earth and back is long and it may come too late for a reaction.

Tools
Aside from specialised software developed for a specific task, there are some programs
which have a wider range of application. Many of them are free of charge or even open
6 METHODS OF SCIENTIFIC COMPUTING

source and some of them are commercial. We’ll make a short overview of the best known
general purpose programs.

Mathematical software

Nowadays computers can not only add and multiply numbers, but can also work with
symbolic mathematical objects such as equations, functions and so on. This means that
they can be used to speed up the mathematical work which cannot use only numbers and
help avoid mistakes people could make. Programs which can manipulate with symbolic
objects are called Computer Algebra Systems.
Another kind of mathematical software is more aimed at the numeric part, because after
all, that’s the area where computers excel and is also very useful. Both kinds of programs
can usually present the objects they work with graphically, for example as a function
graph or a color plot of a matrix.
Some known mathematical software packages include:

## Mathematica Commercial program developed by Wolfram Research

http://www.wolfram.com

## A comprehensive mathematical toolbox, includes a computer algebra

system, custom programming language, powerful plotting and data
visualization capabilities, efficient handling of large data sets, statistics
libraries and more. Wolfram Research provides data sets from several
fields, some of them (such as current weather) updated in real time.
Mathematica has applications in engineering, biology and medicine,
finance, statistics, chemistry and other fields [1].

## MATLAB Commercial program developed by The MathWorks

http://www.mathworks.com/products/matlab/

## MATLAB is a computer system for numerical calculations and data pro-

cessing which is controlled using a custom programming language.
Aside from data manipulation, the language allows to create GUIs, con-
nect to other languages, visualize data and more. Libraries are available
for MATLAB for many fields, such as bioinformatics, optimization, image
and signal processing, economics, distributed computing etc.

R Open source
http://www.r-project.org/
7 METHODS OF SCIENTIFIC COMPUTING

## R is a programming language and a statistical environment. The soft-

ware provides an extensive library of statistical and plotting functions.
R is widely used for statistical data analysis and statistical software
development [2].

## Python Open source

SciPy http://www.python.org
http://www.scipy.org
NumPy http://matplotlib.sourceforge.net
matplotlib
Python is a well known general purpose dynamic programming lan-
guage used in many fields. SciPy and NumPy are numerical computing
libraries for Python which contain functions for working with large
datasets, optimization, linear algebra, signal, image processing and
more. matplotlib is a powerful plotting library for Python which com-
plements SciPy and NumPy. Together they provide a flexible alternative
to programs such as MATLAB.

## GNU Octave Open source

http://www.gnu.org/software/octave/

## Octave is a numerical computing program. It has a custom program-

ming language very similar to MATLAB and Octave itself is mostly com-
patible with MATLAB. By itself it does not support plotting and visual-
ization, but can be used together with Gnuplot.

## Distributed and parallel computing

Due to the memory and processor demands of many scientific computing tasks, dis-
tributed computing is an inseparable facet of scientific computing. Faster and faster
supercomputers are built, many of which are used for science.
In the past, specialised hardware used to be built for specific computing demands. This
is no longer the majority of cases today, because it’s more cost-efficient to use the same
hardware which is being produced en masse for the commercial sector. That means that
today’s supercomputers are built using the same computer architecture and the same
type of processors (only the fastest version) as the ones in consumer’s computers. Most
of them are structured as clusters – more or less standard computers connected by a
network. Programs for clusters are written mostly in C, C++ or Fortran. One notable piece
of software for cluster computing is MPI, an open standard for message passing. Programs
running on cluster consist of independently running programs which communicate by
8 PHARMACOINFORMATICS

passing messages over the network and MPI defines a standard interface and a set of
functions for this task, allowing the programmer to concentrate on the computation
rather than the technical details of network communication.
For example, the fastest supercomputer at the moment is Jaguar [3], a Cray XT5 system,
which has more than 200,000 Opteron processor cores, the same processor, which can be
found in high-end consumer desktop PCs or servers.
A significant new development in the field of high performance computing is the usage of
chips which were previously designed purely as game graphics accelerators. These chips
are specialized for graphics, but due to continuous high demand over more than 10 years,
they have developed quickly and today allow game developers to program the materials
(color, shininess, transparency, bumpiness) of game objects and henceforth the graph-
ics accelerators are programmable. These accelerators are highly parallel, containing
hundreds of small processing cores. The cores are not universal and can not work inde-
pendently as in the case of CPUs, but for some tasks such a card could bring a hundredfold
acceleration. Coupled with a relatively low prices, they are a perfect tool for certain high
performance jobs. Programs using these accelerators must be specifically written for a
given architecture. There are currently two major interface/programming environments.
CUDA is a proprietary interface to Nvidia’s graphic cards and at present time dominates
the field. The new standard, OpenCL, is open and has already been embraced by both
AMD and Nvidia.

Pharmacoinformatics
This section will describe the topic of Pharmacoinformatics. There were two lectures
given by Gerhard Ecker in this course and I attended his course Computational Life Sciences
as well [4].

What is it?
Like bioinformatics employs computers to process biological data, pharmacoinformatics
is the field of applying computers in pharmacy, specifically drug design. Drug design is a
big business, but also a very tough one. Usually it takes more than 10 years and millions
of € to bring a new drug to market, mostly due to large amount of mandatory testing.
Computers are used in all stages of drug design, but mainly in the initial phases of finding
9 PHARMACOINFORMATICS

the best chemical compound. It needs to be highly active, must have no side effects and
should have minimal interactions with other drugs.

## Chemical laboratory in a computer

In order to be able to process chemical compounds in a computer, there must be a proper
way of representing them. For some algorithms, simpler representations are sufficient, for
some we need the complete 3D structure (positions of atoms) and their bindings. In the
group of simple representations we can find SMILE strings. An example of one is CC(=O)C, a
simple chain with a double-bound oxygen in the middle. Those strings have the advantage
of simplicity - they can be used as the keys for database searching, as the way for entering
molecules into the computer, etc. But for many applications the complete structure is
required. We need to know the position of every atom and all its bonds (covalent and
other).
In certain applications, such as activity prediction, we need to decrease the details about
given compounds and generalize, to find features which describe a group of similar
compounds. So far, over 2000 such descriptors have been used in the field. Among the
simplest descriptors we can find the total number of C atoms, the total number of CH 3
groups and so on. Another important descriptor is lipophilicity, the "friendliness" of the
compound with lipids. This usually has a big influence on biological activity, because cell
membranes are made of lipids. Simple descriptors capture only little of the properties of
a molecule, so they are always used either together with many other simple descriptors,
or more complicated descriptors such as fingerprints or autocorrelation are used instead.
Fingerprinting uses the structure of the compound to create a short vector of numbers,
which are supposedly similar for compounds of similar structure. Autocorrelation is
similar, but applies a simple descriptor as well as structure.
When we have a 3D structure of the compound, we can build a VolSurf descriptor. It
stands out of the line, because it’s not just a few numbers and it’s not easy to calculate
either. The molecule is put in a virtual box and interaction forces with some kind of atom
are calculated at regularly-spaced points. In order to be usable as a descriptor, some
statistical method such as principal components analysis needs to be used to extract the
most important features.

## Determining 3D structure of molecules

For the more advanced techniques in pharmacoinformatics the 3D structure of the mol-
ecules needs to be known. With small compounds, which is usually the case for drugs,
10 PHARMACOINFORMATICS

it can be generated by a computer just from its 2D structure (or a SMILES string). To
find the 3D conformation, we need to solve an optimization problem where the variables
are torsion angles on the bonds and the objective function is the system’s energy. This
problem is computationally tractable for small molecules. It is either solved using tradi-
tional optimization methods such as stochastic search, systematic search, evolutionary
algorithms or by using known 3D structures of similar compounds.
Proteins are very large molecules, on the order of hundreds or thousands of atoms. That
means that finding their structure ab initio, from the ground up, is not possible. It can
be determined experimentally, using X-ray crystallography or NMR spectroscopy. But
experimental methods are sometimes rather expensive and difficult and for some proteins
impossible to carry out. With computers, one way to determine protein’s structure is
through homology modelling. This method calculates the 3D structure based on known
structures of other proteins, which have similar amino acid sequences.

Activity prediction
One of the problem pharmacy faces is finding the best chemical compound for a given
target. The number of possible chemical compounds is vast, because the atoms can
be combined in almost any way. Pharmacologists have developed methods such as
High Throughput Screening, which are capable of testing up to 10.000 compounds a
day, but even this is too slow and expensive. Predicting activity of virtual compounds
using computers can be therefore very helpful. There are several ways of going about
that.

QSAR

QSAR stands for Quantitative Structure-Activity Relationships. As the name implies, it tries
to capture the relationship between compound’s structure and biological activity. More
specifically, it is a function of several molecular descriptors which gives the predicted
activity. To find the function, statistical methods such as regression analysis are used
on a set of known (compound, activity) pairs. It is important to hold on several rules
to get meaningful results - the training set should be large enough with respect to the
number of descriptors, one should not extrapolate too far away from the training set and
11 PHARMACOINFORMATICS

the biological system in which activity of training set is measured should be as simple
(and understandable) as possible.

Pharmacophore

Another approach is based on the insight that the precise 3D structure is not as important
for biological activity as the "high level" features such as hydrophobicity, lipophilicity,
hydrogen bond donor/acceptor or ionizability which are generated by the structure. These
features together with their 3D placement can be used to describe a group of chemical
compounds which have the same properties. Then a pharmacophore model built from a
set of known active compounds can be used to find more active compounds in a database
of virtual chemical compounds.

Machine learning

## Machine learning is a subfield of Artificial Intelligence which develops mathematical

models capable of capturing the most important features in a set of data (training set) and
then predicting properties of new items based on this generalized model. There are many
such models, some of them, such as neural networks, are already rather well known. Let’s
begin with them.

## Artificial Neural Networks

This model is inspired by nature, which has developed one rather good system for learning,
the brain. It consists of millions of neuron cells and every one of them is connected to
thousands of other cells. These connections can store information and also think. Or at
least that’s what is believed.
Artificial neuron is just a virtual object simulated by the computer. It has a number
of inputs from other neurons and every input has a weight. The neuron calculates a
weighted sum of its inputs (each input is a real number) and generates a single output
value which is a function of the weighted sum of inputs. The function between inputs
and output is called transfer function and it can be a simple linear function, step function
or a continuous sigmoid function. Each neuron has only one output value, but it can be
propagated to many other neurons. An artificial neural network is usually structured in
layers and each neuron from one layer is connected to each neuron in the next layer.
ANN are a form of supervised learning, because for every input we have an expected
output and the network adjusts itself to match the expected output. To train the network,
the data must be in the form of pairs (input, expected output). For each of these pairs, the
12 PHARMACOINFORMATICS

input data is inserted to the first layer of neurons, propagated using the rules described
and the output from the last layer of the network is compared with the expected output.
The weights of edges are then slightly adjusted to come closer to the expected output.
After going through the entire training set, the network should have the strongest weights
for the most commonly occurring patterns in the training set and we can start inserting
data for which we need a prediction.
Neural networks naturally have also some weaknesses. Some of them can be overcome
by using more complicated methods, some of them can’t. Most importantly, a neural
network doesn’t explain why it gives such results. The information captured by the weights
cannot be easily interpreted. This is in contrast with some other methods, such as QSAR
or decision trees, because the generalised pattern can be easily seen from them.

## Self Organizing Maps

This model can be used to reduce dimensionality of data by finding the most important
features. It is a 3 dimensional table and as input it takes vectors. It is a form of unsu-
pervised learning, so the training data does not have an "expected value" component.
Training is performed by sequentially inserting the input vectors in the table. The vector
is compared with all vectors represented by columns of the table and the column with
the closest values is selected. This columns and nearby columns are slightly adjusted to
become more similar to the input vector. We expect from this approach that at the end
of training, similar input vectors will occupy nearby columns. Then we can only take a
map of columns and see which items are close to each other. For instance biologically
active compounds will be located in columns near known active compounds.

## Decision trees and others

Decision trees can be used to classify input items into groups. Items in the training set
contain an "expected group" component which means that decision trees are a form of
supervised learning, just like artificial neural networks. The model is represented by a
tree whose nodes contain predicates and their outgoing edges correspond to the possible
values of the predicates. In the training phase the tree is built so that in each node the
predicates split the input data in the biggest possible groups, which should lead to a rather
compact tree. When the tree is complete, a new item can be classified simply by going
through the tree and following the edges which correspond to our input item.
13 LINEAR OPTIMIZATION EXAMPLE

Other methods include random forests, a generalization of decision trees, support vector
machines based on separating the data points by a hyperplane, or clustering, based on
building spatially compact groups of data items.

Docking
Docking means performing a rather accurate simulation of the interaction between our
compound and the target protein. Naturally, the 3D structure of both participants need to
be known as well as the location of binding pocket on the protein which can be determined
from known interaction of the protein with similar compounds. It is even possible to
create a cocrystal of protein and a compound and find 3D positions of both using X-ray
crystallography. But as we already mentioned, these experiments are not as fast and
cheap as "virtual experiments".
The problem of docking is an optimization problem. We can translate (move) the com-
pound in 3 dimensions, we can rotate it and sometimes we also need to twist or stretch
the bonds, because the lowest energy state in isolation may not be the same as the lowest
energy state in interaction with a protein. There are several possibilities of rating the
conformations, but molecular mechanics calculating the total energy of the system is the
most commonly used one. Other approaches once again make use of known alignments.
Docking is a rather accurate method of estimating the binding affinity and can be used to
find hits in virtual databases or for optimization (finding a better compound than the one
we already have, one which is less toxic for example). Compared to other methods, this
one is rather computationally intensive.

## The future of drug design

A new trend in medicine is personalised treatment which considers the gender, age,
weight, type of metabolism and past problems and suggests the best strategy. It can
make treatment more effective and less dangerous at the same time. Even better results
could be achieved if we included the person’s genetic information in the process. The
technology is already sufficiently advanced, but the risk of misusing someone’s genetic
data prevents such approaches from being implemented in practice.
And despite all advances in medicine, the best strategy remains a healthy lifestyle [4].

## Linear optimization example

Let x1 be the number of Astro sets produced per day and x2 the number of Cosmo sets.
The objective function of this LP problem is the profit from producing and selling given
14 CRITICAL EVALUATION

number of TV sets:
f(x1 ,x2 ) = 20 · x1 + 10 · x2
The constraints can be expressed as follows:

x1 , x 2 ≥0
x1 ≤ 70
x2 ≤ 50
x1 + 2x2 ≤ 120
x1 + x2 ≤ 90

Solving the problem using OpenOffice.org Calc with x1 ,x2 constrained to integers gives
the following result:
x1 = 70
x2 = 20
for a total profit of \$1600.

Critical evaluation
As a lecture intended to introduce freshmen into the topic of scientific computing, this
subject is doing well in my opinion. There are just a few minor problems that I would like
to point out.
The part about good programming practices in the lecture about Computational Chemistry
was somewhat out of place. Don’t get me wrong, this topic should definitely be taught at
universities, for the sake of anybody even remotely working with software, but it should
be put in the correct place. In a programming course, that is. As I noticed from the
extensive use of Excel, students of this subject are not expected to know programming at
all, so these notes are not relevant for them at this point.
Distributed/parallel computing is a more general topic which finds application in every
part of scientific computing, so maybe it could get a lecture of its own instead of being a
part of Computation Chemistry lecture.
I think this subject should contain more practical work with scientific software such as
Mathematica, Matlab, R and also the free alternatives to these programs, possibly instead
of working so much with Excel. This would help students in the following courses such
as Optimierung :)
15 REFERENCES

Aside from these notes, the topics were diverse and sometimes even colorful. Coupled
with practical hands-on exercises, the course was certainly an interesting one.

References
1 Mathematica Solutions, http://www.wolfram.com/solutions/
2 Data analysts captivated by R’s power, http://www.nytimes.com/2009/01/07/technology