You are on page 1of 89

Python for Science and Engineering

Dr Edward Schoeld A*STAR / Singapore Computational Sciences Club Seminar June 14, 2011

Scientic programming in 2011

Most scientists and engineers are: programming for 50+% of their work time (and rising) self-taught programmers using inefcient programming practices using the wrong programming languages: C++, FORTRAN, C#, PHP, Java, ...

Scientic programming needs

Rapid prototyping Efciency for computational kernels Pre-written packages! Vectors, matrices, modelling, simulations, visualisation Extensibility; web front-ends; database backends; ...

Ed's story: How I found Python

PhD in statistical pattern recognition: 2001-2006 Needed good tools for my research! Discovered Python in 2002 after frustration with C++, Matlab, Java, Perl Contributed to NumPy and SciPy: maxent, sparse matrices, optimization, Monte Carlo, etc. Managed six releases of SciPy in 2005-6

1. Why Python?

Introducing Python

What is it? What is it good for? Who uses it?

What is Python?
interpreted strongly but dynamically typed object-oriented intuitive, readable open source, free batteries included

batteries included
Pythons standard library is: very large well-supported well-documented

Pythons standard library

data types operating system CGI testing calendar strings compression complex numbers multimedia email networking GUI FTP databases XML threads arguments cryptography CSV les serialization

What is an efcient programming language?

Native Python code executes 10x more slowly than C and FORTRAN

Would you build a racing car ...

... to get to Kuala Lumpur ASAP?

Date 1961 1984 1997 2000, Apr 2003, Aug 2007, Mar 2009, Sep

Cost per GFLOPS (US $) US $1.1 trillion US $15,000,000 US $30,000 $1000 $82 $0.42 $0.13

Technology 17 million IBM 1620s Cray X-MP Two 16-CPU clusters of Pentiums Bunyip Beowulf cluster KASY0 Ambric AM2045 ATI Radeon R800 Source: Wikipedia: FLOPS

Unit labor cost growth

Proxy for cost of programmer time


When FORTRAN was invented, computer time was more expensive than programmer time. In the 1980s and 1990s that reversed.

Efcient programming

Python code is 10x faster to write than C and FORTRAN

What if ...
... you now need to reach Sydney?

Advantages of Python
Easy to write Easy to maintain Great standard libraries Thriving ecosystem of third-party packages Open source

Batteries included

Pythons standard library is: very large well supported well documented

Pythons standard library

data types operating system CGI testing calendar strings compression complex numbers multimedia email networking GUI FTP databases XML threads arguments cryptography CSV les serialization

What is the date 177 days from now?

Natural applications of Python

Rapid prototyping Plotting, visualisation, 3D Numerical computing Web and database programming All-purpose glue

Python vs other languages

Languages used at CSIRO

Python Fortran Java

Matlab IDL Perl

C C++ C# R +5-10 others!

Which language do I choose?

A different language for each task? A language you know? A language others in your team are using: support and help?

Python Interpreted Powerful data input/output Great plotting General-purpose language Cost Open source Yes Yes Yes Powerful Free Yes

Matlab Yes Yes Yes Limited $$$ No

Python Powerful Portable Standard libraries Easy to write and maintain Easy to learn Yes Yes Vast Yes Yes

C++ Yes In theory Limited No No


Fast to write Good for embedded systems, device drivers and operating systems Good for most other high-level tasks







Standard library



Python Powerful, well-designed language Standard libraries Easy to learn Code brevity Easy to write and maintain Yes Vast Yes Short Yes

Java Yes Vast No Verbose Okay

Open source
Python is open source software Benets: No vendor lock-in Cross-platform Insurance against bugs in the platform Free

Python success stories

Computer graphics: Industrial Light & Magic Web: Google: News, Groups, Maps, Gmail Legacy system integration: AstraZeneca - collaborative drug discovery

Python success stories (2)

Aerospace: NASA Research: universities worldwide ... Others: YouTube, Reddit, BitTorrent, Civilization IV,

Industrial Light & Magic

Python spread from scripting to the entire production pipeline Numerous reviews since 1996: Python is still the best tool for them

United Space Alliance

A common sentiment: We achieve immediate functioning code so much faster in Python than in any other language that its staggering. - Robin Friedrich, Senior Project Engineer

Case study: air-trafc control

Eric Newton, Python for Critical Applications: http:// recall.html

Metaslash, Inc: 1999 to 2001 Mission-critical system for air-trafc control Replicated, fault-tolerant data storage

Case study: air-trafc control

Python prototype -> C++ implementation -> Python again Why? C++ dependencies were buggy C++ threads, STL were not portable enough Pythons advantages over C++ More portable 75% less code: more productivity, fewer bugs

More case studies

See for lots more case studies and success stories

2. The scientic Python ecosystem

Scientic software development

Small beginnings Piecemeal growth, quirky interfaces ... Large, cumbersome systems

An n-dimensional array/matrix package

Centre of Pythons numerical computing ecosystem


The most fundamental tool for numerical computing in Python Fast multi-dimensional array capability

What NumPy denes:

Two fundamental objects: 1. n-dimensional array
2. universal function

a rich set of numerical data types nearly 400 functions and methods on arrays: type conversions mathematical logical

NumPy's features
Fast. Written in C with BLAS/LAPACK hooks. Rich set of data types Linear algebra: matrix inversion, decompositions, Discrete Fourier transforms Random number generation Trig, hypergeometric functions, etc.

Elementwise array operations

Loops are mostly unnecessary Operate on entire arrays!
>>> a = numpy.array([20, 30, 40, 50]) >>> a < 35 array([True, True, False, False], dtype=bool) >>> b = numpy.arange(4) >>> a - b array([20, 29, 38, 47]) >>> b**2 array([0, 1, 4, 9])

Universal functions
NumPy denes 'ufuncs' that operate on entire arrays and other sequences (hence 'universal') Example: sin()
>>> a = numpy.array([20, 30, 40, 50]) >>> c = 10 * numpy.sin(a) >>> c array([ 9.12945251, -9.88031624, 7.4511316 , -2.62374854])

Array slicing

Arrays can be sliced and indexed powerfully:

>>> a = numpy.arange(10)**3 >>> a array([ 0, 1, 8, 27, 64, 125, 216, 343, 512, 729]) >>> a[2:5] array([ 8, 27, 64])

Fancy indexing

Arrays can be used as indices into other arrays:

>>> a = numpy.arange(12)**2 >>> ind = numpy.array([ 1, 1, 3, 8, 5 ]) >>> a[ind] array([ 1, 1, 9, 64, 25])

Other linear algebra features

Matrix inversion: mat(A).I Or: linalg.inv(A) Linear solvers: linalg.solve(A, x)
Pseudoinverse: linalg.pinv(A)

What is SciPy?

A community A conference A package of scientic libraries

Python for scientic software

Back-end: computational work Front-end: input / output, visualization, GUIs Dozens of great scientic packages exist

Python in science (2)

NumPy: numerical / array module Matplotlib: great 2D and 3D plotting library IPython: nice interactive Python shell SciPy: set of scientic libraries: sparse matrices, signal processing, RPy: integration with the R statistical environment

Python in science (3)

Cython: C language extensions Mayavi: 3D graphics, volumetric rendering Nitimes, Nipype: Python tools for neuroimaging SymPy: symbolic mathematics library

Python in science (4)

VPython: easy, real-time 3D programming UCSF Chimera, PyMOL, VMD: molecular graphics PyRAF: Hubble Space Telescope interface to RAF astronomical data BioPython: computational molecular biology Natural language toolkit: symbolic + statistical NLP Physics: PyROOT

The SciPy package

BSD-licensed software for maths, science, engineering
integration optimization interpolation FFTs clustering signal processing linear algebra ODEs n-dim image processing interpolation sparse matrices maximum entropy statistics scientic constants C/C++ and Fortran integration

SciPy optimisation example

Fit a model to noisy data: y = a/xb sin(cx)+

Example: tting a model with

Task: Fit a model of the form y = a/bx sin(cx)+ to noisy data. Spec: 1. Generate noisy data 2. Choose parameters (a, b, c) to minimize sum squared errors 3. Plot the data and tted model (next session)

SciPy optimisation example

import numpy import pylab from scipy.optimize import leastsq def myfunc(params, x): (a, b, c) = params return a / (x**b) * numpy.sin(c * x) true_params = [1.5, 0.1, 2.] def f(x): return myfunc(true_params, x) def err(params, x, y): # error function return myfunc(params, x) - y

SciPy optimisation example

# n x y y Generate noisy data to fit = 30; xmin = 0.1; xmax = 5 = numpy.linspace(xmin, xmax, n) = f(x) += numpy.rand(len(x)) * 0.2 * \ (y.max() - y.min())

v0 = [3., 1., 4.] # initial param estimate # Fitting v, success = leastsq(err, v0, args=(x, y), maxfev=10000) print 'Estimated parameters: ', v print 'True parameters: ', true_params X = numpy.linspace(xmin, xmax, 5 * n) pylab.plot(x, y, 'ro', X, myfunc(v, X))

SciPy optimisation example

Fit a model to noisy data: y = a/xb sin(cx)+

Ingredients for this example

numpy.linspace numpy.random.rand for the noise model (uniform) scipy.optimize.leastsq

Sparse matrix example

Construct and solve a sparse linear system

Sparse matrices
Sparse matrices are mostly zeros. They can be symmetric or asymmetric. Sparsity patterns vary: block sparse, band matrices, ... They can be huge! Only non-zeros are stored.

Sparse matrices in SciPy

SciPy supports seven sparse storage schemes ... and sparse solvers in Fortran.

Sparse matrix creation

To construct a 1000x1000 lil_matrix and add values:
>>> from scipy.sparse import lil_matrix >>> from numpy.random import rand >>> from scipy.sparse.linalg import spsolve >>> >>> >>> >>> A = lil_matrix((1000, 1000)) A[0, :100] = rand(100) A[1, 100:200] = A[0, :100] A.setdiag(rand(1000))

Solving sparse matrix systems

Now convert the matrix to CSR format and solve Ax=b:
>>> A = A.tocsr() >>> b = rand(1000) >>> x = spsolve(A, b) # Convert it to a dense matrix and solve, and check that the result is the same: >>> from numpy.linalg import solve, norm >>> x_ = solve(A.todense(), b) # Compute norm of the error: >>> err = norm(x - x_) >>> err < 1e-10 True

Great plotting package in Python Matlab-like syntax Great rendering: anti-aliasing etc. Many backends: Cairo, GTK, Cocoa, PDF Flexible output: to EPS, PS, PDF, TIFF, PNG, ...

Matplotlib: worked examples

Search the web for 'Matplotlib gallery'

Example: NumPy vectorization

1. Use a Monte Carlo algorithm to estimate : 1. Generate uniform random variates (x,%y) over [0, 1]. 2. Estimate from the proportion p that land in the unit circle. 2. Time two ways of doing this: 1. Using for loops 2. Using array operations (vectorized)

3. Scaling

High-performance computing

Aspects to HPC
Supercomputers Parallel programming Caches, shared memory Code porting Distributed clusters / grids Scripting Job control Specialized hardware

Python for HPC

Advantages Portability Easy scripting, glue Maintainability Proling to identify hotspots Vectorization with NumPy Disadvantages Global interpreter lock Less control than C Native loops are slow

Large data sets

Useful Python language features: Generators, iterators Useful packages: Great HDF5 support from PyTables!

Hierarchical data
Databases without the relational baggage

Great interface for HDF5 data

Efcient support for massive data sets

Applications of PyTables
aeronautics drug discovery nancial analysis climate prediction telecommunications data mining statistical analysis etc.

Breaking news: June 2011

PyTables Pro is now being open sourced. Indexed searches for speed Merging with PyTables Working project name: NewPyTables

PyTables performance
OPSI indexing engine speed: Querying 10 billion rows can take hundredths of a second! Target use-case: mostly read-only or append-only data

Principles for efcient code

Important principles
1. "Premature optimization is the root of all evil" Don't write cryptic code just to make it more efcient!

2. 1-5% of the code takes up the vast majority of the computing time! ... and it might not be the 1-5% that you think!

Checklist for efcient code

From most to least important: 1. Check: Do you really need to make it more efcient? 2. Check: Are you using the right algorithms and data structures? 3. Check: Are you reusing pre-written libraries wherever possible? 4. Check: Which parts of the code are expensive? Measure, don't guess!

Relative efciency gains

Exponential-order and polynomial-order speedups are possible by choosing the right algorithm for a task. These require the right data structures! These dwarf 10-25x linear-order speedups from: using lower-level languages using different language constructs.

4. About Python Charmers

The largest Python training provider in South-East Asia Delighted customers include:

Most popular course topics

Python for Programmers Python for Scientists and Engineers Python for Geoscientists Python for Bioinformaticians New courses: Python for Financial Engineers Python for IT Security Professionals 4 days 3 days 3 days 4 days 4 days 4 days

Python Charmers: Topics of expertise

Python: beginners, advanced Scientic data processing with Python Software engineering with Python Large-scale problems: HPC, huge data sets, grids Statistics and Monte Carlo problems

Python Charmers: Topics of expertise (2)

Spatial data analysis / GIS General scripting, job control, glue GUIs with PyQt Integrating with other languages: R, C, C++, Fortran, ... Web development in Django

How to get in touch

See or email us at: