You are on page 1of 50

Unix 39

path. The full path always starts with a forward slash (/) which represents the root
of the Unix directory structure. For instance, the pwd command returns to full path
to the current location. A relative path is relative to your current location within the
directory structure. It often start with the name of a directory (moving down in the
directory tree) or with ../ (moving up in the directory tree).
The path to a directory or file can be added to many Unix commands. For instance,
assuming the current location within the directory tree is /home/students/rjones/,
listing the contents of the directory projects/sst can be achieved using the relative
path

ls -l projects/sst

or the full path type

ls -l /home/students/rjones/projects/sst

Some commands such as the copying command cp even take two paths, the first
being the path to the input file or directory (source) and the second being the path
to the output file or directory (destination).
The (relative) path to the current directory is a dot (.) or a dot followed by a forward
slash (./) which can be used to copy a file from another directory into the current
directory as shown in the following example.

cp dir1/dir2/myfile.csv ./

3.4.4 Special Characters


In order to understand the use of special characters and how they are interpreted by
the Unix system it is important to know their meaning. Table 3.4.5.1 lists the most
commonly used special Unix characters.
Unix 40

Table 3.4.5.1: Special characters used in Unix commands.

Character Name Description


. Dot Shortcut for current directory.
.. Double-dot Shortcut for current directory.
/ Forward slash Divides directory and filenames in paths.
∼ Tilde Shortcut for path to home directory.
* Asterisk (wildcard) Used in file and directory names.
? Question mark Similar to asterisk but for individual characters.
& Ampersand Used to run a process in the background.

The dot (.) and tilde (∼) are shortcuts for paths pointing to the current directory
and the home directory of the user, respectively. The double-dot (..) and forward
slash (/) are used in paths, and represent a level up within the directory tree and the
corresponding sub-directory, respectively.
The asterisk (*) and the question mark (?) are useful for listing or search for specific
file or directory pattern. For instance, the following example lists all files that begin
with Engelstaedter, end with the file extension .pdf and that are located in a directory
called papers that sits at the root of the home directory. The command could be
executed from anywhere within the Unix directory tree.

ls -l ~/papers/Engelstaedter*.pdf

If a special character is supposed to be shown as a character in a text rather than be


interpreted by the Unix system then it needs to be escaped. This is done by adding a
backward slash just in front of the character. If the backward slash needs be escaped
itself then just add another one in front of it.

3.5 Working with Files and Directories

3.5.1 Creating Text Files and Directories


Directories and text files can be created from the command line. Some examples are
shown in Table 3.5.1.1.
Unix 41

Text files can be created in various ways. First, a new text file can be created by
using a text editor as shown in Section 3.4.3. Second, the touch command followed
by a filename can be used. The actual purpose of the touch command is to update
the access and modification time of a file but it also creates an empty text file if the
specified filename does not correspond to an existing file. Third, a text file can be
created by redirecting the text output from a Unix command to a file as explained in
Section 3.5.x.

Unix is very picky about white spaces as they are interpreted by the Unix
system as the end of a command, path or filename. Do not use spaces in file
or directory names. Use underscores instead of spaces if needed. If white
spaces exist in filenames then double-quotes can be used to make the Unix
system aware of parts belonging together.

The mkdir (make directory) command can be used to create single or multiple
directories as shown in Table 3.5.1.1. More details about the mkdir command can
be found in the man pages (Section 3.4.2).

Table 3.5.1.1: Unix commands for creating directories and files.

Command Description
mkdir dir1 Create a directory called dir1.
mkdir dir1/dir2 Create a sub-directory dir2 inside directory dir1.
mkdir dir1 dir2 dir3 Create three directories in one go called dir1, dir2 and dir3.
mkdir -p dir1/dir2 Create directory dir2 and parent directory dir1 in on go.
touch myfile.txt Create an empty text file.

3.5.2 Listing Files and Directories


The ls (list) command has already been briefly introduced in examples in previous
sections. The ls command is an extremely powerful command due to its flexibility
facilitated by the many available options that change how the ls command behaves.
Some frequently used examples of how the ls command can be used are shown in
Table 3.5.2.1.
Unix 42

Table 3.5.2.1: Examples for listing files and directories using the ls command.

Command Description
ls Simple list of files and directories.
ls -l Long list file format.
ls -lt Sort list by time.
ls -ltr Sort list by time and reverse order.
ls -lh Show size in human-readable format.
ls -lhS Sort by size and show size in human-readable format.
ls -la List all files including hidden files (that start with a dot).
ls -p | grep / List directories only.
ls -l dir1/ List the contents of directory dir1.

3.5.3 Moving Around in the Directory Tree


Finding your way in a directory structure from the command line can be confusing
in the beginning. The pwd command can be used to show the current location within
the directory tree. Depending on the configuration, some Unix command prompts
also include the path of the current directory.
The cd (change directory) command can be used to move around within the directory
structure. Full or relative paths can be used with the cd command (see Section 3.4.4).
Some examples of using the cd command are listed in Table 3.5.3.1

Table 3.5.3.1: Examples for moving around in the directory tree using the cd command.

Command Description
cd Jump to the root of the home directory.
cd dir1 Move into directory dir1.
cd dir1/dir2 Move two levels down into sub-directory dir2.
cd .. Move one level up in the directory tree.
cd ../.. Move two levels up in the directory tree.
Unix 43

3.5.4 Copying, Moving, Renaming and Deleting Files and


Directories
The cp (copy) command can be used to copy files and directories. The mv (move)
command can be used to move or rename files and directories. The rm (remove)
command is used to delete files and directories. Some examples of frequently used
commands to copy, move, rename and delete files and directories are shown in Table
3.5.4.1.
Applying a copy or remove command recursively to a directory as in cp -R dir or
rm -r dir means that the command includes all files and sub-directories contained
inside the directory.
Trying to remove a directory without the recursive option as in rm dir1 will like fail
with a warning message similar to the one below indicating that it is a directory that
the user is trying to remove. This is meant to be a safeguard as the directory may
contain many files and/or sub-directories.

rm: cannot remove 'dir1': Is a directory

Care should be taken when deleting files or directories, especially, when


deleting directories recursively (rm -rf). There is no Recycle Bin to recover
deleted files. Once deleted they are gone for good. However, most server
administrators run frequent backups of user areas. So there may still be a
chance to recover an older version of a deleted file or directory from the
last backup.
Unix 44

Table 3.5.4.1: Examples for copying (cm), moving (mv), renaming (mv) and deleting (rm) files and
directories.

Command Description
cp <ifile> <ofile> Copy a file (generic syntax).
cp file.txt dir/file1.txt Create a copy of file.txt named file1.txt in the
directory dir.
cp ∼/test.txt ./ Copy a file from the root of the home directory to the
current directory without changing the files name.
cp -R idir odir Copy a directory recursively (generic syntax).
mv ifile ofile Move or rename a file a directory (generic syntax).
mv file.txt dir/ Move file file.txt into the directory dir.
mv file.txt ../ Move file file.txt one level up in the directory tree.
mv file1.txt file2.txt Rename file1.txt to file2.txt.
rm file.txt Delete file.txt.
rm -f file.txt Delete file file.txt without confirmation (force delete).
rm -r dir Remove directory dir and all its content.

When copying and moving files care should be taken not to overwrite
existing files as they may be over-written without warning.

3.6 Advanced Unix Commands

3.6.1 Examining Text Files


While text files can always be opened using a text editor (Section 3.4.3) it is sometimes
not very practical. For instance, a data file in ASCII format may be a million lines
long but only the first few lines containing the file header are of interest. Several
commands are available on the Unix system to examine the content of text files
without changing the file content.
The cat command simply ‘dumps’ the content of a text file into the terminal window.
It works well for short text files.
For longer text files the less command is more practical which allows for both
forward and backward scrolling and word searching the file. The q key can be used
to terminate the less command and jump back to the command line.
Unix 45

The head and tail commands can be used to dump lines from the beginning or the
end of a text file into the terminal window, respectively.
To see the differing behaviour of the commands can be tested best with a long text file
(longfile.txt). Examples are shown in Table 3.6.1.1. For additional command options
such as how to search file content as part of the less command the man pages can be
consulted.

Table 3.6.1.1: Examples for for examining text files using the cat, less, head and tail commands.

Command Description
cat longfile.txt Print the contents of the file to the terminal window.
less longfile.txt Scroll and search through file.
head longfile.txt Dump the first 10 lines of the file to the terminal window.
head -n 20 longfile.txt Dump the first 20 lines of the file to the terminal window.
tail longfile.txt Dump the last 10 lines of the file to the terminal window.
tail -n 20 longfile.txt Dump the last 20 lines of the file to the terminal window.

3.6.2 File and Directory Properties


Using the ls -lh command provides a detailed list of files and directories showing the
size in human-readable format. The list includes tabular information on permissions,
ownership, size and access/creation time known as file properties. Executing the ls
-lh command may produce output similar to the following example.

-rwxr-xr-x 1 rjones climate 1.2K Apr 19 16:32 file.txt

A description of the file properties in the example above is shown in Table 3.6.2.1. The
first part (-rwxr-xr-x) are the file permissions. They are explained in more detail in
Section 3.6.3. The second part (1) is the number of hard links to the file (can be ignored
most of the time). The third and forth part (rjones and climate) are the username of
the file owner and the group the owner belongs to. When a Unix account is created by
the system administrator the owner is placed into a group for management purposes.
The last three parts (1.2K, Apr 19 16:32 and file.txt) show the file size, creation or
last modification time and the filename, respectively.
Unix 46

Table 3.6.2.1: Example of file and directory properties.

Command Description
-rwxr-xr-x File permissions.
1 Number of hard links to this file.
rjones Username of file owner.
climate Name of the group the file belongs to.
1.2K File size (in human-readable format).
Apr 19 16:32 File creation/access time and date.
file.txt Filename.

3.6.3 File Permissions


Understanding the permission part of the file properties (first line in Table 3.6.2.1)
can be challenging at first. However, it is important to understand file permissions
as they control who can read, edit and execute the file.
File permission are divided into four sections and contain a total of ten characters
Figure 3.6.3.1. The first section (yellow) is a single character that shows the file type.
The letter d indicates a directory, a hyphen (-) indicates a file and the letter l indicates
a link.
The following three sections contain sets of three characters showing the permissions
for the user (blue), group (green) and others (orange) (Figure 3.6.3.1). The user is also
sometime referred to as the owner.
The order of the three characters in each section shows the permissions for read
(r, first character), write (w, second character) and execute (x, third character). If the
letter r, w or x is set then read, write and execute permission have been granted. If,
instead of a letter, a hyphen (-) is shown then the specific permission has not been
granted.
The example shown in Figure 3.6.3.1 (-rwxr-xr-x) is a very commonly used set of
permissions that allows the user to read, write and execute the file and members of
the group and others (everyone else on the system) to only read and execute the file.
This means that no one apart from the user can modify the file but everyone else can
read, copy and execute it.
Unix 47

Figure 3.6.2.1: Unix file permissions.

In some cases a plus symbol (+) is shown as an eleventh character indicating that
extended file permissions have been set using Access Control Lists (not covered in
this book).

3.6.4 Changing File Permissions and Ownership


When a file or directory is created the default permissions defined by the system
administrator are set. In most cases these permission are sensible and do not require
changing. However, there are situations when changing file or directory permissions
is needed. For instance, when sensitive information needs to be protected then access
to a file or directory can be removed for everyone but the file owner. The traditional
way of changing permissions is to use the chmod (change mode) command. The
generic syntax is the chmod command followed by options (if required), the desired
permission and the filename.

chmod <options> <permissions> <filename>

The permissions part can be set using either symbolic or octal notation. The symbolic
notation will be discussed first.
In its simplest form, the symbolic permissions part is made up of three characters. The
first character defines the group for which permissions are intended to be changed
(u for user, g for group, o for others or a for all). The second character defines
whether permission is intended to be granted or to be removed (+ for granting or -
Unix 48

for removing permissions). The third character defines which permissions are being
modified (r for read, w for write or x for execute). For more options consult the manual
pages.
For example, executing the command ls -l script.sh may return the following file
properties.

-rw-rw---- 1 rjones climate 34 May 13 12:39 script.sh

Only the file owner rjones and members of the group climate can read and edit the
file. Others (everyone else) can not access the file. No one can execute the file.
The following command adds execute permissions for the file owner (needed to run
a Shell script).

chmod u+x script.sh

The ls -l script.sh command now returns the following updated file properties.

-rwxrw---- 1 rjones climate 34 May 13 12:39 script.sh

The octal notation uses a three-digit octal number to set the permissions. An easy
way to identify the octal number for a specific set of permissions is to use one of the
online Unix permission calculators¹⁸.
Some examples for changing file and directory permissions using symbolic and octal
notation are given in Table 3.6.4.1.

Table 3.6.4.1: Examples for changing file and directory permissions using octal and symbolic
notation.

Command Description
chmod 755 file.txt Sets file permissions to -rwxr-xr-x (often the default).
chmod -R 760 dir1 Setting file permissions to drwxrw---- for a directory recursively.
chmod +r file.txt Give all groups read permission to a file.
chmod g-w file.txt Remove read permission to a file for members of the group.
¹⁸http://permissions-calculator.org
Unix 49

The owner and group of the file or directory can be changed using the chown (change
owner) and chgrp (change group) commands, respectively. Some examples for the use
of the chown and chgrp command are given in Table 3.6.4.2.

Table 3.6.4.2: Examples for changing file and directory ownership and group information.

Command Description
chown jking script.sh Change the owner of the file script.sh to jking.
chgrp students script.sh Change the group associated with the file script.sh
to students.
chown -R jking:students data Change the owner and group of the directory data
recursively in one command.

3.6.5 Changing the Unix Account Password


In order to change the password of the Unix account the passwd command can be used.
The following steps demonstrate how to change the password. The old password
must be known before a new password can be set.

1. Log into the Unix account.


2. Execute the command passwd.
3. Enter the old password when prompted.
4. Enter the new password when prompted.
5. Re-enter the new password when prompted.

When logging into the Unix account next time then the new password should be
used.

If the old password is unknown the only the system administrator can reset
the password.
Unix 50

3.6.6 Redirecting Command Output


The output from any command that prints information to the terminal window
(standard output) can be saved in a file if needed. This is referred to as output
redirection. Redirecting the output of a command to a file can be achieved by adding
the greater-than symbol (>) after the command followed by a filename.
For example, the following command will save the first 15 lines of the CSV file
data.csv to a new file named data_new.csv.

head -n 15 data.csv > data_new.csv

If the file to which the output is redirected to does not exist then it will be created.
If the file to which the output is redirected to already exists then the file will be
overwritten.
In order to append the redirected output to the content of an already existing file two
joined greaten-than symbols (>>) can be used instead of one (>).

3.6.7 Finding Files


When working on a server for some time more and more files and directories are
being created. In order to search for files and directories the find command can be
used. It is a powerful and very flexible tool. There are many options available that
change how the search is conducted. For a complete list of available options a look
at the man pages is advised. The generic format of the find command is as follows.

find [options] [path] [expression(s)]

The path can be either a dot (.) indicating that the search should start at the current
location or any other full or relative path (see Section 3.4.4) pointing to a directory on
the server. Expressions are where the power of the find command lies as they can be
used to determine the search pattern. Some examples that are frequently used with
the find command are shown in Table 3.6.7.1.
Unix 51

Table 3.6.7.1: Examples for finding files and directories using the find command.

Command Description
find . -iname 'myFile.txt' Find the file myFile.txt, case-insensitive.
find . -name '*.pro' -print Find all files and directories ending with pro,
case-sensitive.
find . -type f -iname '*ipcc*' Find files (ignore directories) that contain ipcc in
the file name.
find . -not -iname '*.dat' Find all files that do not end with .dat.
find . -user abcd1234 Find all files owned by a user called abcd1234.
find . -type f -size +100M Find files that are larger than 100 MB.
find . -type f -size -100M Find files that are smaller than 100 MB.
find . -maxdepth 1 -name '*.py' Find all files ending with .py in the current
directory only.

The find command returns an unsorted list of files. In order to generate a sorted
list the find command output can be passed on to the ls command using back ticks.
For instance, the following command searches for files with the file extension .ppt
starting the search at the current location within the directory tree. The whole find
command construct is put between backticks. The ls -l command is placed at the
beginning of the line.

ls -l `find . -iname '*.ppt'`

Note the difference between a single quote and a backtick in the above
command. The backtick can normally be found on the keyboard just below
the Esc key.

3.6.8 File Compression and Archives


A ZIP file is an archive file that uses lossless data compression and usually has the file
extension .zip. It can contain one or more files or directories. The name ZIP derived
from the meaning of move at high speed.
A TAR file is an archive file that allows to combine one or more files or directories
into a single file while preserving file information such as ownership, permissions
Unix 52

and time stamps. It tends to have the file extension .tar. The name TAR was derived
from tape archive as the method was originally developed to write data to tape.
Both formats are quite commonly used to archive data and are sometimes used
together to create tarred zip files.
The archive tool GZIP (the G stands for GNU) makes use of both the zip and tar
format to generate tarred zip files using having the file extension .tar.gz. Some
examples of working with .zip, .tar and .tar.gz files are given in Table 3.6.8.1.

Table 3.6.8.1: Examples of working with file compression and archives.

Command Description
unzip -l file.zip List the content of file.zip.
zip -r docs.zip docs Create a zipped archive docs.zip containing all files
in the directory docs.
unzip docs.zip Extract archive from docs.zip.
tar -ztvf out.tar.gz List the content of out.tar.gz without extracting.
tar -cf out.tar <infiles> Create a tar file out.tar containing several input
files.
tar -xf out.tar Extract files from out.tar.
tar -czf out.tar.gz <infiles> Create a tar file out.tar.gz with GZIP compression
in one go.
tar -xzf out.tar.gz Extract files from a GZIP tar file out.tar.gz in one
go.

3.6.9 Download Files from the Command Line


It is relatively straight forward to download files directly from the Unix command
line using the wget command. The general syntax is the wget command followed by
the URL to the file to be downloaded. The wget command is especially useful when
downloading datasets as it can be included in script (e.g., Python or Shell script) to
loop, for instance, through years or numbers of years.
In the following example the wget command is used to download the data file
absolute.nc¹⁹ from the Climate Research Unit (CRU) of the University of East Anglia
(UEA). The -N option makes sure the wget command checks if the file already exists
¹⁹CRU Global 1961-1990 mean monthly surface temperature climatology on a 5° by 5° grid (Jones et al., 1999). Note
that in this file, latitudes run from North to South.
Unix 53

locally and, if it does, to only download the file if the remote version of the file is
newer than the local version.

wget -N https://crudata.uea.ac.uk/cru/data/temperature/absolute.nc

While the download command is executed some information appears in the terminal
window including which server the file is downloaded from (crudata.uea.ac.uk) and
the file size (62K). The download progress is shown as well as the download speed.
It is possible to rename the download file on the fly within the same command using
the -O option followed by the new filename. In the following example the download
file is renamed to CRU_Sfc_T.nc.

wget -N https://crudata.uea.ac.uk/cru/data/temperature/absolute.nc -O CRU_Sfc_T\


.nc

3.7 Long-running Jobs


Some computational jobs or data downloads take days or even weeks to complete.
This is especially the case when working with large datasets as often done in climate
sciences. Under normal circumstances a processing job terminates when the terminal
window is closed or the connection to the SSH or VPN connection connection is
interrupted. It is therefore not practical to have an active connection and open
terminal window running for days until the job completes.
In the following subsections possible options for dealing with long-running jobs are
discussed.

3.7.1 GNU Screen (recommended)


GNU Screen (also referred to just as Screen) is a Linux application that allows to
run multiple virtual console sessions from within a single terminal window. Once
a session is created and a long-running job started the user can detach from the
session and the job will keep running even if the terminal window is closed or the
VPN connection is interrupted. Later (e.g., from home) the user can re-attach to the
session to check if the job is still running as expected.
Unix 54

To create a Screen session the screen command is used followed by the option -S and
the name of the Screen session. In the following command a Screen session name
era5 is created.

screen -S era5

To check which Screen sessions are currently set up the screen -ls command can be
used. All running Screen sessions will be list including information about the Screen
session’s names and associated ID numbers as well as the current connection status
(Attached or Detached). The output from the screen -ls command may look similar
to the following.

1 There is a screen on:


2 64521.era5 (09/10/19 16:51:08) (Attached)
3 1 Socket in /var/run/screen/S-worc1870.

Multiple Screen sessions with the same name can be created. They are dis-
tinguishable via the associated process ID number (64521 in example above)
which can also be used as part of the Screen session name (64521.era5).

To detach from a Screen session either the the keyboard shortcut Ctrl-a d can be
used or the terminal window can be closed using the mouse.
To re-attach to a Screen session the command screen -dR followed by the session
name can be used. The -d option detaches any open connections (e.g., in another
terminal or on another machine). The -R option re-attaches to the Screen session.
To terminate a Screen session while being attached the keyboard shortcut Ctrl-a k
(kill) can be used or the Unix exit command can be executed on the command line.

Executing the Unix command exit on the command line while being
attached to a Screen session will terminate the session.

Some of the more frequently used Screen commands are listed in Table 3.7.1.1.
Unix 55

Table 3.7.1.1: Frequently used screen commands and keyboard shortcuts.

Command Description
screen -S era5 Start a Screen session named era5.
screen -R era5 Reconnect to the Screen session named era5.
screen -dR era5 Close any open connections to the era5 session and
reconnect.
Ctrl-a d Keyboard shortcut for detaching from a session.
Ctrl-a k Keyboard shortcut for terminating (killing) a session.
4. Multi-dimensional Gridded
Datasets
4.1 The Earth’s Coordinate System and Realms
In order to understand how models represent the spherical nature of our planet it
is important to be familiar with the terminology that is used to describe Earth’s
horizontal and vertical space. In a simplified conceptual model planet Earth takes
the form of a sphere. Our planet has one geographic pole in the north and one
in the south. Lines connecting the two poles are called meridians. Each meridian
is associated with a constant longitude value. Longitude values are expressed in
degrees west and east from the Prime Meridian. By convention the prime meridian
passes through the Royal Observatory in Greenwich, UK, and is associated with 0°
longitude. Meridian longitude values decrease westwards from the prime meridian
halfway around the Earth up to -180° and increase eastwards halfway around the
Earth up to 180°.
The line located at equal distance from both poles circling in the east-west direction
around the globe is called the Equator. Lines parallel to the Equator towards the
north and south are called parallels. Their position on the planet is determined by
the angle from the horizontal Equator plane. This angle is referred to as latitude. The
latitude value of the parallels increases from the Equator northwards up to 90° at the
north pole and decreases from the Equator southwards up to -90° at the south pole.
Each point on the Earth’s surface can be identified by a pair of latitude and longitude
coordinates.
Our planet is surrounded by a layer of gases (primarily nitrogen, oxygen, argon and
carbon dioxide) that make up the atmosphere (know as air). A planet without an
atmosphere does not have weather. Most of these gases are within 16 km of the
surface. Air pressure and density decreases with distance from the land or sea surface.
Mean sea-level pressure (MSLP) is the average air pressure at mean sea-level. The
Multi-dimensional Gridded Datasets 57

global average MSLP is 1012.25 hPa. In addition, large parts of the Earth’s surface
are covered by water forming large ocean basins (e. g., Atlantic, Pacific and Indian
Ocean) and some smaller more shallow seas (e. g., Mediterranean, North Sea). Water
pressure increases with ocean depths. Both, the global oceans and the atmosphere,
constitute the vertical component of the Earth climate system above the surface.

4.2 The Model Grid

Figure 4.2.1: Schematic of surface grid cells (2D) and atmospheric grid boxes (3D) of an atmospheric
general circulation model (AGCM).

An atmospheric general circulation model (AGCM) is a climate model that simulates


the state of the Earth’s atmosphere based on a mathematical model of the planetary
general circulation using thermodynamic equations. It is, however, impossible for
AGCMs to compute the climate state for an infinite number of possible points on the
Multi-dimensional Gridded Datasets 58

Earth’s surface and in its atmosphere. AGCMs therefore compute the climate state
for regularly spaced points around the planet. These model grid points are generally
referred to as the model grid. These points are located at the centre of grid cells.For
surface variables such as precipitation or 2m air temperature they do not have a
vertical representation but follow the surface topography (Figure 4.2.1).
In contrast, for atmospheric variables such as air temperature, humidity or winds the
data points are located at the centre of horizontally distributed and vertically stacked
3-dimensional grid boxes (Figure 4.2.1). As indicated in Figure 4.2.1 AGCMs compute
horizontal and vertical exchanges between the surface and the lowest atmospheric
level of grid boxes as well as between the atmospheric grid boxes based on the
thermodynamic equations. Data values associated with data points are meant to
represent an average value for the grid cell area (surface variables) or grid box volume
(atmospheric variables). The data values change with every model timestep.
The horizontal distance between data points is referred to as the model horizontal or
spatial resolution. AGCM horizontal resolutions typically range between 1° and 5°.
For regional area models and weather forecast models the spatial resolution may be
much lower. If the horizontal distance between the data points is large then the model
resolution is referred to as coarse or low. If the horizontal distance between the data
points is small then the model resolution is referred to as fine or high. The higher
the model resolution the larger the number of data points for which climate variables
have to be computed. Therefore, computing time and processing power requirements
increase exponentially with increasing model resolution. The longitudinal distance
between data points may be different from the latitudinal distance.
The lowest set of horizontally distributed grid boxes creates a layer around the planet
that interacts with land and ocean surfaces. In the vertical domain additional layers
stacked on top of each other make up the atmosphere (Figure 4.2.1). The layers
are referred to as model levels. The vertical distance between data points (vertical
resolution) is more complex and will be discussed in more detail in Section xxx.
Multi-dimensional Gridded Datasets 59

Figure 4.2.2: Schematic of projecting regularly spaced longitude and latitude grid cells wrapped
around the global onto a horizontal plane using the cylindrical projection cut along a) the prime
meridian and b) the 180° (-180°) meridian.

The surface grid cell boundaries and by extent the horizontal atmospheric grid box
boundaries follow the Earth’s meridians and parallels (Figure 4.2.1). In the context
of the model grid the term zonal is used to describe phenomena associated with
changes between grid boxes aligned in the east-west direction. Zonally aligned grid
boxes are bound by one parallel to the north and one to the south (Figure 4.2.1). The
term meridional is used to describe phenomena associated with changes between
grid boxes aligned in the north-south direction. Meridionally aligned grid boxes are
bound by one meridian to the west and one to the east. In climate science the terms
Multi-dimensional Gridded Datasets 60

zonal and meridional are used to describe directional climate variables or statistics
such as meridional wind which refers to the u-component (east-west) of the wind or
zonally averaged global surface temperature which refers to the surface temperature
averaged around the Earth between two specified latitudes.
While on a regularly spaced grid around the Earth the meridional grid cell width will
always be the same from one pole to the other, the zonal grid cell width decreases
with distance from the Equator towards the poles (Figure 4.2.1 and Figure 4.2.2).
Therefore, the area covered by each grid cell will always be the same in the zonal
direction whereas the cell area decreases in the meridional direction away from the
Equator towards the poles. For example, for meridians with a 1° longitudinal spacing
the distance from one meridian to the next is about 118 km near the Equator, 96 km
at 30°latitude and 56 km at 60° latitude.
For the purpose of climate computations the grid wrapped around the spherical
Earth is transposed to a regular grid where all grid boxes have the same size. For
illustration purposes the cylindrical projection may be used whereby the grid cells
are first projected onto a cylinder by an imaginary light source at the centre of the
Earth, after which the cylinder is cut along a meridian and ‘unfolded’ into
a 2-dimensional plane. Cutting the cylinder along the primer meridian results in a
Pacific-centred map (Figure 4.2.2a) whereas cutting the cylinder along the 180° (-180°)
meridian results in an Africa-centred map (Figure 4.2.2b). Each grid cell represents
a single data value. It is important to note that while the transposed grid cells all
have the same size the data value associated with each grid box still represents the
real-world grid cell area which changes with distance from the Equator as discussed
above.

4.3 Grid Indexing and Geographical


Referencing of Data Points
As discussed in Section 4.2 a distinction is made between surface fields which follow
the surface topography and which have no vertical component and atmospheric
levels which have a vertical component taking the form of atmospheric layers.
Schematics of the corresponding grids can be seen for the surface field in Figure
4.3.1a and for a single atmospheric layer in Figure 4.3.1b. Each grid cell in Figure
4.3.1a as well as each grid box in Figure 4.3.1b is associated with a single data value.
Multi-dimensional Gridded Datasets 61

Both represent 2-dimensional fields of the same size with the same number of data
points. Therefore, both are treated the same way during model data analysis. The
only difference between the two fields is with regards to what they represent (surface
field vs. atmospheric layer).
The surface and single atmospheric layer data field shown in Figure 4.3.1a and 4.3.1b
show global fields with a 5° by 5° spatial resolution represented by 72 grid cells or
grid boxes in the longitude direction and 36 grid cells or grid boxes in the latitude
direction. The northernmost boundary of the field is at 90° latitude, representing the
north pole. The southernmost boundary is at -90° latitude, representing the south
pole. The westernmost boundary of the global field is at -180° (or 0°) longitude. The
easternmost boundary is at 180° (or 360°) longitude.
Multi-dimensional Gridded Datasets 62

Figure 4.3.1: Schematic of projecting regularly spaced longitude and latitude grid cells wrapped
around the global onto a horizontal plane using the cylindrical projection cut along a) the prime
meridian and b) the 180° (-180°) meridian.
Multi-dimensional Gridded Datasets 63

Indices are used to refer to a specific grid cell or its associated data value. The
position of each grid cell or grid box within the 2-dimensional grids shown in Figure
4.3.1 can be specified by a pair of two index values. The first index value specifies
the latitudinal position and the second specifies the longitudinal position. In the
latitudinal direction the northernmost grid cell has the index 0. The index then
increases in steps of 1 towards the south until index value 35 for the most southern
grid cell). Note that for 36 grid cells in the latitude direction the index runs from
0 to 35 (Figure 4.3.1). Similarly, in the longitudinal direction the westernmost grid
cell has the index 0 increasing in steps of 1 towards the east until index 71 for the
easternmost grid cell. Note that for 72 grid cells in the longitude direction the index
goes from 0 to 71 (Figure 4.3.1).
By using this system the grid cells and grid boxes that make up the four corners of
the global field can by specified by the index pairs [0, 0] for the northwestern corner,
[0, 71] for the northeastern corner, [35, 0] for the southwestern corner and [35, 71]
for the southeastern corner. All other grid cells can be specified by their respective
indices accordingly in the same way for surface fields (Figure 4.3.1a) and single level
fields (Figure 4.3.1b).

It is important to remember that indices always start with 0.

The magnified grid cell in Figure 4.3.1a has the grid cell boundaries -85° and -90°
latitude and -180° (or 0°) and -175° (or 5°) longitude. The data point associated with
this grid cell is geographically located at the centre of the cell at -87.5° latitude
and -177.5° (or 2.5°) longitude. The concept of centred data points is similar for a
single atmospheric layer grid box. The magnified grid box shown in Figure 4.3.1b
has the grid box boundaries -85° and -90° latitude and 175° (or 355°) and 180° (or 360°)
longitude. Accordingly, the geographical location of the data point is at -87.5° latitude
and 177.5° (or 357.5°) longitude. In addition, the data point here is vertically raised
to the middle of the depth of the atmospheric layer. Different types of atmospheric
levels will be discussed in more detail in Section x.x¹.
¹
Multi-dimensional Gridded Datasets 64

Figure 4.3.2: Schematic representation of a 3-dimensional 5° by 5° spatial resolution atmospheric


data field with 17 vertical layers. Latitude and longitude coordinates are indicated in green. Grid
box indices are indicated red.

Stacked atmospheric layers make up the atmosphere in the model and thereby
introduce a third dimension to the data field (Figure 4.3.2). As a result three indices
are now required to reference a grid box. The spatial indices associated with
longitude and latitude are the same as for the surface and single atmospheric layer
data field (Figure 4.3.1). In addition, a third index is added, usually located at the
first position within the index triplet. In the example depicted in Figure 4.3.2 the
index associated with the highest vertical layer in the atmosphere is 0 and the index
associated with the lowest atmospheric layer is 16. Note that for 17 vertical layers the
index goes from 0 to 16. The grid box associated with the westernmost, southernmost
and highest atmospheric level has the index triplet [0, 35, 0]. Similar, the grid box
associated with the easternmost, northernmost and lowest atmospheric level has the
index triplet [16, 0, 71]. Every single grid box or its associated data value within
this 3-dimensional data field can be referenced by using this system of index triplets
(Figure 4.3.2).
Multi-dimensional Gridded Datasets 65

Data fields are stored in data files, most likely in netCDF format (see Section 2.5.4).
It is important to note that the order in which the dimensions and indices (longitude,
latitude, levels) are stored may vary and differ from the examples presented in Figure
4.3.1 and Figure 4.3.2. While the index for the longitude dimension will always start at
the westernmost position with 0 and increase towards the east the order of indices for
the latitude dimension and vertical levels may be reversed. This means the latitude
index may start at the southernmost position with 0 and increases towards the north
and the index for the vertical levels may start with 0 at the lowest level and increase
with each level upwards.
In addition, the index positions within the index pairs or index triplets may change.
In the examples presented here the index position is [latitude, longitude] for the 2-
dimensional fields (Figure 4.3.1) and [level, latitude, longitude] for the 3-dimensional
field (Figure 4.3.2. How the order of indices and their position within the index pair
or index triplet can be identified in a data file will be discussed in more detail in
Section xx².
²
Multi-dimensional Gridded Datasets 66

4.4 The Time Dimension

Figure 4.5.1: Schematic of the time dimension in gridded datasets. Timesteps are indicated as T1 to
Tn corresponding to the indices 0 to n.

In addition to two (longitude and latitude) or three (longitude, latitude and vertical)
spatial dimensions a dataset can have a time dimension. In that case a 2 or 3-
dimensional data field will be associated with each timestep. An example is shown
in Figure 4.5.1 for a field with three spatial dimensions. The values associated with
each individual small grid box are likely to change between one timestep and the
next. Similar to the indexing of the spatial and vertical dimensions, the index of the
first timestep is also 0.
Multi-dimensional Gridded Datasets 67

4.5 Horizontal Resolutions and Grid Types


This section provides a brief overview of some aspects related to model spatial
resolutions and grid types. The spatial resolution at which output data is available
may differ from the resolution at which the model is actually run. In addition, varying
types of grids are used when output variables are saved and some of the differences
are discussed in the following sub-sections.

4.5.1 Spectral Resolution


The atmospheric part of most global climate models consists of a spectral model
that uses spherical harmonics to calculate model variables instead of calculating
them for grid points. Spectral models are computationally more efficient than grid
point models. The model resolution is usually expressed in the form T<spectral_-
resolution>L<number_of_levels> whereby T is the spectral resolution of the model and
L is the number of vertical levels. The vertical levels part is, however, often omitted.
The spectral resolution of some reanalysis models is listed in Table 4.5.1.1.

Table 4.5.1.1: Spectral resolution of some reanalysis models.

Parameter Spectral Resolution Resolution [km] Grid Resolution [°]


ERA-40 T159L60 125 1.125 x 1.125
ERA-Interim T255L60 80 0.75 x 0.75
ERA5 T639L137 30 0.28125 x 0.28125
NCEP CFSR T382L64 38 0.25 x 0.25, 0.5 x 0.5
NCEP-DOE R2 T62L28 2.5 x 2.5
MERRAv2 50 0.5 x 0.625
JRA-55 TL319L60 55 1.25 x 1.25

4.5.2 Full and Reduced Gaussian Grid


In contrast to upper air variables which are mostly calculated a spectral model (Sec-
tion 4.6.1), surface variables are often computed on Gaussian grids. Computations on
Gaussian grids are more efficient than computations on regular latitude-longitude
grids. Each Gaussian grid point can be referenced by using latitude and longitude
Multi-dimensional Gridded Datasets 68

coordinates (orthogonal coordinate system). On a Gaussian grid the grid points in the
zonal direction (along each parallel) are equally spaced. This means that the distance
between two adjacent degrees of longitude is the same for a given latitude. The grid
points in the meridional direction (along each meridian) are unequally spaced. This
means that the distance between adjacent degrees of latitude varies with distance
between the equator and the poles. The unequal spacing between grid points in the
meridional direction is determined by Gaussian quadrature³ calculations. Gaussian
grids have no grid point at the poles or on the Equator. However, the distances
between lines of latitude are symmetrical about the Equator.
There are two types of Gaussian grids. First, the full Gaussian grid (also referred to
as regular Gaussian grid). On a full (regular) Gaussian grid the number of zonal grid
points (grid points along each parallel) is always the same regardless of the latitude.
Second, the reduced Gaussian grid (also referred to as thinned or quasi-regular
Gaussian grid). On a reduced (thinned or quasi-regular) Gaussian grid the number
of zonal grid points (grid point along each parallel) decreases towards the poles.
Gaussian grids are labelled using the N value whereby N is the number of latitude
grid points between the Equator and the poles. The total number of latitude grid
points between the poles is, therefore, 2N. The total number of longitude grid points
is usually 4N for a full Gaussian grid as well as for latitude grid points located close
to the Equator on a reduced Gaussian grid.
Table 4.5.2.1 illustrates the concepts described above for a N80 Gaussian grid (e.g.,
ERA-40 surface fields). Similar tables provided by ECMWF can be found for N320⁴,
N640⁵ and N1280⁶.

³https://en.wikipedia.org/wiki/Gaussian_quadrature
⁴https://confluence.ecmwf.int/display/FCST/Gaussian+grid+with+320+latitude+lines+between+pole+and+equator
⁵https://confluence.ecmwf.int/display/FCST/Gaussian+grid+with+640+latitude+lines+between+pole+and+equator
⁶https://confluence.ecmwf.int/display/FCST/Gaussian+grid+with+1280+latitude+lines+between+pole+and+
equator
Multi-dimensional Gridded Datasets 69

Table 4.5.2.1: Example of full and reduced N80 Gaussian grid points (adapted from BADC).

Grid Point Associated Number of Grid Number of Grid


Number from Latitude [°] Points along a Points along a
Pole Latitude Circle Latitude Circle
for a Reduced for a Full
Gaussian Grid Guassian Grid
1 89.1416 18 320
2 88.0294 25 320
3 86.9108 36 320
4 85.7906 40 320
5 84.6699 45 320
6 83.5489 54 320
7 82.4278 60 320
8 81.3066 64 320
9 80.1853 72 320
10 79.0640 72 320
11 77.9426 80 320
12 76.8212 90 320
13 75.6998 96 320
14 74.5784 100 320
15 73.4570 108 320
16 72.3356 120 320
17 71.2141 120 320
18 70.0927 128 320
19 68.9712 135 320
20 67.8498 144 320
21 66.7283 144 320
22 65.6069 150 320
23 64.4854 160 320
24 63.3639 160 320
25 62.2425 180 320
26 61.1210 180 320
27 59.9995 180 320
28 58.8780 192 320
29 57.7566 192 320
30 56.6351 200 320
31 55.5136 200 320
32 54.3921 216 320
Multi-dimensional Gridded Datasets 70

Table 4.5.2.1: Example of full and reduced N80 Gaussian grid points (adapted from BADC).

Grid Point Associated Number of Grid Number of Grid


Number from Latitude [°] Points along a Points along a
Pole Latitude Circle Latitude Circle
for a Reduced for a Full
Gaussian Grid Guassian Grid
33 53.2707 216 320
34 52.1492 216 320
36 49.9062 225 320
35 51.0277 225 320
37 48.7847 240 320
38 47.6632 240 320
39 46.5418 240 320
40 45.4203 256 320
41 44.2988 256 320
42 43.1773 256 320
43 42.0558 256 320
44 40.9343 288 320
45 39.8129 288 320
47 37.5699 288 320
46 38.6914 288 320
48 36.4484 288 320
49 35.3269 288 320
50 34.2054 288 320
51 33.0839 288 320
52 31.9624 288 320
53 30.8410 300 320
54 29.7195 300 320
55 28.5980 300 320
56 27.4765 300 320
57 26.3550 320 320
58 25.2335 320 320
59 24.1120 320 320
60 22.9905 320 320
61 21.8690 320 320
62 20.7476 320 320
63 19.6261 320 320
64 18.5046 320 320
Multi-dimensional Gridded Datasets 71

Table 4.5.2.1: Example of full and reduced N80 Gaussian grid points (adapted from BADC).

Grid Point Associated Number of Grid Number of Grid


Number from Latitude [°] Points along a Points along a
Pole Latitude Circle Latitude Circle
for a Reduced for a Full
Gaussian Grid Guassian Grid
65 17.3831 320 320
66 16.2616 320 320
67 15.1401 320 320
68 14.0186 320 320
69 12.8971 320 320
70 11.7756 320 320
71 10.6542 320 320
72 9.53270 320 320
73 8.41120 320 320
74 7.28970 320 320
75 6.16820 320 320
76 5.04670 320 320
77 3.92520 320 320
78 2.80370 320 320
79 1.68220 320 320
80 0.56070 320 320

4.5.3 Regular latitude-longitude grid


In contrast to the full or reduced Gaussian grid the spacing of grid points of a regular
latitude-longitude grid is even in both the zonal and meridional direction. This does
not mean the number of degrees between two adjacent latitude grid points is the same
as for two adjacent longitude grid points. Quite often model variables are saved on
a regular latitude-longitude grid. Some regular grid resolutions are shown in Table
4.6.1.1.

4.6 Vertical Level Types


Climate and NWP models divide the atmosphere into distinct vertical layers. What
the layers represent depends on which vertical coordinate system is used. The
Multi-dimensional Gridded Datasets 72

differences between some of the more common vertical coordinate systems will be
discussed in the following sub-sections.

4.6.1 Pressure, Potential Temperature and Potential


Vorticity Levels
Atmospheric model calculations are done initially on a pressure coordinate system.
Each layer is associated with a specific atmospheric pressure and referred to as
a pressure level. Pressure levels can be seen as surfaces of constant pressure, of
which the lowest is usually a layer close to the surface (e.g., 1000 hPa). The pressure
associated with each level decreases with distance from the surface. Most variables
of the upper atmosphere (e. g., air temperature, relative and specific humidity and
wind components) will be available on pressure levels.
The absolute altitude above mean sea-level of data points on pressure levels will
vary geographically depending on the distribution of high and low pressure. For
instance, in a region of high pressure the 500 hPa pressure level will be located
at a higher elevation compared to a region of low pressure. If geopotential height
fields are available then they can be used as an approximation of gravity-adjusted
heights of pressure level data points. If only the geopotential field is available then
the elevation can be calculated by dividing the geopotential values (usually given in
m²/s²) by gravitational acceleration (g = 9.80665 m/s²).
Similar to data on pressure levels, some variables may be made available on potential
temperature and potential vorticity levels. As with pressure, the layers represent
surfaces of equal potential temperature or potential vorticity.

4.6.2 Sigma (Model) Levels


One problem with data on pressure levels as described in Section 4.7.1 is that pressure
levels close to the surface (e.g, at 1000 hPa) may ‘cut through’ topographically
elevated terrain such as mountains. In contrast, sigma levels (also referred to as model
levels) are terrain-following. Figure 4.7.2 shows sigma levels (green) for a version of
the ECMWF ensemble forecast model system that has 31 levels.
Sigma levels represent atmospheric layers with respect to the ratio between the
atmospheric pressure and the pressure at the surface. At the surface sigma is equal
Multi-dimensional Gridded Datasets 73

to 1. Sigma decreases with increasing altitude as the atmospheric pressure decreases.


In data files sigma levels are likely to be referred to by using level numbers starting
with 1 up to the total number of levels.

Figure 4.7.2.1: Hybrid sigma-pressure levels used by the ECMWF model. (a) The elevation of the
model levels (green lines; the example shows levels from the 31 level model; level indices k in
green) changes with surface pressure (black curve at the bottom). The data value for a given
pressure value p can be located at different levels in the grid (the red line marks the location
of p = 600 hPa). (b) Example of how the surface orography affects the vertical displacement
of the grid points in a vertical section. (Source: Three-dimensional visualization of ensemble
weather forecasts - Part 1: The visualization tool Met.3D (version 1.0) - Scientific Figure on
ResearchGate. Available from: https://www.researchgate.net/figure/Hybrid-sigma-pressure-levels-
used-by-the-ECMWF-model-a-The-elevation-of-the-model_fig9_307835524 [accessed 5 Aug, 2019];
available via license: Creative Commons Attribution 4.0 International)

4.6.3 Sigma-Hybrid Levels


While sigma levels are terrain-following close to the surface the impact of topog-
raphy decreases with altitude up to a point where the topography does not impact
the atmosphere any more. Modern atmospheric models therefore use a mix of both
Multi-dimensional Gridded Datasets 74

pressure levels and sigma levels as a vertical coordinate system referred to as hybrid
levels or hybrid sigma-pressure levels.
Figure 4.7.3.1 shows hybrid sigma-pressure levels (blue) for a version of the ECMWF
forecast model system that has 91 levels. Close to the surface the levels are terrain-
following hybrid sigma-pressure levels. At approximately midway through the
atmosphere the levels transition to pure pressure levels.
Multi-dimensional Gridded Datasets 75

Figure 4.7.3.1: The 91 Sigma levels used in ENS configuration of the atmospheric model.
The 137 level configuration is similarly distributed but with a relatively higher verti-
cal resolution. Sigma levels are terrain-following at lower levels and become constant
pressure levels for the upper tropsphere and above. (Source: ECMWF. Available from:
https://confluence.ecmwf.int/display/FUG/Grid+point+Resolution [accessed 7 Aug, 2019])
5. The netCDF File Format
5.1 Introduction to the netCDF File Format
The netCDF file format (Section 2.5.4) has become the most commonly used data
file format for saving gridded climate data in recent years. The first step in climate
data analysis after obtaining access to data files is to get a good understanding of the
contents of the file. It is essential to understand how the data stored within netCDF
files are organised and what the data represent as this is the basis for any subsequent
data operations. The most important questions to ask of a data file are as follows:

• What temporal and spatial dimensions are associated with the data fields?
• What is the spatial resolution and what spatial domain is covered?
• What is the temporal resolution and what time period is covered?
• Which data variables are saved in the file?
• What units are the data variables saved in?

The variable names and variable dimensions are especially important as these are
needed to read in the data correctly into analysis software packages such as Python.
In addition, it may be helpful to find out what the time unit and the reference
time used is (discussed later in more detail). All the information needed to answer
the above questions is stored in the netCDF file headers, sometimes also called file
metadata. The netCDF file headers describe most aspects of the data the file contains,
hence why this data format is referred to as self-describing.

5.2 netCDF File Headers


Three tools are introduced here that allow the exploration of netCDF file headers.
These are the netCDF utility tool ncdump, the Climate Data Operators (CDO)
package and the quick-look tool ncview. All three tools can do much more than
The netCDF File Format 77

just read netCDF file headers but their use is described here only with regards to
that purpose. The difference between the three tools lies in the way the file header
information is presented. Which tool to use depends on personal preference as well as
the information one is interested in. For instance, ncdump displays a well-structured
overview of the file header in the terminal window whereas the CDO package is more
useful for looking at date and time information because it automatically converts
the timestamps into a more sensible ‘human-readable’ format. ncview is useful for
visually inspecting the geographical domain, spatial pattern in the data and data
value ranges of the fields stored in the netCDF file allowing, for instance, the user to
quickly check if a CDO file operation has produced the expected results.

5.2.1 Exploring netCDF File Headers with ncdump


The netCDF utility tool ncdump developed at Unidata¹ can be used to display a text
representation of the netCDF file headers inside the terminal window. In order to
list only the file header information instead of all data values as well the -h option
is used with the ncdump command. Running the ncdump command on a netCDF file
without the -h option will display all data values (could be millions!). The following
example shows the ncdump command to list the file headers of a file named data.nc.

ncdump -h data.nc

Executing the above command will generate text output inside the terminal similar
to the following.
Example output from a ncdump -h command

netcdf data {
dimensions:

longitude = 480 ;
latitude = 241 ;
time = UNLIMITED ; // (408 currently)
variables:
float longitude(longitude) ;
longitude:standard_name = "longitude" ;
¹https://www.unidata.ucar.edu/software/netcdf/netcdf-4/newdocs/netcdf/ncdump.html
The netCDF File Format 78

longitude:long_name = "longitude" ;
longitude:units = "degrees_east" ;
longitude:axis = "X" ;
float latitude(latitude) ;
latitude:standard_name = "latitude" ;
latitude:long_name = "latitude" ;
latitude:units = "degrees_north" ;
latitude:axis = "Y" ;
double time(time) ;
time:standard_name = "time" ;
time:long_name = "time" ;
time:units = "hours since 1900-01-01 00:00:00" ;
time:calendar = "standard" ;
double t2m(time, latitude, longitude) ;
t2m:long_name = "2 metre temperature" ;
t2m:units = "K" ;
t2m:_FillValue = -32767. ;

// global attributes:
:CDI = "Climate Data Interface version 1.6.0 (http://code.zmaw.\
de/projects/cdi)" ;
:Conventions = "CF-1.0" ;
:history = "Fri May 09 16:49:11 2014: cdo monmean erai_t2m_1979\
_2012.nc ./erai_mm_t2m_1979_2012.nc\n", "Tue Nov 26 15:36:25 2013: cdo -b F64 -\
mergetime erai_t2m_00.nc erai_t2m_06.nc erai_t2m_12.nc erai_t2m_18.nc erai_t2m_\
1979_2012.nc\n", "2013-11-26 15:19:53 GMT by mars2netcdf-0.92" ;
:CDO = "Climate Data Operators version 1.6.0" ;
}

The file name is indicated in line 1. In lines 2 to 6 the dimensions of the file of the
data are shown. In this example the file has three dimensions including two spatial
dimensions (longitude and latitude) and on time dimension (time). The longitude
dimension has 480 data points and the latitude dimension has 241 data points. Setting
the time dimension to UNLIMITED is quite common as it allows to add additional
timesteps to the netCDF file structure. The current number of timesteps is 408.
Lines 7 to 26 provide information about the variables included in the file. First, details
about the variables associated with the three dimensions are listed in lines 8 to 22
including the variables longitude (line 8), latitude (line 13) and time (line 18). These
The netCDF File Format 79

variables are associated with the dimensions and are also referred to as coordinate
variables. Following the coordinate variables, details about the data variable are
shown in lines 23 to 26 (netCDF files can hold multiple data variables). The data
variable in this example is called t2m (line 23).
The general format in which variable information is presented is the following. First,
an indented single line shows the data type, the variable name and in brackets the
dimension(s) associated with the variable. Second, further indented, a list of variable
attributes and their values is presented for each variable. In the following paragraphs
the variables and their attributes will be discussed in some more detail.
Line 8 shows that the coordinate variable longitude is of the data type float (floating
point) and that it is associated with a single dimension named longitude. Note that for
coordinate variables the variable name and the associated dimension variable name
is often the same - they should not be confused. The longitude variable contains 480
longitude values. In this example the longitude variable has four attributes (lines
9 to 12) named standard_name, long_name, units and axis which provide additional
information about the variable. The standard_name and long_name attributes are both
set to longitude and the units attribute is set to degrees_east. The longitude variable
represents the X axis on a map.

When writing netCDF files it is important to be familiar with CF conven-


tions (Section 5.4) for the standard_name and units attributes. The standardi-
sation allows for the clear identification of variables when exchanging data
whereas the long_name attribute may be assigned by the institution (e.g., for
use in plot labels).

The latitudevariable information (lines 13 to 17) looks very similar to that of the
longitude variable. The main difference is that the units attribute of the latitude
variable is degrees_north and that the latitude variable represents the Y axis on a
map.
The variable time (line 18) is associated with the time dimension which means the
time variable stores 408 time values of the data type double (double precision). The
time variable attributes (lines 19 to 22) show that the standard_name and long_name
attributes are both set to ‘time’.
The netCDF File Format 80

The units attribute of the time variable is especially important to understand as it


often leads to some confusion. The time variable’s units attribute shows that time
values are given in hours since 1900-01-01 00:00:00. It is common practice to save
time information in netCDF fiels in relation to a given reference time. This is also
referred to as a relative time axis. Other frequently used time units are days since
... or seconds since .... Saving time information this way has its advantages but the
time values are often hard to interpret. For example, it may take some time to work
out what date and time corresponds to 1025610 hours since midnight of 1 January
1900 (the answer is 18 UTC on 31 December 2016). For that reason, ncdump is not the
best tool to explore date and time information from netCDF files. To look at date
and time information in netCDF files in a more convenient way it is suggested to
use the CDO operator sinfon (see Section 5.2.2) which makes use of the time variable
attributes units and calendar to calculate the correct date and time for each timestep.
Lines 23 to 26 shows details about the data variable. The variable name is t2m and the
data type is double (double precision). Three dimensions are given in brackets after
the variable name t2m indicating that this is a 3-dimensional field associated with
the dimensions time, latitude and longitude. The long_name attribute is set to 2 metre
temperature and the units attribute is set to K (Kelvin). Based on the three dimensions
values it can be deduced that the t2m variable holds 5746519680 (that is five billion
seven hundred forty-six million five hundred nineteen thousand six hundred eighty)
individual 2m air temperature values (408 x 241 x 480). Any potentially missing values
are set to “-32767.0 as indicated by the t2m variable attribute _FillValue‘ (line 26).
Line 28 marks the beginning of the last section of the ncdump -h command output
which is called global attributes. It provides additional information about the
netCDF file and can include information about the origin and processing of the data.
Particular attention should be paid to the history global attribute (line 32). It shows
information about any processing the data have undergone since the file was initially
created. For instance, every CDO command that changes the netCDF file will be
saved in this global history attribute.
The ncdump command has a few more options than -h option (for header information).
A complete list of options and their usage can be looked up in the manual pages (see
Section for manual pages). For instance, some information about the spatial domain
covered by the data field can be found by printing out the longitude and latitude
variables using the -v option with the ncdump command as shown in the following
command. Note that the -v option and variable names longitude and latitude are
The netCDF File Format 81

separated by commas without any spaces.

5.2.2 Exploring netCDF File Headers with CDO


Climate Data Operators (CDO) is a command line tool used for climate data analysis.
CDO will be introduced in full in Section 8. Here, four CDO operators will be
introduced that can be used to extract netCDF file header information. The operators
are info, infon, sinfo and sinfon. The s at the beginning of the info and infon operator
stands for short information list. The n at the end of the info and sinfo operator
stands for list by parameter name. What follows is an example of the use of the
sinfon operator applied to a netCDF file named data.nc (the same file as used in
ncdump example in Section 5.2.1).

cdo sinfon data.nc

Executing the above command will produce output in the terminal window that may
look similar to the following.
Example output from a cdo sinfon command

1 File format: netCDF


2 -1 : Institut Source Ttype Levels Num Gridsize Num Dtype : Parameter \
3 name
4 1 : unknown unknown instant 1 1 115680 1 F64 : t2m
5 Grid coordinates :
6 1 : lonlat > size : dim = 115680 nx = 480 ny = 241
7 longitude : first = 0 last = 359.25 inc = 0.75 degree\
8 s_east circular
9 latitude : first = 90 last = -90 inc = -0.75 degrees\
10 _north
11 Vertical coordinates :
12 1 : surface : 0
13 Time coordinate : 408 steps
14 RefTime = 1900-01-01 00:00:00 Units = hours Calendar = standard
15 YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:mm:ss YYYY-MM-DD hh:\
16 mm:ss
17 1979-01-31 18:00:00 1979-02-28 18:00:00 1979-03-31 18:00:00 1979-04-30 18:\
18 00:00
The netCDF File Format 82

19 1979-05-31 18:00:00 1979-06-30 18:00:00 1979-07-31 18:00:00 1979-08-31 18:\


20 00:00
21 1979-09-30 18:00:00 1979-10-31 18:00:00 1979-11-30 18:00:00 1979-12-31 18:\
22 00:00
23 ...
24 2012-09-30 18:00:00 2012-10-31 18:00:00 2012-11-30 18:00:00 2012-12-31 18:\
25 00:00
26 cdo sinfon: Processed 1 variable over 408 timesteps ( 0.10s )

Lines 1 to 3 display general file information in the form of a table with the table
headers in line 2 and the associated values in line 3. The table headers include
institute (Institute), data source (Source), type of statistical processing (Ttype),
number of levels (Levels), z-axis number (Num), horizontal grid size (Gridsize), grid
size number (Num), data type (Dtype) and parameter identifier (Parameter name). From
the information in line 3 it can be deduced that the file contains data on a single level,
that the horizontal grid has a total of 115680 grid points, that the data are saved as
64-bit floating point values and that the parameter name is t2m.

Note that a netCDF file may contain more than one variable.

Lines 4 to 7 list details about the horizontal grid. Line 5 shows that the grid type
is lonlat meaning the data are on a regular longitude/latitude grid (see Section xxx
for more details on netCDF grid types). The number of grid boxes in the longitude
direction (nx = 480) and in the latitude direction (ny = 241) reveals that there are
115680 data points. Line 6 and 7 shows the range of the longitude and latitude
variables, respectively, as well as their associated spatial resolution. The data field
is on a global grid on a 0.75° by 0.75° degree spatial resolution. Note that the first
longitude value is 0° and the last is 359.25° indicating a Pacific centred global field
(see Figure 4.2.2a).
Useful information displayed in the beginning of the output includes the grid type
lonlat (regular lat/lon grid), the number of longitude (480) and latitude (241) grid
boxes, the first and last longitude (0° to 395.25°) and latitude (90° to -90°) value, the
spatial resolution (0.75° by 0.75°) and the number of timesteps (408). In addition, the
date/time information for each timestep is listed. The final line provides information
The netCDF File Format 83

about the number of variables that were processed (1 named t2m) and the time the
processing took (0.1s).

Note that the date/time information includes details about the hour, min-
utes and seconds (hh:mm:ss). The example file used here contains monthly
fields. Therefore, the details of time from days down to seconds should be
ignored as they are artefacts from the way CDO handles time information.

In contrast to the output of the sinfon operator (short list) the output of the infon
operator (long list) looks somewhat different. For each timestep some information
is provided. The Minimum, Mean and Maximum values are useful to quickly check
whether the range of data makes sense (for instance, after the file was manipulated
with a CDO operator).
Some other useful operators that provide information about the data content in a
netCDF file are listed in Table 5.2.2.1: .

Table 5.2.2.1: Some addition CDO operators that provide useful netCDF file information.

CDO operator Shows information about the …


npar Number of parameters
nlevel Number of levels
nyear Number of years
nmon Number of months
ndate Number of dates
ntime Number of timesteps
showformat File format
showcode Code numbers
showname Variable names
showstdname Standard names
showlevel Levels
showyear Years
showmon Months
showdate Dates
showtime Times
showtimestamp Timestamps
The netCDF File Format 84

5.2.3 Exploring netCDF File Headers with ncview


A light-weight quick-look tool called Ncview² can be used to check variable names,
visualise data fields, click through time steps or create time series for specific grid
boxes. To open a netCDF file use the ncview command followed by the file name.

ncview data.nc

The resulting graphical windows will look similar to the ones shown in Figure
5.2.3.1. The main window allows a variable to be selected which will open in a new
window showing, for instance, a map. Clicking a specific location on the map will
open another window showing the time series for that location. Colour maps can be
adjusted.
²http://meteora.ucsd.edu/~pierce/ncview_home_page.html
The netCDF File Format 85

Figure 5.2.3.1: Ncview graphical windows. Image from David Pierce’s webpage at UC San Diego’s
Scripps Institution of Oceanography (http://meteora.ucsd.edu/~pierce/ncview_home_page.html; ac-
cessed 25 Apr, 2020).

While ncview is a useful tool to have a quick look at netCDF data files it is
too limited in its functionality to allow serious data analysis or plotting.

5.3 Packed netCDF Files


Climate data require a lot of storage space. In order to reduce the size of netCDF
files sometimes a system is used whereby the data values in the file are packed. This
refers to the process of reducing the file size by converting data saved as floating point
values (floats) to integer values which require less storage space. For the conversion
a scale factor (scale_factor) and an added offset (add_offset) value are applied to the
input data using the following formula.
The netCDF File Format 86

unpacked_data_value = packed_data_value * scale_factor + add_offset

To find out if a netCDF file contains packed or unpacked data values the netCDF
utility tool ncdump can be used (see Section 5.2.1). If a netCDF file is packed then the
output of ncdump -h will show a scale_factor and add_offset attribute listed as part
of the variable attributes similar to the example below.
Most software packages automatically detect whether a netCDF file is packed or not
and convert the data fields accordingly when the file is read in. In this case there is
no need to worry about it. However, sometimes when applying CDO commands to
packed netCDF files the -b <bits> option (see Section xxx) needs to be used whereby
<bits> is either F32 or F64 depending on whether the operating system was built using
32-bit or 64-bit architecture. Without the -b option the CDO command may return an
error message. The output file generated by the CDO command will be an unpacked
netCDF file.

5.4 netCDF File Format Conventions


The contents of a netCDF file should be organised and labelled following specific
conventions. Such conventions allow interoperability between data providers, data
users and application developers. A nice overview of different conventions can be
found on the Unidata webpage³ whereby the CF Convention⁴ is the recommended
standard.
It is good practice when generating original netCDF files to follow the CF Con-
ventions as it allows data analysis software or visualisation packages such as R,
Matlab, Python, IDL, NCL or Panoply to interpret the contents correctly. Online CF
compliance checkers can be found on the CF Conventions and Metadata⁵ webpage.
³https://www.unidata.ucar.edu/software/netcdf/conventions.html
⁴http://cfconventions.org
⁵http://cfconventions.org/compliance-checker.html
6. Python - Concepts and Work
Environment
6.1 Python Overview
Python has become one of the most popular programming languages in recent years.
It is a versatile high level programming language with an easy to understand syntax
and English-like statements. This makes Python code fast to write, easy to debug and
portable to other systems.
As an open source programming language Python can be used by anyone to develop
software or web applications for commercial and non-commercial purposes in
accordance with the license issued by the Python Software Foundation (PSF¹).
Before the Python programming language itself is introduced in Chapter 7 this chap-
ter covers some of the concepts and tools that underpin Python code development.

6.2 Python Concepts

6.2.1 Python Modules and Packages


The concept of modular building blocks underpins Python software development.
The smallest building block is a Python module which is just a single file that contains
valid Python code (a Python module can also be written in the C programming
language). Python files have the file extension .py. It is unpractical to write large
Python applications in single files and therefore they tend to be split up into
individual modules (files). Combining different modules to create a larger application
is referred to as packaging. The result is a Python package.
¹https://docs.python.org/3/license.html
Python - Concepts and Work Environment 88

100K+ Python packages have been developed over the years for all kinds of purposes.
Some are well supported and being actively developed while others are not. The
latter tend to not stand the test of time. For the purpose of climate computations
and visualisation only a small number of well-supported Python packages is needed
with each package serving a specific purpose (Table 6.2.1.1). For instance, the NumPy
package allows computations with multi-dimensional number arrays while the
Matplotlib package provides functionality for everything related to plotting data.

Table 6.2.1.1: Some of the Python packages commonly used in climate computing and visualisation.

Package Purpose
Cartopy² Geospatial data processing for
creating maps and other
geospatial data analyses.
CIS Tools³ Analysing, comparing and
visualising earth system data.
IPython⁴ Powerful shell for interactive
computing.
Iris⁵ Powerful, format-agnostic
interface for working with
multi-dimensional earth science
data.
Matplotlib⁶ Cross-platform 2D plotting library
and interactive environments.
MetPy⁷ Reading, visualizing, and
performing calculations with
weather data.
netCDF4⁸ Object-oriented python interface
to the netCDF version 4 library.
NumPy⁹ Powerful scientific computing on
N-dimensional arrays.
Pandas¹⁰ Data analysis and manipulation
tool.

²https://scitools.org.uk/cartopy/docs/latest/
³http://www.cistools.net
⁴https://ipython.org
⁵https://scitools.org.uk/iris/docs/latest
⁶https://matplotlib.org
⁷https://unidata.github.io/MetPy/latest/index.html
⁸https://anaconda.org/anaconda/netcdf4
⁹https://numpy.org/
¹⁰https://pandas.pydata.org

You might also like