You are on page 1of 87

Introduction

Yao-Yuan Chuang

1
Outline
High Performance Computing
Scientific Computing
Parallel Computing
UNIX, GNU, and Linux
Cluster Environment
Bash
Linux Commands
Software development process

2
High-Performance
Computing
The term high performance computing (HPC) refers to
the use of (parallel) supercomputers and computer clusters
, that is, computing systems comprised of multiple (usually
mass-produced) processors linked together in a single
system with commercially available interconnects.

This is in contrast to mainframe computers, which are


generally monolithic in nature.

Because of their flexibility, power, and relatively low cost,


HPC systems increasingly dominate the world of
supercomputing. Usually, computer systems in or above the
teraflop-region are counted as HPC-computers.

3
Scientific Computing
Scientific computing (or computational science) is the
field of study concerned with constructing
mathematical models and numerical solution techniques
and using computers to analyze and solve scientific and
engineering problems.

In practical use, it is typically the application of


computer simulation and other forms of computation to
problems in various scientific disciplines.

Scientists and engineers develop computer programs,


application software, that model systems being studied and
run these programs with various sets of input parameters.
Typically, these models require massive amounts of
calculations (usually floating-point) and are often executed
on supercomputers or distributed computing platforms.

4
Problem Domains for Computational
Science/Scientific Computing

Numerical simulations
Numerical simulations have different objectives depending on the
nature of the task being simulated:
Reconstruct and understand known events (e.g., earthquake, tsunamis
and other natural disasters).
Predict future or unobserved situations (e.g., weather, sub-atomic
particle behavior).

Model fitting and data analysis


Appropriately tune models or solve equations to reflect observations,
subject to model constraints (e.g. oil exploration geophysics,
computational linguistics)
Use graph theory to model networks, especially those connecting
individuals, organizations, and websites.

Optimization
Optimize known scenarios (e.g., technical and manufacturing
processes, front end engineering).

5
Supercomputer
A supercomputer is a computer
that leads the world in terms of
processing capacity, particularly
speed of calculation, at the time
of its introduction. The term
"Super Computing" was first used
by New York World newspaper in
1920 to refer to large custom-
built tabulators IBM made for
Columbia University.

6
IBM Roadrunner, LLNL, USA, 1.026
PFLOPS

7
Moores Law
Moore's Law is the
empirical observation
made in 1965 that the
number of transistors
on an integrated circuit
for minimum
component cost doubles
every 24 months.

The most popular


formulation is of the
doubling of the number
of transistors on
integrated circuits every
18 months.

8
FLOPS
In computing, FLOPS (or flops) is an acronym
meaning FLoating point Operations Per Second.
This is used as a measure of a computer's
performance, especially in fields of scientific
calculations that make heavy use of floating point
calculations; similar to million instructions per second
(MIPS).

The standard SI prefixes can be used for this purpose,


resulting in such units as megaFLOPS (MFLOPS, 106),
gigaFLOPS (GFLOPS, 109), teraFLOPS (TFLOPS, 1012),
petaFLOPS (PFLOPS, 1015) and exaFLOPS (EFLOPS,
1018).

As of 2008 the fastest supercomputer's performance


tops out at one petaflop.

9
www.top500.org

10
www.linuxhpc.org

11
www.winhpc.org

12
Intels 80 Core Chip

Performance numbers*
at 4.27 Ghz:
peak performance:
* 1.37 SP TFLOPS
Explicit PDE solver:
* 1 SP TFLOPS
matrix multiplication:
0.51 SP TFLOPS
2 independent FMAC units 2 Single 13
Precision FLOPS per cycle
Compiler
A programming language is an
artificial language that can be used
to control the behavior of a
machine, particularly a computer.

A compiler is a computer program


(or set of programs) that translates
text written in a
computer language (the source
language) into another computer
language (the target language).
The original sequence is usually
called the source code and the
output called object code.
Commonly the output has a form
suitable for processing by other
programs (e.g., a linker), but it
may be a human-readable text file.

14
Debug and Performance
Analysis
Debugging is a methodical process of finding and reducing
the number of bugs, or defects, in a computer program or a
piece of electronic hardware thus making it behave as
expected.

In software engineering, performance analysis (a field of


dynamic program analysis) is the investigation of a
program's behavior using information gathered as the
program runs, as opposed to static code analysis. The usual
goal of performance analysis is to determine which parts of
a program to optimize for speed or memory usage.

A profiler is a performance analysis tool that measures the


behavior of a program as it runs, particularly the frequency
and duration of function calls. The output is a stream of
recorded events (a trace) or a statistical summary of the
events observed (a profile).

15
Optimization
In computing, optimization is the process of
modifying a system to make some aspect of it
work more efficiently or use less resources. For
instance, a computer program may be optimized
so that it executes more rapidly, or is capable of
operating within a reduced amount of
memory storage, or draws less battery power in a
portable computer. The system may be a single
computer program, a collection of computers or
even an entire network such as the Internet.

16
(http://en.wikipedia/org/wiki/Optimization_%28computer_science
Parallel Computing
Parallel computing is the simultaneous execution of the
same task (split up and specially adapted) on multiple
processors in order to obtain results faster.

The idea is based on the fact that the process of solving a


problem usually can be divided into smaller tasks, which
may be carried out simultaneously with some coordination.

Flynn's taxonomy
whether all processors execute the same instructions at the
same time (single instruction/multiple data -- SIMD)
each processor executes different instructions (multiple
instruction/multiple data -- MIMD).

17
(http://en.wikipedia/org/wiki/Parallel_comput
Terminology in parallel
computing
Efficiency
is the execution time using a single processor divided by the quantity of
the execution time using a multiprocessor and the number of procesors.
Parallel Overhead
the extra work associated with parallel version compared to its
sequential code, mostly the extra CPU time and memory space
requirements from synchronization, data communications, parallel
environment creation and cancellation, etc.
Synchronization
the coordination of simultaneous tasks to ensure correctness and avoid
unexpected race conditions.
Speedup
also called parallel speedup, which is defined as wall-clock time of best
serial execution divided by wall-clock time of parallel execution.
Amdahl's law can be used to give a maximum speedup factor.
Scalability
a parallel system's ability to gain proportionate increase in parallel
speedup with the addition of more processors. Also, see this
Parallel Computing Glossary
Task
a logically high level, discrete, independent section of computational
work. A task is typically executed by a processor as a program 18
Amdahl's law
Amdahl's law, named after computer architect Gene Amdahl
, is used to find the maximum expected improvement to an
overall system when only part of the system is improved. It is
often used in parallel computing to predict the theoretical
maximum speedup using multiple processors.

1
n Pk
k 0

is a percentage of the instructions that can be S

Pk k
improved
(or slowed),
Sk is the speed-up multiplier (where 1 is no speed-up and no
slowing),
k represents a label for each different percentage and speed-
up, and
n is the number of different speed-up/slow-downs resulting
from the system change.

19
Parallelism
One of the simplest methods used to accomplish increased
parallelism is to begin the first steps of instruction fetching
and decoding before the prior instruction finishes
executing. This is the simplest form of a technique known
as instruction pipelining, and is utilized in almost all
modern general-purpose CPUs. Pipelining allows more than
one instruction to be executed at any given time by
breaking down the execution pathway into discrete stages.
Another strategy of achieving performance is to execute
multiple programs or threads in parallel. This area of
research is known as parallel computing. In
Flynn's taxonomy, this strategy is known as Multiple
Instructions-Multiple Data or MIMD.
A less common but increasingly important paradigm of
CPUs (and indeed, computing in general) deals with data
parallelism. A vector processor, or array processor, is a
CPU design that is able to run mathematical operations on
multiple data elements simultaneously.

20
Parallelism
In computing, multitasking is a method by which
multiple tasks, also known as processes, share
common processing resources such as a CPU.

Multiprocessing is a generic term for the use of two


or more central processing units (CPUs) within a single
computer system. It also refers to the ability of a
system to support more than one processor and/or the
ability to allocate tasks between them

A multi-core microprocessor is one that combines two


or more independent processors into a single package,
often a single integrated circuit (IC). A dual-core
device contains two independent microprocessors. In
general, multi-core microprocessors allow a computing
device to exhibit some form of thread-level parallelism
(TLP) without including multiple microprocessors in
separate physical packages. This form of TLP is often
known as chip-level multiprocessing. 21
Computer Cluster
A computer cluster is a group of tightly coupled
computers that work together closely so that in
many respects they can be viewed as though
they are a single computer. The components of a
cluster are commonly, but not always, connected
to each other through fast local area networks.
Clusters are usually deployed to improve
performance and/or availability over that
provided by a single computer, while typically
being much more cost-effective than single
computers of comparable speed or availability.
22
High-Availability (HA)
Clusters
High-availability clusters (a.k.a
Failover clusters) are implemented
primarily for the purpose of improving
the availability of services which the
cluster provides. They operate by
having redundant nodes, which are
then used to provide service when
system components fail. The most
common size for an HA cluster is two
nodes, which is the minimum
requirement to provide redundancy.
HA cluster implementations attempt
to manage the redundancy inherent
in a cluster to eliminate single points
of failure. There are many commercial
implementations of High-Availability
clusters for many operating systems.
The Linux-HA project is one
commonly used free software HA
package for the Linux OS.
23
Load-balancing Clusters
Load-balancing clusters operate by having all workload
come through one or more load-balancing front ends, which
then distribute it to a collection of back end servers.
Although they are primarily implemented for improved
performance, they commonly include high-availability
features as well. Such a cluster of computers is sometimes
referred to as a server farm.

There are many commercial load balancers available


including Platform LSF HPC, Sun Grid Engine,
Moab Cluster Suite and Maui Cluster Scheduler. The
Linux Virtual Server project provides one commonly used
free software package for the Linux OS.

For example, web farm of Google and Yahoo

24
High-performance Clusters
High-performance clusters are implemented primarily to
provide increased performance by splitting a computational
task across many different nodes in the cluster, and are
most commonly used in scientific computing.

One of the most popular HPC implementations is a cluster


with nodes running Linux as the OS and free software to
implement the parallelism. This configuration is often
referred to as a Beowulf cluster.

HPCs are optimized for workloads which require jobs or


processes happening on the separate cluster computer
nodes to communicate actively during the computation.
These include computations where intermediate results
from one node's calculations will affect future calculations
on other nodes.

25
Beowulf (computing)
Beowulf is a design for high-performance parallel computing
clusters on inexpensive personal computer hardware.
Originally developed by Thomas L. Sterling and
Donald Becker at NASA, Beowulf systems are now deployed
worldwide, chiefly in support of scientific computing.

A Beowulf cluster is a group of usually identical PC


computers networked into a small TCP/IP LAN, and have
libraries and programs installed which allow processing to be
shared among them.

There is no particular piece of software that defines a cluster


as a Beowulf. Commonly used parallel processing libraries
include MPI (Message Passing Interface) and PVM (
Parallel Virtual Machine). Both of these permit the
programmer to divide a task among a group of networked
computers, and recollect the results of processing.
26
OpenMosix
openMosix is a free cluster
management system that provides
single-system image (SSI) capabilities,
e.g. automatic work distribution among
nodes. It allows program processes
(not threads) to migrate to machines in
the node's network that would be able
to run that process faster (
process migration). It is particularly
useful for running parallel and intensive
input/output (I/O) applications. It is
released as a Linux kernel patch, but is
also available on specialized LiveCDs
and as a Gentoo Linux kernel choice.

openMosix is currently considered


stable on 2.4 kernel for the x86
architecture.

27
Pthread
POSIX Threads is a POSIX standard for threads. The
standard defines an API for creating and manipulating
threads.

POSIX or "Portable Operating System Interface for uniX" is


the collective name of a family of related standards specified
by the IEEE to define the application programming interface
(API) for software compatible with variants of the Unix
operating system.

A thread in computer science is short for a thread of


execution. Threads are a way for a program to split itself
into two or more simultaneously (or pseudo-simultaneously)
running tasks. Threads and processes differ from one
operating system to another, but in general, the way that a
thread is created and shares its resources is different from
the way a process does.

28
Message Passing Interface
The Message Passing Interface (MPI) is a language-
independent computer communications descriptive application
programmer interface (API) with defined semantics and flexible
interpretations; it does not define the protocol by which these
operations are to be performed in the sense of sockets for TCP/IP
or other layer-4 and below models in the ISO/OSI Reference Model.

It is consequently a layer-5+ type set of interfaces, although


implementations can cover most layers of the reference model,
with sockets+TCP/IP as a common transport used inside the
implementation. MPI's goals are high performance, scalability, and
portability.

Productivity of the interface for programmers is not one of the key


goals of MPI, and MPI is generally considered to be low-level. It
expresses parallelism explicitly rather than implicitly. MPI is
considered successful in achieving high performance and high
portability, but is often criticized for its low-level qualities.

29
TCP/IP
Application
Transport layer (
TCP)
Network layer(IP)
Data link
Physical layer

30
OpenMP
The OpenMP (Open Multi-Processing)
application programming interface (API) supports multi-
platform shared memory multiprocessing programming in C
/C++ and Fortran on many architectures, including Unix
and Microsoft Windows platforms. It consists of a set of
compiler directives, library routines, and
environment variables that influence run-time behavior.

Jointly defined by a group of major computer hardware and


software vendors, OpenMP is a portable, scalable model
that gives programmers a simple and flexible interface for
developing parallel applications for platforms ranging from
the desktop to the supercomputer.

(http://en.wikipedia.org/wiki/Openmp)
31
Distributed Shared Memory
Distributed Shared Memory (DSM), in computer science,
refers to a wide class of software and hardware
implementations, in which each node of a cluster has
access to a large shared memory in addition to each node's
limited non-shared private memory.

Software DSM systems can be implemented within an


operating system, or as a programming library. Software
DSM systems implemented in the operating system can be
thought of as extensions of the underlying virtual memory
architecture. Such systems are transparent to the
developer; which means that the underlying distributed
memory is completely hidden from the users. In contrast,
Software DSM systems implemented at the library or
language level are not transparent and developers usually
have to program differently. However, these systems offer a
more portable approach to DSM system implementation.

32
Distributed Computing
Distributed computing is a method of computer processing
in which different parts of a program run simultaneously on
two or more computers that are communicating with each
other over a network.

Distributed computing is a type of parallel processing. But the


latter term is most commonly used to refer to processing in
which different parts of a program run simultaneously on two
or more processors that are part of the same computer.

While both types of processing require that a program be


parallelized -- divided into sections that can run
simultaneously, distributed computing also requires that the
division of the program take into account the different
environments on which the different sections of the program
will be running. For example, two computers are likely to have
different file systems and different hardware components.

33
(http://en.wikipedia/org/wiki/Distributed_comput
BOINC
The Berkeley Open Infrastructure for Network
Computing (BOINC) is a middleware system for
volunteer computing, originally developed to support
SETI@home, but intended to be useful for other
applications as well.

Currently BOINC is being developed by a team based at the


University of California, Berkeley led by David Anderson,
who also leads SETI@home. As a "quasi-supercomputing"
platform BOINC has over 435,000 active computers (hosts)
worldwide processing on average 500 TFLOPS as of
January 30, 2007

boinc.berkeley.edu 34
Grid Computing
Grid computing or grid clusters are a technology closely
related to cluster computing. The key differences between
grids and traditional clusters are that grids connect collections
of computers which do not fully trust each other, and hence
operate more like a computing utility than like a single
computer. In addition, grids typically support more
heterogeneous collections than are commonly supported in
clusters.

Grid computing is optimized for workloads which consist of


many independent jobs or packets of work, which do not have
to share data between the jobs during the computation
process. Grids serve to manage the allocation of jobs to
computers which will perform the work independently of the
rest of the grid cluster. Resources such as storage may be
shared by all the nodes, but intermediate results of one job
do not affect other jobs in progress on other nodes of the
grid.

35
Topics in Parallel Compu
ting

General High-performance computing


Parallelism Data parallelism Task parallelism
Theory Speedup Amdahl's law Flynn's Taxonomy Cost efficiency
Gustafson's Law Karp-Flatt Metric
Elements Process Thread Fiber Parallel Random Access Machine
Multiprocessing Multitasking Memory coherency Cache coherency
Barrier Synchronization Distributed computing Grid computing
Programming Programming model Implicit parallelism
Explicit parallelism
Hardware Computer cluster Beowulf Symmetric multiprocessing
Non-Uniform Memory Access Cache only memory architecture
Asymmetric multiprocessing Simultaneous multithreading
Shared memory Distributed memory Massively parallel processing
Superscalar processing Vector processing Supercomputer
Software Distributed shared memory Application checkpointing
APIs Pthreads OpenMP Message Passing Interface (MPI)
Problems Embarrassingly parallel Grand Challenge

36
UNIX
Unix (officially trademarked as UNIX) is a computer
operating system originally developed in the 1960s and 1970s by a
group of AT&T employees at Bell Labs including Ken Thompson,
Dennis Ritchie and Douglas McIlroy. Today's Unix systems are split
into various branches, developed over time by AT&T as well as
various commercial vendors and non-profit organizations.

The present owner of the trademark UNIX is The Open Group, an


industry standards consortium. Only systems fully compliant with
and certified to the Single UNIX Specification qualify as "UNIX"
(others are called "Unix system-like" or "Unix-like").

During the late 1970s and early 1980s, Unix's influence in


academic circles led to large-scale adoption (particularly of the BSD
variant, originating from the University of California, Berkeley) of
Unix by commercial startups, the most notable of which is
Sun Microsystems. Today, in addition to certified Unix systems,
Unix-like operating systems such as Linux, Mac OS X and BSD
derivatives are commonly encountered.

37
UNIX

38
(http://en.wikipedia/org/wiki/Unix)
GNU
The GNU project was publicly announced by
Richard Stallman.

GNU was to be a complete Unix-like operating system


composed entirely of free software. Software development
work began in January 1984. By the beginning of the
1990s, the project had produced or collected most of the
necessary components of this system, including libraries,
compilers, text editors, and a Unix shell.

Thus the GNU mid-level portions of the operating system


were almost complete. The upper level could be supplied by
the X Window System, but the lower level, which consisted
of a kernel, device drivers, and daemons, was still mostly
lacking. In 1990, the GNU project began developing the
GNU Hurd kernel, based on the Mach microkernel.

39
(http://en.wikipedia/org/wiki/GNU
Linux
Linus Torvalds, creator of the Linux kernel.

In 1991, work on the Linux kernel began by Linus Torvalds


while attending the University of Helsinki. Torvalds
originally created the Linux kernel as a non-commercial
replacement for the Minix kernel; he later changed his
original non-free license to the GPLv2, which differed
primarily in that it also allowed for commercial
redistribution.

Although dependent on the Minix userspace at first, work


from both Linux kernel developers and the GNU project
allowed Linux to work with GNU components. Thus Linux
filled the last major gap in running a complete, fully
functional operating system built from free software.

40
(http://en.wikipedia/org/wiki/Linux)
Kernel
In computing, the kernel is the central component of
most computer operating systems (OSs). Its
responsibilities include managing the system's resources
and the communication between hardware and software
components.

As a basic component of an operating system, a kernel


provides the lowest-level abstraction layer for the
resources (especially memory, processors and I/O devices
) that applications must control to perform their function.
It typically makes these facilities available to application
processes through inter-process communication
mechanisms and system calls.

While monolithic kernels will try to achieve these goals by


executing all the code in the same address space to
increase the performance of the system, microkernels run
most of their services in user space, aiming to improve
maintainability and modularity of the codebase.

41
Linux Distribution
A Linux distribution, often simply distribution
or distro, is a member of the Linux family of
Unix-like operating systems comprised of the
Linux kernel, the non-kernel parts of the
GNU operating system, and assorted other
software. Linux distributions take a variety of
forms, from fully-featured desktop and server
operating systems to minimal environments
(typically for use in embedded systems, or for
booting from a floppy).

CentOS Debian Fedora Gentoo Knoppix


Mandriva Linux Red Hat Enterprise Linux
Slackware SUSE Linux Ubuntu more
42
Unix Shell
A Unix shell, also called "the command line", provides
the traditional user interface for the Unix
operating system and for Unix-like systems.

bash Bourne Again SHell, (mostly) sh-compatible and


csh-compatible, standard shell on Linux systems and
Mac OS X.
csh C shell. Written by Bill Joy for BSD systems.
ksh Korn shell, standard shell on many proprietary
Unix systems, powerful successor to the Unix
Bourne shell (sh), written by David Korn,
rc originally written for Plan 9.
sh Bourne shell, only shell present on all UNIX and
Unix-like systems, written by Stephen Bourne for
Version 7 Unix.
tcsh TENEX C shell, standard shell on BSD systems.
zsh Z shell.
43
(http://en.wikipedia/org/wiki/Unix_she
Bash
Bash is a Unix shell written for the GNU Project. The name
of the actual executable is bash. Its name is an acronym for
Bourne-again shell, a pun ('Bourne again' / 'born again') on
the name of the Bourne shell (sh), an early and important
Unix shell written by Stephen Bourne and distributed with
Version 7 Unix circa 1978. Bash was created in 1987 by
Brian Fox. In 1990 Chet Ramey became the primary
maintainer.

Bash is the default shell on most Linux systems as well as on


Mac OS X and it can be run on most Unix-like operating
systems. It has also been ported to Microsoft Windows within
the Cygwin POSIX emulation environment for Windows and
to MS-DOS by the DJGPP project. Released under the
GNU General Public License, Bash is free software.

44
(http://en.wikipedia/org/wiki/Bas
Math Cluster

45
Math Cluster

IP 140.127.223.11 (eth1)
IP 192.168.0.254 (eth0)

192.168.0.1 ~7 (node1 ~ node7)

hpc001

46
Remote Login:SSH Client

Dont forget change the password after first time login 47


Kernel Specific Commands
Kernel specific
date Print or set the system date and/or
time
dmesg Print the kernel message buffer
ipcrm Remove a message queue,
semaphore set or shared memory id
ipcs Provide information on IPC facilities
uname Print assorted system statistics

48
(http://en.wikipedia/org/wiki/Linux
General User Commands
dd Convert and copy a file (Disk Dump)
dirname Strip non-directory suffixes from a
path
echo Print to standard output
env Show environment variables; run a
program with altered environment variables
file (or stat) Determine the type of a file
nohup Run a command with immunity to
hangups outputting to non-tty
sh The Bourne shell, the standard Unix shell
uptime Print how long the system has been
running

49
(http://en.wikipedia/org/wiki/Linux
Processes and tasks
management
anacron Periodic command scheduler
at Single-time command scheduler
chroot Change the system root directory for all
child processes
cron Periodic command scheduler
crontab Crontab file editor
daemonic Interface to daemon init scripts
htop Interactive ncurses-based process viewer that allows
scrolling to see all processes and their full command lines
kill Send a signal to process, or terminate a process (by
PID)
killall Terminate all processes (in GNU/Linux, it's kill by
name)

50
(http://en.wikipedia/org/wiki/Linux
Processes and tasks
management
nice Alter priorities for processes
pgrep Find PIDs of processes by name
pidof GNU/Linux equivalent of pgrep
pkill Send a signal to process, or terminate a process (by
name). Equivalent to Linux killall
ps Report process status
renice Alter the priorities of an already running process
sleep Delay for specified time
time Time a command
timex Time process shell execution, measure process
data and system activity
top Produce a dynamic list of all resident processes
wait Wait for the specified process

51
(http://en.wikipedia/org/wiki/Linux
User management and
support
chsh Change user shell
finger Get details about user
id Print real/effective UIDs/GIDs
last show listing of last logged in users
lastlog show last log in information for users
locale Get locale specific information
localedef Compile locale definitions
logname Print user's login name
man Manual browser
mesg Control write access to your terminal
passwd Change user password
52
(http://en.wikipedia/org/wiki/Linux
User management and
support
su Start a new process (defaults to shell) as a
different user (defaults to root)
sudo execute a command as a different user.
users Show who is logged on (only users names)
w Show logged-in users and their current tasks
whatis command description from whatis
database
whereis locates the command's binary and
manual pages associated with it
which (Unix) locates where a command is
executed from
who Show who is logged on (with some details)
write Send a message to another user
53
(http://en.wikipedia/org/wiki/Linux
Filesystem Utilities
info The GNU alternative to man
man The standard unix documentation system

chattr Change file attributes on a Linux second


extended file system
chgrp Change the group of a file or directory
chmod Change the permissions of a file or
directory
chown Change the owner of a file or directory
cd Change to another directory location
cp Copy a file or directory to another location

54
Filesystem Utilities
df Report disk space
dircmp Compare contents of files between two directories
du Calculate used disk space
fdupes Find or remove duplicate files within a directory
find Search for files through a directory hierarchy
fsck Filesystem check
ln Link one file/directory to another
ls List directory contents
lsattr List file attributes on a Linux second extended file
system
lsof list open files

55
Filesystem Utilities
mkdir Make a directory
mkfifo Make a named pipe
mount Mount a filesystem
mv Move or rename a file or directory
pwd Print the current working directory
rm Delete a file or directory tree
readlink Display value of a symbolic link, or display
canonical path for a file
rmdir Delete an empty directory
touch Create a new file or update its modification time
tree Print a depth-indented tree of a given directory
unlink System call to remove a file or directory

56
Archivers and compression
afio Compatible superset of cpio with added functionality
ar Maintain, modify, and extract from archives. Now
largely obsoleted by tar
bzip2 Block-sorting file compressor
compress Traditional compressor using the LZW algorithm
cpio A traditional archiving tool/format
gzip The gzip file compressor
p7zip 7zip for unix/linux
pack, pcat, unpack included in old versions of ATT Unix.
Uses Huffman coding, obsoleted by compress.
pax POSIX archive tool that handles multiple formats.
tar Tape ARchiver, concatenates files
uncompress Uncompresses files compressed with
compress.
zcat Prints files to stdout from gzip archives without
unpacking them to separate file(s)

57
Text Processing
awk A pattern scanning and processing language
banner Creates ascii art version of an input string for printing large
banners
cat Concatenate files to standard output
cksum Print the CRC checksum and bytecount of a file (see also
MD5)
cmp Compare two files byte for byte
cut Remove sections from each line of a file or standard input
diff Compare two text files line by line
egrep Extended pattern matching (synonym for "grep -E")
fgrep Simplified pattern matching (synonym for "grep -F")
fold Wrap each input line to fit within the given width
grep Print lines matching a pattern
head Output the first parts of a file
iconv Convert the encoding of the specified files
join Join lines of two files on a common field
less Improved more-like text pager

58
Text Processing
more Pager
nroff Fixed-width (non-typesetter) version of the standard Unix
typesetting system
patch Change files based on a patch file
sed Stream EDitor
sort Sort lines of text files
split Split a file into pieces
tail Output the tail end of files
tee Read from standard input, write to standard output and files
uudecode Decodes a binary file that was used for transmission
using electronic mail
uuencode Encodes a binary file for transmission using electronic
mail
wc Word/line/byte count

59
Text Editors
GNU Emacs Freely programmable full-screen text editor
and general computing environment (using builtin Elisp, a
simple dialect of the Lisp programming language)
Joe a screen-oriented text editor using a Wordstar-style
command set
Jove a screen-oriented text editor using an Emacs-style
command set
pico PIne's message COmposition editor (simple, easy to
use screen editor)
vi "Visual" (screen-oriented) text editor (originally ex in
screen-oriented "visual" mode)
vim Vi IMproved, portable vi-compatible editor with
multiple buffers, screen splitting, syntax highlighting and a
lot of other features not found in standard ex/vi
XEmacs Popular version of emacs that is derived from
GNU Emacs

60
Communication
ftp, sftp File transfer protocol, secure FTP
NFS Network filesystem
OpenVPN virtual private (encrypting) networking software
Postfix mail transfer agent
rsh, SSH, telnet Remote login
Samba SMB and CIFS client and server for UNIX
Sendmail popular E-Mail transport software
talk Talk to another logged-in user
uustat a Basic Networking Utilities (BNU) command that
displays status information about several types of BNU
operations
uux Remote command execution over UUCP

61
Network monitoring and
security
Ethereal and tethereal a feature rich protocol analyzer (now called
Wireshark, see below)
John the Ripper password cracking software
Nessus a comprehensive open-source network vulnerability scanning
program
Netstat displays a list of the active network connections the computer
Nmap free port scanning software
SAINT System Administrators Integrated Network Tool Network
Vulnerability Scanner.
SATAN the Security Administrator Tool for Analyzing Networks a
testing and reporting tool that collects information about networked
hosts
Snoop Solaris packet sniffer
Snort an open source network intrusion detection system
tcpdump a computer network debugging tool that intercepts and
displays TCP/IP packets being transmitted or received
Wireshark a protocol analyzer, or "packet sniffer", similar to tcpdump,
that adds a GUI frontend, and more sorting and filtering options
(formerly named Ethereal)

62
Programming Tools
bash Bourne Again SHell, (mostly) sh-compatible and csh-
compatible, standard shell on Linux systems and Mac OS X.
csh C shell. Written by Bill Joy for BSD systems.
ksh Korn shell, standard shell on many proprietary Unix
systems, powerful successor to the Unix Bourne shell (sh), written
by David Korn,
rc originally written for Plan 9.
sh Bourne shell, only shell present on all UNIX and Unix-like
systems, written by Stephen Bourne for Version 7 Unix.
tcsh TENEX C shell, standard shell on BSD systems.
zsh Z shell.
awk Standard Unix pattern scanning and text processing tool.
perl Perl scripting language.
PHP PHP scripting language.
Python Python scripting language.

63
Compilers
as GNU assembler tool.
c99 C programming language.
cc C compiler.
dbx (System V and BSD) Symbolic debugger.
f77 Fortran 77 compiler.
gcc GNU Compiler Collection C frontend (also known as GNU C Compiler)
gdb GNU symbolic debugger.
ld Program linker.
lex Lexical scanner generator.
ltrace (Linux) Trace dynamic library calls in the address space of the
watched process.
m4 Macro language.
make Automate builds.
nm List symbols from object files.
size return the size of the sections of an ELF file.
strace (Linux) or truss (Solaris) Trace system calls with their arguments
and signals. Useful debugging tool, but does not trace calls outside the
kernel, in the address space of the process(es) being watched.

64
Desktops/Graphical User
Interfaces
CDE Common Desktop Environment, most commonly
found on proprietary UNIX systems
Enlightenment an open source window manager for the X
Window System
FVWM and its variant FVWM95, which has been modified to
behave like Windows 95 Also FVWM-Crystal that aims to be
eye candy
GNOME GNU Network Object Model Environment
IceWM ICE Window Manager
JWM Joe's Window Manager
KDE K Desktop Environment
XFce a desktop environment for Unix and other Unix-like
platforms

65
Package Management
apt Front-end for dpkg or rpm
debconf Debian package configuration management
system
dpkg The Debian package manager
drakconf Front-end configuration utility for Mandriva Linux
emerge A frontend to portage
pacman A package manager used primarily by Arch Linux
portage The Gentoo Linux package manager
rpm Originally the package manager for Red Hat Linux,
now used by several distributions including Mandriva Linux
Synaptic GTK+ frontend for the apt package manager.
Primarily used by Ubuntu Linux, Debian Sarge, and other
Debian-based systems; but usable on any system using apt.
urpmi Front-end to rpm, used by Mandriva Linux
YaST - System management utility mainly used by SuSE
yum - Front-end for rpm, used by Fedora
66
Web Browsers
Dillo Extremely light-weight web browser
ELinks Enhanced links
Epiphany Light-weight GNOME web browser
Galeon Light-weight old GNOME web browser
Konqueror KDE web browser
Links Console based web browser
lynx Console based web browser
Mozilla Application Suite Graphical cross platform web
browser & email client
Mozilla Firefox Extensible Web browser
Opera Web browser and e-mail client (
Proprietary Software)
w3m Console based web browser

67
Desktop Publishing
groff Traditional typesetting system
LaTeX Popular TeX macro package for higher-
level typesetting
lp Print a file (on a line printer)
Passepartout Desktop publishing program
pr Convert text files for printing
Scribus Desktop publishing program
TeX Macro-based typesetting system
troff The original and standard Unix typesetting
system

68
Databases
DB2
Firebird
MySQL
Oracle
PostgreSQL
Progress Software
SQLite
Sybase

69
Math Tools
maxima Symbol manipulation program.
octave Numerical computing language (mostly
compatible with Matlab) and environment.
R Statistical programming language.
units Unit conversion program.
bc An arbitrary precision calculator language
with syntax similar to the C programming
language.
cal Displays a calendar
dc Reverse-Polish desk calculator which
supports unlimited precision arithmetic
fortune Fortune cookie program that prints a
random quote
70
Unix command line
File and file system management: cat | cd | chmod | chown |
chgrp | cp | du | df | file | fsck | ln | ls | lsof | mkdir | mount | mv |
pwd | rm | rmdir | split | touch
Process management: at | chroot | crontab | kill | killall | nice |
pgrep | pidof | pkill | ps | sleep | time | top | wait | watch
User Management/Environment:env | finger | id | mesg |
passwd | su | sudo | uname | uptime | w | wall | who | whoami |
write
Text processing:awk | cut | diff | ex | head | iconv | join | less |
more | paste | sed | sort | tail | tr | uniq | wc | xargs
Shell programming:echo | expr | printf | unset
Printing:lp
Communications:
inetd | netstat | ping | rlogin | traceroute
Searching:
find | grep | strings
Miscellaneous:
banner | bc | cal | man | size | yes

71
Software development
process
A software development process is a structure imposed
on the development of a software product. Synonyms
include software lifecycle and software process. There
are several models for such processes, each describing
approaches to a variety of tasks or activities that take place
during the process.
Software Elements Analysis
Specification
Software architecture
Implementation (or coding)
Testing
Documentation
Software Training and Support
Maintenance

72
Software Elements Analysis
The most important task in creating a software
product is extracting the requirements.

Customers typically know what they want, but


not what software should do, while incomplete,
ambiguous or contradictory requirements are
recognized by skilled and experienced software
engineers.

Frequently demonstrating live code may help


reduce the risk that the requirements are
incorrect.

73
Specification
Specification is the task of precisely describing the software
to be written, possibly in a rigorous way.

In practice, most successful specifications are written to


understand and fine-tune applications that were already
well-developed, although safety-critical software systems
are often carefully specified prior to application
development.

Specifications are most important for external interfaces


that must remain stable.

74
Software Architecture
The architecture of a software system refers to an abstract
representation of that system. Architecture is concerned
with making sure the software system will meet the
requirements of the product, as well as ensuring that future
requirements can be addressed. The architecture step also
addresses interfaces between the software system and
other software products, as well as the underlying
hardware or the host operating system.

75
Implementation and Testing
Implementation (or coding): Reducing a design to code
may be the most obvious part of the software engineering
job, but it is not necessarily the largest portion.

Testing: Testing of parts of software, especially where code


by two different engineers must work together, falls to the
software engineer.

76
Documentation and
Training
An important (and often overlooked) task is documenting
the internal design of software for the purpose of future
maintenance and enhancement. Documentation is most
important for external interfaces.
Common types of computer hardware/software
documentation include online help, FAQs, HowTos, and
user guides. The term RTFM is often used in regard to such
documentation, especially to computer hardware and
software user guides.
RTFM is an initialism for the statement "Read The Fucking
Manual." This instruction is sometimes given in response to
a question when the person being asked believes that the
question could be easily answered by reading the relevant
"manual" or instructions.
Some people prefer the backronym "Read The Fine
Manual." Alternatively, the "F" can be dropped entirely and
the initialism rendered as "RTM" (Read The Manual), or the
more polite "RTMP" (Read The Manual Please).
77
Software Training and
Support
A large percentage of software projects fail because the
developers fail to realize that it doesn't matter how much
time and planning a development team puts into creating
software if nobody in an organization ends up using it.
People are occasionally resistant to change and avoid
venturing into an unfamiliar area, so as a part of the
deployment phase, its very important to have training
classes for the most enthusiastic software users (build
excitement and confidence), shifting the training towards
the neutral users intermixed with the avid supporters, and
finally incorporate the rest of the organization into adopting
the new software. Users will have lots of questions and
software problems which leads to the next phase of
software.

78
Maintenance
Maintaining and enhancing software to cope with newly
discovered problems or new requirements can take far
more time than the initial development of the software. Not
only may it be necessary to add code that does not fit the
original design but just determining how software works at
some point after it is completed may require significant
effort by a software engineer. About of all software
engineering work is maintenance, but this statistic can be
misleading. A small part of that is fixing bugs. Most
maintenance is extending systems to do new things, which
in many ways can be considered new work. In comparison,
about of all civil engineering, architecture, and
construction work is maintenance in a similar way.

79
Software Developing
Models
Waterfall model
Spiral model
Model driven development
User experience
Top-down and bottom-up design
Chaos model
Evolutionary prototyping
Prototyping
V model
Extreme Programming
Hysterical raisins
80
Waterfall Model
The waterfall model is a
sequential
software development model
(a process for the creation of
software) in which
development is seen as flowing
steadily downwards (like a
waterfall) through the phases
of requirements analysis,
design, implementation,
testing (validation), integration
, and maintenance. The origin
of the term "waterfall" is often
cited to be an article published
in 1970 by W. W. Royce

81
Spiral Model
The new system requirements are defined in as much detail as
possible. This usually involves interviewing a number of users
representing all the external or internal users and other
aspects of the existing system.

A preliminary design is created for the new system.

A first prototype of the new system is constructed from the


preliminary design. This is usually a scaled-down system, and
represents an approximation of the characteristics of the final
product.

A second prototype is evolved by a fourfold procedure: (1)


evaluating the first prototype in terms of its strengths,
weaknesses, and risks; (2) defining the requirements of the
second prototype; (3) planning and designing the second
prototype; (4) constructing and testing the second prototype.

82
Spiral Model
At the customer's option, the entire project can be aborted
if the risk is deemed too great. Risk factors might involve
development cost overruns, operating-cost miscalculation,
or any other factor that could, in the customer's judgment,
result in a less-than-satisfactory final product.
The existing prototype is evaluated in the same manner as
was the previous prototype, and, if necessary, another
prototype is developed from it according to the fourfold
procedure outlined above.
The preceding steps are iterated until the customer is
satisfied that the refined prototype represents the final
product desired.
The final system is constructed, based on the refined
prototype.
The final system is thoroughly evaluated and tested.
Routine maintenance is carried out on a continuing basis to
prevent large-scale failures and to minimize downtime.
83
Top-Down and Bottom-up
Top-down and bottom-up are strategies of
information processing and knowledge ordering, mostly involving
software, and by extension other humanistic and scientific
system theories (see systemics).
In the top-down model an overview of the system is formulated,
without going into detail for any part of it. Each part of the system
is then refined by designing it in more detail. Each new part may
then be refined again, defining it in yet more detail until the entire
specification is detailed enough to validate the model. The top-
down model is often designed with the assistance of "dark boxes"
that make it easier to bring to fulfillment but insufficient and
irrelevant in understanding the elementary mechanisms.
In bottom-up design, first the individual parts of the system are
specified in great detail. The parts are then linked together to
form larger components, which are in turn linked until a complete
system is formed. This strategy often resembles a "seed" model,
whereby the beginnings are small, but eventually grow in
complexity and completeness.

84
Software prototyping
Identify basic requirements
Determine basic requirements including the input and output
information desired. Details, such as security, can typically be
ignored.
Develop Initial Prototype
The initial prototype is developed that includes only user
interfaces.
Review
The customers, including end-users, examine the prototype
and provide feedback on additions or changes.
Revise and Enhancing the Prototype
Using the feedback both the specifications and the prototype
can be improved. Negotiation about what is within the scope
of the contract/product may be necessary. If changes are
introduced then a repeat of steps #3 ands #4 may be
needed.

85
Extreme Programming
Extreme Programming (XP) is a
software engineering methodology, the most prominent of
several agile software development methodologies. Like
other agile methodologies, Extreme Programming differs
from traditional methodologies primarily in placing a higher
value on adaptability than on predictability. Proponents of
XP regard ongoing changes to requirements as a natural,
inescapable and desirable aspect of software development
projects; they believe that being able to adapt to changing
requirements at any point during the project life is a more
realistic and better approach than attempting to define all
requirements at the beginning of a project and then
expending effort to control changes to the requirements.

86
Software Architecture
Software architecture is commonly organized in views,
which are analogous to the different types of blueprints
made in building architecture. Some possible views
(actually, viewpoints in the 1471 ontology) are:
Functional/logic view
Code view
Development/structural view
Concurrency/process/thread view
Physical/deployment view
User action/feedback view
Several languages for describing software architectures
have been devised, but no consensus has yet been reached
on which symbol-set and view-system should be adopted.
The UML was established as a standard "to model systems
(and not just software)," and thus applies to views about
software architecture.

87

You might also like