/  295
 
Making reliabledistributed systemsin the presence of so
ware errors
Final version (with corrections) — last update 20 November 2003
Joe Armstrong 
A Dissertation submitted tothe Royal Institute of Technologyin partial fulfilment of the requirements forthe degree of Doctor of TechnologyThe Royal Institute of TechnologyStockholm, SwedenDecember 2003Department of Microelectronics and Information Technology
 
iiTRITA–IMIT–LECS AVH 03:09ISSN 1651–4076ISRN KTH/IMIT/LECS/AVH-03/09–SEandSICS Dissertation Series 34ISSN 1101–1335ISRN SICS–D–34–SEc
Joe Armstrong, 2003Printed by Universitetsservice US-AB 2003
 
iii
To Helen, Thomas and Claire 
 
106
CHAPTER 4. PROGRAMMING TECHNIQUES 
This philosophy is completely di
erent to that used in a sequentialprogramming language where there is no alternative but to try and handleall errors in the thread of control where the error occurs. In a sequentiallanguage with exceptions, the programmer encloses any code that is likelyto fail within an exception handling construct and tries to contain all errorsthat can occur within this construct.Remote handling of error has several advantages:1. The error-handling code and the code which has the error executewithin di
erent threads of control.2. The code which solves the problem is not cluttered up with the codewhich handles the exception.3. The method works in a distributed system and so porting code froma single-node system to a distributed system needs little change tothe error-handling code.4. Systems can be built and tested on a single node system, but de-ployed on a multi-node distributed system without massive changesto the code.
4.3.2 Workers and supervisors
To make the distinction between processes which perform work, and pro-cesses which handle errors clearer we o
en talk about 
worker 
and
super- visor 
processes:One process, the
worker 
process, does the job. Another process, the
supervisor 
process. observes the worker. If an error occurs in the worker,the supervisor takes actions to correct the error. The nice thing about thisapproach is that:1. There is a clean separation of issues. The processes that are sup-posed to do things (the workers) do not have to worry about errorhandling.
 
4.4. LET IT CRASH 
1072. We can have special processes which are only concerned with errorhandling.3. We can run the workers and supervisors on
di 
c  
erent 
physical ma-chines.4. It o
en turns out that the error correcting code is
generic,
that is,generally applicable to many applications, whereas the worker codeis more o
en application specific.Point three is crucial—given that Erlang satisfies requirements R3 andR4 (see page 27) then we can run worker and supervisor processes ondi
erent physical machines, and thus make a system which tolerates hard-ware errors where entire processes fail.
4.4 Let it crash
How does our philosophy of handling errors fit in with coding practices?What kind of code must the programmer write when they find an error?The philosophy is
let some other process fix the error 
, but what does thismean for their code? The answer is
let it crash
. By this I mean that inthe event of an error, then the program should just crash. But what is anerror? For programming purpose we can say that:
exceptions 
occur when the run-time system does not know what todo.
errors 
occur when the programmer doesn’t know what to do.If an exception is generated by the run-time system, but the program-mer had foreseen this and knows what to do to correct the condition that caused the exception, then this is not an error. For example, opening a filewhich does not exist might cause an exception, but the programmer might decide that this is not an error. They therefore write code which traps thisexception and takes the necessary corrective action.
 
108
CHAPTER 4. PROGRAMMING TECHNIQUES 
Errors occur when the programmer does not know what to do. Pro-grammers are supposed to follow specifications, but o
en the specificationdoes not say what to do and therefore the programmer does not knowwhat to do. Here is a example:Suppose we are writing a program to produce code for a microproces-sor, the specification says that a load operation is to result in opcode 1 anda store operation should result in opcode 2. The programmer turns thisspecification into code like:
asm(load) -> 1;asm(store) -> 2.
Now suppose that the system tries to evaluate
asm(jump)
—what shouldhappen? Suppose you are the programmer and you are used to writing defensive code then you might write:
asm(load) -> 1;asm(store) -> 2;asm(X) -> ??????
but what should the ??????’s be? What code should you write? Youare now in the situation that the run-time system was faced with when it encountered a divide-by-zero situation and you cannot write any sensiblecode here. All you can do is terminate the program. So you write:
asm(load) -> 1;asm(store) -> 2;asm(X) -> exit({oops,i,did,it,again,in,asm,X})
But why bother? The Erlang compiler compiles
asm(load) -> 1;asm(store) -> 2.
almost as if it had been written:

Sections

show all« prev | next »

Share & Embed

More from this user

Recent Readcasters

Add a Comment

Characters: ...