Distributed Systems Course Notes

Original notes by Ian Wakeman with revisions by Dan Chalmers

2

Contents
1 Introduction 1.1 Distributed Systems Trailer . . . . . . . . . . . 1.1.1 Detailed content of Distributed Systems 1.2 Aims and Learning Outcomes . . . . . . . . . . 1.3 Prerequisites . . . . . . . . . . . . . . . . . . . 1.4 Teaching methods . . . . . . . . . . . . . . . . 1.5 Assessment . . . . . . . . . . . . . . . . . . . . 1.6 Course programme . . . . . . . . . . . . . . . . 1.7 Reading list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 9 10 10 10 10 10 10 13 13 13 15 15 17 18 18 18 18 19 19 19 19 20 21 24 25 25 25 26 28 28 31

2 Lecture Notes 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Course Outline . . . . . . . . . . . . . . . . . . 2.1.2 What’s a Distributed System . . . . . . . . . . 2.1.3 Example Distributed Systems . . . . . . . . . . 2.1.4 What do we want from a Distributed System? 2.1.5 Elements of a Distributed System . . . . . . . . 2.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . 2.2 Bits and Bytes . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Bits, Bytes, Integers etc . . . . . . . . . . . . . 2.2.2 Prefices . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Memory and Packets . . . . . . . . . . . . . . . 2.2.4 Bit Manipulation . . . . . . . . . . . . . . . . . 2.3 Foundations of Distributed Systems . . . . . . . . . . 2.3.1 Physical Concepts . . . . . . . . . . . . . . . . 2.3.2 Packets . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Conclusion: Network Properties . . . . . . . . 2.4 Operating System Support . . . . . . . . . . . . . . . 2.4.1 Protocols . . . . . . . . . . . . . . . . . . . . . 2.4.2 Layering . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Reliable Transmission: The Basic Techniques . 2.4.4 Group Communication . . . . . . . . . . . . . . 2.4.5 Sockets . . . . . . . . . . . . . . . . . . . . . . 2.4.6 Request and Response . . . . . . . . . . . . . . 3

. . . . . . .9. . . . . . . . . . . . . . . . . . . .9. . . .4 Public Key Encryption . . . .9. . . .4. .2 Why Distributed Objects? . . . . . .8. . . . .10 Classes of security problems . . . . . . . . . . . . . . . . . . . 2. . . . . .8. 2.6. . . . . . . . . . . . . . .3 How to build Distributed Object Systems . . . . . . . . .7 Enforcement . . . . . . . . . .broadcast . . . . . . . . . . . . . . . . . .8. . . . .7. .2 Three Tier Models and the Web . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . 2. . .9. 2. . .11 Lessons . 2. . . .1 Main Points . . . . . . 2. . . . 31 31 31 32 33 35 36 36 37 37 37 38 39 39 43 43 43 44 46 49 51 51 51 52 53 55 56 56 57 58 59 59 62 62 62 63 63 63 64 65 65 66 66 66 67 2. . . . . .6 Summary . 2.1 Send/Receive . . . . . .7. . . . . . . . . 2. . . .1 Computers in Business . 2. . . . . . . . . . . . . . . . . . . . . . . . .6. . . . . . . . . . . . . . . . .7 Conclusion . .6 Descriptive names . . . . 2. . . . . . .8 Location through database . .2 Why names? . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . 2. . . . . . . . 2. . . . . . . . . . . . .8. . . . . . . . 2. . . . . . . 2. . . . . . . . Names and Naming Services . . .5 Java RMI . .7 2. . . . . . . . 2. . . . . .1 What’s an Object? . . . . . .10 Availability and performance . . . . . . . . . 2. . . . . 2. . .4 Objects and RPC systems . . . . . . 2. . . . . . . . . . . .7 Object Location from name . . . .3 Authentication in distributed systems: Private Key Encryption . . . . . . . . . . . . . . 2. .8 Trusted Computing Platform . . . . . . . . . . . . . . . .8. . . . . .7. . . . . . . . . . . . . .6 Summary . .9 Distributed Name Servers . . . . . . . . .5. . . . . . . . . .6. . . . . . . . . . . . . . . . . . . . . . 2. .5. . . . . .5. . . . . . . . . . . . . . . . . . . . 2. . .3 What does one do with names? . .6. . . . . . . .5 Problems with RPC . . . . . . .5. . . . . . . . 2. . 2. . . . . . . . . . . . . . . . . . . . . . 2. . Computer Security: Why you should never trust a computer system 2. . . 2. . . . . . . . . . . . . . . . . . . . . . .9. .8. . . . . . . .4 Cross domain communication .5 2. . . .9 Firewalls . . . . . . . . . . . . . . . . . .5. . . . . . . . . . . 2. . .8. . . . . .8 2. . . . . . . . .5 Summary . . .6 2. . . . . . . . . .8. . . . .4 CONTENTS 2. 2. .9. . . . . .3 Remote Procedure Call .6. . .6. . . . . . . Enterprise computing and Corba . 2. .2 Authentication . . . 2.8. . . . . . . . . . . .9. . . . . .4 Web Services . . . . . . . . . . . .9 . . . . .6 Authorisation . . .11 Maintaining consistency for distributed name services . . . . . . . . 2. . . . . . . . . . . . . . . . . 2. . . . . . . . . .5 Secure Socket Layer . .4 What’s a name? . 2. . . . . .8. . . . . . . . . . . . . . . 2. . . . . .5. . . . . . .9. . . . . . . . . . . . . . . .7. . . .7. Distributed Objects: The Java Approach . . . . . . . . . . . .1 Definitions . .9.9. . . . . 2. . .2 Message Styles . . . . 2. . . . . . . .5 Partitioned names . . . . . . . . . . . . . . . . . . . .8. . . . . . . . . . . .Business to Business .9. . . Remote Procedure Call (RPC) . . . . . . . . . . . . . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . .3 CORBA . . . . . . . . . . . . . . . . . . . .

. . . . . .12 2. . . . . . . . Concurrency Control and Transactions . . . . .Sun Network File System . . . . . . . . . . .11. . . .2 Web Caches .1 Main Points . . . . . . 2. . . . . . . . . . . . 2. . . . . .15. . . . . . . . Distributed File Systems . . . . . . . .3 Pre-fetching Data .5 Summary . . . . . 2. . . .15. . . . . . . . . . . . . . . . .10. . . . . . . .2 Locking . . .15.13 2. . . .3 Optimistic Concurrency Control . . . . . . . . .11. . 2. . . .2 Client Implementation . . . .13.10. . . . . . . . . . . .10. . . . . . . . . . . . . .11. . . . . . . . . . . . . .12. . . . . . . . . . . . . . . . .14. . . . . . . . . . . . . . . . . . . . . . . . . .11.1 What is Replication? . . 2. . . . . . .4 Transactions .6 Summary . . . . . . . . . . . .6 Message Ordering . . . . . . . . . . . 2. . . . . . . . . . . . . . . . . . . . . . .4 Current Research Challenges . . . . . . . . . . 2.4 Updating Server state .16. .1 Why concurrency control? . . . . . . . . . . . . . . . . . . . . . .2 Distributed Transactions . 2. . . .3 Consistency .4 NFS . . . . . . . .16. . 2. . . . . 2. . . . . . . . . . . . . . 2. . . . . .7 Summary . . . . . . . . . . .9. . . . . . . . . . . . . 2. . . . . . . . . . . . 2. . . . . . . . . . .13. . . .13.12. . . 2. . . . . . . 2.13. . . .13. . . . . . .11. . . . . . . . . . .15. . . . . . . 2. 2. . . 2. .3 Atomic Commit Protocols . . . . . . .12. . Peer to Peer (p2p) Services and Overlay Networks . . . . . . . . . . . . . 2. . .2 Issues in Replication . . . . . . . .9. . . . . . . . . . . . . . . . . . . . . . .A Distributed Hash Table Example . . . . . . . . . . .6 Summary . . . . . . . . . . .2 Atomicity . .14. . . . . . . . . . . . . .13. . . . . . . . . . . . . . . . . . . . .10 2. . . . .5 Multicast and Process Groups . . . . . . . . . . . .14. . . 2. . . 2. . . . . Replication: Availability and Consistency . . . . . . . .14. . . . . . . . . . . . 2. . . . . . . . .13 Summary . . . . . . . . .CONTENTS 2. . . . . . . . . . . . . 2. .14 2. . . . . . . .5 Serial Equivalence . . . . . . .16 . . . . . . . . . . . . . . . . .3 Automatic Teller Machines and Bank accounts 2. . . . .3 No Caching . . . . . . . . . . . . . .2 Gnutella . . .13. . . . . . . . . .11 2. . . . . . . . . . .1 Servers and their state . . . . . . . Distributed Transactions . . . . . . . . . . . . . . 2. . . .14. . . . . . . . . .4 Using your Peers: BitTorrent . .1 Overlay Networks . . . . . . . . . .10. . . 2.16. . . .10. . .5 Andrew File System . . . . . . . . . . . .3 Chord . . . . . . . .10. . . . . . . . . . . . . . . . . . . . . . . . . 2. . . . 2. . . . . . . . . . . . . . . . . .1 Getting Content over a Network . . . . .5 Summary . 5 67 68 68 68 69 69 70 73 74 74 75 76 78 81 81 81 82 83 84 85 86 87 87 88 88 89 89 90 94 94 95 95 96 97 97 99 99 100 100 102 103 103 104 104 104 105 2. . . . . . . . . . . . . . .5 Summary . . . . . . 2. . . . . . . . . . . . . . . . . Content Distribution Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.15 2. . . 2. . . .12. . . .15. . . . . . .4 Timestamping . . . . . . . . . . 2. . . 2.14. . . . . .12. . . . . . . . . . . . . . . . . . . 2. . . 2. . . . . . . . . . . . . . . . . . 2.12 Client and Name server interaction. 2. . . Shared Data and Transactions . . . . . . 2. . . . . . . .1 Single Server Transactions .

. . . . . . .4 Serialization . . 3. . . . . . . . .9 Transactions and Concurrency . 3.8 Distributed Transactions . . . . . . . . .2 Devising a Routing Protocol .5 Remote Procedure Call .1 Availability and Ordering . . . . . . . . . . . . . . . . . . . . . . .3 Sample Exam Question . . . . . . . 3. . .2.6 Security . . . . . . . . . . . . . . . . 2. . . . . . . . . . .17. . . . . 3. . .3 Layering .10 Distributed Transactions . . . . . . . .7 Transactions and Concurrency . . . . . . . . . . . . . . . . . . .17 Transactions: Coping with Failure . . . 3. . . . . . . . . 3. . .1.6 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . Remember to check Sussex Direct for the times and rooms. . . . . . . . . . . . . . . . . . .2. 3. . . . 3. . . . . 3. . . . .2 The answer .4 Distributed Concurrency Control 2. . . . . . . .1. . . . . . . . . . . . . . . . . . . . . . . . . .2 Devising a Routing Protocol . . . . . . . 3. . . . . . . . . . . . . .1. .17. . . . . . . . . . . . . . . . . . . . . . .5 Remote Procedure Call . . . . .4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1. . . . . . . . .2 The Answers . . . . . . . . . .1. . 3. . .1 Exercises .17. . . . . . .3 Layering . . . . . . . . . . . . . . . . . .6 Names and Distributed File Systems .1. . . . . . . . . .2. . . . . . . . . . . 3. 3. . . . . . . . .2.4 Serialization . . . . .1 Communication System Fundamentals 3. . . . . . . .16. . . . . . . . .2.2. .8 Availability and Ordering .1. . . . .2. . . . . . . . . . . . . . . . . 2. . . . 2. . . . . . . . . . . . . . . . . . . . . . . . 3. . . . . . . . . . . . 2. . . . . 2. . . . . 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Failure Modes . . . 3. . . . .16. . . . . . .1. . . . . . . . 3.2 Recovery . . . . .1. . . 107 107 107 108 108 110 112 113 113 113 113 114 114 115 115 116 116 117 117 118 118 119 119 119 121 121 121 122 123 123 123 3 Exercises and answers 3. . . .1 Communication System Fundamentals 3. . . . . .7 Names and distributed filing systems . . . . . . . . 3. . . 3. . . . . . . . . . . . CONTENTS . . .2. . . . . . . . . . . 3. . . . . . .5 Summary . .3. . . as these vary between UG and PG students and from week to week. . . . . . .17. . . . . . . . . . . . .3. .3 Network Partition . . . . . . . . . .1. . . . . . . . . . . . . . . . . . .

CONTENTS 7 Week 1 (8/1) 2 (15/1) 3 (22/1) 4 (29/1) 5 (5/2) 6 (12/2) 7 (19/2) 8 (26/2) 9 (5/3) 10 (12/3) Lecture Introduction OS Issues Object Systems Security Naming P2P Networks Replication Transactions Distributed Transactions Pervasive Computing Management Lecture Fundamentals RPC Enterprise Computing and CORBA Security Distributed File Systems P2P Networks Replication Concurrency Control Coping with Failure No lecture Exercise Class Programming Exercise Programming Exercise Programming Exercise Security Marking Names Exam sample Programming Exercise Marking No class Assignment Ass 1 due Ass 2 due Portfolio due Table 1: Course Timetable 2006 .

8 CONTENTS .

1 Distributed Systems Trailer • Learn about even more Internet applications!!! • Learn about distributed Operating Systems!!!! • Learn how to program Distributed Systems!!!!! 1.uk. my office hour at 9:30-10:30 on Mondays. either in a timetabled session. Learn Java sockets.Multimedia Communication Systems (or be prepared to learn fast) • Languages . or by email to d. 1.Java. 40% coursework. For 2006-7 questions and comments should be directed to Dan Chalmers.1 Detailed content of Distributed Systems • How to build software systems on multiple machines connected by networks. Content Distribution Networks.Chapter 1 Introduction This is the online version of the course information sheet. and how to build simulations • 2 Assignments. • RPC. Replication and concurrency control and other stuff • Pre-requisite .ac. These notes will be updated from time to time as the course progresses. rmi. Peer to Peer applications.chalmers@sussex. 60% exam • Classes a mixture of programming and problem solving 9 .1. distributed objects.

INTRODUCTION 1. and will be able to criticise algorithms and designs for distributed systems.ac. It assumes programming skills in the Java programming language and a rudimentary knowledge of computer hardware. Jean Dollimore and Tim Kindberg. Fourth edition. the student will be able to program over a networked system. If you haven’t taken these courses. 1.2 Aims and Learning Outcomes This course aims to convey an understanding of the problems of programming distributed systems. taught in the spring term of the final year. 1. This is the course textbook.5 Assessment This is a one term course. and I will occasionally be recommending readings from it. 1. 1.7 Reading list • Distributed Systems – Concepts and Design. 2005. Remember to check Sussex Direct for the times and rooms. The notes are based on the third edition.10 CHAPTER 1.3 Prerequisites The course assumes that the courses Introduction to Operating Systems and Multimedia Communications Technology have been taken. George F.uk/courses/dist-sys/ for copies of lecture slides and other course material.4 Teaching methods Two lectures and one exercise class per week together with 2 programming assignments. Coulouris. 1. 2001. After taking the course. The undergraduate course is assessed by coursework (40%) and by unseen examination (60%). . be prepared to do additional work to keep up with the course. Addison-Wesley.sussex. Check the url http://www.6 Course programme This programme is subject to change. as these vary between UG and PG students and from week to week.informatics. The postgraduate course is assessed solely through coursework (100%). although updates to reflect the fourth edition are being made and this is the most current version.

Galvin and Greg Gagne. the material on distributed systems is very relevant. The course textbook from Multimedia Communications Technology. J. ISBN 0-20147711-4. Peter B. If you already have this book from the Operating Systems course.1. John Wiley & Sons. • Operating System Concepts Abraham Silberschatz. Sixth edition. Ross. 2000. . READING LIST Week 1 (8/1) 2 (15/1) 3 (22/1) 4 (29/1) 5 (5/2) 6 (12/2) 7 (19/2) 8 (26/2) 9 (5/3) 10 (12/3) Lecture Introduction OS Issues Object Systems Security Naming P2P Networks Replication Transactions Distributed Transactions Pervasive Computing Management Lecture Fundamentals RPC Enterprise Computing and CORBA Security Distributed File Systems P2P Networks Replication Concurrency Control Coping with Failure No lecture Exercise Class Programming Exercise Programming Exercise Programming Exercise Security Marking Names Exam sample Programming Exercise Marking No class 11 Assignment Ass 1 due Ass 2 due Portfolio due Table 1.1: Course Timetable 2006 • Computer Networking: A Top-Down Approach Featuring the Internet.7. Addison-Wesley. full of useful and overlapping material. Kurose and K.

INTRODUCTION .12 CHAPTER 1.

content distribution networks and Peer to Peer computing. distributed objects and classic distributed systems such as NFS • 3 lectures on integrating distributed systems through the web. TBA (33% weighted) 13 . • 5 lectures on Remote Procedure Call.1.1 Course Outline What are we going to do? • 1 lectures on computer communications and networks and loosely coupled systems. The Postgraduate course will get an additional two hours in their seminar slot.1 Introduction Main Points • What we are going to cover • What distributed systems are • What characteristics well-designed systems have 2.Chapter 2 Lecture Notes 2. revising material from the Multimedia Communications Technology course. • 5 lectures on closely coupled systems for distributed transactions • 1 lecture on recent advances in networks and distributed systems Exercises Two exercises: 1.

and where appropriate. Detailed Timetable Week 1 (8/1) 2 (15/1) 3 (22/1) 4 (29/1) 5 (5/2) 6 (12/2) 7 (19/2) 8 (26/2) 9 (5/3) 10 (12/3) Lecture Introduction OS Issues Object Systems Security Naming P2P Networks Replication Transactions Distributed Transactions Pervasive Computing Management Lecture Fundamentals RPC Enterprise Computing and CORBA Security Distributed File Systems P2P Networks Replication Concurrency Control Coping with Failure No lecture Exercise Class Programming Exercise Programming Exercise Programming Exercise Security Marking Names Exam sample Programming Exercise Marking No class Assignment Ass 1 due Ass 2 due Portfolio due Table 2.1: Course Timetable 2006 Remember to check Sussex Direct for the times and rooms. you will hand in your two assignments along with the assessments from your classmates. LECTURE NOTES The exercises will be in Java. as these vary between UG and PG students and from week to week.14 2. • These will be formally given a course mark out of 40 marks. • The exercises will be peer-assessed. TBA (66% weighted) CHAPTER 2. will be building upon and extending provided skeleton code. MSc Programme • Lectures shared with undergraduates . • In the next exercise class after the handin you will be required to mark two of your classmates’ assignments. • At the end of the term. • Successfully marking an assignment will provide you with 10 marks towards the possible 100 on that assignment.

others carry on working • Better reliability .2. but system should work together • Higher availability .1. Your Task: What are the distributed interactions when you login at an x-terminal? Interactions at an X terminal Simplified interactions 2.1.WWW Remote information hidden below hypertext browser.every machine must be up. transport mechanisms to get mail to mailbox Distributed Information .2 What’s a Distributed System A distributed system: physically separate computers working together • Cheaper and easier to build lots of simple computers • Easier to add power incrementally • Machines may necessarily be remote. Requires global name space to identify users. The real world. INTRODUCTION • Details to be discussed in the MSc seminar • Read from reading list • Work on programming exercises 15 2.1.one computer crashes.store data in multiple locations • More security .each piece easier to secure to right level. . can get: • Worse availability . Caching and other features operate transparently . “A distributed system is one where some machine you’ve never heard of fails and prevents you from working” • Worse reliability • Worse security Problem: Coordination is more difficult because multiple people involved. and communication is over network.3 Example Distributed Systems Electronic Mail Mail delivered to remote mailbox. In real life. .

LECTURE NOTES .16 CHAPTER 2.

. There is one administrator 7. Network assumptions 1. screens updated. Files accessed transparently by OS knowing they’re remote and doing remote operations on them such as read and write e. Concurrency 4.1.Peter Deutsch 2.1. Transport cost is zero 8. Latency is zero 3. Transparency Your Task: order the importance of each of these features for the example systems in the previous slide. The network is reliable 2. Fault Tolerance 6. Topology doesn’t change 6. Openness 3.4 What do we want from a Distributed System? 1. The network is secure 5. stocks sold. Bandwidth is infinite 4. generally not machine you’re working on. The network is homogeneous (Source:The Eight Fallacies of Distributed Computing . Scalability 5. INTRODUCTION 17 Distributed File System Files stored on many machines.2. Network File System (NFS) Trading Floor System Bids made. Resource Sharing 2.g.

1. • A byte is 8 bits sequenced together . Packets and Bit Manipulation 2. 2.18 CHAPTER 2. Bytes.1 Bits.6 Conclusion • Its difficult to design a good distributed system: there are a lot of problems in getting “good” characteristics. Integers etc • All data in computers is held as a sequence of ones and zeros.. • A number is represented as the base 2 number held in some pre-ordained fixed number of bits • Almost all other data is represented as various sets of numbers. Giga.what is the maximum number? • Integers (in Java and most other languages nowadays) are 32 bits long what is the maximum number? . Tera • Memory.5 Elements of a Distributed System Another way to view a system. Mega.2 Bits and Bytes Main Points • Bits and Bytes • Kilo.. LECTURE NOTES 2. not the least of which is people • Over the next ten weeks you will gain some insight into how to design a good distributed system.2. • Communications system • Messages • Machines • Processes on Machines • Programs • People 2.1.

State assumptions if it is not obvious from context.2.2 Prefices • In communications we talk often about throughput in bits/second. its generally 1024.and.3.4 Bit Manipulation • Not only must the data be sent.2.3 Foundations of Distributed Systems Aim: How do we send messages between computers across a network? Physical Concepts: Bandwidth. its generally 1000. 2. • The other machine’s communication device places the bits into memory which the other machine can access. but accompnying information allowing the computers to interpret the context of the data. Headers . • The communications device sends the bits through the network to the other machine (we’ll cover the details of this in the coming week). exclusive or. 2. When dealing with communications. or moving files of some particular size. Latency Packet Concepts: Data. • Communications software must be able to pick out arbitrary bits from an opaque bit sequence and interpret their meaning. 2. • When it wants to send this data to another computer. it copies the bits into the memory of the communications device.3 Memory and Packets • A computer stores data as bits in memory. negation. • We do this using bitwise operations .2. We use magnitude prefices for convenience kilo 1000 × mega 1000000 × giga 109 × tera 1012 × • There is often confusion as to whether a kilobyte is 1000 or 1024 bytes.2. or. When dealing with processor architectures. FOUNDATIONS OF DISTRIBUTED SYSTEMS 19 2. • The tricky bits come in ensuring that both machines interpret the bits correctly.

since noise makes it impossible to correctly distinguish level.sussex. the results of a remote print request. LECTURE NOTES Routing Concepts: Shared Media.ac. what is the possible bandwidth if there are two levels? Eight levels? Noise and errors Can we get infinite bandwidth by increasing the number of levels? No.20 CHAPTER 2. The rate of signalling is the bandwidth. To communicate. Noise is random changes in the physical characteristic to which all physical phenomena are prone. and then for the sender to change the value of a physical characteristic of the substrate. level 2 = 10. Switches and Routing Tables. Signal to other side whether each bit in sequence is 0 or 1.uk/courses/mct/.1 Physical Concepts What is a message? • A piece of information which needs to move from one process to each other eg a request to open a file. If it has more than one level. light level in optical fibre. If 4 levels. If it takes 20 microseconds to raise the voltage on a wire. measured in bits per second. frequency of radio wave in air Signal characteristics If physical characteristic has two levels then signal is binary. Routing Protocols For more detailed notes. The finite time limits the speed at which bits can be signalled. an email message.voltage level of a piece of wire. this is a just a sequence of bits in memory.3. how many bits can be encoded? Bandwidth Adjusting the level at one end. Examples . level 1 = 01. see Dan Chalmers’ Multimedia Communications Technology notes in http://www. level 3 = 11 If 8 levels.informatics. . then level 0 = 00. need each end to each access a common substrate. Always some probability of error in message. Need to communicate this sequence of bits across communications system to other host. and recognising the level has changed at the other takes finite time. can encode a number of bits to each level. For the network. Always some probability that level will be misinterpreted. 2. and 10 milliseconds to recognise the new level.

2.3. FOUNDATIONS OF DISTRIBUTED SYSTEMS

21

Goal of communications engineers is to make this probabilty as low as necessary. Latency Does signal propagate along wire infinitely fast? No, limit to speed of propagation of light. Hence described as propagation delay. Latency is time taken for signal to travel from one end of communication system to destination. Since communication system may reconstruct message at intermediate points, time taken to reconstruct message is also part of latency. Known as switching delay Your Task: Describe the bandwidth, propagation delay and switching delay in a game of chinese whispers.

2.3.2

Packets

If there is an error in a message, how is it detected and rectified? 1. Compute a checksum over the message 2. send the checksum with the message 3. Calculate a new checksum over the received message 4. Compare the checksums - if different, then message is in error 5. Ask for the message to be resent Probability of error rises with length of message, Thus message sent in separate lumps with maximum number of bits per lump, known as packets. If the message fits in one packet, good. Otherwise message is in many packets. Addressing How do we direct the packet to correct recipient(s)? • Put header bits on front of packet, analogous to address and other information on envelope in postal system. Add source of packet to allow returns • Destination Source Packet body

• If destination and source share the same physical medium, then destination can listen for its own address (ethernet, wireless, single point to point link) • If they don’t share the same LAN, we use routing

22
user space
Process Process

CHAPTER 2. LECTURE NOTES
... Src Port Dst Port ... Src IP Dst IP ... Src MAC Dst MAC ... network interface Packet
Process

user space
Process

...

Process

...

Process

socket

socket

socket

socket

socket

network interface

operating system local network

network interface network interface network interface switch

network interface

switch

Internet

Names, addresses and routes Name Identifier of object eg Ian Wakeman, Mum Address Location of object eg Rm 4C6, COGS Route How to get there from here - “turn left, first right and up stairs, first door on left” Internet: Name is a Domain Name, such as www.cogs.susx.ac.uk. More later. To get packet through network, turn name into address by asking Domain Name Service (DNS). Place address in packet header and hand to nearest router. Router locates MAC address corresponding to IP address, if they’re on the same LAN. Indirect Addressing Architecture of a switch A switch is a specialised computer. Can turn a PC into a router, using free software. 1. A packet arrives on a link 2. The switch gets the destination of the packet from the packet header 3. The switch looks up the destination in a routing table in memory and discovers the output link 4. The switch queues the packet to be sent on the output link

socket operating system

2.3. FOUNDATIONS OF DISTRIBUTED SYSTEMS

23

switch

processor

dst . .

link . .

routing table
pkt dst

input link
Distributed Routing Table Maintenance The situation:

output links

• People are network administrators (and end users) • Communication systems are links (possibly multipoint) • Machines are switches • Messages may be lost The problem: Given an address, how does a router know which output link to send the packet on? Choices: 1. Packet could contain list of all output links - source routing. Requires source to lookup and determine route. May be good thing. 2. Router could look up address in local routing table, and send out of corresponding link. If not in table, send out default link. How do we construct tables? Could install entries by hand, known as static routing. But • Limited to entries people know about. • Unable to ensure consistency and absence of loops • Unable to respond to changing topology, sharks gnawing through undersea cables etc. • Internet has no centralised authority So we use distributed routing algorithm

• Loss rate is variable (noise. Each switch knows its own address CHAPTER 2.24 Distance Vector Routing 1. Switch constructs complete graph using most recent link state packets from all other switches 5. therefore get queues. Link state packets flooded to all other switches 4. 3. Switch constructs packets saying who are neighbours . packets will bounce around in the network. Therefore bandwidth is variable. or measure of delay. • Switches shared with other packets. Switches exchange distance vectors with neighbour switches. Each link has”cost”. • Total latency is therefore variable. Switch saves most recent vector from neighbours 6. and whenever info changes 5. Use link with minimum cost to destination as link to route out. . Switch calculates own distance vector by examining cost from neighbour and adding cost of link to neighbour 7. and suffer delays or get dropped. available switch buffers).link state packets. such as a value of 1 per link. Examples include RIP and BGP. Use Dijkstra shortest path to figure out routing table.3 Conclusion: Network Properties Packet Switched Networks present certain fundamental problems to the distributed systems programmer: • Whilst switches converge on a consistent set of routes. 3. and infinity for everyone else 4. consisting of 0 for itself. Examples include OSPF. Each switch knows addresses that are direct neighbours 2. Each switch starts out with distance vector.3. Link State Routing 1. • Queue size management (aka congestion control) changes the bandwidth available to machines. LECTURE NOTES 2. 2.

daemons Basic service provided by network .4. threads. will include the interpretation of the bits in the packet. 2. OPERATING SYSTEM SUPPORT 25 2. sockets.2.2 Layering Networks are used to transmit messages between processes. Need to layer other services on top.4. 2.1 Protocols Protocol: Agreement between two (or more) parties as to how information is to be transmitted. At minimum. timers.4.2: Packets and messages Reason for Layering • Easier to build functions with higher abstractions. • Define ordering of abstractions to simplify necessary functions • Provides modularity . May include finite state machines movements between senders and receivers.4 Operating System Support Aim: What do we need from an operating system to support distributed systems? Main Points: Layering An organisational principle for modular communications Protocol techniques Acknowedgements. windows etc Operating Systems Abstractions Processes. Protocol information in headers (or trailers) at front of packets. bandwidth. Protocols used to give messaging functionality Packets in reality Limited Size Unordered (sometimes) Unreliable Machine to machine Local Area Net Asynchronous Insecure Messaging abstraction Arbitrary Size ordered reliable process to process routed anywhere Synchronous Secure Table 2.variable loss. latency.

2 of 3 “def”.26 Layers Applications Message CHAPTER 2. 3 of 3 “g” • Labels typically sequential fixed field integers Acknowledging To tell the sender data has been received correctly (after checking checksums etc).4. use acknowledgements. LECTURE NOTES Datagrams (UDP) or Streams (TCP) Transport UDP or TCP packets Internet IP packets Network Interface Network−specific frames Underlying Network Application Message TCP header IP Header Ethernet header IP TCP TCP data data data Ethernet Frame Layering in the Internet The Headers in an Ethernet Frame 2. . Labelling • • Split up message into smaller chunks. • Place label in header indicating which part of message it is. eg “abcdefg” → 1 of 3 “abc”. They are often reused in higher level protocols.3 Reliable Transmission: The Basic Techniques These are some of the basic protocol techniques that are used in lower layer protocols.

4. Acknowledgement information is part of the header eg TCP Timeouts and retransmission • The sender measures the expected time between sending a message and receiving an acknowledgement • Sender starts a timer after sending a packet. OPERATING SYSTEM SUPPORT 27 • When the data packet is received. How can the expected round trip time be measured? 2. • Acknowledgement message contains the label of the message being acknowledged. • Acknowledgement of a packet can sometimes implicitly acknowledge reception of all previously sent packets (Go back N) • Multiple labels can be sent in the acknowledgement (Selective acknowledgement) • If data is being returned to the sender. . receiver knows when packet is expected.2. • Window needs to be fixed to avoid overloading network or receiver. the packet is resent • Receiver must be able to deal with duplicate packets Questions 1. acknowledgement information can be piggybacked on return data packet. What value should the time be set to? Negative Acknowledgement Sometimes. • If the timer expires before an acknowledgement is received. • Can send an negative acknowledgement (NAK ) indicating expected messages hasn’t been received Windowing • To increase utilization. and decrease acknowledgement overhead. the pattern of data exchange makes it easier to use Negative acknowledgements • If data-flow is constant. multiple packets can be sent before the sender waits for an acknowledgement • The number of packets that can be sent is the window. send an acknowledgement message back to the sender.

LECTURE NOTES • Careful adjustment of the window size is key to avoiding and controlling congestion and dyanmic performance (slow start in TCP).. 2.4.5 Sockets • Sockets are the near-universal abstraction for using TCP and UDP • Provides data structures for holding addresses and other context information. mirroring etc • Network offers multicast to provide this functionality. and methods for sending and receiving data.XXXXXXXX • Network conspires to deliver packets sent to this address efficiently to all receivers. both ends need to agree on initial label value • Use a handshake of messages containing suggestions for state. [. .. State Synchronization • Both ends of the protocol exchange typically need to agree on some starting state • In labelling systems.XXXXXXXX. and then return messages agreeing the value of the state. The idea is that the perpetrator of the condition need not be aware that a daemon is lurking (though often a program will commit an action only because it knows that it will implicitly invoke a daemon). and may either live forever or be regenerated at intervals. • Clients use sockets to talk to servers.4.4 Group Communication • Often.] Daemons are usually spawned automatically by the system. • Receivers join a particularly addressed group . 2. [from the mythological meaning. but lies dormant waiting for some condition(s) to occur. later rationalized as the acronym ‘Disk And Execution MONitor’] A program that is not invoked explicitly. sender wants the same packet replicated to multiple receivers eg game updates. 1110XXXX. often known as daemons1 Operating Systems Background Socket Usage: Single threaded TCP Server try { Create a socket 1 from the Hackers Dictionary daemon /day’mn/ or /dee’mn/ n.XXXXXXXX.class D addresses in IP. not always available in wide area (unfortunately). • Available in Local Area.28 CHAPTER 2.

heap... OPERATING SYSTEM SUPPORT Threads of Control 29 Process Address Space (Thread stacks.packetisation may fragment the chunks. code etc) Process Process Process File and socket descriptors Process .4.2. User Space Operating System Kernel TCP IP Network Interfaces Bind the address to the socket loop { Accept connections on socket Receive data on connection do some work Send Response Close connection on socket } } catch and deal with any exceptions Socket Usage: TCP client try { Create a socket Bind the address of the server to the socket // Allows the operating system to choose port // and address for the receiver Open connection on socket // Connection setup Send data on connection Wait for Response Close connection on socket } catch and deal with any exceptions Gotchas • The chunk of data written to a socket is not necessarily read as a chunk for the remote socket . Socket Socket UDP .

LECTURE NOTES • Both ends have to agree on how to interpret the data written and read from the socket . • Data interpretation is through application standards defining bit fields eg the RFCs defining HTTP.30 CHAPTER 2. Socket Usage: Concurrent TCP Server • A single threaded server can only deal with one client at a time • This is often unacceptable in terms of performance eg web servers try { Create a socket Bind the address to the socket loop { Accept connections on socket spawn a worker thread to deal with connection } } catch and deal with any exceptions and the worker thread goes try { Receive data on Connection Do some work Send response Close Connection } catch and deal with any exceptions Thread Pools • Spawning a thread for every connection can consume too many resources if connections come too quickly • An alternative approach is to have a fixed size pool of threads • When connection is received pass the connection to the next available thread from pool • When all the threads are busy. SMTP.the concrete syntax of the data stream. FTP etc • Interpreting data can require very messy bit manipulation. server blocks and does’t accept more connections .

REMOTE PROCEDURE CALL (RPC) 31 2.5.5.5 Remote Procedure Call (RPC) Main Points • Send/Receive • One vs Two-way communications • Remote Procedure Call • Cross address space vs Cross machine communication 2.4.2.6 Request and Response • Design patterns for client server communications are very stereotyped. 2. which run on different machines (can’t use test&set at bottom as in a single machine) • Instead the Atomic operations are Send and Receive .temporary holding area for messages (ports) . • Can we automatically generate client server code? • Yes! • We can model a server as an object waiting for methods to be called • Client then obtains reference to object and calls methods • Distributed Object Systems and Remote Method Invocation (next lecture) 2.7 Conclusion • Layering is a modularisation approach allowing services to improve upon the services offered by lower services.doesn’t require shared memory for synchronising cooperating threads. • Sockets encapsulate TCP/UDP endpoints and can be used to construct clients and servers.1 Send/Receive • How do you program distributed applications? • Need to synchronise multiple threads. • Mailbox .4. • Protocol techniques qfrom lower layers are often re-used again and again (see the “End to End Argument” paper.

32 The send abstraction Ideal abstractions send(mbox, message)

CHAPTER 2. LECTURE NOTES

Send a message, possibly over network to specified mailbox When does send return? 1. When Receiver process gets message? 2. When message is safely buffered on destination machine? 3. Immediately, if message is buffered on source node? Choice depends upon system designer. The receive abstraction receive(mbox, buffer) Wait until mbox has message, then copy message into buffer In this abstraction, send and receive are atomic: • never get portion of a message (all or nothing) - need to ensure buffer is of sufficient size • two receivers can’t get the same message - there is local synchronization on the mbox

2.5.2

Message Styles

• 1 way - messages flow in one direction (BSD Unix pipes) • 2 way - request response (Remote Procedure Call) 1 Way example Producer: int msg[100] // maximum message size 100 bytes

while(1) prepare message /* make coke */ send(msg1, mbox) Consumer: int msg2[100] while(1) receive(msg2, mbox) process message /* drink coke */

2.5. REMOTE PROCEDURE CALL (RPC)

33

Producer/consumer doesn’t worry about space in mailbox - Handled by send/receive forcing process to block if no space Request Response • Example: Read a file on a remote machine • Also known as client server - client=requester, server=responder. • Server provides “service” (file storage) to client Client: (requesting the file) char response[1000]; send("read /etc/passwd",mbox1) receive(response,mbox2) Server: char command[1000], answer[1000]; receive(command,mbox1) decode command read file into answer send(answer, mbox2) Server has to decode command, as OS has to decode message to find mailbox What if file too big for response - then use a big message protocol (eg TCP)

2.5.3

Remote Procedure Call

• Call a procedure on a remote machine • client calls: rpc_read("/tsunb/random.txt") translates this into call on server read("/tsunb/random.txt") RPC Implementation • Request Response Message passing • “stub” provides glue on client and server

34

CHAPTER 2. LECTURE NOTES

bundle args call send client client packet (caller) return stub handler receive unbundle network network transport transport return bundle ret vals send server packet server (callee) handler stub call unbundle receive args
RPC Pseudo Code Client stub: build message send message wait for response unpack reply return result Server stub create N threads to wait for work to do while(1) wait for command decode and unpack request parameters call procedure build reply with results send reply RPC and procedure call In a normal procedure call 1. arguments pushed on stack, 2. name converted to address, 3. return address pushed on stack, 4. results either in register or on stack. Equivalent:

generates stubs automatically. static fixed at compile time (C) dynamic fixed at runtime (lisp.5.4 Cross domain communication If processes in different address space. types of arguments and return values. RPC can be used to communicate between processes on same machine as well as different machines Microkernel Operating Systems Conventional monolithic structure OS has all services running in kernel space.5. difficult to debug. All request to operating systems work via rpc mechanisms. QNX . provides translation of service ⇒ mailbox Runtime binding allows Access control check who is permitted to use service Fail-over if server fails.2. Split kernel up into application servers. seek. REMOTE PROCEDURE CALL (RPC) • Parameters . close . • Generates code on client to pack and send message. 35 • Uses procedure signature: name. rpc) Dynamic binding in rpc done via name service. as long as no state carried forward from one call to next eg. read.callers mailbox Implementation issues Stub generator . Idea: use rpc to communicate between services. Kernel then becomes message passer. Difficult to design because of cross dependence. open. use another What if there are multiple servers? Can they use same mailbox? Yes. Examples include Mach.passed in request message • return address . running in separate address spaces.each uses context of previous operation 2.reply message • Name of procedure . to receive and unpack message on server How does client know where to send message? DefinitionBinding is linking a service to a location.request message • Result .

5. 2. Need support for exceptions Performance Cost of procedure call << same machine rpc << network rpc. but failures more complex. We will examine problems more fully in the coming lectures.5. Caching helps.5 Problems with RPC RPC provides location transparency except Failures More failure modes in dist sys than on single machine. since machines or network may crash. LECTURE NOTES App App file App sys RPC threads windows address spaces file system VM windowing threads networking Monolithic structure Microkernel Services Advantages Fault Isolation Bugs are more isolated (firewall) microkernel Enforces Modularity Allows incremental upgrade of pieces of software Location Transparent Service can be remote or local eg X 2.6 Summary • Remote procedure call hides message request response exchanges between client and servers behind programming semantics which resemble normal procedures . Programmers must be aware RPC is cheap but not free.36 CHAPTER 2.

. 2.6 Distributed Objects: The Java Approach Main Points • Why distributed objects • Distributed Object design points • Java RMI • Dynamic Code Loading 2.6. Problem is how to manage complexity at design time . DISTRIBUTED OBJECTS: THE JAVA APPROACH 37 • Remote Procedure Call provides good abstraction for programming distributed systems across address spaces and across machines..but comes at a performance cost.6..2 Why Distributed Objects? Distributed Systems multiplies complexity • multiple machines • multiple people • multiple organisations Large amount of communication between system designers in producing distributed systems. } 2. public void withdraw(float amount) throws RemoteException. • .6.1 What’s an Object? An Object is an autonomous entity having state manipulable only by a set of methods public interface BankAccount extends Remote { public void deposit(float amount) throws RemoteException.. public float balance() throws RemoteException.2.

so keep system comprehensible Encapsulation allows elements to be extracted ⇒ comprehensibility and reusability Concurrency control allows easy management of concurrent activities 2.38 Software Engineering CHAPTER 2.3 How to build Distributed Object Systems What are the various entities? • Programmers using existing services • Programs running on various machines offering services • Packets using RPC protocol to invoke methods in programs How do we communicate between these things? programmers Interface Definition Language programs rpc system Concrete Syntax & rpc protocols packets . so that its easy to maintain and modify. Easier to test • Reusable. cheaper than rebuilding and fewer bugs Objects as a basis for distributed system give you techniques to manage complexity: Abstraction hide unnecessary details. LECTURE NOTES Software design should produce well-engineered software which satisfies requirements: • Comprehensible.6.

allows asynchronous processing when appropriate • Examples include Java Remote Method Invocation (RMI).4 Objects and RPC systems No real distinction between distributed method invocation and rpc systems.6.6. Pure object systems • Provides dynamic binding through name service.5 Java RMI • Java has RPC built in as Remote Method Invocation • No separate IDL .rmi. DISTRIBUTED OBJECTS: THE JAVA APPROACH lookup name 39 2 rmiregistry bind name 3 get classes client object call method 1 5 remote call over TCP Code Repository instantiated through classloader remote stub remote object 6 method invoke 7 4 client JVM generic dispatcher server JVM 2. Corba Static rpc systems such as Sun rpc • Binding of services to machine by programmer • Synchronous processing since protocol processing in user thread 2.Remote as a way of specifying which methods can be invoked remotely. possibly with migration and other features • Protocol processing can be part of OS. .2.uses Java for interface definition Remote Interfaces • An interface in Java specifies a set of methods that the object implementing that interface will provide • Java RMI uses interfaces which extend java.6.

. code acting as a proxy or stub for the remote object must run on the local machine. or as a field in another result object. an object must acquire a remote reference. If no references are found. • Results or exceptions are then returned to the caller. • Single address space GC basically checks for references to objects. } public Bar implements Foo extends RemoteObject { . public void myRemoteMethod() throws RemoteException { .. a generic dispatcher uses reflection to determine which method. typically as a method result.40 CHAPTER 2. Java Distributed Garbage Collection • Garbage collection (GC) is the removal of objects when they are no longer needed.rmi.. Stub files and generic method dispatch • To invoke a remote method. • To provide the necessary communications code.RemoteObject or one of its subclasses. a remote object must extend java... } Remote Objects and Remote References • To use a remote object. a remote reference looks just like a normal object reference. } . LECTURE NOTES public interface Foo extends Remote { public void myRemoteMethod() throws RemoteException. and marshalls the required method and arguments before sending them as a byte stream on a TCP connection to the remote machine. • rmic is the tool that generates the stub file from the implementation of the remote object. and calls the invoke method from the reflection package to call the method.. • Java will then provide remote references to the object when a reference as passed out of the local JVM. . • In Java. • This code implements the appropriate interfaces. • At the remote end. the object is removed.

io. only objects which are accessible remotely can be passed by reference.ie they must support the java. • If an object implements Serializable and all of its references are Serializable. The proxy informs the remote object it holds a reference. it is a candidate for GC.6.and instantiated as copies on the remote machine • Objects passed by value must be capable of being passed by value . RMIRegistry • How do classes get the initial remote reference (bootstrap)? • Remote objects bind themselves against a given textual name (eg myRemoteName) with the rmiregistry • Objects can then resolve the name remotely by querying the rmiregistry. a remote reference corresponds to a proxy in the local machine. • In remote method invocation. • Therefore all methods in a Remote interface must throw RemoteException.call-by-value . object references are passed as arguments and results . • Serializable objects and their associated object graph can be flattened into a byte array. .2. it tells the remote object that it no longer holds a reference.Serializable interface. Parameter and Result Passing • In local method invocation. • Instead.call-by-reference. it can be passed by value Remote Exceptions • The number and probability of failure modes are far higher in distributed systems. • The designers of rmi decided to make this explicit by forcing programmers to deal with a possible RemoteException in all remote invocations.references can be passed arbitrarily from machine to machine. • When the proxy is GCed. DISTRIBUTED OBJECTS: THE JAVA APPROACH 41 • Distributed GC is complicated because the traffic to check all possible references is infeasible . • Other objects must be passed by value . • When a remote object knows of zero proxies.

the interface and the stub files. and are specified in java policy files. the location of the relevant server class files . they are normal remote objects. Activation • Using an active thread continuously for an object which is accessed infrequently may be a poor use of resources . • Compare to the use of inetd to control typical Internet services such as ftp. • Additonal classloaders can be used by programmers to load class files from more exotic places. • Typically. • When passive. and responsibility for accepting calls to that object is handed over to an activator. which searches the CLASSPATH.uses the rmiClassLoader. these actions are network access. screen access etc. it creates a new instance of the object and instantiates its state from its stored state. • Java provides for a security policy to be defined for a classloader so that all classes from that classloader can have their actions sandboxed. telnet etc. . • Instead. and hidden from the programmer. • When active. file access.42 CHAPTER 2. • Code loaded from arbitrary places is a security risk. LECTURE NOTES • The rmiregistry will return a remote reference. • Normal class loading comes through the default classloader. • Class files are loaded on demand as objects are created or static methods are invoked. the object’s state is stored in persistent storage eg a file. allow objects to change state from active to passive and vice versa. • When the activator receives a call. Downloading of Classes • The layout of classes and the bytecodes for implementing class methods are detailed in class files. ClassLoaders • Rmi must allow the interface and stub files for remote objects to be downloaded over the network .consider machines with millions of objects.

but to integrate the various business activities .6 Summary • Described the key elements of Java RMI.7. .1 Computers in Business • The hard job in commercial computing is not writing word processors. • Changes in the data and requests from the front end must be interpreted according to the business logic. • Other possible choices for distributed objects in the next lecture 2. . ENTERPRISE COMPUTING AND CORBA 43 2. • Refer to these in using rmi to help in udnerstanding some of the problems that occur.selling.7 Enterprise computing and Corba Main Points • What are computers used for? • Three tier models • CORBA 2. • .6. generally held in a database management system. . • The key asset in business computing is the data that the business has built up over the years. buying. . managing. coordinating production. .2.7. • Business computing therefore has to integrate the front end activities carried out by people with the data held in the databases.

XML and friends View Objects 44 2.7. html. • The front end is the user interface. Point Of Sale views for the tills.2 • These can be a mixture of xml. LECTURE NOTES Legacy Applications Factory Processes Legacy Applications Factory Processes DBMS DBMS .Javascript. business overviews for decision support systems.¡"¡"¡"¡ # # # "#¡#¡#¡#¡ ¡"¡"¡"¡ "#¡#¡#¡#¡ #¡"¡"¡"¡#"# #"¡#¡#¡#¡" "¡"¡"¡"¡ ¡"¡"¡"¡# #¡#¡#¡#¡" "¡#¡#¡#¡ ¡"¡"¡"¡# "#¡#¡#¡#¡" ¡"¡"¡"¡# "#¡#¡#¡#¡" #¡"¡"¡"¡"#"# ""¡"¡"¡"¡ ¡#¡#¡#¡# ##¡#¡#¡#¡ "¡"¡"¡"¡ ¡#¡#¡#¡" ¡"¡"¡"¡# "#¡#¡#¡#¡" ¡"¡"¡"¡# "#¡#¡#¡#¡" #¡"¡"¡"¡"# "¡#¡#¡#¡# ¡"¡"¡"¡#" #"#¡#¡#¡#¡ "¡"¡"¡"¡" #¡#¡#¡#¡# "¡"¡"¡"¡" #¡#¡#¡#¡# "¡"¡"¡"¡ ¡¡¡¡"#"#" Business Objects ORB ORB ORB ORB A@A ( @ 989898989 8 ( ))() 767 6 545454545 4 && ''&' 23 3 2 101010101 0 $$ %%$% ¡ ¡ ¡ ¡ ¢ ¢ ¢  ¢¡¢¡¢¡¢¡ ¡ ¡ ¡ ¡  ¢¡¢¡¢¡¢¡ ¡ ¡ ¡ ¡¢ ¢   ¢¡ ¡ ¡ ¡ ¡¢¡¢¡¢¡  ¢¢¡¢¡¢¡¢¡¢  ¡ ¡ ¡ ¡ ¡¢¡¢¡¢¡  ¡ ¡ ¡ ¡¢  ¢¡¢¡¢¡¢¡¢  ¡ ¡ ¡ ¡   ¢¡¢¡¢¡¢¡ ¡ ¡ ¡ ¡¢ ¢   ¢¡ ¡ ¡ ¡ ¡¢¡¢¡¢¡  ¢¢¡¢¡¢¡¢¡¢  ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡  ¢¡¢¡¢¡¢¡¢  ¡¢¡¢¡¢¡¢  ¡ ¡ ¡ ¡   ¢¡¢¡¢¡¢¡ ¡ ¡ ¡ ¡¢  ¢¡¢¡¢¡¢¡  ¡ ¡ ¡ ¡¢ ¢   ¢¡ ¡ ¡ ¡ ¡¢¡¢¡¢¡  ¢¢¡¢¡¢¡¢¡¢  ¡ ¡ ¡ ¡¢  ¢¡¢¡¢¡¢¡   ¡ ¡ ¡ ¡ ¡¡¡¡ ¢ ¢ Business Objects ORB ORB ORB ORB !  !§¨§¨¨§     ¥¦¥¦¦¥     £¤£¤£¤  ©©©©©©©© ©© View Objects The view objects . which will provide various views. javascript and java in a web browser. • Functions include views in the warehouse of the stock control system. Three Tier Models and the Web Network Network Server Objects Server Objects CHAPTER 2. depending on function.

ENTERPRISE COMPUTING AND CORBA 45 • Or they can be fully blown applications talking directly back to the server objects Server objects and middleware Network Server Objects View Objects • The server objects encapsulate the business logic . C¡B¡B¡B¡ C C C B¡C¡C¡C¡ ¡B¡B¡B¡ BC¡C¡C¡C¡ ¡B¡B¡B¡CBC CBC¡C¡C¡C¡B B¡B¡B¡B¡ ¡B¡B¡B¡C C¡C¡C¡C¡B B¡C¡C¡C¡ C¡B¡B¡B¡C B¡C¡C¡C¡B ¡B¡B¡B¡C BC¡C¡C¡C¡B ¡B¡B¡B¡BCBC BBC¡B¡B¡B¡ ¡C¡C¡C¡ CC¡C¡C¡C¡C B¡B¡B¡B¡B C¡C¡C¡C¡ ¡B¡B¡B¡C B¡C¡C¡C¡CB ¡B¡B¡B¡B BC¡C¡C¡C¡ ¡B¡B¡B¡C BC¡C¡C¡C¡B ¡B¡B¡B¡CBC CBC¡C¡C¡C¡B B¡B¡B¡B¡ C¡C¡C¡C¡C B¡B¡B¡B¡B C¡C¡C¡C¡C B¡B¡B¡B¡ ¡¡¡¡BCBCB ORB ORB ORB ORB a`a ` YXYXYXYXY X HH IIHI WVW V UTUTUTUTU T FFF GGG RS S R QPQPQPQPQ P DDD EEE DBMS Factory Processes Business Objects Legacy Applications .ie a high maintenance activity.how the business uses its data. • This is subject to tweaking and change as the business evolves .2. • The CORBA object standards intend to provide business objects to support the implementation of business logic in a component framework.7. • Object and particularly component based approaches are most useful here.

• Many different implementations. and this software already does its job • To integrate this legacy software into new systems requires software wrappers that can speak to the new systems. The CORBA Object Model implementation repository client client program proxy for A ORB core Request Reply ORB core interface repository object adapter skeleton or dynamic invocation ¡b¡b¡b¡ c c c bc¡c¡c¡c¡ ¡b¡b¡b¡ bc¡c¡c¡c¡ ¡b¡b¡b¡cbc bbc¡b¡b¡b¡ ¡c¡c¡c¡b cc¡c¡c¡c¡c b¡b¡b¡b¡ ¡c¡c¡c¡b ¡b¡b¡b¡c bc¡c¡c¡c¡cb ¡b¡b¡b¡b bc¡c¡c¡c¡ ¡b¡b¡b¡cbc bbc¡b¡b¡b¡ ¡c¡c¡c¡b cc¡c¡c¡c¡c b¡b¡b¡b¡ ¡b¡b¡b¡b c¡c¡c¡c¡c b¡c¡c¡c¡cb ¡b¡b¡b¡b bc¡c¡c¡c¡ ¡b¡b¡b¡c bc¡c¡c¡c¡b ¡b¡b¡b¡cbc cbc¡c¡c¡c¡b b¡b¡b¡b¡ c¡c¡c¡c¡c b¡b¡b¡b¡b c¡c¡c¡c¡c b¡b¡b¡b¡ ¡¡¡¡bcbcb ORB ORB ORB ORB € € yxyxyxyxy x hhh iii wvw v ututututu t fff ggg rs s r qpqpqpqpq p ddd eee DBMS Factory Processes Business Objects Legacy Applications server Servant A . LECTURE NOTES Legacy systems and software wrappers Network Server Objects View Objects • Most companies have a substantial investment in their existing software. an industrial consortium of well over 100 companies.3 CORBA • CORBA is a middleware design to allow application programs to communicate with each other irrespective of their programming language.46 CHAPTER 2. • Standardised by the Object Management Group (OMG). which can all interoperate.7. • . . . 2. . .

depending on discriminator enumerated type tag) Remote Object references Objects are not passed by value. struct (collections of values of different types). using the object adapter name. Interfaces Interfaces are similar to Java interfaces. in addition to the return value of the call inout value passed in and returned. which may return values Primitive types Such as byte. enumerated (set of named integer values) unions (choice of values. providing dispatch for method invocations. Instead. char. short. . providing scope control mechanisms for names. IDL supports inheritance Attributes can be declared in IDL . ORB core implements the standardised communication systems.IDL Modules CORBA IDL provides modules which function in a similar fashion to Java packages. Available at compile time. 47 proxy for A marshalls the arguments in invocation requests and unmarshalls replies. Interface repository provides information about registered IDL interfaces. so as to enable dynamic invocation cf reflection in Java. references to objects can be returned and passed around. providing collections of methods offered by an object. and provides services for converting remote references to strings etc. Servant for A does the actual work. ENTERPRISE COMPUTING AND CORBA client program calls a method in the remote servant program A.7. These references are interpreted by the local communications modules (the Object Request Broker ORB). except that parameters can be tagged as in value passed into the method out value returned from the method. remote references and activation control skeletons are in the target language and do the unmarshalling of arguments and marshalling of results. Implementation repository activates registered servers on demand and locates running servers. Object Adapter bridges between the ORB and the target language.the compiler generates accessor methods (get/set) automatically. double Constructed Types sequence (variable length list). and is generated from the IDL. array. The Interface Definition Language . Exception Methods can raise exceptions. long. Methods Method descriptions are similar to Java. int.2.

where PersonHolder contains an instance of the returned out value of the Person argument. java . . • But some of the semantics need special support. so that a name actually points to another naming context. Trading Service allows services to be located by description. audit trails etc.c. as long as the programmer is aware of the semantics of the IDL. out Person p). rather than name. Event and Notification Service provides for event management. The naming contexts can be linked. } Language Bindings • In normal system languages .48 IDL Example // From file Person. void addPerson(in Person p). LECTURE NOTES interface PersonList { readonly attribute string listname. • Consider void getPerson(in string name.idl struct Person { string name. Security Service provides for management of authentication and access control. CORBA Services • CORBA includes specifications for a number of services: Naming Service uses a naming context to lookup a name. CHAPTER 2. }. void getPerson(in string name. out Person p). long year. including pattern matching for notification.these are quite straight forward. PersonHolder p). • The Java equivalent is void getPerson(String name. string place. c++. long number(). allowing hierarchical names.

the General InterORB Protocol. • The implementation almost universally used is the Internet Inter-ORB Protocol. Transport Issues .7. • IIOP specifies the layout of messages.7. the Common Data Representation or CDR. • GIOP specifies the concrete syntax for data placed into the byte stream. IIOP.4 Web Services . GIOP.Business to Business . ENTERPRISE COMPUTING AND CORBA 49 Transaction and Concurrency Control Service implements transactional mechanisms to provide concurrency control and ACID semantics to operations Persistent Object Service allows objects to be stored in passive form when not required and activated on demand. and the standard layout for remote object references. 2.2.IIOP • There is a standard protocol for use between ORBs .

but how should business computers talk to each other? • CORBA has never bridged the business to business gap for various reasons. • The interface definition language is Web Services Description Language (WSDL) • The IDL is thus (very verbose) XML • More detailed notes can be found at Vladimiro’s Internet Technologies notes . LECTURE NOTES • HTML/javascript and friends are good for rendering interfaces to people. • Can XML based solutions bridge the gap? SOAP. .50 CHAPTER 2. . . . • The concrete syntax is thus ascii tags and content. WSDL and XML • Web Services are business to business RPC systems based on XML and HTTP • Messages are described using an XML variant called SOAP. • .

2.8.7.1 Definitions Types of Misuse • Accidental • Intentional Protection is to prevent either accidental or intentional misuse Security is to prevent intentional misuse Three pieces to security Authentication Who user is Authorisation Who is allowed to do what Enforcement Ensure that people only do what they are allowed to do A loophole in any of these can cause problem eg 1. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM51 2. • Three tier models of business solutions provide flexibility yet control the complexity of the business logic. • CORBA provides a middleware framework in which to implement the business logic.2.5 Summary • The majority of programmers in the commercial world create bespoke solutions to manage business processes. do anything 3. Log in as super-user 2.8 Computer Security: Why you should never trust a computer system Goal: Prevent Misuse of computers • Definitions • Authentication • Private and Public Key Encryption • Access Control and Capabilities • Enforcement of security policies • Examples of Security Problems 2.8. Can you trust software to make decisions about 1 and 2? . Log in as anyone.

Extend everyone’s password with a unique number (stored in passwd file). password protection has evolved from DES. using a wellknown string as the input data. Secure Digest Functions A secure digest function h = H(M ) has the following properties: 1. Some solutions 1.it takes even less time to check for all words in a dictionary.2 Authentication Common approach: Passwords. through to MD5 to SHA-1.in particular. Consider that unix initially required only lowercase 5 letter passwords How long for an exhaustive search? 265 = 10. so ok if someone reads the file. 10 ms to check a password → 1 day • In 2003.52 CHAPTER 2. 3. Since only I know password. Shared secret between two parties. Can’t crack multiple passwords at a time. Given h. it is hard to compute M such that H(M ) = H(M ) For example: Unix /etc/passwd file Password → one way transform → encrypted password System stores only encrypted version. LECTURE NOTES 2. Paradox: short passwords are easy to crack. system encrypts password and compares against the stored encrypted versions. is is easy to compute h. it is hard to compute M . Over the years. Problem 1 system must keep copy of secret. 000 • In 1975.1 second Most people choose even simpler passwords such as English words . . machine can assume it is me. 000. When you type in password. to check against password. but long ones. 2.8. people write down Improving technology means we have to use longer passwords. and the password as the key.00001 ms to check a password → 0. Given M . What if malicious user gains access to this list of passwords? Encryption Transformation on data that is difficult to reverse . secure digest functions. 0. Passwords as Human Factors Problem Passwords must be long and obscure. Given M .

But people pick common patterns eg 6 lower case plus number. and difficult to reverse without the key . Require more complex passwords. you don’t need to do complete exhaustive search. have back doors (intentional or unintentional).2. Problem 3 Can you trust the encryption algorithm? Recent example: techniques thought to be safe such as DES. Need physical theft to steal card. upper. Two roles for encryption: 1. Authentication . COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM53 2.do we share the same secret? 2.I don’t want anyone to know this data (eg medical records) Use an encryption algorithm that can easily be reversed given the correct key. the network between the machine on which the password is typed and the machine the password is authenticating on is accessible to everyone. If there is a back door. or 1 day. with PIN to activate. Make it take a long time to check each password. delay every login attempt by 1 second. number and special 707 ≈ 8000 billion. 3.8. eg 7 letter with lower.3 Authentication in distributed systems: Private Key Encryption In distributed systems. 2.8. Give everyone a calculator or smart card to carry around to remember password. For example. Secrecy . 4. Assign long passwords.

and instead. and has proven strong against a large body of analysis. 192 or 256 bit keys. as long as they trust the server. 3DES or triple DES is used. can’t derive password. shifts and transpositions.Kerberos Operation The server keeps a list of passwords. IDEA The International Data Encryption Algorithm has 128 bit keys. all of which are fast on modern processors. can’t decode without password.54 CHAPTER 2. • From plain text and cipher text. which has 128 bit keys. multiplication. B to talk to each other. The 56 bit key is too weak for most uses now. Authentication Server . AES The US based NIST has defined a new encryption algorithm based on Rijndael. addition. we get both secrecy and authentication. LECTURE NOTES spy plaintext password encrypt secure insecure transmission ciphertext MI6 plaintext password decrypt secure ciphertext • From cipher text. which offers 128. provides a way for two parties. . Symmetric Encryption Symmetric encryptions use exclusive ors. • As long as the password stays secret. A. DES The Data Encryption Standard (DES) was released in 1977.

. use single use authenticators so that clients are sure of server identities. but (text)∧K∧K = text • (text)∧K-1∧K = text. such as public directory For example: (I’m Ian)∧K Everyone can read it. but (text)∧K-1∧K-1 = text Can’t derive K from K-1.)∧K means encrypt message (.K-1 With private key: (text)∧K∧K = text With public key: • (text)∧K∧K-1 = text.8. separates authentication from secrecy..4 Public Key Encryption Public key encryption is a much slower alternative to private key. What’s wrong with this assumption? Problem: How do you trust the dictionary of public keys? Maybe somebody lied to you in giving you a key. ((I’m Ian)∧K Hi!)∧K’-1 On first glance. and to prevent machine replaying messages later. or vice versa Public key directory Idea: K is kept secret.2.B) • Server gives back special “session” key encrypted in A’s key.) with key K. 2. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM55 Notation: Kxy is a key for talking between x and y.. along with ticket to give to B: S → A (Use Kab (This is A! Use Kab)∧Ksb)∧Ksa • A gives B the ticket: A → B ((This is A! Use Kab)∧Ksb Details: add in timestamps to limit how long each key exists. only you can read it.8. Each key is a pair K. Simplistic overview: • A asks server for key: A → S (Hi! I’d like a key for A. Also have to include encrypted checksums to prevent malicious user inserting garbage into message. (Hi!)∧K-1 Anyone can send it but only I can read it (secrecy). but only I could have sent it (authentication). only I can send it. (. K-1 made public.

8. • Used in http as https. optionally with compression first.6 Authorisation Who can do what. then 3DES or whatever to encrypt the session. exchange random values Send server certificate and optionally request client certificate client server Send Pre−master key for server to generate session key encrypted in server public key Send client certificate if requested All messages now encrypted in session key 2. and for telnet. users A rw r B rw C r . • Uses public key technology to agree upon a key.8.. . For example.56 CHAPTER 2.5 Secure Socket Layer • Provides a techniques for data sent over a TCP connection to be encrypted. one box represents C can read file3 Potentially huge number of users and objects.. . . ftp etc. SSL handshake protocol ClientHello ServerHello ServerCertificate CertificateRequest (PreMasterKey)Kserver ClientCertificate ClientDone ServerDone Establish protocol version session ID.. Access control matrix: formalisation of all the permissions in the system objects file1 file2 file3 .. cipher suite. LECTURE NOTES 2. so impractical to store all of these. • Data encrypted in blocks.

Unix addresses by having rwx for user group world. etc Any bug in enforcer means way for malicious user to gain ability to do anything In Unix. Recent systems provide way of specifying groups of users and permissions for each group 2. superuser has all the powers of the Unix kernel .only minimal privilege.8. make enforcer as small as possible . 2. fancy protection . If bug in any program.easier to get correct. Access control list .7 Enforcement Enforcer is the program that checks passwords.8.each process stores tickets for the objects it can operate on.2. Because of coarse-grained access control. but hard to get right. so that users have to prove their capability of using the content through knowledge of the key.store all permissions for all users with each object Still might be large number of users. you’re in trouble. Paradox : 1.can do anything. . and they wish to protect their investment • Digital Rights Management is the catchall term for the technologies controlling the use of digital content. Watermarking where the content is marked so that devices know that the data is protected. but simple minded protection model (more programs have high privilege). COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM57 Approaches to authorisation 1. lots of stuff has to run as superuser to work. access control lists. • The problem is how to enforce the checking of capabilities and enforce the no circumvention requirement. Capability list . 2. Digital Rights Management • Digital content is produced by people earning a living. • Two main approaches: Containment in which the content is wrapped is an encrypted shell.

. 5. Trusted Operating Systems • Initial design came from mobile code thoughts in how to protect programmable network components. As hardware comes up. executes. loads and validates state of the machine. Once hardware is fully up.. control is handed over to the validated OS. 2. 3.58 CHAPTER 2. Fritz checks boot rom has been signed. • Digital content can require that only validated software can be used to play it and that it can only be played on a specific machine. • There must be an untamperable part of the computer/OS that can hold keys and hand over only to validated software.8 Trusted Computing Platform Question: How do we ensure no software transgresses digital rights? Answer: By only allowing approved software to have access to the data. LECTURE NOTES 2. Fritz checks the first section of the OS has been signed. • The basic computer and operating system must be running validated software.. a co-processor which holds a unique certificate that it is running some validated code. then the machine must be validated. 4.8. 1. and then the associated driver had been signed and validates the state. the serial number of the hardware is checked to see if it is valid. If new hardware is added that is not valid. validates the state of the machine. . executes. • The untamperable part of the computer is the Fritz chip. Startup Sequence ROM Memory Fritz bus Coprocessor Processor Main Bus IO Bus Interface Ethernet Adapter .

pretending to be from the trusted user to install a . can send message appearing to be key strokes from window but really is commands to allow imposter access. • Ethical issues? Many. can set up a . but army hidden inside. • Economic issues? Many.nothing can be done Imposter Break into system by pretending to be someone else. if have open X windows connectionover the network. but really does something harmful 2 Next Generation Secure Computing Base .8. without having to retype password Also allows rsh .8. • Firewalls inspect each packet going in and out of the network. .command to do an operation on a remote node. then we’re all in trouble .8. 2. otherwise disallow X connections. no way to stop this Trojan Horse Greeks present Troy with present of wooden horse. 2. • Palladium 2 is Microsoft’s name for their secure OS.2. Combination means: send rsh request. then allow X connections from that host to the telnet source. . Similarly. If no encryption. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM59 • The Fritz chip uses the key it holds and with the inner kernel can unpack data.rhosts file granting imposter full access. Trojan Horse appears helpful. check its being passed to the certified software and is still on the correct machine. • Genrally routed through static routed installed at ingress points for networks.rhosts file to allow logins from one machine to another. • Firewalls can prevent packets entering or leaving networks for some services. although can be more distributed eg through tunneling. • Filters can be triggered and stateful eg if a Telnet connection is ongoing to a host. . For example in Unix. .9 Firewalls • TCP/IP has port numbers which indicate services that packets should go to. applying pattern matches on port numbers (and other fields).10 Classes of security problems Abuse of privilege If the superuser is evil.

Difference in timing can reveal how far check has proceeded. Arrange first character in string to be as last character in page. Re-program so that partial pennies go to programmers account. bcedefgh on disk) By timing how long the password check takes..i++) if(userPasswd[i] != realPasswd[i]) go to error Appears that you’d have to try all 2568 combinations. page fault. remainder on disk. Key idea: Force page break at inopportune times. Given source and documentation (wanted source to be given away. page with remainder on disk. can figure out whether the first character is correct. Try all first characters till one is slow. What happens to partial pennies from interest on bank and mortgage accounts? Bank keeps it. Mean takes 128 × 8 to crack passwords. How can these be prevented? Hard to build system that is both useful and prevents misuse. But how is this known in advance? . then first character is wrong. Eavesdropping Listener . as Unix was). Tenex used virtual memory. See everything typed in. To demonstrate. LECTURE NOTES Salami Attack Richard Prior in Superman 3 Idea was to build up a small bit at a time. rest to be on next page. gave them normal account. Arrange that page with the first character is in memory. eg by referencing lots of other pages then referencing the first page. if fast. as almost everything goes over network unencrypted. then first character is correct. Fix easy: don’t stop till you look at all characters.60 CHAPTER 2. Try all second characters till one is slow. See Internet Worm later. But. password goes over network unencrypted. Tenex Popular operating system at universities in early seventies before Unix Thought to be secure.tap into serial line on back of terminal or onto Ethernet. Millions of customers adds up quickly. if slow. Then put first two characters on memory. On telnet. In 48 Hours had every password on system. created a team to find holes. Code for password check for(i=0.. other character incorrect.i<8. and it can interact badly with above code. a|bcdefgh where a in memory.

Debug mode. ability to logon to any system in world. Got caught because idea was to launch attacks onto other machines from whatever systems were broken into. 1. Replicates to every Unix system in the world.c.2. get the program to call arbitrary code. one of original Unix designers. Thompson’s self-replicating program Bury Trojan Horses in binaries. Make it possible (easy) 2.c A: if(name == "ken") don’t check password log in as root Idea is to then hide change so no one can see it Step 2 Modify the compiler Instead of code in login.crn Fingerd didn’t check for length of string. Ended up cutting salami too thickly when machines were attacked multiple times. Three attacks • Dictionary lookup on passwd file • Sendmail . Would give Ken Thompson. Hide it (difficult) Step 1 Modify login. If multithreaded. • fingerd . Variant attack: kernel checks arguments to call before using them. could co-ordinate non-calling thread to change args after kernel check. so no evidence in source. only allocated a fixed size array on stack.Worm infected thousands of machines over Internet. if configured wrongly will execute commands as root. dragged CPU down so much that people noticed.8. COMPUTER SECURITY: WHY YOU SHOULD NEVER TRUST A COMPUTER SYSTEM61 Internet Worm 1988 .finger ianw@csrp. By passing a carefully crafted really long string. could overwrite stack. have code in compiler B: if(see trigger) insert A into input stream . and even onto new unixes on different platforms with no visible sign.

Hard to resecure system after penetration How do you remove backdoor? Remove triggers? But what if another trigger in editor? If observer trigger being removed.8. When porting to a new machine.62 CHAPTER 2.1 Names and Naming Services Main Points • Use of names • Structure of names . Now.c. the more bugs . and then new compiler from source code written in c.c recompiled. LECTURE NOTES Whenever the compiler sees a trigger (/*gobbledygook*/). just need trigger. to every compiler built.everything is in binary. backdoor created. step 4 compile the compiler with C present now in the binary for the compiler step 5 Replace code for C in compiler with trigger2 Result . re-insert trigger on saving file. with B. Any system has loopholes.11 Lessons 1. and every system has bugs. Every time compiler recompiled. puts A into source stream into compiler. Hard to detect when system has been penetrated.KISS 2. Compiler used to generate new cross compiler. will replicate C.9. don’t need A in login. Every time login. Easy to make system forget 3. The more complex the system. 2. As long as trigger2 remains in compiler source. 2. with A. Need to get rid of problem in compiler step 3 Modify compiler to have C: if(see trigger2) insert B and C into input stream Note that this can now be self-replicating code.9 2. the compiler re-inserts backdoor inserter. 4.

2 Why names? Object Identification a service or resource we want to use. eg a filename. When an object disappears from the system. If an object comes into creation it must be named. If entire system constructed so that names are passed with authorisation then knowing name means chain of trust to allow access to object.where it is Route .3 What does one do with names? • Use as arguments to functions. Name should meet rules for system. • Choice of design has ramifications for design of rest of system and performance of service. therefore naming authority (possibly object) must keep track of what names it has allocated • Delete names. eg be unique.9. NAMES AND NAMING SERVICES • Name Services • Domain Name Service (DNS) . 2.how to get there 63 2.The Internet Name Service Definitions:Names . . knowing the name of the object means that it must explicitly have been passed. a person Allow Sharing Communicating processes can pass names and thus share resources Location Independence If we separate the name from the address. eg to call a service • Create names. at some point may wish to delete name and allow re-use.9.4 What’s a name? • Part of the design space for any distributed system is how to construct the name. can migrate object transparently Security If large number of possible names.9. 2. a telecommunications provider.what its called Address .2.9.

com edu ac uk co susx ucl cogs .need to search entire name space 2. LECTURE NOTES • Unique Identifier (UID) aka flat names. typically location and name allocation . • Only use is comparison against other UIDs eg for lookup in table containing information on named object. • Provide location independence and uniformity • But Difficult to name different versions of the same object eg Distributed systems edition 1 vs Distributed Systems edition 2. primitive names. Real objects have relationships to each other .64 Unique Identifier CHAPTER 2.9. just string of bits.5 Partitioned names Add partition structure to name to enable some function to be more efficient.useful to reflect in naming practice • Difficult to discover address from name . • No internal structure.

broadcast • Ask all possible objects within the system if they respond to that name. www.informatics.6 Descriptive names • Necessary to have a unique name. simply add to lowest part of partition.9. • When allocating names.sussex.informatics.7 Object Location from name .ac.uk 2. made worse due to higher failures needing more location requests • Broadcast used only on small LAN based systems (and in initial location of directory) . John of Gwent.all responses would generate a lot of traffic.sussex. Can be efficient if network supports broadcast eg Ethernet and other LANs.unix. NAMES AND NAMING SERVICES 65 • Domain Name Service (DNS) name . – Greater number of hosts imply greater probability of failure – Broadcasts consume higher proportion of bandwidth.rn.2. ”ac” to those administered by academic networks.sussex. • Only want positive responses . such as by service they offer eg Postman Pat.scitech.ac.9. 2. since small number of other objects. • Scaling problem when move into wide area. and so on. objects storing names referring to them. DNS chooses partition according to administration of creation. • Note that objects can therefore have multiple names.uk is also doc-sun.uk. Division of name space and possible objects into smaller space.ac.9. • But useful to name objects in different ways. ”uk” reduces objects to those within the uk.informatics.sussex.uk.scitech. Low risk of collision. can guess where information will reside. all administered by same authority • When looking up name. with aliases to allow naming by service eg ftp. • Equivalent to distributing name table across all objects. Each part of name comes from flat name space.ac.rn. not all of which are unique • Choice of name structure depends on system. • Create name using attributes of object.informatics.

Most systems therefore • Replicate name list to other servers. in a resource record • If database centralised.66 CHAPTER 2.9. Location is just another piece of information. • In DNS. May result in server being remote from object.9. use structure to designate names to particular server.9 Distributed Name Servers Parts of the name table are distributed to different servers eg. servers are allocated portions of the name space below certain domains such as the root. then – Whole system fails if database machine fails – Database machine acts as bottleneck for performance of system as whole – In wide area systems. authority should be shared amongst controlling organisations So name information usually distributed. susx. indexed by a name. then servers take responsibility for certain attributes. table is stored as {name.ac. 2.attribute} pairing.uk. and a performance bottleneck.10 Availability and performance If a single server is responsible for name space.uk Names can be partitioned between servers based on algorithmic clustering eg apply well-known hash function on name to map to server. Only technique for UIDs structural clustering if name is structured. 2. in Domain Name Service. Implemented in both client side and server (in recursive calls) in DNS. there is still single point of failure.9. No need to repeat lookup if asked for same information again. Increases performance. If information is cached.8 Location through database • Keep information about object in a table. such as in DNS attribute clustering if names are constructed using attributes. ac. how do we know when its invalid? May attempt to use inconsistent information. Also increases performance for heavily accessed parts of name space eg secondary servers in DNS • Cache information received by lookup. LECTURE NOTES 2. .

so names live for a long time. eg address. may be stored on other server. Address user can recover if it assumes one of the possible problems is inconsistent information.9. and are created infrequently • If address of an object is wrong.11 Maintaining consistency for distributed name services Alleviated by following features of some distributed systems • In most systems objects change slowly.9.9.2. However. • Server will return result with either required attribute or error message Lookup Modes If name not stored on server. there are always systems which break these assumptions eg highly dynamic distributed object system. who then resends request to new server. which may then have to ask another server and so on req client ans req name server 1 ans req name server 2 ans name server 3 Iterative server returns address of other possible server to client. or through broadcast location). it causes an error. creating lots of objects and names and deleting lots of names.12 Client and Name server interaction. • Hidden behind RPC interface. • Client calls with arguments of name and required attribute. In DNS arguments are name and the type of requested attribute. • Obsolete information can be fixed by addressed object leaving redirection pointers. Two options Recursive server asks other possible server about name and attribute. 2. NAMES AND NAMING SERVICES 67 2. . • Client knows of server to ask (either installed in file. Equivalent to leaving a forwarding address to new home.

10.68 CHAPTER 2.how do we make sure each client sees most up to date copy of file? . LECTURE NOTES client request "try ns2" request "try ns3" request answer name server 1 name server 2 name server 3 2.9.1 Distributed File Systems Main Points A Distributed File System provides transparent access to files stored on a remote disk Themes: • Failures .13 Summary • Names can be flat.distributed name servers suffer from inconsistency • Interaction with name server best modelled by RPC 2. or they can be structured • Centralised name servers suffer from availability .what happens when server crashes but client doesn’t? or vice versa • Performance ⇒ Caching .10 2.use caching at both the clients and the server to improve performance • Cache Coherence .

10. write.2 Client Implementation • Request for operation on file goes to OS. • Subtrees in directory structure generally map to file system. Provides access transparency rsuna rsuna local bin tsunb tsunb 2. does operation and returns results.3 No Caching • Simple approach: Use RPC to forward each file system request to remote server (older versions of Novell Netware). read. seek.10. • OS recognises file is remote and constructs RPC to remote file server • Server receives call.2.10. close • Server implements operations as it would for local request and passes result back to client . • Example operations: open. DISTRIBUTED FILE SYSTEMS 69 2.

what if lots of clients 2.uses caching to reduce network load • Cache file blocks. eg open seek read. .10. cache X S X read data write done A X X B Advantage Advantage: If open/read/write/close can be done locally. Any data in server memory but not yet on disk can be lost 2. Disadvantage can be lousy performance • Going over network is slower than going through memory • Lots of network traffic . no network traffic Issues: failure and cache consistency Failures What if server crashes? Can client wait until server comes back up and continue as before? 1. What if server crashes after seek? Then when client does “read” it will fail.Sun Network File System Main idea .4 NFS .70 CHAPTER 2. file headers etc in both client and servers memory.congestion • Server can be a bottleneck . • More recent NFS implementations use a disk cache at the client as well. If there is shared state across RPCs. LECTURE NOTES Advantages and Disadvantages of uncached remote file service Advantage server provides consistent view of file system to both A and B.

So if server crashes between disk I/O and message send. Stateless Protocol . Write-through caching .) (b) return an error? But networked file service is transparent. What if client crashes? 1.2.. but NFS takes more ad hoc approach. DISTRIBUTED FILE SYSTEMS 71 3. • When server crashes and restarts. Many Unix Apps ignore errors and crash if there is a problem. as if nothing had happened. 4. Usually hang and return error if absolutely necessary. Might lose modified data in client cache. position) not Read(openFile). client can resend message and server does request over again • Read and write file blocks are easy .when a file is modified. NFS has both options . Is this good idea? What should happen if server crashes? If application in middle of reading file when server crashes. Operations are idempotent: all requests are ok to repeat (all requests are done at least once). Application doesn’t know network is there.10. application breaks if inadvertently removes other file. • What about “remove”? NFS just ignores this . Message re-tries .server keeps no state about client. Failures are transparent to client system.suppose server crashes after it does “rm foo” but before it sends acknowledgement? Message system will retry and send it again. except as hints to improve performance • Each read request gives enough information to do entire operation ReadAt(inumber. options: (a) Hang until server returns (next week. Could use transactions. How does system know not to delete it again (someone else may have created new “foo”). so no side-effects. NFS Protocol 1.. 2. all modified blocks are sent immediately to the server disk. 3.just re-read or re-write file block.the administrator can select which one. can start processing requests immediately. “write” doesn’t return until all bytes are stored on disk. .does the remove twice and returns error if file not found. To the client.

72 Cache consistency CHAPTER 2. another CPU reads file? • We want isolation between operations. LECTURE NOTES • What if multiple clients sharing same files? Easy if they are both reading . Completely arbitrary! Sequential ordering constraints • What should happen? If one CPU changes file. • How does other client find out about change? NFS and weak consistency • In NFS. server is notified. but other clients use old version of file till timeout. so read should get old file if it completes before write starts. Either all new or all old any other way cf serialisability. writes through to server. • What if multiple clients write to same file? In NFS get either version or mixed version.each gets a local copy in their cache. . and before it completes. new version if it starts after write completes. If one client modifies file. Polls server if data hasn’t been checked in every 3-30 seconds (Exact time is tunable parameter) cache X’ t=0:X’ S X’ X’ on disk X’ t=30 X still ok? A X’ X B • When file changed on one client. They then check server and get new version. client polls server periodically to check if file has changed. • What if one writing? How do updates happen? • Note NFS has write-through cache policy.

anyone who opens file again will see new version. updates visible immediately to other programs who have file open.updates only visible on close • In Unix (single machine). DISTRIBUTED FILE SYSTEMS NFS Pros and Cons • its simple • its highly portable • but its sometimes inconsistent • but it doesn’t scale to large numbers of clients Note that this describes NFS version 3. server is updated (on close) • Server then immediately tells all those with old copy AFS Session Semantics Session semantics . 73 2.. on open and cache miss: get file from server. In AFS: 1. Callbacks: Server records who has copies of file 2.2.B. everyone who has file sees old version.10. NFS (used to) cache only in memory cache callback {X:A. Write through on close • If file changes. • In AFS. tells all clients with copies to fetch new version from server on next open • Files cached on local disk.} X’ t=0:X’ S X’ Fetch new version next time X is opened X’ on disk A X’ X B X X’ . set up callback 2.10.5 Andrew File System AFS (CMU late 80s) ⇒ DCE DFS (commercial product) 1. on write close: send copy to server.

go ask everyone “who has which files cached” AFS Pros and Cons • Disk as cache ⇒ more files cached locally • Callbacks ⇒ server not involved if file is read-only (Majority of file access is read-only) • But on fast LANs. local disk is slower than remote memory NFS version 4 will provide session semantics when it is deployed by vendors in the 2005 timeframe. 2.11 Peer to Peer (p2p) Services and Overlay Networks Main Points: • What is an Overlay Network? • Gnutella .free form searching. • Distributed Hash Tables .74 CHAPTER 2. • Reconstruct information from client .object location .10.6 Summary • Remote file performance needs caching to get decent performance. • Central server is a bottleneck – Performance bottleneck: ∗ All data is written through to server ∗ all cache misses go to server – Availability Bottleneck ∗ Server is single point of failure – Cost bottleneck ∗ Server machines high cost relative to workstation 2. LECTURE NOTES • What if server crashes? Lose all your callback state.

1 Overlay Networks • The Internet is successful because the management is decentralised. by building an overlay network connecting instances of the application • Examples include the web. PEER TO PEER (P2P) SERVICES AND OVERLAY NETWORKS 75 2. the overlay routes messages between the constituent machines • The overlay network builds routing tables to meet application needs hi i † h gfgfgfg f †‡¡† ‡¡† j lm m ˆ l kjkjkjk j ˆ‰¡ ‰¡ˆˆ de e „ d ™˜™˜™˜™ ˜ …„… pq q p ononono n ‘‘ ” –— •”•”•”•–— •”•”•”•— – ”” ” ƒ‚ƒ‚ utu t srsrsrssrsrsrs rr “’“’ .11. file sharing.11. A Generic Overlay Network WAN Figure 2.1: Generic Overlay Networks • Rather than directly routing messages to target. yet connectivity is maintained • Can we build applications that allow people to connect their machines. yet maintain control over their machines? • Yes. DoS protection.2.

Joining the Network • A Servent joining the network can send a Ping request to discover other nodes. emphasising that clients are servers and vice versa. Pong A message containing one or more servent addresses and some information about the data it is sharing. Definitions Servent The entities making up the Gnutella network. LECTURE NOTES 2. When the field reaches zero. then it will remove the Pong message. Ping A message sent into the network. the message is discarded. • Pong messages return along the same path as Ping messages. It will then flood the Ping message to other Servents it is connected to. When a Servent receives the Ping it responds with a Pong to the sender. • Servents receiving the Ping can choose to return a Pong with Servent details. • Simple protocol based on flooding and caching. If the Pong identifier is not in the cache. and much other copyrighted content. • Servents cache recently seen Ping identifiers so as to return Pong messages back along the same path.2 Gnutella • Used to share mp3 files. . • There is a TimeToLive field in each message decremented each time the message passes through a Servent. QueryHit Provides enough information to acquire the data matching the query.76 CHAPTER 2. A servent responds with a QueryHit if a match is found against its local data set. Query A freetext description of the data asked for. • Many different implementations interoperate. • Servents can choose to initiate connections to other discovered Servents.11.

the bandwidth usage increases enormously. • Ultrapeers are the nodes who advertise themselves in the GWebcache. where machines which well-connected become better than others and have 10-100 leaf nodes and < 10 connections to other ultrapeers. • The QueryHit holds a Servent specific identifier and information about the file. or just return everything in its dataset. • The QueryHit returns along the same path as the Query. Bandwidth Usage When a node becomes an UltraPeer. This can cause problems for the institution in which the machine is sited. • Most clients now implement some form of UltraPeer. • Ultrapeers can then filter traffic for leaf nodes using the uploaded tables from the leaf nodes. allowing caching. Bootstrapping • GWebCache is a set of web servers which store the IP addresses of “up” Gnutella hosts. Should people abuse copyright in this way? .11. it initiates a direct connection to the holding Servent and retrieves the data using http on the Gnutella port. FreeLoaders The network works because people share files and donate bandwidth and disk space.2. it returns a QueryHit if the query text “matches” the search criteria • How the Servent interprets the search field is entirely a local matter. What happens when people just take without giving? Copyright Most of the material on Gnutella is copyright to someone or other. • If the user chooses to download the file. It could only do exat matching. • When a Servent receives a Query. PEER TO PEER (P2P) SERVICES AND OVERLAY NETWORKS 77 Queries • Queries are flooded through the network in a similar manner to Pings. Gnutella Issues Security How can you trust the data you’ve downloaded? Gnutella well-known for carrying viruses and trojans.

value) lookup(key) join(n) leave() Description Inserts a key/value binding in the DHT. • In a distributed hash table. Predecessor(n) The node with the highest identifier less than than n’s identifier. Function insert(key. Many others have now been designed.A Distributed Hash Table Example • When the name of the object is known. LECTURE NOTES 2. • A hash table takes an input key k. the hash function is calculated h(k). and uses this as an index into a table. there are more efficient search structures. allowing for wrapround.3 Chord . • A node is then responsible for its own identifier and the identifiers between its identifier and its predecessor’s identifier.. The Identifier Circle . calculates the hash function on the key h(k). Causes a node to leave the Chord system Table 2. • The node joins the DHT and determines who its predecessor and successor are in the table. • The key is provided. allowing for wrapround. such as Hash Tables. Successor(n) The node with the lowest identifier greater than n’s identifier.11. Return the value associated with key. Basic Operations Chord is a research project which designed one of the first DHTs.78 CHAPTER 2. Causes a node to add itself into the Chord system.3: The Chord API Identifiers and Keys • A node generates its identifier by picking a value randomly from the hash space eg the 128 bits of SHA applied to its dns name. and h(k) is used to route to the node which would hold the object corresponding to the key. the table into which h(k) indexes is distributed across the nodes in the DHT.

• To route to a target k.lcs. Routing Example www.pdos. decrease the distance to the target by one half the current distance. and pass the request onto this node. PEER TO PEER (P2P) SERVICES AND OVERLAY NETWORKS 79 Node responsible for key 4−0 0 Node responsible for keys 1 7 1 6 2 5 3 Node responsible for keys 2−3 4 Figure 2.edu/chord/ Expected Number of Routing Hops • Each hop in the routing will.2: Identifier circle for 3 bit identifier Routing in Chord Idea: Route to a node that halves the distance to the node responsible for the target key. . look up the identifier in the routing table that precedes k. So for identifier length N . • Each node with identifier n maintains a table with the address of nodes that succeed identifiers n + 20 modN .n + 22 modN through to 2N −1 modN . n + 21 modN . on average.11.mit.2.

value) pairs from its successor for identifiers in the DHT between n and the successor identifier. then the average number of hops is. • Update the routing tables of the relevant nodes for i = 1 to N . LECTURE NOTES target succ. 0 0 0 Node responsible for keys 2−3 lookup(1) 4 Figure 2.3: Routing example . Node responsible for key 4−0 2 3 3 3 5 0 1 Node responsible for keys 1 7 6 2 5 target 4 5 7 3 succ. with high probablity.80 target 1 2 4 succ. • The node then transfers the (key. O(log(N )). • A node locates its immediate successor by looking up n. Maintenance: Joins and Leaves On joining: • The node generates its identifier n.lookup(1). 1 3 0 0 CHAPTER 2. going from node 3 to the node responsible for 1 • If the size of the identifier space is N .

2.BitTorrent . yet retain the O(log(n)) messages of DHTs? 2.11. CONTENT DISTRIBUTION NETWORKS p = findPredecessor(n − 2i−1 ) p. • Genesis of much of the current research in networks and distributed systems.4 Current Research Challenges • Gnutella is good for open search (as is Google) • DHTs are good for object location • Can we build overlay networks which provide looser search porperties.12 Content Distribution Networks Main Points • Building content caches • Pre-fetching data • Using your neighbours .5 Summary • Overlay networks are an extension of the Internet design philosophy to distributed applications.12. and provide redundant paths.11.updateTables() 81 • Chord implementation does this through timed checks on correctness of successor and predecessor relations 2. 2. • They provide routing mechanisms that are robust to individual failures.

82

CHAPTER 2. LECTURE NOTES

2.12.1

Getting Content over a Network

• Users want to download content from serversas quickly as possible

• What structures can we use to improve their experience, and the impact upon the network?

2.12. CONTENT DISTRIBUTION NETWORKS

83

2.12.2

Web Caches

• Large organisations can improve web performance by sticking a web cache in front of HTTP connections • The cache inspects an incoming HTTP request to see if it can satisfy the request from locally cached objects. • If yes, then the object is returned, and the last reference time is updated • If no, then the object is retrieved and copied locally if allowed. • Web objects can be marked as non-cacheable. • Caching is a major part of the HTTP standard Cache Performance • The performance of a web cache is difficult to model, since the performance is a complex mixture of interaction between TCP, HTTP and content. • Caches work because of temporal locality, due to popularity of content, and spatial locality, due to structure of HTML documents

84

CHAPTER 2. LECTURE NOTES • Measurements of web proxies give the following results (based on JANET caches) – Request hit rate is about 60%. – Volume hit rate is about 30%. – Latency improvement is around a factor of 3 on average

Problems with Caching • Not all content is marked as cacheable, eg because the site wants accurate records of who looks at content. • All hosts behind a cache appear to come from one address. Question: Why is this a problem?

2.12.3

Pre-fetching Data

• Can we improve performance by pro-actively distributing content to caches? • Yes. . .

Active Content Distribution • The html uses links to the cdn domain name eg akamai.com • The internet service provider has entries in their local DNS for akamai.com pointing to a machine on the ISP’s network. • Content will therefore be supplied from the local machine rather than the original machine • Customer interaction improved. • Bandwidth requirements of servers reduced

and the url of a tracker . CONTENT DISTRIBUTION NETWORKS 85 2.12. the consumer must first locate a .4 Using your Peers: BitTorrent • Why not use the other people receiving the content? • BitTorrent downloads from other machines • Basic Idea: – To download. • This contains information about the file length.12. name and hashing numbers of the file blocks. the host contacts a machine tracking those already downloading the torrent.2.torrent file. – Peers are selected at random from the tracker. – Pieces are selected to download from those available on the downloader’s peer set until all pieces of the file have been received The .torrent file • To start downloading from a torrent.

12. it notifies its peers it has a new piece. Choosing Downloaders • A request to upload is accepted if the requester recently uploaded to it • This provides an incentive to machines to share data • Periodically other machines are tried for upload 2.86 CHAPTER 2. each having a SHA1 hash calculated. so as to spread load • Subsequent pieces are based on a rarest-first approach to increase probability all pieces are available. and returns a random list of other downloaders • This list becomes the downloader’s peers • Question What is the shape of the overlay Graph? Choosing Pieces • The downloader will contact its peers to discover what pieces they have. LECTURE NOTES • The file is split into 250 KByte pieces. . the downloader contacts the tracker • The tracker inserts the downloader into its list of downloaders.5 Summary • Web caching improves performance by a reasonable factor. • When a peer has downloaded a new piece which matches the SHA1 hash. dependent on situation • Pro-active content distribution can reduce latency and improve bandwidth usage for popular services • BitTorrent can improve bandwidth usage by spreading load across peers. • It then chooses a piece to download: • The first choice is made randomly. • A tracker holds the IP addresses of current downloaders Locating peers • After receiving the torrent file.

996% Fault Tolerance Even in the presence of failure. pr(f ail) = 5×24 = 0. then assuming independence of failure. Availability = 1 − 0. REPLICATION: AVAILABILITY AND CONSISTENCY 87 2. Why is Replication used? Performance enhancement • Single Server acts as a bottleneck .1 What is Replication? • Multiple copies of dynamic state stored on multiple machines eg Copies of files stored on different machines. then we can still provide a service • Probability of total failure reduced such as all data being lost.13. repair time is 4 four hours.2. get apparent performance gain • If clients are geographically distributed. name servers storing name address mappings • Caching can be seen as a form of replication. the service will continue to give the correct service • Stronger than availability.13 Replication: Availability and Consistency • Motivation for replication • Multicasting updates to a group of replicas • Total Ordering • Causal Ordering • Techniques for ordering protocols • ISIS CBCAST 2. if mean time between failure for 3 machines is 5 days.033 = 99.13.if we can balance load amongst multiple servers.03. then probability of loss of service is pr(f ail)n and the availability of the service is 1 − pr(f ail)n • eg. since can provide real-time guarantees (with extra work!) • Can protect against arbitrary failure where machines feed wrong information (Byzantine Failure) . we can site servers near clients and reduce communication costs Availability • If a machine fails. since data replicated across multiple machines • If probability of failure is pr(f ail) for a given machine in n machines.

Example Distributed Bulletin Board System (BBS) front end Replica Manager client Replica Manager front end client Replica Manager • Users read and submit articles through Front End.. replica can be passive or active. No performance improvement. Standbys must monitor and copy state of active server Provide availability in simple manner. LECTURE NOTES 2. • Articles replicated across a number of servers • Front Ends can connect to any server • Servers propagate articles between themselves so that all servers hold copies of all articles.3 Consistency • Clients can modify resource on any of the replicas. Passive Replication Passive replicas are standbys. Used for highly available systems eg space applications 2.2 Issues in Replication A collection of replicas should behave as if state was stored at one single site • When accessed by client.88 CHAPTER 2. • What happens if another client requests resource before replica has informed others of modification.13. . view should be consistent • Replication should be transparent . to maintain service on failure. as in cache consistency in distributed file systems? • Answer depends upon application.13.client unaware that servers are replicated If we are providing a replica service..

Lower levels take care of routing messages. but only process holding object replies. To maintain consistency. and comes back. which perform identical operations.5 Multicast and Process Groups A Process Group: a collection of processes that co-operate towards a common goal.4 Updating Server state Clients read and update state at any of the replicated servers eg submit messages in bbs.13. Multicast communication: One message is sent to the members of a process group Idea: Instead of knowing address of process. Locating objects in distributed services Request for object goes to all processes implementing service. Reduces communication costs.13.13. Done through Voting and Transactions (later in course) 2. REPLICATION: AVAILABILITY AND CONSISTENCY • User membership of a given bbs is tightly controlled. the replicas must be able to regain consistency. .2. Questions on BBS: • How should messages be passed between replicas? 89 • Should order of presentation of articles to clients be the same across all replicas? Are weaker ordering semantics possible? • When a client leaves bbs group. Useful for: Replicated Services One update message goes to all replicas. just need to know an address representing the service. three things are important Multicast communication Messages delivered to all servers in the group replicating data Ordering of messages Updates occur in the same “order” at each server Failure recovery When servers or the network fails. can they see articles submitted after they have left? Is this desireable? • What should happen when replicas are temporarily partitioned? 2. Group Services Maintenance of group information is a complex function of the name service (for tightly managed groups) Create Group Create a group identifier that is globally unique.

may require authentication. Member List Supply the list of processes within a group. P3 and P4. Process P1 multicasts message a to a group comprising processes P1. may occur as a result of failure or partition. P1 a a b P2 P3 P4 a b b a b . P2. may notify other members. Needed for reliable message delivery. Process P2 multicasts message b to the same group The order of arrival of a and b at members of the group can be different. Requires joining process information to be disseminated to message routing function. May require authentication. 2.6 Message Ordering If two processes multicast to a group. LECTURE NOTES Join Group Join a group. Need to notify message routing function. Leave Group Remove a process from a group.13.90 CHAPTER 2. May require authentication and notification of existing members. the messages may be arbitrarily ordered at any member of the group.

Applicable when each process state is separate. create object. or operations don’t modify state. Sync Ordering For a sync ordered message. either an event occured before message reception at all processes. then operation not completed Ordering Definitions Various definitions of order with increasing complexity in multicasting protocol FIFO Ordering Messages from one process are processed at all group members in same order Causal Ordering All events which preceded the message transmission at a process precede message reception at other processes. Group member orders incoming messages with respect to sequence number. Two techniques for implementation: . Events are message receptions and transmissions. or after message.delete object. • If delete object arrives before create object.2. REPLICATION: AVAILABILITY AND CONSISTENCY Ordering example 91 P1 create object delete object P2 P3 delete object create object • Order of operations may be important . just add incremental updates or read.13. FIFO ordering Achieved by process adding a sequence number to each message. Total Ordering Messages are processed at each group member in the same order. all members of the group receive the messages in the same order. Total Ordering When several messages are sent to a group. Other events may be causally or totally ordered.

5. Replicas store largest final sequence number yet seen Fmax . if a and b are events at the same process. Suffers from single point of failure (recoverable by election) and bottleneck. an article titled “re: Multicast Routing” in repsonse to an article called “Multicast Routing” should always come after.92 CHAPTER 2. FIFO ordering from sequencer guarantees total ordering. We define the causal relation. if a is a message sent by process P1 and b is the arrival of the same message at P2. a → b implies a happened before b 2. if 1. Each Replica i replies with suggested sequence number of max(Fmax . Causal Ordering “Cause” means “since we don’t know application. Sending site chooses largest sequence number and notifies replicas of final sequence number. Sequence Number Negotiation ber with all replicas. then a → b is true In bulletin board. who then sends messages onto replicas. Holdback Queue Received messages are not passed to the application immediately. Pmax )+ 1. Suggested sequence number provisionally assigned to message and message placed in holdback queue (ordered with smallest sequence number at front) 4. but are held in a holdback queue until the ordering constraints are met. . deliver the message. Sender negotiates a largest sequence num- 1. generally sending and receiving of messages. When item at front of queue has an agreed final sequence number. Replicas replace provisional sequence number with final sequence number. Sender sends all replicas message with temporary ID. LECTURE NOTES Sequencer Elect a special sequencing node. and largest proposed sequence number Pmax 2. a → b. 3. All messages are sent to sequencer. even though may be received before the initial article. messages might have causal ordering” a and b are events.

4.. then message placed on holdback queue. Vector timestamps have one operation defined merge(u.n Incoming messages are placed on a holdback queue. based on process groups. i. • If conditions for delivery to process not met.e. When a message bearing a timestamp vt is delivered to pj .2. until all messages which causally precede the message have been delivered. for k = 1. Messages are delivered to the application in process Pj when • The message is the next in sequence from pi i. v)[k] = max(u[k]. pj ’s timestamp is updated as V Tj = merge(vt. . typically an integer. Causal ordering for multicast within a group is based around Vector Timestamps The vector V T has an identifier entry for each member of the group. CBCAST Implementation 1. V Tj [k] ≥ vt[k] for k = i. it piggybacks vt = V Ti on the message 3. REPLICATION: AVAILABILITY AND CONSISTENCY CBCAST .Causal ordering in ISIS 93 ISIS is a real commercial distributed system. the timestamp is updated by the merge. All processes pi initialise the vector to zero 2. • Examine all messages in the holdback queue to see if they can be delivered. it first increments V Ti [i] by 1. v[k]).e. When pi multicasts a new message. vt[i] = V Tj [i] + 1 • All causally prior messages that have been delivered to pi must have been delivered to pj .13. • When an incoming message is delivered. • CBCAST requires reliable delivery. V Tj ) In words • Incoming vector timestamp is compared to current timestamp.

2.1.7 Summary • Replication of services and state increase availability • Replication increases performance • Replication increases Fault tolerance • To maintain consistency.1. Messages thus belong to a particular group view.0.0) Message delayed on holdback queue delivered immediately Message on holdback queue (1.14 Shared Data and Transactions • Stateful Servers • Atomicity . multicast updates to all replicas • Use sequence numbers to maintain FIFO ordering • Use Vector Timestamps to maintain Causal Ordering • Use elected sequencers or identifier negotiation to maintain total ordering 2. what set of messages should be delivered to members of changed group? • What happens to undelivered messages of failed members? • What messages should new member get? • ISIS solves by sending a sync ordered message announcing that the group view has changed.94 CHAPTER 2.13.0) P3 (1.0) can now be delivered Causal Example Group View Changes • When group membership changes. LECTURE NOTES P1 P2 (1. • Use coordinator to decide which messages belong to which view.

Accesses from different clients shouldn’t interfere with each other 2.14.2 Atomicity Stateful server have two requirements 1.. eg retrieving a list of records in a large database better modelled as getting batch of records at a time.eg modify records during retrieval.1 Servers and their state • Servers manage a resource.14. such as a database or a printer • Attempt to limit problems of distributed access by making server stateless. • If other conversations need to go on .. Stateful Servers • Some applications better modelled as extended conversations.2. Clients should get fast access to the server . Needs to keep track of state of distributed conversation • If long duration then. – Servers can crash in between servicing clients – Client requests cannot interfere with each other (assuming concurrency control in server • But we can’t always design stateless servers. SHARED DATA AND TRANSACTIONS • Transactions • ACID • Serial Equivalence 95 2. how do we allow concurrency? • What happens if machine fails . then need to be aware of state. • If application requires state to be consistent across a number of machines.need to recover. then each machine must recognise when it can update internal data.14. • Should also aim to be fault tolerant 2. such that each request is independent of other requests.

using semaphores or monitors Synchronisation In situations such as Producer Consumer. then the updates should have mutual exclusion around the updates to provide isolation. • Accounts are held at various machines belonging to different banks • Accounts offer the following operations deposit Place an amount of money in an account withdraw Take an amount of money from an account balance Get the current value in an account . 2. even unto a server crash).3 Automatic Teller Machines and Bank accounts Greedy Bank accounts ATM controller automatic teller machine (ATM) Sharing Bank accounts Other banks • An ATM or cashmachine allows transfer of funds between accounts.intermediate effects should not be visible. LECTURE NOTES All or Nothing A client’s operation on a server’s resource should complete successfully. Example Mutual Exclusion For a multi-threaded server.96 Definition We define atomicity as CHAPTER 2. needing isolation. need to allow one operation to finish so second operation can use results. or it should fail and the resource should show no effect of the failed operation Isolation Each operation should proceed without interference from other clients’ operations .14. and the results hold thereafter (yea. if two or more threads attempt to modify the same piece of data.

abortTransaction(transId) Abort all the changes the transaction operations have done. independent of any subsequent failure Concurrency Problems Lost Update Inconsistent Retrievals 2.14.write(A.read() .4 Transactions Transactions are technique for grouping operations on data so that either all complete or none complete Typically server offers transaction service.5 Serial Equivalence Definition: Two transactions are serial if all the operations in one transaction precede the operations in the other. commitTransaction(transId) Commit all the changes the operations in this transaction have made to permanent storage.x) 2.14. ACID Transactions are described by the ACID mnemonic Atomicity Either all or none of the Transaction’s operations are performed. SHARED DATA AND TRANSACTIONS 97 • Operations implemented as read() and write() of values.write(B. If a transaction is interrupted by failure. such as: beginTransaction(transId) Record the start of a transaction and associate operations with this transId with this transaction.2. the system never loses the results of the transaction. A.read() + x) 2. so withdraw x from A and deposit x in B implemented as 1. and roll back to previous state. eg the following actions are serial Ri (x)Wi (x)Ri (y)Rj (x)Wj (y) Definition: Two operations are in conflict if: .14. then partial changes are undone Consistency System moves from one self-consistent state to another Isolation An incomplete transaction never reveals partial state or changes before commiting Durability After committing. B.

withdraw(4.deposit(3.T) balance = A.withdraw(3. LECTURE NOTES Transaction T A.write(balance .T) balance = A.100) Transaction U Bank.T) B.read() £200 balance = B.write(balance .98 CHAPTER 2.3) £300 £297 £200 £203 balance = B.read() B.read() + balance £100 £300 £300+ balance = B.deposit(4.read() A.read() balance = B.read() B.write(balance + 3) B.4) Transaction U C.write(balance .write(balance + 4) £204 Transaction T A.U) £100 £96 balance = C.write(balance + 100) £200 £300 .read() + balance balance = C.total(U) £200 £100 balance = A.withdraw(100.T) B.read() C.read() A.deposit(100.U) B.

CONCURRENCY CONTROL AND TRANSACTIONS • At least one is a write • They both act on the same data • They are issued by different transactions Ri (x)Rj (x)Wi (x)Wj (y)Ri (y) has Rj (x)Wi (x) in conflict Definition: Two schedules are computationally equivalent if: • The same operations are involved (possibly reordered) 99 • For every pair of operations in conflict.15.6 Summary • Transactions provide technique for managing stateful servers • Need to worry about concurrency control • Need to worry about aspects of distribution • Need to worry about recovery from failure 2. Question: Is the following schedule for these two transaction serially equivalent? Ri (x)Ri (y)Rj (y)Wj (y)Ri (x)Wj (x)Wi (y) Transaction Nesting Transactions may themselves be composed of multiple transactions eg Transfer is a composition of withdraw and deposit transactions. 2.2. Can instead use other means of recovery. which are themselves composed of read and write transactions Benefits: • Nested transactions can run concurrently with other transactions at same level in hierarchy • If lower levels abort.15 Concurrency Control and Transactions • Problem restatement • Locking • Optimistic control • Timestamping . a schedule is serialisable if the schedule is computationally equivalent to a serial schedule. may not need to abort whole transaction. the same operation appears first in each schedule So.14.

. • If operation needs to do another operation on same data then promotes lock if necessary and possible . • So we must ensure schedules of access to data for concurrent transactions are computationally equivalent to a serial schedule of the transactions. then server locks and proceeds (b) If item is held in a conflicting lock by another transaction.1 Why concurrency control? • To increase performance. locks control access for different clients • Granularity of data locked should be small so as to maximise concurrency. then locks can be shared ⇒ read locks are shared • Operations in conflict imply operations should wait on lock ⇒ write waits on read or write lock.. When transaction commits or aborts. with trade-off against complexity. 2. LECTURE NOTES 2.15... multiple transactions must be able to carry on work simultaneously.2 Locking • As in operating systems.100 CHAPTER 2.operation may conflict with existing shared lock Rules for strict two phase locking 1. transaction must wait till lock released (c) If item is held by non-conflicting lock. When operation accesses data item within transaction (a) If item isn’t locked.15. • To prevent intermediate leakage. then can lead to problems such as lost updates and inconsistent retrievals. once lock is obtained. lock is shared and operation proceeds (d) If item is already locked by same transaction. it must be held till transaction commits or aborts Conflict rules • Conflict rules determine rules of lock usage • If operations are not in conflict. lock is promoted if possible (refer to rule b) 2. locks must be held till transaction commits or aborts. read waits on write lock • Since can’t predict other item usage till end of transactions. locks are released . • .but if data is shared.

WU (j. Independent threads must possess some of its needed resources and waiting for the remainder to become free.DataItem. After timeout. Circular chain of requests and ownership. under following conditions 1. Hold and wait. 66) Question What are the possible schedules allowed under strict locking? Question Are there any schedules computationally equivalent to a serial schedule which are disallowed? Deadlocks • Locks imply deadlock. WT (j.LockType) Lock the specified item if possible.15. • T: RT (i). 55)). lock becomes vulnerable and can be broken if another transaction attempts to gain lock. requiring internal synchronisation • Heavyweight implementation Example Transactions T and U.2. CONCURRENCY CONTROL AND TRANSACTIONS Locking Implementation • Locks generally implemented by a lock manager 101 lock(transId. Limited access (eg mutex or finite buffer) 2. that may not be required . RU (j). No preemption (if someone has resource can’t take it away) 3. 4. else wait according to rules above unLock(transId) Release all locks held by the transaction • Lock manager generally multi-threaded. leading to aborted transactions Drawbacks of Locking • Locking is overly restrictive on the degree of concurrency • Deadlocks produce unnecessary aborts • Lock maintenance is an overhead. • Most common way of protecting against deadlock is through timeouts. 44) • U: WU (i.

The write set of the validating transaction is compared against the read sets of all other active transactions.backward and forward validation. abort one or more transactions – else commit Implementation of Optimistic Concurrency Control Transaction has following phases 1. Forward Validation 1. tentative versions are written to permanence. Validation phase in which operations are checked to see if they are in conflict with other transactions . A transaction in validation is compared against all transactions that haven’t yet committed 2. then either abort validating transaction. and transaction can commit (or abort). • Simplify by ensuring only one transaction in validation and write phase at one time • Trade-off between number of comparisons. or abort conflicting transaction. delay validation till conflicting transaction completes.15. . If invalid.3 Optimistic Concurrency Control • Most transactions do not conflict with each other • So proceed without locks.102 CHAPTER 2.complex part. If validated. Read phase in which clients read values and acquire tentative versions of items they wish to update 2. and transactions that must be stored. Writes may affect ongoing reads 3. then abort. and check on close of transaction that there were no conflicts – Analyse conflicts in validation process – If conflicts could result in non-serialisable schedule. 3. LECTURE NOTES 2. 4. Validation approaches • Validation based upon conflict rules for serialisability • Validation can be either against completed transactions or active transactions . If the sets conflict.

older transaction restarted with new timestamp. in the system or database carries the maximum (ie youngest) timestamp of last transaction to read RTM(x)3 and maximum of last transaction to write WTM(x)4 • If transaction requests operation that conflicts with younger transaction.5 Summary • Locks are commonest ways of providing consistent concurrency • Optimistic concurrency control and timestamping used in some systems • But.15. consistency in concurrency is application dependent .4 Timestamping Operates on tentative versions of data • Each Transaction receives global unique timestamp on initiation • Every object. so we only worry about reads with overlapping transactions that have committed. W T M (x) = T Si 2. CONCURRENCY CONTROL AND TRANSACTIONS Backward validation 103 1. Writes of current transaction can’t affect previous transaction reads. 2. 3 Read 4 Write Timestamp Maximum Timestamp Maximum . RT M (x)) • The operation is a write operation and the object was last read and written by older transactions ie T Si > RT M (x) and T Si > W T M (x). RT M (x) = M AX(T Si . Problems can occur with long term network partition.15. • Transactions committed in order of timestamps. If current read sets conflict with already validated overlapping transactions write sets.15. then abort validating transaction 2.2. read operations may have to wait until the last transaction to write has committed. Approaches based on notification and people resolution becoming popular. • Since tentative version is only written when transaction is committed. If read permissible. If permissible.for shared editors. An operation in transaction Ti with start time T Si is valid if: • The operation is a read operation and the object was last written by an older transaction ie T Si > W T M (x). so a transaction may have to wait for earlier transaction to commit or abort before committing. people may prefer to trade speed of execution against possibilities of conflict resolution. x.

single server..2 Distributed Transactions Distributed Transaction Requirements General characteristics of distributed systems • Independent Failure Modes • No global time • Inconsistent State Need to consider: • how to achieve distributed commitment (or abort) • how to achieve distributed concurrency control . 2.16 Distributed Transactions • Models for distributed transactions • Attaining distributed commitment • Distributed Concurrency Control 2. transactions have referred to multiple clients.104 CHAPTER 2. • How do we have multiple clients interacting with multiple servers? eg complicated funds transfer involving different accounts from different banks? • Generalise transactions to distributed case.16.16.1 client1 client2 Single Server Transactions transaction 1 Transaction Manager transaction 2 resource server clientN transaction N • Till now. LECTURE NOTES 2..

• If coordinator or participant unable to commit.2. ie machine can fail at any time. but one of the server may have failed . commit in 2 phases. • Two phases Phase 1 Reach a common decision Phase 2 Implement that decision at all sites .16. then each transaction must complete before proceeding to next • If transactions are nested.16.no way of ensuring durability • Instead. client requests commit. The coordinator handles all communication with other servers Question: What are the requirements of transaction ids? 2. thus allowing server to request abort. • Other entities in protocol called participants. DISTRIBUTED TRANSACTIONS Models 105 T11 X T client Z Y T22 T1 Y X M T12 N client Z T2 T21 P Simple Distributed model Nested Transaction • If client runs transactions. and others may not discover. then transactions at same level can run in parallel • Client uses a single server to act as coordinator for all other transactions. all parts of transaction are aborted. 2 Phase Commit • One coordinator responsible for initiating protocol.3 Atomic Commit Protocols • Distribution implies independent failure modes. • If one phase commit.

. If vote is no participant aborts immediately. coordinator commits transaction and sends DoCommit to all participants. (b) Otherwise transaction is aborted. LECTURE NOTES 1. Done 4. 3. Prepared to commit (uncertain) DoCommit HaveCommitted 5. 4. Prepared to commit (waiting for votes) CanCommit? participant Yes 3. Participants reply with vote yes or no. Phase 2 Coordinator collects votes including own: (a) If all votes are yes. • Introduces new state in transaction Prepared to commit. and coordinator sends abortTransaction to all participants. Committed (or aborted) 2.106 2 Phase Commit Details CHAPTER 2. Phase 1 The coordinator sends a Can Commit? message to all participants in transaction. it can ask coordinator about results of vote. • Timeouts are used when messages are expected. When a participant recieves DoCommit. it commits its part of the transaction and confirms using HaveCommited coordinator 1. Commit 2 Phase Commit Diagram Note: • If participant crashes after having voted to commit. 2.

can lead to deadlock if different cordinating servers attempt to validate different transaction.2. but interesting dealing with distributed deadlock detection. Timestamping • If clocks are approximately synchronised.. then global to enforce ordering. • Also need to validate in correct serialisable order. TRANSACTIONS: COPING WITH FAILURE 107 2.16. coordinatingserverid > pairs.17. then timestamps can be < localtimestamp.except in dealing with distributed deadlock • Same techniques as usual.4 Locking Distributed Concurrency Control • Locking is done per item. 2. Optimistic Concurrency Control • Need to worry about distributed validation • Simple model of validation had only one transaction being validated at a time . • Other solutions is to validate in two phases with timestamp allocation local.16.... not per client. • . • One solution is to globaly only allow one transaction to validate at a time.17 Transactions: Coping with Failure • Failure Modes • Recovery Techniques • Partitions and quorum voting . • No problems generalising to multiple servers. and an ordering defined upon server ids.5 Summary • Nested Transactions are best model for distributed transactions • Two Phase Commit protocol suitable for almost all case • Distributed Concurrency control is only slightly more diffcult than for single server case 2.

On recovery.17.17.ie something that will survive failure. need to examine failures 1. detected by the application which calls abort eg insufficient funds. Loss of volatile store and possibly all transactions in progress. 4. eg divide by zero. Transaction-local failures . Transaction-local failures. • Keeps information about changes to the resource in a recovery file (also called Log) kept in stable storage . 2. but then is fixed and returns to operation 5 . and committed transactions happened. 3. recovery manager looks through recovery file and undoes changes (or redoes changes) so as uncommitted transactions didn’t happen. because we assume the machine always returns . not detected by application. 2. System calls abort.2 Recovery • We assume a machine crashes. • Events recorded on Recovery file for each change to an object in database.1 Failure Modes For Transactions to be atomic and durable. The Recovery Manager • Recovery from failure handled by entity called Recovery Manager. need to undo changes made. Recovery File Information recorded per event include: Transaction Id To associate change with a transaction Record Id The identifier of the object Action type Create/Delete/Update etc 5 hence sidestepping the impossibility of byzantine agreement in asynchronous systems. System failures affecting transactions in progress but not media eg CPU failure. • When coming back up after failure. • We need to recover state to ensure that the guarantees of the transactional systems are kept. but by system as whole. No way of protecting against this. LECTURE NOTES 2. • Use a recovery file or log that is kept on permanent storage. special recovery manager undoes effects of all transactions in progress at failure.108 CHAPTER 2. Media failures affecting database eg head crash. No info loss.

A list of currently active transactions 2.2. • the database is undamaged. prepareToCommit. Recovery files can get too large Instead. (a) If find beginTransaction add to undo list (b) If find commit record add to redo list . and abort actions. for each transaction a pointer to the first record in recovery file for that transaction • Force database to disk • Write address of checkpoint record to restart location atomically Recovering with checkpoints To recover. TRANSACTIONS: COPING WITH FAILURE Old Value To enable changes to be undone New Value To enable changes to be redone 109 Also log beginTransaction. augment recovery file with checkpoint • Force recovery file to stable storage • Write checkpoint record to stable store with 1. Recovering If after failure.17. The Recovery file entry is made and committed to stable storage before the change is made . then restore database from archive and redo all changes from committed transactions since archive date. Add all active transactions at last checkpoint to undo list 1.incomplete transactions can be undone. commit. with their associated transaction id. What might happen if database changed before recovery file written? Note that recovery files have information needed to undo transactions. undo all changes made by transactions executing at time of failure • the database is damaged. have undo and redo lists. Forwards from checkpoint to end. Checkpointing Calculation of which transaction to undo and redo on large logs can be slow. committed transactions redone.

110 CHAPTER 2. Prepared to commit (waiting for votes) CanCommit? participant Yes 3. execute undo for all transaction operations on undo list 3. commit on signalling DoCommit and done to indicate end of protocol in recovery file. backwards from end to first record in checkpointed transactions. Forwards from checkpont to end. 2. and commited when it receives DoCommit. Commit • Coordinator uses prepared to signal starting protocol. Prepared to commit (uncertain) DoCommit HaveCommitted 5. redo operations for transactions on redo list At checkpoint can discard all recovery file to first logged record in checkpointed transactions Recovery of the Two Phase Commit Protocol coordinator 1. LECTURE NOTES (c) If find abort record remove from undo list 2.3 Network Partition Transactions are often used to keep replicas consistent. • Participant uses uncertain to indicate that it has replied yes to commit request. Committed (or aborted) 2.17. • On recovery. . and resends DoCommit when in commit state but not done • Participant requests decision from coordinator if in uncertain state. replicas divided into two or more sets (possibly with common members). coordinator aborts transactions which reach prepared. Done 4. but not commited. If network partitions (cable breaks).

but • Must reduce possible read and write sets to maintain consistency • Or relax consistency requirements and resolve problems when partition is healed Quorum Consensus Consider set of replicas. 3. we can allocate votes to give different properties Replica config 1 config 2 config 3 R1 1 2 1 depending upon requirements R2 0 1 1 R3 0 1 1 1. 1.2. Partition Example For three replicas. indicating importance. Assign a weighting of votes to each replica. TRANSACTIONS: COPING WITH FAILURE 111 B A C Link A−B broken Routing takes time to determine reroute from A−B via C So clients behind A can see only A and C Clients behind B can only see B and C Clients behind C see all replicas Can we still write and read from any of the sets? Yes. 5. it must gather votes for all the replicas it can talk to (denote X). R3. As long as • W > half the total number of votes • R + W > total number of votes in group Each Read quorum and each write quorum will have at least one member in common. What should R and W be set to in the three configurations? 2. where replicated objects have version numbers at each replica. R2. For client to perform operation. X ≥ votes set for write quorum W to enable write. 2. What properties result from these configurations? . R1.17. X ≥ votes set for read quorum R to enable read 4.

. LECTURE NOTES When write happens. collect votes from replica managers with version numbers of object. collect votes from replica managers with version numbers of object. To write. Manipulating R and W give different characteristics eg R = 1 and W = number of copies gives unaminous update. If write quorum with up to date copy not discovered. then copy up to date copy around to create write quorum. Guaranteed to have at least one up to date copy if in read quorum.4 Summary • Atomicity comes from using logging techniques on operations at server. but usable for reads. where log is kept on stable storage • Voting can be used to give availability for resources on partitioned replicas. from which read occurs. Then write is allowed. object has a version number incremented To read.17. Cached copies of objects can be incorporated as weak representatives with 0 votes.112 Read and Write Operations CHAPTER 2. 2.

2 Devising a Routing Protocol Your task is to devise a routing protocol based on distance vector routing. Explain why.define class structures down to primitive types for this section. Ethernet uses a carrier sense multiple access with collision detect scheme.1. Switched networks are taking over from shared media networks as Local Area Networks.Chapter 3 Exercises and answers 3. Even using CSMA/CD. 2. after sending a packet. What information has to be exchanged? 3. 113 . There is still a finite probability of collision between packets. 3. How should this information be represented . the sending machine must wait 200 microseconds before sending another packet.2. Assume that two computers are sending 1000 byte packets over a shared channel that operates at 64000 bits per second. You should follow the following steps: 1. Explain the terms in italics. To ensure fairness. Why? The answer in Section 3.1 Exercises Communication System Fundamentals 1.1 3. whilst a waiting machine can send a packet 100 microseconds after packet is sent.1.1. How long will it take the two machines to each send a MByte file? 3. What functionality is required from the exchange of messages? 2.

float) = 1. For the following: What size window is required to keep the line fully utilised? 2. . How should the packets be laid out? 5. If the bandwidth of a network is 100 MBit/s and the sequence space is 32 bits.2.4 Serialization In an remote procedure call system. propagation speed 2 × 105 km/s 1. A remote procedure call system has a simple interface representation.2. }. struct id { string name<MAXLENGTH>. 1. In this exercise. When should events occur in the protocol? Think of simple state transition diagrams. The memory buffer capacity of a certain bottle-necked router in the Internet is less than one packet per connection traversing it. 10 byte ack packet • Cable distance 3000 km. program BANKACCOUNT { version BANKVERS { void deposit(id. int pin. what is the maximum packet lifetime? 4. we will think about some of the necessary mechanisms. where the following is an example. The answer in Section 3.3. and each byte is labelled.114 CHAPTER 3. 3.3 Layering • Bandwidth 2 MBit/s • 1000 byte packets.1.2.1. a key decision is how to represent the data structures within the messages passed between clients and servers. For a sequence space of 3 bits. const MAXLENGTH 256. 3. what is the maximum window size? For a sequence space of size N? 3. What effects will this have on TCP? The answer in Section 3. EXERCISES AND ANSWERS 4.

Discuss the problems in serializing Java objects. Sun rpc can either work over udp or tcp . to discover the current number of votes a candidate has. tampering. relying principally on the unix “have global variables in the program environment errno” approach. 4.5 Remote Procedure Call 1. 3. Design the equivalent XDR interface for the above voting service.what are the advantages and disadvantages of the two approaches? 5. 2. and to vote for a particular candidate. float balance(id) = 3. Convert the Java interface above to a remote invokable interface. Exception handling is a weak part of the sun rpc system. and whether there needs to be explicit representation of type information or of data length information. } = 1. masquerading. Suggest methods by which email could be protected against each of these forms of attack.float) = 2. with particular reference to encoding chains of objects. What other classes are needed to provide a complete implementation? 3.1. } = 0x28786554. 2. replay and denial of service. which can allow a user to retrieve a list of candidates and their manifestoes. You should consider which information will be known by both ends. .1.6 Security 1. Java uses object references as well as primitive types. What are the desireable features of exception handling and what are the problems in extending sun rpc to support exception handling? The answer in Section 3.2. 2.3. 3.5. Describe some of the ways in which conventional email is vulnerable to eavesdropping.1. and to deciding whether references are remote or local. EXERCISES void withdraw(id. Design a Java interface for a voting service. 115 Design a representation system for laying out the bits within the request and the response packets.

Process 1 sends two messages with vector timestamps (5. A message is sent from process 0. 0. Show a possible ordering of the transmission and delivery events at these processes. 2. Process 1 then sends a message which reaches process 2 ok. Process 3 sends two messages with vector timestamps (6. Process 0 sends two messages with vector timestamps (6. Many of the problems in distributed file systems arise because the files are writable. How does the design space change if files are only ever created. which reaches process 1 ok. read and destroyed? How would you design a scalable file service for such a service? Consider the scalability problems of lookup and retrieval. 3. and 2. What guarantees on message delivery are required for CBcast? The answer in Section ??.10) and (7. players at separate workstations move figures around a common scene. 1.12).7 Names and distributed filing systems 1.10.9. The original message from process 0 then reaches process 2.1.6. What type of ordering should be applied to the pick up device operation? 3. 3. EXERCISES AND ANSWERS 3. but is delayed in reaching process 2. A process group using CBcast has three members.10). these problems may be overcome.2. The figures may throw projectiles at one another. Outline how a lookup must proceed in looking up a ptr record. 4. What type of ordering is required here. The DNS uses a ptr resource record to allow reverse mapping from IP addresses to DNS names. What data must each NFS client module hold on behalf of each user-level process? Declare a suitable data structure in C to hold the data. and a hit debilitates the unfortunate recipient for some pre-determined interval. In which order are the messages delivered to the processes? 4.9.10) and (5.116 CHAPTER 3. and indicate how.11) and (6. Three processes. 2.10. In a multi-user game.9. The answer in Section 3.11).2. 5.9.1. 0.8 Availability and Ordering 1. use the ISIS total ordering protocol.1. as described in the notes using sequence number agreements. Discuss the problems raised by the use of aliases in a name service. . if at all. The game incorporates magic devices which may be picked up by a player to assist her.

write(j.an . RT (i). WU (i. Define an inconsistent retrieval. yet the coordinator decides to abort the transaction.44). RT (i). a2 . Compare the possible outcomes in timestamping where T ST < T SU and T ST > T SU (a) RT (i). 66). Show a possible interface and describe the semantics of a transactional file system that buffers operational results till committed.. WT (j. WT (j.55). 66) (b) RT (i).1. Transactional file systems have been implemented to provide stronger protection for application programs. EXERCISES 117 3. WU (j.66).44). WT (j. 44). 3. 44). . Consider the following transactions: T x = read(i). RT (i). 55). WU (j. A server manages data items.55). For transactions T and U.9 Transactions and Concurrency 1.33) • U: R(k). write(j. WU (j. 44) (e) WU (i. 44) (d) WU (i.R(i). WT (j.W(j.W(k. Give three serialisable schedules of the following transactions • T: R(j). U write(i. Suggest two situations in the two phase commit protocol in which all the workers voted yes in the first phase. can you always draw their serialisation graphs? Compare the overhead of implementing locking with that of timestamping The answer in Section 3.1. 44). WT (j. What issues should be addressed to ensure consistency across machine crashes? 2. The server provides read and write operations on these items. Why might the start time of a transaction not be the best time to allocate its timestamp? Given the timestamps of two committed transactions.66) 2. 66) 3. explain which of the following interleavings can occur with strict two phase locking and with optimistic concurrency control. WU (j.3.2. 66) (c) WU (i.R(i). 55).. What pattern of operational conflicts may lead to an inconsistent retrieval 4. 55).10 Distributed Transactions 1. 3. and allows read and write operations. 55).7.1. A server manages the data items a1 . WU (i.W(i.W(i. 66). 55). WU (j.

After each packet. C. What is the significance of the commit entries in a log file. So time is approximately 125 s. so packet is 125 milliseconds + 1. 4. U and V when the server restarts after a crash.88). and for each read quorum. If at time t. Explain the terms in italics. Describe the information written to the log file if strict two phase locking is in use.25 milliseconds = 126.2. There is still a finite probability of collision between packets. write(k. and so the greater possibility of collision. 2. how may the quorum be formed and what is the minimum number of servers involved? The answer in Section 3. a station transmits. the receiving machine sends an acknowledgement packet of 10 bytes.2 3. 3. If they check and transmit before t + δt then a collision will occur. The longer the wire. Machines alternate in sending files. 3. and U acquires ai and aj before T. D and E use quorum consensus with weights A = 3. assuming negligible latency. Assume that two computers are sending 1000 byte packets over a shared channel that operates at 64000 bits per second.25 ms. This is also why there is a minimum length of packet. Ethernet uses a carrier sense multiple access with collision detect scheme. B. Explain why. EXERCISES AND ANSWERS V write(k.8.25 milliseconds. Five replicas A.1 The Answers Communication System Fundamentals 1. Acknowledge64000 10×8 ment packet takes 64000 seconds = 1.77). 1 MByte contains approx 1000 packets. Even using CSMA/CD. How long will it take to send a 1 MByte file between the two machines? State your assumptions. Which choices allow the service to continue when one of the servers is unavailable? Give the choice of read quorums for each write quorum. and no loss. B = C = 2. • Data Packet takes 1000×8 seconds = 125 milliseconds. Switched networks are taking over from shared media networks as Local Area Networks. D = E = 1. so that the jamming signal can be detected. the longer the latency.2. Why? .118 CHAPTER 3. State the possible values that may be chosen for a write quorum. then it won’t be heard till t + δt at the other stations. Describe how the recovery manager would uses this information to recover the effects of T. • Because there it takes a finite time for the signal to propagate along the cable.

Some bits of the answer .3. ie one in which packets arrive in order but can be lost. 2. • More secure. If the network can delay.3 Layering 1. the window may advance into space which hasn’t been acknowledged.04/4 = 9. Sun rpc is described in rfc 1831.4 Serialization 1. leading to high numbers of retransmissions.04 = 34. Basically a lot of the TCP flows throttle back due to congestion control till they are no longer sending.04ms For 100% utilisation.2. TCP breaks. G. This is a sample of Sun rpc and its associated data representation.04ms Round Trip Time = 2× propagation delay + data transmission time + ack transmission time = 2 × 15 + 4 + 0.2.2 Devising a Routing Protocol Answers can be found by looking at RFC2453 RIP Version 2. window size × data transmission time ¿ rtt → window size = 34. Malkin. Data Transmission Time = Packet size/Bandwidth = 1000 × 8/2 × 10 6 = 4ms Propagation delay = Distance/Signal speed = 3000 2×105 = 15ms Ack Transmission Time = Packet size/Bandwidth = 10 × 8/2 × 106 = 0. 3. and those packets sent are often lost. Assume single channel.2. These can be transmitted in 12. assuming go-back-n. . • More fault tolerant .5×106 = 343s.someone inadvertently unplugging the wire doesn’t break the whole network. whilst the concrete syntax for sun xdr is available in rfc 1832. We ignore processing delay at sender and receiver.2. 2 3.most wiring is a star laid out from a dry riser.the remaining is for you to fill in. Then maximum window size is 2N − 1 otherwise. duplicate and reorder packets arbitrarily. 32 4. There are 232 labelled bytes. xdr. • Each machine can get full bandwidth of wire • Can scale the switch hub easier 3. THE ANSWERS 119 • Matches actual wiring laid out in buildings . 3. Maximum lifetime of a packet must be less than 343 seconds. then the maximum window size depends upon the maximum packet lifetime.

verf. */ void. } reply. unsigned int high. union switch (accept_stat stat) { case SUCCESS: opaque results[0]. specific parameters start here */ union reply_body switch (reply_stat stat) { case MSG_ACCEPTED: accepted_reply areply. . PROC_UNAVAIL. and SYSTEM_ERR. union switch (msg_type mtype) { case CALL: call_body cbody. EXERCISES AND ANSWERS struct rpc_msg { unsigned int xid. * GARBAGE_ARGS. vers.120 CHAPTER 3. case MSG_DENIED: rejected_reply rreply. struct accepted_reply { opaque_auth verf. rpcvers. /* * procedure-specific results start here */ case PROG_MISMATCH: struct { unsigned int low. /* must be equal to two (2) */ prog. } mismatch_info. cred. case REPLY: reply_body rbody. }. proc. } body. struct call_body { unsigned int unsigned int unsigned int unsigned int opaque_auth opaque_auth /* procedure }. Cases include PROG_UNAVAIL. default: /* * Void.

WU (k) • RU (k). (a) Impossible under strict locking or timestmping with T ST > T SU .2.3. 3. (b) Possible under locking. RT (y).6 Names and Distributed File Systems 1. RT (i).2. This works through using a CNAME for the alias name which points back to the canonical name of the host. Can be made easier by decent tools. RT (j). Remote Procedure Call 3. WT (j). This is Java Serialization. timestamping with T ST < T SU . WU (x). since there is no backwards pointer from the canonical host name entry to the aliases. WT (i). which have entries for all net numbers. . Transactions and Concurrency • RT (j).5 1. WT (j). An inconsistent retrieval occurs when a transaction reads values that another transaction has only partially updated. }.arpa.2. RU (k).7 1. Each net number has an NS entry to point to the particular dns server which maintains the tables for the individual hosts. WT (i). This space is managed by ARIN. RU (i). ftp. Aliases are useful in allowing service names eg www. impossible under timestmping with T ST > T SU . RT (i). Disadvantages are: (a) Performance hit through the additional indirection (b) Adminstrative problems. WU (i). 3. WU (i). WU (k) 2. leaving the set of values in an inconsistent state. An example pattern is RT (x). WU (k). WU (y). RU (i). WT (i) • RU (k). (c) Using the alias name where a reverse or PTR lookup is required ensures that the names don’t correspond 2. IP addresses have their own portion of the dns tree under inaddr. 3. occ. 3. smtp etc to map onto a single host. 2. WT (j). WU (i). THE ANSWERS } reply_data. 121 2.2. RU (i). RT (j). possible under occ or timestamping with T ST < T SU . The tricky bits are how to represent Java objects efficiently. RT (i). who keep the netnumbers on the root servers.

T commit. T write j.timestmping with T ST > T SU . . V write k. Recovery managers and logs 2. 4.most systems wait till commitment before performing writes. Possible under optimistic concurrency control. V commit. Coordinator crashes before exiting the prepared state.122 CHAPTER 3. and require transaction ids to be passed in on creation. U Start.the coordinator eventually times out and issues abort. 3. void write(TransId. Object) throws TransactionAbortedException. the read quorums can be formed by any set of servers whose votes are greater than the read quorum. Total votes = 9. and undoing any operations that have aborted. occ. requiring reads within a transaction potentially to read the tentative values. EXERCISES AND ANSWERS (c) Possible under locking. Java equivalent may extend Reader and Writer classes to TransactionReader and TransactionWriter. impossible otherwise (e) Impossible under locking and timestamping. Network partitions one participant from coordinator . Value) throws TransactionAbortedException. impossible with timestamping with T ST < T SU (d) Possible under timestamping with T ST > T SU . Java like things TransId beginTransaction() throws TransactionException. Object. void abortTransaction(TransId) throws TransactionException. Thus its timestamp may be very old before it even accesses the data. and thus may have a high probability of abortion.2. TransactionException. U write i. 3. A transaction may do a lot of work without accessing data values before it accesses the shared data. depending on particular semantics of rollback . Flashier type systems which ensure commit and abort operations happen are good topics for research. U write j. TransactionException. A better time to allocate the timestamp is when it first accesses data. U commit. In the answers below.8 Distributed Transactions 1. 4. void commitTransaction(TransId) throws TransactionAbortedException. V write k. Recovery works by recovery manager redoing operations in order for those transactions that have commited. Value read(TransId. and for the commit and abort operation. TransactionException. Why? Because the writes don’t actually reveal their changed values till the transactions commit.

Describe the CBCast Causal Ordering Protocol [6 marks] 3. each process in group receives all messages received at the sending process before the message is sent. Process 1 sends two messages with vector timestamps (5.2 The answer 1. R = >=5 W = 6.10). Show a possible ordering of the transmission and delivery events at these processes. asking you to reproduce basic definitions or similar from the notes.1. A process group using CBcast has three members. What guarantees on message delivery are required for CBcast? [3 marks] 4.9.9. I’ll introduce the standard approach taken in exam questions for this course.10) and (7. (a) Total ordering .3. R = >=1 Writing unavailable when any server is down. Process 3 sends two messages with vector timestamps (6.10. [7 marks] 3.12).11). R = >=3 W = 8. Process 0 sends two messages with vector timestamps (6.before a message is received. 1. R = >=2 W = 9. 0. R = >=4 W = 7. Attempt the question using the notes. The final part attempts to discover whether you can extrapolate from your knowledge to new situations. generally in a manner that you will have seen before.3. or meld you knowledge with other areas. 123 3.9. The first part of the question attempts to measure whether you have learnt the basic knowledge. . Define the following: (a) Total Ordering [2 marks] (b) Causal Ordering [2 marks] 2.each process in group receives messages in the same order. 3.11) and (6.2.3. (b) Causal ordering .1 Availability and Ordering This is a sample question for 20 marks. as you would for an exam.3 Sample Exam Question In this section.10. Each question attempts to measure the learning outcomes achieved for a particular subtopic of the course. SAMPLE EXAM QUESTION W = 5.9.10) and (5. The second part asks you to apply your knowledge in the solution of some problem.3.

5 marks depending upon how correct the answers are.9.10.11) .124 CHAPTER 3. 1 mark for mentioning vector timestamps. 3. If there is no diagram. 2 marks for describing the conditions for delivering the message to the application.9. 2 marks for describing the vector timestamp as an array of sequence numbers per process. EXERCISES AND ANSWERS 2. (5.11) (6.9.10) (6.9.10. 7 marks would be awarded .10) (6. 3 marks for noting that reliable delivery is required 4. but the answers are perfectly correct. 2 marks for a diagram as below.however a mistake would result in no marks.12) (7.10) (5. 1 mark for describing how the local vector timestamp is updated.

Sign up to vote on this title
UsefulNot useful