You are on page 1of 47

Networks,Protocolsand DistributedSystems

ASlightlyTheoreticCrashCourse
HaraldurDarriorvaldsson

OverviewofthisTalk
Networksasgraphsofqueues Blocking/NonBlockingprogramstyles Reliable/Unreliablenetworkchannels Concreteexamples:TCP,UDP MMOsAbstracted:SharedDistributedState Widerapplicabilityofnetworkmodel

NetworksasGraphsofQueues
TypicalDiagramView:someabstractions,adashofhardware
Communication

computer

Network Interface Card

Network

computer

Today:ProgrammersView/Model:Queues ofMessages
send/enqueue receive/dequeue

l a node

l b

e c

MessageQueues(Channels)

node

TheBasicDistributedSystemsModel

Abunchofnodes exchangingmessagesacross dedicatedchannels:pairsofunidirectionalqueues Nodescannotobserveormodifyothernodesdirectly All internodeeffectsarethroughmessages

TheLifeofaNode
Anodehasasequenceofevents,whichcanbe:
1. Acomputation step(changingnodesstate)
Basically:thesequentialexecutionofaprogramsnippet

2. Asend event(enqueues amsg onachannel) 3. Areceive event(dequeues amsg fromachannel)

Amessagecontainsafiniteamountofdata
Forexample:astringoversomealphabet
Physicalmessages(packets)typically509000bytes

Nomodeloftime;onlysequencesofevents

TheLifeofaChannel
Whenamessageisenqueued toachannel:
Appendsmessagetoendofitsqueue

Whenchannelaskedtodequeue amessage:
Removesanddeliver msg atfrontofitsqueue

Thisdescribesaperfectreliablechannel
Realnetworksfail,wemitigatewithclever softwareasmuchaspossible

Example:TransmissionControlProtocol(TCP)
Deliversthecorrectbytestream (ifanything)

DistributedAlgorithm/Protocol SimpleExample:LoadBalancing
Twonodesexecutethefollowingpseudocode:
myLoad =ComputeCurrentLoad() send(myLoad) remoteLoad = receive() halfLoad =(myLoad +remoteLoad)/2 ifmyLoad >halfLoad: handoff(myLoad halfLoad)unitsofwork elseifmyLoad <halfLoad: takeon(halfLoad myLoad)unitsofwork

Computation events

Networking events

AnimimatedofAlgorithmInstance

myLoad remoteLoad myDiff

7 3 2

3 7 2

myLoad remoteLoad myDiff

AnimimationofAlgorithmInstance

myLoad remoteLoad myDiff

7 3 2

restofrebalancingtakesplace, somehow

3 7 2

myLoad remoteLoad myDiff

TimingDiagram ofAlgorithmInstance
Showorderofevents(ortime)ateachnode asalonghorizontalorverticalarrow
Afterall,nodesareindependent/concurrent

Drawanarrowfromeachsendeventtoits correspondingreceiveevent
WhatwasAdoinghere?
compute7 send7

orhere?
reveive3 computediff

A
compute3 send3 receive7 computediff

B
HereAisworking HereAiswaiting (foramessage)

TimingDiagram ofAlgorithmInstance
Showcomputationeventsasthickbars Absenceofbarmeanswaitingforamessage Questions:
1. Howdoesanodesprogramwait? 2. Whathappensifamessageneverarrives?
compute7 send7 reveive3 computediff

A
compute3 send3 receive7 computediff

Answers,forourexample
Ourexamplesreceive() callblocks
Ifinputchannelisempty,nodeexecution suspendsuntilremotenodeenqueuesamessage

Problem:ifremotenodeneverenqueuesa messagewellwaitforever!
Anew,excitingwayforprogramstorunforever(in additiontoinfiniteloopsinsequentialprogram) Wellsaymoreaboutfailureslater

BlockingofSends
Wecouldmodelchannelsashavinginfinite spaceformessages(evenmoreperfect!) Butwellbemorerealisticandsay:channels haveafinitecapacity. Hence,send() canalsoblock,whenchannel isfull,withnospaceforadditionalmessages
Executionresumesonceremotenodedequeuesa message,freeingupspaceinthequeue

ToBlock,orNottoBlock
Pro:blockingisrelativelysimple/easy
Sendsandreceiveslooklikecomputationevents, programlooksalotlikeasequentialprogram Terminology:theexecutionappearssynchronous
Systemexecutionisdeterministic,givenstartstate

Waitingisimplicit:programsdontcheckthembut proceedasiftheyrealwaysinareadystate

Con:canlimitperformanceandinteractionstyles
Suspending/resumingexecutioncarriescosts Strictrequest/responsemessagingcanberestrictive

NonBlockingAlternative:Polling
Addnewnonblockingevent:receive_if()
Returnsamessageifqueuenonemptyorelsea queueemptyindicator Nodecangodosomethingelse,whenqueueempty Newsend_if() eventmayreturnqueuefull

Programacknowledgestime,isasynchronous
Systemisnowinherentlynondeterministic

Permitsonenodetohandlemultiplequeues
Polltheminturn,handlethosethatareready

Example:Publish/SubscribewPolling
clients channel1A channelA1 server
a=1 a=1

5 1

A a z
5 1

a=1

1 5

a=1

1 5

Clientn CodeforPublish/Subscribe
doforever: msg =receive_if(An) ifmsg queueempty: var, val =unpackcontentsofmsg updatevariablevalwithvalueval computesomethingforawhile foreachvariablevar Iwanttosettovalueval msg =packvar andval intoamessage ifsend_if(nA,msg)=queuefull: exit
Alternative:waitalittlewhile,thentryagain

ServerCodeforPublish/Subscribe
sndChannels ={A1,A2,A3} doforever: forrch in{1A,2A,3A}: msg =receive_if(rch) ifmsg queueempty: var, val =unpackcontentsofmsg updatemyvariablevalwithvalueval forsch insndChannels: ifsend_if(sch,msg)=queuefull: removesch fromsndChannels

RealNetworkChannelsFail!
Wecanmodelsuchunreliable channels:
Askedtoenqueue,channelmight:
Donothingatall(drop messages)
Note:sameassend_if() withafullchannel

Appendadifferentmsg (corrupt messages)

Askedtodequeue,channelmight:
Removeanddeliveradifferentmsg (reorder messages) Deliveramsg butnotremoveit(duplicate messages)

Example:InternetsUserDatagramProtocol(UDP) Msg drops,reorders,duplicates

QueueModelofUDP/IP
Eachnetworkinterface ofanInternetdeviceis identifiedbyagloballyuniqueIPAddress
A32bitinteger,e.g.82D0F047hexadecimal Writtenasdotseparateddecimals,frommostto leastsignificantbyte,e.g.130.208.240.71

AUDPchannelcomprisesanIPAddressand aUDPPort:a16bitinteger
Portsbelow1024areallottedbyconventionto wellknownservices,suchasDNS. MymainDNSserverisat46.22.96.35:53

Sending/ReceivingUDPMessages
UDPisconnectionless:yousendamessageto achannelanytime(viaOSsAPIs,e.g.socket)
Butyouhavenoideaifitgetsdeliveredornot Canbeupto~64KBinsize,butprefer<1500 bytes,orafewKBatmost

ToreceiveUDP:bind asalistenerofsome portP(viaOSsAPI,e.g.socket)


Youwillreceive(asubsetofthe)UDPmessages senttochannel:yourIPAddress: P

Example:ReliableCommunication
Wanttoexchangeanorderedsequenceof messagesoveranunreliablechannelthat drops,duplicatesandreordersmessages
ThisiswhatTCPprovides,ontopoftheunreliable InternetProtocol(IP)packetdeliveryservice UDPisaverythinlayerontopofIP

ReliableMessaging:SenderProtocol
Whatsagoodvalueforlittlewhile?

global numSent =0 //channelnowrepresentsbothsendandrecvqueues functionreliable_send(msg,channel): numSent= numSent+1 doforever: send_if(channel,(numSent,msg)) waitforalittlewhile reply =receive_if(channel) ifreply queueempty: numReceived,msg=unpackreply if msg =ACKandnumReceived= numSent: return

ReliableMessaging:ReceiverProtocol
global numReceived =0 //channelnowrepresentsbothsendandrecvqueues functionreliable_receive(channel): doforever: packet =receive_if(channel) ifpacket queueempty: packetNum,msg=unpackpacket ifpacketNum= numReceived+1 numReceived= numReceived+1 send_if(channel,(ACK,numReceived)) return msg send_if(channel,(ACK,numReceived)) waitalittlewhile

LetsCheckourProtocol
Thechannelisouradversary:itmisbehavesandtries toconfuseus.Tryprotocolwith:
Dropped,reordered,duplicatemessages Dropped,reordered,duplicateACKs

Belowisthefailurefree,happycase:
send(1,Bla) receive(1,ACK)

B
receive(1,Bla) send(1,ACK)

TakehomePoints
Designingrobustnetworkprotocolsisdifficult
Havetoanticipateandhandleeverytypeoffailure thatcanoccur,atanystageintheprotocol TheMessageQueue/Eventmodelcanhelpalot

Useexistingbuildingblockswheneverpossible Forexample:UDPisrarelybeneficial.Better touseareliabletransport,likeTCP


YoullendupreimplementingTCPanyway Possibleexception:fastpacednetworkedgames

TCPvs.OurToyProtocol
Transmitsbytesequences,notdiscretemessages
Yousendabytebuffer,TCPchopsitupintosegments (packets)anywayitpleases,ACKsbyteseq positions. Youmustprovidemessageframing,e.g.prepend the lengthofyourmessagestotheirdata

Bufferssentandreceiveddataandhasmultiple segmentsinflightonnetworkatthesametime
Messagebymessagepingpongwouldbewaytoslow

Performsflowcontroland congestionavoidance
Adjuststransmissionratetocurrentnetworkbandwidth andsharesbandwidthfairlywithotherconnections

QueueModelofTCP/IP
TCPisconnectionoriented:youestablisha connection witharemotenodebefore exchangingmessageswithit
Toagreeoninitialsequencenumbers,etc.

Wecanmodelthisascreatinganewchannel
WethoughtofUDPchannelsaspreexisting

ATCPchannelisglobally/uniquelyidentified bytwo IPAddress:Portpairs


TheIPAddressesofthetwonodesinvolved

QueueModelofTCP/IP,Continued
Thesenodesareplayingtherolesofclientsconnectingtoserver.Theychoosetheirportsatwill.
2.2.2.2:2 1.2.3.4:1 1.2.3.4:1 2.2.2.2:2 Node:2.2.2.2 Usesport:2

Node:1.2.3.4 Usesport:1

3.3.3.3:2 1.2.3.4:1 1.2.3.4:1 3.3.3.3:2

Node:3.3.3.3 Usesports:2,3

3.3.3.3:3 1.2.3.4:1 1.2.3.4:1 3.3.3.3:3

Thisnodeisplayingtheroleofaserver,acceptingconnectionsatawellknownport (e.g.port80forhttp,theWorldWideWebprotocol)

QueuesAreReal!
Networkinghardware/softwarefullofqueues
computer router

router

smartphone

ModelingMultiUserGames
Multiplenodesholdacopyofsomestate Wewantthemtobehaveasiftherewasa singleshared instanceofthestate
Theycantreally,canonlyexchangemessages

Nodesthat(proposeto)mutatestatemust notifyothernodes,whichupdatetheircopies Problem:nodescandiverge:breakingillusion


Canmutatedifferentlyorindifferentorder

ReplicatedSystemProblems
FirstOrderproblem:conflictingupdates
x y
3 2 0

X3

x
X5

3 2 5 0

x y

2 3 0

x y

5 2 0

NearUniversalSolution:Master/Slave
Onenodeisthemaster forupdates,theother slavenodesforwardtheirupdatestomaster
x y
3 5 2
5 X3 X5

x y

5 3 2 0

x y

3 2 5 0

X3

x y

3 2 5 0

Inessence:weensureeveryonesreceivequeuelooksthesameasthemastersqueue

Sofar,sogoodbut
Whatweveshownisbasicallyadistributed cache,whereslavesareeventuallyconsistent
Masterisauthoritative.Itisinapositionto authenticate,modifyorrejectchanges MMOsusuallyhaveapermanent,trustedmaster (operatedbygamecorp)sinceendusercheat!

Problem:aslavesdecisiontomutatemay havebeenbasedonstale (old,obsolete)data


Forexample:shotadudewhohadmovedaway

InconsistentExecution/RaceCondition
Thestateissharedbutthesimulation isnot
x y
5 2
X5 X5

x y

5 2 0

Whenthenodecomputed updatex3,xhadvalue2. Doesupdatestillmakesense?

x y

2 5 0

X3

x y

2 5 0

Solution1:DBStyleDistributedLocking
1. Slavesendsmasterarequesttolock theset ofvariablesitwantstoreadand/orupdate 2. Themasteracknowledgestherequest,ifno othernodehasanyofthevariableslocked
Otherwise:rejectsordelaysthelockrequest

3. Slavethenexecuteseventandsendsupdate
Noinconsistency,otherscantmodifythevars

4. Masterupdatesandunlocks

ProblemswithLocking
Lowperformance:slavesspendatleasta messageroundtrip waiting,foreachupdate
Thisalonerulesoutlockingformostgames

Faulttolerance:ifslavecrashesorloses connection,variablesleftinlockedstate Deadlocks:lockrequestscanformcircular waitfordependencies


Notabiggie,mastercandetectsuchcyclesand breakthembyrejectingoneofthelockrequests

Solution2:OptimisticConcurrencyControl
Givemasterenoughinformationtobeableto rejectupdatesbasedon(possibly)staledata
Slavesendswithupdatesthereadset ofvariables readbyeventsexecution,aswellastheirvalues Masterchecksifallofanupdatesreadvariables stillhavethesevalue.Ifnot,rejectsupdate
Alternative:mastertrackswhichupdateseach slavehasreceivedandrejectsupdatesifany readsetvaluehaschanged(disregardingvalues)

OptimisticConcCtrlinAction
Serververifiesupdatesweremadeassuming correctvariablevalues
x y
5 2
X5 X5

x y

5 2 0

x y

2 5 0
X3 (x=2)

x y

2 5 0

OptimisticConcurrencyControlPro&Con
Pro:whentherearenoconflicts,thereisno waitingandnoadditionaldelay Con:readsetscanbelarge,eatnetwork bandwidth Con:highcontention (manyconflicts)may causelivelock :someslavekeepslosingout
Forexample:aslavewithhighnetworklatency Canbehardtoensurefairness forallnodes

Solution3:ShareExecution,notUpdates
Insteadofsendingstatemutations,slaves senduserinput(mouse/keyboard)tomaster Masterexecutestotalsimulationand distributesresultingstateupdatestoslaves

SharedExecutionPro&Con
Pro:workswell,thisisessentiallyhowmost quickpacedgamesdoit(FPSes,e.g.)
Gamesnolongertreatedasadatabaseproblem

Con:centralizedmasterlimitsscalability Con:largedelayfrommouse/keyboardaction toeffectonscreen(e.g.turninghead)


Ontheorderofanetworkroundtrip,10sofms Makesplayerssick/drivesthemcrazy

SolutionstoSharedExecutionDelay
Prediction:slavesalsoexecutegamelogic, assumingimmediateeffectofusersinput
Predicthowplayerscharactermoves,predicthow otheruserscharacterswillmove.

Whenmastersendsactual/authoritative updates,slavesmustreconcile theirlocal versionusingupdates,convergetomaster


Shiftcharacterstowardscorrectposition,e.g.

Thisisnotafullysolvedproblem
FPSengines(Quake,Unreal)havefinely handtuned,fairlyadhocsolutions
Separatepredictionsforcharacterrunning, jumping,gunshots,flyinggrenades Heavilyoptimized/compressedencodingof updatepackets,toconservebandwidth

Canbesolvedgenerallythroughdeterminism
Slavesrollback theirstatetotimeofnewserver updateandthereplay alleventsbacktonow Asyoufigureitout,usetheQueue,Luke!

DistributedSystemsEverywhere!
Multicoremachines(withNUMA)
Fast,failurefreenetworks(memory,PCIExpress)
memory

cores

DistributedSystemsEverywhere!
SharedmemoryThreads:canmodelasnodes
Memoryaccessesaremessagepassing
Implementedbymemorycontrollerhardware
r1=load(10001234) r2=r1+1

A B
store(10001234:155)

return(10001234:??)

Summary
Modelingdistributedsystemsasnodes exchangingmessagesviaqueuesisveryuseful
Thisishowacademicsdoit,fortheirproofs!

Sharedstateisthecanonicalhardproblemfor distributedsystems
Weveseenthetopoftheicebergtoday.Addpartialfailures,partial subscriptions,partitionedservers,dynamicmigration

MMOsarespecial,butnotallthatspecial
YettosuccessfullyapplyknowledgefromDB/Distr

You might also like