Professional Documents
Culture Documents
IPDA Tutorial
IPDA Tutorial
ASlightlyTheoreticCrashCourse
HaraldurDarriorvaldsson
OverviewofthisTalk
Networksasgraphsofqueues Blocking/NonBlockingprogramstyles Reliable/Unreliablenetworkchannels Concreteexamples:TCP,UDP MMOsAbstracted:SharedDistributedState Widerapplicabilityofnetworkmodel
NetworksasGraphsofQueues
TypicalDiagramView:someabstractions,adashofhardware
Communication
computer
Network
computer
Today:ProgrammersView/Model:Queues ofMessages
send/enqueue receive/dequeue
l a node
l b
e c
MessageQueues(Channels)
node
TheBasicDistributedSystemsModel
TheLifeofaNode
Anodehasasequenceofevents,whichcanbe:
1. Acomputation step(changingnodesstate)
Basically:thesequentialexecutionofaprogramsnippet
Amessagecontainsafiniteamountofdata
Forexample:astringoversomealphabet
Physicalmessages(packets)typically509000bytes
Nomodeloftime;onlysequencesofevents
TheLifeofaChannel
Whenamessageisenqueued toachannel:
Appendsmessagetoendofitsqueue
Whenchannelaskedtodequeue amessage:
Removesanddeliver msg atfrontofitsqueue
Thisdescribesaperfectreliablechannel
Realnetworksfail,wemitigatewithclever softwareasmuchaspossible
Example:TransmissionControlProtocol(TCP)
Deliversthecorrectbytestream (ifanything)
DistributedAlgorithm/Protocol SimpleExample:LoadBalancing
Twonodesexecutethefollowingpseudocode:
myLoad =ComputeCurrentLoad() send(myLoad) remoteLoad = receive() halfLoad =(myLoad +remoteLoad)/2 ifmyLoad >halfLoad: handoff(myLoad halfLoad)unitsofwork elseifmyLoad <halfLoad: takeon(halfLoad myLoad)unitsofwork
Computation events
Networking events
AnimimatedofAlgorithmInstance
7 3 2
3 7 2
AnimimationofAlgorithmInstance
7 3 2
restofrebalancingtakesplace, somehow
3 7 2
TimingDiagram ofAlgorithmInstance
Showorderofevents(ortime)ateachnode asalonghorizontalorverticalarrow
Afterall,nodesareindependent/concurrent
Drawanarrowfromeachsendeventtoits correspondingreceiveevent
WhatwasAdoinghere?
compute7 send7
orhere?
reveive3 computediff
A
compute3 send3 receive7 computediff
B
HereAisworking HereAiswaiting (foramessage)
TimingDiagram ofAlgorithmInstance
Showcomputationeventsasthickbars Absenceofbarmeanswaitingforamessage Questions:
1. Howdoesanodesprogramwait? 2. Whathappensifamessageneverarrives?
compute7 send7 reveive3 computediff
A
compute3 send3 receive7 computediff
Answers,forourexample
Ourexamplesreceive() callblocks
Ifinputchannelisempty,nodeexecution suspendsuntilremotenodeenqueuesamessage
Problem:ifremotenodeneverenqueuesa messagewellwaitforever!
Anew,excitingwayforprogramstorunforever(in additiontoinfiniteloopsinsequentialprogram) Wellsaymoreaboutfailureslater
BlockingofSends
Wecouldmodelchannelsashavinginfinite spaceformessages(evenmoreperfect!) Butwellbemorerealisticandsay:channels haveafinitecapacity. Hence,send() canalsoblock,whenchannel isfull,withnospaceforadditionalmessages
Executionresumesonceremotenodedequeuesa message,freeingupspaceinthequeue
ToBlock,orNottoBlock
Pro:blockingisrelativelysimple/easy
Sendsandreceiveslooklikecomputationevents, programlooksalotlikeasequentialprogram Terminology:theexecutionappearssynchronous
Systemexecutionisdeterministic,givenstartstate
Waitingisimplicit:programsdontcheckthembut proceedasiftheyrealwaysinareadystate
Con:canlimitperformanceandinteractionstyles
Suspending/resumingexecutioncarriescosts Strictrequest/responsemessagingcanberestrictive
NonBlockingAlternative:Polling
Addnewnonblockingevent:receive_if()
Returnsamessageifqueuenonemptyorelsea queueemptyindicator Nodecangodosomethingelse,whenqueueempty Newsend_if() eventmayreturnqueuefull
Programacknowledgestime,isasynchronous
Systemisnowinherentlynondeterministic
Permitsonenodetohandlemultiplequeues
Polltheminturn,handlethosethatareready
Example:Publish/SubscribewPolling
clients channel1A channelA1 server
a=1 a=1
5 1
A a z
5 1
a=1
1 5
a=1
1 5
Clientn CodeforPublish/Subscribe
doforever: msg =receive_if(An) ifmsg queueempty: var, val =unpackcontentsofmsg updatevariablevalwithvalueval computesomethingforawhile foreachvariablevar Iwanttosettovalueval msg =packvar andval intoamessage ifsend_if(nA,msg)=queuefull: exit
Alternative:waitalittlewhile,thentryagain
ServerCodeforPublish/Subscribe
sndChannels ={A1,A2,A3} doforever: forrch in{1A,2A,3A}: msg =receive_if(rch) ifmsg queueempty: var, val =unpackcontentsofmsg updatemyvariablevalwithvalueval forsch insndChannels: ifsend_if(sch,msg)=queuefull: removesch fromsndChannels
RealNetworkChannelsFail!
Wecanmodelsuchunreliable channels:
Askedtoenqueue,channelmight:
Donothingatall(drop messages)
Note:sameassend_if() withafullchannel
Askedtodequeue,channelmight:
Removeanddeliveradifferentmsg (reorder messages) Deliveramsg butnotremoveit(duplicate messages)
QueueModelofUDP/IP
Eachnetworkinterface ofanInternetdeviceis identifiedbyagloballyuniqueIPAddress
A32bitinteger,e.g.82D0F047hexadecimal Writtenasdotseparateddecimals,frommostto leastsignificantbyte,e.g.130.208.240.71
AUDPchannelcomprisesanIPAddressand aUDPPort:a16bitinteger
Portsbelow1024areallottedbyconventionto wellknownservices,suchasDNS. MymainDNSserverisat46.22.96.35:53
Sending/ReceivingUDPMessages
UDPisconnectionless:yousendamessageto achannelanytime(viaOSsAPIs,e.g.socket)
Butyouhavenoideaifitgetsdeliveredornot Canbeupto~64KBinsize,butprefer<1500 bytes,orafewKBatmost
Example:ReliableCommunication
Wanttoexchangeanorderedsequenceof messagesoveranunreliablechannelthat drops,duplicatesandreordersmessages
ThisiswhatTCPprovides,ontopoftheunreliable InternetProtocol(IP)packetdeliveryservice UDPisaverythinlayerontopofIP
ReliableMessaging:SenderProtocol
Whatsagoodvalueforlittlewhile?
global numSent =0 //channelnowrepresentsbothsendandrecvqueues functionreliable_send(msg,channel): numSent= numSent+1 doforever: send_if(channel,(numSent,msg)) waitforalittlewhile reply =receive_if(channel) ifreply queueempty: numReceived,msg=unpackreply if msg =ACKandnumReceived= numSent: return
ReliableMessaging:ReceiverProtocol
global numReceived =0 //channelnowrepresentsbothsendandrecvqueues functionreliable_receive(channel): doforever: packet =receive_if(channel) ifpacket queueempty: packetNum,msg=unpackpacket ifpacketNum= numReceived+1 numReceived= numReceived+1 send_if(channel,(ACK,numReceived)) return msg send_if(channel,(ACK,numReceived)) waitalittlewhile
LetsCheckourProtocol
Thechannelisouradversary:itmisbehavesandtries toconfuseus.Tryprotocolwith:
Dropped,reordered,duplicatemessages Dropped,reordered,duplicateACKs
Belowisthefailurefree,happycase:
send(1,Bla) receive(1,ACK)
B
receive(1,Bla) send(1,ACK)
TakehomePoints
Designingrobustnetworkprotocolsisdifficult
Havetoanticipateandhandleeverytypeoffailure thatcanoccur,atanystageintheprotocol TheMessageQueue/Eventmodelcanhelpalot
TCPvs.OurToyProtocol
Transmitsbytesequences,notdiscretemessages
Yousendabytebuffer,TCPchopsitupintosegments (packets)anywayitpleases,ACKsbyteseq positions. Youmustprovidemessageframing,e.g.prepend the lengthofyourmessagestotheirdata
Bufferssentandreceiveddataandhasmultiple segmentsinflightonnetworkatthesametime
Messagebymessagepingpongwouldbewaytoslow
Performsflowcontroland congestionavoidance
Adjuststransmissionratetocurrentnetworkbandwidth andsharesbandwidthfairlywithotherconnections
QueueModelofTCP/IP
TCPisconnectionoriented:youestablisha connection witharemotenodebefore exchangingmessageswithit
Toagreeoninitialsequencenumbers,etc.
Wecanmodelthisascreatinganewchannel
WethoughtofUDPchannelsaspreexisting
QueueModelofTCP/IP,Continued
Thesenodesareplayingtherolesofclientsconnectingtoserver.Theychoosetheirportsatwill.
2.2.2.2:2 1.2.3.4:1 1.2.3.4:1 2.2.2.2:2 Node:2.2.2.2 Usesport:2
Node:1.2.3.4 Usesport:1
Node:3.3.3.3 Usesports:2,3
Thisnodeisplayingtheroleofaserver,acceptingconnectionsatawellknownport (e.g.port80forhttp,theWorldWideWebprotocol)
QueuesAreReal!
Networkinghardware/softwarefullofqueues
computer router
router
smartphone
ModelingMultiUserGames
Multiplenodesholdacopyofsomestate Wewantthemtobehaveasiftherewasa singleshared instanceofthestate
Theycantreally,canonlyexchangemessages
ReplicatedSystemProblems
FirstOrderproblem:conflictingupdates
x y
3 2 0
X3
x
X5
3 2 5 0
x y
2 3 0
x y
5 2 0
NearUniversalSolution:Master/Slave
Onenodeisthemaster forupdates,theother slavenodesforwardtheirupdatestomaster
x y
3 5 2
5 X3 X5
x y
5 3 2 0
x y
3 2 5 0
X3
x y
3 2 5 0
Inessence:weensureeveryonesreceivequeuelooksthesameasthemastersqueue
Sofar,sogoodbut
Whatweveshownisbasicallyadistributed cache,whereslavesareeventuallyconsistent
Masterisauthoritative.Itisinapositionto authenticate,modifyorrejectchanges MMOsusuallyhaveapermanent,trustedmaster (operatedbygamecorp)sinceendusercheat!
InconsistentExecution/RaceCondition
Thestateissharedbutthesimulation isnot
x y
5 2
X5 X5
x y
5 2 0
x y
2 5 0
X3
x y
2 5 0
Solution1:DBStyleDistributedLocking
1. Slavesendsmasterarequesttolock theset ofvariablesitwantstoreadand/orupdate 2. Themasteracknowledgestherequest,ifno othernodehasanyofthevariableslocked
Otherwise:rejectsordelaysthelockrequest
3. Slavethenexecuteseventandsendsupdate
Noinconsistency,otherscantmodifythevars
4. Masterupdatesandunlocks
ProblemswithLocking
Lowperformance:slavesspendatleasta messageroundtrip waiting,foreachupdate
Thisalonerulesoutlockingformostgames
Solution2:OptimisticConcurrencyControl
Givemasterenoughinformationtobeableto rejectupdatesbasedon(possibly)staledata
Slavesendswithupdatesthereadset ofvariables readbyeventsexecution,aswellastheirvalues Masterchecksifallofanupdatesreadvariables stillhavethesevalue.Ifnot,rejectsupdate
Alternative:mastertrackswhichupdateseach slavehasreceivedandrejectsupdatesifany readsetvaluehaschanged(disregardingvalues)
OptimisticConcCtrlinAction
Serververifiesupdatesweremadeassuming correctvariablevalues
x y
5 2
X5 X5
x y
5 2 0
x y
2 5 0
X3 (x=2)
x y
2 5 0
OptimisticConcurrencyControlPro&Con
Pro:whentherearenoconflicts,thereisno waitingandnoadditionaldelay Con:readsetscanbelarge,eatnetwork bandwidth Con:highcontention (manyconflicts)may causelivelock :someslavekeepslosingout
Forexample:aslavewithhighnetworklatency Canbehardtoensurefairness forallnodes
Solution3:ShareExecution,notUpdates
Insteadofsendingstatemutations,slaves senduserinput(mouse/keyboard)tomaster Masterexecutestotalsimulationand distributesresultingstateupdatestoslaves
SharedExecutionPro&Con
Pro:workswell,thisisessentiallyhowmost quickpacedgamesdoit(FPSes,e.g.)
Gamesnolongertreatedasadatabaseproblem
SolutionstoSharedExecutionDelay
Prediction:slavesalsoexecutegamelogic, assumingimmediateeffectofusersinput
Predicthowplayerscharactermoves,predicthow otheruserscharacterswillmove.
Thisisnotafullysolvedproblem
FPSengines(Quake,Unreal)havefinely handtuned,fairlyadhocsolutions
Separatepredictionsforcharacterrunning, jumping,gunshots,flyinggrenades Heavilyoptimized/compressedencodingof updatepackets,toconservebandwidth
Canbesolvedgenerallythroughdeterminism
Slavesrollback theirstatetotimeofnewserver updateandthereplay alleventsbacktonow Asyoufigureitout,usetheQueue,Luke!
DistributedSystemsEverywhere!
Multicoremachines(withNUMA)
Fast,failurefreenetworks(memory,PCIExpress)
memory
cores
DistributedSystemsEverywhere!
SharedmemoryThreads:canmodelasnodes
Memoryaccessesaremessagepassing
Implementedbymemorycontrollerhardware
r1=load(10001234) r2=r1+1
A B
store(10001234:155)
return(10001234:??)
Summary
Modelingdistributedsystemsasnodes exchangingmessagesviaqueuesisveryuseful
Thisishowacademicsdoit,fortheirproofs!
Sharedstateisthecanonicalhardproblemfor distributedsystems
Weveseenthetopoftheicebergtoday.Addpartialfailures,partial subscriptions,partitionedservers,dynamicmigration
MMOsarespecial,butnotallthatspecial
YettosuccessfullyapplyknowledgefromDB/Distr