You are on page 1of 15

JourneytotheCenteroftheLinuxKernel:TrafficControl,ShapingandQoS

JulienVehent[http://jve.linuxwall.info]seerevisions[http://wiki.linuxwall.info/doku.php/en:ressources:dossiers:networking:traffic_control?do=revisions]

1Introduction
ThisdocumentdescribestheTrafficControlsubsystemoftheLinuxKernelindepth,algorithmbyalgorithm,andshowshowitcanbeusedtomanage theoutgoingtrafficofaLinuxsystem.Throughoutthechapters,wewilldiscussboththetheorybehindandtheusageofTrafficControl,anddemonstrate howonecangainacompletecontroloverthepacketspassingthroughhissystem.

aQoSgraph TheinitialtargetofthispaperwastogainabettercontroloverasmallDSLuplink.Anditgrewovertimetocoveralotmorethanthat.99%ofthe informationprovidedherecanbeappliedtoanytypeofserver,aswellasrouters,firewalls,etc TheTrafficControltopicislargeandinconstantevolution,asistheLinuxKernel.Therealcreditgoestothedevelopersbehindthe/netdirectoryofthe kernel,andalloftheresearcherswhocreatedandimprovedallofthesealgorithms.Thisismerelyanattempttodocumentsomeofthisworkforthe masses.Anyparticipationandcommentsarewelcome,inparticularifyouspottedaninconsistencysomewhere.Pleaseemailjulien[at]linuxwall.info, yourmessagesarealwaysmostappreciated. Forthetechnicaldiscussion,sincetheLARTCmailinglistdoesn'texistsanymore,trythosetwo: *Netfilterusersmailinglist[http://vger.kernel.org/vgerlists.html#netfilter]forgeneraldiscussions*NetDevmailinglist[http://vger.kernel.org/vger lists.html#netdev]]iswheremagichappens(developersML)

2Motivation
ThisarticlewasinitiallypublishedinthefrenchissueofGnu/LinuxMagazineFrance#127,inMay2010.GLMFis kindenoughtoprovideacontractthatreleasethecontentofthearticleunderCreativeCommonaftersometime.I extendedtheinitialarticlequiteabitsince,butyoucanstillfindtheoriginalfrenchversionhere
[http://wiki.linuxwall.info/doku.php/fr:ressources:dossiers:networking:traffic_control]

MyinterestforthetrafficshapingsubsystemofLinuxstartedaround2005,whenIdecidedtohostmostofthe servicesIusemyself.Ireadthedocumentationavailableonthesubject(essentiallyLARTC[http://lartc.org/])but founditincompleteandendedupreadingtheoriginalresearchpublicationsandthesourcecodeofLinux. IhostmostofmyInternetservicesmyself,athomeoronsmallendservers(dediboxandsoon).Thisincludes webhosting(thiswiki,someblogsandafewwebsites),SMTPandIMAPservers,someFTPservers,XMPP, DNSandsoon.FrenchISPsallowthis,butonlygiveyou1Mbps/128KBpsofuplink,whichcorrespondstothe TCPAcksratenecessaryfora20Mbpsdownlink. 1Mbpsisenoughformostusage,butwithouttrafficcontrol,anyweeklybackuptoanexternallocationfillsupthe DSLlinkandslowsdowneveryoneonthenetwork.Duringthattime,boththevisitorsofthiswikiandmywife chattingonskypewillexperiencehighlatency.Thisisnotacceptable,becausethepriorityoftheweeklybackupis alotlowerthanthetwoothers.Linuxgiveyoutheflexibilitytoshapethenetworktrafficanduseallofyour bandwidthefficiently,withoutpenalizingrealtimeapplications.Butthiscomeswithaprice,andthelearningcurvetoimplementanefficienttrafficcontrol policyisquitesteep.ThisdocumentprovidesanaccurateandcomprehensiveintroductiontothemostusedQoSalgorithms,andgivesyouthetoolsto implementandunderstandyourownpolicy.

3ThebasicsofTrafficControl
IntheInternetworld,everythingispackets.Managingannetworkmeansmanagingpackets:howtheyaregenerated,router,transmitted,reorder, fragmented,etcTrafficControlworksonpacketsleavingthesystem.Itdoesn't,initially,haveasanobjectivetomanipulatepacketsenteringthe system(althoughyoucoulddothat,ifyoureallywanttoslowdowntherateatwhichyoureceivepackets).TheTrafficControlcodeoperatesbetween theIPlayerandthehardwaredriverthattransmitsdataonthenetwork.Wearediscussingaportionofcodethatworksonthelowerlayersofthe networkstackofthekernel.Infact,theTrafficControlcodeistheveryoneinchargeofconstantlyfurnishingpacketstosendtothedevicedriver. ItmeansthattheTCmodule,thepacketscheduler,ispermanentlyactivateinthekernel.Evenwhenyoudonotexplicitlywanttouseit,it'sthere schedulingpacketsfortransmission.Bydefault,thisschedulermaintainsabasicqueue(similartoaFIFOtypequeue)inwhichthefirstpacketarrived isthefirsttobetransmitted.

Atthecore,TCiscomposedofqueuingdisciplines,orqdisc,that representtheschedulingpoliciesappliedtoaqueue.Severaltypesofqdisc exist.IfyouarefamiliarwiththewayCPUschedulerswork,youwillfind thatsomeoftheTCqdiscaresimilar.WehaveFIFO,FIFOwithmultiple queues,FIFOwithhashandroundrobin(SFQ).WealsohaveaToken BucketFilter(TBF)thatassignstokenstoaqdisctolimititflowrate(no token=notransmission=waitforatoken).Thislastalgorithmwasthen extendedtoahierarchicalmodecalledHTB(HierarchicalTokenBuket). AndalsoQuickFairQueuing(QFQ),HierarchicalFairServiceCurve (HFSC),RandomEarlyDetection(RED),etc. Foracompletelistofalgorithm,checkoutthesourcecodeatkernel.org
[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=treef=net/schedhb=HEAD].

3.1Firstcontact
Let'sskipthetheoryfornowandstartwithaneasyexample.Wehavea webserveronwhichwewouldliketolimittheflowrateofpacketsleaving theserver.Wewanttofixthatlimitat200kilobitsperseconds(25KB/s). Thissetupisfairlysimple,andweneedthreethings: 1. aNetfilterruletomarkthepacketsthatwewanttolimit 2. aTrafficControlpolicy 3. aFiltertobindthepacketstothepolicy

3.2NetfilterMARK
Netfiltercanbeusedtointeractdirectlywiththestructurerepresentingapacketinthekernel.Thisstructure,thesk_buff[http://git.kernel.org/? p=linux/kernel/git/next/linuxnext.gita=blobf=include/linux/skbuff.h],containsafieldcalled__u32nfmarkthatwearegoingtomodify.TCwillthenreadthat valuetoselectthedestinationclassofapacket. Thefollowingiptablesrulewillapplythemark'80'tooutgoingpackets(OUTPUTchain)sentbythewebserver(TCPsourceportis80).
#iptablestmangleAOUTPUToeth0ptcpsport80jMARKsetmark80

Wecancontroltheapplicationofthisruleviathenetfilterstatistics:
#iptablesLOUTPUTtmanglev ChainOUTPUT(policyACCEPT74107packets,109Mbytes) pktsbytestargetprotoptinoutsourcedestination 73896109MMARKtcpanyeth0anywhereanywheretcpspt:wwwMARKxset0x50/0xffffffff

Youprobablynoticedthattheruleislocatedinthemangletable.Wewillgobacktothatalittlebitlater.

3.3Twoclassesinatree
TomanipulateTCpolicies,weneedthe/sbin/tcbinaryfromthe**iproute**package [http://www.linuxfoundation.org/collaborate/workgroups/networking/iproute2](aptitudeinstalliproute). Theiproutepackagemustmatchyourkernelversion.Yourdistribution'spackagemanagerwillnormallytakecareofthat. Wearegoingtocreateatreethatrepresentsourschedulingpolicy,andthatusestheHTBscheduler.Thistreewillcontaintwoclasses:oneforthe markedtraffic(TCPsport80),andoneforeverythingelse.
#tcqdiscadddeveth0roothandle1:htbdefault20 #tcclassadddeveth0parent1:0classid1:10htbrate200kbitceil200kbitprio1mtu1500 #tcclassadddeveth0parent1:0classid1:20htbrate824kbitceil1024kbitprio2mtu1500

Thetwoclassesareattachedtotheroot.Eachclasshasaguaranteedbandwidth(ratevalue)andanopportunisticbandwidth(ceilvalue).Ifthetotality ofthebandwidthisnotused,aclasswillbeallowedtoincreaseditsflowrateuptotheceilvalue.Otherwise,theratevalueisapplied.Itmeansthatthe sumoftheratevaluesmustcorrespondtothetotalbandwidthavailable. Inthepreviousexample,weconsiderthetotaluploadbandwidthtobe1024kbits/s,soclass10(webserver)gets200kbits/sandclass20(everything else)gets824kbits/s. TCcanusebothkbitandkbpsnotations,buttheydon'thavethesamemeaning.kbitistherateinkilobitsperseconds,andkbpsisinkilobytesper seconds.Inthisarticle,Iwillthekbitnotationonly.

3.4Connectingthemarkstothetree
Wenowhaveononesideatrafficshapingpolicy,andontheothersidepacketsmarking.Toconnectthetwo,weneedafilter. Afilterisarulethatidentifypackets(handleparameter)anddirectthemtoaclass(fwflowidparameter).Sinceseveralfilterscanworkinparallel,they canalsohaveapriority.AfiltermustbeattachedtotherootoftheQoSpolicy,otherwise,itwon'tbeapplied.
#tcfilteradddeveth0parent1:0protocolipprio1handle80fwflowid1:10

Wecantestthepolicyusingasimplyclient/serversetup.Netcatisveryusefulforsuchtesting.Startalisteningprocessontheserverthatappliesthe policyusing:

#nclp80</dev/zero

Andconnecttoitfromanothermachineusing:
#nc192.168.1.180>/dev/null

Theserverprocesswillsendzeros(takenfrom/dev/zero)asfastasitcan,andtheclientwillreceivethemandthrowthemaway,asfastasitcan. Usingiptraftomonitortheconnection,wecansupervisethebandwidthusage(bottomrightcorner).

Thevalueis199.20kbits/s,whichiscloseenoughtothe200kbits/starget.Theprecisionoftheschedulerdependsonafewparametersthatwewill discusslateron. AnyotherconnectionfromtheserverthatusesasourceportdifferentfromTCP/80willhaveaflowratebetween824kbits/sand1024kbits/s(depending onthepresenceofotherconnectionsinparallel).

4TwentyThousandLeaguesUndertheCode
Nowthatweenjoyedthisfirstcontact,itistimetogobacktothefundamentalsoftheQualityofServiceofLinux.Thegoalofthischapteristodiveinto thealgorithmsthatcomposethetrafficcontrolsubsystem.Lateron,wewillusethatknowledgetobuildourownpolicy. ThecodeofTCislocatedinthenet/scheddirectoryofthesourcesofthekernel.Thekernelseparatestheflowsenteringthesystem(ingress)fromthe flowsleavingit(egress).And,aswesaidearlier,itistheresponsibilityoftheTCmoduletomanagetheegresspath. Theillustrationbelowshowthepathofapacketinsidethekernel,whereitenters(ingress)andwhereitleaves(egress).Ifwefocusontheegresspath, apacketarrivesfromthelayer4(TCP,UDP,)andthenentertheIPlayer(notrepresentedhere).TheNetfilterchainsOUTPUTandPOSTROUTING areintegratedintheIPlayerandarelocatedbetweentheIPmanipulationfunctions(headercreation,fragmentation,).AttheexitoftheNATtableof thePOSTROUTINGchain,thepacketistransmittedtotheegressqueue,andthisiswhereTCstartsitswork. Almostallthedevisesuseaqueuetoscheduletheegresstraffic.Thekernelpossessesalgorithmsthatcanmanipulatethesequeues,theyarethe queuingdisciplines(FIFO,SFQ,HTB,).ThejobofTCistoapplythesequeuingdisciplinestotheegressqueueinordertoselectapacketfor transmission. TCworkswiththesk_buff[http://git.kernel.org/?p=linux/kernel/git/next/linuxnext.gita=blobf=include/linux/skbuff.h]]structurethatrepresentsapacketinthe kernel.Itdoesn'tmanipulatethepacketdatadirectly.sk_buffisasharedstructureaccessibleanywhereinthekernel,thusavoidingunnecessary duplicationofdata.Thismethodisalotmoreflexibleandalotfasterbecausesk_buffcontainsallofthenecessaryinformationtomanipulatethepacket, thekernelthereforeavoidscopiesofheadersandpayloadsfromdifferentmemoryareasthatwouldruintheperformances.

Onaregularbasis,thepacketschedulerwillwakeupandlaunchanypreconfiguredschedulingalgorithmstoselectapackettotransmit. Mostoftheworkinlaunchbythefunctiondev_queue_xmitfromnet/core/dev.c,thatonlytakeask_buffstructureasinput(thisisenough,sk_buff containseverythingneeded,suchasskbdev,apointertotheoutputNIC). dev_queue_xmitmakessurethepacketisreadytobetransmittedonthenetwork,thatthefragmentationiscompatiblewiththecapacityoftheNIX,that thechecksumsarecalculated(ifthisishandledbytheNIC,thisstepisskipped).Oncethosecontrolsdone,andiftheequipmenthasaqueuein skbdevqdisc,thenthesk_buffstructureofthepacketisaddedtothisqueue(viatheenqueuefunction)andtheqdisc_runfunctioniscalledtoselect apackettosend. ThismeansthatthepacketthathasjustbeenaddedtotheNIC'squeuemightnotbetheoneimmediatelytransmitted,butweknowthatitisreadyfor subsequenttransmissionthemomentitisaddedtothequeue. Toeachdeviceisattachedarootqueuingdiscipline.Thisiswhatwedefinedearlierwhencreatingtherootqdisctolimittheflowrateofthewebserver:
#tcqdiscadddeveth0roothandle1:htbdefault20

Thiscommandmeansattacharootqdiscidentifiedbyid1tothedeviceeth0,usehtbasaschedulerandsendeverythingtoclass20bydefault. Wewillfindthisqdiscatthepointerskbdevqdisc.Theenqueueanddequeuefunctionsarealsolocatedthere,respectivelyin skbdevqdiscenqueue()andskbdevqdiscdequeue().Thislastdequeuefunctionwillbeinchargeofforwardingthesk_buffstructureofthe packetselectedfortransmissiontotheNIC'sdriver. Therootqdisccanhaveleaves,knownasclasses,thatcanalsopossesstheirownqdisc,thusconstructingatree.

4.1ClasslessDisciplines
Thisisaprettylongwaydown.Forthosewhowishestodeepenthesubject,IrecommendreadingUnderstandingLinuxNetworkInternals,from ChristianBenvenutiatO'reilly. Wenowhaveadequeuefunctionwhichroleistoselectthepackettosendtothenetworkinterface.Todoso,thisfunctioncallsschedulingalgorithms thatwearenowgoingtostudy.Therearetwotypesofalgorithms:ClasslessandClassful.Theclassfulalgorithmsarecomposedofqdiscthatcan containclasses,likewedidinthepreviousexamplewithHTB.Inopposition,classlessalgorithmscannotcontainclasses,andare(supposedly)more simple.

4.1.1PFIFO_FAST
Let'sstartwithasmallone.pfifo_fastisthedefaultschedulingalgorithmusedwhennootherisexplicitlydefined.Inotherwords,it'stheoneusedon 99.9%oftheLinuxsystems.ItisclasslessandatinybitmoreevolvedthanabasicFIFOqueue,sinceitimplements3bandsworkinginparallel.These bandsarenumbered0,1and2andemptiedsequentially:while0isnotempty,1willnotbeprocessed,and2willbethelast.Sowehave3priorities: thepacketsinqueue0willbesentbeforethepacketsinqueue2. ThekernelusestheTypeofServicefield(the8bitsfieldsfrombit8tobit15oftheIPheader,seebelow)toselectthedestinationbandofapacket.
0123 01234567890123456789012345678901 +++++++++++++++++++++++++++++++++ |Version|IHL|TypeofService|TotalLength|

+++++++++++++++++++++++++++++++++ |Identification|Flags|FragmentOffset| +++++++++++++++++++++++++++++++++ |TimetoLive|Protocol|HeaderChecksum| +++++++++++++++++++++++++++++++++ |SourceAddress| +++++++++++++++++++++++++++++++++ |DestinationAddress| +++++++++++++++++++++++++++++++++ |Options|Padding| +++++++++++++++++++++++++++++++++ ExampleInternetDatagramHeaderfromRFC791

Thisalgorithmisdefinedin**net/sched/sch_generic.c**[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=blob_plainf=net/sched/sch_generic.chb=HEAD]andrepresentedinthediagrambelow.

diasource Thelengthofaband,representingthenumberofpacketitcancontain,issetto100bydefaultanddefinedoutsideofTC.It'saparameterthatcanbe setusingifconfig,andvisualizedin/sys:


#cat/sys/class/net/eth0/tx_queue_len 1000

Oncethedefaultvalueof1000ispassed,TCwillstartdroppingpackets.ThisshouldveryrarelyhappenbecauseTCPmakessuretoadaptitssending speedtothecapacityofbothsystemsparticipatinginthecommunication(that'stheroleoftheTCPslowstart).Butexperimentsshowedthatincreased thatlimitto10,000,oreven100,000,insomeveryspecificcasesofgigabitsnetworkscanimprovetheperformances.Iwouldn'trecommendtouching thisvalueunlessyoureallynowwhatyouaredoing.Increasingabuffersizetoatoolargevaluecanhaveverynegativesideeffectonthequalityofa connection.JimGettyscalledthatbufferbloat,andwewilltalkaboutitinthelastchapter. Becausethisalgorithmisclassless,itisnotpossibletopluganotherschedulerafterit.

4.1.2SFQStochasticFairnessQueuing
StochasticFairnessQueuingisanalgorithmthatsharesthebandwidthwithoutgivinganyprivilegeofanysort.Thesharingisfairbybeingstochastic,or randomifyouprefer.Theideaistotakeafingerprint,orhashofthepacketbasedonitsheader,andtousethishashtosendthepackettooneofthe 1024bucketsavailable.Thebucketsaresendemptiedinaroundrobinfashion. Themainadvantageofthismethodisthatnopacketwillhavethepriorityoveranotherone.Noconnexioncantakeovertheother,andeverybodyhasto share.Therepartitionofthebandwidthacrossthepacketswillalmostalwaysbefair,buttherearesomeminorlimitation.Themainlimitationisthatthe hashingalgorithmmightproducethesameresultforseveralpackets,andthussendthemtothesamebucket.Onebucketwillthenfillupfasterthanthe other,breakingthefairnessofthealgorithm.Tomitigatethis,SFQwillmodifytheparametersofitshashingalgorithmonaregularbasis,bydefault every10seconds. ThediagrambelowshowshowthepacketsareprocessedbySFQ,fromenteringthescheduleratthetop,tobeingdequeuedandtransmittedatthe bottom.Thesourcecodeisavailableinnet/sched/sch_sfq.c[http://git.kernel.org/?p=linux/kernel/git/next/linux next.gita=blob_plainf=net/sched/sch_sfq.chb=HEAD].Somevariablesarehardcodedinthesourcecode:*SFQ_DEFAULT_HASH_DIVISORgives thenumberofbucketsanddefaultto1024*SFQ_DEPTHdefinesthedepthofeachbucket,anddefaultsto128packets
#defineSFQ_DEPTH 128/*maxnumberofpacketsperflow*/ #defineSFQ_DEFAULT_HASH_DIVISOR1024

Thesetwovaluedeterminethemaximumnumberofpacketsthatcanbequeuedatthesametime.1024*128=131072packetscanbewaitinginthe SFQschedulerbeforeitstartsdroppingpackets.Butthistheoriticalnumberisveryunlikelytoeverhappen,becauseitwouldrequireofthealgorithmto fillupeverysinglebucketevenlybeforestartingtodequeueordroppackets. ThetccommandlineliststheoptionsthatcanbefedtoSFQ:


#tcqdiscaddsfqhelp Usage:...sfq[limitNUMBER][perturbSECS][quantumBYTES]

limit:isthesizeofthebuckets.ItcanbereducedbelowSFQ_DEPTHbutnotincreasedpastit.Ifyoutrytoputavalueabove128,TCwill simplyignoreit. perturb:isthefrequency,inseconds,atwhichthehashingalgorithmisupdated. quantum:representsthemaximumamountofbytesthattheroundrobindequeueprocesswillbeallowtodequeuefromabucket.Atabare minimum,thisshouldbeequaltotheMTUoftheconnexion(1500bytesonethernet).Imaginethatwesetthisvalueto500bytes,allpackets biggerthan500byteswouldnotbedequeuedbySFQ,andwouldstayintheirbucket.Wewould,veryquickly,arriveatapointwhereallbuckets areblockedandnopacketsaretransmittedanymore.

4.1.2.1SFQHashingalgorithm

DIAsource

ThehashusedbySFQiscomputedbythesfq_hashfunction(seethesourcecode),andtakeintoaccountthesourceanddestinationIPaddresses,the layer4protocolnumber(stillanIPheader),andtheTCPports(iftheIPpacketisnotfragmented).Thoseparametersaremixedwitharandomnumber regeneratedevery10seconds(theperturbvaluedefinesthat). Let'swalkthroughasimplifiedversionofthealgorithmsteps.Theconnexionhasthefollowingparameters: sourceIP:126.255.154.140 sourceport:8080 destinationIP:175.112.129.215 destinationport:2146


/*IPsourceaddressinhexadecimal*/ h1=7efffef0 /*IPDestinationaddressinhexadecimal*/ h2=af7081d7 /*06istheprotocolnumberforTCP(bits72to80oftheIPheader) WeperformaXORbetweenthevariableh2obtainedinthepreviousstep andtheTCPprotocolnumber */ h2=h2XOR06 /*iftheIPpacketisnotfragmented,weincludetheTCPportsinthehash*/ /*1f900862isthehexadecimalrepresentationofthesourcedestinationports WeperformanotherXORwiththisvalueandtheh2variable */ h2=h2XOR1f900862 /*Andfinally,weusetheJenkinsalgorithmwithsomeadditional"goldennumbers" Thisjhashfunctionisdefinedsomewhereelseinthekernelsourcecode*/ h=jhash(h1,h2,perturbation)

Theresultobtainedisahashvalueof32bitsthatwillbeusedbySFQtoselectthedestinationbucketofthepacket.Becausetheperturbvalueis regeneratedevery10seconds,thepacketsfromareasonablylongconnexionwillbedirectedtodifferentbucketsovertime. ButthisalsomeansthatSFQmightbreakthesequencingofthesendingprocess.Becauseiftwopacketsfromthesameconnexionareplacedintwo differentsbuckets,itispossiblethatthesecondbucketwillbeprocessedbeforethefirstone,andthereforesendingthesecondpacketbeforethefirst one.ThisisnotaproblemforTCP,whichusessequencenumbertoreorderthepacketswhentheyreachtheirdestination,butforUDPitmightbe. Forexample,imaginethatyouhaveasyslogdaemonsendinglogstoacentralsyslogserver.WithSFQ,itmighthappenthatalogprocessarrives beforethelogthatprecedesit.Ifyoudon'tlikeit,useTCP. TCgivesussomemoreflexibilityonthehashingalgorithm.Wecan,forexample,modifythefieldsconsideredbythehashingprocess.Thiscanbeused usingTCfilters,asfollow:

#tcqdiscadddeveth0roothandle1:sfqperturb10quantum3000limit64 #tcfilteradddeveth0parent1:0protocoliphandle1flowhashkeyssrc,dstdivisor1024

ThefilterabovesimplythehashtokeeponlythesourceanddestinationIPaddressesasinputparameters.Thedivisorvalueisthenumberofbuckets, asseenbefore.Wecould,then,createaSFQschedulerthatworkswith10bucketsonlyandconsiderstheIPaddressesofthepacketsinthehash. Thisdisciplineisclasslessaswell,whichmeanswecannotdirectpackettoanotherschedulerwhentheyleaveSFQ.Packetsaretransmittedtothe networkinterfaceonly.

4.2ClassfulDisciplines
4.2.1TBFTokenBucketFilter
Untilnow,welookedatalgorithmthatdonotallowtocontroltheamountofbandwidth.SFQandPFIFO_FASTgivetheabilitytosmoothenthetraffic, andeventoprioritizeitabit,butnottocontrolitsthroughput. Infact,themainproblemwhencontrollingthebandwidthistofindanefficientaccountingmethod.Becausecountinginmemoryisextremelydifficulty andcostlytodoinrealtime,computerscientiststookadifferentapproachhere. Insteadofcountingthepackets(orthebitstransmittedbythepackets,it'sthesamething),theTokenBucketFilteralgorithmsends,ataregular interval,atokenintoabucket.Nowthisisdisconnectedfromtheactualpackettransmission,butwhenapacketentersthescheduler,itwillconsumea certainnumberoftokens.Ifthereisnotenoughtokensforittobetransmitted,thepacketwaits. Untilnow,withSFQandPFIFO_FAST,weweretalkingaboutpackets,butwithTBFwenowhavetolookintothebitscontainedinthepackets.Let's takeanexample:apacketcarrying8000bits(1KB)wishestobetransmitted.ItenterstheTBFschedulerandTBFcontrolthecontentofitsbucket:if thereare8000tokensinthebucket,TBFdestroysthemandthepacketcanpass.Otherwise,thepacketwaitsuntilthebuckethasenoughtokens. Thefrequencyatwhichtokensareaddedintothebucketdeterminethetransmissionspeed,orrate.ThisistheparameteratthecoreoftheTBF algorithm,showninthediagrambelow.

DIAsource AnotherparticularityofTBFistoallowbursts.Thisisanaturalsideeffectofthealgorithm:thebucketfillsupatacontinuousrate,butifnopacketsare beingtransmittedforsometime,thebucketwillgetcompletelyfull.Then,thenextpacketstoenterTBFwillbetransmittedrightaway,withouthavingto waitandwithouthavinganylimitappliedtothem,untilthebucketisempty.Thisiscalledaburst,andinTBFtheburstparameterdefinesthesizeofthe bucket. Sowithaverylargeburstvalue,say1,000,000tokens,wewouldletamaximumof83fullyloadedpackets(roughly124KBytesiftheyallcarrytheir maximumMTU)traversetheschedulerwithoutapplyinganysortoflimittothem. Toovercomethisproblem,andprovidesbettercontroloverthebursts,TBFimplementsasecondbucket,smallerandgenerallythesamesizeasthe MTU.Thissecondbucketcannotstorelargeamountoftokens,butitsreplenishingratewillbealotfasterthattheoneofthebigbucket.Thissecond rateiscalledpeakrateanditwilldeterminethemaximumspeedofaburst. Let'stakeastepbackandlookatthoseparametersagain.Wehave: peakrate>rate:thesecondbucketfillsupfasterthanthemainone,toallowandcontrolbursts.Ifthepeakratevalueisinfinite,thenTBF behavesasifthesecondbucketdidn'texist.Packetswouldbedequeuedaccordingtothemainbucket,atthespeedofrate. burst>MTU:thesizeofthefirstbucketisalotlargerthanthesizeofthesecondbucket.IftheburstisequaltoMTU,thenpeakrateisequalto rateandthereisnoburst. So,tosummarize,wheneverythingworkssmoothlypacketsareenqueuedanddequeuedatthespeedofrate.Iftokensareavailablewhenpackets enterTBF,thosepacketsaretransmittedatthespeedofpeakrateuntilthefirstbucketisempty.Thisisrepresentedinthediagramabove,andinthe sourcecodeatnet/sched/sch_tbf.c[http://git.kernel.org/?p=linux/kernel/git/next/linuxnext.gita=blob_plainf=net/sched/sch_tbf.chb=HEAD],the interestingfunctionbeingtbf_dequeue.c.

TheconfigurableoptionsoftheTBFschedulerarelistedinTC:
#tcqdiscaddtbfhelp Usage:...tbflimitBYTESburstBYTES[/BYTES]rateKBPS[mtuBYTES[/BYTES]] [peakrateKBPS][latencyTIME][overheadBYTES][linklayerTYPE]

Werecognizeburst,rate,mtuandpeakratethatwediscussedabove.Thelimitparameteristhesizeofthepacketqueue(seediagramabove). latencyrepresentsanotherwayofcalculatingthelimitparameter,bysettingthemaximumtimeapacketcanwaitinthequeue(thesizeofthequeueis thenderivatedfromit,thecalculationincludesallofthevaluesofburst,rate,peakrateandmtu).overheadandlinklayeraretwootherparameters whosestoryisquiteinteresting.Let'stakealookatthosenow. 4.2.1.1DSLandATM,theOxthatbelievedtobeFrog IfyouhaveeverreadJeanDeLaFontaine[http://en.wikipedia.org/wiki/Jean_de_La_Fontaine],youprobablyknowthestoryoftheThefrogthatwantedto beasbigasanox.Well,inourcase,it'stheopposite,andyourpacketsmightnotbeassmallastheythinktheyare. IfmostlocalnetworksuseEthernet,upuntilrecentlyalotofcommunicationsinside(atleastinEurope)weredoneovertheATMprotocol.Nowadays, ISParemovingtowardingalloverip,butATMisstillaround.TheparticularityofATMistosplitlargeethernetpacketsintomuchsmallerones,called cells.A1500bytesethernetpacketwouldbesplitinto~30smallerATMcellsofjust53byteseach.Andfromthose53bytes,only48arefromthe originalpacket,therestisoccupiedbytheATMheaders. Sowhereistheproblem?Consideringthefollowingnetworktopology.

TheQoSboxisinchargeofperformingthepacketschedulingbeforetransmittingittothemodem.ThepacketsarethensplitbythemodemintoATM cells.Soourinitial1.5KBethernetpacketsissplitinto32ATMcells,foratotalsizeof32*5bytesofheaderspercell+1500bytesofdata= (32*5)+1500=1660bytes.1660bytesis10.6%biggerthan1500.WhenATMisused,welose10%ofbandwidthcomparedtoanethernetnetwork(this isanestimatethatdependontheaveragepacketsize,etc). IfTBFdoesn'tknowaboutthat,andcalculatesitsratebasedonthesoleknowledgeoftheethernetMTU,thenitwilltransmit10%morepacketsthan themodemcantransmit.Themodemwillstartqueuing,andeventuallydropping,packets.TheTCPstackswillhavetoadjusttheirspeed,trafficgets erraticandwelosethebenefitofTCasatrafficshaper. JesperDangaardBrouerdidhisMasterThesisonthistopic,andwroteafewpatchsforthekernelandTC.Thesepatchsimplementtheoverheadand linklayerparameter,andcanbeusedtoinformtheTBFschedulerofthetypeoflinktoaccountfor. *overheadrepresentsthequantityofbytesaddedbytheATMheaders,5bydefault*linklayerdefinesthetypeoflink,eitherethernetor{atm,adsl}. atmandadslarethesamethingandrepresenta5bytesheaderoverhead WecanusetheseparametertofinetunethecreationofaTBFschedulingdiscipline:
#tcqdiscadddeveth0roottbfrate1mbitburst10klatency30mslinklayeratm #tcsqdiscshowdeveth0 qdisctbf8005:rootrate1000Kbitburst10Kblat30.0ms Sent738bytes5pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0

Andbysettingthelinklayertoatm,wetakeintoaccountanoverheadof5bytespercelllaterinthetransmission,thereforepreventingthemodemfrom bufferingthepackets. 4.2.1.2ThelimitsofTBF TBFgivesaprettyaccuratecontroloverthebandwidthassignedtoaqdisc.Butitalsoimposesthatallpacketspassthroughasinglequeue.Ifabig packetisblockedbecausethereisnotenoughtokenstosendit,smallerpacketsthatcouldpotentiallybesentinsteadareblockedbehinditaswell. Thisisthecaserepresentedinthediagramabove,wherepacket#2isstuckbehindpacket#1.Wecouldoptimizethebandwidthusagebyallowingthe smallerpackettobesentinsteadofthebiggerone.Wewould,however,fallintothesameproblemofreorderingpacketsthatwediscussedwiththe SFQalgorithm. Theothersolutionwouldbetogivemoreflexibilitytothetrafficshaper,declareseveralTBFqueuesinparallel,androutethepacketstooneortheother usingfilters.Wecouldalsoallowthoseparallelqueuestoborrowtokensfromeachother,incaseoneisidleandtheotheroneisnot. Wejustpreparedthegroundforclassfulqdisc,andtheHierarchicalTokenBucket.

4.2.2HTBHierarchicalTokenBucket
TheHierarchicalTokenBucket(HTB)isanimprovedversionofTBFthatintroducesthenotionofclasses.Eachclassis,infact,aTBFlikeqdisc,and classesarelinkedtogetherasatree,witharootandleaves.HTBintroducesanumberoffeaturestoimprovedthemanagementofbandwidth,suchasa theprioritybetweenclasses,awaytoborrowbandwidthfromanotherclass,orthepossibilitytopluganotherqdiscasanexitpoint(aSFQforexample). Let'stakeasimpleexample,representedinthediagrambelow.

htb_en.dia.zip Thetreeiscreatedwiththecommandsbelow:
#tcqdiscadddeveth0roothandle1:htbdefault20 #tcclassadddeveth0parent1:0classid1:10htbrate200kbitceil400kbitprio1mtu1500 #tcclassadddeveth0parent1:0classid1:20htbrate200kbitceil200kbitprio2mtu1500

HTBusesasimilarterminologythanTBFandSFQ: burstisidenticaltotheburstofTBF:it'sthesizeofthetokenbucket rateisthespeedatwhichtokenatgeneratedandputinthebucket,thespeedoftheleaf,likeinTBF quantumissimilartothequantumdiscussedinSFQ,it'stheamountofbytestoservefromtheleafatonce. Thenewparametersareceilandcburst.Letuswalkthroughthetreetoseehowtheywork. Intheexampleabove,wehavearootqdischandle1andtwoleavesqdischandle10andqdischandle20.Therootwillapplyfilterstodecidewhere apacketshouldbedirected,wewilldiscussthoselater,andbydefaultpacketsaresenttoleaf#20(defaultto20). Theleaf#10hasaratevalueof200kbits/s,aceilvalueofof400kbibts/s(whichmeansitcanborrow200kbits/smorethatitsrate)andapriority(prio)of 1. Theleaf#20hasaratevalueof200kbits/s,aceilvalueof200kbbits/s(whichmeansitcannotborrowanything,rate==ceil)andapriorityof2. EachHTBleafwill,atanypoint,haveoneofthethreefollowingstatus: HTB_CANT_SEND:theclasscanneithersendnorborrow,nopacketsareallowedtoleavetheclass HTB_MAY_BORROW:theclasscannotsendusingitsowntokens,butcantrytoborrowfromanotherclass HTB_CAN_SEND:theclasscansendusingitsowntokens ImagineagroupofpacketsthatenterTCandaremarkedwiththeflag#10,andthereforearedirectedtoleaf#10.Thebucketforleaf#10doesnot containenoughtokenstoletthefirstpacketspass,soitwilltrytoborrowsomefromitsneighborleaf#20.Thequantumvalueofleaf#10issettothe MTU(1500bytes),whichmeansthemaximalamountofdatathatleaf#10willtrytosendis1500bytes.Ifpacket#1is1400byteslarge,andthebucket inleaf#10hasenoughtokensfor1000bytes,thentheleafwilltrytoborrowtheremaining400bytesfromitsneightborleaf#20. Thequantumisthemaximalamountofbytesthataleafwilltrytosendatonce.ThecloserthevalueisfromtheMTU,themoreaccuratethescheduling willbe,becausewerescheduleafterevery1500bytes.Andthelargerthevalueofquantumwillbe,themorealeafwillbeprivileged:itwillbeallowedto borrowmoretokensfromitsneighbor.Butofcourse,sincethetotalamountoftokensinthetreeisnotunlimited,ifatokenisborrowedfromaleaf, anotherleafcannotuseitanymore.Therefore,thebiggerthevalueofquantumis,themorealeafisabletostealfromitsneighbor.Thisistricky becausethoseneighborsmightverywellhavepacketstosendaswell. WhenconfiguringTC,wedonotmanipulatethevalueofquantumdirectly.Thereisanintermediaryparametercalledr2qthatcalculatesthequantum automaticallybasedontherate.quantum=rate/r2q.Bydefault,r2qissetto10,soforarateof200kbits,quantumwillhaveavalueof20kbits. Forverysmallorverylargebandwidth,itisimportanttotuner2qproperly.Ifr2qistoolarge,toomanypacketswillleaveaqueueatonce.Ifr2qistoo small,noenoughpacketsaresent. Oneimportantdetailisthatr2qissetontherootqdisconceandforall.Itcannotbeconfiguredforeachleafseparately. TCofferthefollowingconfigurationoptionsforHTB:

Usage:...qdiscadd...htb[defaultN][r2qN] defaultminoridofclasstowhichunclassifiedpacketsaresent{0} r2qDRRquantumsarecomputedasrateinBps/r2q{10} debugstringof16numberseach03{0} ...classadd...htbrateR1[burstB1][mpuB][overheadO] [prioP][slotS][pslotPS] [ceilR2][cburstB2][mtuMTU][quantumQ] raterateallocatedtothisclass(classcanstillborrow) burstmaxbytesburstwhichcanbeaccumulatedduringidleperiod{computed} mpuminimumpacketsizeusedinratecomputations overheadperpacketsizeoverheadusedinratecomputations linklayadaptingtoalinklayere.g.atm ceildefiniteupperclassrate(noborrows){rate} cburstburstbutforceil{computed} mtumaxpacketsizewecreateratemapfor{1600} priopriorityofleaflowerareservedfirst{0} quantumhowmuchbytestoservefromleafatonce{user2q}

Asyoucansee,wearenowfamiliartoallofthoseparameters.IfyoujustjumpedtothissectionwithoutreadingaboutSFQandTBF,pleasereadthose chaptersforadetailedexplanationofwhatthoseparametersdo. Rememberthat,whenconfiguringleavesinHTB,thesumoftherateoftheleavescannotbehigherthantherateoftheroot.Itmakessense,right? 4.2.2.1HysteresisandHTB Hysteresis.Ifthisbarbarianwordisnotfamiliartoyou,asitwasn'ttome,hereishowwikipediadefinesit:Hysteresisisthedependenceofasystem notjustonitscurrentenvironmentbutalsoonitspast. HysteresisisasideeffectintroducedbyanoptimizationofHTB.InordertoreducetheloadontheCPU,HTBinitiallydidnotrecalculatethecontentof thebucketoftenenough,thereforeallowingsomeclassestoconsumemoretokensthattheyactuallyheld,withoutborrowing. TheproblemwascorrectedandaparameterintroducedtoalloworblocktheusageofestimateinHTBcalculation.Thekerneldeveloperskeptthe optimizationfeaturesimplybecauseitcanproveusefulinhightrafficnetworks,whererecalculatingthecontentofthebucketeachtimeissimplynot doable. Butinmostcases,thisoptimizationissimplydeactivated,asshownbelow:
#cat/sys/module/sch_htb/parameters/htb_hysteresis 0

4.2.3HFSCHierarchicalFairServiceCurve

http://nbd.name/gitweb.cgi?p=openwrt.gita=treef=package/qosscripts/filesh=71d89f8ad63b0dda0585172ef01f77c81970c8cchb=HEAD
[http://nbd.name/gitweb.cgi?p=openwrt.gita=treef=package/qosscripts/filesh=71d89f8ad63b0dda0585172ef01f77c81970c8cchb=HEAD]

4.2.4QFQQuickFairQueueing

http://www.youtube.com/watch?v=r8vBmybeKlE[http://www.youtube.com/watch?v=r8vBmybeKlE]

4.2.5REDRandomEarlyDetection

4.2.6CHOKeCHOoseand{Keep,Kill}

5ShapingthetrafficontheHomeNetwork
Homenetworksaretrickytoshape,becauseeverybodywantsthepriorityandit'sdifficulttopredetermineausagepattern.Inthischapter,wewillbuild aTCpolicythatanswergeneralneeds.Thoseare: Lowlatency.Theuplinkisonly1.5Mbpsandthelatencyshouldn'tbemorethan30msunderhighload.Wecantunethebuffersintheqdiscto ensurethatourpacketswillnotstayinalargequeuefor500mswaitingtobeprocessed HighUDPresponsiveness,forapplicationslikeSkypeandDNSqueries GuarantiedHTTP/sbandwidth,halfoftheuplinkisdedicatedtotheHTTPtraffic(although,otherclassescanborrowfromit)toensurethatweb browsing,probably80%ofahomenetworkusage,issmoothandresponsive TCPACKsandSSHtrafficgethigherpriority.IntheageofNetflixandHDVoD,it'snecessarytoensurefastdownloadspeed.Andforthat,you needtobeabletosendTCPACKsasfastaspossible.ThisiswhythosepacketsgetahigherprioritythantheHTTPtraffic. Ageneralclassforeverythingelse. Thispolicyisrepresentedinthediagrambelow.WewillusePFIFO_FASTandSFQterminationqdisconceweexitHTBtoperformsomeadditional scheduling(andpreventasingleHTTPconnectionfromeatingallofthebandwidth,forexample).

DIAsource Thescriptthatgeneratesthispolicyisavailableongithubviatheiconbelow,withcommentstohelpyoufollowthrough.

getthebashscriptfromgithub Belowisoneofthesection,inchargeofthecreationoftheclassforSSH.Ihavereplacedthevariableswiththeirvalueforreadability.
#SSHclass:foroutgoingconnectionsto #avoidlagwhensomebodyelseisdownloading #however,anSSHconnectioncannotfillup #theconnectiontomorethan70% echo"#sshid300rate160kbitceil1120kbit" /sbin/tcclassadddeveth0parent1:1classid1:300htb\ rate160kbitceil1120kbitburst15kprio3 #SFQwillmixthepacketsifthereareseveral #SSHconnectionsinparallel #andensurethatnonehasthepriority echo"#~subssh:sfq" /sbin/tcqdiscadddeveth0parent1:300handle1300:\ sfqperturb10limit32 echo"#~sshfilter" /sbin/tcfilteradddeveth0parent1:0protocolip\ prio3handle300fwflowid1:300 echo"#~netfilterruleSSHat300" /sbin/iptablestmangleAPOSTROUTINGoeth0ptcp tcpflagsSYNSYNdport22jCONNMARK\ setmark300

ThefirstruleisthedefinitionoftheHTBclass,theleaf.Iconnectsbacktoitsparent1:1,definesarateof160kbit/sandcanuseupto1120kbit/sby

borrowingthedifferencefromotherleaves.Theburstvalueissetto15k,withis10fullpacketswithaMTUof1500bytes. ThesecondruledefinesaSFQqdiscconnectedtotheHTBoneabove.ThatmeansthatoncepacketshavepassedtheHTBleaf,theywillpassthrough aSFQleafbeforebeingtransmitted.TheSFQwillensurethatmultipleparallelconnectionsaremixedbeforebeingtransmitted,andthatoneconnection cannoteatthewholebandwidth.WelimitthesizeoftheSFQqueueto32packets,insteadofthedefaultof128. ThecometheTCfilterinthethirdrule.Thisfilterwillcheckthehandleofeachpacket,or,tobemoreaccurate,thevalueofnf_markinthesk_buff representationofthepacketinthekernel.Usingthismark,thefilterwilldirectSSHpackettotheHTBleafabove. EventhoughthisruleislocatedintheSSHclassblockforclarity,youmighthavenoticedthatthefilterhastherootqdiscforparent(parent1:0).Filters arealwaysattachedtotherootqdisc,andnottotheleaves.Thatmakessense,becausethefilteringmustbedoneattheentranceofthetrafficcontrol layer. Andfinally,thefourthruleistheiptablesrulethatappliesamarktoSYNpacketsleavingthegateway(connectionestablishments).WhySYNpackets only?Toavoidperformingcomplexmatchingonallthepacketsofalltheconnection.Wewillrelyonnetfilter'scapabilitytomaintainstatefulinformation topropagateamarkplacedonthefirstpacketoftheconnectiontoalloftheotherpackets.Thisisdonebythefollowingruleattheendofthescript:
echo"#~propagatingmarksonconnections" iptablestmangleAPOSTROUTINGjCONNMARKrestoremark

Letusnowloadthescriptonourgateway,andvisualisetheqdisccreated.
#/etc/network/ifup.d/lnw_gateway_tc.shstart ~~~~LOADINGeth0TRAFFICCONTROLRULESFORramiel~~~~ #cleanup RTNETLINKanswers:Nosuchfileordirectory #defineaHTBrootqdisc #uplinkrate1600kbitceil1600kbit #interactiveid100rate160kbitceil1600kbit #~subinteractive:pfifo #~interactivefilter #~netfilterruleallUDPtrafficat100 #tcpacksid200rate320kbitceil1600kbit #~subtcpacks:pfifo #~filtretcpacks #~netfilterruleforTCPACKswillbeloadedattheend #sshid300rate160kbitceil1120kbit #~subssh:sfq #~sshfilter #~netfilterruleSSHat300 #httpbranchid400rate800kbitceil1600kbit #~subhttpbranch:sfq #~httpbranchfilter #~netfilterrulehttp/s #defaultid999rate160kbitceil1600kbit #~subdefault:sfq #~filtredefault #~propagatingmarksonconnections #~MarkTCPACKsflagsat200 TrafficControlisupandrunning #/etc/network/ifup.d/lnw_gateway_tc.shshow qdiscsdetails qdischtb1:rootrefcnt2r2q40default999direct_packets_stat0ver3.17 qdiscpfifo1100:parent1:100limit10p qdiscpfifo1200:parent1:200limit10p qdiscsfq1300:parent1:300limit32pquantum1514bflows32/1024perturb10sec qdiscsfq1400:parent1:400limit32pquantum1514bflows32/1024perturb10sec qdiscsfq1999:parent1:999limit32pquantum1514bflows32/1024perturb10sec qdiscsstatistics qdischtb1:rootrefcnt2r2q40default999direct_packets_stat0 Sent16776950bytes125321pkt(dropped4813,overlimits28190requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscpfifo1100:parent1:100limit10p Sent180664bytes1985pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscpfifo1200:parent1:200limit10p Sent5607402bytes100899pkt(dropped4813,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1300:parent1:300limit32pquantum1514bperturb10sec Sent0bytes0pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1400:parent1:400limit32pquantum1514bperturb10sec Sent9790497bytes15682pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0 qdiscsfq1999:parent1:999limit32pquantum1514bperturb10sec Sent1198387bytes6755pkt(dropped0,overlimits0requeues0) rate0bit0ppsbacklog0b0prequeues0

Theoutputbelowisjusttwotypesofoutputtccangenerate.Youmightfindtheclassstatisticstobehelpfultodiagnoseleavesconsumption:
#tcsclassshowdeveth0 [...truncated...] classhtb1:400parent1:1leaf1400:prio4rate800000bitceil1600Kbitburst30Kbcburst1600b Sent10290035bytes16426pkt(dropped0,overlimits0requeues0) rate23624bit5ppsbacklog0b0prequeues0 lended:16424borrowed:2giants:0 tokens:4791250ctokens:120625

AboveisshownthedetailledstatisticsfortheHTTPleaf,andyoucanseetheaccumulatedrate,statisticsofpacketsperseconds,butalsothetokens accumulated,lended,borrowed,etcthisisthemosthelpfuloutputtodiagnoseyourpolicyindepth.

6Awordabout"BufferBloat"
Wementionnedthattoolargebufferscanhaveanegativeimpactontheperformancesofaconnection.Buthowbadisitexactly? TheanswertothatquestionwasinvestigatedbyJimGettys[http://gettys.wordpress.com/bufferbloatfaq/]whenhefoundhishomenetworktobe inexplicablyslow. Hefoundthat,whilewewereincreasingthebandwidthofnetworkconnections,wedidn'tworryaboutthelatencyatall.Thosetwofactorsarequite differentandbothcriticaltothegoodqualityofanetwork.AllowmetoquoteGettys'sFAQhere:
A100Gigabitnetworkisalwaysfasterthana 1megabitnetwork,isntit? Morebandwidthisalwaysbetter!Iwantafasternetwork! No,suchanetworkcaneasilybemuchslower. Bandwidthisameasureofcapacity,notameasureofhow fastthenetworkcanrespond.Youpickupthephoneto sendamessagetoShanghaiimmediately,butdispatchinga cargoshipfullofbluraydiskswillbeamazinglyslower thanthetelephonecall,eventhoughthebandwidthofthe shipisbillionsandbillionsoftimeslargerthanthe telephoneline.Somorebandwidthisbetteronlyifits latency(speed)meetsyourneeds.Moreofwhatyoudont needisuseless. Bufferbloatdestroysthespeedwereallyneed.

MoreinformationonGettys'spage,andinthispaperfrom1996:It'stheLatency,Stupid[http://rescomp.stanford.edu/~cheshire/rants/Latency.html]. Longstoryshort:ifyouhavebadlatency,butlargebandwidth,youwillbeabletotransferverylargefilesefficiently,butasimpleDNSquerywilltakea lotlongerthanitshould.AndsincethoseDNSqueries,andothersmallmessagesuchasVoIP,areveryoftentimesensitive,badlatencyimpactsthem alot(asinglepackettakesseveralhundredsofmillisecondstotraversethenetwork). Sohowdoesthatrelatetobuffers?Gettysproposesasimpleexperiment[http://gettys.wordpress.com/2010/11/29/homerouterpuzzlepieceonefunwith yourswitch/]toillustratetheproblem. WesaidearlierthatLinuxshipswithadefaulttxqueuelenof1000.Thisvaluewasincreasedinthekernelwhengigabitsethernetcardsbecamethe standard.Butnotalllinksaregigabits,farfromthat.Considerthefollowingtwocomputers:

Wewillcallthemdesktopandlaptop.Theyarebothgigabit,andtheswitchisgigabit. Ifweverifytheconfigurationoftheirnetworkinterfaces,wewillconfirmthat: theinterfacesareconfiguredingigabitsviaethtool thetxqueuelenissetto1000


#ifconfigeth0|greptxqueuelen collisions:0txqueuelen:1000 #ethtooleth0|grepSpeed Speed:1000Mb/s

Ononemachine,launchnttcpwiththe'i'switchtomakeitwaitforconnections:
#nttcpi

Onthelaptop,launchnttcptDn2048000<serverIP>,where tmeansthismachineistransmitting,orsendingdata DdisablestheTCP_NODELAYseeredhatdoc[http://docs.redhat.com/docs/enUS/Red_Hat_Enterprise_MRG/1.1/html/Realtime_Tuning_Guide/sect


Realtime_Tuning_GuideApplication_Tuning_and_DeploymentTCP_NODELAY_and_Small_Buffer_Writes.html]

nisthenumberofbuffersof4096bytesgiventothesocket.
#nttcptDn2048000192.168.1.220

Andatthesametime,onthelaptop,launchapingofthedesktop.
64bytesfrom192.168.1.220:icmp_req=1ttl=64time=0.300ms 64bytesfrom192.168.1.220:icmp_req=2ttl=64time=0.386ms 64bytesfrom192.168.1.220:icmp_req=3ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=4ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=5ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=6ttl=64time=19.2ms 64bytesfrom192.168.1.220:icmp_req=7ttl=64time=19.3ms 64bytesfrom192.168.1.220:icmp_req=8ttl=64time=19.0ms 64bytesfrom192.168.1.220:icmp_req=9ttl=64time=0.281ms 64bytesfrom192.168.1.220:icmp_req=10ttl=64time=0.362ms

Thefirsttwopingsarelaunchbeforenttcpislaunched.Whennttcpstarts,thelatencyaugmentsbutthisisstillacceptable. Now,reducethespeedofeachnetworkcardonthedesktopandthelaptopto100Mbips.Thecommandis:

#ethtoolseth0speed100duplexfull #ethtooleth0|grepSpeed Speed:100Mb/s

Andrunthesametestagain.After60seconds,herearethelatencyIget:
64bytesfrom192.168.1.220:icmp_req=75ttl=64time=183ms 64bytesfrom192.168.1.220:icmp_req=76ttl=64time=179ms 64bytesfrom192.168.1.220:icmp_req=77ttl=64time=181ms

Andonelasttime,withanEthernetspeedof10Mbps:
64bytesfrom192.168.1.220:icmp_req=187ttl=64time=940ms 64bytesfrom192.168.1.220:icmp_req=188ttl=64time=937ms 64bytesfrom192.168.1.220:icmp_req=189ttl=64time=934ms

Almostonesecondoflatencybetweentwomachinesnexttoeachother.Everytimewedividethespeedoftheinterfacebyanorderofmagnitude,we augmentthelatencybyanorderofmagnitudeaswell.Underthatload,openinganSSHconnectionfromthelaptoptothedesktopistakingmorethan 10seconds,becausewehavealatencyofalmost1secondperpacket. Now,whilethislasttestisrunning,andwhileyouareenjoyingtheridiculouslatencyofyourSSHsessiontothedesktop,wewillgetridofthetransmit andethernetbuffers.


#ifconfigeth0|greptxqueuelen collisions:0txqueuelen:1000 #ethtoolgeth0 Ringparametersforeth0: [...] Currenthardwaresettings: [...] TX: 511

Westartbychangingthetxqueuelenvalueonthelaptopmachinefrom1000tozero.Thelatencywillnotchange.
#ifconfigeth0txqueuelen0 64bytesfrom192.168.1.220:icmp_req=1460ttl=64time=970ms 64bytesfrom192.168.1.220:icmp_req=1461ttl=64time=967ms

Thenwereducethesizeofthetxringoftheethernetcard.Nowthatwedon'thaveanybufferanymore,let'sseewhathappens:
#ethtoolGeth0tx32 64bytesfrom192.168.1.220:icmp_req=1495ttl=64time=937ms 64bytesfrom192.168.1.220:icmp_req=1499ttl=64time=0.865ms 64bytesfrom192.168.1.220:icmp_req=1500ttl=64time=60.3ms 64bytesfrom192.168.1.220:icmp_req=1501ttl=64time=53.1ms 64bytesfrom192.168.1.220:icmp_req=1502ttl=64time=49.2ms 64bytesfrom192.168.1.220:icmp_req=1503ttl=64time=45.7ms

Thelatencyjustgotdividedby20!Wedroppedfromalmostonesecondtobarely50ms.Thisistheeffectofexcessivebufferinginanetwork,andthis iswhathappens,today,inmostInternetrouters.

6.1Whathappensinthebuffer?
IfwetakealookattheLinuxnetworkingstack,weseethattheTCPstackisalotabovethetransmitqueueandethernetbuffer.DuringanormalTCP connection,theTCPstackstartssendingandreceivingpacketsatanormalrate,andacceleratesitssendingspeedatanexponentialrate:send2 packets,receiveACKs,send4packets,receiveACKs,send8packets,receivesACKs,send16packets,receivesACKS,etc. ThisisknownastheTCPSlowStart[http://tools.ietf.org/html/rfc5681].Thismechanismsworksfineinpractice,butthepresenceoflargebufferswillbreak it. Abufferof1MBona1Gbits/slinkwillemptyin~8milliseconds.Butthesamebufferona1MBits/slinkwilltake8secondstoempty.Duringthose8 seconds,theTCPstackthinksthatallofthepacketsitsenthavebeentransmitted,andwillprobablycontinuetoincreaseitssendingspeed.The subsequentpacketswillgetdropped,theTCPstackwillpanick,dropitssendingrate,andrestarttheslowstartprocedurefrom0:2packets,getack,4 packets,getack,etc ButwhiletheTCPstackwasfillinguptheTXbuffers,alltheotherpacketsthatoursystemwantedtosendgoteitherstucksomewhereinthequeue, withseveralhundredsofmillisecondsofdelaybeforebeingtransmitted,orpurelydropped. TheproblemhappensontheTXqueueofthesendingmachine,butalsoonallthethebuffersoftheintermediarynetworkdevices.Andthisiswhy Gettyswenttowaragainstthehomeroutervendors.

Discussion
PaulBixel,2011/12/0800:46 ThisisaninterestingarticleandIamespeciallyinterestedinyoudiscussionofthelinklayeratmoptionnowsupportedbyTC.Thiskindof explainationisneededbecausethereissolittlewrittinaboutthelinklayeroptionandthepropersettingsformtu/mpu/tsize&overhead. Inyourdiscussionyoumentiontheoverheadparameterisdefaultedto5anditisimpliedthatitisnotnecessarytherefortospecifyitwhenatmis used.Butaccordingtohttp://acehost.stuart.id.au/russell/files/tc/tcatm[http://acehost.stuart.id.au/russell/files/tc/tcatm]theoverheadparametersis

variableandmuchlargerthan5.Someoneisnotcorrect. JussiKivilinna,2011/12/2412:57 Justnoteonoverhead/linklayeroptions..since2.6.27therehasbeengenerictcstaboptionthataddsoverhead/linklayersupporttoallqdiscs.Some documentationat:http://www.linuxhowtos.org/manpages/8/tcstab.htm[http://www.linuxhowtos.org/manpages/8/tcstab.htm]


en/ressources/dossiers/networking/traffic_control.txtLastmodified:2011/12/2409:53byjulien

You might also like