Professional Documents
Culture Documents
Thebuffersusedbythekerneltomanagenetworkpacketsarereferredtoassk_buffsinLinux.(Their
BSDcounterpartsarereferredtoasmbufs).Thebuffersarealwaysallocatedasatleasttwoseparate
components:afixedsizeheaderoftypestructsk_buff;andavariablelengtharealargeenoughtohold
allorpartofthedataofasinglepacket.
Theheaderisalargestructureinwhichthefunctionofmanyoftheelementsisfairlyobvious,butthe
useofsomeothers,especiallythelengthrelatedfieldsandthepointersintothedatacomponentare
sometimesnotespeciallyclear.
1
Protocolheaderpointers
Thenextmajorsectioncontainsdefinitionsofpointerstotransport,network,andlinkheadersas
unionssothatonlyasinglewordofstorageisallocatedforeachlayer'sheaderpointer.Notallof
thesepointerswillbevalidallofthetime.Infactontheoutputpath,allofthepointerswillbeinvalid
initiallyandshouldthusbeusedwithcare!
240
241 union {
242 struct tcphdr *th;
243 struct udphdr *uh;
244 struct icmphdr *icmph;
245 struct igmphdr *igmph;
246 struct iphdr *ipiph;
247 struct ipv6hdr *ipv6h;
248 unsigned char *raw;
249 } h; // <--- Transport header address
250
251 union {
252 struct iphdr *iph;
253 struct ipv6hdr *ipv6h;
254 struct arphdr *arph;
255 unsigned char *raw;
256 } nh; // <--- Network header address
257
258 union {
259 unsigned char *raw;
260 } mac; // <--- MAC header address
2
Routingrelatedentries
Thdst_entrypointerisanextremelyimportantfieldisapointertotheroutecacheentryusedtoroute
thesk_buff.Thisroutecacheelementtowhichitpointscontainspointerstofunctionsthatare
invokedtoforwardthepacket.Thispointerdstmustpointtoavalidroutecacheelementbeforea
bufferispassedtotheIPlayerfortransmission.
Thesec_pathpointerisarelativelynewoptionalfieldwhichsupportsadditional"hooks"fornetwork
security.
Thecontrolbufferisanejunkyardthatcanbeusedasascratchpadduringprocessingbyagiven
layeroftheprotocol.ItsmainuseisbytheIPlayertocompileheaderoptions.
265 /*
266 * This is the control buffer. It is free to use for every
267 * layer. Please put your private variables there. If you
268 * want to keep them across layers you have to skb_clone()
269 * first. This is owned by whoever has the skb queued ATM.
270 */
271 char cb[48];
272
3
Lengthfields
Theusageoflen,data_len,andtruesizeareeasytoconfuse.
Thevalueoftruesizeisthelengthofthevariablesizedatacomponent(s)plusthesizeofthe
sk_buffheader.Thisistheamountthatischargedagainstthesock'ssendorreceivequota.
Thevaluesoftheothertwoaresettozeroatallocationtime.
Whenapacketisreceived,thelenfieldissettothesizeofacompleteinputpacketincluding
headers.Thisvalueincludesdatainthekmalloc'dpart,fragmentchainand/orunmappedpage
buffers.Asheadersareremovedoraddedthevalueoflenisdecrementedandincremented
accordingly.
Thevalueofthedata_lenfieldisthenumberofbytesinthefragmentchainandinunmapped
pagebuffersandisnormally0.
4
Referencecounting
Referencecountingisacriticallyimportanttechniquethatisusedtopreventbothmemoryleaksand
invalidpointeraccesses.Itisusedinallnetworkdatastructuresthataredynamicallyallocatedand
freed.Unfortunatelythereisnostandardnameforeitherthevariablethatcontainsthereference
countnorthehelperfunction(ifany)thatmanipulatesit:(
Theatomicvariableuserscountsthenumberofprocessesthatholdareferencetothesk_buff
structureitself.
Itisincrementedwheneverthebufferisshared.
Itisdecrementedwhenabufferislogicallyfreed.
The buffer is physically freed only when the reference count reaches 0.
5
Pointersintothedataarea
Thesepointersallpointintothevariablesizecomponentofthebufferwhichactuallycontainsthe
packetdata.Atallocationtime
data,head,andtailinitiallypointtothestartoftheallocatedpacketdataareaand
endpointstotheskb_shared_infostructurewhichbeginsatnextbytebeyondtheareaavailable
forpacketdata.
Alargecollectionofinlinefunctionsdefinedininclude/linux/skbuff.hmaybeusedinadjustmentof
data,tail,andlenasheadersareaddedorremoved.
6
MACHeaderdefinition
LinuxprefersthestandardDIXethernetheaderto802.x/803.xframing.However,thelatterareboth
alsosupported.
93 struct ethhdr
94 {
95 unsigned char h_dest[ETH_ALEN]; /* dest eth addr */
96 unsigned char h_source[ETH_ALEN]; /* src eth addr */
97 unsigned short h_proto; /* packet type*/
98 };
IPHeader
7
Theskb_shared_infostructure
Thestructskb_shared_infodefinedininclude/linux/skbuff.hisusedtomanagefragmentedbuffers
andunmappedpagebuffers.Thisstructureresidesattheendofthekmalloc'ddataareaandis
pointedtobytheendelementofthestructsk_buffheader.Theatomicdatarefisareferencecounter
thatcountsthenumberofentitiesthatholdreferencestothekmalloc'ddataarea.
Whenabufferisclonedthesk_buffheaderiscopiedbutthedataareaisshared.Thuscloning
incrementsdatarefbutnotusers.
Functionsofstructureelements:
dataref Thenumberofusersofthedataofthissk_buffthisvalueisincremented
eachtimeabufferiscloned.
frag_list IfnotNULL,thisvalueispointertothenextsk_buffinthechain.The
fragmentsofanIPpacketundergoingreassemblyarechainedusingthis
pointer.
frags Anarrayofpointerstothepagedescriptorsofupunmappedpage
buffers.
nr_frags Thenumberofelementsoffragsarrayinuse.
8
Supportforfragmenteddatainsk_buffs
Theskb_shared_infostructureisusedwhenthedatacomponentofasinglesk_buffconsistsof
multiplefragments.Thereareactuallytwomechanismswithwhichfragmentedpacketsmaybe
stored:
The*frag_listpointerisusedtolinkalistofsk_buffheaderstogether.Thismechanismis
usedatreceivetimeinthereassemblyoffragmentedIPpackets.
Thenr_fragscounterandthefrags[]arrayareusedforunmappedpagebuffers.Thisfacility
wasaddedinkernel2.4andispresumablydesignedtosupportsomemannerofzerocopy
facilityinwhichpacketsmaybereceiveddirectlyintopagesthatcanbemappedintouser
space.
Thevalueofthedata_lenfieldrepresentsthesumtotalofbytesresidentinfragmentlistsand
unmappedpagebuffers.
Exceptforreassemblyoffragmentedpacketsthevalueofdata_lenisalways0.
Typicalbufferorganization
Thefragmentlistandunmappedbufferstructuresleadtoarecursiveimplementationof
checksumminganddatamovementcodethatisquitecomplicatedinnature.
Fortunately,inpractice,anunfragmentedIPpacketalwaysconsistsofonly:
Aninstanceofthestructsk_buffbufferheader.
Thekmalloc'd``data''areaallocatedholdingbothpacketheadersandata.
9
Unmappedpagebuffers
Theskb_frag_tstructurerepresentsanunmappedpagebuffer.
Functionsofstructureelements:
page Pointertoastructpagewhichcontrolstherealmemorypageframe.
offset Offsetinpagefromwheredataisstored.
size Lengthofthedata.
10
Managementofbuffercontentpointers
Fivefieldsaremostimportantinthemanagementofdatainthekmalloc'dcomponentofthebuffer.
head Pointstothefirstbyteofthekmalloc'dcomponent.Itissetatbuffer
allocationtimeandneveradjustedthereafter.
end Pointstothestartoftheskb_shared_infostructure(i.e.thefirstbyte
beyondtheareainwhichpacketdatacanbestored.)Itisalsosetat
bufferallocationtimeandneveradjustedthereafter.
data Pointstothestartofthedatainthebuffer.Thispointermaybe
adjustedforwardorbackwardasheaderdataisremovedoraddedtoa
packet.
tail Pointstothebytefollowingthedatainthebuffer.Thispointermay
alsobeadjusted.
len Thevalueoftaildata.
Othertermsthatarecommonlyencounteredinclude:
headroom Thespacebetweentheheadanddatapointers
tailroom Thespacebetweenthattailandendpointers.
Initiallyhead=data=tailandlen=0.
11
Buffermanagementconveniencefunctions
Linuxprovidesanumberofconveniencefunctionsformanipulatingthesefields.Notethat
noneofthesefunctionsactuallycopiesanydataintooroutofthebuffer!
Theyaregoodtousebecausetheyprovidebuiltinchecksforvariousoverflowandunderflow
errorsthatifundetectedcancauseunpredictablebehaviorforwhichthecausecanbevery
hardtoidentify!
12
Reservingspaceattheheadofthebuffer
Theskb_reserve()functiondefinedininclude/linux/skbuff.hiscalledtoreserveheadroomforthe
hardwareheaderwhichshallbefilledinlater.Sincetheskb>headpointeralwayspointstothestart
ofthekmalloc'darea,thesizeoftheheadroomisdefinedasskb>dataskb>head.Theheadpointer
isleftunchanged,thedataandtailpointersareadvancedbythespecifiedamount.
Atransportprotocolsendroutinemightusethisfunctiontoreservespaceheadersandpointdatato
wherethedatashouldbecopiedfromuserspace.
13
Appendingdatatothetailofabuffer
Theskb_put()functioncanbeusedtoincrementthelenandtailvaluesafterdatahasbeenplacedin
thesk_buff().Theactualfillingofthebufferismostcommonlyperformedby
aDMAtransferoninputor
acopy_from_user()onoutput.
The transport protocol might use this function after copying the data from user space.
14
Insertingnewdataatthefrontofbuffer.
Theskb_push()functiondecrementsthedatapointerbythelenpassedinandincrementsthevalueof
skb>lenbythesameamount.Itisusedtoextendthedataareabacktowardtheheadendofthe
buffer.Itreturnsapointerthenewvalueofskb>data.
ThetransportlayerprotocolmightusethisfunctionwhenpreparingtobuildtransportandIPheaders.
15
Removingdatafromthefrontofthebuffer
Theskb_pull()functionlogiciallyremovesdatafromthestartofabufferreturningthespacetothe
headroom.Itincrementstheskb>datapointeranddecrementsthevalueofskb>leneffectively
removingdatafromtheheadofabufferandreturningittotheheadroom.Itreturnsapointertothe
newstartofdata.
Thereceivesideofthetransportlayermightusethisfunctionduringreceptionwhenremovinga
headerfromthepacket.
TheBUG_ONconditionwillraisedifanattemptismadetopullmoredatathanexistscausingskb
>lentobecomenegativeorifatattemptismadetopullacrosstheboundarybetweenthekmalloc'd
partandthefragmentchain.
16
Removingdatafromthetailofabuffer
Theskb_trim()functioncanbeusedtodecrementthelengthofabufferandmovethetailpointer
towardthehead.Thenewlengthnottheamounttobetrimmedispassedin.Thismightbedoneto
removeatrailerfromapacket.Theprocessisstraightforwardunlessthebufferisnonlinear.Inthat
case,___pskb_trim()mustbecalledanditbecomesyourworstnightmare.
17
Obtainingtheavailableheadandtailroom.
Thefollowingfunctionsmaybeusedtoobtainthelengthoftheheadroomandtailroom.Ifthebuffer
isnonlinear,thetailroonis0byconvention.
18
Determininghowmuchdataisinthekmalloc'dpartofthebuffer..
Theskb_headlen()functionreturnsthelengthofthedatapresentlyinthekmalloc'dpartofthebuffer.
Thissectionissometimesreferredtoastheheader(eventhoughthestructsk_buffitselfismore
properlyreferredtoasthebufferheader.)
Nonlinearbuffers
Abufferislinearifandonlyifallthedataiscontainedinthekmalloc'dheader.
Theskb_is_nonlinear()returnstrueifthereisdatainthefragmentlistorinunmappedpagebuffers.
19
Managinglistsofsk_buffs
Buffersawaitingprocessingbythenextlayerofthenetworkstacktypicallyresideinlinkedliststhat
arecalledbufferqueues.Thestructurebelowdefinesabufferqueueheader.Becausethesk_buff
structurealsobeginswith*nextand*prevpointers,pointerstosk_buffandsk_buff_headare
sometimesusedinterchangably.
Empty list
20
Queuemanagementfunctions
Anumberoffunctionsareprovidedbythekerneltosimplifyqueuemanagementoperationsandthus
improvetheirreliability.Thesefunctionsaredefinedininclude/linux/skbuff.h.
Obtainingapointertothefirstbufferinthequeue.
Theskb_peek()functionmaybeusedtoobtainapointertothefirstelementinanonemptyqueue.
Notethatsk_buff_headandsk_buffpointersareusedinterchangablyinline569.This(bad)practice
workscorrectlybecausethefirsttwoelementsofthesk_buff_headstructurearethesameasthoseof
thesk_buff.Ifthenextpointerpointsbacktotheheader,thelistisemptyandNULLisreturned.
21
Testingforanemptyqueue.
Theskb_queue_empty()functionsreturnstrueifthequeueisemptyanfalseifitisnot.
22
Removalofbuffersfromqueues
Theskb_dequeue()functionisusedtoremovethefirstbufferfromtheheadofthespecifiedspecified
queue.Itcalls__skb_dequeueafterobtainingthelist'sassociatedlock.
23
Themechanicsofdequeue
The__skb_dequeue()functiondoestheworkofactuallyremovingansk_bufffromthereceive
queue.Sincethesk_buff_headstructurecontainsthesamelinkpointersasanactualsk_buffstructure,
itcanmasqueradeasalistelementasisdoneviathecastinline708.
Inline708previssettopointtothesk_buff_head.Theninline709,thelocalvariablenextreceives
thevalueofthenextpointerinthesk_buff_head.Thetestinline711checkstoseeifthenextpointer
stillpointstothesk_buff_head.Ifsothelistwasempty.Ifnotthefirstelementisremovedfromthe
listanditslinkfieldsarezeroed.
24
Addingbufferstoqueues
SincebufferqueuesareusuallymanagedinaFIFOmannerandbuffersareremovedfromtheheadof
thelist,theyaretypicallyaddedtoalistwithskb_queue_tail().
25
Themechanicsofenqueue
Theactualworkofenqueuingabufferonthetailofaqueueisdonein__skb_queue_tail().
Thesk_buff_headpointerinthesk_buffissetandthelengthfieldinthesk_buff_headisincremented.
(Thesetwolinesarereversedfromkernel2.4.x.)
686 list->qlen++;
687 next = (struct sk_buff *)list;
Herenextpointstothesk_buff_headstructureandprevpointtothesk_buffstructurethatwas
previouslyatthetailofthelist.Notethattheliststructureiscircularwiththeprevpointerofthe
sk_buff_headpointingtothelastelementofthelist.
26
Removalofallbuffersfromaqueue
Theskb_queue_purge()functionmaybeusedtoremoveallbuffersfromaqueueandfreethem.This
mightbeusedwhenasocketisbeingclosedandthereexistreceivedpacketsthathavenotyetbeen
consumedbytheapplication.
Whenabufferisbeingfreedbesuretousekfree_skb()andnotkfree().
Thisversionfromskbuff.hmaybeusedifandonlyifthelistlockisheld.
27
Allocationofsk_buffsfortransmission
Thesock_alloc_send_skb()functionresidesinnet/core/sock.c.Itisnormallycalledforthispurpose.
Itisaminimalwrapperroutinethatsimplyinvokesthesock_alloc_send_pskb()functiondefinedin
net/core/sock.c,withdata_lenparametersettozero.Thesizefieldpassedhashistoricallyhadthe
value:userdatasize+transportheaderlength+IPheaderlength+devicehardwareheaderlength
+15.Theremaybeanewhelperfunctiontocomputethesizenow.Whenyoucall
sock_alloc_send_skb(),youmustsetnoblockto0.
Andwhenyouallocateasupervisorypacketinthecontextofasoftirqyoumustusedev_alloc_skb().
28
Whensock_alloc_send_pskb()isinvokedontheUDPsendpathviathefastIPbuildroutine,the
variableheader_lenwillcarrythelengthascomputedonthepreviouspageandthevariabledata_len
willalwaysbe0.Examinationofthenetworkcodefailedtoshowanyevidenceofanonzerovalueof
data_len.
Thesock_sndtimeo()functiondefinedininclude/net/sock.hreturnsthesndtimeovaluesetby
sock_init_datatoMAX_SCHEDULE_TIMEOUTwhichinturnisdefinedasLONG_MAXfor
blockingcallsandreturnstimeoutaszerofornonblockingcalls.
29
Themainallocationloop.
Arelativelylongloopisenteredhere.Ifnotransmitbufferspaceisavailabletheprocesswillsleep
viathecalltosock_wait_for_wmem()whichappearsatline1213.Thefunctionsock_error()retrieves
anyerrorcodethatmightbepresent,andclearsitatomicallyfromthesockstructure.
Exitconditionsinclude
successfulallocationofthesk_buff,
anerrorconditionreturnedbysock_error,closingofthesocket,and
receiptofasignal.
30
Verifyingthatquotaisnotexhausted.
sock_alloc_send_pskb()willallocateansk_buffonlyiftheamountofsendbufferspace,
sk>wmem_alloc,thatiscurrentlyallocatedtothesocketislessthanthesendbufferlimit,
sk>sndbuf.Thebufferlimitisinheritedfromthesystemdefaultsetduringsocketinitialization.
Ifallocationworked,skbwillholdtheaddressofthebufferotherwiseitwillbe0.Allocationwillfail
onlyincaseofsomecatastrophickernelmemoryexhaustion.
1168 if (skb) {
1169 int npages;
1170 int i;
1171
1172 /* No pages, we're done... */
1173 if (!data_len)
1174 break;
Atthispointinthecodeissomeawfulstuffinwhichunmappedpagebuffersareallocated.Wewill
skipoverthis.
31
Arrivalheremeansalloc_skb()returned0.
Sleepinguntilwmemisavailable
Ifcontrolreachesthebottomoftheloopinsock_alloc_send_pskb(),thennospacewasavailableand
iftherequesthasnottimedoutandthereisnosignalpendingthenitisnecessarytosleepwhilethe
linklayerconsumessomepackets,transmitsthemandthenreleasesthebufferspacetheyoccupy.
32
Thisistheendofsock_alloc_send_pskb.Thefunctionskb_set_owner_w()
setstheownerfieldofthesk_bufftosk
callssock_hold()toincrementtherefcountofthestructsock.
addsthetruesizetosk_wmem_alloc
andsetsthedestructorfunctionfieldoftheskbtosock_wfree.
33
Thealloc_skb()function
Theactualallocationofthesk_buffheaderstructureandthedataareaisperformedbythealloc_skb()
functionwhichisdefinedinnet/core/skbuff.cCommentsattheheadofthefunctiondescribeits
operation:
``Allocateanewsk_buff.Thereturnedbufferhasnoheadroomandatailroomofsize
bytes.Theobjecthasareferencecountofone.Thereturnisthebuffer.Onafailurethe
returnisNULL.Buffersmayonlybeallocatedfrominterrupts/bottomhalvesusinga
gfp_maskofGFP_ATOMIC.''
Thehardcoded0inthecallto__alloc_skb()saysnottoallocatefromthefclonecache.
34
The__alloc_skb()function
Therealworkisdonehere.Thewrapperonthepreviouspageonlysetsthefcloneflagto0.Acloned
bufferisoneinwhichtwostructsk_buffscontrolthesamedataarea.Becausereliabletransfer
protocolsusuallymakeexactlyonecloneofEVERYbuffer,eachallocationfromthefclonecache
returnstwoadjacentsk_buffheaders.
Clonedandnonclonedbufferheadersnowareallocatedfromseparatecaches.
Theheadisthestructsk_buff
35
Thedataportionisallocatedfromoneofthe"general"caches.Thesecachesconsistsofblocksthat
aremultiplesofpagesize,andallocationoccursusingabestfitstrategy.
Allelementsofthestructsk_buffuptothetruesizefieldaresetto0.Thenthehead,tail,data,and
endpointersaresettocorrectinitialstate.
Finallytheskb_shared_infostructureatthetailofthekmalloc'edpartisinitialized.Whymustitbe
donesequentially?
36
Managingfclones.
Thislooksseriouslyugly...Anfclonemustimmediatelyfollowtheparentinmemory.Thetermchild
referstothepotentialclonethatimmediatelyfollowstheparentinmemory.Furthermorethereisan
unnamedatomicvariablefollowingthechildbufferinthefclonecache.Thisvariableisalways
accessedusingthepointernamefclone_refandcountsthetotalnumberofreferencescurrentlyheld
fortheparent+child.
Heretheatomicfclone_refissetto1.ThefclonestateoftheparentissettoFCLONE_ORIGwhich
makessense,butthestateofthechildissettoFCLONE_UNAVAILABLEwhichseemsjust
backwardtomebecausethechildisnowAVAILABLEforuseincloning.
Itappearsthatifthebufferdidn'tcomefromthefclonecachethattheskb>fcloneflagisimplicitlyset
toFCLONE_UNAVAILABLE(0)bythememset().Ugh.
227 enum {
228 SKB_FCLONE_UNAVAILABLE,
229 SKB_FCLONE_ORIG,
230 SKB_FCLONE_CLONE,
231 };
180 if (fclone) {
181 struct sk_buff *child = skb + 1;
182 atomic_t *fclone_ref = (atomic_t *) (child + 1);
183
184 skb->fclone = SKB_FCLONE_ORIG;
185 atomic_set(fclone_ref, 1);
186
187 child->fclone = SKB_FCLONE_UNAVAILABLE;
188 }
189 out:
190 return skb;
191 nodata:
192 kmem_cache_free(cache, skb);
193 skb = NULL;
194 goto out;
195 }
37
The old version
163
164 struct sk_buff *alloc_skb(unsigned int size,int gfp_mask)
165 {
166 struct sk_buff *skb;
167 u8 *data;
alloc_skb()ensuresthatwhencalledfromaninterrupthandler,itiscalledusingtheGFP_ATOMIC
flag.Inearlierincarnationsofthecodeitloggedupto5instancesofawarningmessagesifsuchwas
notthecase.Nowitsimplycrashesthesystem!
38
Allocationoftheheader
Thestructsk_buffheaderisallocatedeitherfromthepoolorfromthecacheviatheslaballocator.A
poolisatypicallysmalllistofobjectsnormallymanagedbytheslaballocatorthathaverecentlybeen
releasedbyaspecificprocessorinanSMPcomplex.Thusthereisonepoolperobjecttypeper
processor.Theobjectiveisofpoolusageisto:
toavoidspinlockingand
toobtainbettercachebehaviorbyattemptingtoensurethatanobjectthathasbeenrecently
usedisreallocatedtotheCPUthatlastusedit.
Allocatingthedatabuffer
SKB_DATA_ALIGNincrementssizetoensurethatsomemannerofcachelinealignmentcanbe
achieved.Notethattheactualalignmentdoesnotoccurhere.
39
Headerinitialization
truesizeholdstherequestedbuffer'ssize+thesizeofofthesk_buffheader.Itdoesnotincludeslab
overheadortheskb_shared_info.Initially,allthespaceinthebuffermemoryisassignedtothetail
component.
No fragments
209 skb_shinfo(skb)->nr_frags = 0;
210 skb_shinfo(skb)->frag_list = NULL;
211 return skb;
212
213 nodata:
214 skb_head_to_pool(skb);
215 nohead:
216 return NULL;
217 }
40
Waitinguntilmemorybecomesavailable
Ifaprocessentersarapidsendloop,datawillaccumulateinsk_buffsfarfasterthanitcanbe
transmitted.Whenthesendingprocesshasconsumeditswmemquotaitisputtosleepuntilspaceis
recoveredthroughsuccessfultransmissionofpacketsandsubsequentreleaseofthesk_buffs.
FortheUDPpaththevalueoftimeoiseither
0forsocketswiththenonblockingattributeor
themaximumpossibleunsignedintforallothers.
Whenwebuildaconnectionprotocol,youcancopythiscodeasabasisforwaitinginsideacallto
cop_listen().
41
Sleep/wakeupdetails
Atimeoof0willhavecausedajumptothefailureexit.Arrivalheregenerallymeanswaitforever.
Thesomewhatcomplex,multistepprocedureusedtosleepisnecessarytoavoidanastyrace
conditionthatcouldoccurwithtraditionalinterruptible_sleep_on()/wake_up_interruptible()
synchronization.
Aprocessmighttestforavailablememory,
thenmemorybecomesavailableinasoftirqandawakeupbeissued,
thentheprocessgoestosleep
possiblyforalongtime.
42
Mechanicsofwait
Thissituationisavoidedbyputtingthetask_structonthewaitqueuebeforetestingforavailable
memoryandisexplainedwellintheLinuxDeviceDriversbook.
Thestructsockcontainsavariable,wait_queue_head_t*sk_sleep,thatdefinesthewaitqueueon
whichtheprocesswillsleep.Thelocalvariablewaitisthewaitqueueelementthatthe
prepare_to_wait()functionwillputonthequeue.Thecalltoschedule_timeo()actuallyinitiatesthe
wait.
43
Chargingtheownerforallocatedwritebufferspace.
Theskb_set_owner_w()functionsetsupthedestructorfunctionand"bills"theownerfortheamount
ofspaceconsumed.Thecalltosock_holdincrementssk>refcntonthestructsocktoindicatethat
thissk_buffholdsapointertothestructsock.Thisreferencewillnotbereleaseduntilsock_putis
calledbythedestructorfunction,sock_wfree(),atthetimethesk_buffisfreed.
44
Thekfree_skbfunction.
Thekfree_skb()functionatomicallydecrementsthenumberofusersandinvokes__kfree_skb()to
actuallyfreethebufferwhenthenumberofusersbecomes0.Thestandardtechniqueofreference
countingisemployed,butinawaythatissomewhatsubtle.
Iftheatomic_read()returns1,thenthisthreadofcontrolistheonlyentitythatholdsapointertothis
skb_buff.Thesubtlepartoftheprocedureisthatthisalsoimpliesthereisnowayanyotherentityis
goingtobeabletoobtainareference.Sincethisentityholdstheonlyreference,itwouldhaveto
provideitandthisentityisnotgoingtodothat.
Iftheatomic_read()returns2,forexample,thereisanexposuretoaracecondition.Bothentitiesthat
holdreferencescouldsimultaneouslydecrementwiththeresultbeingthatbothreferenceswerelost
without__kfree_skb()everbeingcalledatall.
Theatomic_dec_and_test()definedininclude/asm/atomic.hresolvesthatpotentialproblem.It
atomicallydecrementsthereferencecounterandreturnstrueonlyifthedecrementoperationproduced
0.
45
Freeingansk_bufftheoldway
Thekfree_skb()functionatomicallydecrementsthenumberofusersandinvokes__kfree_skb()to
actuallyfreethebufferwhenthenumberofusersbecomes0.Thestandardtechniqueofreference
countingisemployed,butinawaythatissomewhatsubtle.
Iftheatomic_read()returns1,thenthisthreadofcontrolistheonlyentitythatholdsapointertothis
skb_buff.Thesubtlepartoftheprocedureisthatthisalsoimpliesthereisnowayanyotherentityis
goingtobeabletoobtainareference.Sincethisentityholdstheonlyreference,itwouldhaveto
provideitandthisentityisnotgoingtodothat.
Iftheatomic_read()returns2,forexample,thereisanexposuretoaracecondition.Bothentitiesthat
holdreferencescouldsimultaneouslydecrementwiththeresultbeingthatbothreferenceswerelost
without__kfree_skb()everbeingcalledatall.
Theatomic_dec_and_test()definedininclude/asm/atomic.hresolvesthatpotentialproblem.It
atomicallydecrementsthereferencecounterandreturnstrueonlyifthedecrementoperationproduced
0.
46
The__kfree_skb()function
The__kfree_skb()functionusedtoensurethatthesk_buff()doesnotbelongtoanybufferlist.It
appearsthatisnolongerdeemednecessary.
Thedst_entryentityisalsoreferencecounted.Thestructrtablewillactuallybereleasedonlyifthis
bufferholdsthelastreference.Thecalltothedestructor()functionadjuststheamountofsndbuf
spaceallocatedtostructsockthatownsthebuffer.
__kfree_skbalsousedtoinitializethestateofthestructsk_buffheaderviatheskb_headerinit
function.Thekfree_skbmem()functionreleasesallassociatedbufferstorageincludingfragments.The
structsk_buffusedtobereturnedtothecurrentprocessor'spoolunlessthepoolisalreadyfullin
whichcaseitwasreturnedthecache.Poolsseemtohavegoneaway.
393 kfree_skbmem(skb);
394 }
47
Freeingthethedataandtheheaderwithkfree_skbmem()
Thekfree_skbmem()functioninvokesskb_release_data()tothefreethedata.Itusedtocall
skb_head_to_pooltoreturnthestructsk_bufftotheperprocessorcache.Nowacomplexsetof
operationsregardingthefclonestateareperformed.
Recallthatthepossiblesettingsoftheskb>fcloneflagare:
227 enum {
228 SKB_FCLONE_UNAVAILABLE,
229 SKB_FCLONE_ORIG,
230 SKB_FCLONE_CLONE,
231 };
Ifthebufferdidn'tcomefromthefclonecache,itsflagwillbesettoSKB_FCLONE_UNAVAILABLE.
Ifthebufferistheparentanddidcomefromthefclonecache.Theflagwillbesetto
SKB_FCLONE_ORIG.Ifthebufferisthechildandcamefromthefclonecache,theflagwillbeset
toSKB_FCLONE_UNAVAILABLEifthebufferisavailableforuse,butitwillbesetto
SKB_FCLONE_CLONEifthebufferisinuse.Anavailablebufferwillneverbefreed.Therefore,
iftheflagsaysSKB_FCLONE_UNAVAILABLE,thenthisisastandalonebuffernotonfromthe
fclonecache.Simple,no?Tohavereachedthispointinthecodeskb>usersisguaranteedtobe1.
Sonofurthertestingisneeded.
48
Thisistheparentofthetwobufferpair.Theatomicvariablefollowingthechildcountstotal
referencestotheparentandchild.(Itwassettoonewhentheparentwasallocatedbutbeforeany
cloninghastakenplace.Freeingtheparentimplicitlyfreesthechildclone,andwedon'tknow
whethertheparentorthechildwillbefreedfirst.Therefore,theunnamedatomicvariablefollowing
thechildmustbe1inordertofreetheparent.Sincethisatomicvariablehasnonameitissomewhat
difficulttofindallreferencestoit.
Thisisthechildclone.Itismadeavailableforcloningagainbyjustresettingthefcloneflagto
FCLONE_UNAVAILABLE.Butiftheparenthasalreadybeenfreed,thenfreeingthechildwill
causea"real"free.
49
Releasingunmappedpagebuffers,thefragmentlist,andthekmalloc'darea
Theskb_release_data()functioncallsput_page()tofreeanyunmappedpagebuffers,
skb_drop_fraglist()tofreethefragmentchain,andthencallskfree()tofreethekmalloc'edcomponent
thatnormallyholdsthecompletepacket.
Thedatamaybereleasedonlywhenitisassuredthatnoentityholdsapointertothedata.Ifthe
clonedflagisnotsetitisassumedthatwhoeverisattemptingtofreethesk_buffheaderistheonly
entitythatheldapointertothedata.
Iftheclonedflagisset,thedatarefreferencecountercontrolsthefreeingofthedata.Unfortunately
thedatareffieldhasnowbeensplitintotwobitfields.Itisshownintheskb_clone()functionthatthe
clonedflagissetintheheaderofboththeoriginalbufferandtheclonewhenansk_buffiscloned.
Wedividedatarefintotwohalves.Thehigher16bitsholdreferencestothepayloadpartofskb
>data.Thelower16bitsholdreferencestotheentireskb>data.Itisuptotheusersoftheskbto
agreeonwherethepayloadstarts.Allusersmustobeytherulethattheskb>datareferencecount
mustbegreaterthanorequaltothepayloadreferencecount.Holdingareferencetothepayload
partmeansthattheuserdoesnotcareaboutmodificationstotheheaderpartofskb>data.
50
Theoldversionofskb_release_data
51
Releasing the fragment list
277
278 static void skb_drop_list(struct sk_buff **listp)
279 {
280 struct sk_buff *list = *listp;
281
282 *listp = NULL;
283
284 do {
285 struct sk_buff *this = list;
286 list = list->next;
287 kfree_skb(this);
288 } while (list);
289 }
290
291 static inline void skb_drop_fraglist(struct sk_buff *skb)
292 {
293 skb_drop_list(&skb_shinfo(skb)->frag_list);
294 }
Question:Howdoestheloopterminationlogicwork?
52
Freeingthestructsk_bufftheoldway.
Theskb_head_to_pool()functionreleasesthesk_buffstructure.Whetherthesk_buffisreturnedtothe
cacheorplacedontheperprocessorhotlistdependsuponthepresentlengthofthehotlistqueue.
Recallthatthermem,wmemquotasalsolivein/proc/sys/net/core.
53
Thewritebufferdestructorfunction
Whenthedestructorfunctionsock_wfree()isinvoked,itdecrementsthewmem_alloccounterbythe
truesizefieldandwillwakeupaprocessthatissleepingonthesocketifappropriate.
Thecalltosock_put()undoesthecalltosock_hold()madeinskb_set_owner_w()indicatingthe
sk_buffnolongerholdsapointertothestructsock,Thevalueofsk>use_write_queueissetto1by
TCPbutisnotsetbyUDP.Therefore,sock_def_write_spacewillbecalledforaUDPsocket.
54
Thesock_putfunction
Sincethestructsocketalsoholdsapointertothestructsockthiswillalwaybejustadecrementwhen
sk_buffsarebeingfreed.Ifsk_refcntweretoequal1whencalledbysock_wfree()itwouldbea
catastrophicfailure!!!
55
Wakingaprocesssleepingonwmem.
Thedefaultwritespacefunctionissock_def_write_space().Itwillnotattempttowakeupawaiting
processuntilatleasthalfofthesndbufspaceisfree.Italsohastoensurethatthereisasleeping
processbeforeawakeupisattempted.
56
Devicedriverallocationofsk_buffs
Whereastransportprotocolsmustallocatebuffersfortransmittraffic,itisnecessaryfordevice
driverstoallocatethebuffersthatwillholdreceivedpackets.Thedev_alloc_skb()functiondefinedin
linux/skbuff.hisusedforthispurpose.Thedev_alloc_skb()isoftencalledinthecontextofahardor
softIRQandthusmustuseGFP_ATOMICtoindicatethatsleepingisnotanoptionifthebuffer
cannotbeallocated.
Accordingtocommentsinthecodethereservationof16bytes(NET_SKB_PAD)ofheadroomis
donefor(presumablycache)optimizations....notforheaderspace.
57
Accountingfortheallocationofreceivebufferspace.
Adevicedriverwillnotcallskb_set_owner_r()becauseitdoesnotknowwhichstructsockwill
eventuallyownthesk_buff.However,whenareceivedsk_buffiseventuallyassignedtoastructsock,
skb_set_owner_r()willbecalled.
Interestingly,unlikeskb_set_owner_w(),Theskb_set_owner_r()functiondoesnotcallsock_hold()
eventhoughitdoesholdapointertothestructsock.Thisseemstosetupthepossibilityofanugly
raceconditionifasocketisclosedaboutthetimeapacketisreceived.
58
Sharingandcloningofsk_buffs
Therearetworelatedmechanismsbywhichmultipleentitiesmayholdpointerstoansk_buff
structureorthedataitdescribes.Ansk_buffissaidtobesharedwhenmorethanoneprocessholdsa
pointertothestructsk_buff.Sharingiscontrolledbytheskb>userscounter.Abuffermaynot
actuallybefreeduntiltheusecountreaches0.Abufferissharedviaacalltoskb_get().
Sharedbuffersmustbeassumedtobereadonly.Specifically,verybadthingswillhappeniftwo
entitiesthatshareabuffertrytoputthebufferondifferentqueues!!!
Asseenpreviously,thekfree_skb()functionwillactuallyfreeabufferonlywhencalledbythelast
userthatholdsareferencetothestructsk_buff.
59
Clonedbuffers
Incontrast,aclonedbufferisoneinwhichmultiplestructskbuffheadersreferenceasingledataarea.
Aclonedheaderisindicatedbysettingtheskb>clonedflag.Thenumberofusersoftheshareddata
areaiscountedbythedatarefelementoftheskb_shared_infostructure.Cloningisnecessarywhen
multipleusersofthesamebufferneedtomakechangestothestructsk_buff.Forexample,areliable
datagramprotocolneedstoretainacopyofansk_buffthathasbeenpassedtothedevlayerfor
transmission.Boththetransportprotocolandthedevlayermayneedtomodifytheskb>nextand
skb>prevpointers.
60
Creatingacloneofansk_buff.
Theskb_clone()functionisdefinedinnet/core/skbuff.c.Itduplicatesthestructsk_buffheader,but
thedataportionremainsshared.Theusecountoftheclonetobesettoone.Ifmemoryallocation
fails,NULLisreturned.Theownershipofthenewbufferisnotassignedtoanystructsock.Ifthis
functioniscalledfromaninterrupthandlergfp_maskmustbeGFP_ATOMIC.
Thepointernisoptimisticallysettotheaddressofthefclone.ThetestforSKB_FCLONE_ORIG
ensuresthatabrokenattempttofcloneabufferfromthestandardcachewillNOTbeattempted.Ifa
successfulfcloneoccursthentheunnamedatomic_tvariablefollowingthefclonewillbecome2.
Herewhenthebufferisallocatedfromtheskbuff_head_cachethefcloneflagisexplicitlysetto0.
Thefcloneflagisatwobitbitfield.
432 n = skb + 1;
433 if (skb->fclone == SKB_FCLONE_ORIG &&
434 n->fclone == SKB_FCLONE_UNAVAILABLE) {
435 atomic_t *fclone_ref = (atomic_t *) (n + 1);
436 n->fclone = SKB_FCLONE_CLONE;
437 atomic_inc(fclone_ref);
438 } else {
439 n = kmem_cache_alloc(skbuff_head_cache, gfp_mask);
440 if (!n)
441 return NULL;
442 n->fclone = SKB_FCLONE_UNAVAILABLE;
443 }
444
61
Therestofthefunctiondealswithcopyingspecificfieldsoneattime.Whynotusememcpyandthen
overridethefieldsthatwedon'twantcopied?
Clonelivesonnolistandhasnowownersocket.
457#ifdef CONFIG_INET
458 secpath_get(skb->sp);
459#endif
460 memcpy(n->cb, skb->cb, sizeof(skb->cb));
461 C(len);
462 C(data_len);
463 C(csum);
464 C(local_df);
465 n->cloned = 1;
466 n->nohdr = 0;
467 C(pkt_type);
468 C(ip_summed);
469 C(priority);
62
470#if defined(CONFIG_IP_VS) || defined(CONFIG_IP_VS_MODULE)
471 C(ipvs_property);
472#endif
473 C(protocol);
Clonemustnothaveadestructortoavoid"doublecredit"forfreeingdata.Forproperaccountingina
realiableprotocol,theclonenottheoriginalmustbepasseddownthestackfortransmissionbecause
theoriginalwillnecessarilybefreedlast.Ifmultipleretransmissionsarerequired,anewclonemust
becreatedforeachretransmission.
However,iffcloningisinusethenewclonejustrecyclethefclonebecauseitwillhavealreadybeen
freedbythetimetheretransmissionoccurs.
63
499 C(truesize);
500 atomic_set(&n->users, 1);
501 C(head);
502 C(data);
503 C(tail);
504 C(end);
505
506 atomic_inc(&(skb_shinfo(skb)->dataref));
507 skb->cloned = 1;
508
509 return n;
510 }
64
Convertingasharedbuffertoaclone
Theskb_share_check()function,definedininclude/linux/skbuff.h,clonesasharedsk_buff.Afterthe
cloningtakesplace, thecallto kfree_skb() decrements skb>users ontheoriginalcopy. Ashared
buffernecessarilyhasausecountexceedingone,andsothecalltokfree_skb()simplydecrementsit.
The skb_shared() inline function returns TRUE if the number of users of the buffer exceeds 1.
65
Obtainingabufferfromoneoftheperprocessorpools
Theskb_head_from_pool()functionusedtoprovidebuffersfromafastaccessperCPUcache.It
detachesandreturnsthefirstsk_buffheaderinthelistorreturnsNULLifthelistisempty.Interrupt
disablementinsteadoflockingcanbeusedbecauseandonlybecausethepoolislocaltotheprocessor.
116 if (skb_queue_len(list)) {
117 struct sk_buff *skb;
118 unsigned long flags;
119
120 local_irq_save(flags);
121 skb = __skb_dequeue(list);
122 local_irq_restore(flags);
123 return skb;
124 }
125 return NULL;
126 }
66
Nonlinearbuffers
Nonlinearsk_buffsarethoseconsistingofunmappedpagebuffersandadditionalchainedstruct
sk_buffs.Probably1/2ofthenetworkcodeinthekernelisdedicatedtodealingwiththerarelyused
abomination.Anonzerovalueofdata_lenisanindicatorofnonlinearity.Forobviousreasonsthe
simpleskb_put()functionneithersupportsnortoleratesnonlinearity.SKB_LINEAR_ASSERT
checksvalueofdata_lenthroughfunctionskb_is_nonlinear.Anonzerovalueresultsinanerror
messagetobeloggedbyBUG.
Trimmingnonlinearbuffers
Therealtrimfunctionis___pskb_trim()functionwhichisdefinedinnet/core/skbuff.c.Itgetsreally
uglyreallyfastbecauseitmustdealwithunmappedpagesandbufferchains.
The value of offset denotes length of the kmalloc'd component of the sk_buff.
67
Thisloopprocessesanyunmappedpagefragmentsthatmaybeassociatedwiththebuffer.
AddthefragmentsizetooffsetandcompareitagainstthelengthoftheIPpacket.Ifendisgreater
thanlen,thenthisfragmentneedstobetrimmed.Inthiscase,ifthesk_buffisaclone,itsheaderand
skb_shared_infostructurearereallocatedhere.
Iftheoffsetofthestartofthefragmentliesbeyondtheendofthedata,thefragmentisfreedand
numberoffragmentsdecrementedbyone.Otherwise,thefragmentsizeisdecrementedsothatits
lengthisconsistentwiththesizeofthepacket.
Update offset so that it reflects the offset to the start position of the next fragment.
68
Afterprocessingtheunmappedpagefragments,someadditionaladjustmentsmaybenecessary.Here
lenholdsthetargettrimmedlengthandoffsetholdstheoffsettothefirstbyteofdatabeyondthe
unmappedpagefragments.Sinceskb>lenisgreaterthanlenitisnotclearhowoffsetcanbesmaller
thanlen.
Iflen<=skb_headlen(skb)thenallofthedatanowresidesinthekmalloc'edportionofthesk_buff.
Ifthesk_buffisnotclonedthenpresumablyskb_drop_fraglist()freesthenowunusedelements.
else {
768 if (len <= skb_headlen(skb)) {
769 skb->len = len;
770 skb->data_len = 0;
771 skb->tail = skb->data + len;
772 if (skb_shinfo(skb)->frag_list &&
!skb_cloned(skb))
773 skb_drop_fraglist(skb);
774 }
Inthiscasetheoffsetisgreaterthanorequaltolen.Thetrimmingoperationisachievedby
decrementingskb>data_lenbytheamounttrimmedandsetttingskb>lentothetargetlength.
else {
775 skb->data_len -= skb->len - len;
776 skb->len = len;
777 }
778 }
779
780 return 0;
781 }
69
Miscellaneousbuffermanagementfunctions
Theskb_cow()functionisdefinedininclude/linux/skbuff.h.Itensuresthattheheadroomofthe
sk_buffisatleast16bytes.Thesk_buffisreallocatedifitsheadroomisinadequateofsmallorifit
hasaclone.Recallthatdev_alloc_skb()usedskb_reserve()toestablisha16byteheadroomwhenthe
packetwasallocated.Thusforthe``normal''casethevalueofdeltawillbe0here.
Whentheheadroomissmallorthesk_buffiscloned,reallocatethesk_buffwithspecifiedheadroom
size.
1082 }
70