Professional Documents
Culture Documents
There are a few general rules which ease the study of Linux traffic control.
Traffic control structures under Linux are the same whether the initial configuration has been done with tcng or with tc.
Any router performing a shaping function should be the bottleneck on the link, and should be shaping slightly below the maximum
available link bandwidth. This prevents queues from forming in other routers, affording maximum control of packet latency/deferral
to the shaping device.
A device can only shape traffic it transmits . Because the traffic has already been received on an input interface, the traffic cannot be
shaped. A traditional solution to this problem is an ingress policer.
Every interface must have a qdisc. The default qdisc (the pfifo_fast qdisc) is used when another qdisc is not explicitly attached to the
interface.
One of the classful qdiscs added to an interface with no children classes typically only consumes CPU for no benefit.
Any newly created class contains a FIFO. This qdisc can be replaced explicitly with any other qdisc. The FIFO qdisc will be removed
implicitly if a child class is attached to this class.
Classes directly attached to the root qdisc can be used to simulate virtual circuits.
o Where lower classes are only able to send if higher ones have no packets available.
o The PRIO scheduler prefers to dequeue any available packet in the highest priority band first, then falling to the lower priority
queues.
Priority could be deduced from the source or destination IP address of the packet using the "classifier" concept. This is called
classification. We use “Filters” to classify.
The PRIO qdisc is a simple classful queueing discipline that contains an arbitrary number of classes of differing priority.
The classes are dequeued in numerical descending order of priority.
PRIO is a scheduler and never delays packets.
o It is a work-conserving qdisc, though the qdiscs contained in the classes may not be Very useful for lowering latency when there is
no need for slowing down traffic.
The PRIO qdisc doesn't actually shape, it only subdivides traffic based on how you configured your filters.
You can consider the PRIO qdisc a kind of pfifo_fast, whereby each band is a separate class instead of a simple FIFO.
o You can also add an other qdisc to the 3 predefined classes, whereas pfifo_fast is limited to simple fifo qdiscs.
This qdisc is very useful in case you want to prioritize certain kinds of traffic without using only TOS-flags but using all the
power of the tc filters.
Because it doesn't actually shape, the same warning as for SFQ holds:
o Either use it only if your physical link is really full or wrap it inside a classful qdisc that does shape. The latter holds
for almost all cable modems and DSL devices.
When dequeuing, band 0 is tried first and only if it did not deliver a packet does PRIO try band 1, and so onwards.
o Maximum reliability packets should therefore go to band 0, minimum delay to band 1 and the rest to band 2.
o As the PRIO qdisc itself will have minor number 0, band 0 is actually major:1, band 1 is major:2, etc.
For major, substitute the major number assigned to the qdisc on 'tc qdisc add' with the handle parameter.
Classification:
Three methods are available to PRIO to determine in which band a packet will be enqueued.
o From userspace
• A process with sufficient privileges can encode the destination class directly with SO_PRIORITY, see tc(7).
o With a tc filter
• A tc filter attached to the root qdisc can point traffic directly to a class
bands
o Number of bands to create. It must be between 2 and 16. Default is 3. If changed from the default of 3, priomap must be updated.
o If tc filters are not used to classify traffic, the PRIO qdisc looks at the TC_PRIO priority to decide how to enqueue traffic.
o The bands are classes, and are called major:1 to major:3 by default, so if your PRIO qdisc is called 12:, tc filter traffic to 12:1 to grant it
more priority. Band 0 goes to minor number 1! Band 1 to minor number 2, etc.
o The priomap maps the priority of a packet to a class. The priority can either be set directly from userspace, or be derived from the
Type of Service of the packet.
o Determines how packet priorities, as assigned by the kernel, map to bands. Mapping occurs based on the TOS octet of the packet,
which looks like this:
o This option defines a table which is used to assign a packet to a class based on its :priority:.
o There is one entry in the table per packet priority, so the table has 16 entries. The first entry in the table contains the class number
for priority 0 packets, the second entry contains the class number for priority 1 packets, and so on.
o The "priomap" keyword is followed by the class numbers for each priority in ascending priority number order.
o If not specified this default priomap is used: priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
The four TOS bits (the 'TOS field') are defined as: 0 1 2 3 4 5 6 7 Prio dequeues starting from band(class) 0.
+---+---+---+----+---+---+---+--------+ TOS in packets has to be selected based on
Binary Decimal Meaning
| | | | required priority hence band as per the table in
---------------------------------
1000 8 Minimize delay (md) |PRECEDENCE | TOS | MBZ | this sheet on the left hand side.
TOS Bits Means Linux Priority Band Tcpdump -v -v shows you the value of the entire TOS field, not just
----------------------------------------------------------------------------- the four bits. It is the value you see in the first column of this table:
--
0x0 0 Normal Service 0 Best Effort 1 The second column contains the value of the relevant four TOS bits, followed by their
0x2 1 Minimize Monetary Cost 1 Filler 2 translated meaning.
0x4 2 Maximize Reliability 0 Best Effort 1
0x6 3 mmc+mr 0 Best Effort 1 o For example, 15 stands for a packet wanting Minimal Monetary Cost, Maximum Reliability,
0x8 4 Maximize Throughput 2 Bulk 2 Maximum Throughput AND Minimum Delay.
0xa 5 mmc+mt 2 Bulk 2
0xc 6 mr+mt 2 Bulk 2 o The fourth column lists the way the Linux kernel interprets the TOS bits, by showing to which
0xe 7 mmc+mr+mt 2 Bulk 2 Priority they are mapped.
0x10 8 Minimize Delay 6 Interactive 0
0x12 9 mmc+md 6 Interactive 0 o The last column shows the result of the default priomap. On the command line, the default
0x14 10 mr+md 6 Interactive 0 priomap looks like this: 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 (3 classes or bands 0,1,2)
0x16 11 mmc+mr+md 6 Interactive 0 (more classes or bands can be configured default is 3) (TOS priorities 16)
0x18 12 mt+md 4 Int. Bulk 1 o This means that priority 4, for example, gets mapped to band(class) number 1.
0x1a 13 mmc+mt+md 4 Int. Bulk 1
0x1c 14 mr+mt+md 4 Int. Bulk 1 o The priomap also allows you to list higher priorities (> 7) which do not correspond to TOS
0x1e 15 mmc+mr+mt+md 4 Int. Bulk 1 mappings, but which are set by other means.
This table from RFC 1349 (read it for more details) explains Classes:
how applications might very well set their TOS bits:
PRIO classes cannot be configured further - they are automatically
TELNET 1000 (minimize delay) created when the PRIO qdisc is attached.
FTP
Control 1000 (minimize delay)
Data 0100 (maximize throughput) Each class however can contain yet a further qdisc.
SMTP Large amounts of traffic in the lower bands can cause starvation of
Command phase 1000 (minimize delay) higher bands.
DATA phase 0100 (maximize throughput)
ICMP
Errors 0000
Requests 0000 (mostly)
Responses <same as request> (mostly)
9.5.3.2. Sample configuration Perform bulk data transfer with a tool that properly sets TOS flags.
Bulk traffic will go to 30:, interactive traffic to 20: or 10:. As you can see, all traffic went to handle 30:, which is the lowest priority band, just as intended.
Command lines:
Now to verify that interactive traffic goes to higher bands, we create some interactive traffic:
tc qdisc add dev eth0 root handle 1: prio # tc -s qdisc ls dev eth0
// This *instantly* creates classes 1:1, 1:2, 1:3 qdisc sfq 30: quantum 1514b
Sent 384228 bytes 274 pkts (dropped 0, overlimits 0)
tc qdisc add dev eth0 parent 1:1 handle 10: sfq qdisc tbf 20: rate 20Kbit burst 1599b lat 667.6ms
tc qdisc add dev eth0 parent 1:2 handle 20: tbf Sent 2640 bytes 20 pkts (dropped 0, overlimits 0)
qdisc sfq 10: quantum 1514b
rate 20kbit buffer 1600 limit 3000
Sent 14926 bytes 193 pkts (dropped 0, overlimits 0)
tc qdisc add dev eth0 parent 1:3 handle 30: sfq
qdisc prio 1: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 401836 bytes 488 pkts (dropped 0, overlimits 0)
It worked - all additional traffic has gone to 10:, which is our highest priority qdisc. No traffic was sent to
the lowest priority, which previously received our entire scp.
PRIO: Classifying packets with filters:
To determine which class shall process a packet, the so-called 'classifier chain' is called each time a choice needs to be made.
This chain consists of all filters attached to the classful qdisc that needs to decide.
root 1:
|
_1:1_
/ | \
/ | \
/ | \
10: 11: 12:
/ \ / \
10:1 10:2 12:1 12:2
We have a PRIO qdisc called '10:' containing 3 classes, and we want to assign all traffic from and to port 22 to the highest priority band, the filters would be:
# tc filter add dev eth0 protocol ip parent 10: prio 1 u32 match ip dport 22 0xffff flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 1 u32 match ip sport 80 0xffff flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 2 flowid 10:2
Attach to eth0, node 10: a priority 1 u32 filter that matches on IP destination port 22 *exactly* and send it to band 10:1. And it then repeats the same for source
port 80. The last command says that anything unmatched so far should go to band 10:2, the next-highest priority.
# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip dst 4.3.2.1/32 flowid 10:1
# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip src 1.2.3.4/32 flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 2 flowid 10:2
This assigns traffic to 4.3.2.1 and traffic from 1.2.3.4 to the highest priority queue, and the rest to the next-highest one.
You can concatenate matches, to match on traffic from 1.2.3.4 and from port 80, do this:
# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip src 4.3.2.1/32 match ip sport 80 0xffff flowid 10:1
All the filtering commands you will normally need. Most shaping commands presented here start with this preamble:
o # tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ..
These are the so called 'u32' matches, which can match on ANY part of a packet.
o On source/destination address
• Source mask 'match ip src 1.2.3.0/24', destination mask 'match ip dst 4.3.2.0/24'. To match a single host, use /32, or omit the mask.
On fwmark
o You can mark packets with either ipchains or iptables and have that mark survive routing across interfaces.
o This is really useful to for example only shape traffic on eth1 that came in on eth0. Syntax:
# tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 6 fw flowid 1:1
Note that this is not a u32 match!
Packets which are selected by the filter go to the high-priority class, while all other packets go to the low-priority class.
Whenever there are packets in the high-priority queue, they are sent before packets in the low-priority queue (e.g. the “sch_prio”
queuing discipline works this way).
In order to prevent high-priority traffic from starving low-priority traffic, we use the token bucket filter (TBF) queuing discipline, which
enforces a rate of at most 1 Mbps.
Note that there are better ways to accomplish what we've done here, e.g. by using class-based queuing (CBQ).
The Figure 2.1.8 implements three qdiscs viz PRIO,TBF and FIFO.
We can implement a default class where packets are sent when they do not match any of our filters.
“Leave a default codepoint and assign it to the best-effort PHB".
Let's see now with a little detail how the diagram 2.1.18 represents our queuing discipline.
The main qdisc is the PRIO queue which receives the packets.
It applies the filter and select those of them marked as "high priority".
Marking could be based on MF (multi-field selection) or DS-codepoint.
Classifier puts the packets marked with “high priority” in the class “High” as shown in figure.
Rest of the packets (not marked with our high priority identifier) go to the "Low" class below.
Finally qdiscs for both classes deliver the packets on the right side of the main queue to be dispatched to the physical network.
Observation:
o The "High" priority class is controlled by an upper level of throughput (implemented by TBF).
o High priority class traffic has to be controlled to avoid "starving" of lower priority classes.
tc qdisc [ add | change | replace | link ] dev DEV [ parent qdisc-id | root ] [ handle qdisc-id ] qdisc [ qdisc specific parameters ]
tc class [ add | change | replace ] dev DEV parent qdisc-id [classid class-id ] qdisc [ qdisc specific parameters ]
// You could have "u32 match src " or specify an sport or any protocol
tc filter add dev eth0 protocol ip parent 1: prio 2 u32 match ip src 0/0 flowid 1:2
• The default pfifo_fast qdisc should already be capable of doing what is done by above commands by obeying ToS bits.
o So another solution, without messing with tc at all, would be to just configure your ssh and MySql daemons to set
the ToS bits on their traffic appropriately.
net/sch_generic.c
int tc_classify(struct sk_buff *skb, struct tcf_proto *tp, struct tcf_result *res) // Main classifier routine: scans classifier chain attached to this qdisc,
{ (optionally) tests for protocol and asks specific classifiers.
int err = 0; // net/sched/sch_api.c
__be16 protocol; int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
struct tcf_result *res)
#ifdef CONFIG_NET_CLS_ACT {
struct tcf_proto *otp = tp; __be16 protocol = skb->protocol;
reclassify: int err = 0;
#endif
protocol = skb->protocol; for (; tp; tp = tp->next)
err = tc_classify_compat(skb, tp, res); {
if ((tp->protocol == protocol || tp->protocol == htons(ETH_P_ALL))
#ifdef CONFIG_NET_CLS_ACT &&
if (err == TC_ACT_RECLASSIFY) (err = tp->classify(skb, tp, res)) >= 0)
{ {
u32 verd = G_TC_VERD(skb->tc_verd);
tp = otp; #ifdef CONFIG_NET_CLS_ACT
if(err != TC_ACT_RECLASSIFY && skb->tc_verd)
if (verd++ >= MAX_REC_LOOP) skb->tc_verd = SET_TC_VERD(skb->tc_verd, 0);
{ #endif
printk("rule prio %u protocol %02x reclassify loop, " "packet dropped\n", return err;
tp->prio&0xffff, ntohs(tp->protocol)); }
return TC_ACT_SHOT; }
} return -1;
skb->tc_verd = SET_TC_VERD(skb->tc_verd, verd); }
goto reclassify;
}
#endif
return err;
}
include/linux/pkt_sched.h // PRIO section.
sch_tree_lock(sch); // prio_tune() continued
#define TCQ_PRIO_BANDS 16
#define TCQ_MIN_PRIO_BANDS 2 q->bands = qopt->bands;
#define TC_PRIO_MAX 15 memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);
unsigned long prio_get(struct Qdisc *sch, u32 classid) static int prio_dump_class_stats(struct Qdisc *sch, unsigned long cl, struct gnet_dump *d)
{ {
struct prio_sched_data *q = qdisc_priv(sch); struct prio_sched_data *q = qdisc_priv(sch);
struct Qdisc *cl_q;
unsigned long band = TC_H_MIN(classid);
cl_q = q->queues[cl - 1];
if(band - 1 >= q->bands) cl_q->qstats.qlen = cl_q->q.qlen;
return 0;
return band; if(gnet_stats_copy_basic(d, &cl_q->bstats) < 0 || gnet_stats_copy_queue(d, &cl_q->qstats) < 0)
} return -1;
static struct tcf_proto ** prio_find_tcf(struct Qdisc *sch, unsigned long cl) static void prio_put(struct Qdisc *q, unsigned long cl)
{ {
struct prio_sched_data *q = qdisc_priv(sch);
return;
if(cl)
return NULL; }
return &q->filter_list;
}
qdisc = prio_classify(skb, sch, &ret); for(int prio = 0; prio < q->bands; prio++)
{
#ifdef CONFIG_NET_CLS_ACT struct Qdisc *qdisc = q->queues[prio];
if (qdisc == NULL) struct sk_buff *skb = qdisc->ops->peek(qdisc);
{ if(skb) // Get from high priority band Qdisc or Queue.
if(ret & __NET_XMIT_BYPASS) return skb;
sch->qstats.drops++; }
kfree_skb(skb); return NULL;
return ret; }
}
#endif static struct sk_buff* prio_dequeue(struct Qdisc* sch)
{
ret = qdisc_enqueue(skb, qdisc); // Band or Class Qdisc struct prio_sched_data *q = qdisc_priv(sch);
opt.bands = q->bands;
memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);
return skb->len;
nla_put_failure:
nlmsg_trim(skb, b);
return -1;
}