You are on page 1of 26

Overview of

PRIO Qdisc Linux Implementation

NOMUS COMM SYSTEMS PVT LTD


HYDERABAD
Presented by NSS MURTHY on 19-12-2019
8. Rules, Guidelines and Approaches

8.1. General Rules of Linux Traffic Control

 There are a few general rules which ease the study of Linux traffic control.

 Traffic control structures under Linux are the same whether the initial configuration has been done with tcng or with tc.

 Any router performing a shaping function should be the bottleneck on the link, and should be shaping slightly below the maximum
available link bandwidth. This prevents queues from forming in other routers, affording maximum control of packet latency/deferral
to the shaping device.

 A device can only shape traffic it transmits . Because the traffic has already been received on an input interface, the traffic cannot be
shaped. A traditional solution to this problem is an ingress policer.

 Every interface must have a qdisc. The default qdisc (the pfifo_fast qdisc) is used when another qdisc is not explicitly attached to the
interface.

 One of the classful qdiscs added to an interface with no children classes typically only consumes CPU for no benefit.

 Any newly created class contains a FIFO. This qdisc can be replaced explicitly with any other qdisc. The FIFO qdisc will be removed
implicitly if a child class is attached to this class.

 Classes directly attached to the root qdisc can be used to simulate virtual circuits.

 A filter can be attached to classes or one of the classful qdiscs.


PRIO
net/sched/sch_prio.c Simple 3-band priority "scheduler"
// This function is called during prio module init. static const struct Qdisc_class_ops prio_class_ops =
static int __init prio_module_init(void) {
.graft = prio_graft, // Child qdisc manipulation
{
.leaf = prio_leaf, // Child qdisc manipulation
return register_qdisc(&prio_qdisc_ops); .get = prio_get, // class manipulation
} .put = prio_put, // class manipulation
.walk = prio_walk, // class manipulation
static void __exit prio_module_exit(void)
{ .tcf_chain = prio_find_tcf, // Filter manipulation
.bind_tcf = prio_bind, // Filter manipulation
unregister_qdisc(&prio_qdisc_ops);
.unbind_tcf = prio_put, // Filter manipulation
}
module_init(prio_module_init) .dump = prio_dump_class,
module_exit(prio_module_exit) .dump_stats = prio_dump_class_stats,
};

static struct Qdisc_ops prio_qdisc_ops __read_mostly =


{ struct prio_sched_data
.next = NULL, {
.cl_ops = &prio_class_ops, int bands;
.id = "prio",
.priv_size = sizeof(struct prio_sched_data), struct tcf_proto *filter_list;
.enqueue = prio_enqueue,
.dequeue = prio_dequeue, u8 prio2band[TC_PRIO_MAX+1];
.peek = prio_peek,
.drop = prio_drop, struct Qdisc *queues[TCQ_PRIO_BANDS];
.init = prio_init, };
.reset = prio_reset,
.destroy = prio_destroy, #define TCQ_PRIO_BANDS 16
.change = prio_tune, #define TCQ_MIN_PRIO_BANDS 2 (DEFAULT 3)
.dump = prio_dump, #define TC_PRIO_MAX 15
.owner = THIS_MODULE,
};
 The PRIO qdisc is a non-shaping container for a configurable number of classes which are dequeued in order.

 This allows for easy prioritization of traffic,

o Where lower classes are only able to send if higher ones have no packets available.

o The PRIO scheduler prefers to dequeue any available packet in the highest priority band first, then falling to the lower priority
queues.

 To facilitate configuration, Type Of Service bits are honoured by default.

 Priority could be deduced from the source or destination IP address of the packet using the "classifier" concept. This is called
classification. We use “Filters” to classify.

tc qdisc ... dev dev  ( parent classid | root) [ handle major: ] prio


[ bands bands ] [ priomapband,band,band... ] [ estimator interval timeconstant ]

 The PRIO qdisc is a simple classful queueing discipline that contains an arbitrary number of classes of differing priority.
 The classes are dequeued in numerical descending order of priority.
 PRIO is a scheduler and never delays packets.
o It is a work-conserving qdisc, though the qdiscs contained in the classes may not be Very useful for lowering latency when there is
no need for slowing down traffic.
 The PRIO qdisc doesn't actually shape, it only subdivides traffic based on how you configured your filters.

 You can consider the PRIO qdisc a kind of pfifo_fast, whereby each band is a separate class instead of a simple FIFO.
o You can also add an other qdisc to the 3 predefined classes, whereas pfifo_fast is limited to simple fifo qdiscs.

 When a packet is enqueued to the PRIO qdisc,


o By default, three classes are created.
o A class is chosen based on the filter commands you gave.
o These classes by default contain pure FIFO qdiscs with no internal structure, but you can replace these by any qdisc you have
available.

 Whenever a packet needs to be dequeued,


o Class :1 is tried first.
o Higher classes are only used if lower bands all did not give up a packet.

 This qdisc is very useful in case you want to prioritize certain kinds of traffic without using only TOS-flags but using all the
power of the tc filters.

 Because it doesn't actually shape, the same warning as for SFQ holds:
o Either use it only if your physical link is really full or wrap it inside a classful qdisc that does shape. The latter holds
for almost all cable modems and DSL devices.

 In formal words, the PRIO qdisc is a Work-Conserving scheduler.


Algorithm:

 On creation with 'tc qdisc add', a fixed number of bands is created.


o Each band is a class, although it is not possible to add classes with 'tc qdisc add’,
- The number of bands to be created must instead be specified on the command line attaching PRIO to its root.

 When dequeuing, band 0 is tried first and only if it did not deliver a packet does PRIO try band 1, and so onwards.
o Maximum reliability packets should therefore go to band 0, minimum delay to band 1 and the rest to band 2.
o As the PRIO qdisc itself will have minor number 0, band 0 is actually major:1, band 1 is major:2, etc.
 For major, substitute the major number assigned to the qdisc on 'tc qdisc add' with the handle parameter.

Classification:

 Three methods are available to PRIO to determine in which band a packet will be enqueued.
o From userspace
• A process with sufficient privileges can encode the destination class directly with SO_PRIORITY, see tc(7).

o With a tc filter
• A tc filter attached to the root qdisc can point traffic directly to a class

o With the priomap


• Based on the packet priority, which in turn is derived from the Type of Service assigned to the packet.

 Only the priomap is specific to this qdisc.


Qdisc Parameters:

 bands
o Number of bands to create. It must be between 2 and 16. Default is 3. If changed from the default of 3, priomap must be updated.

o If tc filters are not used to classify traffic, the PRIO qdisc looks at the TC_PRIO priority to decide how to enqueue traffic.

o The bands are classes, and are called major:1 to major:3 by default, so if your PRIO qdisc is called 12:, tc filter traffic to 12:1 to grant it
more priority. Band 0 goes to minor number 1! Band 1 to minor number 2, etc.

 Priomap classForPrio_0 classForPrio_1 ... classForPrio_15

o The priomap maps the priority of a packet to a class. The priority can either be set directly from userspace, or be derived from the
Type of Service of the packet.

o Determines how packet priorities, as assigned by the kernel, map to bands. Mapping occurs based on the TOS octet of the packet,
which looks like this:

o It is only used if the packet is not assigned a class by a classifier.

o This option defines a table which is used to assign a packet to a class based on its :priority:.

o There is one entry in the table per packet priority, so the table has 16 entries. The first entry in the table contains the class number
for priority 0 packets, the second entry contains the class number for priority 1 packets, and so on.
o The "priomap" keyword is followed by the class numbers for each priority in ascending priority number order.
o If not specified this default priomap is used: priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
The four TOS bits (the 'TOS field') are defined as: 0 1 2 3 4 5 6 7 Prio dequeues starting from band(class) 0.
+---+---+---+----+---+---+---+--------+ TOS in packets has to be selected based on
Binary Decimal Meaning
| | | | required priority hence band as per the table in
---------------------------------
1000 8 Minimize delay (md) |PRECEDENCE | TOS | MBZ | this sheet on the left hand side.

0100 4 Maximize throughput (mt) +---+---+---+----+---+---+---+--------+


0010 2 Maximize reliability (mr)
0001 1 Minimize monetary cost (mmc)  As there is 1 bit to the right of these four bits, the actual value of
0000 0 Normal Service the TOS field is double the value of the TOS bits.

TOS Bits Means Linux Priority Band  Tcpdump -v -v shows you the value of the entire TOS field, not just
----------------------------------------------------------------------------- the four bits. It is the value you see in the first column of this table:
--
0x0 0 Normal Service 0 Best Effort 1  The second column contains the value of the relevant four TOS bits, followed by their
0x2 1 Minimize Monetary Cost 1 Filler 2 translated meaning.
0x4 2 Maximize Reliability 0 Best Effort 1
0x6 3 mmc+mr 0 Best Effort 1 o For example, 15 stands for a packet wanting Minimal Monetary Cost, Maximum Reliability,
0x8 4 Maximize Throughput 2 Bulk 2 Maximum Throughput AND Minimum Delay.
0xa 5 mmc+mt 2 Bulk 2
0xc 6 mr+mt 2 Bulk 2 o The fourth column lists the way the Linux kernel interprets the TOS bits, by showing to which
0xe 7 mmc+mr+mt 2 Bulk 2 Priority they are mapped.
0x10 8 Minimize Delay 6 Interactive 0
0x12 9 mmc+md 6 Interactive 0 o The last column shows the result of the default priomap. On the command line, the default
0x14 10 mr+md 6 Interactive 0 priomap looks like this: 1, 2, 2, 2, 1, 2, 0, 0 , 1, 1, 1, 1, 1, 1, 1, 1 (3 classes or bands 0,1,2)
0x16 11 mmc+mr+md 6 Interactive 0 (more classes or bands can be configured default is 3) (TOS priorities 16)
0x18 12 mt+md 4 Int. Bulk 1 o This means that priority 4, for example, gets mapped to band(class) number 1.
0x1a 13 mmc+mt+md 4 Int. Bulk 1
0x1c 14 mr+mt+md 4 Int. Bulk 1 o The priomap also allows you to list higher priorities (> 7) which do not correspond to TOS
0x1e 15 mmc+mr+mt+md 4 Int. Bulk 1 mappings, but which are set by other means.
 This table from RFC 1349 (read it for more details) explains Classes:
how applications might very well set their TOS bits:
 PRIO classes cannot be configured further - they are automatically
TELNET 1000 (minimize delay) created when the PRIO qdisc is attached.
FTP
Control 1000 (minimize delay)
Data 0100 (maximize throughput)  Each class however can contain yet a further qdisc.

TFTP 1000 (minimize delay) Bugs:

SMTP  Large amounts of traffic in the lower bands can cause starvation of
Command phase 1000 (minimize delay) higher bands.
DATA phase 0100 (maximize throughput)

Domain Name Service  Can be prevented by attaching a shaper (for example, tc-tbf(8) to


UDP Query 1000 (minimize delay) these bands to make sure they cannot dominate the link.
TCP Query 0000
Zone Transfer 0100 (maximize throughput)

NNTP 0001 (minimize monetary cost)

ICMP
Errors 0000
Requests 0000 (mostly)
Responses <same as request> (mostly)
9.5.3.2. Sample configuration Perform bulk data transfer with a tool that properly sets TOS flags.

1: root qdisc # scp tc ahu@10.0.0.11:./ahu@10.0.0.11's password: tc


100% |*****************************| 353 KB 00:00
/|\
/ | \ # tc -s qdisc ls dev eth0
/ | \ qdisc sfq 30: quantum 1514b
Sent 384228 bytes 274 pkts (dropped 0, overlimits 0)
1:1 1:2 1:3 classes qdisc tbf 20: rate 20Kbit burst 1599b lat 667.6ms
| | | Sent 2640 bytes 20 pkts (dropped 0, overlimits 0)
qdisc sfq 10: quantum 1514b
10: 20: 30: qdiscs Sent 2230 bytes 31 pkts (dropped 0, overlimits 0)
sfq tbf sfq
qdisc prio 1: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
0 1 2 bands Sent 389140 bytes 326 pkts (dropped 0, overlimits 0)

Bulk traffic will go to 30:, interactive traffic to 20: or 10:. As you can see, all traffic went to handle 30:, which is the lowest priority band, just as intended.
Command lines:
Now to verify that interactive traffic goes to higher bands, we create some interactive traffic:
tc qdisc add dev eth0 root handle 1: prio # tc -s qdisc ls dev eth0
// This *instantly* creates classes 1:1, 1:2, 1:3 qdisc sfq 30: quantum 1514b
Sent 384228 bytes 274 pkts (dropped 0, overlimits 0)
tc qdisc add dev eth0 parent 1:1 handle 10: sfq qdisc tbf 20: rate 20Kbit burst 1599b lat 667.6ms
tc qdisc add dev eth0 parent 1:2 handle 20: tbf Sent 2640 bytes 20 pkts (dropped 0, overlimits 0)
qdisc sfq 10: quantum 1514b
rate 20kbit buffer 1600 limit 3000
Sent 14926 bytes 193 pkts (dropped 0, overlimits 0)
tc qdisc add dev eth0 parent 1:3 handle 30: sfq
qdisc prio 1: bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 401836 bytes 488 pkts (dropped 0, overlimits 0)

It worked - all additional traffic has gone to 10:, which is our highest priority qdisc. No traffic was sent to
the lowest priority, which previously received our entire scp.
PRIO: Classifying packets with filters:
To determine which class shall process a packet, the so-called 'classifier chain' is called each time a choice needs to be made.
This chain consists of all filters attached to the classful qdisc that needs to decide.
root 1:
|
_1:1_
/ | \
/ | \
/ | \
10: 11: 12:
/ \ / \
10:1 10:2 12:1 12:2

We have a PRIO qdisc called '10:' containing 3 classes, and we want to assign all traffic from and to port 22 to the highest priority band, the filters would be:

# tc filter add dev eth0 protocol ip parent 10: prio 1 u32 match ip dport 22 0xffff flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 1 u32 match ip sport 80 0xffff flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 2 flowid 10:2

Attach to eth0, node 10: a priority 1 u32 filter that matches on IP destination port 22 *exactly* and send it to band 10:1. And it then repeats the same for source
port 80. The last command says that anything unmatched so far should go to band 10:2, the next-highest priority.

To select on an IP address, use this:

# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip dst 4.3.2.1/32 flowid 10:1
# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip src 1.2.3.4/32 flowid 10:1
# tc filter add dev eth0 protocol ip parent 10: prio 2 flowid 10:2

This assigns traffic to 4.3.2.1 and traffic from 1.2.3.4 to the highest priority queue, and the rest to the next-highest one.

You can concatenate matches, to match on traffic from 1.2.3.4 and from port 80, do this:
# tc filter add dev eth0 parent 10:0 protocol ip prio 1 u32 match ip src 4.3.2.1/32 match ip sport 80 0xffff flowid 10:1
 All the filtering commands you will normally need. Most shaping commands presented here start with this preamble:
o # tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 ..

 These are the so called 'u32' matches, which can match on ANY part of a packet.
o On source/destination address
• Source mask 'match ip src 1.2.3.0/24', destination mask 'match ip dst 4.3.2.0/24'. To match a single host, use /32, or omit the mask.

o On source/destination port, all IP protocols


• Source: 'match ip sport 80 0xffff', destination: 'match ip dport 80 0xffff'

o On ip protocol (tcp, udp, icmp, gre, ipsec)


• Use the numbers from /etc/protocols, for example, icmp is 1: 'match ip protocol 1 0xff'.

 On fwmark
o You can mark packets with either ipchains or iptables and have that mark survive routing across interfaces.
o This is really useful to for example only shape traffic on eth1 that came in on eth0. Syntax:
# tc filter add dev eth1 protocol ip parent 1:0 prio 1 handle 6 fw flowid 1:1
Note that this is not a u32 match!

You can place a mark like this:


# iptables -A PREROUTING -t mangle -i eth0 -j MARK --set-mark 6
The number 6 is arbitrary.
The following command will show you all the rules that mark packages in the mangle table, also how many packages and bytes have matched.
# iptables -L -t mangle -n –v

 On the TOS field


o To select interactive, minimum delay traffic:
# tc filter add dev ppp0 parent 1:0 protocol ip prio 10 u32 match ip tos 0x10 0xff flowid 1:4
o Use 0x08 0xff for bulk traffic.
o For more filtering commands, see the Advanced Filters chapter.
 Figure 2.1.8 shows a queuing discipline with two delay priorities.

 Packets which are selected by the filter go to the high-priority class, while all other packets go to the low-priority class.

 Whenever there are packets in the high-priority queue, they are sent before packets in the low-priority queue (e.g. the “sch_prio”
queuing discipline works this way).

 In order to prevent high-priority traffic from starving low-priority traffic, we use the token bucket filter (TBF) queuing discipline, which
enforces a rate of at most 1 Mbps.

 Finally, the queuing of low-priority packets is done by a FIFO queuing discipline.

 Note that there are better ways to accomplish what we've done here, e.g. by using class-based queuing (CBQ).

 The Figure 2.1.8 implements three qdiscs viz PRIO,TBF and FIFO.
 We can implement a default class where packets are sent when they do not match any of our filters.
 “Leave a default codepoint and assign it to the best-effort PHB".
 Let's see now with a little detail how the diagram 2.1.18 represents our queuing discipline.

 The main qdisc is the PRIO queue which receives the packets.
 It applies the filter and select those of them marked as "high priority".
 Marking could be based on MF (multi-field selection) or DS-codepoint.

 Classifier puts the packets marked with “high priority” in the class “High” as shown in figure.
 Rest of the packets (not marked with our high priority identifier) go to the "Low" class below.

 The "High" class implements a TBF qdisc.


o Typically, each class "owns" one queue… This time the queue assigned to the "High" class is a TBF queue.
o This queue controls the maximum throughput traversing through it to 1 Mbps (have a look to the diagram).
o To the right of the TBF queue representation there's a little queue and a little clock shown.
 They represent the queue behavior and that some sort of metering is been made on it to know how much is the flow flowing.

 The "Low" class queue is a FIFO queue.


o A simple “FirstIn-FirstOut” queue.
o It doesn't try to exercise any control just to enqueue and dequeue packets as they are arriving.

 Finally qdiscs for both classes deliver the packets on the right side of the main queue to be dispatched to the physical network.

 Observation:
o The "High" priority class is controlled by an upper level of throughput (implemented by TBF).
o High priority class traffic has to be controlled to avoid "starving" of lower priority classes.
tc qdisc [ add | change | replace | link ] dev DEV [ parent qdisc-id | root ] [ handle qdisc-id ] qdisc [ qdisc specific parameters ]

tc class [ add | change | replace ] dev DEV parent qdisc-id [classid class-id ] qdisc [ qdisc specific parameters ]

tc filter [ add | change | replace ] dev DEV [ parent qdisc-id | root ]


protocol protocol prio priority filtertype [ filtertype specific parameters ] flowid flow-id
// Create a prio qdisc named 1:
tc qdisc add dev eth0 root handle 1: prio (Creates 3 bands with priorities 0,1,2 by default until specified in command line)

// Add a filter matching port 22 -> band 1 //ssh


tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dport 22 0xffff flowid 1:1

// Add another filter matching port 3306 -> band 1 //MySql


tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dport 3306 0xffff flowid 1:1

//Add another filter matching protocol 1 (icmp) -> band 1


tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip protocol 1 0x00ff flowid 1:1

// You could have "u32 match src " or specify an sport or any protocol
tc filter add dev eth0 protocol ip parent 1: prio 2 u32 match ip src 0/0 flowid 1:2

• The default pfifo_fast qdisc should already be capable of doing what is done by above commands by obeying ToS bits.
o So another solution, without messing with tc at all, would be to just configure your ssh and MySql daemons to set
the ToS bits on their traffic appropriately.

# tc qdisc add dev eth0 root handle 1: prio


# tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dport 22 0xffff flowid 1:1
# tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dport 3306 0xffff flowid 1:1
# tc filter add dev eth0 parent 1: prio 3 protocol all u32 match u32 0 0 flowid 1:3
# tc -s qdisc ls dev eth0
qdisc prio 1: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
Sent 125836067 bytes 87549 pkt (dropped 0, overlimits 0 requeues 347)
// "Handles"
// include/linux/pkt_sched.h

All the traffic control objects have 32bit identifiers, or "handles".

They can be considered as opaque numbers from user API viewpoint,


but actually they always consist of two fields: major and
minor numbers, which are interpreted by kernel specially,
that may be used by applications, though not recommended.

F.e. qdisc handles always have minor number equal to zero,


classes (or flows) have major equal to parent qdisc major, and
minor uniquely identifying class inside qdisc.

Macros to manipulate handles:

#define TC_H_MAJ_MASK (0xFFFF0000U)


#define TC_H_MIN_MASK (0x0000FFFFU)

#define TC_H_MAJ(h) ((h)&TC_H_MAJ_MASK)


#define TC_H_MIN(h) ((h)&TC_H_MIN_MASK)

#define TC_H_MAKE(maj,min) (((maj)&TC_H_MAJ_MASK)|((min)&TC_H_MIN_MASK))

#define TC_H_UNSPEC (0U)


#define TC_H_ROOT (0xFFFFFFFFU)
#define TC_H_INGRESS (0xFFFFFFF1U)
struct tcf_proto struct tcf_proto_ops
{ {
// Fast access part struct tcf_proto_ops *next;
char kind[IFNAMSIZ];
struct tcf_proto *next;
void *root; int (*classify)(struct sk_buff*, struct tcf_proto*, struct tcf_result *);
int (*classify)(struct sk_buff*, struct tcf_proto*, int (*init)(struct tcf_proto*);
struct tcf_result *); void (*destroy)(struct tcf_proto*);
__be16 protocol;
ul (*get)(struct tcf_proto*, u32 handle);
// All the rest void (*put)(struct tcf_proto*, unsigned long);
u32 prio; int (*change)(struct tcf_proto*, unsigned long, u32 handle, struct nlattr **, ul *);
u32 classid; int (*delete)(struct tcf_proto*, unsigned long);
struct Qdisc *q; void (*walk)(struct tcf_proto*, struct tcf_walker *arg);
void *data;
// rtnetlink specific
struct tcf_proto_ops *ops; int (*dump)(struct tcf_proto*, unsigned long, struct sk_buff *skb, struct tcmsg*);
};
struct module *owner;
};
struct tcf_result
{
unsigned long class;
u32 classid;
};

net/sch_generic.c
int tc_classify(struct sk_buff *skb, struct tcf_proto *tp, struct tcf_result *res) // Main classifier routine: scans classifier chain attached to this qdisc,
{ (optionally) tests for protocol and asks specific classifiers.
int err = 0; // net/sched/sch_api.c
__be16 protocol; int tc_classify_compat(struct sk_buff *skb, struct tcf_proto *tp,
struct tcf_result *res)
#ifdef CONFIG_NET_CLS_ACT {
struct tcf_proto *otp = tp; __be16 protocol = skb->protocol;
reclassify: int err = 0;
#endif
protocol = skb->protocol; for (; tp; tp = tp->next)
err = tc_classify_compat(skb, tp, res); {
if ((tp->protocol == protocol || tp->protocol == htons(ETH_P_ALL))
#ifdef CONFIG_NET_CLS_ACT &&
if (err == TC_ACT_RECLASSIFY) (err = tp->classify(skb, tp, res)) >= 0)
{ {
u32 verd = G_TC_VERD(skb->tc_verd);
tp = otp; #ifdef CONFIG_NET_CLS_ACT
if(err != TC_ACT_RECLASSIFY && skb->tc_verd)
if (verd++ >= MAX_REC_LOOP) skb->tc_verd = SET_TC_VERD(skb->tc_verd, 0);
{ #endif
printk("rule prio %u protocol %02x reclassify loop, " "packet dropped\n", return err;
tp->prio&0xffff, ntohs(tp->protocol)); }
return TC_ACT_SHOT; }
} return -1;
skb->tc_verd = SET_TC_VERD(skb->tc_verd, verd); }
goto reclassify;
}
#endif
return err;
}
include/linux/pkt_sched.h // PRIO section.
sch_tree_lock(sch); // prio_tune() continued
#define TCQ_PRIO_BANDS 16
#define TCQ_MIN_PRIO_BANDS 2 q->bands = qopt->bands;
#define TC_PRIO_MAX 15 memcpy(q->prio2band, qopt->priomap, TC_PRIO_MAX+1);

struct tc_prio_qopt for (i=q->bands; i<TCQ_PRIO_BANDS; i++)


{ {
int bands; // Number of bands. struct Qdisc *child = q->queues[i];
__u8 priomap[TC_PRIO_MAX+1]; // Map: logical priority -> PRIO q->queues[i] = &noop_qdisc;
band if (child != &noop_qdisc)
};
qdisc_tree_decrease_qlen(child, child->q.qlen); qdisc_destroy(child);
static int prio_init(struct Qdisc *sch, struct nlattr *opt)
}
{ sch_tree_unlock(sch);
struct prio_sched_data *q = qdisc_priv(sch);
for (i=0; i<q->bands; i++)
for(int i=0; i<TCQ_PRIO_BANDS; i++) {
q->queues[i] = &noop_qdisc; if (q->queues[i] == &noop_qdisc)
{
if((int err= prio_tune(sch, opt)) != 0) struct Qdisc *child, *old;
return err; child = qdisc_create_dflt(qdisc_dev(sch), sch->dev_queue, &pfifo_qdisc_ops, TC_H_MAKE(sch->handle, i + 1));
return 0; if(child)
}
{
static int prio_tune(struct Qdisc *sch, struct nlattr *opt) sch_tree_lock(sch);
{ old = q->queues[i];
struct prio_sched_data *q = qdisc_priv(sch); q->queues[i] = child;
struct tc_prio_qopt *qopt;
if (old != &noop_qdisc)
qopt = nla_data(opt); qdisc_tree_decrease_qlen(old, old->q.qlen); qdisc_destroy(old);

if(qopt->bands > TCQ_PRIO_BANDS || qopt->bands < 2) sch_tree_unlock(sch);


return -EINVAL; }
}
for (int i=0; i<=TC_PRIO_MAX; i++)
}
{
if(qopt->priomap[i] >= qopt->bands) return 0;
return -EINVAL; }
}
struct Qdisc * prio_leaf(struct Qdisc *sch, unsigned long arg) static int prio_dump_class(struct Qdisc *sch, unsigned long cl, struct sk_buff *skb, struct tcmsg *tcm)
{ {
struct prio_sched_data *q = qdisc_priv(sch); struct prio_sched_data *q = qdisc_priv(sch);

unsigned long band = arg - 1; tcm->tcm_handle |= TC_H_MIN(cl);


tcm->tcm_info = q->queues[cl-1]->handle;
return q->queues[band]; return 0;
} }

unsigned long prio_get(struct Qdisc *sch, u32 classid) static int prio_dump_class_stats(struct Qdisc *sch, unsigned long cl, struct gnet_dump *d)
{ {
struct prio_sched_data *q = qdisc_priv(sch); struct prio_sched_data *q = qdisc_priv(sch);
struct Qdisc *cl_q;
unsigned long band = TC_H_MIN(classid);
cl_q = q->queues[cl - 1];
if(band - 1 >= q->bands) cl_q->qstats.qlen = cl_q->q.qlen;
return 0;
return band; if(gnet_stats_copy_basic(d, &cl_q->bstats) < 0 || gnet_stats_copy_queue(d, &cl_q->qstats) < 0)
} return -1;

#define TC_H_MIN(h) ((h)&TC_H_MIN_MASK) return 0;


#define TC_H_MIN_MASK (0x0000FFFFU) }

static struct tcf_proto ** prio_find_tcf(struct Qdisc *sch, unsigned long cl) static void prio_put(struct Qdisc *q, unsigned long cl)
{ {
struct prio_sched_data *q = qdisc_priv(sch);
return;
if(cl)
return NULL; }

return &q->filter_list;
}

unsigned long prio_bind(struct Qdisc *sch, ul parent, u32 classid)


{
return prio_get(sch, classid);
}
static int prio_enqueue(struct sk_buff *skb, struct Qdisc *sch) static struct sk_buff* prio_peek(struct Qdisc *sch)
{ {
struct Qdisc *qdisc; struct prio_sched_data *q = qdisc_priv(sch);

qdisc = prio_classify(skb, sch, &ret); for(int prio = 0; prio < q->bands; prio++)
{
#ifdef CONFIG_NET_CLS_ACT struct Qdisc *qdisc = q->queues[prio];
if (qdisc == NULL) struct sk_buff *skb = qdisc->ops->peek(qdisc);
{ if(skb) // Get from high priority band Qdisc or Queue.
if(ret & __NET_XMIT_BYPASS) return skb;
sch->qstats.drops++; }
kfree_skb(skb); return NULL;
return ret; }
}
#endif static struct sk_buff* prio_dequeue(struct Qdisc* sch)
{
ret = qdisc_enqueue(skb, qdisc); // Band or Class Qdisc struct prio_sched_data *q = qdisc_priv(sch);

if (ret == NET_XMIT_SUCCESS) for(int prio = 0; prio < q->bands; prio++)


{ // root prio Qdisc {
sch->bstats.bytes += qdisc_pkt_len(skb); struct Qdisc *qdisc = q->queues[prio];
sch->bstats.packets++; struct sk_buff *skb = qdisc->dequeue(qdisc);
sch->q.qlen++; if (skb) // Get from high priority band Qdisc or Queue.
return NET_XMIT_SUCCESS; {
} sch->q.qlen--;
return skb;
if(net_xmit_drop_count(ret)) }
sch->qstats.drops++; }
return ret; return NULL;
} }
static struct Qdisc * prio_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) net/sched/sch_prio.c
{
struct prio_sched_data *q = qdisc_priv(sch); struct prio_sched_data
u32 band = skb->priority; // copy skb->priority into band.
struct tcf_result res; int err;
{
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS; int bands;
struct tcf_proto *filter_list;
#define TC_H_MAJ_MASK (0xFFFF0000U)
if (TC_H_MAJ(skb->priority) != sch->handle)
#define TC_H_MIN_MASK (0x0000FFFFU)
u8 prio2band[TC_PRIO_MAX+1];
{ struct Qdisc *queues[TCQ_PRIO_BANDS];
#define TC_H_MAJ(h) ((h)&TC_H_MAJ_MASK)
err = tc_classify(skb, q->filter_list, &res); };
#define TC_H_MIN(h) ((h)&TC_H_MIN_MASK)
#ifdef CONFIG_NET_CLS_ACT
switch (err) include/linux/pkt_sched.h // PRIO section.
{
case TC_ACT_STOLEN: case TC_ACT_QUEUED: #define TCQ_PRIO_BANDS 16
*qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
#define TCQ_MIN_PRIO_BANDS 2
case TC_ACT_SHOT:
return NULL; SKB’s priority field can be used to set the band(priority) which #define TC_PRIO_MAX 15
} identifies the Qdisc to be used to enqueue the packet.
#endif struct tc_prio_qopt
Its upper 2 bytes must contain the major handle of the prio
{
if (!q->filter_list || err < 0) Qdisc. Otherwise tc_classify() is used to classify the packet int bands; // Number of bands.
{ from filter list associated with prio_qdisc.
__u8 priomap[TC_PRIO_MAX+1]; // Map: logical priority -> PRIO band
if (TC_H_MAJ(band))
band = 0; };
return q->queues[q->prio2band[band&TC_PRIO_MAX]];
}
band = res.classid; // Important
}
band = TC_H_MIN(band) - 1;

if (band >= q->bands)


return q->queues[q->prio2band[0]];
return q->queues[band];
}
static void prio_walk(struct Qdisc *sch, struct qdisc_walker *arg) static inline int qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
{ {
struct prio_sched_data *q = qdisc_priv(sch);
int prio; #ifdef CONFIG_NET_SCHED
if (sch->stab)
if(arg->stop) qdisc_calculate_pkt_len(skb, sch->stab);
return; #endif

for (prio = 0; prio < q->bands; prio++) return sch->enqueue(skb, sch);


{ }
if (arg->count < arg->skip)
{
arg->count++;
continue;
}
if (arg->fn(sch, prio+1, arg) < 0)
{
arg->stop = 1;
break;
}
arg->count++;
}
}
static void prio_reset(struct Qdisc* sch) static unsigned int prio_drop(struct Qdisc* sch)
{ {
struct prio_sched_data *q = qdisc_priv(sch);
struct prio_sched_data *q = qdisc_priv(sch); unsigned int len;
struct Qdisc *qdisc;
for(int prio=0; prio<q->bands; prio++)
qdisc_reset(q->queues[prio]); for (int prio = q->bands-1; prio >= 0; prio--)
sch->q.qlen = 0; {
} qdisc = q->queues[prio];
if (qdisc->ops->drop && (len = qdisc->ops->drop(qdisc)) != 0)
{
sch->q.qlen--;
static void prio_destroy(struct Qdisc* sch) return len;
{ }
}
struct prio_sched_data *q = qdisc_priv(sch); return 0;
}
tcf_destroy_chain(&q->filter_list);
for(int prio=0; prio<q->bands; prio++) static int prio_dump(struct Qdisc *sch, struct sk_buff *skb)
qdisc_destroy(q->queues[prio]); {
} struct prio_sched_data *q = qdisc_priv(sch);
unsigned char *b = skb_tail_pointer(skb);
struct tc_prio_qopt opt;

opt.bands = q->bands;
memcpy(&opt.priomap, q->prio2band, TC_PRIO_MAX+1);

NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);

return skb->len;

nla_put_failure:

nlmsg_trim(skb, b);
return -1;
}

You might also like