While looking for a possible reason of bugzilla report on HTB oops:
http://bugzilla.kernel.org/show_bug.cgi?id=12858
I found the code in htb_delete calling htb_destroy_class on zero
refcount is very misleading: it can suggest this is a common path, and
destroy is called under sch_tree_lock. Actually, this can never happen
like this because before deletion cops->get() is done, and after
delete a class is still used by tclass_notify. The class destroy is
always called from cops->put(), so without sch_tree_lock.
This doesn't mean much now (since 2.6.27) because all vulnerable calls
were moved from htb_destroy_class to htb_delete, but there was a bug
in older kernels. The same change is done for other classful scheds,
which, it seems, didn't have similar locking problems here.
Reported-by: m0sia <m0sia@m0sia.ru>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy <kaber@trash.net> suggested using a workqueue instead
of hrtimers to trigger netif_schedule() when there is a problem with
setting exact time of this event: 'The differnce - yeah, it shouldn't
make much, mainly wake up the qdisc earlier (but not too early) after
"too many events" occured _and_ no further enqueue events wake up the
qdisc anyways.'
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Let's get some info on possible config problems. This patch brings
back an old warning, but is printed only once now.
With feedback from Patrick McHardy <kaber@trash.net>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy <kaber@trash.net> suggested:
> How about making this flag and the warning message (in a out-of-line
> function) globally available? Other qdiscs (f.i. HFSC) can't deal with
> inner non-work-conserving qdiscs as well.
This patch uses qdisc->flags field of "suspected" child qdisc.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently htb_do_events() breaks events recounting for a level after 2
jiffies, but there is no reason to repeat this for next levels and
increase delays even more (with softirqs disabled). htb_dequeue_tree()
can add to this too, btw. In such a case q->now time is invalid anyway.
Thanks to Patrick McHardy for spotting an error around earlier version
of this patch.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Next event time should consider jiffies used for recounting. Otherwise
qdisc_watchdog_schedule() triggers hrtimer immediately with the event
in the past, and may cause very high ksoftirqd cpu usage (if highres
is on).
There is also removed checking "event" for zero in htb_dequeue(): it's
always true in this place.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We can skip WARN_ON() in htb_dequeue_tree() because there should be
always a similar warning from htb_lookup_leaf() earlier.
The first WARN_ON() in in htb_lookup_leaf() is changed to BUG_ON()
because most likly this should end with oops anyway.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
htb_id_find_next_upper() is usually called to find a class with next
id after some previously removed class, so let's move a check for
equality to the end: it's the least likely here.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Replace HTB_ACCNT() macro with inlines to make it more readable.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
L2T() is currently used only in one place (and has one spurious
parameter, btw), so let's: 'get rid of L2T completely, and just
use "qdisc_l2t(rate, size)" directly.' - quote & feedback from
David S. Miller.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While implementing htb_parent_to_leaf() there where added backup prio
and quantum struct htb_class fields to preserve these values for inner
classes in case of their return to leaf. This patch cleans this a bit
by removing union leaf duplicates.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The functions gen_new_estimator and gen_replace_estimator can return
errors, but they were being ignored.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The use of xchg() hasn't been necessary since 2.2.something when proper
locking was added to packet schedulers.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
After implementing qdisc->ops->peek() and changing sch_netem into
classless qdisc there are no more qdisc->ops->requeue() users. This
patch removes this method with its wrappers (qdisc_requeue()), and
also unused qdisc->requeue structure. There are a few minor fixes of
warnings (htb_enqueue()) and comments btw.
The idea to kill ->requeue() and a similar patch were first developed
by David S. Miller.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds qdisc_peek_dequeued() wrapper to emulate peek method
with qdisc->dequeue() and storing "peeked" skb in qdisc->gso_skb until
dequeuing. This is mainly for compatibility reasons not to break some
strange configs because peeking is expected for non-work-conserving
parent qdiscs to query work-conserving child qdiscs.
This implementation requires using qdisc_dequeue_peeked() wrapper
instead of directly calling qdisc->dequeue() for all qdiscs ever
querried with qdisc->ops->peek() or qdisc_peek_dequeued().
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use qdisc_root_sleeping_lock() instead of qdisc_root_lock() where
appropriate. The only difference is while dev is deactivated, when
currently we can use a sleeping qdisc with the lock of noop_qdisc.
This shouldn't be dangerous since after deactivation root lock could
be used only by gen_estimator code, but looks wrong anyway.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While passing a qdisc root lock to gen_new_estimator() and
gen_replace_estimator() dev could be deactivated or even before
grafting proper root qdisc as qdisc_sleeping (e.g. qdisc_create), so
using qdisc_root_lock() is not enough. This patch adds
qdisc_root_sleeping_lock() for this, plus additional checks, where
necessary.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Based upon a bug report by Josip Rodin.
Packet schedulers should only return NET_XMIT_DROP iff
the packet really was dropped. If the packet does reach
the device after we return NET_XMIT_DROP then TCP can
crash because it depends upon the enqueue path return
values being accurate.
Signed-off-by: David S. Miller <davem@davemloft.net>
Recent changes introduced a bug in htb_delete(): cl->parent->children
counter update misses checking cl->parent for NULL, which is used for
root classes, so deleting them causes an oops.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy <kaber@trash.net> noticed that it would be nice to
handle NET_XMIT_BYPASS by NET_XMIT_SUCCESS with an internal qdisc flag
__NET_XMIT_BYPASS and to remove the mapping from dev_queue_xmit().
David Miller <davem@davemloft.net> spotted a serious bug in the first
version of this patch.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy <kaber@trash.net> noticed:
"The other problem that affects all qdiscs supporting actions is
TC_ACT_QUEUED/TC_ACT_STOLEN getting mapped to NET_XMIT_SUCCESS
even though the packet is not queued, corrupting upper qdiscs'
qlen counters."
and later explained:
"The reason why it translates it at all seems to be to not increase
the drops counter. Within a single qdisc this could be avoided by
other means easily, upper qdiscs would still increase the counter
when we return anything besides NET_XMIT_SUCCESS though.
This means we need a new NET_XMIT return value to indicate this to
the upper qdiscs. So I'd suggest to introduce NET_XMIT_STOLEN,
return that to upper qdiscs and translate it to NET_XMIT_SUCCESS
in dev_queue_xmit, similar to NET_XMIT_BYPASS."
David Miller <davem@davemloft.net> noticed:
"Maybe these NET_XMIT_* values being passed around should be a set of
bits. They could be composed of base meanings, combined with specific
attributes.
So you could say "NET_XMIT_DROP | __NET_XMIT_NO_DROP_COUNT"
The attributes get masked out by the top-level ->enqueue() caller,
such that the base meanings are the only thing that make their
way up into the stack. If it's only about communication within the
qdisc tree, let's simply code it that way."
This patch is trying to realize these ideas.
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Removes legacy reinvent-the-wheel type thing. The generic
machinery integrates much better to automated debugging aids
such as kerneloops.org (and others), and is unambiguous due to
better naming. Non-intuively BUG_TRAP() is actually equal to
WARN_ON() rather than BUG_ON() though some might actually be
promoted to BUG_ON() but I left that to future.
I could make at least one BUILD_BUG_ON conversion.
Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
When code wants to lock the qdisc tree state, the logic
operation it's doing is locking the top-level qdisc that
sits of the root of the netdev_queue.
Add qdisc_root_lock() to represent this and convert the
easiest cases.
In order for this to work out in all cases, we have to
hook up the noop_qdisc to a dummy netdev_queue.
Signed-off-by: David S. Miller <davem@davemloft.net>
The lock is now an attribute of the device queue.
One thing to notice is that "suspicious" places
emerge which will need specific training about
multiple queue handling. They are so marked with
explicit "netdev->rx_queue" and "netdev->tx_queue"
references.
Signed-off-by: David S. Miller <davem@davemloft.net>
It can be obtained via the netdev_queue. So create a helper routine,
qdisc_dev(), to make the transformations nicer looking.
Now, qdisc_alloc() now no longer needs a net_device pointer argument.
Signed-off-by: David S. Miller <davem@davemloft.net>
A netdev_queue is an entity managed by a qdisc.
Currently there is one RX and one TX queue, and a netdev_queue merely
contains a backpointer to the net_device.
The Qdisc struct is augmented with a netdev_queue pointer as well.
Eventually the 'dev' Qdisc member will go away and we will have the
resulting hierarchy:
net_device --> netdev_queue --> Qdisc
Also, qdisc_alloc() and qdisc_create_dflt() now take a netdev_queue
pointer argument.
Signed-off-by: David S. Miller <davem@davemloft.net>
The filter_cnt is supposed to count filter references to a class.
Since the qdisc can't be the target of a filter, it doesn't need
a filter_cnt. In fact the counter is never decreased since cls_api
considers a return value of zero a failure and doesn't unbind again.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that the qdisc isn't destroyed in hierarchical order anymore,
the only user of the child lists left is htb_parent_last_child().
This can be easily changed to use a counter of children to save
a few bytes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Hash list removal currently happens twice (once in htb_delete, once
in htb_destroy_class), which makes it harder to use the dynamically
sized class hash without adding special cases for HTB. The reason is
that qdisc destruction destroys classes in hierarchical order, which
is not necessary if filters are destroyed in a separate iteration
during qdisc destruction.
Adjust qdisc destruction to follow the same scheme as other hierarchical
qdiscs by first performing a filter destruction pass, then destroying
all classes in hash order.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pass double tcf_proto pointers to tcf_destroy_chain() to make it
clear the start of the filter list for more consistency.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Add a htb_hysteresis parameter to htb_sch.ko and by sysfs magic make
it runtime adjustable via
/sys/module/sch_htb/parameters/htb_hysteresis mode 640.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Martin Devera <devik@cdi.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
The HTB hysteresis mode reduce the CPU load, but at the
cost of scheduling accuracy.
On ADSL links (512 kbit/s upstream), this inaccuracy introduce
significant jitter, enought to disturbe VoIP. For details see my
masters thesis (http://www.adsl-optimizer.dk/thesis/), chapter 7,
section 7.3.1, pp 69-70.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Martin Devera <devik@cdi.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch removes CVS keywords that weren't updated for a long time
from comments.
Signed-off-by: Adrian Bunk <bunk@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There is lack of removing a class from the event queue while changing
from parent to leaf which can cause corruption of this rb tree. This
patch fixes a bug introduced by my patch: "sch_htb: turn intermediate
classes into leaves" commit: 160d5e10f8.
Many thanks to Jan 'yanek' Bortl for finding a way to reproduce this
rare bug and narrowing the test case, which made possible proper
diagnosing.
This patch is recommended for all kernels starting from 2.6.20.
Reported-and-tested-by: Jan 'yanek' Bortl <yanek@ya.bofh.cz>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
HTB is event driven algorithm and part of its work is to apply
scheduled events at proper times. It tried to defend itself from
livelock by processing only limited number of events per dequeue.
Because of faster computers some users already hit this hardcoded
limit.
This patch limits processing up to 2 jiffies (why not 1 jiffie ?
because it might stop prematurely when only fraction of jiffie
remains).
Signed-off-by: Martin Devera <devik@cdi.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
htb_requeue() enqueues skbs for which htb_classify() returns NULL.
This is wrong because such skbs could be handled by NET_CLS_ACT code,
and the decision could be different than earlier in htb_enqueue().
So htb_requeue() is changed to work and look more like htb_enqueue().
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use nla_nest_start/nla_nest_end for dumping nested attributes.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
nla_parse() returns more detailed errno codes, propagate them back on
error.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert packet schedulers to use the netlink API. Unfortunately a gradual
conversion is not possible without breaking compilation in the middle or
adding lots of casts, so this patch converts them all in one step. The
patch has been mostly generated automatically with some minor edits to
at least allow seperate conversion of classifiers and actions.
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Qdisc_class_ops are const, and Qdisc_ops are mostly read.
Using "const" and "__read_mostly" qualifiers helps to reduce false
sharing.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change L2T (length to time) macros, in all rate based schedulers, to
call a common function qdisc_l2t() that does the rate table lookup.
This function handles if the packet size lookup is larger than the
rate table, which often occurs with TSO enabled.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>