395 lines
17 KiB
Plaintext
395 lines
17 KiB
Plaintext
Ethernet switch device driver model (switchdev)
|
|
===============================================
|
|
Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
|
|
Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com>
|
|
|
|
|
|
The Ethernet switch device driver model (switchdev) is an in-kernel driver
|
|
model for switch devices which offload the forwarding (data) plane from the
|
|
kernel.
|
|
|
|
Figure 1 is a block diagram showing the components of the switchdev model for
|
|
an example setup using a data-center-class switch ASIC chip. Other setups
|
|
with SR-IOV or soft switches, such as OVS, are possible.
|
|
|
|
|
|
User-space tools
|
|
|
|
user space |
|
|
+-------------------------------------------------------------------+
|
|
kernel | Netlink
|
|
|
|
|
+--------------+-------------------------------+
|
|
| Network stack |
|
|
| (Linux) |
|
|
| |
|
|
+----------------------------------------------+
|
|
|
|
sw1p2 sw1p4 sw1p6
|
|
sw1p1 + sw1p3 + sw1p5 + eth1
|
|
+ | + | + | +
|
|
| | | | | | |
|
|
+--+----+----+----+----+----+---+ +-----+-----+
|
|
| Switch driver | | mgmt |
|
|
| (this document) | | driver |
|
|
| | | |
|
|
+--------------+----------------+ +-----------+
|
|
|
|
|
kernel | HW bus (eg PCI)
|
|
+-------------------------------------------------------------------+
|
|
hardware |
|
|
+--------------+----------------+
|
|
| Switch device (sw1) |
|
|
| +----+ +--------+
|
|
| | v offloaded data path | mgmt port
|
|
| | | |
|
|
+--|----|----+----+----+----+---+
|
|
| | | | | |
|
|
+ + + + + +
|
|
p1 p2 p3 p4 p5 p6
|
|
|
|
front-panel ports
|
|
|
|
|
|
Fig 1.
|
|
|
|
|
|
Include Files
|
|
-------------
|
|
|
|
#include <linux/netdevice.h>
|
|
#include <net/switchdev.h>
|
|
|
|
|
|
Configuration
|
|
-------------
|
|
|
|
Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
|
|
support is built for driver.
|
|
|
|
|
|
Switch Ports
|
|
------------
|
|
|
|
On switchdev driver initialization, the driver will allocate and register a
|
|
struct net_device (using register_netdev()) for each enumerated physical switch
|
|
port, called the port netdev. A port netdev is the software representation of
|
|
the physical port and provides a conduit for control traffic to/from the
|
|
controller (the kernel) and the network, as well as an anchor point for higher
|
|
level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using
|
|
standard netdev tools (iproute2, ethtool, etc), the port netdev can also
|
|
provide to the user access to the physical properties of the switch port such
|
|
as PHY link state and I/O statistics.
|
|
|
|
There is (currently) no higher-level kernel object for the switch beyond the
|
|
port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops.
|
|
|
|
A switch management port is outside the scope of the switchdev driver model.
|
|
Typically, the management port is not participating in offloaded data plane and
|
|
is loaded with a different driver, such as a NIC driver, on the management port
|
|
device.
|
|
|
|
Switch ID
|
|
^^^^^^^^^
|
|
|
|
The switchdev driver must implement the switchdev op switchdev_port_attr_get
|
|
for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same
|
|
physical ID for each port of a switch. The ID must be unique between switches
|
|
on the same system. The ID does not need to be unique between switches on
|
|
different systems.
|
|
|
|
The switch ID is used to locate ports on a switch and to know if aggregated
|
|
ports belong to the same switch.
|
|
|
|
Port Netdev Naming
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
Udev rules should be used for port netdev naming, using some unique attribute
|
|
of the port as a key, for example the port MAC address or the port PHYS name.
|
|
Hard-coding of kernel netdev names within the driver is discouraged; let the
|
|
kernel pick the default netdev name, and let udev set the final name based on a
|
|
port attribute.
|
|
|
|
Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
|
|
useful for dynamically-named ports where the device names its ports based on
|
|
external configuration. For example, if a physical 40G port is split logically
|
|
into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
|
|
name for each port using port PHYS name. The udev rule would be:
|
|
|
|
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \
|
|
ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}"
|
|
|
|
Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
|
|
is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0
|
|
would be sub-port 0 on port 1 on switch 1.
|
|
|
|
Port Features
|
|
^^^^^^^^^^^^^
|
|
|
|
NETIF_F_NETNS_LOCAL
|
|
|
|
If the switchdev driver (and device) only supports offloading of the default
|
|
network namespace (netns), the driver should set this feature flag to prevent
|
|
the port netdev from being moved out of the default netns. A netns-aware
|
|
driver/device would not set this flag and be responsible for partitioning
|
|
hardware to preserve netns containment. This means hardware cannot forward
|
|
traffic from a port in one namespace to another port in another namespace.
|
|
|
|
Port Topology
|
|
^^^^^^^^^^^^^
|
|
|
|
The port netdevs representing the physical switch ports can be organized into
|
|
higher-level switching constructs. The default construct is a standalone
|
|
router port, used to offload L3 forwarding. Two or more ports can be bonded
|
|
together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge
|
|
L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3
|
|
tunnels can be built on ports. These constructs are built using standard Linux
|
|
tools such as the bridge driver, the bonding/team drivers, and netlink-based
|
|
tools such as iproute2.
|
|
|
|
The switchdev driver can know a particular port's position in the topology by
|
|
monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a
|
|
bond will see it's upper master change. If that bond is moved into a bridge,
|
|
the bond's upper master will change. And so on. The driver will track such
|
|
movements to know what position a port is in in the overall topology by
|
|
registering for netdevice events and acting on NETDEV_CHANGEUPPER.
|
|
|
|
L2 Forwarding Offload
|
|
---------------------
|
|
|
|
The idea is to offload the L2 data forwarding (switching) path from the kernel
|
|
to the switchdev device by mirroring bridge FDB entries down to the device. An
|
|
FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
|
|
|
|
To offloading L2 bridging, the switchdev driver/device should support:
|
|
|
|
- Static FDB entries installed on a bridge port
|
|
- Notification of learned/forgotten src mac/vlans from device
|
|
- STP state changes on the port
|
|
- VLAN flooding of multicast/broadcast and unknown unicast packets
|
|
|
|
Static FDB Entries
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
|
|
to support static FDB entries installed to the device. Static bridge FDB
|
|
entries are installed, for example, using iproute2 bridge cmd:
|
|
|
|
bridge fdb add ADDR dev DEV [vlan VID] [self]
|
|
|
|
The driver should use the helper switchdev_port_fdb_xxx ops for ndo_fdb_xxx
|
|
ops, and handle add/delete/dump of SWITCHDEV_OBJ_ID_PORT_FDB object using
|
|
switchdev_port_obj_xxx ops.
|
|
|
|
XXX: what should be done if offloading this rule to hardware fails (for
|
|
example, due to full capacity in hardware tables) ?
|
|
|
|
Note: by default, the bridge does not filter on VLAN and only bridges untagged
|
|
traffic. To enable VLAN support, turn on VLAN filtering:
|
|
|
|
echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
|
|
|
|
Notification of Learned/Forgotten Source MAC/VLANs
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The switch device will learn/forget source MAC address/VLAN on ingress packets
|
|
and notify the switch driver of the mac/vlan/port tuples. The switch driver,
|
|
in turn, will notify the bridge driver using the switchdev notifier call:
|
|
|
|
err = call_switchdev_notifiers(val, dev, info);
|
|
|
|
Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when
|
|
forgetting, and info points to a struct switchdev_notifier_fdb_info. On
|
|
SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the
|
|
bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge
|
|
command will label these entries "offload":
|
|
|
|
$ bridge fdb
|
|
52:54:00:12:35:01 dev sw1p1 master br0 permanent
|
|
00:02:00:00:02:00 dev sw1p1 master br0 offload
|
|
00:02:00:00:02:00 dev sw1p1 self
|
|
52:54:00:12:35:02 dev sw1p2 master br0 permanent
|
|
00:02:00:00:03:00 dev sw1p2 master br0 offload
|
|
00:02:00:00:03:00 dev sw1p2 self
|
|
33:33:00:00:00:01 dev eth0 self permanent
|
|
01:00:5e:00:00:01 dev eth0 self permanent
|
|
33:33:ff:00:00:00 dev eth0 self permanent
|
|
01:80:c2:00:00:0e dev eth0 self permanent
|
|
33:33:00:00:00:01 dev br0 self permanent
|
|
01:00:5e:00:00:01 dev br0 self permanent
|
|
33:33:ff:12:35:01 dev br0 self permanent
|
|
|
|
Learning on the port should be disabled on the bridge using the bridge command:
|
|
|
|
bridge link set dev DEV learning off
|
|
|
|
Learning on the device port should be enabled, as well as learning_sync:
|
|
|
|
bridge link set dev DEV learning on self
|
|
bridge link set dev DEV learning_sync on self
|
|
|
|
Learning_sync attribute enables syncing of the learned/forgotten FDB entry to
|
|
the bridge's FDB. It's possible, but not optimal, to enable learning on the
|
|
device port and on the bridge port, and disable learning_sync.
|
|
|
|
To support learning and learning_sync port attributes, the driver implements
|
|
switchdev op switchdev_port_attr_get/set for
|
|
SWITCHDEV_ATTR_PORT_ID_BRIDGE_FLAGS. The driver should initialize the attributes
|
|
to the hardware defaults.
|
|
|
|
FDB Ageing
|
|
^^^^^^^^^^
|
|
|
|
The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
|
|
the responsibility of the port driver/device to age out these entries. If the
|
|
port device supports ageing, when the FDB entry expires, it will notify the
|
|
driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the
|
|
device does not support ageing, the driver can simulate ageing using a
|
|
garbage collection timer to monitor FDB entries. Expired entries will be
|
|
notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for
|
|
example of driver running ageing timer.
|
|
|
|
To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
|
|
entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The
|
|
notification will reset the FDB entry's last-used time to now. The driver
|
|
should rate limit refresh notifications, for example, no more than once a
|
|
second. (The last-used time is visible using the bridge -s fdb option).
|
|
|
|
STP State Change on Port
|
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
Internally or with a third-party STP protocol implementation (e.g. mstpd), the
|
|
bridge driver maintains the STP state for ports, and will notify the switch
|
|
driver of STP state change on a port using the switchdev op
|
|
switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE.
|
|
|
|
State is one of BR_STATE_*. The switch driver can use STP state updates to
|
|
update ingress packet filter list for the port. For example, if port is
|
|
DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
|
|
and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
|
|
|
|
Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
|
|
so packet filters should be applied consistently across untagged and tagged
|
|
VLANs on the port.
|
|
|
|
Flooding L2 domain
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
For a given L2 VLAN domain, the switch device should flood multicast/broadcast
|
|
and unknown unicast packets to all ports in domain, if allowed by port's
|
|
current STP state. The switch driver, knowing which ports are within which
|
|
vlan L2 domain, can program the switch device for flooding. The packet may
|
|
be sent to the port netdev for processing by the bridge driver. The
|
|
bridge should not reflood the packet to the same ports the device flooded,
|
|
otherwise there will be duplicate packets on the wire.
|
|
|
|
To avoid duplicate packets, the switch driver should mark a packet as already
|
|
forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark
|
|
the skb using the ingress bridge port's mark and prevent it from being forwarded
|
|
through any bridge port with the same mark.
|
|
|
|
It is possible for the switch device to not handle flooding and push the
|
|
packets up to the bridge driver for flooding. This is not ideal as the number
|
|
of ports scale in the L2 domain as the device is much more efficient at
|
|
flooding packets that software.
|
|
|
|
If supported by the device, flood control can be offloaded to it, preventing
|
|
certain netdevs from flooding unicast traffic for which there is no FDB entry.
|
|
|
|
IGMP Snooping
|
|
^^^^^^^^^^^^^
|
|
|
|
In order to support IGMP snooping, the port netdevs should trap to the bridge
|
|
driver all IGMP join and leave messages.
|
|
The bridge multicast module will notify port netdevs on every multicast group
|
|
changed whether it is static configured or dynamically joined/leave.
|
|
The hardware implementation should be forwarding all registered multicast
|
|
traffic groups only to the configured ports.
|
|
|
|
L3 Routing Offload
|
|
------------------
|
|
|
|
Offloading L3 routing requires that device be programmed with FIB entries from
|
|
the kernel, with the device doing the FIB lookup and forwarding. The device
|
|
does a longest prefix match (LPM) on FIB entries matching route prefix and
|
|
forwards the packet to the matching FIB entry's nexthop(s) egress ports.
|
|
|
|
To program the device, the driver has to register a FIB notifier handler
|
|
using register_fib_notifier. The following events are available:
|
|
FIB_EVENT_ENTRY_ADD: used for both adding a new FIB entry to the device,
|
|
or modifying an existing entry on the device.
|
|
FIB_EVENT_ENTRY_DEL: used for removing a FIB entry
|
|
FIB_EVENT_RULE_ADD, FIB_EVENT_RULE_DEL: used to propagate FIB rule changes
|
|
|
|
FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:
|
|
|
|
struct fib_entry_notifier_info {
|
|
struct fib_notifier_info info; /* must be first */
|
|
u32 dst;
|
|
int dst_len;
|
|
struct fib_info *fi;
|
|
u8 tos;
|
|
u8 type;
|
|
u32 tb_id;
|
|
u32 nlflags;
|
|
};
|
|
|
|
to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The *fi
|
|
structure holds details on the route and route's nexthops. *dev is one of the
|
|
port netdevs mentioned in the route's next hop list.
|
|
|
|
Routes offloaded to the device are labeled with "offload" in the ip route
|
|
listing:
|
|
|
|
$ ip route show
|
|
default via 192.168.0.2 dev eth0
|
|
11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload
|
|
11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
|
|
11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload
|
|
11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
|
|
12.0.0.2 proto zebra metric 30 offload
|
|
nexthop via 11.0.0.1 dev sw1p1 weight 1
|
|
nexthop via 11.0.0.9 dev sw1p2 weight 1
|
|
12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload
|
|
12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload
|
|
192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15
|
|
|
|
The "offload" flag is set in case at least one device offloads the FIB entry.
|
|
|
|
XXX: add/mod/del IPv6 FIB API
|
|
|
|
Nexthop Resolution
|
|
^^^^^^^^^^^^^^^^^^
|
|
|
|
The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
|
|
the switch device to forward the packet with the correct dst mac address, the
|
|
nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac
|
|
address discovery comes via the ARP (or ND) process and is available via the
|
|
arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver
|
|
should trigger the kernel's neighbor resolution process. See the rocker
|
|
driver's rocker_port_ipv4_resolve() for an example.
|
|
|
|
The driver can monitor for updates to arp_tbl using the netevent notifier
|
|
NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops
|
|
for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy
|
|
to know when arp_tbl neighbor entries are purged from the port.
|
|
|
|
Transaction item queue
|
|
^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
For switchdev ops attr_set and obj_add, there is a 2 phase transaction model
|
|
used. First phase is to "prepare" anything needed, including various checks,
|
|
memory allocation, etc. The goal is to handle the stuff that is not unlikely
|
|
to fail here. The second phase is to "commit" the actual changes.
|
|
|
|
Switchdev provides an infrastructure for sharing items (for example memory
|
|
allocations) between the two phases.
|
|
|
|
The object created by a driver in "prepare" phase and it is queued up by:
|
|
switchdev_trans_item_enqueue()
|
|
During the "commit" phase, the driver gets the object by:
|
|
switchdev_trans_item_dequeue()
|
|
|
|
If a transaction is aborted during "prepare" phase, switchdev code will handle
|
|
cleanup of the queued-up objects.
|