Commit Graph

950 Commits

Author SHA1 Message Date
Marcelo Leitner 88eab472ec netfilter: conntrack: adjust nf_conntrack_buckets default value
Manually bumping either nf_conntrack_buckets or nf_conntrack_max has
become a common task as our Linux servers tend to serve more and more
clients/applications, so let's adjust nf_conntrack_buckets this to a
more updated value.

Now for systems with more than 4GB of memory, nf_conntrack_buckets
becomes 65536 instead of 16384, resulting in nf_conntrack_max=256k
entries.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2014-12-23 14:20:10 +01:00
Duan Jiong fd223068fc fib_trie.txt: fix typo
Fix the typo, there should be "It".
On the other hand, fix whitespace errors detected by checkpatch.pl

Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-15 11:45:15 -05:00
Rami Rosen 6dc696401a Documentation (ixgbe.txt): use a decimal address.
This patch fixes the erronous usage of an hexadecimal address in the
example, by replacing it with a decimal address.

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Acked-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09 16:02:32 -05:00
Willem de Bruijn cbd3aad5ce net-timestamp: expand documentation and test
Documentation:
  expand explanation of timestamp counter

Test:
  new: flag -I requests and prints PKTINFO
  new: flag -x prints payload (possibly truncated)
  fix: remove pretty print that breaks common flag '-l 1'

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-08 20:20:48 -05:00
Willem de Bruijn 829ae9d611 net-timestamp: allow reading recv cmsg on errqueue with origin tstamp
Allow reading of timestamps and cmsg at the same time on all relevant
socket families. One use is to correlate timestamps with egress
device, by asking for cmsg IP_PKTINFO.

on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
avoid changing legacy expectations, only do so if the caller sets a
new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.

on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
returned for all origins. only change is to set ifindex, which is
not initialized for all error origins.

In both cases, only generate the pktinfo message if an ifindex is
known. This is not the case for ACK timestamps.

The difference between the protocol families is probably a historical
accident as a result of the different conditions for generating cmsg
in the relevant ip(v6)_recv_error function:

ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {

At one time, this was the same test bar for the ICMP/ICMP6
distinction. This is no longer true.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

Changes
  v1 -> v2
    large rewrite
    - integrate with existing pktinfo cmsg generation code
    - on ipv4: only send with new flag, to maintain legacy behavior
    - on ipv6: send at most a single pktinfo cmsg
    - on ipv6: initialize fields if not yet initialized

The recv cmsg interfaces are also relevant to the discussion of
whether looping packet headers is problematic. For v6, cmsgs that
identify many headers are already returned. This patch expands
that to v4. If it sounds reasonable, I will follow with patches

1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
   (http://patchwork.ozlabs.org/patch/366967/)
2. sysctl to conditionally drop all timestamps that have payload or
   cmsg from users without CAP_NET_RAW.
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-08 20:20:48 -05:00
Jiri Pirko 007f790c82 net: introduce generic switch devices support
The goal of this is to provide a possibility to support various switch
chips. Drivers should implement relevant ndos to do so. Now there is
only one ndo defined:
- for getting physical switch id is in place.

Note that user can use random port netdevice to access the switch.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Thomas Graf <tgraf@suug.ch>
Acked-by: Andy Gospodarek <gospo@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-02 20:01:20 -08:00
David S. Miller 60b7379dc5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-29 20:47:48 -08:00
Andrew Lutomirski 138a7f4927 net-timestamp: Fix a documentation typo
SOF_TIMESTAMPING_OPT_ID puts the id in ee_data, not ee_info.

Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-25 13:35:26 -05:00
Mahesh Bandewar 2ad7bf3638 ipvlan: Initial check-in of the IPVLAN driver.
This driver is very similar to the macvlan driver except that it
uses L3 on the frame to determine the logical interface while
functioning as packet dispatcher. It inherits L2 of the master
device hence the packets on wire will have the same L2 for all
the packets originating from all virtual devices off of the same
master device.

This driver was developed keeping the namespace use-case in
mind. Hence most of the examples given here take that as the
base setup where main-device belongs to the default-ns and
virtual devices are assigned to the additional namespaces.

The device operates in two different modes and the difference
in these two modes in primarily in the TX side.

(a) L2 mode : In this mode, the device behaves as a L2 device.
TX processing upto L2 happens on the stack of the virtual device
associated with (namespace). Packets are switched after that
into the main device (default-ns) and queued for xmit.

RX processing is simple and all multicast, broadcast (if
applicable), and unicast belonging to the address(es) are
delivered to the virtual devices.

(b) L3 mode : In this mode, the device behaves like a L3 device.
TX processing upto L3 happens on the stack of the virtual device
associated with (namespace). Packets are switched to the
main-device (default-ns) for the L2 processing. Hence the routing
table of the default-ns will be used in this mode.

RX processins is somewhat similar to the L2 mode except that in
this mode only Unicast packets are delivered to the virtual device
while main-dev will handle all other packets.

The devices can be added using the "ip" command from the iproute2
package -

	ip link add link <master> <virtual> type ipvlan mode [ l2 | l3 ]

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Laurent Chavey <chavey@google.com>
Cc: Tim Hockin <thockin@google.com>
Cc: Brandon Philips <brandon.philips@coreos.com>
Cc: Pavel Emelianov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-24 15:29:18 -05:00
Giuseppe CAVALLARO 233b36cf1f stmmac: update driver documentation
Recently many changes have been done inside the driver
so this patch updates the driver's doc for example reviewing
information for the rx and tx processes that are managed
by napi method, adding new information for missing glue-logic files
etc.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-19 15:04:57 -05:00
David S. Miller 4e84b496fd Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-11-06 22:01:18 -05:00
Loganaden Velvindron 219b5f29a5 net: Add missing descriptions for fwmark_reflect for ipv4 and ipv6.
It was initially sent by Lorenzo Colitti, but was subsequently
lost in the final diff he submitted.

Signed-off-by: Loganaden Velvindron <logan@elandsys.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-05 15:43:57 -05:00
Erik Kline 7fd2561e4e net: ipv6: Add a sysctl to make optimistic addresses useful candidates
Add a sysctl that causes an interface's optimistic addresses
to be considered equivalent to other non-deprecated addresses
for source address selection purposes.  Preferred addresses
will still take precedence over optimistic addresses, subject
to other ranking in the source address selection algorithm.

This is useful where different interfaces are connected to
different networks from different ISPs (e.g., a cell network
and a home wifi network).

The current behaviour complies with RFC 3484/6724, and it
makes sense if the host has only one interface, or has
multiple interfaces on the same network (same or cooperating
administrative domain(s), but not in the multiple distinct
networks case.

For example, if a mobile device has an IPv6 address on an LTE
network and then connects to IPv6-enabled wifi, while the wifi
IPv6 address is undergoing DAD, IPv6 connections will try use
the wifi default route with the LTE IPv6 address, and will get
stuck until they time out.

Also, because optimistic nodes can receive frames, issue
an RTM_NEWADDR as soon as DAD starts (with the IFA_F_OPTIMSTIC
flag appropriately set).  A second RTM_NEWADDR is sent if DAD
completes (the address flags have changed), otherwise an
RTM_DELADDR is sent.

Also: add an entry in ip-sysctl.txt for optimistic_dad.

Signed-off-by: Erik Kline <ek@google.com>
Acked-by: Lorenzo Colitti <lorenzo@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 15:11:36 -04:00
Eric Dumazet dca145ffaa tcp: allow for bigger reordering level
While testing upcoming Yaogong patch (converting out of order queue
into an RB tree), I hit the max reordering level of linux TCP stack.

Reordering level was limited to 127 for no good reason, and some
network setups [1] can easily reach this limit and get limited
throughput.

Allow a new max limit of 300, and add a sysctl to allow admins to even
allow bigger (or lower) values if needed.

[1] Aggregation of links, per packet load balancing, fabrics not doing
 deep packet inspections, alternative TCP congestion modules...

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yaogong Wang <wygivan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-29 15:05:15 -04:00
Li RongQing 1a9525f68e Documentation: replace __sk_run_filter with __bpf_prog_run
__sk_run_filter has been renamed as __bpf_prog_run, so replace them in comments

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-10 15:10:50 -04:00
Linus Torvalds 35a9ad8af0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Most notable changes in here:

   1) By far the biggest accomplishment, thanks to a large range of
      contributors, is the addition of multi-send for transmit.  This is
      the result of discussions back in Chicago, and the hard work of
      several individuals.

      Now, when the ->ndo_start_xmit() method of a driver sees
      skb->xmit_more as true, it can choose to defer the doorbell
      telling the driver to start processing the new TX queue entires.

      skb->xmit_more means that the generic networking is guaranteed to
      call the driver immediately with another SKB to send.

      There is logic added to the qdisc layer to dequeue multiple
      packets at a time, and the handling mis-predicted offloads in
      software is now done with no locks held.

      Finally, pktgen is extended to have a "burst" parameter that can
      be used to test a multi-send implementation.

      Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4,
      virtio_net

      Adding support is almost trivial, so export more drivers to
      support this optimization soon.

      I want to thank, in no particular or implied order, Jesper
      Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal
      Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann,
      David Tat, Hannes Frederic Sowa, and Rusty Russell.

   2) PTP and timestamping support in bnx2x, from Michal Kalderon.

   3) Allow adjusting the rx_copybreak threshold for a driver via
      ethtool, and add rx_copybreak support to enic driver.  From
      Govindarajulu Varadarajan.

   4) Significant enhancements to the generic PHY layer and the bcm7xxx
      driver in particular (EEE support, auto power down, etc.) from
      Florian Fainelli.

   5) Allow raw buffers to be used for flow dissection, allowing drivers
      to determine the optimal "linear pull" size for devices that DMA
      into pools of pages.  The objective is to get exactly the
      necessary amount of headers into the linear SKB area pre-pulled,
      but no more.  The new interface drivers use is eth_get_headlen().
      From WANG Cong, with driver conversions (several had their own
      by-hand duplicated implementations) by Alexander Duyck and Eric
      Dumazet.

   6) Support checksumming more smoothly and efficiently for
      encapsulations, and add "foo over UDP" facility.  From Tom
      Herbert.

   7) Add Broadcom SF2 switch driver to DSA layer, from Florian
      Fainelli.

   8) eBPF now can load programs via a system call and has an extensive
      testsuite.  Alexei Starovoitov and Daniel Borkmann.

   9) Major overhaul of the packet scheduler to use RCU in several major
      areas such as the classifiers and rate estimators.  From John
      Fastabend.

  10) Add driver for Intel FM10000 Ethernet Switch, from Alexander
      Duyck.

  11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric
      Dumazet.

  12) Add Datacenter TCP congestion control algorithm support, From
      Florian Westphal.

  13) Reorganize sk_buff so that __copy_skb_header() is significantly
      faster.  From Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits)
  netlabel: directly return netlbl_unlabel_genl_init()
  net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers
  net: description of dma_cookie cause make xmldocs warning
  cxgb4: clean up a type issue
  cxgb4: potential shift wrapping bug
  i40e: skb->xmit_more support
  net: fs_enet: Add NAPI TX
  net: fs_enet: Remove non NAPI RX
  r8169:add support for RTL8168EP
  net_sched: copy exts->type in tcf_exts_change()
  wimax: convert printk to pr_foo()
  af_unix: remove 0 assignment on static
  ipv6: Do not warn for informational ICMP messages, regardless of type.
  Update Intel Ethernet Driver maintainers list
  bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING
  tipc: fix bug in multicast congestion handling
  net: better IFF_XMIT_DST_RELEASE support
  net/mlx4_en: remove NETDEV_TX_BUSY
  3c59x: fix bad split of cpu_to_le32(pci_map_single())
  net: bcmgenet: fix Tx ring priority programming
  ...
2014-10-08 21:40:54 -04:00
Linus Torvalds 6325e940e7 arm64 updates for 3.18:
- eBPF JIT compiler for arm64
 - CPU suspend backend for PSCI (firmware interface) with standard idle
   states defined in DT (generic idle driver to be merged via a different
   tree)
 - Support for CONFIG_DEBUG_SET_MODULE_RONX
 - Support for unmapped cpu-release-addr (outside kernel linear mapping)
 - set_arch_dma_coherent_ops() implemented and bus notifiers removed
 - EFI_STUB improvements when base of DRAM is occupied
 - Typos in KGDB macros
 - Clean-up to (partially) allow kernel building with LLVM
 - Other clean-ups (extern keyword, phys_addr_t usage)
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUNB6NAAoJEGvWsS0AyF7x22sP/1qPQvFoY71fSqTZmSY+kfgW
 UMXhDFZOd+khD2TPHWptbgBRDElTQjRPHyISv/8ILKwDNoMlUDLlYkp1XPLM/nlB
 ea9ou2GX8iktqgM2JF5r4vk1hjH6JqEGOUHyWKZc7ibphTVm3dhg3nWL1A4peOUG
 0UyX79kl8BLAaggLSUhjtUz1GMpSNlb6Pc1ForUXaPMayBlOcVoOzh1ir7b5wb3e
 IvotUY1gv+opE9uK0QPr1AJSfpCogPEfQ2TSCP8MQZjxkrEz69n0HaFvdy60rwf4
 DaJiqBoQ5MSP3Bw+qvoYgyz+tfiPFAvEF+O3YQ5x3LBTteoooriFYH4mL7DsicAs
 2WLor/342mHykE0bOc44/gNl8B/xaZNzvO2ezLYrjVGsiY2QHTZ7fXB8arPUvQSS
 RUXVfHmcv4qthZjI17rgreBKvsfeFIMighSfvMJnVhGqDSvB8abjiPwZjzqB91Bq
 pu5MDitNgR3k3ctwzRaS6JtH2CluVFv97xIS4VaD/hm3JnS5NPeTXFou3Gb3lvon
 d/wXOIB3vY8FDMIt+BMCQPzWiU0liZ/sN7p1bsOmkgZ1wLOZ0nmsaHF09PDRGbtA
 vifopwaw9qtNlcVrTB/rDBCDaT0Ds/mTYD/a3+ch5CYUeLmQmfW/vBMfq/3gUt65
 JdI/nTVXawbl2CpBWw36
 =SAfQ
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:
 - eBPF JIT compiler for arm64
 - CPU suspend backend for PSCI (firmware interface) with standard idle
   states defined in DT (generic idle driver to be merged via a
   different tree)
 - Support for CONFIG_DEBUG_SET_MODULE_RONX
 - Support for unmapped cpu-release-addr (outside kernel linear mapping)
 - set_arch_dma_coherent_ops() implemented and bus notifiers removed
 - EFI_STUB improvements when base of DRAM is occupied
 - Typos in KGDB macros
 - Clean-up to (partially) allow kernel building with LLVM
 - Other clean-ups (extern keyword, phys_addr_t usage)

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (51 commits)
  arm64: Remove unneeded extern keyword
  ARM64: make of_device_ids const
  arm64: Use phys_addr_t type for physical address
  aarch64: filter $x from kallsyms
  arm64: Use DMA_ERROR_CODE to denote failed allocation
  arm64: Fix typos in KGDB macros
  arm64: insn: Add return statements after BUG_ON()
  arm64: debug: don't re-enable debug exceptions on return from el1_dbg
  Revert "arm64: dmi: Add SMBIOS/DMI support"
  arm64: Implement set_arch_dma_coherent_ops() to replace bus notifiers
  of: amba: use of_dma_configure for AMBA devices
  arm64: dmi: Add SMBIOS/DMI support
  arm64: Correct ftrace calls to aarch64_insn_gen_branch_imm()
  arm64:mm: initialize max_mapnr using function set_max_mapnr
  setup: Move unmask of async interrupts after possible earlycon setup
  arm64: LLVMLinux: Fix inline arm64 assembly for use with clang
  arm64: pageattr: Correctly adjust unaligned start addresses
  net: bpf: arm64: fix module memory leak when JIT image build fails
  arm64: add PSCI CPU_SUSPEND based cpu_suspend support
  arm64: kernel: introduce cpu_init_idle CPU operation
  ...
2014-10-08 05:34:24 -04:00
Linus Torvalds b6420ebd4a Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/doc
Pull documentation updates from Jiri Kosina:
 "Updates to kernel documentation.

  I took this over (hopefully temporarily) from Randy who was not
  willing to maintain it any longer.  This pile mostly is a relay of
  queue that Randy already had in his tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/doc:
  Documentation: fix broken v4l-utils URL
  Documentation: update include path for mpssd
  Documentation: correct parameter error for dma_mapping_error
  MAINTAINERS: update location of linux-doc tree
  Documentation: remove networking/.gitignore
  tools: add more endian.h macros
  Make Documenation depend on headers_install
  Docs: this_cpu_ops: remove redundant add forms
  Documentation: disable vdso_test to avoid breakage with old glibc
  Documentation: update vDSO makefile to build portable examples
  Documentation: update .gitignore files
  Documentation: support glibc versions without htole macros
  v4l2-pci-skeleton: Only build if PCI is available
  Documentation: fix misc. warnings
  Documentation: make functions static to avoid prototype warnings
  Documentation: add makefiles for more targets
  Documentation: use subdir-y to avoid unnecessary built-in.o files
2014-10-07 21:14:57 -04:00
Linus Torvalds d0cd84817c dmaengine-3.17
1/ Step down as dmaengine maintainer see commit 08223d80df "dmaengine
    maintainer update"
 
 2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit
    7787380336 "net_dma: mark broken"), without reports of performance
    regression.
 
 3/ Miscellaneous fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJUKDLKAAoJEB7SkWpmfYgC7wwP/iNHqRjf1suMUTBIF3P6Hgbe
 VCUwh0IkuujMPDG46WRn6cYzarRxVPLoGaLHLPszgjI6pmGPVv19wqeDOlUxtcmr
 0iQWEWv/zqseaAIW+4gj/WYCyMgKil49EUBJKCZCfNmIaad+e0pr8f0uE5yOkHPM
 tqWoZERu9A4dlXGr1TjeOZVzdnPrCt92MrLDN6ZZ6tMuJaEc5PauaLxKTeGy5fYj
 UB+k1xJQzECbsYfpB+uCVYl5/qPO1rNyuBYS8THCsW+JYmrbbfH2kkF2lo2FaUpO
 8Yd50FtzXHKWwAt7BzfIwU2M7x0wRmryrC/xsQi6M+WmVeHYvvHUIpzaA66xRZ5x
 fCy3Fu8sEnmnmboAbh2v2c5uTycqRl2xPzbpLAuxglloXIxzi3ckp6ESF/Z4SldH
 oxIoEievN7lah3vKgvlHZYcWDzrYr8EKf/EzFe9RqDBQDKtzDzre1H9Uivr387Vm
 uFUcGHYG/GXuX47C7EUsMtaSW2UEoR2ytw/HR6CKFPTVXwAzEO6kA9vg0EqL0iIq
 2wVLgavlZuwegmaUBgnr+bgVZMvVN7OU7fAIRVe5xNO6itrPKvheSlQthmRiiq9C
 uzOu4PS6PexqzHUNPCcJpCsj+lawmCSrE0bxtPzTA/CQInVgWs219V9+W5Gn/0YA
 EARN9k6ueX9PZPQrPQLm
 =BBBv
 -----END PGP SIGNATURE-----

Merge tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine

Pull dmaengine updates from Dan Williams:
 "Even though this has fixes marked for -stable, given the size and the
  needed conflict resolutions this is 3.18-rc1/merge-window material.

  These patches have been languishing in my tree for a long while.  The
  fact that I do not have the time to do proper/prompt maintenance of
  this tree is a primary factor in the decision to step down as
  dmaengine maintainer.  That and the fact that the bulk of drivers/dma/
  activity is going through Vinod these days.

  The net_dma removal has not been in -next.  It has developed simple
  conflicts against mainline and net-next (for-3.18).

  Continuing thanks to Vinod for staying on top of drivers/dma/.

  Summary:

   1/ Step down as dmaengine maintainer see commit 08223d80df
      "dmaengine maintainer update"

   2/ Removal of net_dma, as it has been marked 'broken' since 3.13
      (commit 7787380336 "net_dma: mark broken"), without reports of
      performance regression.

   3/ Miscellaneous fixes"

* tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine:
  net: make tcp_cleanup_rbuf private
  net_dma: revert 'copied_early'
  net_dma: simple removal
  dmaengine maintainer update
  dmatest: prevent memory leakage on error path in thread
  ioat: Use time_before_jiffies()
  dmaengine: fix xor sources continuation
  dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup()
  dma: mv_xor: Remove all callers of mv_xor_slot_cleanup()
  dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call
  ioat: Use pci_enable_msix_exact() instead of pci_enable_msix()
  drivers: dma: Include appropriate header file in dca.c
  drivers: dma: Mark functions as static in dma_v3.c
  dma: mv_xor: Add DMA API error checks
  ioat/dca: Use dev_is_pci() to check whether it is pci device
2014-10-07 20:39:25 -04:00
Alexei Starovoitov 38b2cf2982 net: pktgen: packet bursting via skb->xmit_more
This patch demonstrates the effect of delaying update of HW tailptr.
(based on earlier patch by Jesper)

burst=1 is the default. It sends one packet with xmit_more=false
burst=2 sends one packet with xmit_more=true and
        2nd copy of the same packet with xmit_more=false
burst=3 sends two copies of the same packet with xmit_more=true and
        3rd copy with xmit_more=false

Performance with ixgbe (usec 30):
burst=1  tx:9.2 Mpps
burst=2  tx:13.5 Mpps
burst=3  tx:14.5 Mpps full 10G line rate

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-01 22:08:12 -04:00
Daniel Borkmann e3118e8359 net: tcp: add DCTCP congestion control algorithm
This work adds the DataCenter TCP (DCTCP) congestion control
algorithm [1], which has been first published at SIGCOMM 2010 [2],
resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
recently as an informational IETF draft available at [4]).

DCTCP is an enhancement to the TCP congestion control algorithm for
data center networks. Typical data center workloads are i.e.
i) partition/aggregate (queries; bursty, delay sensitive), ii) short
messages e.g. 50KB-1MB (for coordination and control state; delay
sensitive), and iii) large flows e.g. 1MB-100MB (data update;
throughput sensitive). DCTCP has therefore been designed for such
environments to provide/achieve the following three requirements:

  * High burst tolerance (incast due to partition/aggregate)
  * Low latency (short flows, queries)
  * High throughput (continuous data updates, large file
    transfers) with commodity, shallow buffered switches

The basic idea of its design consists of two fundamentals: i) on the
switch side, packets are being marked when its internal queue
length > threshold K (K is chosen so that a large enough headroom
for marked traffic is still available in the switch queue); ii) the
sender/host side maintains a moving average of the fraction of marked
packets, so each RTT, F is being updated as follows:

 F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
 alpha := (1 - g) * alpha + g * F, where g is a smoothing constant

The resulting alpha (iow: probability that switch queue is congested)
is then being used in order to adaptively decrease the congestion
window W:

 W := (1 - (alpha / 2)) * W

The means for receiving marked packets resp. marking them on switch
side in DCTCP is the use of ECN.

RFC3168 describes a mechanism for using Explicit Congestion Notification
from the switch for early detection of congestion, rather than waiting
for segment loss to occur.

However, this method only detects the presence of congestion, not
the *extent*. In the presence of mild congestion, it reduces the TCP
congestion window too aggressively and unnecessarily affects the
throughput of long flows [4].

DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
processing to estimate the fraction of bytes that encounter congestion,
rather than simply detecting that some congestion has occurred. DCTCP
then scales the TCP congestion window based on this estimate [4],
thus it can derive multibit feedback from the information present in
the single-bit sequence of marks in its control law. And thus act in
*proportion* to the extent of congestion, not its *presence*.

Switches therefore set the Congestion Experienced (CE) codepoint in
packets when internal queue lengths exceed threshold K. Resulting,
DCTCP delivers the same or better throughput than normal TCP, while
using 90% less buffer space.

It was found in [2] that DCTCP enables the applications to handle 10x
the current background traffic, without impacting foreground traffic.
Moreover, a 10x increase in foreground traffic did not cause any
timeouts, and thus largely eliminates TCP incast collapse problems.

The algorithm itself has already seen deployments in large production
data centers since then.

We did a long-term stress-test and analysis in a data center, short
summary of our TCP incast tests with iperf compared to cubic:

This test measured DCTCP throughput and latency and compared it with
CUBIC throughput and latency for an incast scenario. In this test, 19
senders sent at maximum rate to a single receiver. The receiver simply
ran iperf -s.

The senders ran iperf -c <receiver> -t 30. All senders started
simultaneously (using local clocks synchronized by ntp).

This test was repeated multiple times. Below shows the results from a
single test. Other tests are similar. (DCTCP results were extremely
consistent, CUBIC results show some variance induced by the TCP timeouts
that CUBIC encountered.)

For this test, we report statistics on the number of TCP timeouts,
flow throughput, and traffic latency.

1) Timeouts (total over all flows, and per flow summaries):

            CUBIC            DCTCP
  Total     3227             25
  Mean       169.842          1.316
  Median     183              1
  Max        207              5
  Min        123              0
  Stddev      28.991          1.600

Timeout data is taken by measuring the net change in netstat -s
"other TCP timeouts" reported. As a result, the timeout measurements
above are not restricted to the test traffic, and we believe that it
is likely that all of the "DCTCP timeouts" are actually timeouts for
non-test traffic. We report them nevertheless. CUBIC will also include
some non-test timeouts, but they are drawfed by bona fide test traffic
timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
TCP timeouts. DCTCP reduces timeouts by at least two orders of
magnitude and may well have eliminated them in this scenario.

2) Throughput (per flow in Mbps):

            CUBIC            DCTCP
  Mean      521.684          521.895
  Median    464              523
  Max       776              527
  Min       403              519
  Stddev    105.891            2.601
  Fairness    0.962            0.999

Throughput data was simply the average throughput for each flow
reported by iperf. By avoiding TCP timeouts, DCTCP is able to
achieve much better per-flow results. In CUBIC, many flows
experience TCP timeouts which makes flow throughput unpredictable and
unfair. DCTCP, on the other hand, provides very clean predictable
throughput without incurring TCP timeouts. Thus, the standard deviation
of CUBIC throughput is dramatically higher than the standard deviation
of DCTCP throughput.

Mean throughput is nearly identical because even though cubic flows
suffer TCP timeouts, other flows will step in and fill the unused
bandwidth. Note that this test is something of a best case scenario
for incast under CUBIC: it allows other flows to fill in for flows
experiencing a timeout. Under situations where the receiver is issuing
requests and then waiting for all flows to complete, flows cannot fill
in for timed out flows and throughput will drop dramatically.

3) Latency (in ms):

            CUBIC            DCTCP
  Mean      4.0088           0.04219
  Median    4.055            0.0395
  Max       4.2              0.085
  Min       3.32             0.028
  Stddev    0.1666           0.01064

Latency for each protocol was computed by running "ping -i 0.2
<receiver>" from a single sender to the receiver during the incast
test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
that traffic traversed the DCTCP queue and was not dropped when the
queue size was greater than the marking threshold. The summary
statistics above are over all ping metrics measured between the single
sender, receiver pair.

The latency results for this test show a dramatic difference between
CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
which incurs the maximum queue latency (more buffer memory will lead
to high latency.) DCTCP, on the other hand, deliberately attempts to
keep queue occupancy low. The result is a two orders of magnitude
reduction of latency with DCTCP - even with a switch with relatively
little RAM. Switches with larger amounts of RAM will incur increasing
amounts of latency for CUBIC, but not for DCTCP.

4) Convergence and stability test:

This test measured the time that DCTCP took to fairly redistribute
bandwidth when a new flow commences. It also measured DCTCP's ability
to remain stable at a fair bandwidth distribution. DCTCP is compared
with CUBIC for this test.

At the commencement of this test, a single flow is sending at maximum
rate (near 10 Gbps) to a single receiver. One second after that first
flow commences, a new flow from a distinct server begins sending to
the same receiver as the first flow. After the second flow has sent
data for 10 seconds, the second flow is terminated. The first flow
sends for an additional second. Ideally, the bandwidth would be evenly
shared as soon as the second flow starts, and recover as soon as it
stops.

The results of this test are shown below. Note that the flow bandwidth
for the two flows was measured near the same time, but not
simultaneously.

DCTCP performs nearly perfectly within the measurement limitations
of this test: bandwidth is quickly distributed fairly between the two
flows, remains stable throughout the duration of the test, and
recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
fairly, and has trouble remaining stable.

  CUBIC                      DCTCP

  Seconds  Flow 1  Flow 2    Seconds  Flow 1  Flow 2
   0       9.93    0          0       9.92    0
   0.5     9.87    0          0.5     9.86    0
   1       8.73    2.25       1       6.46    4.88
   1.5     7.29    2.8        1.5     4.9     4.99
   2       6.96    3.1        2       4.92    4.94
   2.5     6.67    3.34       2.5     4.93    5
   3       6.39    3.57       3       4.92    4.99
   3.5     6.24    3.75       3.5     4.94    4.74
   4       6       3.94       4       5.34    4.71
   4.5     5.88    4.09       4.5     4.99    4.97
   5       5.27    4.98       5       4.83    5.01
   5.5     4.93    5.04       5.5     4.89    4.99
   6       4.9     4.99       6       4.92    5.04
   6.5     4.93    5.1        6.5     4.91    4.97
   7       4.28    5.8        7       4.97    4.97
   7.5     4.62    4.91       7.5     4.99    4.82
   8       5.05    4.45       8       5.16    4.76
   8.5     5.93    4.09       8.5     4.94    4.98
   9       5.73    4.2        9       4.92    5.02
   9.5     5.62    4.32       9.5     4.87    5.03
  10       6.12    3.2       10       4.91    5.01
  10.5     6.91    3.11      10.5     4.87    5.04
  11       8.48    0         11       8.49    4.94
  11.5     9.87    0         11.5     9.9     0

SYN/ACK ECT test:

This test demonstrates the importance of ECT on SYN and SYN-ACK packets
by measuring the connection probability in the presence of competing
flows for a DCTCP connection attempt *without* ECT in the SYN packet.
The test was repeated five times for each number of competing flows.

              Competing Flows  1 |    2 |    4 |    8 |   16
                               ------------------------------
Mean Connection Probability    1 | 0.67 | 0.45 | 0.28 |    0
Median Connection Probability  1 | 0.65 | 0.45 | 0.25 |    0

As the number of competing flows moves beyond 1, the connection
probability drops rapidly.

Enabling DCTCP with this patch requires the following steps:

DCTCP must be running both on the sender and receiver side in your
data center, i.e.:

  sysctl -w net.ipv4.tcp_congestion_control=dctcp

Also, ECN functionality must be enabled on all switches in your
data center for DCTCP to work. The default ECN marking threshold (K)
heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).

In above tests, for each switch port, traffic was segregated into two
queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
0x04 - the packet was placed into the DCTCP queue. All other packets
were placed into the default drop-tail queue. For the DCTCP queue,
RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
More details however, we refer you to the paper [2] under section 3).

There are no code changes required to applications running in user
space. DCTCP has been implemented in full *isolation* of the rest of
the TCP code as its own congestion control module, so that it can run
without a need to expose code to the core of the TCP stack, and thus
nothing changes for non-DCTCP users.

Changes in the CA framework code are minimal, and DCTCP algorithm
operates on mechanisms that are already available in most Silicon.
The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
the paper, but we leave the option that it can be chosen carefully
to a different value by the user.

In case DCTCP is being used and ECN support on peer site is off,
DCTCP falls back after 3WHS to operate in normal TCP Reno mode.

ss {-4,-6} -t -i diag interface:

  ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
  ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
  send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
  reordering:101 rcv_space:29200

  ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
  cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
  325.5Mbps rcv_rtt:1.5 rcv_space:29200

More information about DCTCP can be found in [1-4].

  [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
  [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
  [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
  [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00

Joint work with Florian Westphal and Glenn Judd.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Glenn Judd <glenn.judd@morganstanley.com>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-29 00:13:10 -04:00
Dan Williams 7bced39751 net_dma: simple removal
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used
and there is no plan to fix it.

This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards.
Reverting the remainder of the net_dma induced changes is deferred to
subsequent patches.

Marked for stable due to Roman's report of a memory leak in
dma_pin_iovec_pages():

    https://lkml.org/lkml/2014/9/3/177

Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: David Whipple <whipple@securedatainnovations.ch>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: <stable@vger.kernel.org>
Reported-by: Roman Gushchin <klamm@yandex-team.ru>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-28 07:05:16 -07:00
Alexei Starovoitov 51580e798c bpf: verifier (add docs)
this patch adds all of eBPF verfier documentation and empty bpf_check()

The end goal for the verifier is to statically check safety of the program.

Verifier will catch:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention

More details in Documentation/networking/filter.txt

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Alexei Starovoitov 99c55f7d47 bpf: introduce BPF syscall and maps
BPF syscall is a multiplexor for a range of different operations on eBPF.
This patch introduces syscall with single command to create a map.
Next patch adds commands to access maps.

'maps' is a generic storage of different types for sharing data between kernel
and userspace.

Userspace example:
/* this syscall wrapper creates a map with given type and attributes
 * and returns map_fd on success.
 * use close(map_fd) to delete the map
 */
int bpf_create_map(enum bpf_map_type map_type, int key_size,
                   int value_size, int max_entries)
{
    union bpf_attr attr = {
        .map_type = map_type,
        .key_size = key_size,
        .value_size = value_size,
        .max_entries = max_entries
    };

    return bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}

'union bpf_attr' is backwards compatible with future extensions.

More details in Documentation/networking/filter.txt and in manpage

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-26 15:05:14 -04:00
Peter Foley e043271b6a Documentation: remove networking/.gitignore
Remove empty networking/.gitignore

Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Cc: rdunlap@infradead.org
Cc: linux-doc@vger.kernel.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-09-26 11:03:02 +02:00
Peter Foley c5e2a7e012 Documentation: update .gitignore files
Add some missing files to .gitignore.
Push Documentation/.gitignore down into subdirectories.

Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-09-26 11:02:59 +02:00
Peter Foley adb19fb66e Documentation: add makefiles for more targets
Add a bunch of previously unbuilt source files to the Documentation build
machinery.

Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-09-26 11:02:56 +02:00
Peter Foley df68a01014 Documentation: use subdir-y to avoid unnecessary built-in.o files
Change the Documentation makefiles from obj-m to subdir-y
to avoid generating unnecessary built-in.o files since nothing
in Documentation/ is ever linked in to vmlinux.

Signed-off-by: Peter Foley <pefoley2@pefoley.com>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-09-26 11:02:55 +02:00
Eric Dumazet 4cdf507d54 icmp: add a global rate limitation
Current ICMP rate limiting uses inetpeer cache, which is an RBL tree
protected by a lock, meaning that hosts can be stuck hard if all cpus
want to check ICMP limits.

When say a DNS or NTP server process is restarted, inetpeer tree grows
quick and machine comes to its knees.

iptables can not help because the bottleneck happens before ICMP
messages are even cooked and sent.

This patch adds a new global limitation, using a token bucket filter,
controlled by two new sysctl :

icmp_msgs_per_sec - INTEGER
    Limit maximal number of ICMP packets sent per second from this host.
    Only messages whose type matches icmp_ratemask are
    controlled by this limit.
    Default: 1000

icmp_msgs_burst - INTEGER
    icmp_msgs_per_sec controls number of ICMP packets sent per second,
    while icmp_msgs_burst controls the burst size of these packets.
    Default: 50

Note that if we really want to send millions of ICMP messages per
second, we might extend idea and infra added in commit 04ca6973f7
("ip: make IP identifiers less predictable") :
add a token bucket in the ip_idents hash and no longer rely on inetpeer.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:47:38 -04:00
David S. Miller 1f6d80358d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	arch/mips/net/bpf_jit.c
	drivers/net/can/flexcan.c

Both the flexcan and MIPS bpf_jit conflicts were cases of simple
overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-23 12:09:27 -04:00
Sébastien Barré 72b126a45e Revert "ipv4: Clarify in docs that accept_local requires rp_filter."
This reverts commit c801e3cc19 ("ipv4: Clarify in docs that accept_local requires rp_filter.").
It is not needed anymore since commit 1dced6a854 ("ipv4: Restore accept_local behaviour in fib_validate_source()").

Suggested-by: Julian Anastasov <ja@ssi.bg>
Cc: Gregory Detal <gregory.detal@uclouvain.be>
Cc: Christoph Paasch <christoph.paasch@uclouvain.be>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Signed-off-by: Sébastien Barré <sebastien.barre@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-12 16:34:17 -04:00
Markos Chandras 1d7efe9dfa Documentation: filter: Add MIPS to architectures with BPF JIT
MIPS supports BPF JIT since v3.16-rc1

Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Markos Chandras <markos.chandras@imgtec.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-10 15:24:27 -07:00
Alexei Starovoitov 02ab695bb3 net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.

Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.

x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn

Note that old eBPF programs are binary compatible with new interpreter.

It helps eBPF programs load 64-bit constant into a register with one
instruction instead of using two registers and 4 instructions:
BPF_MOV32_IMM(R1, imm32)
BPF_ALU64_IMM(BPF_LSH, R1, 32)
BPF_MOV32_IMM(R2, imm32)
BPF_ALU64_REG(BPF_OR, R1, R2)

User space generated programs will use this instruction to load constants only.

To tell kernel that user space needs a pointer the _pseudo_ variant of
this instruction may be added later, which will use extra bits of encoding
to indicate what type of pointer user space is asking kernel to provide.
For example 'off' or 'src_reg' fields can be used for such purpose.
src_reg = 1 could mean that user space is asking kernel to validate and
load in-kernel map pointer.
src_reg = 2 could mean that user space needs readonly data section pointer
src_reg = 3 could mean that user space needs a pointer to per-cpu local data
All such future pseudo instructions will not be carrying the actual pointer
as part of the instruction, but rather will be treated as a request to kernel
to provide one. The kernel will verify the request_for_a_pointer, then
will drop _pseudo_ marking and will store actual internal pointer inside
the instruction, so the end result is the interpreter and JITs never
see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
loads 64-bit immediate into a register. User space never operates on direct
pointers and verifier can easily recognize request_for_pointer vs other
instructions.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09 10:26:47 -07:00
Zi Shen Lim e54bcde3d6 arm64: eBPF JIT compiler
The JIT compiler emits A64 instructions. It supports eBPF only.
Legacy BPF is supported thanks to conversion by BPF core.

JIT is enabled in the same way as for other architectures:

	echo 1 > /proc/sys/net/core/bpf_jit_enable

Or for additional compiler output:

	echo 2 > /proc/sys/net/core/bpf_jit_enable

See Documentation/networking/filter.txt for more information.

The implementation passes all 57 tests in lib/test_bpf.c
on ARMv8 Foundation Model :) Also tested by Will on Juno platform.

Signed-off-by: Zi Shen Lim <zlim.lnx@gmail.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Will Deacon <will.deacon@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
2014-09-08 14:39:21 +01:00
Willem de Bruijn 18a47e6d8a net-timestamp: fix allocation error in test
A buffer is incorrectly zeroed to the length of the pointer. If
cfg_payload_len < sizeof(void *) this can overwrites unrelated memory.
The buffer contents are never read, so no need to zero.

Fixes: 8fe2f761ca ("net-timestamp: expand documentation")

Reported-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05 17:31:03 -07:00
Hannes Frederic Sowa a9fe8e2994 ipv4: implement igmp_qrv sysctl to tune igmp robustness variable
As in IPv6 people might increase the igmp query robustness variable to
make sure unsolicited state change reports aren't lost on the network. Add
and document this new knob to igmp code.

RFCs allow tuning this parameter back to first IGMP RFC, so we also use
this setting for all counters, including source specific multicast.

Also take over sysctl value when upping the interface and don't reuse
the last one seen on the interface.

Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-04 22:26:14 -07:00
Hannes Frederic Sowa 2f711939d2 ipv6: add sysctl_mld_qrv to configure query robustness variable
This patch adds a new sysctl_mld_qrv knob to configure the mldv1/v2 query
robustness variable. It specifies how many retransmit of unsolicited mld
retransmit should happen. Admins might want to tune this on lossy links.

Also reset mld state on interface down/up, so we pick up new sysctl
settings during interface up event.

IPv6 certification requests this knob to be available.

I didn't make this knob netns specific, as it is mostly a setting in a
physical environment and should be per host.

Cc: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-04 22:26:14 -07:00
Willem de Bruijn 8fe2f761ca net-timestamp: expand documentation
Expand Documentation/networking/timestamping.txt with new
interfaces and bytestream timestamping. Also minor
cleanup of the other text.

Import txtimestamp.c test of the new features.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-01 21:49:08 -07:00
stephen hemminger a3d1214688 neigh: document gc_thresh2
Missing documentation for gc_thresh2 sysctl.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-25 17:37:10 -07:00
Vasu Dev 38758f552d i40e: adds FCoE to build and updates its documentation
Adds newly added FCoE files to the build but only if FCoE module is configured.

Also, updates i40e document for added FCoE support.

Signed-off-by: Vasu Dev <vasu.dev@intel.com>
Tested-by: Jack Morgan<jack.morgan@intel.com>
Signed-off-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 19:41:13 -07:00
Alexei Starovoitov 7ae457c1e5 net: filter: split 'struct sk_filter' into socket and bpf parts
clean up names related to socket filtering and bpf in the following way:
- everything that deals with sockets keeps 'sk_*' prefix
- everything that is pure BPF is changed to 'bpf_*' prefix

split 'struct sk_filter' into
struct sk_filter {
	atomic_t        refcnt;
	struct rcu_head rcu;
	struct bpf_prog *prog;
};
and
struct bpf_prog {
        u32                     jited:1,
                                len:31;
        struct sock_fprog_kern  *orig_prog;
        unsigned int            (*bpf_func)(const struct sk_buff *skb,
                                            const struct bpf_insn *filter);
        union {
                struct sock_filter      insns[0];
                struct bpf_insn         insnsi[0];
                struct work_struct      work;
        };
};
so that 'struct bpf_prog' can be used independent of sockets and cleans up
'unattached' bpf use cases

split SK_RUN_FILTER macro into:
    SK_RUN_FILTER to be used with 'struct sk_filter *' and
    BPF_PROG_RUN to be used with 'struct bpf_prog *'

__sk_filter_release(struct sk_filter *) gains
__bpf_prog_release(struct bpf_prog *) helper function

also perform related renames for the functions that work
with 'struct bpf_prog *', since they're on the same lines:

sk_filter_size -> bpf_prog_size
sk_filter_select_runtime -> bpf_prog_select_runtime
sk_filter_free -> bpf_prog_free
sk_unattached_filter_create -> bpf_prog_create
sk_unattached_filter_destroy -> bpf_prog_destroy
sk_store_orig_filter -> bpf_prog_store_orig_filter
sk_release_orig_filter -> bpf_release_orig_filter
__sk_migrate_filter -> bpf_migrate_filter
__sk_prepare_filter -> bpf_prepare_filter

API for attaching classic BPF to a socket stays the same:
sk_attach_filter(prog, struct sock *)/sk_detach_filter(struct sock *)
and SK_RUN_FILTER(struct sk_filter *, ctx) to execute a program
which is used by sockets, tun, af_packet

API for 'unattached' BPF programs becomes:
bpf_prog_create(struct bpf_prog **)/bpf_prog_destroy(struct bpf_prog *)
and BPF_PROG_RUN(struct bpf_prog *, ctx) to execute a program
which is used by isdn, ppp, team, seccomp, ptp, xt_bpf, cls_bpf, test_bpf

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:03:58 -07:00
Alexei Starovoitov 4df95ff488 net: filter: rename sk_chk_filter() -> bpf_check_classic()
trivial rename to indicate that this functions performs classic BPF checking

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-02 15:02:38 -07:00
Vince Bridgers 49193a66f0 Documentation: networking: phy.txt: Update text for indirect MMD access
Update the PHY library documentation to describe how a specific PHY
driver can use the PAL MMD register access routines or override those
routines with it's own in the event the PHY does not support the IEEE
standard for reading and writing MMD phy registers.

Signed-off-by: Vince Bridgers <vbridgers2013@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-30 20:00:22 -07:00
Willem de Bruijn 4d276eb6a4 net: remove deprecated syststamp timestamp
The SO_TIMESTAMPING API defines three types of timestamps: software,
hardware in raw format (hwtstamp) and hardware converted to system
format (syststamp). The last has been deprecated in favor of combining
hwtstamp with a PTP clock driver. There are no active users in the
kernel.

The option was device driver dependent. If set, but without hardware
support, the correct behavior is to return zero in the relevant field
in the SCM_TIMESTAMPING ancillary message. Without device drivers
implementing the option, this field is effectively always zero.

Remove the internal plumbing to dissuage new drivers from implementing
the feature. Keep the SOF_TIMESTAMPING_SYS_HARDWARE flag, however, to
avoid breaking existing applications that request the timestamp.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29 11:39:50 -07:00
Willem de Bruijn 68a360e82e packet: remove deprecated syststamp timestamp
No device driver will ever return an skb_shared_info structure with
syststamp non-zero, so remove the branch that tests for this and
optionally marks the packet timestamp as TP_STATUS_TS_SYS_HARDWARE.

Do not remove the definition TP_STATUS_TS_SYS_HARDWARE, as processes
may refer to it.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-29 11:39:50 -07:00
Nikolay Aleksandrov 1bab4c7507 inet: frag: set limits and make init_net's high_thresh limit global
This patch makes init_net's high_thresh limit to be the maximum for all
namespaces, thus introducing a global memory limit threshold equal to the
sum of the individual high_thresh limits which are capped.
It also introduces some sane minimums for low_thresh as it shouldn't be
able to drop below 0 (or > high_thresh in the unsigned case), and
overall low_thresh should not ever be above high_thresh, so we make the
following relations for a namespace:
init_net:
 high_thresh - max(not capped), min(init_net low_thresh)
 low_thresh - max(init_net high_thresh), min (0)

all other namespaces:
 high_thresh = max(init_net high_thresh), min(namespace's low_thresh)
 low_thresh = max(namespace's high_thresh), min(0)

The major issue with having low_thresh > high_thresh is that we'll
schedule eviction but never evict anything and thus rely only on the
timers.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:36 -07:00
Florian Westphal e3a57d18b0 inet: frag: remove periodic secret rebuild timer
merge functionality into the eviction workqueue.

Instead of rebuilding every n seconds, take advantage of the upper
hash chain length limit.

If we hit it, mark table for rebuild and schedule workqueue.
To prevent frequent rebuilds when we're completely overloaded,
don't rebuild more than once every 5 seconds.

ipfrag_secret_interval sysctl is now obsolete and has been marked as
deprecated, it still can be changed so scripts won't be broken but it
won't have any effect. A comment is left above each unused secret_timer
variable to avoid confusion.

Joint work with Nikolay Aleksandrov.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:36 -07:00
Florian Westphal b13d3cbfb8 inet: frag: move eviction of queues to work queue
When the high_thresh limit is reached we try to toss the 'oldest'
incomplete fragment queues until memory limits are below the low_thresh
value.  This happens in softirq/packet processing context.

This has two drawbacks:

1) processors might evict a queue that was about to be completed
by another cpu, because they will compete wrt. resource usage and
resource reclaim.

2) LRU list maintenance is expensive.

But when constantly overloaded, even the 'least recently used' element is
recent, so removing 'lru' queue first is not 'fairer' than removing any
other fragment queue.

This moves eviction out of the fast path:

When the low threshold is reached, a work queue is scheduled
which then iterates over the table and removes the queues that exceed
the memory limits of the namespace. It sets a new flag called
INET_FRAG_EVICTED on the evicted queues so the proper counters will get
incremented when the queue is forcefully expired.

When the high threshold is reached, no more fragment queues are
created until we're below the limit again.

The LRU list is now unused and will be removed in a followup patch.

Joint work with Nikolay Aleksandrov.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-27 22:34:35 -07:00
Jianhua Xie 92abf75033 bonding: update bonding.txt for Layer2 hash factors
Document the Layer 2 hash factors with packet type ID field.

CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: David S. Miller <davem@davemloft.net>
CC: Pan Jiafei <Jiafei.Pan@freescale.com>

Signed-off-by: Jianhua Xie <jianhua.xie@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-17 16:03:27 -07:00
Willem de Bruijn 26c4fdb052 net-timestamp: document deprecated syststamp
The SO_TIMESTAMPING API defines option SOF_TIMESTAMPING_SYS_HW.
This feature is deprecated. It should not be implemented by new
device drivers. Existing drivers do not implement it, either --
with one exception.

Driver developers are encouraged to expose the NIC hw clock as a
PTP HW clock source, instead, and synchronize system time to the
HW source.

The control flag cannot be removed due to being part of the ABI, nor
can the structure scm_timestamping that is returned. Due to the one
legacy driver, the internal datapath and structure are not removed.

This patch only clearly marks the interface as deprecated. Device
drivers should always return a syststamp value of zero.

Signed-off-by: Willem de Bruijn <willemb@google.com>

----

We can consider adding a WARN_ON_ONCE in__sock_recv_timestamp
if non-zero syststamp is encountered
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-15 16:32:45 -07:00
Tom Herbert cb1ce2ef38 ipv6: Implement automatic flow label generation on transmit
Automatically generate flow labels for IPv6 packets on transmit.
The flow label is computed based on skb_get_hash. The flow label will
only automatically be set when it is zero otherwise (i.e. flow label
manager hasn't set one). This supports the transmit side functionality
of RFC 6438.

Added an IPv6 sysctl auto_flowlabels to enable/disable this behavior
system wide, and added IPV6_AUTOFLOWLABEL socket option to enable this
functionality per socket.

By default, auto flowlabels are disabled to avoid possible conflicts
with flow label manager, however if this feature proves useful we
may want to enable it by default.

It should also be noted that FreeBSD has already implemented automatic
flow labels (including the sysctl and socket option). In FreeBSD,
automatic flow labels default to enabled.

Performance impact:

Running super_netperf with 200 flows for TCP_RR and UDP_RR for
IPv6. Note that in UDP case, __skb_get_hash will be called for
every packet with explains slight regression. In the TCP case
the hash is saved in the socket so there is no regression.

Automatic flow labels disabled:

  TCP_RR:
    86.53% CPU utilization
    127/195/322 90/95/99% latencies
    1.40498e+06 tps

  UDP_RR:
    90.70% CPU utilization
    118/168/243 90/95/99% latencies
    1.50309e+06 tps

Automatic flow labels enabled:

  TCP_RR:
    85.90% CPU utilization
    128/199/337 90/95/99% latencies
    1.40051e+06

  UDP_RR
    92.61% CPU utilization
    115/164/236 90/95/99% latencies
    1.4687e+06

Signed-off-by: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07 21:14:21 -07:00
Jesper Dangaard Brouer 9ceb87fcea pktgen: document tuning for max NIC performance
Using pktgen I'm seeing the ixgbe driver "push-back", due TX ring
running full.  Thus, the TX ring is artificially limiting pktgen.
(Diagnose via "ethtool -S", look for "tx_restart_queue" or "tx_busy"
counters.)

Using ixgbe, the real reason behind the TX ring running full, is due
to TX ring not being cleaned up fast enough. The ixgbe driver combines
TX+RX ring cleanups, and the cleanup interval is affected by the
ethtool --coalesce setting of parameter "rx-usecs".

Do not increase the default NIC TX ring buffer or default cleanup
interval.  Instead simply document that pktgen needs special NIC
tuning for maximum packet per sec performance.

Performance results with pktgen with clone_skb=100000.
TX ring size 512 (default), adjusting "rx-usecs":
 (Single CPU performance, E5-2630, ixgbe)
 - 3935002 pps - rx-usecs:  1 (irqs:  9346)
 - 5132350 pps - rx-usecs: 10 (irqs: 99157)
 - 5375111 pps - rx-usecs: 20 (irqs: 50154)
 - 5454050 pps - rx-usecs: 30 (irqs: 33872)
 - 5496320 pps - rx-usecs: 40 (irqs: 26197)
 - 5502510 pps - rx-usecs: 50 (irqs: 21527)

TX ring size adjusting (ethtool -G), "rx-usecs==1" (default):
 - 3935002 pps - tx-size:  512
 - 5354401 pps - tx-size:  768
 - 5356847 pps - tx-size: 1024
 - 5327595 pps - tx-size: 1536
 - 5356779 pps - tx-size: 2048
 - 5353438 pps - tx-size: 4096

Notice after commit 6f25cd47d (pktgen: fix xmit test for BQL enabled
devices) pktgen uses netif_xmit_frozen_or_drv_stopped() and ignores
the BQL "stack" pause (QUEUE_STATE_STACK_XOFF) flag.  This allow us to put
more pressure on the TX ring buffers.

It is the ixgbe_maybe_stop_tx() call that stops the transmits, and
pktgen respecting this in the call to netif_xmit_frozen_or_drv_stopped(txq).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-01 15:50:23 -07:00
Ben Greear d933319657 ipv6: Allow accepting RA from local IP addresses.
This can be used in virtual networking applications, and
may have other uses as well.  The option is disabled by
default.

A specific use case is setting up virtual routers, bridges, and
hosts on a single OS without the use of network namespaces or
virtual machines.  With proper use of ip rules, routing tables,
veth interface pairs and/or other virtual interfaces,
and applications that can bind to interfaces and/or IP addresses,
it is possibly to create one or more virtual routers with multiple
hosts attached.  The host interfaces can act as IPv6 systems,
with radvd running on the ports in the virtual routers.  With the
option provided in this patch enabled, those hosts can now properly
obtain IPv6 addresses from the radvd.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-01 12:16:24 -07:00
Linus Torvalds f9da455b93 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:

 1) Seccomp BPF filters can now be JIT'd, from Alexei Starovoitov.

 2) Multiqueue support in xen-netback and xen-netfront, from Andrew J
    Benniston.

 3) Allow tweaking of aggregation settings in cdc_ncm driver, from Bjørn
    Mork.

 4) BPF now has a "random" opcode, from Chema Gonzalez.

 5) Add more BPF documentation and improve test framework, from Daniel
    Borkmann.

 6) Support TCP fastopen over ipv6, from Daniel Lee.

 7) Add software TSO helper functions and use them to support software
    TSO in mvneta and mv643xx_eth drivers.  From Ezequiel Garcia.

 8) Support software TSO in fec driver too, from Nimrod Andy.

 9) Add Broadcom SYSTEMPORT driver, from Florian Fainelli.

10) Handle broadcasts more gracefully over macvlan when there are large
    numbers of interfaces configured, from Herbert Xu.

11) Allow more control over fwmark used for non-socket based responses,
    from Lorenzo Colitti.

12) Do TCP congestion window limiting based upon measurements, from Neal
    Cardwell.

13) Support busy polling in SCTP, from Neal Horman.

14) Allow RSS key to be configured via ethtool, from Venkata Duvvuru.

15) Bridge promisc mode handling improvements from Vlad Yasevich.

16) Don't use inetpeer entries to implement ID generation any more, it
    performs poorly, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1522 commits)
  rtnetlink: fix userspace API breakage for iproute2 < v3.9.0
  tcp: fixing TLP's FIN recovery
  net: fec: Add software TSO support
  net: fec: Add Scatter/gather support
  net: fec: Increase buffer descriptor entry number
  net: fec: Factorize feature setting
  net: fec: Enable IP header hardware checksum
  net: fec: Factorize the .xmit transmit function
  bridge: fix compile error when compiling without IPv6 support
  bridge: fix smatch warning / potential null pointer dereference
  via-rhine: fix full-duplex with autoneg disable
  bnx2x: Enlarge the dorq threshold for VFs
  bnx2x: Check for UNDI in uncommon branch
  bnx2x: Fix 1G-baseT link
  bnx2x: Fix link for KR with swapped polarity lane
  sctp: Fix sk_ack_backlog wrap-around problem
  net/core: Add VF link state control policy
  net/fsl: xgmac_mdio is dependent on OF_MDIO
  net/fsl: Make xgmac_mdio read error message useful
  net_sched: drr: warn when qdisc is not work conserving
  ...
2014-06-12 14:27:40 -07:00
Alexei Starovoitov 783e327b69 net: filter: document internal instruction encoding
This patch adds a description of eBPFs instruction encoding in order
to bring the documentation in line with the implementation.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:18 -07:00
Alexei Starovoitov e4ad403269 net: filter: mention eBPF terminology as well
Since the term eBPF is used anyway on mailing list discussions, lets
also document that in the main BPF documentation file and replace a
couple of occurrences with eBPF terminology to be more clear.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 15:39:18 -07:00
Alexei Starovoitov e430f34ee5 net: filter: cleanup A/X name usage
The macro 'A' used in internal BPF interpreter:
 #define A regs[insn->a_reg]
was easily confused with the name of classic BPF register 'A', since
'A' would mean two different things depending on context.

This patch is trying to clean up the naming and clarify its usage in the
following way:

- A and X are names of two classic BPF registers

- BPF_REG_A denotes internal BPF register R0 used to map classic register A
  in internal BPF programs generated from classic

- BPF_REG_X denotes internal BPF register R7 used to map classic register X
  in internal BPF programs generated from classic

- internal BPF instruction format:
struct sock_filter_int {
        __u8    code;           /* opcode */
        __u8    dst_reg:4;      /* dest register */
        __u8    src_reg:4;      /* source register */
        __s16   off;            /* signed offset */
        __s32   imm;            /* signed immediate constant */
};

- BPF_X/BPF_K is 1 bit used to encode source operand of instruction
In classic:
  BPF_X - means use register X as source operand
  BPF_K - means use 32-bit immediate as source operand
In internal:
  BPF_X - means use 'src_reg' register as source operand
  BPF_K - means use 32-bit immediate as source operand

Suggested-by: Chema Gonzalez <chema@google.com>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Chema Gonzalez <chema@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-11 00:13:16 -07:00
Linus Torvalds 1aacb90eaa Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial into next
Pull trivial tree changes from Jiri Kosina:
 "Usual pile of patches from trivial tree that make the world go round"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (23 commits)
  staging: go7007: remove reference to CONFIG_KMOD
  aic7xxx: Remove obsolete preprocessor define
  of: dma: doc fixes
  doc: fix incorrect formula to calculate CommitLimit value
  doc: Note need of bc in the kernel build from 3.10 onwards
  mm: Fix printk typo in dmapool.c
  modpost: Fix comment typo "Modules.symvers"
  Kconfig.debug: Grammar s/addition/additional/
  wimax: Spelling s/than/that/, wording s/destinatary/recipient/
  aic7xxx: Spelling s/termnation/termination/
  arm64: mm: Remove superfluous "the" in comment
  of: Spelling s/anonymouns/anonymous/
  dma: imx-sdma: Spelling s/determnine/determine/
  ath10k: Improve grammar in comments
  ath6kl: Spelling s/determnine/determine/
  of: Improve grammar for of_alias_get_id() documentation
  drm/exynos: Spelling s/contro/control/
  radio-bcm2048.c: fix wrong overflow check
  doc: printk-formats: do not mention casts for u64/s64
  doc: spelling error changes
  ...
2014-06-04 08:50:34 -07:00
David S. Miller 54e5c4def0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_alb.c
	drivers/net/ethernet/altera/altera_msgdma.c
	drivers/net/ethernet/altera/altera_sgdma.c
	net/ipv6/xfrm6_output.c

Several cases of overlapping changes.

The xfrm6_output.c has a bug fix which overlaps the renaming
of skb->local_df to skb->ignore_df.

In the Altera TSE driver cases, the register access cleanups
in net-next overlapped with bug fixes done in net.

Similarly a bug fix to send ALB packets in the bonding driver using
the right source address overlaps with cleanups in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-24 00:32:30 -04:00
Daniel Borkmann 04caa48930 net: filter: doc: add section for BPF test suite
Mention the recently added test suite in the documentation file.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-23 16:48:05 -04:00
Tobias Klauser b0db5cdf39 net: doc: Update references to skb->rxhash
In commit 61b905da33 ("net: Rename skb->rxhash to skb->hash"), skb->rxhash
was renamed to skb->hash. Update references in Documentation
accordingly.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-22 15:26:37 -04:00
Oliver Hartkopp 277bd320e7 can: add documentation for CAN filter usage optimisation
To benefit from special filters for single SFF or single EFF CAN identifier
subscriptions the CAN_EFF_FLAG bit and the CAN_RTR_FLAG bit has to be set
together with the CAN_(SFF|EFF)_MASK in can_filter.mask.

Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2014-05-19 09:38:25 +02:00
Bjørn Mork a563babeb5 net: cdc_mbim: add driver documentation
An initial attempt on describing some of the odd APIs
provided by this driver.

Cc: Greg Suarez <gsuarez@smithmicro.com>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13 17:46:09 -04:00
Carlos Garcia c98be0c96d doc: spelling error changes
Fixed multiple spelling errors.

Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Carlos E. Garcia <carlos@cgarcia.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-05-05 15:32:05 +02:00
Alexei Starovoitov dfee07ccef net: filter: doc: expand and improve BPF documentation
In particular, this patch tries to clarify internal BPF calling
convention and adds internal BPF examples, JIT guide, use cases.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-04 19:40:00 -04:00
David S. Miller 4366004d77 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/igb/e1000_mac.c
	net/core/filter.c

Both conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24 13:19:00 -04:00
Mahesh Bandewar e9f0fb8849 bonding: Add tlb_dynamic_lb parameter for tlb mode
The aggresive load balancing causes packet re-ordering as active
flows are moved from a slave to another within the group. Sometime
this aggresive lb is not necessary if the preference is for less
re-ordering. This parameter if used with value "0" disables
this dynamic flow shuffling minimizing packet re-ordering. Of course
the side effect is that it has to live with the static load balancing
that the hashing distribution provides. This impact is less severe if
the correct xmit-hashing-policy is used for the tlb setup.

The default value of the parameter is set to "1" mimicing the earlier
behavior.

Ran the netperf test with 200 stream for 1 min between two hosts with
4x1G trunk (xmit-lb mode with xmit-policy L3+4) before and after these
changes. Following was the command used for those 200 instances -

    netperf -t TCP_RR -l 60 -s 5 -H <host> -- -r81920,81920

Transactions per second:
    Before change: 1,367.11
    After  change: 1,470.65

Change-Id: Ie3f75c77282cf602e83a6e833c6eb164e72a0990
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24 13:04:34 -04:00
Mahesh Bandewar f05b42eaa2 bonding: Added bond_tlb_xmit() for tlb mode.
Re-organized the xmit function for the lb mode separating tlb xmit
from the alb mode. This will enable use of the hashing policies
like 802.3ad mode. Also extended use of xmit-hash-policy to tlb mode.

Now the tlb-mode defaults to BOND_XMIT_POLICY_LAYER2 if the xmit policy
module parameter is not set (just like 802.3ad, or Xor mode).

Change-Id: I140257403d272df75f477b380207338d0f04963e
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-24 13:04:34 -04:00
Ben Hutchings c06cbcb605 net: Update my email address
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-23 15:04:42 -04:00
Chema Gonzalez 4cd3675ebf filter: added BPF random opcode
Added a new ancillary load (bpf call in eBPF parlance) that produces
a 32-bit random number. We are implementing it as an ancillary load
(instead of an ISA opcode) because (a) it is simpler, (b) allows easy
JITing, and (c) seems more in line with generic ISAs that do not have
"get a random number" as a instruction, but as an OS call.

The main use for this ancillary load is to perform random packet sampling.

Signed-off-by: Chema Gonzalez <chema@google.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-22 21:27:57 -04:00
Linus Torvalds cd6362befe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Here is my initial pull request for the networking subsystem during
  this merge window:

   1) Support for ESN in AH (RFC 4302) from Fan Du.

   2) Add full kernel doc for ethtool command structures, from Ben
      Hutchings.

   3) Add BCM7xxx PHY driver, from Florian Fainelli.

   4) Export computed TCP rate information in netlink socket dumps, from
      Eric Dumazet.

   5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas
      Dichtel.

   6) Convert many drivers to pci_enable_msix_range(), from Alexander
      Gordeev.

   7) Record SKB timestamps more efficiently, from Eric Dumazet.

   8) Switch to microsecond resolution for TCP round trip times, also
      from Eric Dumazet.

   9) Clean up and fix 6lowpan fragmentation handling by making use of
      the existing inet_frag api for it's implementation.

  10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss.

  11) Auto size SKB lengths when composing netlink messages based upon
      past message sizes used, from Eric Dumazet.

  12) qdisc dumps can take a long time, add a cond_resched(), From Eric
      Dumazet.

  13) Sanitize netpoll core and drivers wrt.  SKB handling semantics.
      Get rid of never-used-in-tree netpoll RX handling.  From Eric W
      Biederman.

  14) Support inter-address-family and namespace changing in VTI tunnel
      driver(s).  From Steffen Klassert.

  15) Add Altera TSE driver, from Vince Bridgers.

  16) Optimizing csum_replace2() so that it doesn't adjust the checksum
      by checksumming the entire header, from Eric Dumazet.

  17) Expand BPF internal implementation for faster interpreting, more
      direct translations into JIT'd code, and much cleaner uses of BPF
      filtering in non-socket ocntexts.  From Daniel Borkmann and Alexei
      Starovoitov"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits)
  netpoll: Use skb_irq_freeable to make zap_completion_queue safe.
  net: Add a test to see if a skb is freeable in irq context
  qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port'
  net: ptp: move PTP classifier in its own file
  net: sxgbe: make "core_ops" static
  net: sxgbe: fix logical vs bitwise operation
  net: sxgbe: sxgbe_mdio_register() frees the bus
  Call efx_set_channels() before efx->type->dimension_resources()
  xen-netback: disable rogue vif in kthread context
  net/mlx4: Set proper build dependancy with vxlan
  be2net: fix build dependency on VxLAN
  mac802154: make csma/cca parameters per-wpan
  mac802154: allow only one WPAN to be up at any given time
  net: filter: minor: fix kdoc in __sk_run_filter
  netlink: don't compare the nul-termination in nla_strcmp
  can: c_can: Avoid led toggling for every packet.
  can: c_can: Simplify TX interrupt cleanup
  can: c_can: Store dlc private
  can: c_can: Reduce register access
  can: c_can: Make the code readable
  ...
2014-04-02 20:53:45 -07:00
Linus Torvalds 159d8133d0 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "Usual rocket science -- mostly documentation and comment updates"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  sparse: fix comment
  doc: fix double words
  isdn: capi: fix "CAPI_VERSION" comment
  doc: DocBook: Fix typos in xml and template file
  Bluetooth: add module name for btwilink
  driver core: unexport static function create_syslog_header
  mmc: core: typo fix in printk specifier
  ARM: spear: clean up editing mistake
  net-sysfs: fix comment typo 'CONFIG_SYFS'
  doc: Insert MODULE_ in module-signing macros
  Documentation: update URL to hfsplus Technote 1150
  gpio: update path to documentation
  ixgbe: Fix format string in ixgbe_fcoe.
  Kconfig: Remove useless "default N" lines
  user_namespace.c: Remove duplicated word in comment
  CREDITS: fix formatting
  treewide: Fix typo in Documentation/DocBook
  mm: Fix warning on make htmldocs caused by slab.c
  ata: ata-samsung_cf: cleanup in header file
  idr: remove unused prototype of idr_free()
2014-04-02 16:23:38 -07:00
David S. Miller f91ca783f1 linux-can-fixes-for-3.15-20140401
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iEYEABECAAYFAlM6jfsACgkQjTAFq1RaXHOVTwCeOyOYOJ+Ze2x33WydGcGsvX8a
 QPoAoJVAcekmL5MVbcMTTvc8QL1AebAm
 =4aEp
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-3.15-20140401' of git://gitorious.org/linux-can/linux-can

linux-can-fixes-for-3.15-20140401

Marc Kleine-Budde says:

====================
this is a pull request of 16 patches for the 3.15 release cycle.

Bjorn Van Tilt contributes a patch which fixes a memory leak in usb_8dev's
usb_8dev_start_xmit()s error path. A patch by Robert Schwebel fixes a typo in
the can documentation. The remaining patches all target the c_can driver. Two
of them are by me; they add a missing netif_napi_del() and return value
checking. Thomas Gleixner contributes 12 patches, which address several
shortcomings in the driver like hardware initialisation, concurrency, message
ordering and poor performance.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-01 17:49:50 -04:00
Robert Schwebel 5fb7639dee can: Documentation: fix parameter name "sample-point"
This patch fixes the name of the parameter to configure the sample point used
in iproute2's ip command. The correct writing is "sample-point" not
"sample_point".

Signed-off-by: Robert Schwebel <r.schwebel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2014-04-01 11:54:55 +02:00
Alexei Starovoitov 9a985cdc5c doc: filter: extend BPF documentation to document new internals
Further extend the current BPF documentation to document new BPF
engine internals. Joint work with Daniel Borkmann.

Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31 00:45:09 -04:00
Florian Fainelli 604fdf4286 Documentation: networking: phy.txt: MDIO bus reset is optional
Update the MDIO bus documentation to mention that the MDIO bus reset
function is completely optional. It became optional with commit
e13934563d ("[PATCH] PHY Layer fixup")

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-28 01:38:03 -04:00
David S. Miller 04f58c8854 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	Documentation/devicetree/bindings/net/micrel-ks8851.txt
	net/core/netpoll.c

The net/core/netpoll.c conflict is a bug fix in 'net' happening
to code which is completely removed in 'net-next'.

In micrel-ks8851.txt we simply have overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-25 20:29:20 -04:00
Masanari Iida df5cbb2783 doc: fix double words
Fix double words "the the" in various files
within Documentations.

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-03-21 13:16:58 +01:00
stephen hemminger 88050049e7 netlink: fix setsockopt in mmap examples in documentation
The documentation for how to use netlink mmap interface is incorrect.
The calls to setsockopt() require an additional argument.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-20 14:11:38 -04:00
Jakub Kicinski 59cb89e6b7 doc: update driver TX algorithm in timestamping.txt
Since cd4d8fdad1 ("net: kernel panic in dev_hard_start_xmit:
remove faulty software TX time stamping") dev_hard_start_xmit()
will not provide software timestamps. It's a responsibility of
the drivers to call skb_tx_timestamp() at the right time.

Cc: linux-doc@vger.kernel.org
Signed-off-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-18 15:16:48 -04:00
Fernando Luis Vazquez Cao 406d49656f igb: remove references to long gone command line parameters
Command line parameters QueuePairs, Node, EEE, DMAC and InterruptThrottleRate
do not exist these days. Remove all references to them in the Documentation
folder and update code comments.

Signed-off-by: Fernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-18 13:25:41 -04:00
Vince Bridgers 04add4ab7b Documentation: networking: Add Altera Ethernet (TSE) Documentation
This patch adds a bindings description for the Altera Triple Speed Ethernet
(TSE) driver. The bindings support the legacy SGDMA soft IP as well as the
preferred MSGDMA soft IP. The TSE can be configured and synthesized in soft
logic using Altera's Quartus toolchain. Please consult the bindings document
for supported options.

Signed-off-by: Vince Bridgers <vbridgers2013@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-17 21:26:57 -04:00
David S. Miller 85dcce7a73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/usb/r8152.c
	drivers/net/xen-netback/netback.c

Both the r8152 and netback conflicts were simple overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-14 22:31:55 -04:00
Geert Uytterhoeven a93c125604 packet: doc: Spelling s/than/that/
Signed-off-by: Geert Uytterhoeven <geert+renesas@linux-m68k.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-12 16:15:17 -04:00
David S. Miller 3894004289 RxRPC development
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIVAwUAUxX4KBOxKuMESys7AQJqkRAAmxU1GEGESCE/F/U3Hm9pRFeg9kj6+7BO
 7vPrUzB5KEnu4eXjCc60a+BVmV0fJRiS7zsFWzmcKgjIZzJdobbStD944tdz0kIx
 w7RFKOr+D6+zyo0zr1Fj0DwlhvuOpfBsisYtNC5J7OTWc3whgnIkroLt16FNFasu
 gxe7oo+KHiTXIJuIf6E/p5iK1te7CUsIIc1v01hgVVD91Udk4odWyl288BzzcXuC
 5WFYaMnOG+9ndo+4/mCt5dN0FDYL2a1djyknPdXP6HOSBoqVBjPxFGxuN8O9HFMV
 ghWjNG72YGDk0jy6ghiQijdoxE4l2iy0/gYGzSiCMSp59aHHlkq/XIhOoZWD96Nw
 TAoo0PbTFrqRzfvOQ8lMULDneB0KP20gI/AUbgnU3CGataALntAKiVOhrOtXaG21
 trE77omDA5FtMlQ60DabTl6RveMZXCZ+LwHPuZ0Euxo4uykmDLdOChZADGoqZA+5
 VzI1Qa4YK+9B/fkgUyiIzS4syCHGISFIIQ7/Srr26UMlpuK4WDPA9a11hKxrbrkA
 JxEkAk9i/KD3/XsWWfxEMPgNarwHOyDQQ4/XYKx20O2N9RDrao+cNw6ghlwZS+mo
 bmx/sqWmMDaQoMNqL8Q+bq2W7omG8wNgv1pn6MVnFSfiJHAMDhNZkBzp0uCLGFpA
 8WYj9OBcUgY=
 =PK2i
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-devel-20140304' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
net-next: AF_RXRPC fixes and development

Here are some AF_RXRPC fixes:

 (1) Fix to remove incorrect checksum calculation made during recvmsg().  It's
     unnecessary to try to do this there since we check the checksum before
     reading the RxRPC header from the packet.

 (2) Fix to prevent the sending of an ABORT packet in response to another
     ABORT packet and inducing a storm.

 (3) Fix UDP MTU calculation from parsing ICMP_FRAG_NEEDED packets where we
     don't handle the ICMP packet not specifying an MTU size.

And development patches:

 (4) Add sysctls for configuring RxRPC parameters, specifically various delays
     pertaining to ACK generation, the time before we resend a packet for
     which we don't receive an ACK, the maximum time a call is permitted to
     live and the amount of time transport, connection and dead call
     information is cached.

 (5) Improve ACK packet production by adjusting the handling of ACK_REQUESTED
     packets, ignoring the MORE_PACKETS flag, delaying the production of
     otherwise immediate ACK_IDLE packets and delaying all ACK_IDLE production
     (barring the call termination) to half a second.

 (6) Add more sysctl parameters to expose the Rx window size, the maximum
     packet size that we're willing to receive and the number of jumbo rxrpc
     packets we're willing to handle in a single UDP packet.

 (7) Request ACKs on alternate DATA packets so that the other side doesn't
     wait till we fill up the Tx window.

 (8) Use a RCU hash table to look up the rxrpc_call for an incoming packet
     rather than stepping through a hierarchy involving several spinlocks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-07 16:08:59 -05:00
Andrew Lutomirski adca476782 net: Improve SO_TIMESTAMPING documentation and fix a minor code bug
The original documentation was very unclear.

The code fix is presumably related to the formerly unclear
documentation: SOCK_TIMESTAMPING_RX_SOFTWARE has no effect on
__sock_recv_timestamp's behavior, so calling __sock_recv_ts_and_drops
from sock_recv_ts_and_drops if only SOCK_TIMESTAMPING_RX_SOFTWARE is
set is pointless.  This should have no user-observable effect.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06 16:18:01 -05:00
David S. Miller 67ddc87f16 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/wireless/ath/ath9k/recv.c
	drivers/net/wireless/mwifiex/pcie.c
	net/ipv6/sit.c

The SIT driver conflict consists of a bug fix being done by hand
in 'net' (missing u64_stats_init()) whilst in 'net-next' a helper
was created (netdev_alloc_pcpu_stats()) which takes care of this.

The two wireless conflicts were overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05 20:32:02 -05:00
Oliver Hartkopp 821047c405 can: remove CAN FD compatibility for CAN 2.0 sockets
In commit e2d265d3b5 (canfd: add support for CAN FD in CAN_RAW sockets)
CAN FD frames with a payload length up to 8 byte are passed to legacy
sockets where the CAN FD support was not enabled by the application.

After some discussions with developers at a fair this well meant feature
leads to confusion as no clean switch for CAN / CAN FD is provided to the
application programmer. Additionally a compatibility like this for legacy
CAN_RAW sockets requires some compatibility handling for the sending, e.g.
make CAN2.0 frames a CAN FD frame with BRS at transmission time (?!?).

This will become a mess when people start to develop applications with
real CAN FD hardware. This patch reverts the bad compatibility code
together with the documentation describing the removed feature.

Acked-by: Stephane Grosjean <s.grosjean@peak-system.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2014-03-03 14:29:52 +01:00
David Howells 817913d8cd af_rxrpc: Expose more RxRPC parameters via sysctls
Expose RxRPC parameters via sysctls to control the Rx window size, the Rx MTU
maximum size and the number of packets that can be glued into a jumbo packet.

More info added to Documentation/networking/rxrpc.txt.

Signed-off-by: David Howells <dhowells@redhat.com>
2014-02-26 17:25:07 +00:00
David Howells 5873c0834f af_rxrpc: Add sysctls for configuring RxRPC parameters
Add sysctls for configuring RxRPC protocol handling, specifically controls on
delays before ack generation, the delay before resending a packet, the maximum
lifetime of a call and the expiration times of calls, connections and
transports that haven't been recently used.

More info added in Documentation/networking/rxrpc.txt.

Signed-off-by: David Howells <dhowells@redhat.com>
2014-02-26 17:25:06 +00:00
Mathias Krause 72f8e06f3e pktgen: document all supported flags
The documentation misses a few of the supported flags. Fix this. Also
respect the dependency to CONFIG_XFRM for the IPSEC flag.

Cc: Fan Du <fan.du@windriver.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-24 18:54:26 -05:00
David S. Miller 1e8d6421cf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/bonding/bond_3ad.h
	drivers/net/bonding/bond_main.c

Two minor conflicts in bonding, both of which were overlapping
changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-19 01:24:22 -05:00
Veaceslav Falico 52f65ef33b bonding: document the new _arp options for arp_validate
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-18 16:47:15 -05:00
Veaceslav Falico 13ac34a886 bonding: permit using arp_validate with non-ab modes
Currently it's disabled because it's sometimes hard, in typical configs, to
make it work - because of the nature how the loadbalance modes work - as
it's hard to deliver valid arp replies to correct slaves by the switch.

However we still can use arp_validation in loadbalance with several other
configs, per example with arp_validate == 2 for backup with one broadcast
domain, without the switch(es) doing any balancing - this way we'd be (a
bit more) sure that the slave is up.

So, enable it to let users decide which one works/suits them best. Also
correct the mode limitation from BOND_OPT_ARP_VALIDATE.

CC: Nikolay Aleksandrov <nikolay@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-18 16:47:14 -05:00
Claudiu Manoil 34018fd419 gianfar: Remove sysfs stubs for FIFOCFG and stashing
Removing the sysfs stubs for the Tx FIFOCFG and ATTRELI
(stashing) config registers, as these registers may only
be configured after a MAC reset, with the controller stopped
(i.e. during hw init, at probe() time).  The current sysfs
stubs allow on-the-fly updates of these registers (the locking
measures are useless and only add unecessary code).

Changing these registers is discouraged. Only the default values
will be used instead.

Moreover, the stashing (ATTRELI) configuration options were
effectively disabled (didn't get to the hw anyway if changed)
because the stashing device_flags (HAS_BD_STASHING|HAS_BUF_STASHING)
were "accidentally" cleared during probe().

Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-18 15:03:02 -05:00
Florian Fainelli 7f6224b7c7 Documentation: networking: update phy.txt with recent changes
The PHY library was missing a bunch of newly added PHY driver callbacks
along with a smallish description of what they do, fix that.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-17 16:40:09 -05:00
Stanislav Fomichev 45f7435968 tcp: remove unused min_cwnd member of tcp_congestion_ops
Commit 684bad1107 "tcp: use PRR to reduce cwin in CWR state" removed all
calls to min_cwnd, so we can safely remove it.
Also, remove tcp_reno_min_cwnd because it was only used for min_cwnd.

Signed-off-by: Stanislav Fomichev <stfomichev@yandex-team.ru>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 18:22:34 -05:00
Paul Gortmaker 455016e5dc Documentation/networking: delete orphaned 3c505.txt file.
In the commit 0e245dbaac
("drivers/net: delete the 3Com 3c505/3c507 intel i825xx support")
we clobbered the 3c505 driver (over a year ago) along with other
abandoned ISA drivers.

However, this orphaned README file escaped detection at that
time, and has lived on until today. Get rid of it now.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-02-13 17:08:29 -05:00
Henrik Austad 3cf8ca1c25 Documentation/: update 00-INDEX files
Some of the 00-INDEX files are somewhat outdated and some folders does
not contain 00-INDEX at all.  Only outdated (with the notably exception
of spi) indexes are touched here, the 169 folders without 00-INDEX has
not been touched.

New 00-INDEX
 - spi/* was added in a series of commits dating back to 2006

Added files (missing in (*/)00-INDEX)
 - dmatest.txt was added by commit 851b7e16a0 ("dmatest: run test via
   debugfs")
 - this_cpu_ops.txt was added by commit a1b2a555d6 ("percpu: add
   documentation on this_cpu operations")
 - ww-mutex-design.txt was added by commit 040a0a3710 ("mutex: Add
   support for wound/wait style locks")
 - bcache.txt was added by commit cafe563591 ("bcache: A block layer
   cache")
 - kernel-per-CPU-kthreads.txt was added by commit 49717cb404
   ("kthread: Document ways of reducing OS jitter due to per-CPU
   kthreads")
 - phy.txt was added by commit ff76496347 ("drivers: phy: add generic
   PHY framework")
 - block/null_blk was added by commit 12f8f4fc03 ("null_blk:
   documentation")
 - module-signing.txt was added by commit 3cafea3076 ("Add
   Documentation/module-signing.txt file")
 - assoc_array.txt was added by commit 3cb989501c ("Add a generic
   associative array implementation.")
 - arm/IXP4xx was part of the initial repo
 - arm/cluster-pm-race-avoidance.txt was added by commit 7fe31d28e8
   ("ARM: mcpm: introduce helpers for platform coherency exit/setup")
 - arm/firmware.txt was added by commit 7366b92a77 ("ARM: Add
   interface for registering and calling firmware-specific operations")
 - arm/kernel_mode_neon.txt was added by commit 2afd0a0524 ("ARM:
   7825/1: document the use of NEON in kernel mode")
 - arm/tcm.txt was added by commit bc581770cf ("ARM: 5580/2: ARM TCM
   (Tightly-Coupled Memory) support v3")
 - arm/vlocks.txt was added by commit 9762f12d3e ("ARM: mcpm: Add
   baremetal voting mutexes")
 - blackfin/gptimers-example.c, Makefile was added by commit
   4b60779d5e ("Blackfin: add an example showing how to use the
   gptimers API")
 - devicetree/usage-model.txt was added by commit 31134efc68 ("dt:
   Linux DT usage model documentation")
 - fb/api.txt was added by commit fb21c2f428 ("fbdev: Add FOURCC-based
   format configuration API")
 - fb/sm501.txt was added by commit e6a0498071 ("video, sm501: add
   edid and commandline support")
 - fb/udlfb.txt was added by commit 96f8d864af ("fbdev: move udlfb out
   of staging.")
 - filesystems/Makefile was added by commit 1e0051ae48
   ("Documentation/fs/: split txt and source files")
 - filesystems/nfs/nfsd-admin-interfaces.txt was added by commit
   8a4c6e19cf ("nfsd: document kernel interfaces for nfsd
   configuration")
 - ide/warm-plug-howto.txt was added by commit f74c91413e ("ide: add
   warm-plug support for IDE devices (take 2)")
 - laptops/Makefile was added by commit d49129accc
   ("Documentation/laptop/: split txt and source files")
 - leds/leds-blinkm.txt was added by commit b54cf35a7f ("LEDS: add
   BlinkM RGB LED driver, documentation and update MAINTAINERS")
 - leds/ledtrig-oneshot.txt was added by commit 5e417281cd ("leds: add
   oneshot trigger")
 - leds/ledtrig-transient.txt was added by commit 44e1e9f8e7 ("leds:
   add new transient trigger for one shot timer activation")
 - m68k/README.buddha was part of the initial repo
 - networking/LICENSE.(qla3xxx|qlcnic|qlge) was added by commits
   40839129f7, c4e84bde1d, 5a4faa8737
 - networking/Makefile was added by commit 3794f3e812 ("docsrc: build
   Documentation/ sources")
 - networking/i40evf.txt was added by commit 105bf2fe6b ("i40evf: add
   driver to kernel build system")
 - networking/ipsec.txt was added by commit b3c6efbc36 ("xfrm: Add
   file to document IPsec corner case")
 - networking/mac80211-auth-assoc-deauth.txt was added by commit
   3cd7920a2b ("mac80211: add auth/assoc/deauth flow diagram")
 - networking/netlink_mmap.txt was added by commit 5683264c39
   ("netlink: add documentation for memory mapped I/O")
 - networking/nf_conntrack-sysctl.txt was added by commit c9f9e0e159
   ("netfilter: doc: add nf_conntrack sysctl api documentation") lan)
 - networking/team.txt was added by commit 3d249d4ca7 ("net: introduce
   ethernet teaming device")
 - networking/vxlan.txt was added by commit d342894c5d ("vxlan:
   virtual extensible lan")
 - power/runtime_pm.txt was added by commit 5e928f77a0 ("PM: Introduce
   core framework for run-time PM of I/O devices (rev.  17)")
 - power/charger-manager.txt was added by commit 3bb3dbbd56
   ("power_supply: Add initial Charger-Manager driver")
 - RCU/lockdep-splat.txt was added by commit d7bd2d68aa ("rcu:
   Document interpretation of RCU-lockdep splats")
 - s390/kvm.txt was added by 5ecee4b (KVM: s390: API documentation)
 - s390/qeth.txt was added by commit b4d72c08b3 ("qeth: bridgeport
   support - basic control")
 - scheduler/sched-bwc.txt was added by commit 88ebc08ea9 ("sched: Add
   documentation for bandwidth control")
 - scsi/advansys.txt was added by commit 4bd6d7f356 ("[SCSI] advansys:
   Move documentation to Documentation/scsi")
 - scsi/bfa.txt was added by commit 1ec90174bd ("[SCSI] bfa: add
   readme file")
 - scsi/bnx2fc.txt was added by commit 12b8fc10ea ("[SCSI] bnx2fc: Add
   driver documentation")
 - scsi/cxgb3i.txt was added by commit c3673464eb ("[SCSI] cxgb3i: Add
   cxgb3i iSCSI driver.")
 - scsi/hpsa.txt was added by commit 992ebcf14f ("[SCSI] hpsa: Add
   hpsa.txt to Documentation/scsi")
 - scsi/link_power_management_policy.txt was added by commit
   ca77329fb7 ("[libata] Link power management infrastructure")
 - scsi/osd.txt was added by commit 78e0c621de ("[SCSI] osd:
   Documentation for OSD library")
 - scsi/scsi-parameter.txt was created/moved by commit 163475fb11
   ("Documentation: move SCSI parameters to their own text file")
 - serial/driver was part of the initial repo
 - serial/n_gsm.txt was added by commit 323e84122e ("n_gsm: add a
   documentation")
 - timers/Makefile was added by commit 3794f3e812 ("docsrc: build
   Documentation/ sources")
 - virt/kvm/s390.txt was added by commit d9101fca3d ("KVM: s390:
   diagnose call documentation")
 - vm/split_page_table_lock was added by commit 49076ec2cc ("mm:
   dynamically allocate page->ptl if it cannot be embedded to struct
   page")
 - w1/slaves/w1_ds28e04 was added by commit fbf7f7b4e2 ("w1: Add
   1-wire slave device driver for DS28E04-100")
 - w1/masters/omap-hdq was added by commit e0a29382c6 ("hdq:
   documentation for OMAP HDQ")
 - x86/early-microcode.txt was added by commit 0d91ea86a8 ("x86, doc:
   Documentation for early microcode loading")
 - x86/earlyprintk.txt was added by commit a1aade4788 ("x86/doc:
   mini-howto for using earlyprintk=dbgp")
 - x86/entry_64.txt was added by commit 8b4777a4b5 ("x86-64: Document
   some of entry_64.S")
 - x86/pat.txt was added by commit d27554d874 ("x86: PAT
   documentation")

Moved files
 - arm/kernel_user_helpers.txt was moved out of arch/arm/kernel by
   commit 37b8304642 ("ARM: kuser: move interface documentation out of
   the source code")
 - efi-stub.txt was moved out of x86/ and down into Documentation/ in
   commit 4172fe2f8a ("EFI stub documentation updates")
 - laptops/hpfall.c was moved out of hwmon/ and into laptops/ in commit
   efcfed9bad ("Move hp_accel to drivers/platform/x86")
 - commit 5616c23ad9 ("x86: doc: move x86-generic documentation from
   Doc/x86/i386"):
   * x86/usb-legacy-support.txt
   * x86/boot.txt
   * x86/zero_page.txt
 - power/video_extension.txt was moved to acpi in commit 70e66e4df1
   ("ACPI / video: move video_extension.txt to Documentation/acpi")

Removed files (left in 00-INDEX)
 - memory.txt was removed by commit 00ea8990aa ("memory.txt: remove
   stray information")
 - gpio.txt was moved to gpio/ in commit fd8e198cfc ("Documentation:
   gpiolib: document new interface")
 - networking/DLINK.txt was removed by commit 168e06ae26
   ("drivers/net: delete old parallel port de600/de620 drivers")
 - serial/hayes-esp.txt was removed by commit f53a2ade0b ("tty: esp:
   remove broken driver")
 - s390/TAPE was removed by commit 9e280f6693 ("[S390] remove tape
   block docu")
 - vm/locking was removed by commit 57ea8171d2 ("mm: documentation:
   remove hopelessly out-of-date locking doc")
 - laptops/acer-wmi.txt was remvoed by commit 020036678e ("acer-wmi:
   Delete out-of-date documentation")

Typos/misc issues
 - rpc-server-gss.txt was added as knfsd-rpcgss.txt in commit
   030d794bf4 ("SUNRPC: Use gssproxy upcall for server RPCGSS
   authentication.")
 - commit b88cf73d92 ("net: add missing entries to
   Documentation/networking/00-INDEX")
   * generic-hdlc.txt was added as generic_hdlc.txt
   * spider_net.txt was added as spider-net.txt
 - w1/master/mxc-w1 was added as mxc_w1 by commit a5fd9139f7 ("w1: add
   1-wire master driver for i.MX27 / i.MX31")
 - s390/zfcpdump.txt was added as zfcpdump by commit 6920c12a40
   ("[S390] Add Documentation/s390/00-INDEX.")

Signed-off-by: Henrik Austad <henrik@austad.us>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>	[rcu bits]
Acked-by: Rob Landley <rob@landley.net>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Mark Brown <broonie@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Len Brown <len.brown@intel.com>
Cc: James Bottomley <JBottomley@parallels.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-02-10 16:01:40 -08:00
Martin Schwenke d922e1cb1e net: Document promote_secondaries
From 038a821667f62c496f2bbae27081b1b612122a97 Mon Sep 17 00:00:00 2001
From: Martin Schwenke <martin@meltin.net>
Date: Tue, 28 Jan 2014 15:16:49 +1100
Subject: [PATCH] net: Document promote_secondaries

This option was added a long time ago...

  commit 8f937c6099
  Author: Harald Welte <laforge@gnumonks.org>
  Date:   Sun May 29 20:23:46 2005 -0700

    [IPV4]: Primary and secondary addresses

Signed-off-by: Martin Schwenke <martin@meltin.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-27 20:39:21 -08:00
Neil Horman bb9fbe2ddd AF_PACKET: Add documentation for queue mapping fanout mode
Recently I added a new AF_PACKET fanout operation mode in commit
2d36097, but I forgot to document it.  Add PACKET_FANOUT_QM as an available mode
in the af_packet documentation.  Applies to net-next.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-27 12:57:32 -08:00
dingtianhong e1d206a713 bonding: update bonding.txt for primary description
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-22 17:45:32 -08:00
Chen-Yu Tsai 938dfdaa3c net: stmmac: Allocate and pass soc/board specific data to callbacks
The current .init and .exit callbacks requires access to driver
private data structures. This is not a good seperation and abstraction.

Instead, we add a new .setup callback for allocating private data, and
pass the returned pointer to the other callbacks.

Signed-off-by: Chen-Yu Tsai <wens@csie.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 20:02:02 -08:00
Florent Fourcot 6444f72b4b ipv6: add flowlabel_consistency sysctl
With the introduction of IPV6_FL_F_REFLECT, there is no guarantee of
flow label unicity. This patch introduces a new sysctl to protect the old
behaviour, enable by default.

Changelog of V3:
 * rename ip6_flowlabel_consistency to flowlabel_consistency
 * use net_info_ratelimited()
 * checkpatch cleanups

Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19 17:12:31 -08:00
David S. Miller aef2b45fe4 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Conflicts:
	net/xfrm/xfrm_policy.c

Steffen Klassert says:

====================
This pull request has a merge conflict between commits be7928d20b
("net: xfrm: xfrm_policy: fix inline not at beginning of declaration") and
da7c224b1b ("net: xfrm: xfrm_policy: silence compiler warning") from
the net-next tree and commit 2f3ea9a95c ("xfrm: checkpatch erros with
inline keyword position") from the ipsec-next tree.

The version from net-next can be used, like it is done in linux-next.

1) Checkpatch cleanups, from Weilong Chen.

2) Fix lockdep complaints when pktgen is used with IPsec,
   from Fan Du.

3) Update pktgen to allow any combination of IPsec transport/tunnel mode
   and AH/ESP/IPcomp type, from Fan Du.

4) Make pktgen_dst_metrics static, Fengguang Wu.

5) Compile fix for pktgen when CONFIG_XFRM is not set,
   from Fan Du.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 23:14:25 -08:00
David S. Miller 853dc21bfe Included changes:
- drop dependency against CRC16
 - move to new release version
 - add size check at compile time for packet structs
 - update copyright years in every file
 - implement new bonding/interface alternation feature
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABCAAGBQJS0p53AAoJEEKTMo6mOh1VZ7sP/RDgh5PMXC5l1LNTG9gtsV5Z
 zzqs+kEfeTCivcMtONXhBhuli7wlW3eZehscB/Hn6VOKf40Ktwb0tNjrJZ+OMEHC
 hwR7i56ucNOKA5L1cswWwprIY3tV3dK+tI/Y7Oc0d/HAkhU2j3wHWdzCdUMa2yBp
 PurZZrRFXrqcKtIKP+AK1DkkQ2TyUZVNB6dZDf1AifMxhcfFf6Vxg1JGj6AKgvKF
 zf9q8SC5x33qGfvx4VML1b0JNChAwt01PecY/Eo154W3eOWHNXh9UxQEQBRoChbJ
 A/hCoo/LWGFeyqZwEeWBIjR+nGi/5zbCg310FGTgjWZaJ+BBD/mjKAjDhtszX2sI
 Lf0TJ7ytfCn/qij0Xmqg3R5RnmstJpb+weZ7gMqk63o9I06RtvV6/x58tPB+7+Y1
 AJ5egjL0yUypn7LtFUL3z1S7Np6m+FC9KuH47Yc1FyXR8tSwDWYszdlAloqPeYVI
 CDMw73T+6vsWr2UPBnXqudy0BtG3XT8LFXAUC9GMkz0k8GY8RZK7MeUun9Sax+ra
 c7MC+yZYDhwKgNSiEw3J86n0ozsIGVCheAQGuenmdicQdirJa8KImj1CZCoGqWBk
 kWsyNhw68SVLZFSfCiKPjJjbWijpVfQKM/TSB30Cmj2L1hyWeEtBn0McHig6srdX
 hf1Xh4tFh1yGwMYdCcAq
 =BKos
 -----END PGP SIGNATURE-----

Merge tag 'batman-adv-for-davem' of git://git.open-mesh.org/linux-merge

Included changes:
- drop dependency against CRC16
- move to new release version
- add size check at compile time for packet structs
- update copyright years in every file
- implement new bonding/interface alternation feature

Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 21:50:27 -08:00
Norbert van Bolhuis 7e11daa7c1 packet: doc: describe PACKET_MMAP with one packet socket for rx and tx
Document how to use one AF_PACKET mmap socket for RX and TX.

Signed-off-by: Norbert van Bolhuis <nvbolhuis@aimvalley.nl>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 14:45:59 -08:00
Hannes Frederic Sowa 8ed1dc44d3 ipv4: introduce hardened ip_no_pmtu_disc mode
This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
to be honored by protocols which do more stringent validation on the
ICMP's packet payload. This knob is useful for people who e.g. want to
run an unmodified DNS server in a namespace where they need to use pmtu
for TCP connections (as they are used for zone transfers or fallback
for requests) but don't want to use possibly spoofed UDP pmtu information.

Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
if the returned packet is in the window or if the association is valid.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Suggested-by: Florian Weimer <fweimer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:22:55 -08:00
Hannes Frederic Sowa f87c10a8aa ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing
While forwarding we should not use the protocol path mtu to calculate
the mtu for a forwarded packet but instead use the interface mtu.

We mark forwarded skbs in ip_forward with IPSKB_FORWARDED, which was
introduced for multicast forwarding. But as it does not conflict with
our usage in unicast code path it is perfect for reuse.

I moved the functions ip_sk_accept_pmtu, ip_sk_use_pmtu and ip_skb_dst_mtu
along with the new ip_dst_mtu_maybe_forward to net/ip.h to fix circular
dependencies because of IPSKB_FORWARDED.

Because someone might have written a software which does probe
destinations manually and expects the kernel to honour those path mtus
I introduced a new per-namespace "ip_forward_use_pmtu" knob so someone
can disable this new behaviour. We also still use mtus which are locked on a
route for forwarding.

The reason for this change is, that path mtus information can be injected
into the kernel via e.g. icmp_err protocol handler without verification
of local sockets. As such, this could cause the IPv4 forwarding path to
wrongfully emit fragmentation needed notifications or start to fragment
packets along a path.

Tunnel and ipsec output paths clear IPCB again, thus IPSKB_FORWARDED
won't be set and further fragmentation logic will use the path mtu to
determine the fragmentation size. They also recheck packet size with
help of path mtu discovery and report appropriate errors.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: David Miller <davem@davemloft.net>
Cc: John Heffner <johnwheffner@gmail.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-13 11:22:54 -08:00
Antonio Quartulli c28c0b6b14 batman-adv: add missing sysfs attributes to README
Add missing sysfs attributes in the proper section of the README

Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2014-01-12 14:41:18 +01:00
FX Le Bail 509aba3b0d IPv6: add the option to use anycast addresses as source addresses in echo reply
This change allows to follow a recommandation of RFC4942.

- Add "anycast_src_echo_reply" sysctl to control the use of anycast addresses
  as source addresses for ICMPv6 echo reply. This sysctl is false by default
  to preserve existing behavior.
- Add inline check ipv6_anycast_destination().
- Use them in icmpv6_echo_reply().

Reference:
RFC4942 - IPv6 Transition/Coexistence Security Considerations
   (http://tools.ietf.org/html/rfc4942#section-2.1.6)

2.1.6. Anycast Traffic Identification and Security

   [...]
   To avoid exposing knowledge about the internal structure of the
   network, it is recommended that anycast servers now take advantage of
   the ability to return responses with the anycast address as the
   source address if possible.

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-07 15:51:39 -05:00
Fan Du e5f79d111f {pktgen, xfrm} Document IPsec usage in pktgen.txt
Update pktgen.txt for reference when using IPsec.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2014-01-03 07:29:12 +01:00
Greg Rose 105bf2fe6b i40evf: add driver to kernel build system
Modify the existing Kconfig, Makefile, and MAINTAINERS to add the driver
to the kernel. Add a Makefile and a documentation

Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Signed-off-by: Greg Rose <gregory.v.rose@intel.com>
Tested-by: Sibai Li <sibai.li@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-12-31 16:27:49 -08:00
dingtianhong 84a6a0acad bonding: update Documentation/networking/bonding.txt for option lp_interval
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-31 00:40:31 -05:00
Atzm Watanabe ac7686b969 packet: doc: add documentation for VLAN TPID delivery
Introduce TP_STATUS_VLAN_TPID_VALID bit into the documentation.

Signed-off-by: Atzm Watanabe <atzm@stratosphere.co.jp>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-21 00:01:40 -05:00
David S. Miller 1669cb9855 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2013-12-19

1) Use the user supplied policy index instead of a generated one
   if present. From Fan Du.

2) Make xfrm migration namespace aware. From Fan Du.

3) Make the xfrm state and policy locks namespace aware. From Fan Du.

4) Remove ancient sleeping when the SA is in acquire state,
   we now queue packets to the policy instead. This replaces the
   sleeping code.

5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the
   posibility to sleep. The sleeping code is gone, so remove it.

6) Check user specified spi for IPComp. Thr spi for IPcomp is only
   16 bit wide, so check for a valid value. From Fan Du.

7) Export verify_userspi_info to check for valid user supplied spi ranges
   with pfkey and netlink. From Fan Du.

8) RFC3173 states that if the total size of a compressed payload and the IPComp
   header is not smaller than the size of the original payload, the IP datagram
   must be sent in the original non-compressed form. These packets are dropped
   by the inbound policy check because they are not transformed. Document the need
   to set 'level use' for IPcomp to receive such packets anyway. From Fan Du.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 18:37:49 -05:00
David S. Miller de47c4ab25 Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next
Marc Kleine-Budde says:

====================
this is a pull request of four patches for net-next/master.

There is one patch by Markus Pargmann, which speeds up the c_can
driver, a patch by John Whitmore which updates the in tree
documentation. A patch by Jeff Kirsher which replaces the FSF's address
by a link and a patch by Alexander Shiyan which converts the mcp251x
driver to make use of managed resources.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19 15:13:14 -05:00
Hannes Frederic Sowa cd174e67a6 ipv4: new ip_no_pmtu_disc mode to always discard incoming frag needed msgs
This new mode discards all incoming fragmentation-needed notifications
as I guess was originally intended with this knob. To not break backward
compatibility too much, I only added a special case for mode 2 in the
receiving path.

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:58:20 -05:00
David S. Miller 143c905494 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/i40e/i40e_main.c
	drivers/net/macvtap.c

Both minor merge hassles, simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-18 16:42:06 -05:00
Hannes Frederic Sowa 188b04d580 ipv4: improve documentation of ip_no_pmtu_disc
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-17 15:20:15 -05:00
John Whitmore f35f6c8f74 can: update MAINTAINERS and Documentation
Changed MAINTAINERS file to add Documentation/networking/can.txt to the list of
maintained files.

can.txt:
- Globally changed Socket CAN to SocketCAN
- Removed section 3.3 from the document
- Updated Section 7
- Corrected a few simple typos

Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: John Whitmore <johnfwhitmore@gmail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2013-12-17 11:47:20 +01:00
Fan Du b3c6efbc36 xfrm: Add file to document IPsec corner case
Create Documentation/networking/ipsec.txt to document IPsec
corner issues and other info, which will be useful when user
deploying IPsec.

Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2013-12-16 12:54:05 +01:00
Daniel Borkmann 7924cd5e0b filter: doc: improve BPF documentation
This patch significantly updates the BPF documentation and describes
its internal architecture, Linux extensions, and handling of the
kernel's BPF and JIT engine, plus documents how development can be
facilitated with the help of bpf_dbg, bpf_asm, bpf_jit_disasm.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-11 20:28:35 -05:00
Florian Fainelli 87aa9f9c61 net: phy: consolidate PHY reset in phy_init_hw()
There are quite a lot of drivers touching a PHY device MII_BMCR
register to reset the PHY without taking care of:

1) ensuring that BMCR_RESET is cleared after a given timeout
2) the PHY state machine resuming to the proper state and re-applying
potentially changed settings such as auto-negotiation

Introduce phy_poll_reset() which will take care of polling the MII_BMCR
for the BMCR_RESET bit to be cleared after a given timeout or return a
timeout error code.

In order to make sure the PHY is in a correct state, phy_init_hw() first
issues a software reset through MII_BMCR and then applies any fixups.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:38:59 -05:00
Daniel Borkmann d346a3fae3 packet: introduce PACKET_QDISC_BYPASS socket option
This patch introduces a PACKET_QDISC_BYPASS socket option, that
allows for using a similar xmit() function as in pktgen instead
of taking the dev_queue_xmit() path. This can be very useful when
PF_PACKET applications are required to be used in a similar
scenario as pktgen, but with full, flexible packet payload that
needs to be provided, for example.

On default, nothing changes in behaviour for normal PF_PACKET
TX users, so everything stays as is for applications. New users,
however, can now set PACKET_QDISC_BYPASS if needed to prevent
own packets from i) reentering packet_rcv() and ii) to directly
push the frame to the driver.

In doing so we can increase pps (here 64 byte packets) for
PF_PACKET a bit:

  # CPUs -- QDISC_BYPASS   -- qdisc path -- qdisc path[**]
  1 CPU  ==  1,509,628 pps --  1,208,708 --  1,247,436
  2 CPUs ==  3,198,659 pps --  2,536,012 --  1,605,779
  3 CPUs ==  4,787,992 pps --  3,788,740 --  1,735,610
  4 CPUs ==  6,173,956 pps --  4,907,799 --  1,909,114
  5 CPUs ==  7,495,676 pps --  5,956,499 --  2,014,422
  6 CPUs ==  9,001,496 pps --  7,145,064 --  2,155,261
  7 CPUs == 10,229,776 pps --  8,190,596 --  2,220,619
  8 CPUs == 11,040,732 pps --  9,188,544 --  2,241,879
  9 CPUs == 12,009,076 pps -- 10,275,936 --  2,068,447
 10 CPUs == 11,380,052 pps -- 11,265,337 --  1,578,689
 11 CPUs == 11,672,676 pps -- 11,845,344 --  1,297,412
 [...]
 20 CPUs == 11,363,192 pps -- 11,014,933 --  1,245,081

 [**]: qdisc path with packet_rcv(), how probably most people
       seem to use it (hopefully not anymore if not needed)

The test was done using a modified trafgen, sending a simple
static 64 bytes packet, on all CPUs.  The trick in the fast
"qdisc path" case, is to avoid reentering packet_rcv() by
setting the RAW socket protocol to zero, like:
socket(PF_PACKET, SOCK_RAW, 0);

Tradeoffs are documented as well in this patch, clearly, if
queues are busy, we will drop more packets, tc disciplines are
ignored, and these packets are not visible to taps anymore. For
a pktgen like scenario, we argue that this is acceptable.

The pointer to the xmit function has been placed in packet
socket structure hole between cached_dev and prot_hook that
is hot anyway as we're working on cached_dev in each send path.

Done in joint work together with Jesper Dangaard Brouer.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:23:33 -05:00
David S. Miller 34f9f43710 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Merge 'net' into 'net-next' to get the AF_PACKET bug fix that
Daniel's direct transmit changes depend upon.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:20:14 -05:00
Daniel Borkmann 66e56cd46b packet: fix send path when running with proto == 0
Commit e40526cb20 introduced a cached dev pointer, that gets
hooked into register_prot_hook(), __unregister_prot_hook() to
update the device used for the send path.

We need to fix this up, as otherwise this will not work with
sockets created with protocol = 0, plus with sll_protocol = 0
passed via sockaddr_ll when doing the bind.

So instead, assign the pointer directly. The compiler can inline
these helper functions automagically.

While at it, also assume the cached dev fast-path as likely(),
and document this variant of socket creation as it seems it is
not widely used (seems not even the author of TX_RING was aware
of that in his reference example [1]). Tested with reproducer
from e40526cb20.

 [1] http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap#Example

Fixes: e40526cb20 ("packet: fix use after free race in send path when dev is released")
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Tested-by: Salam Noureddine <noureddine@aristanetworks.com>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-09 20:09:20 -05:00
David S. Miller f1abb346d8 Merge branch 'for-davem' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next
John W. Linville says:

====================
Please pull this batch of updates intended for the 3.14 stream...

For the mac80211 bits, Johannes says:

"I have various improvements/cleanups/fixes all over, but the shortlog
shows that Luis's regulatory work and mesh work from the cozybit folks
are the biggest ones, along with the CSA fixes."

Along with that, we have big batches of updates to brcmfmac, rtlwifi,
and ath9k.  There are updates to wcn36xx, rt2x00, and a handful of
others as well.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 14:25:23 -05:00
Eric Dumazet f54b311142 tcp: auto corking
With the introduction of TCP Small Queues, TSO auto sizing, and TCP
pacing, we can implement Automatic Corking in the kernel, to help
applications doing small write()/sendmsg() to TCP sockets.

Idea is to change tcp_push() to check if the current skb payload is
under skb optimal size (a multiple of MSS bytes)

If under 'size_goal', and at least one packet is still in Qdisc or
NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
will be delayed up to TX completion time.

This delay might allow the application to coalesce more bytes
in the skb in following write()/sendmsg()/sendfile() system calls.

The exact duration of the delay is depending on the dynamics
of the system, and might be zero if no packet for this flow
is actually held in Qdisc or NIC TX ring.

Using FQ/pacing is a way to increase the probability of
autocorking being triggered.

Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
this feature and default it to 1 (enabled)

Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
This counter is incremented every time we detected skb was under used
and its flush was deferred.

Tested:

Interesting effects when using line buffered commands under ssh.

Excellent performance results in term of cpu usage and total throughput.

lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39

 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':

      35209.439626 task-clock                #    2.901 CPUs utilized
             2,294 context-switches          #    0.065 K/sec
               101 CPU-migrations            #    0.003 K/sec
             4,079 page-faults               #    0.116 K/sec
    97,923,241,298 cycles                    #    2.781 GHz                     [83.31%]
    51,832,908,236 stalled-cycles-frontend   #   52.93% frontend cycles idle    [83.30%]
    25,697,986,603 stalled-cycles-backend    #   26.24% backend  cycles idle    [66.70%]
   102,225,978,536 instructions              #    1.04  insns per cycle
                                             #    0.51  stalled cycles per insn [83.38%]
    18,657,696,819 branches                  #  529.906 M/sec                   [83.29%]
        91,679,646 branch-misses             #    0.49% of all branches         [83.40%]

      12.136204899 seconds time elapsed

lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89

 Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
      40045.864494 task-clock                #    3.301 CPUs utilized
               171 context-switches          #    0.004 K/sec
                53 CPU-migrations            #    0.001 K/sec
             4,080 page-faults               #    0.102 K/sec
   111,340,458,645 cycles                    #    2.780 GHz                     [83.34%]
    61,778,039,277 stalled-cycles-frontend   #   55.49% frontend cycles idle    [83.31%]
    29,295,522,759 stalled-cycles-backend    #   26.31% backend  cycles idle    [66.67%]
   108,654,349,355 instructions              #    0.98  insns per cycle
                                             #    0.57  stalled cycles per insn [83.34%]
    19,552,170,748 branches                  #  488.244 M/sec                   [83.34%]
       157,875,417 branch-misses             #    0.81% of all branches         [83.34%]

      12.130267788 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-06 12:51:41 -05:00
John W. Linville d86804cb70 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem 2013-12-06 10:37:24 -05:00
John W. Linville 4b074b0762 Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next 2013-12-02 14:25:38 -05:00
Luis R. Rodriguez 8fe02e167e cfg80211: consolidate passive-scan and no-ibss flags
These two flags are used for the same purpose, just
combine them into a no-ir flag to annotate no initiating
radiation is allowed.

Old userspace sending either flag will have it treated as
the no-ir flag. To be considerate to older userspace we
also send both the no-ir flag and the old no-ibss flags.
Newer userspace will have to be aware of older kernels.

Update all places in the tree using these flags with the
following semantic patch:

@@
@@
-NL80211_RRF_PASSIVE_SCAN
+NL80211_RRF_NO_IR
@@
@@
-NL80211_RRF_NO_IBSS
+NL80211_RRF_NO_IR
@@
@@
-IEEE80211_CHAN_PASSIVE_SCAN
+IEEE80211_CHAN_NO_IR
@@
@@
-IEEE80211_CHAN_NO_IBSS
+IEEE80211_CHAN_NO_IR
@@
@@
-NL80211_RRF_NO_IR | NL80211_RRF_NO_IR
+NL80211_RRF_NO_IR
@@
@@
-IEEE80211_CHAN_NO_IR | IEEE80211_CHAN_NO_IR
+IEEE80211_CHAN_NO_IR
@@
@@
-(NL80211_RRF_NO_IR)
+NL80211_RRF_NO_IR
@@
@@
-(IEEE80211_CHAN_NO_IR)
+IEEE80211_CHAN_NO_IR

Along with some hand-optimisations in documentation, to
remove duplicates and to fix some indentation.

Signed-off-by: Luis R. Rodriguez <mcgrof@do-not-panic.com>
[do all the driver updates in one go]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2013-11-25 20:49:35 +01:00
Ben Hutchings a4bcc795e9 net_tstamp,doc: Add test program for SIOC{G,S}HWTSTAMP
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-22 20:10:53 +00:00
Ben Hutchings fd468c74bd net_tstamp: Add SIOCGHWTSTAMP ioctl to match SIOCSHWTSTAMP
SIOCSHWTSTAMP returns the real configuration to the application
using it, but there is currently no way for any other
application to find out the configuration non-destructively.
Add a new ioctl for this, making it unprivileged.

Signed-off-by: Ben Hutchings <bhutchings@solarflare.com>
2013-11-19 19:07:21 +00:00
Eric Dumazet 98e09386c0 tcp: tsq: restore minimal amount of queueing
After commit c9eeec26e3 ("tcp: TSQ can use a dynamic limit"), several
users reported throughput regressions, notably on mvneta and wifi
adapters.

802.11 AMPDU requires a fair amount of queueing to be effective.

This patch partially reverts the change done in tcp_write_xmit()
so that the minimal amount is sysctl_tcp_limit_output_bytes.

It also remove the use of this sysctl while building skb stored
in write queue, as TSO autosizing does the right thing anyway.

Users with well behaving NICS and correct qdisc (like sch_fq),
can then lower the default sysctl_tcp_limit_output_bytes value from
128KB to 8KB.

This new usage of sysctl_tcp_limit_output_bytes permits each driver
authors to check how their driver performs when/if the value is set
to a minimum of 4KB.

Normally, line rate for a single TCP flow should be possible,
but some drivers rely on timers to perform TX completion and
too long TX completion delays prevent reaching full throughput.

Fixes: c9eeec26e3 ("tcp: TSQ can use a dynamic limit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Sujith Manoharan <sujith@msujith.org>
Reported-by: Arnaud Ebalard <arno@natisbad.org>
Tested-by: Sujith Manoharan <sujith@msujith.org>
Cc: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-14 16:25:14 -05:00
Nikolay Aleksandrov 12465fb833 bonding: document the new packets_per_slave option
Add new documentation for the packets_per_slave option available
for balance-rr mode.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-07 15:10:21 -05:00
David S. Miller 13521a5797 Merge branch 'for-davem' of git://gitorious.org/linux-can/linux-can-next
Marc Kleine-Budde says:

====================
here's a pull request for net-next.

It includes a patch by Oliver Hartkopp et al. that adds documentation
for the broadcast manager to Documentation/networking/can.txt. Three
patches by me that clean up the netlink handling code in the CAN core.
And another patch that removes a not needed function from the ti_hecc
driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 19:59:44 -05:00
Yuchung Cheng 9f9843a751 tcp: properly handle stretch acks in slow start
Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
regardless the number of packets. Consequently slow start performance
is highly dependent on the degree of the stretch ACKs caused by
receiver or network ACK compression mechanisms (e.g., delayed-ACK,
GRO, etc).  But slow start algorithm is to send twice the amount of
packets of packets left so it should process a stretch ACK of degree
N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
follow up patch will use the remainder of the N (if greater than 1)
to adjust cwnd in the congestion avoidance phase.

In addition this patch retires the experimental limited slow start
(LSS) feature. LSS has multiple drawbacks but questionable benefit. The
fractional cwnd increase in LSS requires a loop in slow start even
though it's rarely used. Configuring such an increase step via a global
sysctl on different BDPS seems hard. Finally and most importantly the
slow start overshoot concern is now better covered by the Hybrid slow
start (hystart) enabled by default.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 19:57:59 -05:00
Yuchung Cheng 0d41cca490 tcp: enable sockets to use MSG_FASTOPEN by default
Applications have started to use Fast Open (e.g., Chrome browser has
such an optional flag) and the feature has gone through several
generations of kernels since 3.7 with many real network tests. It's
time to enable this flag by default for applications to test more
conveniently and extensively.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 19:57:47 -05:00
David S. Miller 394efd19d5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be.h
	drivers/net/netconsole.c
	net/bridge/br_private.h

Three mostly trivial conflicts.

The net/bridge/br_private.h conflict was a function signature (argument
addition) change overlapping with the extern removals from Joe Perches.

In drivers/net/netconsole.c we had one change adjusting a printk message
whilst another changed "printk(KERN_INFO" into "pr_info(".

Lastly, the emulex change was a new inline function addition overlapping
with Joe Perches's extern removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-04 13:48:30 -05:00
Eric Dumazet 74d332c13b net: extend net_device allocation to vmalloc()
Joby Poriyath provided a xen-netback patch to reduce the size of
xenvif structure as some netdev allocation could fail under
memory pressure/fragmentation.

This patch is handling the problem at the core level, allowing
any netdev structures to use vmalloc() if kmalloc() failed.

As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
to kzalloc() flags to do this fallback only when really needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Joby Poriyath <joby.poriyath@citrix.com>
Cc: Ben Hutchings <bhutchings@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-03 23:19:00 -05:00
Oliver Hartkopp 51b2f451b5 can: add broadcast manager documentation
This patch adds documentation about the broadcast manager. It's based on Brian
Thorne's initial patch http://marc.info/?l=linux-can&m=138119382015496&w=2 and
Daniele Venzano's work http://brownhat.org/docs/socketcan.html .

Signed-off-by: Brian Thorne <hardbyte@gmail.com>
Cc: Daniele Venzano <linux@brownhat.org>
Cc: Andre Naujoks <nautsch2@gmail.com>
Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2013-10-31 20:45:04 +01:00
Masanari Iida c17cb8b55b doc:net: Fix typo in Documentation/networking
Correct spelling typo in Documentation/networking

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-30 17:10:20 -04:00
Randy Dunlap ad86de802d Documentation/networking: netdev-FAQ typo corrections
Various typo fixes to netdev-FAQ.txt:
- capitalize Linux
- hyphenate dual-word adjectives
- minor punctuation fixes

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-27 16:55:40 -04:00
Marek Lindner bc58eeef74 batman-adv: update email address for Marek Lindner
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2013-10-19 14:53:48 +02:00
Simon Wunderlich c679ff8fb2 batman-adv: update email address for Simon Wunderlich
My university will stop email service for alumni in january 2014, please
use my new e-mail address instead.

Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
2013-10-19 14:53:41 +02:00
Simon Wunderlich 9f4980e68b batman-adv: remove vis functionality
This is replaced by a userspace program, we don't need this
functionality to bloat the kernel.

Signed-off-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <lindner_marek@yahoo.de>
Signed-off-by: Antonio Quartulli <antonio@meshcoding.com>
2013-10-09 21:22:32 +02:00
Nikolay Aleksandrov 7a6afab1de bonding: document the new xmit policy modes and update the changed ones
Add new documentation for encap2+3 and encap3+4, also update the formula
for the old modes due to the changes.

Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-03 15:36:38 -04:00
Neil Horman 7eacd03810 bonding: Make alb learning packet interval configurable
running bonding in ALB mode requires that learning packets be sent periodically,
so that the switch knows where to send responding traffic.  However, depending
on switch configuration, there may not be any need to send traffic at the
default rate of 3 packets per second, which represents little more than wasted
data.  Allow the ALB learning packet interval to be made configurable via sysfs

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Acked-by: Veaceslav Falico <vfalico@redhat.com>
CC: Jay Vosburgh <fubar@us.ibm.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-15 22:20:44 -04:00
Jesse Brandeburg 1bff652941 i40e: include i40e in kernel proper
This patch adds the changes for Kconfig, i40e.txt, MAINTAINERS, Kbuild
and new i40e/Makefile to build i40e with the kernel.

New driver build option is CONFIG_I40E

Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Shannon Nelson <shannon.nelson@intel.com>
CC: PJ Waskiewicz <peter.p.waskiewicz.jr@intel.com>
CC: e1000-devel@lists.sourceforge.net
Tested-by: Kavindya Deegala <kavindya.s.deegala@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2013-09-11 02:28:40 -07:00
Daniel Borkmann f212781082 net: ipv6: mld: document force_mld_version in ip-sysctl.txt
Document force_mld_version parameter in ip-sysctl.txt.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-09-04 14:53:21 -04:00
Sonic Zhang e2a240c7d3 driver:net:stmmac: Disable DMA store and forward mode if platform data force_thresh_dma_mode is set.
Some synopsys ip implementation doesn't support DMA store and forward mode,
such as BF60x. So, set force_thresh_dma_mode to use DMA thresholds only.
Update document and devicetree as well.

Signed-off-by: Sonic Zhang <sonic.zhang@analog.com>
Acked-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-30 17:26:09 -04:00
Daniel Borkmann 7ec06da81d net: packet: document available fanout policies
Update documentation to add fanout policies that are available.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 16:43:29 -04:00
Eric Dumazet 95bd09eb27 tcp: TSO packets automatic sizing
After hearing many people over past years complaining against TSO being
bursty or even buggy, we are proud to present automatic sizing of TSO
packets.

One part of the problem is that tcp_tso_should_defer() uses an heuristic
relying on upcoming ACKS instead of a timer, but more generally, having
big TSO packets makes little sense for low rates, as it tends to create
micro bursts on the network, and general consensus is to reduce the
buffering amount.

This patch introduces a per socket sk_pacing_rate, that approximates
the current sending rate, and allows us to size the TSO packets so
that we try to send one packet every ms.

This field could be set by other transports.

Patch has no impact for high speed flows, where having large TSO packets
makes sense to reach line rate.

For other flows, this helps better packet scheduling and ACK clocking.

This patch increases performance of TCP flows in lossy environments.

A new sysctl (tcp_min_tso_segs) is added, to specify the
minimal size of a TSO packet (default being 2).

A follow-up patch will provide a new packet scheduler (FQ), using
sk_pacing_rate as an input to perform optional per flow pacing.

This explains why we chose to set sk_pacing_rate to twice the current
rate, allowing 'slow start' ramp up.

sk_pacing_rate = 2 * cwnd * mss / srtt

v2: Neal Cardwell reported a suspect deferring of last two segments on
initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
into account tp->xmit_size_goal_segs

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Tom Herbert <therbert@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:50:06 -04:00
Hannes Frederic Sowa b800c3b966 ipv6: drop fragmented ndisc packets by default (RFC 6980)
This patch implements RFC6980: Drop fragmented ndisc packets by
default. If a fragmented ndisc packet is received the user is informed
that it is possible to disable the check.

Cc: Fernando Gont <fernando@gont.com.ar>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 15:32:08 -04:00
David S. Miller 5b2941b18d Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/jesse/openvswitch
Jesse Gross says:

====================
A number of significant new features and optimizations for net-next/3.12.
Highlights are:
 * "Megaflows", an optimization that allows userspace to specify which
   flow fields were used to compute the results of the flow lookup.
   This allows for a major reduction in flow setups (the major
   performance bottleneck in Open vSwitch) without reducing flexibility.
 * Converting netlink dump operations to use RCU, allowing for
   additional parallelism in userspace.
 * Matching and modifying SCTP protocol fields.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 22:11:18 -04:00
Jeff Kirsher d7064f4c19 Documentation/networking/: Update Intel wired LAN driver documentation
Updates the documentation to the Intel wired LAN drivers.

Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Tested-by: Phil Schmitt <phillip.j.schmitt@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-27 16:05:26 -04:00
Andy Zhou 03f0d916aa openvswitch: Mega flow implementation
Add wildcarded flow support in kernel datapath.

Wildcarded flow can improve OVS flow set up performance by avoid sending
matching new flows to the user space program. The exact performance boost
will largely dependent on wildcarded flow hit rate.

In case all new flows hits wildcard flows, the flow set up rate is
within 5% of that of linux bridge module.

Pravin has made significant contributions to this patch. Including API
clean ups and bug fixes.

Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
2013-08-23 16:43:07 -07:00
David S. Miller 89d5e23210 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Conflicts:
	net/netfilter/nf_conntrack_proto_tcp.c

The conflict had to do with overlapping changes dealing with
fixing the use of an "s32" to hold the value returned by
NAT_OFFSET().

Pablo Neira Ayuso says:

====================
The following batch contains Netfilter/IPVS updates for your net-next tree.
More specifically, they are:

* Trivial typo fix in xt_addrtype, from Phil Oester.

* Remove net_ratelimit in the conntrack logging for consistency with other
  logging subsystem, from Patrick McHardy.

* Remove unneeded includes from the recently added xt_connlabel support, from
  Florian Westphal.

* Allow to update conntracks via nfqueue, don't need NFQA_CFG_F_CONNTRACK for
  this, from Florian Westphal.

* Remove tproxy core, now that we have socket early demux, from Florian
  Westphal.

* A couple of patches to refactor conntrack event reporting to save a good
  bunch of lines, from Florian Westphal.

* Fix missing locking in NAT sequence adjustment, it did not manifested in
  any known bug so far, from Patrick McHardy.

* Change sequence number adjustment variable to 32 bits, to delay the
  possible early overflow in long standing connections, also from Patrick.

* Comestic cleanups for IPVS, from Dragos Foianu.

* Fix possible null dereference in IPVS in the SH scheduler, from Daniel
  Borkmann.

* Allow to attach conntrack expectations via nfqueue. Before this patch, you
  had to use ctnetlink instead, thus, we save the conntrack lookup.

* Export xt_rpfilter and xt_HMARK header files, from Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-20 13:30:54 -07:00
Hannes Frederic Sowa fc4eba58b4 ipv6: make unsolicited report intervals configurable for mld
Commit cab70040df ("net: igmp:
Reduce Unsolicited report interval to 1s when using IGMPv3") and
2690048c01 ("net: igmp: Allow user-space
configuration of igmp unsolicited report interval") by William Manley made
igmp unsolicited report intervals configurable per interface and corrected
the interval of unsolicited igmpv3 report messages resendings to 1s.

Same needs to be done for IPv6:

MLDv1 (RFC2710 7.10.): 10 seconds
MLDv2 (RFC3810 9.11.): 1 second

Both intervals are configurable via new procfs knobs
mldv1_unsolicited_report_interval and mldv2_unsolicited_report_interval.

(also added .force_mld_version to ipv6_devconf_dflt to bring structs in
line without semantic changes)

v2:
a) Joined documentation update for IPv4 and IPv6 MLD/IGMP
   unsolicited_report_interval procfs knobs.
b) incorporate stylistic feedback from William Manley

v3:
a) add new DEVCONF_* values to the end of the enum (thanks to David
   Miller)

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: William Manley <william.manley@youview.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-13 17:05:04 -07:00
David S. Miller 71acc0ddd4 Revert "net: sctp: convert sctp_checksum_disable module param into sctp sysctl"
This reverts commit cda5f98e36.

As per Vlad's request.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 13:09:41 -07:00
Daniel Borkmann cda5f98e36 net: sctp: convert sctp_checksum_disable module param into sctp sysctl
Get rid of the last module parameter for SCTP and make this
configurable via sysctl for SCTP like all the rest of SCTP's
configuration knobs.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-09 11:33:02 -07:00
Paul Gortmaker 49dfe76261 Documentation: add networking/netdev-FAQ.txt
A collection of expectations and operational details about how
networking development takes place in the context of the netdev
mailing list.

The content is meant to capture specific items that are unique
to netdev workflow, and not re-document generic linux expectations
that are already captured elsewhere.

This was originally proposed[1] as a regular posting mailing list
FAQ, but it probably is more universally accessible here in tree.

[1] https://lwn.net/Articles/559211/

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-31 17:36:29 -07:00
Florian Westphal fd158d79d3 netfilter: tproxy: remove nf_tproxy_core, keep tw sk assigned to skb
The module was "permanent", due to the special tproxy skb->destructor.
Nowadays we have tcp early demux and its sock_edemux destructor in
networking core which can be used instead.

Thanks to early demux changes the input path now also handles
"skb->sk is tw socket" correctly, so this no longer needs the special
handling introduced with commit d503b30bd6
(netfilter: tproxy: do not assign timewait sockets to skb->sk).

Thus:
- move assign_sock function to where its needed
- don't prevent timewait sockets from being assigned to the skb
- remove nf_tproxy_core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2013-07-31 16:39:40 +02:00
Hannes Frederic Sowa 5ad37d5dee tcp: add tcp_syncookies mode to allow unconditionally generation of syncookies
| If you want to test which effects syncookies have to your
| network connections you can set this knob to 2 to enable
| unconditionally generation of syncookies.

Original idea and first implementation by Eric Dumazet.

Cc: Florian Westphal <fw@strlen.de>
Cc: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-30 16:15:18 -07:00
Eric Dumazet c9bee3b7fd tcp: TCP_NOTSENT_LOWAT socket option
Idea of this patch is to add optional limitation of number of
unsent bytes in TCP sockets, to reduce usage of kernel memory.

TCP receiver might announce a big window, and TCP sender autotuning
might allow a large amount of bytes in write queue, but this has little
performance impact if a large part of this buffering is wasted :

Write queue needs to be large only to deal with large BDP, not
necessarily to cope with scheduling delays (incoming ACKS make room
for the application to queue more bytes)

For most workloads, using a value of 128 KB or less is OK to give
applications enough time to react to POLLOUT events in time
(or being awaken in a blocking sendmsg())

This patch adds two ways to set the limit :

1) Per socket option TCP_NOTSENT_LOWAT

2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.

This changes poll()/select()/epoll() to report POLLOUT
only if number of unsent bytes is below tp->nosent_lowat

Note this might increase number of sendmsg()/sendfile() calls
when using non blocking sockets,
and increase number of context switches for blocking sockets.

Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
defined as :
 Specify the minimum number of bytes in the buffer until
 the socket layer will pass the data to the protocol)

Tested:

netperf sessions, and watching /proc/net/protocols "memory" column for TCP

With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y

Using 128KB has no bad effect on the throughput or cpu usage
of a single flow, although there is an increase of context switches.

A bonus is that we hold socket lock for a shorter amount
of time and should improve latencies of ACK processing.

lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

           412,514 context-switches

     200.034645535 seconds time elapsed

lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
Final       Final                                             %     Method %      Method
1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB

 Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':

         2,675,818 context-switches

     200.029651391 seconds time elapsed

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:54:48 -07:00
Daniel Borkmann 91705c61b5 net: sctp: trivial: update mailing list address
The SCTP mailing list address to send patches or questions
to is linux-sctp@vger.kernel.org and not
lksctp-developers@lists.sourceforge.net anymore. Therefore,
update all occurences.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-24 17:53:38 -07:00
Linus Torvalds 496322bc91 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "This is a re-do of the net-next pull request for the current merge
  window.  The only difference from the one I made the other day is that
  this has Eliezer's interface renames and the timeout handling changes
  made based upon your feedback, as well as a few bug fixes that have
  trickeled in.

  Highlights:

   1) Low latency device polling, eliminating the cost of interrupt
      handling and context switches.  Allows direct polling of a network
      device from socket operations, such as recvmsg() and poll().

      Currently ixgbe, mlx4, and bnx2x support this feature.

      Full high level description, performance numbers, and design in
      commit 0a4db187a9 ("Merge branch 'll_poll'")

      From Eliezer Tamir.

   2) With the routing cache removed, ip_check_mc_rcu() gets exercised
      more than ever before in the case where we have lots of multicast
      addresses.  Use a hash table instead of a simple linked list, from
      Eric Dumazet.

   3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from
      Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski,
      Marek Puzyniak, Michal Kazior, and Sujith Manoharan.

   4) Support reporting the TUN device persist flag to userspace, from
      Pavel Emelyanov.

   5) Allow controlling network device VF link state using netlink, from
      Rony Efraim.

   6) Support GRE tunneling in openvswitch, from Pravin B Shelar.

   7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from
      Daniel Borkmann and Eric Dumazet.

   8) Allow controlling of TCP quickack behavior on a per-route basis,
      from Cong Wang.

   9) Several bug fixes and improvements to vxlan from Stephen
      Hemminger, Pravin B Shelar, and Mike Rapoport.  In particular,
      support receiving on multiple UDP ports.

  10) Major cleanups, particular in the area of debugging and cookie
      lifetime handline, to the SCTP protocol code.  From Daniel
      Borkmann.

  11) Allow packets to cross network namespaces when traversing tunnel
      devices.  From Nicolas Dichtel.

  12) Allow monitoring netlink traffic via AF_PACKET sockets, in a
      manner akin to how we monitor real network traffic via ptype_all.
      From Daniel Borkmann.

  13) Several bug fixes and improvements for the new alx device driver,
      from Johannes Berg.

  14) Fix scalability issues in the netem packet scheduler's time queue,
      by using an rbtree.  From Eric Dumazet.

  15) Several bug fixes in TCP loss recovery handling, from Yuchung
      Cheng.

  16) Add support for GSO segmentation of MPLS packets, from Simon
      Horman.

  17) Make network notifiers have a real data type for the opaque
      pointer that's passed into them.  Use this to properly handle
      network device flag changes in arp_netdev_event().  From Jiri
      Pirko and Timo Teräs.

  18) Convert several drivers over to module_pci_driver(), from Peter
      Huewe.

  19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a
      O(1) calculation instead.  From Eric Dumazet.

  20) Support setting of explicit tunnel peer addresses in ipv6, just
      like ipv4.  From Nicolas Dichtel.

  21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet.

  22) Prevent a single high rate flow from overruning an individual cpu
      during RX packet processing via selective flow shedding.  From
      Willem de Bruijn.

  23) Don't use spinlocks in TCP md5 signing fast paths, from Eric
      Dumazet.

  24) Don't just drop GSO packets which are above the TBF scheduler's
      burst limit, chop them up so they are in-bounds instead.  Also
      from Eric Dumazet.

  25) VLAN offloads are missed when configured on top of a bridge, fix
      from Vlad Yasevich.

  26) Support IPV6 in ping sockets.  From Lorenzo Colitti.

  27) Receive flow steering targets should be updated at poll() time
      too, from David Majnemer.

  28) Fix several corner case regressions in PMTU/redirect handling due
      to the routing cache removal, from Timo Teräs.

  29) We have to be mindful of ipv4 mapped ipv6 sockets in
      upd_v6_push_pending_frames().  From Hannes Frederic Sowa.

  30) Fix L2TP sequence number handling bugs, from James Chapman."

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits)
  drivers/net: caif: fix wrong rtnl_is_locked() usage
  drivers/net: enic: release rtnl_lock on error-path
  vhost-net: fix use-after-free in vhost_net_flush
  net: mv643xx_eth: do not use port number as platform device id
  net: sctp: confirm route during forward progress
  virtio_net: fix race in RX VQ processing
  virtio: support unlocked queue poll
  net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit
  Documentation: Fix references to defunct linux-net@vger.kernel.org
  net/fs: change busy poll time accounting
  net: rename low latency sockets functions to busy poll
  bridge: fix some kernel warning in multicast timer
  sfc: Fix memory leak when discarding scattered packets
  sit: fix tunnel update via netlink
  dt:net:stmmac: Add dt specific phy reset callback support.
  dt:net:stmmac: Add support to dwmac version 3.610 and 3.710
  dt:net:stmmac: Allocate platform data only if its NULL.
  net:stmmac: fix memleak in the open method
  ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available
  net: ipv6: fix wrong ping_v6_sendmsg return value
  ...
2013-07-09 18:24:39 -07:00
Geert Uytterhoeven b78ba72cda Documentation: Fix references to defunct linux-net@vger.kernel.org
linux-net@vger.kernel.org was replaced by netdev@oss.sgi.com was replaced
by netdev@vger.kernel.org.

Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-09 12:42:19 -07:00
Linus Torvalds 80cc38b163 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina:
 "The usual stuff from trivial tree"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (34 commits)
  treewide: relase -> release
  Documentation/cgroups/memory.txt: fix stat file documentation
  sysctl/net.txt: delete reference to obsolete 2.4.x kernel
  spinlock_api_smp.h: fix preprocessor comments
  treewide: Fix typo in printk
  doc: device tree: clarify stuff in usage-model.txt.
  open firmware: "/aliasas" -> "/aliases"
  md: bcache: Fixed a typo with the word 'arithmetic'
  irq/generic-chip: fix a few kernel-doc entries
  frv: Convert use of typedef ctl_table to struct ctl_table
  sgi: xpc: Convert use of typedef ctl_table to struct ctl_table
  doc: clk: Fix incorrect wording
  Documentation/arm/IXP4xx fix a typo
  Documentation/networking/ieee802154 fix a typo
  Documentation/DocBook/media/v4l fix a typo
  Documentation/video4linux/si476x.txt fix a typo
  Documentation/virtual/kvm/api.txt fix a typo
  Documentation/early-userspace/README fix a typo
  Documentation/video4linux/soc-camera.txt fix a typo
  lguest: fix CONFIG_PAE -> CONFIG_x86_PAE in comment
  ...
2013-07-04 11:40:58 -07:00
David S. Miller 0c1072ae02 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/freescale/fec_main.c
	drivers/net/ethernet/renesas/sh_eth.c
	net/ipv4/gre.c

The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list)
and the splitting of the gre.c code into seperate files.

The FEC conflict was two sets of changes adding ethtool support code
in an "!CONFIG_M5272" CPP protected block.

Finally the sh_eth.c conflict was between one commit add bits set
in the .eesr_err_check mask whilst another commit removed the
.tx_error_check member and assignments.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-07-03 14:55:13 -07:00
David S. Miller 4e144d3a80 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
The following batch contains Netfilter/IPVS updates for net-next,
they are:

* Enforce policy to several nfnetlink subsystem, from Daniel
  Borkmann.

* Use xt_socket to match the third packet (to perform simplistic
  socket-based stateful filtering), from Eric Dumazet.

* Avoid large timeout for picked up from the middle TCP flows,
  from Florian Westphal.

* Exclude IPVS from struct net if IPVS is disabled and removal
  of unnecessary included header file, from JunweiZhang.

* Release SCTP connection immediately under load, to mimic current
  TCP behaviour, from Julian Anastasov.

* Replace and enhance SCTP state machine, from Julian Anastasov.

* Add tweak to reduce sync traffic in the presence of persistence,
  also from Julian Anastasov.

* Add tweak for the IPVS SH scheduler not to reject connections
  directed to a server, choose a new one instead, from Alexander
  Frolkin.

* Add support for sloppy TCP and SCTP modes, that creates state
  information on any packet, not only initial handshake packets,
  from Alexander Frolkin.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-30 17:35:13 -07:00
Julian Anastasov 4d0c875dcc ipvs: add sync_persist_mode flag
Add sync_persist_mode flag to reduce sync traffic
by syncing only persistent templates.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Tested-by: Aleksey Chudov <aleksey.chudov@gmail.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-06-26 18:01:46 +09:00
Veaceslav Falico 8599b52e14 bonding: add an option to fail when any of arp_ip_target is inaccessible
Currently, we fail only when all of the ips in arp_ip_target are gone.
However, in some situations we might need to fail if even one host from
arp_ip_target becomes unavailable.

All situations, obviously, rely on the idea that we need *completely*
functional network, with all interfaces/addresses working correctly.

One real world example might be:
vlans on top on bond (hybrid port). If bond and vlans have ips assigned
and we have their peers monitored via arp_ip_target - in case of switch
misconfiguration (trunk/access port), slave driver malfunction or
tagged/untagged traffic dropped on the way - we will be able to switch
to another slave.

Though any other configuration needs that if we need to have access to all
arp_ip_targets.

This patch adds this possibility by adding a new parameter -
arp_all_targets (both as a module parameter and as a sysfs knob). It can be
set to:

	0 or any (the default) - which works exactly as it's working now -
	the slave is up if any of the arp_ip_targets are up.

	1 or all - the slave is up if all of the arp_ip_targets are up.

This parameter can be changed on the fly (via sysfs), and requires the mode
to be active-backup and arp_validate to be enabled (it obeys the
arp_validate config on which slaves to validate).

Internally it's done through:

1) Add target_last_arp_rx[BOND_MAX_ARP_TARGETS] array to slave struct. It's
   an array of jiffies, meaning that slave->target_last_arp_rx[i] is the
   last time we've received arp from bond->params.arp_targets[i] on this
   slave.

2) If we successfully validate an arp from bond->params.arp_targets[i] in
   bond_validate_arp() - update the slave->target_last_arp_rx[i] with the
   current jiffies value.

3) When getting slave's last_rx via slave_last_rx(), we return the oldest
   time when we've received an arp from any address in
   bond->params.arp_targets[].

If the value of arp_all_targets == 0 - we still work the same way as
before.

Also, update the documentation to reflect the new parameter.

v3->v4:
Kill the forgotten rtnl_unlock(), rephrase the documentation part to be
more clear, don't fail setting arp_all_targets if arp_validate is not set -
it has no effect anyway but can be easier to set up. Also, print a warning
if the last arp_ip_target is removed while the arp_interval is on, but not
the arp_validate.

v2->v3:
Use _bh spinlock, remove useless rtnl_lock() and use jiffies for new
arp_ip_target last arp, instead of slave_last_rx(). On bond_enslave(),
use the same initialization value for target_last_arp_rx[] as is used
for the default last_arp_rx, to avoid useless interface flaps.

Also, instead of failing to remove the last arp_ip_target just print a
warning - otherwise it might break existing scripts.

v1->v2:
Correctly handle adding/removing hosts in arp_ip_target - we need to
shift/initialize all slave's target_last_arp_rx. Also, don't fail module
loading on arp_all_targets misconfiguration, just disable it, and some
minor style fixes.

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:58:38 -07:00
Veaceslav Falico d7d35c681f bonding: doc: some details on backup slave arp validation
Add some details to bonding documentation on how backup slave arp
validation works.

Signed-off-by: Veaceslav Falico <vfalico@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:58:38 -07:00
Cong Wang 7623757661 doc: fix some syntax errors in netlink mmap sample code
Cc: Patrick McHardy <kaber@trash.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-25 16:47:02 -07:00
Shan Wei a3c910d2e7 tcp: doc : fix the syncookies default value
syncookies is on for default since in commit e994b7c901
(tcp: Don't make syn cookies initial setting depend on CONFIG_SYSCTL).

And fix a typo of CONFIG_SYN_COOKIES.

Signed-off-by: Shan Wei <davidshan@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-24 00:26:26 -07:00
Cong Wang e3d73bcedf net: add doc for ip_early_demux sysctl
commit 6648bd7e0e (ipv4: Add sysctl knob to control
early socket demux) introduced such sysctl, but forgot to add
doc into Documentation/networking/ip-sysctl.txt. This patch adds it.

Basically I grab the doc from the description of commit 41063e9dd1
(ipv4: Early TCP socket demux.) and the above commit.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-12 15:10:13 -07:00
Rami Rosen e8b265e8ba doc:networking: Fix default value (icmp_ignore_bogus_error_responses).
This patch fixes icmp_ignore_bogus_error_responses default value to be 1 instead
of FALSE. It is initialized to 1 in icmp_sk_init().

Signed-off-by: Rami Rosen <ramirose@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 23:31:54 -07:00
Daniel Borkmann d70a3f887a doc: packet: simplify tpacket example code
This patch simplifies the tpacket_v3 example code a bit by getting rid
of unecessary macro wrappers, removing some debugging code so that it is
more to the point, and also adds a header comment. Now this example code
is the very minimum one needs to start from when dealing with tpacket_v3
and ~100 lines smaller than before.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-06-07 14:39:05 -07:00
Stefan Huber 3d9738aa41 Documentation/networking/ieee802154 fix a typo
Corrected the words Accsess to Access.

Signed-off-by: Stefan Huber <steffhip@googlemail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-06-05 16:24:59 +02:00
Anatol Pomozov f884ab15af doc: fix misspellings with 'codespell' tool
Signed-off-by: Anatol Pomozov <anatol.pomozov@gmail.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2013-05-28 12:02:12 +02:00
Cong Wang b1098bbe1b bonding: remove ifenslave.c from kernel source
As Stephen proposed:
Since bonding supports configuration via iproute (netlink) and sysfs, I think
it is time to purge the old ifenslave code out of Documentation/networking
and update the documentation.

Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Cong Wang <amwang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-27 23:34:46 -07:00
Masanari Iida 3dd17edea0 doc:networking: Fix typo in documentation/networking
Correct spelling typo

Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-27 23:29:18 -07:00
Willem de Bruijn 191cb1f21a rps: document flow limit in scaling.txt
Explain the mechanism and API of the recently merged
rps flow limit patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-05-23 18:54:44 -07:00
Daniel Borkmann 2940b26bec packet: doc: update timestamping part
Bring the timestamping section in sync with the implementation.

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-25 01:22:22 -04:00
Patrick McHardy 5683264c39 netlink: add documentation for memory mapped I/O
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-19 14:58:36 -04:00
Giuseppe CAVALLARO 49cfbf675c stmmac: review driver documentation
This patch reviews the driver documentation file;
for example, there were some new fields (in the driver
module parameter section) and the ptp files were
not documented.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08 16:55:27 -04:00
Werner Almesberger 56aa091d60 ieee802154/nl-mac.c: make some MLME operations optional
Check for NULL before calling the following operations from "struct
ieee802154_mlme_ops": assoc_req, assoc_resp, disassoc_req, start_req,
and scan_req.

This fixes a current oops where those functions are called but not
implemented. It also updates the documentation to clarify that they
are now optional by design. If a call to an unimplemented function
is attempted, the kernel returns EOPNOTSUPP via netlink.

The following operations are still required: get_phy, get_pan_id,
get_short_addr, and get_dsn.

Note that the places where this patch changes the initialization
of "ret" should not affect the rest of the code since "ret" was
always set (again) before returning its value.

Signed-off-by: Werner Almesberger <werner@almesberger.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-04-08 12:00:16 -04:00
Daniel Borkmann 4eb0614825 doc: packet: add minimal TPACKET_V3 example code
Lost in space for a long time, but it finally came back to us from
some ancient code tombs. This patch adds a minimal runnable example
of Linux' packet mmap(2) from Chetan Loke's TPACKET_V3. Special
thanks to David S. Miller, and also Eric Leblond and Victor Julien!

Cc: Eric Leblond <eric@regit.org>
Cc: Victor Julien <victor@inliniac.net>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-30 17:40:27 -04:00
Giuseppe CAVALLARO 94fbbbf894 stmmac: update the Doc and Version (PTP+SGMII)
This patch updates the stmmac.txt file adding information related to the PTP
and SGMII/RGMII supports.

Also the patch updates the driver version to: March_2013.

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-26 12:53:37 -04:00
Yuchung Cheng e33099f96d tcp: implement RFC5682 F-RTO
This patch implements F-RTO (foward RTO recovery):

When the first retransmission after timeout is acknowledged, F-RTO
sends new data instead of old data. If the next ACK acknowledges
some never-retransmitted data, then the timeout was spurious and the
congestion state is reverted.  Otherwise if the next ACK selectively
acknowledges the new data, then the timeout was genuine and the
loss recovery continues. This idea applies to recurring timeouts
as well. While F-RTO sends different data during timeout recovery,
it does not (and should not) change the congestion control.

The implementaion follows the three steps of SACK enhanced algorithm
(section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
3 are in tcp_process_loss().  The basic version is not supported
because SACK enhanced version also works for non-SACK connections.

The new implementation is functionally in parity with the old F-RTO
implementation except the one case where it increases undo events:
In addition to the RFC algorithm, a spurious timeout may be detected
without sending data in step 2, as long as the SACK confirms not
all the original data are dropped. When this happens, the sender
will undo the cwnd and perhaps enter fast recovery instead. This
additional check increases the F-RTO undo events by 5x compared
to the prior implementation on Google Web servers, since the sender
often does not have new data to send for HTTP.

Note F-RTO may detect spurious timeout before Eifel with timestamps
does so.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21 11:47:51 -04:00
Yuchung Cheng 9b44190dc1 tcp: refactor F-RTO
The patch series refactor the F-RTO feature (RFC4138/5682).

This is to simplify the loss recovery processing. Existing F-RTO
was developed during the experimental stage (RFC4138) and has
many experimental features.  It takes a separate code path from
the traditional timeout processing by overloading CA_Disorder
instead of using CA_Loss state. This complicates CA_Disorder state
handling because it's also used for handling dubious ACKs and undos.
While the algorithm in the RFC does not change the congestion control,
the implementation intercepts congestion control in various places
(e.g., frto_cwnd in tcp_ack()).

The new code implements newer F-RTO RFC5682 using CA_Loss processing
path.  F-RTO becomes a small extension in the timeout processing
and interfaces with congestion control and Eifel undo modules.
It lets congestion control (module) determines how many to send
independently.  F-RTO only chooses what to send in order to detect
spurious retranmission. If timeout is found spurious it invokes
existing Eifel undo algorithms like DSACK or TCP timestamp based
detection.

The first patch removes all F-RTO code except the sysctl_tcp_frto is
left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-21 11:47:50 -04:00
David S. Miller 61816596d1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull in the 'net' tree to get Daniel Borkmann's flow dissector
infrastructure change.

Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-20 12:46:26 -04:00
Julian Anastasov 0c12582fbc ipvs: add backup_only flag to avoid loops
Dmitry Akindinov is reporting for a problem where SYNs are looping
between the master and backup server when the backup server is used as
real server in DR mode and has IPVS rules to function as director.

Even when the backup function is enabled we continue to forward
traffic and schedule new connections when the current master is using
the backup server as real server. While this is not a problem for NAT,
for DR and TUN method the backup server can not determine if a request
comes from client or from director.

To avoid such loops add new sysctl flag backup_only. It can be needed
for DR/TUN setups that do not need backup and director function at the
same time. When the backup function is enabled we stop any forwarding
and pass the traffic to the local stack (real server mode). The flag
disables the director function when the backup function is enabled.

For setups that enable backup function for some virtual services and
director function for other virtual services there should be another
more complex solution to support DR/TUN mode, may be to assign
per-virtual service syncid value, so that we can differentiate the
requests.

Reported-by: Dmitry Akindinov <dimak@stalker.com>
Tested-by: German Myzovsky <lawyer@sipnet.ru>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2013-03-19 21:21:51 +09:00
Christoph Paasch 1a2c6181c4 tcp: Remove TCPCT
TCPCT uses option-number 253, reserved for experimental use and should
not be used in production environments.
Further, TCPCT does not fully implement RFC 6013.

As a nice side-effect, removing TCPCT increases TCP's performance for
very short flows:

Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
for files of 1KB size.

before this patch:
	average (among 7 runs) of 20845.5 Requests/Second
after:
	average (among 7 runs) of 21403.6 Requests/Second

Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-17 14:35:13 -04:00
Li RongQing b66c66dc5c Documentation: fix neigh/default/gc_thresh1 default value.
The default value is 128, not 256
	#grep gc_thresh1 net/ -rI
	net/decnet/dn_neigh.c:	.gc_thresh1 =			128,
	net/ipv6/ndisc.c:	.gc_thresh1 =	 128,
	net/ipv4/arp.c:	.gc_thresh1	= 128,

Signed-off-by: Li RongQing <roy.qing.li@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-15 09:12:23 -04:00
Nandita Dukkipati 6ba8a3b19e tcp: Tail loss probe (TLP)
This patch series implement the Tail loss probe (TLP) algorithm described
in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
first patch implements the basic algorithm.

TLP's goal is to reduce tail latency of short transactions. It achieves
this by converting retransmission timeouts (RTOs) occuring due
to tail losses (losses at end of transactions) into fast recovery.
TLP transmits one packet in two round-trips when a connection is in
Open state and isn't receiving any ACKs. The transmitted packet, aka
loss probe, can be either new or a retransmission. When there is tail
loss, the ACK from a loss probe triggers FACK/early-retransmit based
fast recovery, thus avoiding a costly RTO. In the absence of loss,
there is no change in the connection state.

PTO stands for probe timeout. It is a timer event indicating
that an ACK is overdue and triggers a loss probe packet. The PTO value
is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
ACK timer when there is only one oustanding packet.

TLP Algorithm

On transmission of new data in Open state:
  -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
  -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
  -> PTO = min(PTO, RTO)

Conditions for scheduling PTO:
  -> Connection is in Open state.
  -> Connection is either cwnd limited or no new data to send.
  -> Number of probes per tail loss episode is limited to one.
  -> Connection is SACK enabled.

When PTO fires:
  new_segment_exists:
    -> transmit new segment.
    -> packets_out++. cwnd remains same.

  no_new_packet:
    -> retransmit the last segment.
       Its ACK triggers FACK or early retransmit based recovery.

ACK path:
  -> rearm RTO at start of ACK processing.
  -> reschedule PTO if need be.

In addition, the patch includes a small variation to the Early Retransmit
(ER) algorithm, such that ER and TLP together can in principle recover any
N-degree of tail loss through fast recovery. TLP is controlled by the same
sysctl as ER, tcp_early_retrans sysctl.
tcp_early_retrans==0; disables TLP and ER.
		 ==1; enables RFC5827 ER.
		 ==2; delayed ER.
		 ==3; TLP and delayed ER. [DEFAULT]
		 ==4; TLP only.

The TLP patch series have been extensively tested on Google Web servers.
It is most effective for short Web trasactions, where it reduced RTOs by 15%
and improved HTTP response time (average by 6%, 99th percentile by 10%).
The transmitted probes account for <0.5% of the overall transmissions.

Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-12 08:30:34 -04:00
Jason Wang f422d2a04f net: docs: document multiqueue tuntap API
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-03-06 14:56:10 -05:00
Stephen Hemminger ca2eb5679f tcp: remove Appropriate Byte Count support
TCP Appropriate Byte Count was added by me, but later disabled.
There is no point in maintaining it since it is a potential source
of bugs and Linux already implements other better window protection
heuristics.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-02-05 14:51:16 -05:00