linux

Commit Graph

Author	SHA1	Message	Date
Jason Wang	e6fd07c899	tun: unbreak truncated packet signalling Commit `6680ec68ef` (tuntap: hardware vlan tx support) breaks the truncated packet signal by nev return a length greater than iov length in tun_put_user(). This patch fixes by always return the length of packet plus possible vlan header. Caller can detect the truncated packet by comparing the return value and the size of io length. Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-11 15:23:06 -05:00
David S. Miller	bbd37626e6	net: Revert macvtap/tun truncation signalling changes. Jason Wang and Michael S. Tsirkin are still discussing how to properly fix this. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-10 22:10:21 -05:00
Jason Wang	923347bb83	tun: unbreak truncated packet signalling Commit `6680ec68ef` (tuntap: hardware vlan tx support) breaks the truncated packet signal by never return a length greater than iov length in tun_put_user(). This patch fixes this by always return the length of packet plus possible vlan header. Caller can detect the truncated packet by comparing the return value and the size of iov length. Reported-by: Vlad Yasevich <vyasevich@gmail.com> Cc: Vlad Yasevich <vyasevich@gmail.com> Cc: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-10 22:06:49 -05:00
Zhi Yong Wu	d0b7da8afa	tun: update file current position Signed-off-by: Zhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-12-06 12:42:14 -05:00
Jason Wang	96f8d9ecf2	tuntap: limit head length of skb allocated We currently use hdr_len as a hint of head length which is advertised by guest. But when guest advertise a very big value, it can lead to an 64K+ allocating of kmalloc() which has a very high possibility of failure when host memory is fragmented or under heavy stress. The huge hdr_len also reduce the effect of zerocopy or even disable if a gso skb is linearized in guest. To solves those issues, this patch introduces an upper limit (PAGE_SIZE) of the head, which guarantees an order 0 allocation each time. Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-11-14 16:05:27 -05:00
Michael S. Tsirkin	5c0c52c910	tun: don't look at current when non-blocking We play with a wait queue even if socket is non blocking. This is an obvious waste. Besides, it will prevent calling the non blocking variant when current is not valid. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-10-08 15:38:35 -04:00
Jason Wang	662ca437e7	tuntap: correctly handle error in tun_set_iff() Commit `c8d68e6be1` (tuntap: multiqueue support) only call free_netdev() on error in tun_set_iff(). This causes several issues: - memory of tun security were leaked - use after free since the flow gc timer was not deleted and the tfile were not detached This patch solves the above issues. Reported-by: Wannes Rombouts <wannes.rombouts@epitech.eu> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-12 17:21:42 -04:00
Jason Wang	7bf6630523	tuntap: orphan frags before trying to set tx timestamp sock_tx_timestamp() will clear all zerocopy flags of skb which may lead the frags never to be orphaned. This will break guest to guest traffic when zerocopy is enabled. Fix this by orphaning the frags before trying to set tx time stamp. The issue were introduced by commit `eda2977291` (tun: Support software transmit time stamping). Cc: Richard Cochran <richardcochran@gmail.com> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-05 12:44:31 -04:00
Jason Wang	4bfb0513ff	tuntap: purge socket error queue on detach Commit `eda2977291` (tun: Support software transmit time stamping) will queue skbs into error queue when tx stamping is enabled. But it forgets to purge the error queue during detach. This patch fixes this. Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-09-05 12:44:31 -04:00
Pavel Emelyanov	76975e9cb4	tun: Get skfilter layout The only thing we may have from tun device is the fprog, whic contains the number of filter elements and a pointer to (user-space) memory where the elements are. The program itself may not be available if the device is persistent and detached. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-21 12:21:45 -07:00
Pavel Emelyanov	849c9b6f93	tun: Allow to skip filter on attach There's a small problem with sk-filters on tun devices. Consider an application doing this sequence of steps: fd = open("/dev/net/tun"); ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" }); ioctl(fd, TUNATTACHFILTER, &my_filter); ioctl(fd, TUNSETPERSIST, 1); close(fd); At that point the tun0 will remain in the system and will keep in mind that there should be a socket filter at address '&my_filter'. If after that we do fd = open("/dev/net/tun"); ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" }); we most likely receive the -EFAULT error, since tun_attach() would try to connect the filter back. But (!) if we provide a filter at address &my_filter, then tun0 will be created and the "new" filter would be attached, but application may not know about that. This may create certain problems to anyone using tun-s, but it's critical problem for c/r -- if we meet a persistent tun device with a filter in mind, we will not be able to attach to it to dump its state (flags, owner, address, vnethdr size, etc.). The proposal is to allow to attach to tun device (with TUNSETIFF) w/o attaching the filter to the tun-file's socket. After this attach app may e.g clean the device by dropping the filter, it doesn't want to have one, or (in case of c/r) get information about the device with tun ioctls. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-21 12:21:45 -07:00
Pavel Emelyanov	3d407a80b6	tun: Report whether the queue is attached or not Multiqueue tun devices allow to attach and detach from its queues while keeping the interface itself set on file. Knowing this is critical for the checkpoint part of criu project. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-21 12:21:45 -07:00
Pavel Emelyanov	fb7589a162	tun: Add ability to create tun device with given index Tun devices cannot be created with ifidex user wants, but it's required by checkpoint-restore project. Long time ago such ability was implemented for rtnl_ops-based interface for creating links (`9c7dafbf` net: Allow to create links with given ifindex), but the only API for creating and managing tuntap devices is ioctl-based and is evolving with adding new ones (`cde8b15f` tuntap: add ioctl to attach or detach a file form tuntap device). Following that trend, here's how a new ioctl that sets the ifindex for device, that _will_ be created by TUNSETIFF ioctl looks like. So those who want a tuntap device with the ifindex N, should open the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF. If the index N is busy, then the register_netdev will find this out and the ioctl would be failed with -EBUSY. If setifindex is not called, then it will be generated as before. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-21 12:21:45 -07:00
David S. Miller	2ff1cf12c9	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2013-08-16 15:37:26 -07:00
Dan Carpenter	15718ea0d8	tun: signedness bug in tun_get_user() The recent fix `d9bf5f1309` "tun: compare with 0 instead of total_len" is not totally correct. Because "len" and "sizeof()" are size_t type, that means they are never less than zero. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-15 14:51:23 -07:00
Weiping Pan	d9bf5f1309	tun: compare with 0 instead of total_len Since we set "len = total_len" in the beginning of tun_get_user(), so we should compare the new len with 0, instead of total_len, or the if statement always returns false. Signed-off-by: Weiping Pan <wpan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-13 19:29:08 -07:00
Eric Dumazet	28d6427109	net: attempt high order allocations in sock_alloc_send_pskb() Adding paged frags skbs to af_unix sockets introduced a performance regression on large sends because of additional page allocations, even if each skb could carry at least 100% more payload than before. We can instruct sock_alloc_send_pskb() to attempt high order allocations. Most of the time, it does a single page allocation instead of 8. I added an additional parameter to sock_alloc_send_pskb() to let other users to opt-in for this new feature on followup patches. Tested: Before patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 46861.15 After patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 57981.11 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-10 01:16:44 -07:00
Jason Wang	c3bdeb5c7c	net: move zerocopy_sg_from_iovec() to net/core/datagram.c To let it be reused and reduce code duplication. Also document this function. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-07 16:52:34 -07:00
Jason Wang	b4bf07771f	net: move iov_pages() to net/core/iovec.c To let it be reused and reduce code duplication. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-08-07 16:52:33 -07:00
Jason Wang	6680ec68ef	tuntap: hardware vlan tx support Inspired by commit `f09e2249c4` (macvtap: restore vlan header on user read). This patch adds hardware vlan tx support for tuntap. This is done by copying vlan header directly into userspace in tun_put_user() instead of doing it through __vlan_put_tag() in dev_hard_start_xmit(). This eliminates one unnecessary memmove() in vlan_insert_tag() for 802.1ad and 802.1q traffic. pktgen test shows about 20% improvement for 802.1q traffic: Before: 662149pps 317Mb/sec (317831520bps) errors: 0 After: 801033pps 384Mb/sec (384495840bps) errors: 0 Cc: Basil Gor <basil.gor@gmail.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-07-27 20:09:21 -07:00
Richard Cochran	eda2977291	tun: Support software transmit time stamping. This patch adds transmit time stamping to the tun/tap driver. Similar support already exists for UDP, can, and raw packets. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-07-22 14:58:19 -07:00
Jason Wang	885291761d	tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS We try to linearize part of the skb when the number of iov is greater than MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest network. Solve this problem by calculate the pages needed for iov before trying to do zerocopy and switch to use copy instead of zerocopy if it needs more than MAX_SKB_FRAGS. This is done through introducing a new helper to count the pages for iov, and call uarg->callback() manually when switching from zerocopy to copy to notify vhost. We can do further optimization on top. The bug were introduced from commit `0690899b4d` (tun: experimental zero copy tx support) Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-07-18 13:04:25 -07:00
Jason Wang	3dd5c3308e	tuntap: correctly linearize skb when zerocopy is used Userspace may produce vectors greater than MAX_SKB_FRAGS. When we try to linearize parts of the skb to let the rest of iov to be fit in the frags, we need count copylen into linear when calling tun_alloc_skb() instead of partly counting it into data_len. Since this breaks zerocopy_sg_from_iovec() since its inner counter assumes nr_frags should be zero at beginning. This cause nr_frags to be increased wrongly without setting the correct frags. This bug were introduced from `0690899b4d` (tun: experimental zero copy tx support) Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-07-10 18:45:52 -07:00
David S. Miller	0c1072ae02	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/freescale/fec_main.c drivers/net/ethernet/renesas/sh_eth.c net/ipv4/gre.c The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list) and the splitting of the gre.c code into seperate files. The FEC conflict was two sets of changes adding ethtool support code in an "!CONFIG_M5272" CPP protected block. Finally the sh_eth.c conflict was between one commit add bits set in the .eesr_err_check mask whilst another commit removed the .tx_error_check member and assignments. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-07-03 14:55:13 -07:00
Michael S. Tsirkin	7e24bfbe43	tun: fix recovery from gup errors get user pages might fail partially in tun zero copy mode. To recover we need to put all pages that we got, but code used a wrong index resulting in double-free errors. Reported-by: Brad Hubbard <bhubbard@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-25 16:16:45 -07:00
David S. Miller	d98cae64e4	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/wireless/ath/ath9k/Kconfig drivers/net/xen-netback/netback.c net/batman-adv/bat_iv_ogm.c net/wireless/nl80211.c The ath9k Kconfig conflict was a change of a Kconfig option name right next to the deletion of another option. The xen-netback conflict was overlapping changes involving the handling of the notify list in xen_netbk_rx_action(). Batman conflict resolution provided by Antonio Quartulli, basically keep everything in both conflict hunks. The nl80211 conflict is a little more involved. In 'net' we added a dynamic memory allocation to nl80211_dump_wiphy() to fix a race that Linus reported. Meanwhile in 'net-next' the handlers were converted to use pre and post doit handlers which use a flag to determine whether to hold the RTNL mutex around the operation. However, the dump handlers to not use this logic. Instead they have to explicitly do the locking. There were apparent bugs in the conversion of nl80211_dump_wiphy() in that we were not dropping the RTNL mutex in all the return paths, and it seems we very much should be doing so. So I fixed that whilst handling the overlapping changes. To simplify the initial returns, I take the RTNL mutex after we try to allocate 'tb'. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-19 16:49:39 -07:00
Pavel Emelyanov	944a1376b6	tun: Turn tun_flow_init() into void fn This routine doesn't fail since `9fdc6bef` (tuntap: dont use a private kmem_cache) so it makes sense to compact the code a little bit. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-12 15:08:14 -07:00
Pavel Emelyanov	274038f8c9	tun: Report "persist" flag to userspace The TUN_PERSIST flag is not reported at all -- both TUNGETIFF, and sysfs "flags" attribute skip one. Knowing whether a device is persistent or not is critical for checkpoint-restore, thus I propose to add the read-only IFF_PERSIST one for this. Setting this new IFF_PERSIST is hardly possible, as TUNSETIFF doesn't check for unknown flags being zero and thus there can be trash. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-12 15:07:21 -07:00
Jason Wang	19a6afb23e	tuntap: set SOCK_ZEROCOPY flag during open Commit `54f968d6ef` (tuntap: move socket to tun_file) forgets to set SOCK_ZEROCOPY flag, which will prevent vhost_net from doing zercopy w/ tap. This patch fixes this by setting it during file open. Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-12 00:44:35 -07:00
Jason Wang	92bb73ea2c	tuntap: fix a possible race between queue selection and changing queues Complier may generate codes that re-read the tun->numqueues during tun_select_queue(). This may be a race if vlan->numqueues were changed in the same time and can lead unexpected result (e.g. very huge value). We need prevent the compiler from generating such codes by adding an ACCESS_ONCE() to make sure tun->numqueues were only read once. Bug were introduced by commit `c8d68e6be1` (tuntap: multiqueue support). Reported-by: Michael S. Tsirkin <mst@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-06-10 14:32:47 -07:00
Jason Wang	8e6d91ae09	tuntap: forbid changing mq flag for persistent device We currently allow changing the mq flag (IFF_MULTI_QUEUE) for a persistent device. This will result a mismatch between the number the queues in netdev and tuntap. This is because we only allocate a 1q netdevice when IFF_MULTI_QUEUE was not specified, so when we set the IFF_MULTI_QUEUE and try to attach more queues later, netif_set_real_num_tx_queues() may fail which result a single queue netdevice with multiple sockets attached. Solve this by disallowing changing the mq flag for persistent device. Bug was introduced by commit `edfb6a148c` (tuntap: reduce memory using of queues). Reported-by: Sriram Narasimhan <sriram.narasimhan@hp.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-05-29 00:21:32 -07:00
David S. Miller	58717686cf	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c drivers/net/ethernet/emulex/benet/be.h include/net/tcp.h net/mac802154/mac802154.h Most conflicts were minor overlapping stuff. The be2net driver brought in some fixes that added __vlan_put_tag calls, which in net-next take an additional argument. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-30 03:55:20 -04:00
Gao feng	3811ae76bc	net: tun: release the reference of tun device in tun_recvmsg We forget to release the reference of tun device in tun_recvmsg. bug introduced in commit `54f968d6ef` (tuntap: move socket to tun_file) Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-29 11:06:37 -04:00
Jason Wang	e8dbad66ef	tuntap: correct the return value in tun_set_iff() commit (`3be8fbab` tuntap: fix error return code in tun_set_iff()) breaks the creation of multiqueue tuntap since it forbids to create more than one queues for a multiqueue tuntap device. We need return 0 instead -EBUSY here since we don't want to re-initialize the device when one or more queues has been already attached. Add a comment and correct the return value to zero. Reported-by: Jerry Chu <hkchu@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Wei Yongjun <weiyj.lk@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Jerry Chu <hkchu@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-25 01:48:23 -04:00
David S. Miller	6e0895c2ea	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-22 20:32:51 -04:00
Wei Yongjun	3be8fbab18	tuntap: fix error return code in tun_set_iff() Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. [ Bug added in linux-3.8 , commit `4008e97f86` ("tuntap: fix ambigious multiqueue API") ] Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-12 15:00:04 -04:00
Jason Wang	c0317998c3	tuntap: initialize vlan_features The vlan_features was zero which prevents vlan GSO packets to be transmitted to userspace. This is suboptimal so enable this by initialize vlan_features for tuntap. Netperf shows better performance of guest receiving since vlan TSO works for tuntap: before: netperf -H 192.168.5.4 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.01 2786.67 after: netperf -H 192.168.5.4 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 8085.49 Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-04-11 16:21:57 -04:00
Jason Wang	40893fd0fd	net: switch to use skb_probe_transport_header() Switch to use the new help skb_probe_transport_header() to do the l4 header probing for untrusted sources. For packets with partial csum, the header should already been set by skb_partial_csum_set(). Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-27 12:48:31 -04:00
Jason Wang	38502af77e	tuntap: set transport header before passing it to kernel Currently, for the packets receives from tuntap, before doing header check, kernel just reset the transport header in netif_receive_skb() which pretends no l4 header. This is suboptimal for precise packet length estimation (introduced in `1def9238`) which needs correct l4 header for gso packets. So this patch set the transport header to csum_start for partial checksum packets, otherwise it first try skb_flow_dissect(), if it fails, just reset the transport header. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-26 12:44:43 -04:00
Wei Yongjun	f7de0b9368	tuntap: remove unused variable in __tun_detach() The variable dev is initialized but never used otherwise, so remove the unused variable. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-13 11:31:58 -04:00
Eric Dumazet	f8af75f351	tun: add a missing nf_reset() in tun_net_xmit() Dave reported following crash : general protection fault: 0000 [#1] SMP CPU 2 Pid: 25407, comm: qemu-kvm Not tainted 3.7.9-205.fc18.x86_64 #1 Hewlett-Packard HP Z400 Workstation/0B4Ch RIP: 0010:[<ffffffffa0399bd5>] [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack] RSP: 0018:ffff880276913d78 EFLAGS: 00010206 RAX: 50626b6b7876376c RBX: ffff88026e530d68 RCX: ffff88028d158e00 RDX: ffff88026d0d5470 RSI: 0000000000000011 RDI: 0000000000000002 RBP: ffff880276913d88 R08: 0000000000000000 R09: ffff880295002900 R10: 0000000000000000 R11: 0000000000000003 R12: ffffffff81ca3b40 R13: ffffffff8151a8e0 R14: ffff880270875000 R15: 0000000000000002 FS: 00007ff3bce38a00(0000) GS:ffff88029fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fd1430bd000 CR3: 000000027042b000 CR4: 00000000000027e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-kvm (pid: 25407, threadinfo ffff880276912000, task ffff88028c369720) Stack: ffff880156f59100 ffff880156f59100 ffff880276913d98 ffffffff815534f7 ffff880276913db8 ffffffff8151a74b ffff880270875000 ffff880156f59100 ffff880276913dd8 ffffffff8151a5a6 ffff880276913dd8 ffff88026d0d5470 Call Trace: [<ffffffff815534f7>] nf_conntrack_destroy+0x17/0x20 [<ffffffff8151a74b>] skb_release_head_state+0x7b/0x100 [<ffffffff8151a5a6>] __kfree_skb+0x16/0xa0 [<ffffffff8151a666>] kfree_skb+0x36/0xa0 [<ffffffff8151a8e0>] skb_queue_purge+0x20/0x40 [<ffffffffa02205f7>] __tun_detach+0x117/0x140 [tun] [<ffffffffa022184c>] tun_chr_close+0x3c/0xd0 [tun] [<ffffffff8119669c>] __fput+0xec/0x240 [<ffffffff811967fe>] ____fput+0xe/0x10 [<ffffffff8107eb27>] task_work_run+0xa7/0xe0 [<ffffffff810149e1>] do_notify_resume+0x71/0xb0 [<ffffffff81640152>] int_signal+0x12/0x17 Code: 00 00 04 48 89 e5 41 54 53 48 89 fb 4c 8b a7 e8 00 00 00 0f 85 de 00 00 00 0f b6 73 3e 0f b7 7b 2a e8 10 40 00 00 48 85 c0 74 0e <48> 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 c7 c7 08 6a 3a a0 RIP [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack] RSP <ffff880276913d78> This is because tun_net_xmit() needs to call nf_reset() before queuing skb into receive_queue Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-03-06 16:05:00 -05:00
Sasha Levin	b67bfe0d42	hlist: drop the node parameter from iterators I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S \| hlist_for_each_entry_continue(a, - b, c) S \| hlist_for_each_entry_from(a, - b, c) S \| hlist_for_each_entry_rcu(a, - b, c, d) S \| hlist_for_each_entry_rcu_bh(a, - b, c, d) S \| hlist_for_each_entry_continue_rcu_bh(a, - b, c) S \| for_each_busy_worker(a, c, - b, d) S \| ax25_uid_for_each(a, - b, c) S \| ax25_for_each(a, - b, c) S \| inet_bind_bucket_for_each(a, - b, c) S \| sctp_for_each_hentry(a, - b, c) S \| sk_for_each(a, - b, c) S \| sk_for_each_rcu(a, - b, c) S \| sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S \| sk_for_each_safe(a, - b, c, d) S \| sk_for_each_bound(a, - b, c) S \| hlist_for_each_entry_safe(a, - b, c, d, e) S \| hlist_for_each_entry_continue_rcu(a, - b, c) S \| nr_neigh_for_each(a, - b, c) S \| nr_neigh_for_each_safe(a, - b, c, d) S \| nr_node_for_each(a, - b, c) S \| nr_node_for_each_safe(a, - b, c, d) S \| - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S \| - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S \| for_each_host(a, - b, c) S \| for_each_host_safe(a, - b, c, d) S \| for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2013-02-27 19:10:24 -08:00
Pravin B Shelar	c9af6db4c1	net: Fix possible wrong checksum generation. Patch `cef401de7b` (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-13 13:30:10 -05:00
David S. Miller	188d1f76d0	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Conflicts: drivers/net/ethernet/intel/e1000e/ethtool.c drivers/net/vmxnet3/vmxnet3_drv.c drivers/net/wireless/iwlwifi/dvm/tx.c net/ipv6/route.c The ipv6 route.c conflict is simple, just ignore the 'net' side change as we fixed the same problem in 'net-next' by eliminating cached neighbours from ipv6 routes. The e1000e conflict is an addition of a new statistic in the ethtool code, trivial. The vmxnet3 conflict is about one change in 'net' removing a guarding conditional, whilst in 'net-next' we had a netdev_info() conversion. The iwlwifi conflict is dealing with a WARN_ON() conversion in 'net-next' vs. a revert happening in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-02-05 14:12:20 -05:00
Jason Wang	9e85722d58	tuntap: allow polling/writing/reading when detached We forbid polling, writing and reading when the file were detached, this may complex the user in several cases: - when guest pass some buffers to vhost/qemu and then disable some queues, host/qemu needs to do its own cleanup on those buffers which is complex sometimes. We can do this simply by allowing a user can still write to an disabled queue. Write to an disabled queue will cause the packet pass to the kernel and read will get nothing. - align the polling behavior with macvtap which never fails when the queue is created. This can simplify the polling errors handling of its user (e.g vhost) We can simply achieve this by don't assign NULL to tfile->tun when detached. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:43:04 -05:00
Michael S. Tsirkin	af668b3c27	tun: fix carrier on/off status Commit `c8d68e6be1` removed carrier off call from tun_detach since it's now called on queue disable and not only on tun close. This confuses userspace which used this flag to detect a free tun. To fix, put this back but under if (clean). Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Jason Wang <jasowang@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:43:03 -05:00
David S. Miller	f1e7b73acc	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Bring in the 'net' tree so that we can get some ipv4/ipv6 bug fixes that some net-next work will build upon. Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-29 15:32:13 -05:00
Eric Dumazet	cef401de7b	net: fix possible wrong checksum generation Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-28 00:27:15 -05:00
Jason Wang	b8732fb7f8	tuntap: limit the number of flow caches We create new flow caches when a new flow is identified by tuntap, This may lead some issues: - userspace may produce a huge amount of short live flows to exhaust host memory - the unlimited number of flow caches may produce a long list which increase the time in the linear searching Solve this by introducing a limit of total number of flow caches. Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-23 13:47:06 -05:00
Jason Wang	edfb6a148c	tuntap: reduce memory using of queues A MAX_TAP_QUEUES(1024) queues of tuntap device is always allocated unconditionally even userspace only requires a single queue device. This is unnecessary and will lead a very high order of page allocation when has a high possibility to fail. Solving this by creating a one queue net device when userspace only use one queue and also reduce MAX_TAP_QUEUES to DEFAULT_MAX_NUM_RSS_QUEUES which can guarantee the success of the allocation. Reported-by: Dirk Hohndel <dirk@hohndel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>	2013-01-23 13:47:06 -05:00

1 2 3 4 5

234 Commits