Commit Graph

26714 Commits

Author SHA1 Message Date
Linus Torvalds
cb62ab71fe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking changes from David Miller:

 1) Get rid of the error prone NLA_PUT*() macros that used an embedded
    goto.

 2) Kill off the token-ring and MCA networking drivers, from Paul
    Gortmaker.

 3) Reduce high-order allocations made by datagram AF_UNIX sockets, from
    Eric Dumazet.

 4) Add PTP hardware clock support to IGB and IXGBE, from Richard
    Cochran and Jacob Keller.

 5) Allow users to query timestamping capabilities of a card via
    ethtool, from Richard Cochran.

 6) Add loadbalance mode to the teaming driver, from Jiri Pirko.  Part
    of this is that we can now have BPF filters not attached to sockets,
    and the loadbalancing function is calculated using one.

 7) Francois Romieu went through the network drivers removing gratuitous
    uses of netdev->base_addr, perhaps some day we can remove it
    completely but it's used for ISA probing still.

 8) Add a BPF JIT for sparc.  I know, who cares, right? :-)

 9) Move networking sysctl registry away from using the compatability
    mode interfaces in the sysctl code.  From Eric W Biederman.

10) Pavel Emelyanov added a way to save and restore TCP socket state via
    TCP_REPAIR, TCP_REPAIR_QUEUE, and TCP_QUEUE_SEQ socket options as
    well as a way to forcefully bind a socket to a port via the
    sk->sk_reuse value SK_FORCE_REUSE.  There is also a
    TCP_REPAIR_OPTIONS which allows to reinstante the TCP options
    enabled on the connection.

11) Several enhancements from Eric Dumazet that, in particular, can
    enhance splice performance on TCP sockets significantly.

     a) Reset the offset of the per-socket sendmsg page when we know
        we're the only use of the page in linear_to_page().

     b) Add facilities such that skb->data can be backed a page rather
        than SLAB kmalloc'd memory.  In particular devices which were
        receiving into linear RX buffers can now end up providing paged
        data.

    The big result is that code like splice and GRO do not have to copy
    any more.

12) Allow a pure sender to more gracefully handle ACK backlogs in TCP.
    What can happen at high rates is that the sender hasn't grown his
    receive buffer limits at all (he's not receiving data so really
    doesn't need to), but the non-data ACKs consume receive buffer
    space.

    sk_add_backlog() is too aggressive in dropping frames in this case,
    so relax it's requirements by using the receive buffer plus the send
    buffer limit as the backlog limit instead of just the former.

    Also from Eric Dumazet.

13) Add ipv6 support to L2TP, from Benjamin LaHaise, James Chapman, and
    Chris Elston.

14) Implement TCP early retransmit (RFC 5827), from Yuchung Cheng.
    Basically, we can start fast retransmit before hiting the dupack
    threshold under certain conditions.

15) New CODEL active queue management packet scheduler, from Eric
    Dumazet based upon initial work by Dave Taht.

    Basically, the big feature is that packets are dropped (or ECN bits
    are set) based upon how long packets live in the queue, rather than
    the queue length (which is what RED uses).

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1341 commits)
  drivers/net/stmmac: seq_file fix memory leak
  ipv6/exthdrs: strict Pad1 and PadN check
  USB: qmi_wwan: Add ZTE (Vodafone) K3520-Z
  USB: qmi_wwan: Add ZTE (Vodafone) K3765-Z
  USB: qmi_wwan: Make forced int 4 whitelist generic
  net/ipv4: replace simple_strtoul with kstrtoul
  net/ipv4/ipconfig: neaten __setup placement
  net: qmi_wwan: Add Vodafone/Huawei K5005 support
  net: cdc_ether: Add ZTE WWAN matches before generic Ethernet
  ipv6: use skb coalescing in reassembly
  ipv4: use skb coalescing in defragmentation
  net: introduce skb_try_coalesce()
  net:ipv6:fixed space issues relating to operators.
  net:ipv6:fixed a trailing white space issue.
  ipv6: disable GSO on sockets hitting dst_allfrag
  tg3: use netdev_alloc_frag() API
  net: napi_frags_skb() is static
  ppp: avoid false drop_monitor false positives
  ipv6: bool/const conversions phase2
  ipx: Remove spurious NULL checking in ipx_ioctl().
  ...
2012-05-21 10:03:46 -07:00
Linus Torvalds
31ed8e6f93 Merge branch 'dentry-cleanups' (dcache access cleanups and optimizations)
This branch simplifies and clarifies the dcache lookup, and allows us to
do certain nice optimizations when comparing dentries.  It also cleans
up the interface to __d_lookup_rcu(), especially around passing the
inode information around.

* dentry-cleanups:
  vfs: make it possible to access the dentry hash/len as one 64-bit entry
  vfs: move dentry name length comparison from dentry_cmp() into callers
  vfs: do the careful dentry name access for all dentry_cmp cases
  vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu
  vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces
2012-05-21 08:50:57 -07:00
Linus Torvalds
7e5cb5e151 Merge branch 'vfs-cleanups' (random vfs cleanups)
This teaches vfs_fstat() to use the appropriate f[get|put]_light
functions, allowing it to avoid some unnecessary locking for the common
case.

More noticeably, it also cleans up and simplifies the "getname_flags()"
function, which now relies on the architecture strncpy_from_user() doing
all the user access checks properly, instead of hacking around the fact
that on x86 it didn't use to do it right (see commit 92ae03f2ef: "x86:
merge 32/64-bit versions of 'strncpy_from_user()' and speed it up").

* vfs-cleanups:
  VFS: make vfs_fstat() use f[get|put]_light()
  VFS: clean up and simplify getname_flags()
  x86: make word-at-a-time strncpy_from_user clear bytes at the end
2012-05-21 08:46:08 -07:00
Linus Torvalds
8c12fec90c Merge branch 'stat-cleanups' (clean up copying of stat info to user space)
This makes cp_new_stat() a bit more readable, and avoids having to
memset() the whole structure just to fill in a couple of padding fields.

This is another result of me looking at code generation of functions
that show up high on certain kernel profiles, and just going "Oh, let's
just clean that up".

Architectures that don't supply the #define to fill just the padding
fields will still fall back to memset().

* stat-cleanups:
  vfs: don't force a big memset of stat data just to clear padding fields
  vfs: de-crapify "cp_new_stat()" function
2012-05-21 08:41:38 -07:00
David S. Miller
17eea0df5f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-20 21:53:04 -04:00
Linus Torvalds
14e931a264 Merge branch 'for-linus' of git://git.kernel.dk/linux-block
Pull block layer fixes from Jens Axboe:
 "A few small, but important fixes.  Most of them are marked for stable
  as well

   - Fix failure to release a semaphore on error path in mtip32xx.
   - Fix crashable condition in bio_get_nr_vecs().
   - Don't mark end-of-disk buffers as mapped, limit it to i_size.
   - Fix for build problem with CONFIG_BLOCK=n on arm at least.
   - Fix for a buffer overlow on UUID partition printing.
   - Trivial removal of unused variables in dac960."

* 'for-linus' of git://git.kernel.dk/linux-block:
  block: fix buffer overflow when printing partition UUIDs
  Fix blkdev.h build errors when BLOCK=n
  bio allocation failure due to bio_get_nr_vecs()
  block: don't mark buffers beyond end of disk as mapped
  mtip32xx: release the semaphore on an error path
  dac960: Remove unused variables from DAC960_CreateProcEntries()
2012-05-19 10:12:17 -07:00
Linus Torvalds
73f1f5dd3e Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc fixes from Andrew Morton.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (4 patches)
  frv: delete incorrect task prototypes causing compile fail
  slub: missing test for partial pages flush work in flush_all()
  fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
  drivers/rtc/rtc-pl031.c: configure correct wday for 2000-01-01
2012-05-18 15:56:25 -07:00
Linus Torvalds
30a08bf2d3 proc: move fd symlink i_mode calculations into tid_fd_revalidate()
Instead of doing the i_mode calculations at proc_fd_instantiate() time,
move them into tid_fd_revalidate(), which is where the other inode state
(notably uid/gid information) is updated too.

Otherwise we'll end up with stale i_mode information if an fd is re-used
while the dentry still hangs around.  Not that anything really *cares*
(symlink permissions don't really matter), but Tetsuo Handa noticed that
the owner read/write bits don't always match the state of the
readability of the file descriptor, and we _used_ to get this right a
long time ago in a galaxy far, far away.

Besides, aside from fixing an ugly detail (that has apparently been this
way since commit 61a2878402: "proc: Remove the hard coded inode
numbers" in 2006), this removes more lines of code than it adds.  And it
just makes sense to update i_mode in the same place we update i_uid/gid.

Al Viro correctly points out that we could just do the inode fill in the
inode iops ->getattr() function instead.  However, that does require
somewhat slightly more invasive changes, and adds yet *another* lookup
of the file descriptor.  We need to do the revalidate() for other
reasons anyway, and have the file descriptor handy, so we might as well
fill in the information at this point.

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-18 14:06:17 -07:00
Cyrill Gorcunov
eb94cd96e0 fs, proc: fix ABBA deadlock in case of execution attempt of map_files/ entries
map_files/ entries are never supposed to be executed, still curious
minds might try to run them, which leads to the following deadlock

  ======================================================
  [ INFO: possible circular locking dependency detected ]
  3.4.0-rc4-24406-g841e6a6 #121 Not tainted
  -------------------------------------------------------
  bash/1556 is trying to acquire lock:
   (&sb->s_type->i_mutex_key#8){+.+.+.}, at: do_lookup+0x267/0x2b1

  but task is already holding lock:
   (&sig->cred_guard_mutex){+.+.+.}, at: prepare_bprm_creds+0x2d/0x69

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (&sig->cred_guard_mutex){+.+.+.}:
         validate_chain+0x444/0x4f4
         __lock_acquire+0x387/0x3f8
         lock_acquire+0x12b/0x158
         __mutex_lock_common+0x56/0x3a9
         mutex_lock_killable_nested+0x40/0x45
         lock_trace+0x24/0x59
         proc_map_files_lookup+0x5a/0x165
         __lookup_hash+0x52/0x73
         do_lookup+0x276/0x2b1
         walk_component+0x3d/0x114
         do_last+0xfc/0x540
         path_openat+0xd3/0x306
         do_filp_open+0x3d/0x89
         do_sys_open+0x74/0x106
         sys_open+0x21/0x23
         tracesys+0xdd/0xe2

  -> #0 (&sb->s_type->i_mutex_key#8){+.+.+.}:
         check_prev_add+0x6a/0x1ef
         validate_chain+0x444/0x4f4
         __lock_acquire+0x387/0x3f8
         lock_acquire+0x12b/0x158
         __mutex_lock_common+0x56/0x3a9
         mutex_lock_nested+0x40/0x45
         do_lookup+0x267/0x2b1
         walk_component+0x3d/0x114
         link_path_walk+0x1f9/0x48f
         path_openat+0xb6/0x306
         do_filp_open+0x3d/0x89
         open_exec+0x25/0xa0
         do_execve_common+0xea/0x2f9
         do_execve+0x43/0x45
         sys_execve+0x43/0x5a
         stub_execve+0x6c/0xc0

This is because prepare_bprm_creds grabs task->signal->cred_guard_mutex
and when do_lookup happens we try to grab task->signal->cred_guard_mutex
again in lock_trace.

Fix it using plain ptrace_may_access() helper in proc_map_files_lookup()
and in proc_map_files_readdir() instead of lock_trace(), the caller must
be CAP_SYS_ADMIN granted anyway.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reported-by: Sasha Levin <levinsasha928@gmail.com>
Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Dave Jones <davej@redhat.com>
Cc: Vasiliy Kulikov <segoon@openwall.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-17 18:00:51 -07:00
David S. Miller
028940342a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2012-05-16 22:17:37 -04:00
Linus Torvalds
dfae359f08 Merge git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fix from Jeff Layton

* git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix misspelling of "forcedirectio"
2012-05-16 14:22:38 -07:00
Jeff Layton
531c8ff0d4 cifs: fix misspelling of "forcedirectio"
...and add a "directio" synonym since that's what the manpage has
always advertised.

Acked-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-16 11:26:25 -05:00
Linus Torvalds
9ff00d58a9 Three fixes for 3.4:
- Fix a lock ordering deadlock in JFFS2
  - Fix an oops in the dataflash driver, triggered by a dummy call to test
    whether it has OTP functionality.
  - Fix request_mem_region() failure on amsdelta NAND driver.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iEYEABECAAYFAk+vekgACgkQdwG7hYl686N8bQCfdizsFrliKbDW20R/pO66NoAV
 aloAn0ln+mwe3rIdNt8qKynW8e8dbudF
 =R7XS
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-3.4-20120513' of git://git.infradead.org/linux-mtd

Pull three MTD fixes from David Woodhouse:
 - Fix a lock ordering deadlock in JFFS2
 - Fix an oops in the dataflash driver, triggered by a dummy call to test
   whether it has OTP functionality.
 - Fix request_mem_region() failure on amsdelta NAND driver.

* tag 'for-linus-3.4-20120513' of git://git.infradead.org/linux-mtd:
  mtd: ams-delta: fix request_mem_region() failure
  jffs2: Fix lock acquisition order bug in gc path
  mtd: fix oops in dataflash driver
2012-05-13 11:33:09 -07:00
Bernd Schubert
f908ee9463 bio allocation failure due to bio_get_nr_vecs()
The number of bio_get_nr_vecs() is passed down via bio_alloc() to
bvec_alloc_bs(), which fails the bio allocation if
nr_iovecs > BIO_MAX_PAGES. For the underlying caller this causes an
unexpected bio allocation failure.
Limiting to queue_max_segments() is not sufficient, as max_segments
also might be very large.

bvec_alloc_bs(gfp_mask, nr_iovecs, ) => NULL when nr_iovecs  > BIO_MAX_PAGES
bio_alloc_bioset(gfp_mask, nr_iovecs, ...)
bio_alloc(GFP_NOIO, nvecs)
xfs_alloc_ioend_bio()

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:45:12 +02:00
Jeff Moyer
080399aaaf block: don't mark buffers beyond end of disk as mapped
Hi,

We have a bug report open where a squashfs image mounted on ppc64 would
exhibit errors due to trying to read beyond the end of the disk.  It can
easily be reproduced by doing the following:

[root@ibm-p750e-02-lp3 ~]# ls -l install.img
-rw-r--r-- 1 root root 142032896 Apr 30 16:46 install.img
[root@ibm-p750e-02-lp3 ~]# mount -o loop ./install.img /mnt/test
[root@ibm-p750e-02-lp3 ~]# dd if=/dev/loop0 of=/dev/null
dd: reading `/dev/loop0': Input/output error
277376+0 records in
277376+0 records out
142016512 bytes (142 MB) copied, 0.9465 s, 150 MB/s

In dmesg, you'll find the following:

squashfs: version 4.0 (2009/01/31) Phillip Lougher
[   43.106012] attempt to access beyond end of device
[   43.106029] loop0: rw=0, want=277410, limit=277408
[   43.106039] Buffer I/O error on device loop0, logical block 138704
[   43.106053] attempt to access beyond end of device
[   43.106057] loop0: rw=0, want=277412, limit=277408
[   43.106061] Buffer I/O error on device loop0, logical block 138705
[   43.106066] attempt to access beyond end of device
[   43.106070] loop0: rw=0, want=277414, limit=277408
[   43.106073] Buffer I/O error on device loop0, logical block 138706
[   43.106078] attempt to access beyond end of device
[   43.106081] loop0: rw=0, want=277416, limit=277408
[   43.106085] Buffer I/O error on device loop0, logical block 138707
[   43.106089] attempt to access beyond end of device
[   43.106093] loop0: rw=0, want=277418, limit=277408
[   43.106096] Buffer I/O error on device loop0, logical block 138708
[   43.106101] attempt to access beyond end of device
[   43.106104] loop0: rw=0, want=277420, limit=277408
[   43.106108] Buffer I/O error on device loop0, logical block 138709
[   43.106112] attempt to access beyond end of device
[   43.106116] loop0: rw=0, want=277422, limit=277408
[   43.106120] Buffer I/O error on device loop0, logical block 138710
[   43.106124] attempt to access beyond end of device
[   43.106128] loop0: rw=0, want=277424, limit=277408
[   43.106131] Buffer I/O error on device loop0, logical block 138711
[   43.106135] attempt to access beyond end of device
[   43.106139] loop0: rw=0, want=277426, limit=277408
[   43.106143] Buffer I/O error on device loop0, logical block 138712
[   43.106147] attempt to access beyond end of device
[   43.106151] loop0: rw=0, want=277428, limit=277408
[   43.106154] Buffer I/O error on device loop0, logical block 138713
[   43.106158] attempt to access beyond end of device
[   43.106162] loop0: rw=0, want=277430, limit=277408
[   43.106166] attempt to access beyond end of device
[   43.106169] loop0: rw=0, want=277432, limit=277408
...
[   43.106307] attempt to access beyond end of device
[   43.106311] loop0: rw=0, want=277470, limit=2774

Squashfs manages to read in the end block(s) of the disk during the
mount operation.  Then, when dd reads the block device, it leads to
block_read_full_page being called with buffers that are beyond end of
disk, but are marked as mapped.  Thus, it would end up submitting read
I/O against them, resulting in the errors mentioned above.  I fixed the
problem by modifying init_page_buffers to only set the buffer mapped if
it fell inside of i_size.

Cheers,
Jeff

Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Nick Piggin <npiggin@kernel.dk>

--

Changes from v1->v2: re-used max_block, as suggested by Nick Piggin.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-05-11 16:42:14 +02:00
Linus Torvalds
26fe575028 vfs: make it possible to access the dentry hash/len as one 64-bit entry
This allows comparing hash and len in one operation on 64-bit
architectures.  Right now only __d_lookup_rcu() takes advantage of this,
since that is the case we care most about.

The use of anonymous struct/unions hides the alternate 64-bit approach
from most users, the exception being a few cases where we initialize a
'struct qstr' with a static initializer.  This makes the problematic
cases use a new QSTR_INIT() helper function for that (but initializing
just the name pointer with a "{ .name = xyzzy }" initializer remains
valid, as does just copying another qstr structure).

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:54:35 -07:00
Linus Torvalds
ee983e8967 vfs: move dentry name length comparison from dentry_cmp() into callers
All callers do want to check the dentry length, but some of them can
check the length and the hash together, so doing it in dentry_cmp() can
be counter-productive.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:54:35 -07:00
Linus Torvalds
94753db5ed vfs: do the careful dentry name access for all dentry_cmp cases
Commit 12f8ad4b05 ("vfs: clean up __d_lookup_rcu() and dentry_cmp()
interfaces") did the careful ACCESS_ONCE() of the dentry name only for
the word-at-a-time case, even though the issue is generic.

Admittedly I don't really see gcc ever reloading the value in the middle
of the loop, so the ACCESS_ONCE() protects us from a fairly theoretical
issue. But better safe than sorry.

Also, this consolidates the common parts of the word-at-a-time and
bytewise logic, which includes checking the length.  We'll be changing
that later.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:54:09 -07:00
Linus Torvalds
8c01a529b8 vfs: remove unnecessary d_unhashed() check from __d_lookup_rcu
The check for d_unhashed() is not strictly incorrect, but at the same
time it is also not sensible.  The actual dentry removal from the dentry
hash chains is totally asynchronous to the __d_lookup_rcu() logic, and
we depend on __d_drop() updating the sequence number to invalidate any
lookup of an unhashed dentry.

So checking d_unhashed() is not incorrect, but it's not useful either:
the code has to work correctly even without it. So just remove it.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 19:52:35 -07:00
Linus Torvalds
7c283324da Merge branch 'akpm' (Andrew's patch-bomb)
Merge misc fixes from Andrew Morton.

* emailed from Andrew Morton <akpm@linux-foundation.org>: (8 patches)
  MAINTAINERS: add maintainer for LED subsystem
  mm: nobootmem: fix sign extend problem in __free_pages_memory()
  drivers/leds: correct __devexit annotations
  memcg: free spare array to avoid memory leak
  namespaces, pid_ns: fix leakage on fork() failure
  hugetlb: prevent BUG_ON in hugetlb_fault() -> hugetlb_cow()
  mm: fix division by 0 in percpu_pagelist_fraction()
  proc/pid/pagemap: correctly report non-present ptes and holes between vmas
2012-05-10 15:17:24 -07:00
Konstantin Khlebnikov
16fbdce62d proc/pid/pagemap: correctly report non-present ptes and holes between vmas
Reset the current pagemap-entry if the current pte isn't present, or if
current vma is over.  Otherwise pagemap reports last entry again and
again.

Non-present pte reporting was broken in commit 092b50bacd ("pagemap:
introduce data structure for pagemap entry")

Reporting for holes was broken in commit 5aaabe831e ("pagemap: avoid
splitting thp when reading /proc/pid/pagemap")

Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org>
Reported-by: Pavel Emelyanov <xemul@parallels.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-10 15:06:44 -07:00
Dan Carpenter
48a5730e5b cifs: fix revalidation test in cifs_llseek()
This test is always true so it means we revalidate the length every
time, which generates more network traffic.  When it is SEEK_SET or
SEEK_CUR, then we don't need to revalidate.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-09 15:16:22 -05:00
David S. Miller
0d6c4a2e46 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/intel/e1000e/param.c
	drivers/net/wireless/iwlwifi/iwl-agn-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans-pcie-rx.c
	drivers/net/wireless/iwlwifi/iwl-trans.h

Resolved the iwlwifi conflict with mainline using 3-way diff posted
by John Linville and Stephen Rothwell.  In 'net' we added a bug
fix to make iwlwifi report a more accurate skb->truesize but this
conflicted with RX path changes that happened meanwhile in net-next.

In e1000e a conflict arose in the validation code for settings of
adapter->itr.  'net-next' had more sophisticated logic so that
logic was used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-07 23:35:40 -04:00
Josh Cartwright
226bb7df3d jffs2: Fix lock acquisition order bug in gc path
The locking policy is such that the erase_complete_block spinlock is
nested within the alloc_sem mutex.  This fixes a case in which the
acquisition order was erroneously reversed.  This issue was caught by
the following lockdep splat:

   =======================================================
   [ INFO: possible circular locking dependency detected ]
   3.0.5 #1
   -------------------------------------------------------
   jffs2_gcd_mtd6/299 is trying to acquire lock:
    (&c->alloc_sem){+.+.+.}, at: [<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890

   but task is already holding lock:
    (&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #1 (&(&c->erase_completion_lock)->rlock){+.+...}:
          [<c008bec4>] validate_chain+0xe6c/0x10bc
          [<c008c660>] __lock_acquire+0x54c/0xba4
          [<c008d240>] lock_acquire+0xa4/0x114
          [<c046780c>] _raw_spin_lock+0x3c/0x4c
          [<c01f744c>] jffs2_garbage_collect_pass+0x4c/0x890
          [<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc
          [<c0071a68>] kthread+0x98/0xa0
          [<c000f264>] kernel_thread_exit+0x0/0x8

   -> #0 (&c->alloc_sem){+.+.+.}:
          [<c008ad2c>] print_circular_bug+0x70/0x2c4
          [<c008c08c>] validate_chain+0x1034/0x10bc
          [<c008c660>] __lock_acquire+0x54c/0xba4
          [<c008d240>] lock_acquire+0xa4/0x114
          [<c0466628>] mutex_lock_nested+0x74/0x33c
          [<c01f7714>] jffs2_garbage_collect_pass+0x314/0x890
          [<c01f937c>] jffs2_garbage_collect_thread+0x1b4/0x1cc
          [<c0071a68>] kthread+0x98/0xa0
          [<c000f264>] kernel_thread_exit+0x0/0x8

   other info that might help us debug this:

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(&(&c->erase_completion_lock)->rlock);
                                  lock(&c->alloc_sem);
                                  lock(&(&c->erase_completion_lock)->rlock);
     lock(&c->alloc_sem);

    *** DEADLOCK ***

   1 lock held by jffs2_gcd_mtd6/299:
    #0:  (&(&c->erase_completion_lock)->rlock){+.+...}, at: [<c01f7708>] jffs2_garbage_collect_pass+0x308/0x890

   stack backtrace:
   [<c00155dc>] (unwind_backtrace+0x0/0x100) from [<c0463dc0>] (dump_stack+0x20/0x24)
   [<c0463dc0>] (dump_stack+0x20/0x24) from [<c008ae84>] (print_circular_bug+0x1c8/0x2c4)
   [<c008ae84>] (print_circular_bug+0x1c8/0x2c4) from [<c008c08c>] (validate_chain+0x1034/0x10bc)
   [<c008c08c>] (validate_chain+0x1034/0x10bc) from [<c008c660>] (__lock_acquire+0x54c/0xba4)
   [<c008c660>] (__lock_acquire+0x54c/0xba4) from [<c008d240>] (lock_acquire+0xa4/0x114)
   [<c008d240>] (lock_acquire+0xa4/0x114) from [<c0466628>] (mutex_lock_nested+0x74/0x33c)
   [<c0466628>] (mutex_lock_nested+0x74/0x33c) from [<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890)
   [<c01f7714>] (jffs2_garbage_collect_pass+0x314/0x890) from [<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc)
   [<c01f937c>] (jffs2_garbage_collect_thread+0x1b4/0x1cc) from [<c0071a68>] (kthread+0x98/0xa0)
   [<c0071a68>] (kthread+0x98/0xa0) from [<c000f264>] (kernel_thread_exit+0x0/0x8)

This was introduce in '81cfc9f jffs2: Fix serious write stall due to erase'.

Cc: stable@kernel.org [2.6.37+]
Signed-off-by: Josh Cartwright <joshc@linux.com>
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
2012-05-07 20:30:14 +01:00
Linus Torvalds
8529f613b6 vfs: don't force a big memset of stat data just to clear padding fields
Admittedly this is something that the compiler should be able to just do
for us, but gcc just isn't that smart.  And trying to use a structure
initializer (which would get us the right semantics) ends up resulting
in gcc allocating stack space for _two_ 'struct stat', and then copying
one into the other.

So do it by hand - just have a per-architecture macro that initializes
the padding fields.  And if the architecture doesn't provide one, fall
back to the old behavior of just doing the whole memset() first.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-06 18:02:40 -07:00
Linus Torvalds
a52dd971f9 vfs: de-crapify "cp_new_stat()" function
It's an unreadable mess of 32-bit vs 64-bit #ifdef's that mostly follow
a rather simple pattern.

Make a helper #define to handle that pattern, in the process making the
code both shorter and more readable.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-06 17:47:30 -07:00
Linus Torvalds
271fd5d728 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs
Pull btrfs fixes from Chris Mason:
 "The big ones here are a memory leak we introduced in rc1, and a
  scheduling while atomic if the transid on disk doesn't match the
  transid we expected.  This happens for corrupt blocks, or out of date
  disks.

  It also fixes up the ioctl definition for our ioctl to resolve logical
  inode numbers.  The __u32 was a merging error and doesn't match what
  we ship in the progs."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
  Btrfs: avoid sleeping in verify_parent_transid while atomic
  Btrfs: fix crash in scrub repair code when device is missing
  btrfs: Fix mismatching struct members in ioctl.h
  Btrfs: fix page leak when allocing extent buffers
  Btrfs: Add properly locking around add_root_to_dirty_list
2012-05-06 10:20:07 -07:00
Chris Mason
b9fab919b7 Btrfs: avoid sleeping in verify_parent_transid while atomic
verify_parent_transid needs to lock the extent range to make
sure no IO is underway, and so it can safely clear the
uptodate bits if our checks fail.

But, a few callers are using it with spinlocks held.  Most
of the time, the generation numbers are going to match, and
we don't want to switch to a blocking lock just for the error
case.  This adds an atomic flag to verify_parent_transid,
and changes it to return EAGAIN if it needs to block to
properly verifiy things.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-06 07:23:47 -04:00
Linus Torvalds
12f8ad4b05 vfs: clean up __d_lookup_rcu() and dentry_cmp() interfaces
The calling conventions for __d_lookup_rcu() and dentry_cmp() are
annoying in different ways, and there is actually one single underlying
reason for both of the annoyances.

The fundamental reason is that we do the returned dentry sequence number
check inside __d_lookup_rcu() instead of doing it in the caller.  This
results in two annoyances:

 - __d_lookup_rcu() now not only needs to return the dentry and the
   sequence number that goes along with the lookup, it also needs to
   return the inode pointer that was validated by that sequence number
   check.

 - and because we did the sequence number check early (to validate the
   name pointer and length) we also couldn't just pass the dentry itself
   to dentry_cmp(), we had to pass the counted string that contained the
   name.

So that sequence number decision caused two separate ugly calling
conventions.

Both of these problems would be solved if we just did the sequence
number check in the caller instead.  There's only one caller, and that
caller already has to do the sequence number check for the parent
anyway, so just do that.

That allows us to stop returning the dentry->d_inode in that in-out
argument (pointer-to-pointer-to-inode), so we can make the inode
argument just a regular input inode pointer.  The caller can just load
the inode from dentry->d_inode, and then do the sequence number check
after that to make sure that it's synchronized with the name we looked
up.

And it allows us to just pass in the dentry to dentry_cmp(), which is
what all the callers really wanted.  Sure, dentry_cmp() has to be a bit
careful about the dentry (which is not stable during RCU lookup), but
that's actually very simple.

And now that dentry_cmp() can clearly see that the first string argument
is a dentry, we can use the direct word access for that, instead of the
careful unaligned zero-padding.  The dentry name is always properly
aligned, since it is a single path component that is either embedded
into the dentry itself, or was allocated with kmalloc() (see __d_alloc).

Finally, this also uninlines the nasty slow-case for dentry comparisons:
that one *does* need to do a sequence number check, since it will call
in to the low-level filesystems, and we want to give those a stable
inode pointer and path component length/start arguments.  Doing an extra
sequence check for that slow case is not a problem, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-04 18:21:14 -07:00
Greg Kroah-Hartman
6f24f89287 hfsplus: Fix potential buffer overflows
Commit ec81aecb29 ("hfs: fix a potential buffer overflow") fixed a few
potential buffer overflows in the hfs filesystem.  But as Timo Warns
pointed out, these changes also need to be made on the hfsplus
filesystem as well.

Reported-by: Timo Warns <warns@pre-sense.de>
Acked-by: WANG Cong <amwang@redhat.com>
Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
Cc: Miklos Szeredi <mszeredi@suse.cz>
Cc: Sage Weil <sage@newdream.net>
Cc: Eugene Teo <eteo@redhat.com>
Cc: Roman Zippel <zippel@linux-m68k.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Dave Anderson <anderson@redhat.com>
Cc: stable <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-04 17:11:24 -07:00
Linus Torvalds
c6de1687f5 Merge git://git.samba.org/sfrench/cifs-2.6
Pull CIFS fixes from Steve French.

* git://git.samba.org/sfrench/cifs-2.6:
  fs/cifs: fix parsing of dfs referrals
  cifs: make sure we ignore the credentials= and cred= options
  [CIFS] Update cifs version to 1.78
  cifs - check S_AUTOMOUNT in revalidate
  cifs: add missing initialization of server->req_lock
  cifs: don't cap ra_pages at the same level as default_backing_dev_info
  CIFS: Fix indentation in cifs_show_options
2012-05-04 15:34:21 -07:00
Stefan Behrens
ea9947b439 Btrfs: fix crash in scrub repair code when device is missing
Fix that when scrub tries to repair an I/O or checksum error and one of
the devices containing the mirror is missing, it crashes in bio_add_page
because the bdev is a NULL pointer for missing devices.

Reported-by: Marco L. Crociani <marco.crociani@gmail.com>
Signed-off-by: Stefan Behrens <sbehrens@giantdisaster.de>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-04 15:16:07 -04:00
Alexander Block
d04b1debc9 btrfs: Fix mismatching struct members in ioctl.h
Fix the size members of btrfs_ioctl_ino_path_args and
btrfs_ioctl_logical_ino_args. The user space btrfs-progs utilities used
__u64 and the kernel headers used __u32 before.

Signed-off-by: Alexander Block <ablock84@googlemail.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-04 15:16:06 -04:00
Josef Bacik
17de39ac17 Btrfs: fix page leak when allocing extent buffers
If we happen to alloc a extent buffer and then alloc a page and notice that
page is already attached to an extent buffer, we will only unlock it and
free our existing eb.  Any pages currently attached to that eb will be
properly freed, but we don't do the page_cache_release() on the page where
we noticed the other extent buffer which can cause us to leak pages and I
hope cause the weird issues we've been seeing in this area.  Thanks,

Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-04 15:16:06 -04:00
Chris Mason
e5846fc665 Btrfs: Add properly locking around add_root_to_dirty_list
add_root_to_dirty_list happens once at the very beginning of the
transaction, but it is still racey.

Signed-off-by: Chris Mason <chris.mason@oracle.com>
2012-05-04 15:14:11 -04:00
Stefan Metzmacher
d8f2799b10 fs/cifs: fix parsing of dfs referrals
The problem was that the first referral was parsed more than once
and so the caller tried the same referrals multiple times.

The problem was introduced partly by commit
066ce68994,
where 'ref += le16_to_cpu(ref->Size);' got lost,
but that was also wrong...

Cc: <stable@vger.kernel.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Tested-by: Björn Jacke <bj@sernet.de>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-03 22:47:39 -05:00
Linus Torvalds
e419b4cc58 vfs: make word-at-a-time accesses handle a non-existing page
It turns out that there are more cases than CONFIG_DEBUG_PAGEALLOC that
can have holes in the kernel address space: it seems to happen easily
with Xen, and it looks like the AMD gart64 code will also punch holes
dynamically.

Actually hitting that case is still very unlikely, so just do the
access, and take an exception and fix it up for the very unlikely case
of it being a page-crosser with no next page.

And hey, this abstraction might even help other architectures that have
other issues with unaligned word accesses than the possible missing next
page.  IOW, this could do the byte order magic too.

Peter Anvin fixed a thinko in the shifting for the exception case.

Reported-and-tested-by: Jana Saout <jana@saout.de>
Cc:  Peter Anvin <hpa@zytor.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-03 14:01:40 -07:00
Jeff Layton
a557b97616 cifs: make sure we ignore the credentials= and cred= options
Older mount.cifs programs passed this on to the kernel after parsing
the file. Make sure the kernel ignores that option.

Should fix:

    https://bugzilla.kernel.org/show_bug.cgi?id=43195

Cc: Sachin Prabhu <sprabhu@redhat.com>
Reported-by: Ronald <ronald645@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-03 13:50:01 -05:00
Steve French
f966424e99 [CIFS] Update cifs version to 1.78
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-03 13:50:01 -05:00
Ian Kent
936ad90944 cifs - check S_AUTOMOUNT in revalidate
When revalidating a dentry, if the inode wasn't known to be a dfs
entry when the dentry was instantiated, such as when created via
->readdir(), the DCACHE_NEED_AUTOMOUNT flag needs to be set on the
dentry in ->d_revalidate().

The false return from cifs_d_revalidate(), due to the inode now
being marked with the S_AUTOMOUNT flag, might not invalidate the
dentry if there is a concurrent unlazy path walk. This is because
the dentry reference count will be at least 2 in this case causing
d_invalidate() to return EBUSY. So the asumption that the dentry
will be discarded then correctly instantiated via ->lookup() might
not hold.

Signed-off-by: Ian Kent <raven@themaw.net>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Cc: Steve French <smfrench@gmail.com>
Cc: linux-cifs@vger.kernel.org
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-03 13:49:47 -05:00
Linus Torvalds
529acf5898 NFS client bugfixes for Linux 3.4
Highlights include:
 - Fixes for the NFSv4 security negotiation
 - Use the correct hostname when mounting from a private namespace
 - NFS net namespace bugfixes for the pipefs filesystem
 - NFSv4 GETACL bugfixes
 - IPv6 bugfix for NFSv4 referrals
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJPoK8MAAoJEGcL54qWCgDyr4AP/1cSY4ZjaZwZm1l9M1l1RBtx
 zBBE6RfM+4eKqwAzFNaIjLjslLMMkTV0TsARYG/CQrJ4DuonHDkdGMwXdTgWFYNN
 AuVO50QKTy+8j2PqY5t84/d6WrFrxbCckKyhixb4/uHtl6mB2jdICA7xLWa4hndS
 kPhRYZQt4zs+Db7Y66nXCLnpWaWoR34ZNxbpoCTLLyYIiUOTplfSfJ21bVZWN3Pt
 M5BYUdKDfgDV15V1/UqULL9j3xnrgFsOK9DjiHEXppXZYfEqfwmEMg9ZQw2AfAm1
 HcrcVv3YTa0I4ag3s/IeZ7wot8PJPOMQzVnzvD2FIO8FX+9vkkYQ3BwoQSVv21Ar
 hgywkT/MMlz9mCDqpjJQVgTaNq4AOoFBF5MXQz9KLWSdummjZs3ILMkpV7Ze3qpj
 Q6GEgii5Xr+Pj/D5D5W3gvkcztDhn3ziSv7fuL5fEADfrP6tYxNmLlP1MKPzrtJn
 SP7WnkmcuWXdvfnKAeOeqAsrvDuaNoHRjtNmfe1PAajUWcvVuLidYhi84dtRYvBe
 N4ukQGqerBoHN3nYhQHl0p9arXA6mAdb2Y9Pt9FY3nraA7e+oJWaEfq1vuFEgF8s
 et8mDrGYpVN155qUvCBGNIwyQXgGt6LLhBZVF9OJa59JfRPDkagaIaTVPlhKJm/Q
 Mbx7dfpGgDU+aLipyv2Q
 =vLBv
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-3.4-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 - Fixes for the NFSv4 security negotiation
 - Use the correct hostname when mounting from a private namespace
 - NFS net namespace bugfixes for the pipefs filesystem
 - NFSv4 GETACL bugfixes
 - IPv6 bugfix for NFSv4 referrals

* tag 'nfs-for-3.4-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv4.1: Use the correct hostname in the client identifier string
  SUNRPC: RPC client must use the current utsname hostname string
  NFS: get module in idmap PipeFS notifier callback
  NFS: Remove unused function nfs_lookup_with_sec()
  NFS: Honor the authflavor set in the clone mount data
  NFS: Fix following referral mount points with different security
  NFS: Do secinfo as part of lookup
  NFS: Handle exceptions coming out of nfs4_proc_fs_locations()
  NFS: Fix SECINFO_NO_NAME
  SUNRPC: traverse clients tree on PipeFS event
  SUNRPC: set per-net PipeFS superblock before notification
  SUNRPC: skip clients with program without PipeFS entries
  SUNRPC: skip dead but not buried clients on PipeFS events
  Avoid beyond bounds copy while caching ACL
  Avoid reading past buffer when calling GETACL
  fix page number calculation bug for block layout decode buffer
  NFSv4.1 fix page number calculation bug for filelayout decode buffers
  pnfs-obj: Remove unused variable from objlayout_get_deviceinfo()
  nfs4: fix referrals on mounts that use IPv6 addrs
2012-05-02 08:17:57 -07:00
Jeff Layton
58fa015f61 cifs: add missing initialization of server->req_lock
Cc: Pavel Shilovsky <piastryyy@gmail.com>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-01 22:29:51 -05:00
Jeff Layton
8f71465c19 cifs: don't cap ra_pages at the same level as default_backing_dev_info
While testing, I've found that even when we are able to negotiate a
much larger rsize with the server, on-the-wire reads often end up being
capped at 128k because of ra_pages being capped at that level.

Lifting this restriction gave almost a twofold increase in sequential
read performance on my craptactular KVM test rig with a 1M rsize.

I think this is safe since the actual ra_pages that the VM requests
is run through max_sane_readahead() prior to submitting the I/O. Under
memory pressure we should end up with large readahead requests being
suppressed anyway.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-01 22:27:54 -05:00
Sachin Prabhu
156d17905e CIFS: Fix indentation in cifs_show_options
Trivial patch which fixes a misplaced tab in cifs_show_options().

Signed-off-by: Sachin Prabhu <sprabhu@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
2012-05-01 22:19:43 -05:00
Randy Dunlap
8a7dc4b04b nfsd: fix nfs4recover.c printk format warning
Fix printk format warnings -- both items are size_t,
so use %zu to print them.

fs/nfsd/nfs4recover.c:580:3: warning: format '%lu' expects type 'long unsigned int', but argument 3 has type 'size_t'
fs/nfsd/nfs4recover.c:580:3: warning: format '%lu' expects type 'long unsigned int', but argument 4 has type 'unsigned int'

Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-30 12:28:48 -07:00
Trond Myklebust
3617e5031b NFSv4.1: Use the correct hostname in the client identifier string
We need to use the hostname of the process that created the nfs_client.
That hostname is now stored in the rpc_client->cl_nodename.

Also remove the utsname()->domainname component. There is no reason
to include the NIS/YP domainname in a client identifier string.

Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2012-04-30 12:04:58 -04:00
Linus Torvalds
64f371bc31 autofs: make the autofsv5 packet file descriptor use a packetized pipe
The autofs packet size has had a very unfortunate size problem on x86:
because the alignment of 'u64' differs in 32-bit and 64-bit modes, and
because the packet data was not 8-byte aligned, the size of the autofsv5
packet structure differed between 32-bit and 64-bit modes despite
looking otherwise identical (300 vs 304 bytes respectively).

We first fixed that up by making the 64-bit compat mode know about this
problem in commit a32744d4ab ("autofs: work around unhappy compat
problem on x86-64"), and that made a 32-bit 'systemd' work happily on a
64-bit kernel because everything then worked the same way as on a 32-bit
kernel.

But it turned out that 'automount' had actually known and worked around
this problem in user space, so fixing the kernel to do the proper 32-bit
compatibility handling actually *broke* 32-bit automount on a 64-bit
kernel, because it knew that the packet sizes were wrong and expected
those incorrect sizes.

As a result, we ended up reverting that compatibility mode fix, and
thus breaking systemd again, in commit fcbf94b9de.

With both automount and systemd doing a single read() system call, and
verifying that they get *exactly* the size they expect but using
different sizes, it seemed that fixing one of them inevitably seemed to
break the other.  At one point, a patch I seriously considered applying
from Michael Tokarev did a "strcmp()" to see if it was automount that
was doing the operation.  Ugly, ugly.

However, a prettier solution exists now thanks to the packetized pipe
mode.  By marking the communication pipe as being packetized (by simply
setting the O_DIRECT flag), we can always just write the bigger packet
size, and if user-space does a smaller read, it will just get that
partial end result and the extra alignment padding will simply be thrown
away.

This makes both automount and systemd happy, since they now get the size
they asked for, and the kernel side of autofs simply no longer needs to
care - it could pad out the packet arbitrarily.

Of course, if there is some *other* user of autofs (please, please,
please tell me it ain't so - and we haven't heard of any) that tries to
read the packets with multiple writes, that other user will now be
broken - the whole point of the packetized mode is that one system call
gets exactly one packet, and you cannot read a packet in pieces.

Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-29 13:30:08 -07:00
Linus Torvalds
9883035ae7 pipes: add a "packetized pipe" mode for writing
The actual internal pipe implementation is already really about
individual packets (called "pipe buffers"), and this simply exposes that
as a special packetized mode.

When we are in the packetized mode (marked by O_DIRECT as suggested by
Alan Cox), a write() on a pipe will not merge the new data with previous
writes, so each write will get a pipe buffer of its own.  The pipe
buffer is then marked with the PIPE_BUF_FLAG_PACKET flag, which in turn
will tell the reader side to break the read at that boundary (and throw
away any partial packet contents that do not fit in the read buffer).

End result: as long as you do writes less than PIPE_BUF in size (so that
the pipe doesn't have to split them up), you can now treat the pipe as a
packet interface, where each read() system call will read one packet at
a time.  You can just use a sufficiently big read buffer (PIPE_BUF is
sufficient, since bigger than that doesn't guarantee atomicity anyway),
and the return value of the read() will naturally give you the size of
the packet.

NOTE! We do not support zero-sized packets, and zero-sized reads and
writes to a pipe continue to be no-ops.  Also note that big packets will
currently be split at write time, but that the size at which that
happens is not really specified (except that it's bigger than PIPE_BUF).
Currently that limit is the system page size, but we might want to
explicitly support bigger packets some day.

The main user for this is going to be the autofs packet interface,
allowing us to stop having to care so deeply about exact packet sizes
(which have had bugs with 32/64-bit compatibility modes).  But user
space can create packetized pipes with "pipe2(fd, O_DIRECT)", which will
fail with an EINVAL on kernels that do not support this interface.

Tested-by: Michael Tokarev <mjt@tls.msk.ru>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: David Miller <davem@davemloft.net>
Cc: Ian Kent <raven@themaw.net>
Cc: Thomas Meyer <thomas@m3y3r.de>
Cc: stable@kernel.org  # needed for systemd/autofs interaction fix
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-29 13:12:42 -07:00
Linus Torvalds
e994defb7b VFS: make vfs_fstat() use f[get|put]_light()
Use the *_light() versions that properly avoid doing the file user count
updates when they are unnecessary.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-28 14:55:17 -07:00
Linus Torvalds
3f9f0aa687 VFS: clean up and simplify getname_flags()
This removes a number of silly games around strncpy_from_user() in
do_getname(), and removes that helper function entirely.  We instead
make getname_flags() just use strncpy_from_user() properly directly.

Removing the wrapper function simplifies things noticeably, mostly
because we no longer play the unnecessary games with segments (x86
strncpy_from_user() no longer needs the hack), but also because the
empty path handling is just much more obvious.  The return value of
"strncpy_to_user()" is much more obvious than checking an odd error
return case from do_getname().

[ non-x86 architectures were notified of this change several weeks ago,
  since it is possible that they have copied the old broken x86
  strncpy_from_user. But nobody reacted, so .. See

    http://www.spinics.net/lists/linux-arch/msg17313.html

  for details ]

Cc: linux-arch@vger.kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-04-28 14:38:32 -07:00