exofs uses simple_write_end() for it's .write_end handler. But
it is not enough because simple_write_end() does not call
mark_inode_dirty() when it extends i_size. So even if we do
call mark_inode_dirty at beginning of write out, with a very
long IO and a saturated system we might get the .write_inode()
called while still extend-writing to file and miss out on the last
i_size updates.
So override .write_end, call simple_write_end(), and afterwords if
i_size was changed call mark_inode_dirty().
It stands to logic that since simple_write_end() was the one extending
i_size it should also call mark_inode_dirty(). But it looks like all
users of simple_write_end() are memory-bound pseudo filesystems, who
could careless about mark_inode_dirty(). I might submit a
warning-comment patch to simple_write_end() in future.
CC: Stable <stable@kernel.org>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Some on disk exofs constants and types are defined in the pnfs_osd_xdr.h
file. Since we needed these types before the pnfs-objects code was
accepted to mainline we duplicated the minimal needed definitions into
an exofs local header. The definitions where conditionally included
depending on !CONFIG_PNFS defined. So if PNFS was present in the tree
definitions are taken from there and if not they are defined locally.
That was all good but, the CONFIG_PNFS is planed to be included upstream
before the pnfs-objects is also included. (The first pnfs batch might be
pnfs-files only)
So condition exofs local definitions on the absence of pnfs_osd_xdr.h
inclusion (__PNFS_OSD_XDR_H__ not defined). User code must make sure
that in future pnfs_osd_xdr.h will be included before fs/exofs/pnfs.h,
which happens to be so in current code.
Once pnfs-objects hits mainline, exofs's local header will be removed.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch changes on-disk format, it is accompanied with a parallel
patch to mkfs.exofs that enables multi-device capabilities.
After this patch, old exofs will refuse to mount a new formatted FS and
new exofs will refuse an old format. This is done by moving the magic
field offset inside the FSCB. A new FSCB *version* field was added. In
the future, exofs will refuse to mount unmatched FSCB version. To
up-grade or down-grade an exofs one must use mkfs.exofs --upgrade option
before mounting.
Introduced, a new object that contains a *device-table*. This object
contains the default *data-map* and a linear array of devices
information, which identifies the devices used in the filesystem. This
object is only written to offline by mkfs.exofs. This is why it is kept
separate from the FSCB, since the later is written to while mounted.
Same partition number, same object number is used on all devices only
the device varies.
* define the new format, then load the device table on mount time make
sure every thing is supported.
* Change I/O engine to now support Mirror IO, .i.e write same data
to multiple devices, read from a random device to spread the
read-load from multiple clients (TODO: stripe read)
Implementation notes:
A few points introduced in previous patch should be mentioned here:
* Special care was made so absolutlly all operation that have any chance
of failing are done before any osd-request is executed. This is to
minimize the need for a data consistency recovery, to only real IO
errors.
* Each IO state has a kref. It starts at 1, any osd-request executed
will increment the kref, finally when all are executed the first ref
is dropped. At IO-done, each request completion decrements the kref,
the last one to return executes the internal _last_io() routine.
_last_io() will call the registered io_state_done. On sync mode a
caller does not supply a done method, indicating a synchronous
request, the caller is put to sleep and a special io_state_done is
registered that will awaken the caller. Though also in sync mode all
operations are executed in parallel.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
In anticipation for multi-device operations, we separate osd operations
into an abstract I/O API. Currently only one device is used but later
when adding more devices, we will drive all devices in parallel according
to a "data_map" that describes how data is arranged on multiple devices.
The file system level operates, like before, as if there is one object
(inode-number) and an i_size. The io engine will split this to the same
object-number but on multiple device.
At first we introduce Mirror (raid 1) layout. But at the final outcome
we intend to fully implement the pNFS-Objects data-map, including
raid 0,4,5,6 over mirrored devices, over multiple device-groups. And
more. See: http://tools.ietf.org/html/draft-ietf-nfsv4-pnfs-obj-12
* Define an io_state based API for accessing osd storage devices
in an abstract way.
Usage:
First a caller allocates an io state with:
exofs_get_io_state(struct exofs_sb_info *sbi,
struct exofs_io_state** ios);
Then calles one of:
exofs_sbi_create(struct exofs_io_state *ios);
exofs_sbi_remove(struct exofs_io_state *ios);
exofs_sbi_write(struct exofs_io_state *ios);
exofs_sbi_read(struct exofs_io_state *ios);
exofs_oi_truncate(struct exofs_i_info *oi, u64 new_len);
And when done
exofs_put_io_state(struct exofs_io_state *ios);
* Convert all source files to use this new API
* Convert from bio_alloc to bio_kmalloc
* In io engine we make use of the now fixed osd_req_decode_sense
There are no functional changes or on disk additions after this patch.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
If I do a "git mv" together with a massive code change
and commit in one patch, git looses the rename and
records a delete/new instead. This is bad because I want
a rename recorded so later rebased/cherry-picked patches
to the old name will work. Also the --follow is lost.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Even though exofs has a 4k block size, statfs blocks
is in sectors (512 bytes).
Also if target returns 0 for capacity then make it
ULLONG_MAX. df does not like zero-size filesystems
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
It is important to print in the logs when a filesystem was
mounted and eventually unmounted.
Print the osd-device's osd_name and pid the FS was
mounted/unmounted on.
TODO: How to also print the namespace path the filesystem was
mounted on?
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
There are two places that initialize inodes: exofs_iget() and
exofs_new_inode()
As more members of exofs_i_info that need initialization are
added this code will grow. (soon)
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Iner-loops printing is converted to EXOFS_DBG2 which is #defined
to nothing.
It is now almost bareable to just leave debug-on. Every operation
is printed once, with most relevant info (I hope).
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
the two places inside exofs that where taking the BKL were:
exofs_put_super() - .put_super
and
exofs_sync_fs() - which is .sync_fs and is also called from
.write_super.
Now exofs_sync_fs() is protected from itself by also taking
the sb_lock.
exofs_put_super() directly calls exofs_sync_fs() so there is no
danger between these two either.
In anyway there is absolutely nothing dangerous been done
inside exofs_sync_fs().
Unless there is some subtle race with the actual lifetime of
the super_block in regard to .put_super and some other parts
of the VFS. Which is highly unlikely.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Remove smp_lock.h from files which don't need it (including some headers!)
* Add smp_lock.h to files which do need it
* Make smp_lock.h include conditional in hardirq.h
It's needed only for one kernel_locked() usage which is under CONFIG_PREEMPT
This will make hardirq.h inclusion cheaper for every PREEMPT=n config
(which includes allmodconfig/allyesconfig, BTW)
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The use of file_fsync() in exofs_file_sync() is not necessary since it
does some extra stuff not used by exofs. Open code just the parts that
are currently needed.
TODO: Farther optimization can be done to sync the sb only on inode
update of new files, Usually the sb update is not needed in exofs.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Boaz,
Congrats on getting all the OSD stuff into 2.6.30!
I just pulled the git, and saw that the IBM copyrights are still there.
Please remove them from all files:
* Copyright (C) 2005, 2006
* International Business Machines
IBM has revoked all rights on the code - they gave it to me.
Thanks!
Avishay
Signed-off-by: Avishay Traeger <avishay@gmail.com>
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
When failing a read request in the sync path, called from
write_begin, I forgot to free the allocated bio, fix it.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Conflicts:
drivers/message/fusion/mptsas.c
fixed up conflict between req->data_len accessors and mptsas driver updates.
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
Add a ->sync_fs method for data integrity syncs, and reimplement
->write_super ontop of it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Push down lock_super into ->write_super instances and remove it from the
caller.
Following filesystem don't need ->s_lock in ->write_super and are skipped:
* bfs, nilfs2 - no other uses of s_lock and have internal locks in
->write_super
* ext2 - uses BKL in ext2_write_super and has internal calls without s_lock
* reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in
->write_super
* xfs - no other uses of s_lock and uses internal lock (buffer lock on
superblock buffer) to serialize ->write_super. Also xfs_fs_write_super
is superflous and will go away in the next merge window
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Move BKL into ->put_super from the only caller. A couple of
filesystems had trivial enough ->put_super (only kfree and NULLing of
s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs,
hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most
of them probably don't need it, but I'd rather sort that out individually.
Preferably after all the other BKL pushdowns in that area.
[AV: original used to move lock_super() down as well; these changes are
removed since we don't do lock_super() at all in generic_shutdown_super()
now]
[AV: fuse, btrfs and xfs are known to need no damn BKL, exempt]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We just did a full fs writeout using sync_filesystem before, and if
that's not enough for the filesystem it can perform it's own writeout
in ->put_super, which many filesystems already do.
Move a call to foofs_write_super into every foofs_put_super for now to
guarantee identical behaviour until it's cleaned up by the individual
filesystem maintainers.
Exceptions:
- affs already has identical copy & pasted code at the beginning of
affs_put_super so no need to do it twice.
- xfs does the right thing without it and I have changes pending for
the xfs tree touching this are so I don't really need conflicts
here..
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
libosd users that need to work with bios, must sometime use
the request_queue associated with the osd_dev. Make a wrapper for
that, and convert all in-tree users.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
For supporting of chained-bios we can not inspect the first
bio only, as before. Caller shall pass the total length of the
request, ie. sum_bytes(bio-chain).
Also since the bio might be a chain we don't set it's direction
on behalf of it's callers. The bio direction should be properly
set prior to this call. So fix a couple of write users that now
need to set the bio direction properly
[In this patch I change both library code and user sites at
exofs, to make it easy on integration. It should be submitted
via James's scsi-misc tree.]
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
CC: Jeff Garzik <jeff@garzik.org>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
By popular demand, define usefull wrappers for osd_req_read/write
that recieve kernel pointers. All users had their own.
Also remove these from exofs
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
rq->data_len served two purposes - the length of data buffer on issue
and the residual count on completion. This duality creates some
headaches.
First of all, block layer and low level drivers can't really determine
what rq->data_len contains while a request is executing. It could be
the total request length or it coulde be anything else one of the
lower layers is using to keep track of residual count. This
complicates things because blk_rq_bytes() and thus
[__]blk_end_request_all() relies on rq->data_len for PC commands.
Drivers which want to report residual count should first cache the
total request length, update rq->data_len and then complete the
request with the cached data length.
Secondly, it makes requests default to reporting full residual count,
ie. reporting that no data transfer occurred. The residual count is
an exception not the norm; however, the driver should clear
rq->data_len to zero to signify the normal cases while leaving it
alone means no data transfer occurred at all. This reverse default
behavior complicates code unnecessarily and renders block PC on some
drivers (ide-tape/floppy) unuseable.
This patch adds rq->resid_len which is used only for residual count.
While at it, remove now unnecessasry blk_rq_bytes() caching in
ide_pc_intr() as rq->data_len is not changed anymore.
Boaz : spotted missing conversion in osd
Sergei : spotted too early conversion to blk_rq_bytes() in ide-tape
[ Impact: cleanup residual count handling, report 0 resid by default ]
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Cc: Borislav Petkov <petkovbb@googlemail.com>
Cc: Sergei Shtylyov <sshtylyov@ru.mvista.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Doug Gilbert <dgilbert@interlog.com>
Cc: Mike Miller <mike.miller@hp.com>
Cc: Eric Moore <Eric.Moore@lsi.com>
Cc: Darrick J. Wong <djwong@us.ibm.com>
Cc: Pete Zaitcev <zaitcev@redhat.com>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Added some documentation in exofs.txt, as well as a BUGS file.
For further reading, operation instructions, example scripts
and up to date infomation and code please see:
http://open-osd.org
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch ties all operation vectors into a file system superblock
and registers the exofs file_system_type at module's load time.
* The file system control block (AKA on-disk superblock) resides in
an object with a special ID (defined in common.h).
Information included in the file system control block is used to
fill the in-memory superblock structure at mount time. This object
is created before the file system is used by mkexofs.c It contains
information such as:
- The file system's magic number
- The next inode number to be allocated
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
implementation of directory and inode operations.
* A directory is treated as a file, and essentially contains a list
of <file name, inode #> pairs for files that are found in that
directory. The object IDs correspond to the files' inode numbers
and are allocated using a 64bit incrementing global counter.
* Each file's control block (AKA on-disk inode) is stored in its
object's attributes. This applies to both regular files and other
types (directories, device files, symlinks, etc.).
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
OK Now we start to read and write from osd-objects. We try to
collect at most contiguous pages as possible in a single write/read.
The first page index is the object's offset.
TODO:
In 64-bit a single bio can carry at most 128 pages.
Add support of chaining multiple bios
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
implementation of the file_operations and inode_operations for
regular data files.
Most file_operations are generic vfs implementations except:
- exofs_truncate will truncate the OSD object as well
- Generic file_fsync is not good for none_bd devices so open code it
- The default for .flush in Linux is todo nothing so call exofs_fsync
on the file.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
This patch includes osd infrastructure that will be used later by
the file system.
Also the declarations of constants, on disk structures,
and prototypes.
And the Kbuild+Kconfig files needed to build the exofs module.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>