Commit Graph

27 Commits

Author SHA1 Message Date
Christoph Lameter 39743889aa [PATCH] Swap Migration V5: sys_migrate_pages interface
sys_migrate_pages implementation using swap based page migration

This is the original API proposed by Ray Bryant in his posts during the first
half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.

The intent of sys_migrate is to migrate memory of a process.  A process may
have migrated to another node.  Memory was allocated optimally for the prior
context.  sys_migrate_pages allows to shift the memory to the new node.

sys_migrate_pages is also useful if the processes available memory nodes have
changed through cpuset operations to manually move the processes memory.  Paul
Jackson is working on an automated mechanism that will allow an automatic
migration if the cpuset of a process is changed.  However, a user may decide
to manually control the migration.

This implementation is put into the policy layer since it uses concepts and
functions that are also needed for mbind and friends.  The patch also provides
a do_migrate_pages function that may be useful for cpusets to automatically
move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
implementation.

The current code here is based on the swap based page migration capability and
thus is not able to preserve the physical layout relative to it containing
nodeset (which may be a cpuset).  When direct page migration becomes available
then the implementation needs to be changed to do a isomorphic move of pages
between different nodesets.  The current implementation simply evicts all
pages in source nodeset that are not in the target nodeset.

Patch supports ia64, i386 and x86_64.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 20:12:42 -08:00
Peter Chubb 24b8e0cc09 [IA64] Remove warnings for gcc 4.0 IA64 compilation.
This patch removes some compilation warnings, mostly
trivially. acpi.c fix also noted by Kenji Kaneshige.

Signed-off-by; Peter Chubb <peterc@gelato.unsw.edu.au>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-16 09:45:27 -07:00
Linus Torvalds 486a153f0e Merge master.kernel.org:/pub/scm/linux/kernel/git/sam/kbuild 2005-09-09 15:46:49 -07:00
Chen, Kenneth W 383f2835eb [PATCH] Prefetch kernel stacks to speed up context switch
For architecture like ia64, the switch stack structure is fairly large
(currently 528 bytes).  For context switch intensive application, we found
that significant amount of cache misses occurs in switch_to() function.
The following patch adds a hook in the schedule() function to prefetch
switch stack structure as soon as 'next' task is determined.  This allows
maximum overlap in prefetch cache lines for that structure.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Luck, Tony" <tony.luck@intel.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-09 13:57:31 -07:00
Sam Ravnborg 39e01cb874 kbuild: ia64 use generic asm-offsets.h support
Delete obsolete stuff from arch Makefile
Rename file to asm-offsets.h
The trick used in the arch Makefile to circumvent the circular
dependency is kept.

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2005-09-09 22:03:13 +02:00
Chen, Kenneth W 0232622324 [IA64] minor performance tune-up in ia64_switch_to
The reenabling of psr.ic should really belong to dtr mapping code block.
It make the fall through code fast since it doesn't need to execute the
predicated-off instruction.  Logically make more sense as well since psr.ic
was turned off in .map code block.

Signed-off-by: Ken Chen <kenneth.w.chen@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-09-07 13:56:23 -07:00
Ingo Molnar 6cb54819d7 [PATCH] remove sys_set_zone_reclaim()
This removes sys_set_zone_reclaim() for now.  While i'm sure Martin is
trying to solve a real problem, we must not hard-code an incomplete and
insufficient approach into a syscall, because syscalls are pretty much
for eternity.  I am quite strongly convinced that this syscall must not
hit v2.6.13 in its current form.

Firstly, the syscall lacks basic syscall design: e.g. it allows the
global setting of VM policy for unprivileged users. (!) [ Imagine an
Oracle installation and a SAP installation on the same NUMA box fighting
over the 'optimal' setting for this flag. What will they do? Will they
try to set the flag to their own preferred value every second or so? ]

Secondly, it was added based on a single datapoint from Martin:

 http://marc.theaimsgroup.com/?l=linux-mm&m=111763597218177&w=2

where Martin characterizes the numbers the following way:

 ' Run-to-run variability for "make -j" is huge, so these numbers aren't
   terribly useful except to see that with reclaim the benchmark still
   finishes in a reasonable amount of time. '

in other words: the fundamental problem has likely not been solved, only
a tendential move into the right direction has been observed, and a
handful of numbers were picked out of a set of hugely variable results,
without showing the variability data. How much variance is there
run-to-run?

I'd really suggest to first walk the walk and see what's needed to get
stable & predictable kernel compilation numbers on that NUMA box, before
adding random syscalls to tune a particular aspect of the VM ... which
approach might not even matter once the whole picture has been analyzed
and understood!

The third, most important point is that the syscall exposes VM tuning
internals in a completely unstructured way. What sense does it make to
have a _GLOBAL_ per-node setting for 'should we go to another node for
reclaim'? If then it might make sense to do this per-app, via numalib or
so.

The change is minimalistic in that it doesnt remove the syscall and the
underlying infrastructure changes, only the user-visible changes.  We
could perhaps add a CAP_SYS_ADMIN-only sysctl for this hack, a'ka
/proc/sys/vm/swappiness, but even that looks quite counterproductive
when the generic approach is that we are trying to reduce the number of
external factors in the VM balance picture.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-08-01 10:03:56 -07:00
Robert Love d108919b2b [IA64] inotify: ia64 syscalls.
Attached patch adds the inotify syscalls to ia64.

Signed-off-by: Robert Love <rml@novell.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-07-27 10:46:12 -07:00
H. J. Lu 763b3917e7 [IA64] Fix a typo in arch/ia64/kernel/entry.S
Both 2.4 and 2.6 kernels need this patch for the next binutils.

Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-07-08 13:23:49 -07:00
Tony Luck 54522b6613 Auto merge with /home/aegl/GIT/ia64-test 2005-06-28 08:24:49 -07:00
Jens Axboe 22e2c507c3 [PATCH] Update cfq io scheduler to time sliced design
This updates the CFQ io scheduler to the new time sliced design (cfq
v3).  It provides full process fairness, while giving excellent
aggregate system throughput even for many competing processes.  It
supports io priorities, either inherited from the cpu nice value or set
directly with the ioprio_get/set syscalls.  The latter closely mimic
set/getpriority.

This import is based on my latest from -mm.

Signed-off-by: Jens Axboe <axboe@suse.de>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-27 14:33:29 -07:00
Martin Hicks 753ee72896 [PATCH] VM: early zone reclaim
This is the core of the (much simplified) early reclaim.  The goal of this
patch is to reclaim some easily-freed pages from a zone before falling back
onto another zone.

One of the major uses of this is NUMA machines.  With the default allocator
behavior the allocator would look for memory in another zone, which might be
off-node, before trying to reclaim from the current zone.

This adds a zone tuneable to enable early zone reclaim.  It is selected on a
per-zone basis and is turned on/off via syscall.

Adding some extra throttling on the reclaim was also required (patch
4/4).  Without the machine would grind to a crawl when doing a "make -j"
kernel build.  Even with this patch the System Time is higher on
average, but it seems tolerable.  Here are some numbers for kernbench
runs on a 2-node, 4cpu, 8Gig RAM Altix in the "make -j" run:

			wall  user   sys   %cpu  ctx sw.  sleeps
			----  ----   ---   ----   ------  ------
No patch		1009  1384   847   258   298170   504402
w/patch, no reclaim     880   1376   667   288   254064   396745
w/patch & reclaim       1079  1385   926   252   291625   548873

These numbers are the average of 2 runs of 3 "make -j" runs done right
after system boot.  Run-to-run variability for "make -j" is huge, so
these numbers aren't terribly useful except to seee that with reclaim
the benchmark still finishes in a reasonable amount of time.

I also looked at the NUMA hit/miss stats for the "make -j" runs and the
reclaim doesn't make any difference when the machine is thrashing away.

Doing a "make -j8" on a single node that is filled with page cache pages
takes 700 seconds with reclaim turned on and 735 seconds without reclaim
(due to remote memory accesses).

The simple zone_reclaim syscall program is at
http://www.bork.org/~mort/sgi/zone_reclaim.c

Signed-off-by: Martin Hicks <mort@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-21 18:46:14 -07:00
Tony Luck f2cbb4f019 Auto merge with /home/aegl/GIT/linus 2005-06-15 14:06:48 -07:00
Tony Luck 325a479c4c Merge with temp tree to get David's gdb inferior calls patch 2005-05-17 15:53:14 -07:00
David Mosberger-Tang bfd6859408 [IA64] Avoid .spillpsp directive in handcoded assembly
Some time ago, GAS was fixed to bring the .spillpsp directive in line
with the Intel assembler manual (there was some disagreement as to
whether or not there is a built-in 16-byte offset).  Unfortunately,
there are two places in the kernel where this directive is used in
handwritten assembly files and those of course relied on the "buggy"
behavior.  As a result, when using a "fixed" assembler, the kernel
picks up the UNaT bits from the wrong place (off by 16) and randomly
sets NaT bits on the scratch registers.  This can be noticed easily by
looking at a coredump and finding various scratch registers with
unexpected NaT values.  The patch below fixes this by using the
.spillsp directive instead, which works correctly no matter what
assembler is in use.

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-05-10 13:52:00 -07:00
David Mosberger-Tang 9df6f705c0 [IA64] fix typos caught by new assembler
Patch below fixes 3 trivial typos which are caught by the new
assembler (v2.169.90).  Please apply.

[Note: fix to memcpy that was also part of this patch was separately
 applied from patches by H.J. and Andreas ... so the delta here only
 has the other two fixes. -Tony]

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-05-03 10:56:42 -07:00
Stephen Rothwell 7d87e14c23 [PATCH] consolidate sys_shmat
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-05-01 08:59:12 -07:00
David Mosberger-Tang e7e965fa19 [IA64] use srlz.d instead of srlz.i in ia64_leave_kernel()
This patch switches the srlz.i in ia64_leave_kernel() to srlz.d.  As
per architecture manual, the former is needed only to ensure that the
clearing of PSR.IC is seen by the VHPT for subsequent instruction
fetches.  However, since the remainder of the code (up to and
including the RFI instruction) is mapped by a pinned TLB entry, there
is no chance of an iTLB miss and we don't care whether or not the VHPT
sees PSR.IC cleared.  Since srlz.d is substantially cheaper than
srlz.i, this should shave off a few cycles off the interrupt path
(unverified though; I'm not setup to measure this at the moment).

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-27 21:22:08 -07:00
David Mosberger-Tang c03f058fbf [IA64] In ia64_leave_syscall(), fix comments and whitespace only.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-27 21:18:22 -07:00
David Mosberger-Tang 87e522a0f7 [IA64] Schedule ia64_leave_syscall() to read ar.bsp earlier
Reschedule code to read ar.bsp as early as possible.  To enable this,
don't bother clearing some of the registers when we're returning to
kernel stacks.  Also, instead of trying to support the pNonSys case
(which makes no sense), do a bugcheck instead (with break 0).  Finally,
remove a clear of r14 which is a left-over from the previous patch.

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-27 21:17:44 -07:00
David Mosberger-Tang 96e017495e [IA64] On return from syscall, hint b7 with __kernel_syscall_via_epc().
Why is this a good idea?  Clearing b7 to 0 is guaranteed to do us no
good and writing it with __kernel_syscall_via_epc() yields a 6 cycle
improvement _if_ the application performs another EPC-based system-
call without overwriting b7, which is not all that uncommon.  Well
worth the minimal cost of 1 bundle of code.

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-27 21:16:07 -07:00
David Mosberger-Tang 3c79c8b1d9 [IA64] Schedule fp-clearing insns at least 6 cycles after reading ar.bsp.
Decreases syscall overhead by approximately 6 cycles.

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-27 21:15:13 -07:00
David Mosberger-Tang 9ec1a7ad43 [IA64] Use dynamic prediction for RSE-clearing branches.
This by itself is good for a 1-2 cycle speed up.  Effect is bigger
when combined with the later patches.

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-27 21:13:33 -07:00
David Mosberger-Tang 06ef660816 [IA64] __ia64_syscall() is no longer used anywhere in the kernel. Remove it.
Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-27 21:10:45 -07:00
David Mosberger-Tang a37d98f6a9 [IA64] fix syscall-optimization goof
Sadly, I goofed in this syscall-tuning patch:

ChangeSet 1.1966.1.40 2005/01/22 13:31:05 davidm@hpl.hp.com
  [IA64] Improve ia64_leave_syscall() for McKinley-type cores.

  Optimize ia64_leave_syscall() a bit better for McKinley-type cores.
  The patch looks big, but that's mostly due to renaming r16/r17 to r2/r3.
  Good for a 13 cycle improvement.

The problem is that the size of the physical stacked registers was
loaded into the wrong register (r3 instead of r17).  Since r17 by
coincidence always had the value 1, this had the effect of turning
rse_clear_invalid into a no-op.  That poses the risk of leaking kernel
state back to user-land and is hence not acceptable.

The fix below is simple, but unfortunately it costs us about 28 cycles
in syscall overhead. ;-(

Unfortunately, there isn't much we can do about that since those
registers have to be cleared one way or another.

	--david

Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-25 13:20:38 -07:00
David Mosberger-Tang 30325d1771 [IA64] speed up syscall path a bit more
Recently I noticed that clearing ar.ssd/ar.csd right before srlz.d is
causing significant stalling in the syscall path.  The patch below
fixes that by moving the register-writes after srlz.d.  On a Madison,
this drops break-based getpid() from 241 to 226 cycles (-15 cycles).

Signed-off-by: David Mosberger-Tang <davidm@hpl.hp.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
2005-04-25 13:03:16 -07:00
Linus Torvalds 1da177e4c3 Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.

Let it rip!
2005-04-16 15:20:36 -07:00