428 lines
21 KiB
Plaintext
428 lines
21 KiB
Plaintext
+---------------------------------------------------------------------------+
|
|
| wm-FPU-emu an FPU emulator for 80386 and 80486SX microprocessors. |
|
|
| |
|
|
| Copyright (C) 1992,1993,1994,1995,1996,1997,1999 |
|
|
| W. Metzenthen, 22 Parker St, Ormond, Vic 3163, |
|
|
| Australia. E-mail billm@melbpc.org.au |
|
|
| |
|
|
| This program is free software; you can redistribute it and/or modify |
|
|
| it under the terms of the GNU General Public License version 2 as |
|
|
| published by the Free Software Foundation. |
|
|
| |
|
|
| This program is distributed in the hope that it will be useful, |
|
|
| but WITHOUT ANY WARRANTY; without even the implied warranty of |
|
|
| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
|
|
| GNU General Public License for more details. |
|
|
| |
|
|
| You should have received a copy of the GNU General Public License |
|
|
| along with this program; if not, write to the Free Software |
|
|
| Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. |
|
|
| |
|
|
+---------------------------------------------------------------------------+
|
|
|
|
|
|
|
|
wm-FPU-emu is an FPU emulator for Linux. It is derived from wm-emu387
|
|
which was my 80387 emulator for early versions of djgpp (gcc under
|
|
msdos); wm-emu387 was in turn based upon emu387 which was written by
|
|
DJ Delorie for djgpp. The interface to the Linux kernel is based upon
|
|
the original Linux math emulator by Linus Torvalds.
|
|
|
|
My target FPU for wm-FPU-emu is that described in the Intel486
|
|
Programmer's Reference Manual (1992 edition). Unfortunately, numerous
|
|
facets of the functioning of the FPU are not well covered in the
|
|
Reference Manual. The information in the manual has been supplemented
|
|
with measurements on real 80486's. Unfortunately, it is simply not
|
|
possible to be sure that all of the peculiarities of the 80486 have
|
|
been discovered, so there is always likely to be obscure differences
|
|
in the detailed behaviour of the emulator and a real 80486.
|
|
|
|
wm-FPU-emu does not implement all of the behaviour of the 80486 FPU,
|
|
but is very close. See "Limitations" later in this file for a list of
|
|
some differences.
|
|
|
|
Please report bugs, etc to me at:
|
|
billm@melbpc.org.au
|
|
or b.metzenthen@medoto.unimelb.edu.au
|
|
|
|
For more information on the emulator and on floating point topics, see
|
|
my web pages, currently at http://www.suburbia.net/~billm/
|
|
|
|
|
|
--Bill Metzenthen
|
|
December 1999
|
|
|
|
|
|
----------------------- Internals of wm-FPU-emu -----------------------
|
|
|
|
Numeric algorithms:
|
|
(1) Add, subtract, and multiply. Nothing remarkable in these.
|
|
(2) Divide has been tuned to get reasonable performance. The algorithm
|
|
is not the obvious one which most people seem to use, but is designed
|
|
to take advantage of the characteristics of the 80386. I expect that
|
|
it has been invented many times before I discovered it, but I have not
|
|
seen it. It is based upon one of those ideas which one carries around
|
|
for years without ever bothering to check it out.
|
|
(3) The sqrt function has been tuned to get good performance. It is based
|
|
upon Newton's classic method. Performance was improved by capitalizing
|
|
upon the properties of Newton's method, and the code is once again
|
|
structured taking account of the 80386 characteristics.
|
|
(4) The trig, log, and exp functions are based in each case upon quasi-
|
|
"optimal" polynomial approximations. My definition of "optimal" was
|
|
based upon getting good accuracy with reasonable speed.
|
|
(5) The argument reducing code for the trig function effectively uses
|
|
a value of pi which is accurate to more than 128 bits. As a consequence,
|
|
the reduced argument is accurate to more than 64 bits for arguments up
|
|
to a few pi, and accurate to more than 64 bits for most arguments,
|
|
even for arguments approaching 2^63. This is far superior to an
|
|
80486, which uses a value of pi which is accurate to 66 bits.
|
|
|
|
The code of the emulator is complicated slightly by the need to
|
|
account for a limited form of re-entrancy. Normally, the emulator will
|
|
emulate each FPU instruction to completion without interruption.
|
|
However, it may happen that when the emulator is accessing the user
|
|
memory space, swapping may be needed. In this case the emulator may be
|
|
temporarily suspended while disk i/o takes place. During this time
|
|
another process may use the emulator, thereby perhaps changing static
|
|
variables. The code which accesses user memory is confined to five
|
|
files:
|
|
fpu_entry.c
|
|
reg_ld_str.c
|
|
load_store.c
|
|
get_address.c
|
|
errors.c
|
|
As from version 1.12 of the emulator, no static variables are used
|
|
(apart from those in the kernel's per-process tables). The emulator is
|
|
therefore now fully re-entrant, rather than having just the restricted
|
|
form of re-entrancy which is required by the Linux kernel.
|
|
|
|
----------------------- Limitations of wm-FPU-emu -----------------------
|
|
|
|
There are a number of differences between the current wm-FPU-emu
|
|
(version 2.01) and the 80486 FPU (apart from bugs). The differences
|
|
are fewer than those which applied to the 1.xx series of the emulator.
|
|
Some of the more important differences are listed below:
|
|
|
|
The Roundup flag does not have much meaning for the transcendental
|
|
functions and its 80486 value with these functions is likely to differ
|
|
from its emulator value.
|
|
|
|
In a few rare cases the Underflow flag obtained with the emulator will
|
|
be different from that obtained with an 80486. This occurs when the
|
|
following conditions apply simultaneously:
|
|
(a) the operands have a higher precision than the current setting of the
|
|
precision control (PC) flags.
|
|
(b) the underflow exception is masked.
|
|
(c) the magnitude of the exact result (before rounding) is less than 2^-16382.
|
|
(d) the magnitude of the final result (after rounding) is exactly 2^-16382.
|
|
(e) the magnitude of the exact result would be exactly 2^-16382 if the
|
|
operands were rounded to the current precision before the arithmetic
|
|
operation was performed.
|
|
If all of these apply, the emulator will set the Underflow flag but a real
|
|
80486 will not.
|
|
|
|
NOTE: Certain formats of Extended Real are UNSUPPORTED. They are
|
|
unsupported by the 80486. They are the Pseudo-NaNs, Pseudoinfinities,
|
|
and Unnormals. None of these will be generated by an 80486 or by the
|
|
emulator. Do not use them. The emulator treats them differently in
|
|
detail from the way an 80486 does.
|
|
|
|
Self modifying code can cause the emulator to fail. An example of such
|
|
code is:
|
|
movl %esp,[%ebx]
|
|
fld1
|
|
The FPU instruction may be (usually will be) loaded into the pre-fetch
|
|
queue of the CPU before the mov instruction is executed. If the
|
|
destination of the 'movl' overlaps the FPU instruction then the bytes
|
|
in the prefetch queue and memory will be inconsistent when the FPU
|
|
instruction is executed. The emulator will be invoked but will not be
|
|
able to find the instruction which caused the device-not-present
|
|
exception. For this case, the emulator cannot emulate the behaviour of
|
|
an 80486DX.
|
|
|
|
Handling of the address size override prefix byte (0x67) has not been
|
|
extensively tested yet. A major problem exists because using it in
|
|
vm86 mode can cause a general protection fault. Address offsets
|
|
greater than 0xffff appear to be illegal in vm86 mode but are quite
|
|
acceptable (and work) in real mode. A small test program developed to
|
|
check the addressing, and which runs successfully in real mode,
|
|
crashes dosemu under Linux and also brings Windows down with a general
|
|
protection fault message when run under the MS-DOS prompt of Windows
|
|
3.1. (The program simply reads data from a valid address).
|
|
|
|
The emulator supports 16-bit protected mode, with one difference from
|
|
an 80486DX. A 80486DX will allow some floating point instructions to
|
|
write a few bytes below the lowest address of the stack. The emulator
|
|
will not allow this in 16-bit protected mode: no instructions are
|
|
allowed to write outside the bounds set by the protection.
|
|
|
|
----------------------- Performance of wm-FPU-emu -----------------------
|
|
|
|
Speed.
|
|
-----
|
|
|
|
The speed of floating point computation with the emulator will depend
|
|
upon instruction mix. Relative performance is best for the instructions
|
|
which require most computation. The simple instructions are adversely
|
|
affected by the FPU instruction trap overhead.
|
|
|
|
|
|
Timing: Some simple timing tests have been made on the emulator functions.
|
|
The times include load/store instructions. All times are in microseconds
|
|
measured on a 33MHz 386 with 64k cache. The Turbo C tests were under
|
|
ms-dos, the next two columns are for emulators running with the djgpp
|
|
ms-dos extender. The final column is for wm-FPU-emu in Linux 0.97,
|
|
using libm4.0 (hard).
|
|
|
|
function Turbo C djgpp 1.06 WM-emu387 wm-FPU-emu
|
|
|
|
+ 60.5 154.8 76.5 139.4
|
|
- 61.1-65.5 157.3-160.8 76.2-79.5 142.9-144.7
|
|
* 71.0 190.8 79.6 146.6
|
|
/ 61.2-75.0 261.4-266.9 75.3-91.6 142.2-158.1
|
|
|
|
sin() 310.8 4692.0 319.0 398.5
|
|
cos() 284.4 4855.2 308.0 388.7
|
|
tan() 495.0 8807.1 394.9 504.7
|
|
atan() 328.9 4866.4 601.1 419.5-491.9
|
|
|
|
sqrt() 128.7 crashed 145.2 227.0
|
|
log() 413.1-419.1 5103.4-5354.21 254.7-282.2 409.4-437.1
|
|
exp() 479.1 6619.2 469.1 850.8
|
|
|
|
|
|
The performance under Linux is improved by the use of look-ahead code.
|
|
The following results show the improvement which is obtained under
|
|
Linux due to the look-ahead code. Also given are the times for the
|
|
original Linux emulator with the 4.1 'soft' lib.
|
|
|
|
[ Linus' note: I changed look-ahead to be the default under linux, as
|
|
there was no reason not to use it after I had edited it to be
|
|
disabled during tracing ]
|
|
|
|
wm-FPU-emu w original w
|
|
look-ahead 'soft' lib
|
|
+ 106.4 190.2
|
|
- 108.6-111.6 192.4-216.2
|
|
* 113.4 193.1
|
|
/ 108.8-124.4 700.1-706.2
|
|
|
|
sin() 390.5 2642.0
|
|
cos() 381.5 2767.4
|
|
tan() 496.5 3153.3
|
|
atan() 367.2-435.5 2439.4-3396.8
|
|
|
|
sqrt() 195.1 4732.5
|
|
log() 358.0-387.5 3359.2-3390.3
|
|
exp() 619.3 4046.4
|
|
|
|
|
|
These figures are now somewhat out-of-date. The emulator has become
|
|
progressively slower for most functions as more of the 80486 features
|
|
have been implemented.
|
|
|
|
|
|
----------------------- Accuracy of wm-FPU-emu -----------------------
|
|
|
|
|
|
The accuracy of the emulator is in almost all cases equal to or better
|
|
than that of an Intel 80486 FPU.
|
|
|
|
The results of the basic arithmetic functions (+,-,*,/), and fsqrt
|
|
match those of an 80486 FPU. They are the best possible; the error for
|
|
these never exceeds 1/2 an lsb. The fprem and fprem1 instructions
|
|
return exact results; they have no error.
|
|
|
|
|
|
The following table compares the emulator accuracy for the sqrt(),
|
|
trig and log functions against the Turbo C "emulator". For this table,
|
|
each function was tested at about 400 points. Ideal worst-case results
|
|
would be 64 bits. The reduced Turbo C accuracy of cos() and tan() for
|
|
arguments greater than pi/4 can be thought of as being related to the
|
|
precision of the argument x; e.g. an argument of pi/2-(1e-10) which is
|
|
accurate to 64 bits can result in a relative accuracy in cos() of
|
|
about 64 + log2(cos(x)) = 31 bits.
|
|
|
|
|
|
Function Tested x range Worst result Turbo C
|
|
(relative bits)
|
|
|
|
sqrt(x) 1 .. 2 64.1 63.2
|
|
atan(x) 1e-10 .. 200 64.2 62.8
|
|
cos(x) 0 .. pi/2-(1e-10) 64.4 (x <= pi/4) 62.4
|
|
64.1 (x = pi/2-(1e-10)) 31.9
|
|
sin(x) 1e-10 .. pi/2 64.0 62.8
|
|
tan(x) 1e-10 .. pi/2-(1e-10) 64.0 (x <= pi/4) 62.1
|
|
64.1 (x = pi/2-(1e-10)) 31.9
|
|
exp(x) 0 .. 1 63.1 ** 62.9
|
|
log(x) 1+1e-6 .. 2 63.8 ** 62.1
|
|
|
|
** The accuracy for exp() and log() is low because the FPU (emulator)
|
|
does not compute them directly; two operations are required.
|
|
|
|
|
|
The emulator passes the "paranoia" tests (compiled with gcc 2.3.3 or
|
|
later) for 'float' variables (24 bit precision numbers) when precision
|
|
control is set to 24, 53 or 64 bits, and for 'double' variables (53
|
|
bit precision numbers) when precision control is set to 53 bits (a
|
|
properly performing FPU cannot pass the 'paranoia' tests for 'double'
|
|
variables when precision control is set to 64 bits).
|
|
|
|
The code for reducing the argument for the trig functions (fsin, fcos,
|
|
fptan and fsincos) has been improved and now effectively uses a value
|
|
for pi which is accurate to more than 128 bits precision. As a
|
|
consequence, the accuracy of these functions for large arguments has
|
|
been dramatically improved (and is now very much better than an 80486
|
|
FPU). There is also now no degradation of accuracy for fcos and fptan
|
|
for operands close to pi/2. Measured results are (note that the
|
|
definition of accuracy has changed slightly from that used for the
|
|
above table):
|
|
|
|
Function Tested x range Worst result
|
|
(absolute bits)
|
|
|
|
cos(x) 0 .. 9.22e+18 62.0
|
|
sin(x) 1e-16 .. 9.22e+18 62.1
|
|
tan(x) 1e-16 .. 9.22e+18 61.8
|
|
|
|
It is possible with some effort to find very large arguments which
|
|
give much degraded precision. For example, the integer number
|
|
8227740058411162616.0
|
|
is within about 10e-7 of a multiple of pi. To find the tan (for
|
|
example) of this number to 64 bits precision it would be necessary to
|
|
have a value of pi which had about 150 bits precision. The FPU
|
|
emulator computes the result to about 42.6 bits precision (the correct
|
|
result is about -9.739715e-8). On the other hand, an 80486 FPU returns
|
|
0.01059, which in relative terms is hopelessly inaccurate.
|
|
|
|
For arguments close to critical angles (which occur at multiples of
|
|
pi/2) the emulator is more accurate than an 80486 FPU. For very large
|
|
arguments, the emulator is far more accurate.
|
|
|
|
|
|
Prior to version 1.20 of the emulator, the accuracy of the results for
|
|
the transcendental functions (in their principal range) was not as
|
|
good as the results from an 80486 FPU. From version 1.20, the accuracy
|
|
has been considerably improved and these functions now give measured
|
|
worst-case results which are better than the worst-case results given
|
|
by an 80486 FPU.
|
|
|
|
The following table gives the measured results for the emulator. The
|
|
number of randomly selected arguments in each case is about half a
|
|
million. The group of three columns gives the frequency of the given
|
|
accuracy in number of times per million, thus the second of these
|
|
columns shows that an accuracy of between 63.80 and 63.89 bits was
|
|
found at a rate of 133 times per one million measurements for fsin.
|
|
The results show that the fsin, fcos and fptan instructions return
|
|
results which are in error (i.e. less accurate than the best possible
|
|
result (which is 64 bits)) for about one per cent of all arguments
|
|
between -pi/2 and +pi/2. The other instructions have a lower
|
|
frequency of results which are in error. The last two columns give
|
|
the worst accuracy which was found (in bits) and the approximate value
|
|
of the argument which produced it.
|
|
|
|
frequency (per M)
|
|
------------------- ---------------
|
|
instr arg range # tests 63.7 63.8 63.9 worst at arg
|
|
bits bits bits bits
|
|
----- ------------ ------- ---- ---- ----- ----- --------
|
|
fsin (0,pi/2) 547756 0 133 10673 63.89 0.451317
|
|
fcos (0,pi/2) 547563 0 126 10532 63.85 0.700801
|
|
fptan (0,pi/2) 536274 11 267 10059 63.74 0.784876
|
|
fpatan 4 quadrants 517087 0 8 1855 63.88 0.435121 (4q)
|
|
fyl2x (0,20) 541861 0 0 1323 63.94 1.40923 (x)
|
|
fyl2xp1 (-.293,.414) 520256 0 0 5678 63.93 0.408542 (x)
|
|
f2xm1 (-1,1) 538847 4 481 6488 63.79 0.167709
|
|
|
|
|
|
Tests performed on an 80486 FPU showed results of lower accuracy. The
|
|
following table gives the results which were obtained with an AMD
|
|
486DX2/66 (other tests indicate that an Intel 486DX produces
|
|
identical results). The tests were basically the same as those used
|
|
to measure the emulator (the values, being random, were in general not
|
|
the same). The total number of tests for each instruction are given
|
|
at the end of the table, in case each about 100k tests were performed.
|
|
Another line of figures at the end of the table shows that most of the
|
|
instructions return results which are in error for more than 10
|
|
percent of the arguments tested.
|
|
|
|
The numbers in the body of the table give the approx number of times a
|
|
result of the given accuracy in bits (given in the left-most column)
|
|
was obtained per one million arguments. For three of the instructions,
|
|
two columns of results are given: * The second column for f2xm1 gives
|
|
the number cases where the results of the first column were for a
|
|
positive argument, this shows that this instruction gives better
|
|
results for positive arguments than it does for negative. * In the
|
|
cases of fcos and fptan, the first column gives the results when all
|
|
cases where arguments greater than 1.5 were removed from the results
|
|
given in the second column. Unlike the emulator, an 80486 FPU returns
|
|
results of relatively poor accuracy for these instructions when the
|
|
argument approaches pi/2. The table does not show those cases when the
|
|
accuracy of the results were less than 62 bits, which occurs quite
|
|
often for fsin and fptan when the argument approaches pi/2. This poor
|
|
accuracy is discussed above in relation to the Turbo C "emulator", and
|
|
the accuracy of the value of pi.
|
|
|
|
|
|
bits f2xm1 f2xm1 fpatan fcos fcos fyl2x fyl2xp1 fsin fptan fptan
|
|
62.0 0 0 0 0 437 0 0 0 0 925
|
|
62.1 0 0 10 0 894 0 0 0 0 1023
|
|
62.2 14 0 0 0 1033 0 0 0 0 945
|
|
62.3 57 0 0 0 1202 0 0 0 0 1023
|
|
62.4 385 0 0 10 1292 0 23 0 0 1178
|
|
62.5 1140 0 0 119 1649 0 39 0 0 1149
|
|
62.6 2037 0 0 189 1620 0 16 0 0 1169
|
|
62.7 5086 14 0 646 2315 10 101 35 39 1402
|
|
62.8 8818 86 0 984 3050 59 287 131 224 2036
|
|
62.9 11340 1355 0 2126 4153 79 605 357 321 1948
|
|
63.0 15557 4750 0 3319 5376 246 1281 862 808 2688
|
|
63.1 20016 8288 0 4620 6628 511 2569 1723 1510 3302
|
|
63.2 24945 11127 10 6588 8098 1120 4470 2968 2990 4724
|
|
63.3 25686 12382 69 8774 10682 1906 6775 4482 5474 7236
|
|
63.4 29219 14722 79 11109 12311 3094 9414 7259 8912 10587
|
|
63.5 30458 14936 393 13802 15014 5874 12666 9609 13762 15262
|
|
63.6 32439 16448 1277 17945 19028 10226 15537 14657 19158 20346
|
|
63.7 35031 16805 4067 23003 23947 18910 20116 21333 25001 26209
|
|
63.8 33251 15820 7673 24781 25675 24617 25354 24440 29433 30329
|
|
63.9 33293 16833 18529 28318 29233 31267 31470 27748 29676 30601
|
|
|
|
Per cent with error:
|
|
30.9 3.2 18.5 9.8 13.1 11.6 17.4
|
|
Total arguments tested:
|
|
70194 70099 101784 100641 100641 101799 128853 114893 102675 102675
|
|
|
|
|
|
------------------------- Contributors -------------------------------
|
|
|
|
A number of people have contributed to the development of the
|
|
emulator, often by just reporting bugs, sometimes with suggested
|
|
fixes, and a few kind people have provided me with access in one way
|
|
or another to an 80486 machine. Contributors include (to those people
|
|
who I may have forgotten, please forgive me):
|
|
|
|
Linus Torvalds
|
|
Tommy.Thorn@daimi.aau.dk
|
|
Andrew.Tridgell@anu.edu.au
|
|
Nick Holloway, alfie@dcs.warwick.ac.uk
|
|
Hermano Moura, moura@dcs.gla.ac.uk
|
|
Jon Jagger, J.Jagger@scp.ac.uk
|
|
Lennart Benschop
|
|
Brian Gallew, geek+@CMU.EDU
|
|
Thomas Staniszewski, ts3v+@andrew.cmu.edu
|
|
Martin Howell, mph@plasma.apana.org.au
|
|
M Saggaf, alsaggaf@athena.mit.edu
|
|
Peter Barker, PETER@socpsy.sci.fau.edu
|
|
tom@vlsivie.tuwien.ac.at
|
|
Dan Russel, russed@rpi.edu
|
|
Daniel Carosone, danielce@ee.mu.oz.au
|
|
cae@jpmorgan.com
|
|
Hamish Coleman, t933093@minyos.xx.rmit.oz.au
|
|
Bruce Evans, bde@kralizec.zeta.org.au
|
|
Timo Korvola, Timo.Korvola@hut.fi
|
|
Rick Lyons, rick@razorback.brisnet.org.au
|
|
Rick, jrs@world.std.com
|
|
|
|
...and numerous others who responded to my request for help with
|
|
a real 80486.
|
|
|