This implements the same choices made in the gc runtime, except that
for 32-bit x86 we only use the fence instruction if the processor
supports SSE2.
The code here is hacked up for speed; the gc runtime uses straight
assembler.
Reviewed-on: https://go-review.googlesource.com/97715
From-SVN: r258336