ewx | amd64

You're viewing

ewx's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

I’ve not written a nontrivial amount of assembler in over 15 years, but have over the past few days started doing some for a future version of my Mandelbrot set program.

A few random thoughts:

I did most of my assembler programming on the 68K, so the AT&T syntax preferred by gas (i.e. op src,dest) seems natural to me. But the AMD documentation uses the Intel syntax and turning everything back to front is mindbending, so I quickly gave up and switched to that.
Fortunately there’s a .intel_syntax directive, and I did some Z80 long ago, so the Intel order isn’t wholely alien.
The MUL instruction’s a bit lame, isn’t it? The 68K could multiply any pair of registers (or in fact memory, for the source) and store the result in any register. Intel/AMD’s MUL fixes the destination and one of the inputs.
I was disappointed to find that there doesn’t seem to be a high precision integer multiply (128*128->256 would be just the ticket) in the SSEn instructions.

Flat | Top-Level Comments Only

From:

gerald_duck

IANAx86-programmer. However, my guess is that high-precision multiplies aren't provided because wide multiplies create very heavy data interdependencies and therefore need deep pipelining. RISC principles (which have fallen out of vogue, but do still inform design decisions) dictate that that's a bad candidate for a single instruction.

Hence SSE's shallow vectorisation: several narrow multiplies in parallel rather than wide multiplies. I'm guessing you're expected to parallelise your algorithm a smidgen — doing as much in parallel as possible without reaching divergent decision logic — so you can exploit the ability to do two or four narrow integer multiplies simultaneously.

Given you're playing with a Mandelbrot set, a prime candidate might be manipulating real and imaginary portions of a number in parallel?

(Or am I teaching my grandmother to suck eggs, here…)

From:

ewx.livejournal.com

PMULUDQ (two parallel 32x32->64) looks like the best available for integer work; not very compelling when I already have a 64x64->128. For floating point the situation is somewhat better but that's not what I'm after right now.