ewx | amd64

You're viewing

ewx's journal
Create a Dreamwidth Account Learn More

Reload page in style: site light

I’ve not written a nontrivial amount of assembler in over 15 years, but have over the past few days started doing some for a future version of my Mandelbrot set program.

A few random thoughts:

I did most of my assembler programming on the 68K, so the AT&T syntax preferred by gas (i.e. op src,dest) seems natural to me. But the AMD documentation uses the Intel syntax and turning everything back to front is mindbending, so I quickly gave up and switched to that.
Fortunately there’s a .intel_syntax directive, and I did some Z80 long ago, so the Intel order isn’t wholely alien.
The MUL instruction’s a bit lame, isn’t it? The 68K could multiply any pair of registers (or in fact memory, for the source) and store the result in any register. Intel/AMD’s MUL fixes the destination and one of the inputs.
I was disappointed to find that there doesn’t seem to be a high precision integer multiply (128*128->256 would be just the ticket) in the SSEn instructions.

Flat | Top-Level Comments Only

From:

keirf.livejournal.com

MUL? MUL?! In my day we had to run a loop using DX and accumulate the answer in AX!

From:

simont

The MUL instruction is indeed feeble, but if I remember rightly there's a more recent (i.e. after the original 8086) IMUL which makes up for its shortcomings and has a sensibly diverse set of source and destination options. If you're targeting x86-64 only, you ought to be able to use that with a clear conscience.

From:

ewx.livejournal.com

Unfortunately IMUL is a signed multiply and I need unsigned.

From:

simont

Ah, drat. (Apparently I'm a bit rusty on this stuff myself. I used to know bloody everything about x86oids, though admittedly that was around the time IMUL was newfangled and weird.)

From:

simont

Operand order: I wavered between 68k and x86 in my childhood, so I've used both src,dest and dest,src order in assembly languages. Mostly I can get used to either reasonably quickly (or at least I could in the day – I might have more trouble now I've been writing extensive ARM assembler for twelve years solid), but one place where I really fell down is the compare instruction.

In x86 or ARM, with dest,src order, it's obvious how the compare instruction works with the subsequent conditional branch. If you CMP x,y and then branch-if-greater, it's clear that the 'greater' gets mentally inserted between the two compare operands, so you're branching if x > y. But on 68k, they flip the compare instruction's operands but don't flip the names of the branch conditions, so you always have to remember that comparing x,y followed by branch-if-greater means that you're branching if x is less than y.

Of course it makes sense if you're thinking of CMP as a trial subtraction, and in really complicated cases where you're abusing the condition codes to do fun things, that's the only way you can think of it. But for normal workaday code that isn't doing anything exciting, you really don't want to put your brain into that mode every time; you just want to say to yourself "now check if x > y and branch somewhere else if so", and then you want to translate that thought into a CMP and conditional branch in a basically trivial and mechanical manner, and mentally inverting the condition every time I did that was something I never got used to on 68k.

From:

ewx.livejournal.com

I think I must have got used to it on the 68K as I don't recall much trouble with it.

From:

gerald_duck

IANAx86-programmer. However, my guess is that high-precision multiplies aren't provided because wide multiplies create very heavy data interdependencies and therefore need deep pipelining. RISC principles (which have fallen out of vogue, but do still inform design decisions) dictate that that's a bad candidate for a single instruction.

Hence SSE's shallow vectorisation: several narrow multiplies in parallel rather than wide multiplies. I'm guessing you're expected to parallelise your algorithm a smidgen — doing as much in parallel as possible without reaching divergent decision logic — so you can exploit the ability to do two or four narrow integer multiplies simultaneously.

Given you're playing with a Mandelbrot set, a prime candidate might be manipulating real and imaginary portions of a number in parallel?

(Or am I teaching my grandmother to suck eggs, here…)

From:

ewx.livejournal.com

PMULUDQ (two parallel 32x32->64) looks like the best available for integer work; not very compelling when I already have a 64x64->128. For floating point the situation is somewhat better but that's not what I'm after right now.

From:

mobbsy.livejournal.com

Some of these (http://en.wikipedia.org/wiki/Advanced_Vector_Extensions) might be what you're looking for ... and will be available in next year's CPUs.

From:

fanf

We were briefly confused yesterday when looking at some logs, before we realised that amd64 was not a CRSID, but rather a machine architecture from a user agent string...

(Having said that, amd64 has been allocated to someone who left three years ago.)