<- Back
Comments (151)
- RiverCrochetXOR is a simple logic-gate operation. SUB would have to be an ALU operation.A one-bit adder (which is subtraction in reverse) makes signals pass through two gates.See https://en.wikipedia.org/wiki/Adder_(electronics)You need the 2 gates for adding/subtracting because you care about carry. So if you're adding/subtracting 8 bits, 16 bits, or more, you're connecting multiples of these together, and that carry has to ripple through all the rest of the gates one-by-one. It can't be paralellized without extra circuitry, which increases your costs in other ways.Without the AND gate needed for carry, all the XORs can fire off at the same time. If you added the extra circuitry for a parallelizable add/subtract to make it as fast as XOR, your actual parallel XOR would consume less power.
- SuzuranOn some of IBM's smaller processors, such as channel controllers and the CSP used in the midrange line prior to the System/38, the xor instruction had a special feature when used with identical source and destination - It would inhibit parity and/or ECC error checking on the read cycle, which meant that xor could be used to clear a register or memory location that had been stored with bad parity without taking a machine check or processor check.
- Sweepi"Bonus bonus chatter: The xor trick doesn’t work for Itanium because mathematical operations don’t reset the NaT bit. Fortunately, Itanium also has a dedicated zero register, so you don’t need this trick. You can just move zero into your desired destination."Will remember for the next time I write asm for Itanium!
- NewCzechThe obvious answer is that XOR is faster. To do a subtract, you have to propagate the carry bit from the least-significant bit to the most-significant bit. In XOR you don't have to do that because the output of every bit is independent of the other adjacent bits.Probably, there are ALU pipeline designs where you don't pay an explicit penalty. But not all, and so XOR is faster.Surely, someone as awesome as Raymond Chen knows that. The answer is so obvious and basic I must be missing something myself?
- butterisgoodI recall thinking about these things quite a bit when reading Michael Abrash back in the 90s.How much of that advice applies to anything these days is questionable. Back then we used to squeeze as much as possible from every clock cycle.And cache misses weren’t great but the “front side bus” vs CPU clock difference wasn’t so insane either. RAM is “far away” now.So the stuff you optimize for has changed a bit.Always measure!
- matjaSUB has higher latency than XOR on some Intel CPUs:latency (L) and throughput (T) measurements from the InstLatx64 project (https://github.com/InstLatx64/InstLatx64) : | GenuineIntel | ArrowLake_08_LC | SUB r64, r64 | L: 0.26ns= 1.00c | T: 0.03ns= 0.135c | | GenuineIntel | ArrowLake_08_LC | XOR r64, r64 | L: 0.03ns= 0.13c | T: 0.03ns= 0.133c | | GenuineIntel | GoldmontPlus | SUB r64, r64 | L: 0.67ns= 1.0 c | T: 0.22ns= 0.33 c | | GenuineIntel | GoldmontPlus | XOR r64, r64 | L: 0.22ns= 0.3 c | T: 0.22ns= 0.33 c | | GenuineIntel | Denverton | SUB r64, r64 | L: 0.50ns= 1.0 c | T: 0.17ns= 0.33 c | | GenuineIntel | Denverton | XOR r64, r64 | L: 0.17ns= 0.3 c | T: 0.17ns= 0.33 c | I couldn't find any AMD chips where the same is true.
- drfuchsRelatedly, there's a steganographic opportunity to hide info in machine code by using "XOR rax,rax" for a "zero" and "SUB rax,rax" for a "one" in your executable. Shouldn't be too hard to add a compiler feature to allow you to specify the string you want encoded into its output.
- zahlman> but xor took a slightly lead due to some fluke, perhaps because it felt more “clever”.Absolutely. But I can also imagine that it feels more like something that should be more efficient, because it's "a bit hack" rather than arithmetic. After all, it avoids all the "data dependencies" (carries, never mind the ALU is clocked to allow time for that regardless)!I imagine that a similar feeling is behind XOR swap.> Once an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side.Network effects are much older than social media, then....
- b1temyBack when I was in university, one of the units touching Assembly[0] required students to use subtraction to zero out the register instead of using the move instruction (which also worked), as it used fewer cycles.I looked it up afterwards and xor was also a valid instruction in that architecture to zero out a register, and used even fewer cycles than the subtraction method; but it was not listed in the subset of the assembly language instructions we were allowed to use for that unit. I suspect that it was deemed a bit off-topic, since you would need to explain what the mathematical XOR operation was (if you didn't already learn about it in other units), when the unit was about something else entirely- but everyone knows what subtraction is, and that subtracting a number by itself leads to zero.[0] Not x86, I do not recall the exact architecture.
- nopurposeIt amazes me how entertaining Raymond's writing on most mundane aspects of computing often is.
- endukuI ran into this rabbithole while writing an x86-64 asm rewriter.xor was the default zeroing idiom.I onkly did sub reg,reg when I actually want its flags result. Otherwise the main rule is: do not touch either form unless flags liveness makes the rewrite obviously safe. Had about 40 such idioms for the passes.
- defrostOnce an instruction has an edge, even if only extremely slight, that’s enough to tip the scales and rally everyone to that side. And this, interestingly, is why life on earth uses left-handed amino acids and right-handed sugars .. and why left handed sugar is perfect for diet sodas.
- adrian_bIt should be noted that XOR is just (bitwise) subtraction modulo 2.There are many kinds of SUB instructions in the x86-64 ISA, which do subtraction modulo 2^64, modulo 2^32, modulo 2^16 or modulo 2^8.To produce a null result, any kind of subtraction can be used, and XOR is just a particular case of subtraction, it is not a different kind of operation.Unlike for bigger moduli, when operations are done modulo 2 addition and subtraction are the same, so XOR can be used for either addition modulo 2 or subtraction modulo 2.
- tliltocatlIt might be because XOR is rarely (in terms of static count, dynamically it surely appears a lot in some hot loops) used for anything else, so it is easier to spot and identify as "special" if you are writing manual assembly.
- empiricusThe hw implementation of xor is simpler than sub, so it should consume slightly less energy. Wondering how much energy was saved in the whole world by using xor instead of sub.
- anematodeMy favorite (admittedly not super useful) trick in this domain is that sbb eax, eax breaks the dependency on the previous value of eax (just like xor and sub) and only depends on the carry flag. arm64 is less obtuse and just gives you csetm (special case of csinv) for this purpose.
- raszLooking at some random 1989 Zenith 386SX bios written in assembly so purely programmer preferences:8 'sub al, al', 14 'sub ah, ah', 3 'sub ax, ax'26 'xor al, al', 43 'xor ah, ah', 3 'xor ax, ax'edit: checked a 2010 bios and not a single 'sub x, x'
- dreamcompilerI vaguely remember we used the XOR trick on processors other than Intel, so it may not be Intel-specific.In principle, sub requires 4 steps:1. Move both operands to the ALU2. Invert second operand (twos complement convert)3. Add (which internally is just XOR plus carry propagate)4. Move result to proper result register.This is absolutely not how modern processors do it in practice; there are many shortcuts, but at least with pure XOR you don't need twos complement conversion or carry propagation.Source: Wrote microcode at work a million years ago when designing a GPU.
- jhoechtlBack in the stone ages XOR ing was just 1 byte of opcode. Habbits stick. In effect XORing is no longer faster since a long time.
- jdw64[dead]
- grebc[flagged]