Ideal divisors: when a division compiles down to just a multiplication

4 years ago by svat

If you've never seen it before, this is a beautiful blog post on related matters: "Optimising pointer subtraction with 2-adic integers" http://blog.sigfpe.com/2010/05/optimising-pointer-subtractio...

4 years ago by kevinventullo

A few months back I wrote a blog post very much inspired by Dan’s post, though a little more academic: https://kevinventullo.com/2020/12/21/2-adic-logarithms-and-f...

The idea is to exploit deeper structures in p-adic number theory to develop useful and novel algorithms with unsigned integer arithmetic, e.g. fast integer exponentiation. Feedback welcome!

(Warning: math heavy)

4 years ago by svat

This was a nice post, thank you! I don't yet have an intuition for how to think about these 2-adic exponentials and logs (e.g. what does it mean to say that log_2(5) =6713115954038056572) but clearly the mathematics works out and even leads to faster code, so it's very cool and I hope to read it again and understand it better.

(As a sidenote on the motivation, I do wonder how common/useful it is to compute x^y mod (2^64) for random 64-bit integers x and y (so they'd be close to 2^63 on average)... from this perspective, the promised future post about computing the k-th root of a perfect k-th power integer sounds very interesting!)

4 years ago by kevinventullo

Thanks, that serves as good motivation to complete that post!

As far as the statement that

  log_2(5) = 6713115954038056572,

really that number is only an approximation; it is the last 64 bits of the “true” log_2(5), which has infinitely many bits. That is, the above equation in binary would be

  log_2(5) = 101110100101001110001110110010010000000011100100010011001111100.

In the true 2-adic world,

  log_2(5) = ...101110100101001110001110110010010000000011100100010011001111100,

where there are infinitely many leading bits.

(Classical positive integers embed into the 2-adic world as numbers with all but finitely many bits set to 0, and similarly classical negative integers have all but finitely many bits set to 1.)

4 years ago by thechao

> I don't know how gcc generates its approximate 2-adic reciprocals. Possibly it uses something based on the Euclidean GCD algorithm. I wasn't able to find the precise line of source in a reasonable time.

It's 0x1FFF...F / 7 = 0x249...249249249.

Just about the oddest thing I've seen; I guess, if I were the author, it'd be my "1 in 10000" day?

4 years ago by xyzzy_plugh

Makes more sense in binary:

  0xFFF..F 111111111111111111111111
  0x7      000000000000000000000111
  0x249..9 001001001001001001001001

4 years ago by svat

I think you're trying to explain what's special about the number -1227133513 or how it comes about / how one could arrive at it. This is something the author understands very well; in fact that's what the whole post is about (well, it's about more advanced matters, using this as a "hook"), and many nice methods to arrive at that number are discussed in the post and comments. (And BTW your computation corresponds to 1227133513 or -1/7, not to -1227133513 or 1/7.) The sentence you quoted is merely asking which of the many discussed methods gcc is actually using, and this cannot be answered by mathematics; it needs to be answered by pointing to the relevant source code.

4 years ago by spoonjim

For a total n00b, could you explain why subtracting a char is a division?

“It doesn't seem like there could be much to say about that. The A structure is 7 bytes long so the subtraction implicitly divides by 7. That's about it.”

4 years ago by svat

(Note that the subtraction is between two pointers to "struct A", not two chars. In the snippet "char" is used simply because it's 1 byte; one could also use int8_t, assuming a byte is 8 bits long.)

It's a fact about how the C language does pointer arithmetic.

For instance, suppose you have an array of many 64-bit integers, and a pointer to one of the elements of the array:

    int64_t arr[100];
    int64_t *p;
    p = &arr[50];

Now, the value of p is a certain address in memory. As each element of the array takes 8 bytes (64 bits = 8 bytes), the address of the next element of the array is 8 bytes past the location of p. In some languages (including assembly), you have to refer to this address by manually adding 8 to the address stored in p. In C, to refer to the next element of the array, you write "p + 1", and the compiler translates that into adding 8 bytes.

Conversely, if you give the new pointer a name:

    int64_t *q;
    q = p + 1;

then you expect (q - p) to be 1, not 8 (even though the value stored in q is 8 more than the value stored in p).

That is, when subtracting pointers (of the same type), the result implicitly divides by the size of the type. In the code example in the blog post, as the size of "struct A" is 7 bytes, subtracting two pointers to A will subtract the memory addresses and divide by 7.

If that wasn't clear, see this sentence on Wikipedia: https://en.wikipedia.org/w/index.php?title=Pointer_(computer... or the answers to this question: https://stackoverflow.com/questions/3238482/pointer-subtract... or play with it yourself here: https://godbolt.org/z/M6PEWYeYx

4 years ago by spoonjim

Ok, I’ve looked at this about five separate times and think that I finally get it. Thank you so much.

4 years ago by throw8away

35 / 7 = 5

35-7 = 28, 1

28-7 = 21, 2

21-7 = 14, 3

14-7 = 07, 4

07-7 = 00, 5

thus we can see that in essence, division is repeated subtraction in some contexts, especially computationally.

An easy way to count the amount of X that fits into Y is to simply count the amount of times you can subtract X from Y before Y is a negative value

4 years ago by chrchang523

See also https://libdivide.com/ , when you need to repeatedly divide by a constant that isn't known until runtime.

4 years ago by oh_sigh

Honestly if you're not familiar with the nitty gritty details of division, this sounds similar to those nodejs packages for "left padding a string"

4 years ago by bjoli

The key here is "at runtime".

4 years ago by oh_sigh

Yes. The funny part is that, as I said if you're not familiar with the nitty gritty details of division, you might think the library just exposes a function (a,b)->a/b, which works at runtime.

4 years ago by layoutIfNeeded

>I call the corresponding divisors ideal

Not a particularly good name, as “ideal” already has an established meaning in algebra: https://en.m.wikipedia.org/wiki/Ideal_(ring_theory)

4 years ago by dahart

Didn’t “ideal” have an established meaning in English before ring theory existed? Curious, why’d you pick that specialization of this word, out of all that are available? https://en.wikipedia.org/wiki/Ideal

4 years ago by layoutIfNeeded

When shown on a 7-segment display, the only numbers whose images don't contain a little square are 1, 2, 3, 4, 5, and 7. I call them "square-free numbers".

Oh, you're saying "square-free" has an established meaning in mathematics? https://en.wikipedia.org/wiki/Square-free_integer

Well, you see, "square" already had an established meaning in English before number theory existed. Curious, why’d you pick that specialization of this word, out of all that are available? :)

4 years ago by dahart

> “square” already had an established meaning

Wasn’t your straw man example ‘square-free’, not ‘square’...?

4 years ago by gowld

It's not an ideal name, even though it's ideal.

4 years ago by withinboredom

This is computer science, not algebra.

4 years ago by Tainnor

algebra plays a huge role in CS, e.g. cryptography

4 years ago by tragomaskhalos

Genuine question: why would any compiler bother to optimise division with such obtuse dividends, other than a purists' satisfaction ?

4 years ago by Denvercoder9

There's no specific optimisation for these specific numbers. There's a more general optimisation to turn division by a fixed number into a multiplication and shift where possible (because division is a lot slower than multiplication), and there is another optimisation that eliminates useless shifts. Combined they yield this result for these specific numbers.

4 years ago by nitrogen

As an example of applicability, I used reciprocal multiplication and shifting, plus knowledge of how many bits of accuracy I didn't need, to get my old Kinect code for my former startup to run near 30fps instead of 2-3fps on hardware with no FPU and IIRC no divide instruction.

4 years ago by joe_the_user

Yeah, a more detailed description of this general method seems missing from the article. Anyone care to provide it?

4 years ago by chrchang523

See the "Labor of Division" blog posts by the author of libdivide:

https://ridiculousfish.com/blog/posts/labor-of-division-epis...

4 years ago by pkaye

The book "Hackers Delight" goes over a lot of these kinds of software algorithms for integers.

4 years ago by warkdarrior

Division is orders of magnitude slower than multiplication.

4 years ago by Tuna-Fish

Less than one order of magnitude these days. On Tiger Lake, integer multiply is 3 cycles, while integer division is 12 for 32-bit integers and 15 for 64-bit.

Of course, the multiply is fully pipelined while the division is only partially, but even that leaves only a 10-fold difference.

Division is much faster these days than it used to be. It's still worth it to have compiler optimizations for it, but it's not important to avoid it anymore.

4 years ago by Fordec

> Less than one order of magnitude these days.

> only a 10-fold difference.

A ten fold difference is literally one order of magnitude. Which is what the post you're replying to factually stated.

4 years ago by mmoskal

Cortex-M4 (think $2) has 1 cycle multiplication and 2-12 cycle division (with early termination). If you go down to Cortex-M0 (think $0.50) then you get software division which is indeed orders of magnitude slower. Of course lower down the line there isn't even hardware multiplication.

Interestingly, the ARM11 in original RPi and RPi0 doesn't have hardware integer division (despite being 10x faster than typical M4 and typically having 10000x RAM), but it does have floating point division.

4 years ago by jbay808

Embedded systems programmers don't always have such luxury!

4 years ago by gowld

Why do some people assume decimal ten as the base of the orders of magnitude when discussing binary arithmetic?

4 years ago by martincmartin

They're often special cases of general algorithms. You can do many divisions as combinations of multiplications & shifts; in this case, the resulting code simplifies down to the smallest possible.

4 years ago by lamontcg

For floating point math is this why I see "/2" written as "*0.5" all the time, and is that really necessary and can't compilers figure that out?

4 years ago by kevin_thibedeau

They don't necessarily produce the same result and strict IEEE conformance prohibits such transformations. With GCC, --ffast-math enables optimizations like this.

4 years ago by SuchAnonMuchWow

In this specific example, they do produce exactly the same result:

`a / b` is completely equivalent to `a * (1 / b)` when both b and 1/b are perfectly representable in floating point without rounding, which is the case for 2 and 0.5

4 years ago by lamontcg

That makes sense. I typically have to worry about larger relative tolerances anyway so I think I can start using that trick.

4 years ago by jvickers

In JavaScript (specifically node / V8) if dividing multiple values by the same divisor, is it fastest to calculate the reciprocal and then multiply it?

How about other languages (and compiler systems)?

Are there binary tricks worth knowing about to quickly calculate reciprocals?

4 years ago by brrrrrm

It's typically faster to calculate a reciprocal (especially at compile time) and then multiply it, but this is often at the expense of precision. In C++, for example, the -O3 flag on gcc will still emit a division (divss) to preserve compliant floating point precision in this example: https://godbolt.org/z/PY3Wjno4P

However, when we enable `-ffast-math`, we permit the compiler to do a bit more aggressive optimization in this domain (trading off precision), and we see multiplication (mulss) is emitted: https://godbolt.org/z/KEb1nc43z

4 years ago by dahart

> Are there binary tricks worth knowing about to quickly calculate reciprocals?

Yes! You can calculate an approximate reciprocal with a single subtraction and some type-casting or bit manipulation.

The accuracy isn’t great (relative error of up to ~5%), but it might be good enough for many applications. The method is famous for being used to normalize 3d vectors in Quake https://en.wikipedia.org/wiki/Fast_inverse_square_root

This same method for calculating 1/x^2 can also be used to calculate 1/x by using a different magic constant and skipping the shift-right by one bit.

4 years ago by andrewfromx

i'd like to find these numbers for base9 vs base10. see https://stackoverflow.com/questions/67226727/is-there-a-way-... if you look at that sample golang code, is that the right way to go about this? Or is there a much simpler way already built into a high level language like go?

4 years ago by undefined

[deleted]

4 years ago by omgJustTest

Floating point, while convenient does have substantial drawbacks in the implementations of circuits that we have today. Efficiency gains like this somewhat expose facets that are normally hidden. It will be interesting to see the transition to either analog NN or quantum computers, which inherently have better representations of continuous values.

Edit: Additionally the serial vs parallel processing paradigm shift is a main advantage of the switch too! Queued systems seem to have severe scheduling limitations, which the real-world does not always inherently have.

4 years ago by svat

The linked article is about integer division and has nothing to do with floating point, and in fact I can't see how this idea could be applied to floating-point arithmetic. Can it?

4 years ago by wizzwizz4

Not directly, no. The underlying mathematics kind of can, but you'd need dedicated circuitry (or bit-hacking) for it because of the floating point parts, and it would fall over around subnormals.

4 years ago by dahart

I’m not certain, but while reading I was feeling like there potentially is a relationship. It was reminding me of the algorithm where you can approximate a reciprocal square root using a single subtraction from a magic number constant. https://en.wikipedia.org/wiki/Fast_inverse_square_root