Faster Integer Division with Floating Point Multiplication on a common ...

Why Nostr? What is Njump?

Cybersecurity & cyberwarfare /

npub1rh…mnrwj

2024-12-23 06:00:50

Faster Integer Division with Floating Point

Multiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data) instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.

The plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.

Since modern processors have several SIMD instructions, the post takes the time to benchmark many different variations of a program dividing in a loop. The basic program is the reference and, thus, has a “speed factor” of 1. Unrolling the loop, a common loop optimization technique, doesn’t help much and, on some CPUs, can make the loop slower.

Converting to floating point and using AVX2 sped the program up by a factor of 8X to 11X, depending on the CPU. Some of the processors supported AVX512, which also offered considerable speed-ups.

This is one of those examples of why profiling is so important. If you’d had asked us if converting integer division to floating point might make a program run faster, we’d have bet the answer was no, but we’d have been wrong.

As CPUs get more complex, optimizing gets a lot less intuitive. If you are interested in things like AVX-512, we’ve got you covered.

hackaday.com/2024/12/22/faster…

Author Public Key

npub1rhyvxl4vpetmjx6fe5syagtlxfr7zhce2ujkxk9j7dxh5rj9rymq4mnrwj

Show more details

Published at

2024-12-23 06:00:50

Kind type

1 Short Text Note

Event JSON

{ "id": "90b4dccf3bd077e0ab523674f4ad767c5de145458bb46484c371d62b07c5bfdf", "pubkey": "1dc8c37eac0e57b91b49cd204ea17f3247e15f1957256358b2f34d7a0e451936", "created_at": 1734933650, "kind": 1, "tags": [ [ "p", "15a03ee92fda904c66e083a3b8c771c462190761797ece68edd34de5e927e321", "wss://relay.mostr.pub" ], [ "imeta", "url https://hackaday.com/wp-content/uploads/2024/12/simd.png?w=800", "m image/webp", "dim 800x382" ], [ "proxy", "https://poliverso.org/objects/0477a01e-aa18b042-ad5e11dae2b8d616", "activitypub" ] ], "content": "Faster Integer Division with Floating Point\n\nMultiplication on a common microcontroller is easy. But division is much more difficult. Even with hardware assistance, a 32-bit division on a modern 64-bit x86 CPU can run between 9 and 15 cycles. Doing array processing with SIMD (single instruction multiple data) instructions like AVX or NEON often don’t offer division at all (although the RISC-V vector extensions do). However, many processors support floating point division. Does it make sense to use floating point division to replace simpler division? According to [Wojciech Mula] in a recent post, the answer is yes.\n\nThe plan is simple: cast the 8-bit numbers into 32-bit integers and then to floating point numbers. These can be divided in bulk via the SIMD instructions and then converted in reverse to the 8-bit result. You can find several code examples on GitHub.\n\nSince modern processors have several SIMD instructions, the post takes the time to benchmark many different variations of a program dividing in a loop. The basic program is the reference and, thus, has a “speed factor” of 1. Unrolling the loop, a common loop optimization technique, doesn’t help much and, on some CPUs, can make the loop slower.\n\nConverting to floating point and using AVX2 sped the program up by a factor of 8X to 11X, depending on the CPU. Some of the processors supported AVX512, which also offered considerable speed-ups.\n\nThis is one of those examples of why profiling is so important. If you’d had asked us if converting integer division to floating point might make a program run faster, we’d have bet the answer was no, but we’d have been wrong.\n\nAs CPUs get more complex, optimizing gets a lot less intuitive. If you are interested in things like AVX-512, we’ve got you covered.\n\nhackaday.com/2024/12/22/faster…\n\nhttps://hackaday.com/wp-content/uploads/2024/12/simd.png?w=800", "sig": "54e19ae4a18f52f81e99d7d6396e4523300fcf922973f6b59453cf1d7957d49def8966d3f11feae9b01331068d61b0a94513a09de71cedf3d1f78714a0a78a67" }