Brancheless Programming Python example
Modern processors batch many lines of code to optimize for speed. What happens in case of a branche? Learn how to branches can slow down the CPU
In 1986 I would fire up Turbo Assembler on my C64 and write assembly code that was blazingly fast. The processor would process one instruction per cycle and the world was easy and OK.
Since then, chip manufactures did many things to improve processing power. They make transitors smaller and fit more and more on one wafer. I hear a transistor can be made with a 1000x1000x1000 atoms these days.
But shrinking CPU’s is not the only way to make processors faster anymore. Moore’s law is not only about more transistors but also about optimizing the code. By this I mean the code in the processor.
One of the tricks of modern CPU’s is batching instructions. Many lines of code are loaded and analysed at once. But what happens if there are branches? Well, it turns out that branches have a high chance of going one direction and the CPU optimizes for that branche. But in case the branche goes the other direction, the pipe is flushed and there is a performance hit.
This video Branchless Programming: Why If is Sloowww… and what we can do about it! shows two examples in c++. It shows the difference between the branched and brancheless code because the assembly code generated from the branched c++ code has j(ump) opcodes and the brancheless does not.
How does Python deal with branched and branchless code? Python has a disassembler that we use to show the byte code that is generated.
Here is a function that returns the smallest number of any two given numbers:
def smaller(a, b): if a < b: return a else: return b print(smaller(1, 2)) print(smaller(10, 20))
To show the generated byte code, you can use the
dis function from the
sys module. Here is an example. I’ve omitted the function calls.
import dis def smaller(a, b): if a < b: return a else: return b dis.dis(smaller)
4 0 LOAD_FAST 0 (a) 2 LOAD_FAST 1 (b) 4 COMPARE_OP 0 (<) 6 POP_JUMP_IF_FALSE 12 5 8 LOAD_FAST 0 (a) 10 RETURN_VALUE 7 >> 12 LOAD_FAST 1 (b) 14 RETURN_VALUE 16 LOAD_CONST 0 (None) 18 RETURN_VALUE
And there we see the JUMP statement:
6 POP_JUMP_IF_FALSE 12. This indicates a branche: If FALSE, jump to 12. If not, continue
Brancheless Byte Code
Let’s rewrite the program. The function still uses relational expressions but does not use them to branche from it.
import dis def smaller(a, b): return a * (a < b) + b * (b <= a) dis.dis(smaller)
4 0 LOAD_FAST 0 (a) 2 LOAD_FAST 0 (a) 4 LOAD_FAST 1 (b) 6 COMPARE_OP 0 (<) 8 BINARY_MULTIPLY 10 LOAD_FAST 1 (b) 12 LOAD_FAST 1 (b) 14 LOAD_FAST 0 (a) 16 COMPARE_OP 1 (<=) 18 BINARY_MULTIPLY 20 BINARY_ADD 22 RETURN_VALUE
The re-written programm does not use branches anymore but the effect of the function is still the same!
Writing brancheless code can speed up the code. However we are talking about fractions of seconds. This can be interesting if you are doing something in a loop and repeat it over and over.
But you probably know enough about code to make a smart decision when it comes down to write readable code that can be understood by you future you. The question is: Do you want to write an if/else statement that anyone understands, or a more cryptic line of code that prevents branching?
That said, it is still interesting to know about branched code v.d. brancheless code.
If you want to know more about modern processors, listen to Lex Fridmans podcast where he interviews legendary microprocessor engineer Jim Keller: Jim Keller: Moore’s Law, Microprocessors, Abstractions, and First Principles AI Podcast