Branch and Loop Reorganization to Prevent Mispredicts
By Jeff Andrews, published on May 6, 2011
by Jeff Andrews
Introduction
Proven techniques and Intel tools enable developers to minimize branch mispredictions and keep deep-pipelined processors fully utilized.
Modern microprocessors are pipelined in order to get more instructions completed faster. This means that instructions do not wait for the previous ones to complete before their execution begins. A problem with this approach arises, however, due to conditional branches. If the microprocessor encounters a conditional branch and the result for the condition has not yet been calculated, how does it know whether to take the branch or not? This is where branch prediction comes in.
Branch prediction is what the processor uses to decide whether to take a conditional branch or not. Getting this information as accurately as possible is important, as an incorrect prediction (mispredict) will cause the microprocessor to throw out all the instructions that did not need to be executed and start over with the correct set of instructions. This process is particularly expensive with deeply pipelined processors.
This article introduces the various branch-prediction methods used by the microprocessor and provides some tips about how to avoid costly mispredicts. The paper assumes that the reader is familiar with programming in C and with IA32 assembly-language instructions.
Branch Examples
A pipelined processor makes it possible to begin execution of instructions before their predecessors are completed by breaking the instruction execution up into stages. When a conditional branch is encountered, the microprocessor uses branch prediction to determine which direction the branch will take. The following are examples of C language commands that cause conditional branches.
The first type of construction considered here that causes conditional branches is if-else:
// // C code // if ( data == 0 ) … else if ( data == 1 ) … else … ; ; Assembly code ; mov eax, DWORD PTR data cmp eax, 0 jne NextIfElse1 … jmp EndIfElse NextIfElse1: cmp eax, 1 jne NextIfElse2 … jmp EndIfElse NextIfElse2: … EndIfElse: |
// // C code // switch ( data ) { case 0: … break; case 1: … break; default: … } ; ; Assembly code ; mov eax, DWORD PTR data cmp eax, 0 jne NextCase1 … jmp EndSwitch NextCase1: cmp eax, 1 jne NextCase2 … jmp EndSwitch NextCase2: … EndSwitch: |
// // C code // for ( i=0; i < 10; i++ ) { … } ; ; Assembly code ; mov esi, data mov ecx, 0 ForLoop: cmp ecx, 10 jge EndForLoop … add ecx, 1 jmp ForLoop EndForLoop: |
// // C code // while ( data > 0 ) { … data--; } ; ; Assembly code ; mov eax, data WhileLoop: cmp eax, 0 jle EndWhile … sub eax, 1 jmp WhileLoop EndWhile: |
// // C code // do { … data--; } while ( data > 0 ); ; ; Assembly code ; mov eax, data DoWhileLoop: … sub eax, 1 cmp eax, 0 jg DoWhileLoop EndDoWhile: |
Static Branch Prediction
There are two kinds of branch prediction: static and dynamic. Static branch prediction is used by the microprocessor the first time a conditional branch is encountered, and dynamic branch prediction is used for succeeding executions of the conditional branch code.
Static branch prediction is used when there is no data collected by the microprocessor when it encounters a branch, which is typically the first time a branch is encountered. The rules are simple:
- A forward branch defaults to not taken
- A backward branch defaults to taken
In order to effectively write your code to take advantage of these rules, when writing if-else or switch statements, check the most common cases first and work progressively down to the least common. Loops do not necessarily require any special ordering of code for static branch prediction, as only the condition of the loop iterator is normally used.
The Pentium® 4 Processor introduced new instructions for adding static hints to branches. It is not recommended that a programmer use these instructions, as they add slightly to the size of the code and are static hints only. It is best to use a conditional branch in the manner that the static predictor expects, rather than adding these branch hints.
In the event that a branch hint is necessary, the following instruction prefixes can be added before a branch instruction to change the way the static predictor behaves:
- 0x3E – statically predict a branch as taken
- 0x2E – statically predict a branch as not taken
For example:
mov eax, data cmp eax, 5 __emit 0x3E jne branch add eax, 1 branch: |
Dynamic Branch Prediction
Dynamic branch prediction is done in the microprocessor by using a history log of previously encountered branches containing data for each branch, noting whether or not it was taken. This branch-history log is known as the Branch Target Buffer (BTB). Every time a branch is encountered and the microprocessor knows which direction the branch has taken, the BTB is updated to reflect that information.
The BTB is a buffer that holds a branch’s address and a brief history of the direction that a branch has taken. The address field is used somewhat like an index to the BTB, where it looks up whether a branch should be taken or not taken. There are 16 bits in the Pentium 4 processor to signify whether a branch should be taken or not. The bits work like a circular buffer, with each bit being checked for each check into the BTB.
The following is an example of the BTB entry for a backward branch (i.e., a do-while loop in C++) doing four iterations with all its entries already filled in:

In this example, the do-while loop has been executed multiple times, with each execution of the loop containing a fixed amount of four iterations. Now that the history for this loop is in the BTB, whenever this code is executed again, it will not cause any branch mispredicts and the accompanying penalty.
One thing to remember is that the BTB is finite. Once all the entries in the BTB have been consumed, an older entry will need to be used for a new branch that is encountered.
Since the BTB is only 16 entries long in the Pentium 4 processor, the prediction will eventually fail for loops that are longer than 16 iterations. This limitation can be avoided by unrolling a loop until it is only 16 iterations long. When this is done, a loop conditional will always fit into the BTB, and a branch misprediction will not occur on loop exit. The following is an exam ple of loop unrolling:
// // Not-unrolled // for (i=0; i < 32; i++) { some_val += some_array[i]; } // //Unrolled // for (i=0; i < 32; i+=2) { some_val += some_array[i] * 20; some_val += some_array[i+1] * 20; } |
Another benefit of loop unrolling is that dependence chains are stretched, allowing for deep pipelines to get more utilization.
There are some rules that need to be followed:
- Do not go beyond 16 iterations for the Pentium 4 processor and four iterations for the Pentium® II processor and Pentium® III processor.
- Branches within a loop can place a heavy demand on the BTB, since they are now multiplied the same number of times the loop is unrolled.
It is also best to remove branches from within a loop, if possible. By doing so, the branch is only taken once, rather than for each iteration of the loop. This is only possible when the conditional does not change during the entire duration of the loop. The following is an example of how to do this:
// //original version // for ( i=0; i < 10; i++ ) { if ( some_value > 10 ) data++; else data--; } // // conditional extracted version // if ( some_value > 10 ) { for ( i=0; i < 10; i++ ) data++; } else { for ( i=0; i < 10; i++ ) data--; } |
Conditional Instructions
The Pentium® Pro processor introduced new instructions that help eliminate branches. These instructions behave differently, depending on whether the condition is met or not. The instructions are as follows:
- SETcc – sets the destination register to 0 if the condition is not met and to 1 if the condition is met.
- CMOVcc – moves the data from the source register to the destination register if the condition is met; otherwise, it is treated as a NOP by the microprocessor.
- FCMOVcc – moves the floating point data from the source FP register to the destination FP register if the condition is met; otherwise, it is treated as a NOP by the microprocessor.
The cc stands for the different conditional expressions available (e.g., ne – not equal, z – zero). Given that, the following condi tional instruction moves ebx into eax if the value of ecx is not equal to 10:
cmp ecx, 10 cmovne eax, ebx |
; ; original version ; add eax, somevalue jcc branch add edx, 1 branch: ; ; setcc version ; add eax, somevalue setc ebx, 1 add edx, ebx |
; ; original version ; cmp eax, somevalue jge branch mov ebx, constant0 branch: ; ; cmovcc version ; cmp eax, somevalue mov ecx, constant0 movl ebx, ecx |
; ; original version ; fld comp1 fcomp comp2 fnstsw ax test ah, 1 jne branch fld data fadd dataadjust fstp data branch: ; ; fcmovcc version ; fld dataadjust fldz fld comp1 fcomp comp2 fcmovnbe st(1) fadd data fstp data fdecstp |
When using conditional instructions, verify that they execute faster than conditional branches, as they can incur some additional instruction overhead for the more complex branches.
For more information on conditional instructions, please refer to the IA-32 Intel® Architecture Software Developer’s Manual, Volume 2: Instructions Set Reference.
Intel® Software Development Tools
The VTune™ Performance Analyzer is an excellent tool for determining if branch mispredictions are occurring and how much of an impact they are causing on an executable. The VTune analyzer can sample many different processor events, including mispredicted branches. The simplest way to do this is to create an "Advanced Activity Configuration" and create a sampling collector with all the "Primary performance tuning events" selected. This will give you more than the needed number of events, but it will also alert you to some other possible issues.
The Intel® Compilers can automatically unroll loops and insert branch-removing instructions. Using the Intel Compilers is simple, as they are flag compatible with the products they integrate with (e.g., Microsoft Visual Studio*). When the capability is enabled, the compilers will insert conditional instructions and branch hints.
The compilers can also enable application profiling, so that elements like case statements will be reordered to have the most common cases appear first, taking better advantage of static branch prediction.
Evaluation copies of the VTune Performance Analyzer and the Intel compilers can be downloaded from the links given above.
Conclusion
Deep-pipelined processors give the ability to boost clock frequencies up to several gigahertz. In order to take advantage of deep pipelines, however, branch mispredictions should be avoided so that the pipelines remain fully utilized.
The techniques described in this article, which range to the simple to the more complex, allow developers to avoid mispredictions. Using Intel software-development tools simplifies the removal of branch mispredictions and provides efficient solutions. If hand tuning of the code is required, using the methods given here will certainly help to minimize branch mispredictions.
Additional Resources
Articles
- Avoiding the Cost of Branch Misprediction
- Performance Analysis of Applications Running on Itanium® Processor
Community
About the Author
Jeff Andrews is an Application Engineer with Intel Corporation specializing in optimizing code for ISVs. He also researches software technologies that enhance the performance of applications on Intel® processors.
3 comments
TopAnonymous said on Sep 16,2011
i am doing my research work in this area.i had some idea,can you give some comment on that.
Anonymous said on Sep 1,2010
ITS VERY NICE
FOR UNDERSTANDING PURPOSE IT VERY EASY TO UNDERSTAND TO ME
Anonymous said on Sep 27,2009
Hi Jeff,
I am taking Parallel Computer Architecture course this semester. And above description is very helpful for branch prediction. But I still have a doubt on unconditional branch that appears for the first time in code and unconditional branch for the second or more. Can you please elaborate on this point? I would appreciate if you can e-mail me your point of view.
Thanks,
Saumil Patel
Add a Comment
Sign inHave a technical question? Visit our forums. Have site or software product issues? Contact support.