How does CPU predict a branch
Intro
After clearing the myth about branch prediction, I have been spending more time digging into how CPUs practically implement it. And the result was amazing.
As the pipelines grow longer, the cost of a branch misprediction is getting more expensive. Vendors are investing more and more in branch prediction, among other techniques like the Instruction Pointer queue or the Decoded cache.
Today we will go through some common techniques of branch prediction.
Overview
From my last post, we know the prediction contains 2 main components, the Branch Target Buffer and the Branch Target Calculator:
To be honest, I do not find much resource about the BTC, so instead, let us zoom into the BTB with its cache and see how they work together.
Branch prediction techniques
Prediction with One-bit
The naive approach is to have one bit to record the last result of the branch. Our cache in this case will contains an array that each entry only contains 1 bit recording the last result. We will use that bit for the prediction of the next result.
It is impossible to remember the last result of any branch, instead we will only use some bits (normally they are the lowest bits) of the instruction address to locate the recording bit.
This approach works well for consecutive similar results, like Taken (e.g. TTTT…) or Not-Taken (NNNN…)
Prediction with Two-bit
In the previous approach, once the result of a branch changes, it will take 2 hits to recover. Consider a branch with this behavior: TTTTNTTTT. Most of the time it is Taken, so keep predicting it as Taken is better than changing the prediction to Not-Taken.
If we detect this kind of pattern, we should fortify our Taken prediction, and we should not switch to Not-Taken easily. How can we do that?
We use 2 bits to store our history! Now, we have 4 states and they will behave like this:
The structure of our cache stays unchanged. The only difference is that now each cache entry contains 2 bits, representing 4 different states:
Now it will take 2 consecutive Not-Taken results to change our prediction! Papers and articles refer to this approach as a “saturated counter“.
Keeping global branch history
In practice, many branches have patterns, for example:
for(int i = 0; i < 100; i++) { if(i % 3 == 0) { foo++; } else { bar++; } }
In the example above, the branch is Taken every 3 values, its pattern is like TNNTNNTNN… To detect this kind of pattern, we need to remember the history of previous branch result.
We can improve our prediction by storing the recent results of branches and using them for the prediction of the next result:
We maintain several bits for recent histories, every time we want to predict an address, we combine the lowest bits of the address with the history to get the index to the prediction, which is also a saturated counter.
In our example above, we are using 2 global bits to store recent histories. Now each instruction address will have 2^2 = 4 saturated counters, each representing a predicted result in a different state.
This approach is referred to as “Two-level adaptive, global“.
The more bits we have for history, the more accurate our prediction is as we can detect longer patterns. But it comes with the cost of more storage needed for each instruction address. In our example above, if our cache capacity does not change, we can only hold predictions for 1/4 branches compared with the previous approach.
Keeping local branch history
The global history approach suffers another issue: Recent histories are shared between all branches. Imagine the code now become like this:
for(int i = 0; i < 100; i++) { if(i % 3 == 0) { if(another_condition1() == 1) { foo++; } if(another_condition2() == 2) { bar++; } } }
Since we only have 1 storage for the branch history, and it is shared between all branches, the history entries will be mixed and interfere each other.
We can fix this by having separate histories for each individual address.
We will first use the address to locate the branch history index, then combine the branch history with the address to get the prediction, which is also a saturated counter.
This approach is referred to as “Two-level adaptive, local“. The global is a special case of this approach where the branch history only has 1 entry.
Keeping global branch history – GShare
How can we make our prediction more accurate than previous approaches? Remember that we only use the lowest bits of the address with a small number of bits for history. Increasing these 2 numbers will quickly increase the required storage for the BTB cache. We should think of some way to balance this.
Another aspect is, that some history patterns are more common than others, so if we directly use the history pattern to index the table, some entries will be used more often and some are barely used.
The solution for this case is, instead of directly combining the address and history pattern as the index, we will use a hash function to combine them. This distributes the usage more evenly between entries and reduces the chance of multiple addresses sharing the same entries.
Loop counter
Each loop contains at least one branch, to decide whether to continue or break out of the loop. If there are other branches within the loop body, we will need a very long history to be able to predict the loop’s branch.
One way to resolve this is to have separate mechanisms for loop and normal code. We call this the loop counter.
We will have a counter for each loop, which will count the period N the first time the loop is executed. For sub-sequence executions, we will predict that the loop also takes N period to break.
We need to store extra data for the loop mechanism, including:
- Whether the branch has loop behavior or not.
- Whether the branch is taken or not taken at the end of the period.
- The period N.
- The current count.
Hybrid predictor
The idea of hybrid comes naturally, as we have different approaches working well in different contexts.
To combine them, we will need a meta-predictor. It will decide the result of which predictor is selected. There are many ways to implement this meta-predictor. The simplest way is to have a saturated counter, which remembers the predictor giving better results for a particular branch address.
Typically, Intel Pentium CPUs pair a loop counter predictor with a two-level adaptive predictor using global history. It is possible to extend this to include more than just 2 predictors.
Future directions
Branch prediction will likely be further improved in the future, as pipelines get longer. The following directions are being researched:
- Hybrid predictor.
- Alloyed predictor: Combining local and global history as the index to the prediction table.
- Keeping unimportant branches out: Many branches will typically go the same way. If we keep them out of the prediction table, we can save space and increase accuracy.
- Decoding both branches: If the prediction is uncertain, the processor can select to decode and execute both cases instead.
- Neural networks: The storage requirement increases exponentially with the number of bits, and so does the warm-up time. Other methods with less storage requirements can be used, such as using neural networks.
- Reducing context switch: Context switching will often clear all information collected by the predictor. More complex predictions will require a longer warm-up time, it is likely that there will be new methods to reduce the loss of context switching.
Closing thoughts
In conclusion, branch prediction is a cornerstone of modern processor design, significantly impacting performance. By exploring various implementation techniques and their hybrids, I hope you gained some insights into the trade-offs inherent in each method.
That is all for today’s post! Enjoy the weekend.
References:
- Agner Fog’s great resource for optimization: https://www.agner.org/optimize