| |
|
|
||||
![]() |
||||||
| |
|
|||||
|
Optimizations
Corner:
Detecting Partial Stalls The first stage in cleaning the partial stalls in the code is their detection. A partial stall is significant to the performance of the application only if it is executed frequently in the code. The best tool to evaluate the significance of the problem and to detect the partial stalls is VTune(TM) Performance Analyzer. Partial stalls can be detected statically by static code analysis and dynamically using event-based sampling. To sample the event, create a test example. The test example should have the following characteristics:
main() { int i; for (i = 0; i < 100000000; i++) a = i | 0xff00; } The relevant events are partial stall events and partial stall cycles. Select these events using the menu option Configure | Options | Sampling | Processor Events for EBS. In addition,sample for clock ticks too. Then look at the sampling session's summary view. You can see the partial stall events and cycles using the VTune Performance Analyzer, shown in Figure 1.
This example had about 88M partial stalls and as a result lost 594M cycles. The machine was a 300Mhz Pentium II processor. One processor was in idle mode and lost about two seconds of the 3.4 seconds it took to run the code. If the total partial stall cycles is very low, saving the lost cycles does not cause a significant speedup. If it is higher you can zoom into hot spots of partial stalls. The hot spot view appears in Figure 2.
Only one hot spot has all the partial stall events. Click on the hot spot to display the offending line of code; in this case a = i | 0xff00;It is not immediately clear why this line causes a partial stall, so examine the mixed assembly/source view, shown in Figure 3.
The compiler elected to "optimize" the logical or (|) operation and use an 8-bit register instead of a 32-bit register. Click on the instruction marked ! to get the full story with context-sensitive help. In this case the solution is to use the Pentium Pro code generation strategy instead of the default blended strategy in the compiler code generation options. In many other cases changing the compiler flags may not solve the problem. The compiler may generate partial stalls as a result of mixing unsigned chars with integer code amongst other reasons. In most cases, by changing the code you can work around the problem. Statically Detecting Partial and MOB Stalls Another simple way to detect partial stalls is to open the static code analysis view, request a detailed view of the functions with code, and click on the warning column to sort the functions by warning. Click on functions with Pentium Pro warnings to jump to the relevant source code. Partial stalls are marked with PPro_Partial_Stall and MOB stalls with PPro_Mem_Stall, so search for the string PPro_ using the Find option. This helps you navigate to all the instructions indicated by the ! mark. For every warning, ensure that the code is active in this area (look for clock tick samples). Optimizing code that is executed infrequently does not help overall performance. Dynamically Detecting MOB Stalls Unfortunately there is no event that counts MOB stalls. Several other events can indicate a performance problem possibly due to a MOB stall. The relevant events are:
Each
one of these events can have a high count as a result of several
other performance issues well as a MOB stall. The Resource related
stall and the low parallelization events are general events that
indicate a problem. We can use the number of those events as an
indication of the cost of the problem. Resource Related Stalls event
counts the number of clock cycles executed while a resource-related
stall occurs. This includes stalls due to register renaming buffer
entries, memory buffer entries, branch misprediction recovery, and
delay in retiring mispredicted branches. If there are no performance
problems, resource stalls should not be a concern. However, if the
system is not running at full speed, the event may indicate one
of the several possible problems including MOB stalls.
The
last source of this event is the one that will be created by MOB
stall. The problem with this event is that the first two reasons
are more common, and as a result it may require a lot of work to
identify the MOB stall cases.
In
this example according to the Clock Tick event, the MOB stall causes
a loss of 8 cycles. The Micro-Ops retired - Low parallelization
event indicates an average loss of about four cycles every iteration.
The resource-related stall is about 7.8 cycles for iteration. The
store buffer block number is about 0.94 times every iteration. These
events indicate the presence of a MOB stall or other store buffer-blocking
problem. Conclusion You
can improve the performance of your application by detecting and
fixing microarchitectural bottlenecks (also known as "glass jaws")
such as partial and MOB stalls. Compiler performance problems or
faulty assembly coding can cause partial stalls. When the compiler
is at fault, using the Pentium Pro code generation instead of the
default blending option may fix the problem. Otherwise, minor code
modifications can help, such as using 32-bit values instead of 8-bit
for the problematic variable. In assembly code, algorithmic reasons
such as packing several colors in one pixel register may cause a
partial stall. In those cases, use shifts rather than trying to
utilize the 8-bit parts of a register. Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel. ________________________________________________________
|
|
|