Our Properties: Gamasutra GameCareerGuide IndieGames Indie Royale GDC IGF Game Developer Magazine GAO
My Message close
Contents
Pentium III Prefetch Optimizations Using the VTune Performance Analyzer
 
 
Printer-Friendly VersionPrinter-Friendly Version
 
Latest News
spacer View All spacer
 
February 10, 2012
 
Road to the IGF: Lucky Frame's Pugs Luv Beats
 
Analyst questions validity of unusual January NPD results [3]
 
DICE 2012: Blizzard's Pearce on World Of Warcraft's launch hangover
spacer
Latest Jobs
spacer View All     Post a Job     RSS spacer
 
February 10, 2012
 
Sony Computer Entertainment America LLC
Audio Tools Engineer
 
Sony Computer Entertainment America LLC
World Wide Studios Technical Product Manager
 
Sony Computer Entertainment America LLC
Senior Software Application Engineer
 
Sony Computer Entertainment America LLC
Senior Gamer Insights Specialist
 
High 5 Games
Technical Artist
 
Airtight Games
Art Director
spacer
Latest Features
spacer View All spacer
 
February 10, 2012
 
arrow Principles of an Indie Game Bottom Feeder [19]
 
arrow Postmortem: CyberConnect 2's Solatorobo: Red the Hunter [1]
 
arrow Jerked Around by the Magic Circle - Clearing the Air Ten Years Later [39]
 
arrow Building the World of Reckoning [4]
 
arrow SPONSORED FEATURE: TwitchTV - How to Build Community Around Your Game in 2012 [13]
 
arrow Happy Action, Happy Developer: Tim Schafer on Reimagining Double Fine [9]
 
arrow Building an iOS Hit: Phase 1 [11]
 
arrow Postmortem: Appy Entertainment's SpellCraft School of Magic [5]
spacer
Latest Blogs
spacer View All     Post     RSS spacer
 
February 10, 2012
 
Audio Passes: Success Through Layering
 
What the current RPG can learn from Diablo 1
 
Double Fine's Kickstarter Windfall: Will Patronage Supplant Traditional Game Publishing? [5]
 
The Principles of Game Monetization
 
Did DoubleFine Just break the publishing model for good? [13]
spacer
About
spacer Editor-In-Chief/News Director:
Kris Graft
Features Director:
Christian Nutt
Senior Contributing Editor:
Brandon Sheffield
News Editors:
Frank Cifaldi, Tom Curtis, Mike Rose, Eric Caoili, Kris Graft
Editors-At-Large:
Leigh Alexander, Chris Morris
Advertising:
Jennifer Sulik
Recruitment:
Gina Gross
 
Feature Submissions
 
Comment Guidelines
Sponsor
Features
  Pentium III Prefetch Optimizations Using the VTune Performance Analyzer
by Ornit Gross [Programming]
Post A Comment Share on Twitter Share on Facebook RSS
 
 
July 30, 1999 Article Start Page 1 of 4 Next
 

The most appealing applications today for the home PC market handle growing amounts of data. Video decoders and encoders manipulate big frame buffers, 3D games handle large textures and data structures, and speech recognition applications demand a lot of memory. Unfortunately, most of the data that these applications need is out of the cache when it's most needed.

Transferring data to and from the main memory slows down the application, forcing it to wait for the data to become available. However, in some cases, the data address is known ahead of time and the data could have been fetched in advance, reducing these waiting cycles. Until now, those who have tried to prefetch the data by simply reading it in advance discovered that this trick does not work well.


With the advent of the Streaming SIMD Extensions for the Pentium III processor, game developers now can use instructions for controlling the cache, which use prefetches and streaming stores. However, these instructions must be used carefully, because excessive use may not lead to the expected speed boost, and may even slow down your game.

About the New Prefetch Instructions

The new prefetch instructions of the Pentium III provide hints to the processor – they suggest where into in the memory hierarchy the data should be prefetched. However, there's no guarantee that the data will be fetched. The prefetches do not affect the functionality of your code, apart from moving data blocks along the memory hierarchy.

There are dedicated prefetches for data of temporal locality (data that will be accessed again in the short term) and for data of non-temporal locality (data that is accessed only once). The four prefetch instructions are as follows:

  1. PrefetchNTA. This instruction is for non-temporal data. It only fetches into the first level cache (L1), without polluting the second level cache (L2).
  2. PrefetchT0. This instruction is used for temporal data that fits into L1. It fetches into the whole cache hierarchy of L1 and L2.
  3. PrefetchT1. This instruction is for temporal data that fits into L2 without polluting L1.
  4. PrefetchT2. This implementation in the Pentium III processor is the same as for PrefetchT1.

(The Intel Architecture Software Developer's Manual provides a complete architectural description of the prefetch instructions, beyond their implementation in the Pentium III processor.)

Unaligned accesses are supported, and the prefetch does not split cache lines. In other words, only one cache line of 32 bytes is fetched, which includes the address of the prefetch. If the data is already in the desired cache level, or closer to the processor, then it is not moved.

You cannot prefetch data from uncacheable ranges such as Write Combining, AGP, video frame buffer, and so on. These prefetches are treated as no operations (NOPs).

An attempt to prefetch an illegal address is ignored, meaning there is no data movement. You do not get an exception. It is difficult to track pointer bugs in the prefetch address (wrong address to prefetch from), which do not result in improved performance.

Using Prefetches Efficiently

Using prefetches efficiently is more of an art than a science, but any developer can acquire the necessary skills. To take full advantage of the prefetches, you must follow several simple guidelines. Following these guidelines can make the difference between great and acceptable prefetch optimizations. The micro-architectural reasons for such guidelines are based on the organization of the load buffer, the store buffer and the fill buffer as described in the Intel Architecture Software Optimization Reference Manual. The guidelines derived from the micro-architecture are:

  1. Prefetch only when the probability of a cache miss is high (use the VTune analyzer as described in the section below titled, "Analyzing a Sample Application"). Redundant prefetches may carry some overhead, so be thrifty about using prefetch instructions.
  2. Avoid prefetching the same cache line (32 bytes) more than once. Instead, unroll your loops to handle full cache lines per iteration, making it easier to arrange the prefetching per iteration.
  3. Change your data structures to include as much useful data as possible. Use the structure of arrays (SoA) format instead of the array of structures (AoS) format to increase the prefetching efficiency. The data is already available, so make use of it now instead of prefetching it again later.
  4. Spread the prefetches among other computational instructions, and if possible, space them out, and don't use them around load instructions. This guideline is tightly related to the following one, and is based on the processor resources that both the prefetches and the loads share.
  5. Carefully mix prefetches, streaming stores, loads and stores that miss the caches, and stores to uncacheable memory range. All these instructions cause data transaction to or from the main memory, and all of them share the same valuable processor resource

 
Article Start Page 1 of 4 Next
 
Comments


none
 
Comment:
 




UBM Techweb
Game Network
Game Developers Conference | GDC Europe | GDC Online | GDC China | Gamasutra | Game Developer Magazine | Game Advertising Online
Game Career Guide | Independent Games Festival | Indie Royale | IndieGames

Other UBM TechWeb Networks
Business Technology | Business Technology Events | Telecommunications & Communications Providers

Privacy Policy | Terms of Service | Contact Us | Copyright © UBM TechWeb, All Rights Reserved.