Gamasutra is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Gamasutra: The Art & Business of Making Gamesspacer
Sponsored Feature: Who Moved the Goal Posts? The Rapidly Changing World of CPUs
View All     RSS
October 20, 2020
arrowPress Releases
October 20, 2020
Games Press
View All     RSS







If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 

Sponsored Feature: Who Moved the Goal Posts? The Rapidly Changing World of CPUs


October 19, 2009 Article Start Previous Page 3 of 7 Next
 

Focus on the Intel Core i7 architecture


Figure 6. The core and uncore diagrams of the Intel Core i7 processor design.

The Intel Core i7 processor architecture is designed from the ground up to be modular, allowing for a variable number of cores per processor. Each of the cores on the Intel Core i7 processor is identical and support better branch prediction, improved memory performance when dealing with unaligned loads and cache splits and improved stores compared to previous architectures. Each common core supports two register sets to support SMT, these two register sets share the L2 and L1 data caches and the out-of-order execution units. Each core appears as two logical processors to the OS.

The Intel Core i7 processor microarchitecture also introduced an uncore portion to the processor design. The uncore contains the other processor features, such as the memory controller, power, and clocking controllers. It runs at a frequency that differs from that of the common cores. The architecture allows for different numbers of cores to be attached to a single uncore design and for the uncore to vary based on its intended market, such as adding additional QPI links to allow for multi-socket servers and workstations.

To get the best performance benefit from the Intel Core i7 processor developers need to aggressively implement multithreading. Game engines that have been written for four or fewer threads will utilize only a fraction of the available processing power and may see only limited performance gains over previous quad-core processors. Developers have to go beyond functional parallelism and merely separating the physics or AI and instead move to data-level parallelism with individual functions distributed across multiple threads.

One of the key requirements of any graphically intensive game will be ensuring the rendering engine itself scales across multiple threads and the graphic pipeline between the game and video card doesn't become the bottleneck limiting the use of both the processor and graphics processing unit (GPU). Hopefully the support for multi-threaded rendering contexts in DirectX 11 will significantly benefit this area of games programming.


Figure 7. The memory hierarchy on the Intel Core i7 processor.

Figure 7 shows the memory hierarchy on the Intel Core i7 processor which is significantly different from previous Intel desktop designs. Each core has a 32 K L1 cache accessible inside of four cycles-incredibly fast. Compared to the Intel Core processor family, the 256 K L2 cache is significantly smaller, even though it is much faster at a 10-cycle access time and runs at a higher clock rate. The smaller L2 cache is offset by an additional, much larger L3 cache that is part of the uncore and shared by all of the common cores.

The new L3 cache has a 35-40 cycle access time depending on the variable ratio between the core and uncore clock speeds. The L3 cache has a shared, inclusive cache design that mirrors the data in the L1 and L2 caches. While the inclusive cache design might at first glance seem a less efficient use of the total available cache on the die, the inclusive design allows for much better memory latency.

If one core needs data that it can't find in its own local L1 and L2 cache stores, it need only query the inclusive L3 cache before accessing the main memory, instead of probing the other cores for cache results. The L3 cache is connected to the new integrated memory controller that is connected to the DRAM in the system offering much high bandwidth than the previous Front-Side bus used on Core 2 processors. For a multi-socket system the memory controller can communicate to another socket via a quickpath interconnect replacing the Front-Side bus.

Compared to the previous generation of high end desktop systems the performance of the main memory increased both in terms of bandwidth (by a factor of three) and latency (by a factor of 0.5). When deciding whether to run eight threads of computing, it is important to have memory bandwidth available to keep from starving the processor cores.

Intel Streaming SIMD Instruction Sets

The Intel Core i7 processor also introduced and extension to the various SIMD instruction sets support on Intel processors called SSE4.2 The new instructions were targeted at accelerating specific algorithms and while they themselves may have limited benefits in games (depending on whether you use an algorithm that benefits from them) the support of SSE in general is very important to getting the most from the PC. It is frequently surprising to games developers just how wide spread the support for previous versions of SSE is.

The Intel SSE instruction set was originally introduced in 1999, and the Intel SSE2 set soon followed in 2000 (Figure 8). It seems reasonable to assume that most game developers today are targeting hardware built after 2000, and thus developers can essentially guarantee support for Intel SSE2 at a minimum. Targeting at least a dual-core configuration as a minimum specification will guarantee support for Intel SSE3 on both Intel platforms and AMD platforms.


Figure 8. The evolution of Intel Streaming SIMD Instructions.

In addition to hand-coded optimizations, one of the goals of each new architecture is to allow compilers more flexibility in the types of optimizations they can automatically generate. Improvements allowing better unaligned loads and more flexible SSE load instructions enable the compiler to optimize code that previously could not be done safely with the context information available at compile time. Thus it's always advisable to test the performance with the compiler flags set to target the specific hardware. Some compilers also allow the multiple code paths to be generated at the same time providing backward compatibility together with fast paths to targeted hardware.


Article Start Previous Page 3 of 7 Next

Related Jobs

Remedy Entertainment
Remedy Entertainment — Espoo, Finland
[10.20.20]

Senior Material Artist
Remedy Entertainment
Remedy Entertainment — Espoo, Finland
[10.20.20]

Development Manager (Xdev Team)
Remedy Entertainment
Remedy Entertainment — Espoo, Finland
[10.20.20]

Animation Director
innogames
innogames — Hamburg, Germany
[10.20.20]

Mobile Software Developer (C++) - Video Game: Forge of Empires





Loading Comments

loader image