Gamasutra - Feature - "A Glimpse Inside the Cell Processor"
It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Jim Turley
[Author's Bio]

Gamasutra
July 13, 2006

A Glimpse Inside the Cell Processor

Introduction
Magic Eight Ball

 


Change Login/Pwd
Post A Job
Post A Project
Post Resume
Post An Event
Post A Contractor
Post A Product
Write An Article
Get In Art Gallery
Submit News

 


 


Latest Letters to the Editor:
Perpetual Layoffs by Alexander Brandon [09.21.2007]

Casual friendliness in MMO's by Colby Poulson [09.20.2007]

Scrum deals and 'What is Scrum?' by Tom Plunket [08.29.2007]


[Submit Letter]

[View All...]
  



Upcoming Events:
SPARK Animation Festival
Vancouver, Canada
09.10.08

Women In Games Conference
Coventry, United Kingdom
09.10.08

3rd ACM International Conference on Digital Interactive Media in Entertainment and Arts - DIMEA 2008
Athens, Greece
09.10.08

GDC Austin
Austin, United States
09.15.08

Game Career Seminar
Austin, United States
09.17.08

[Submit Event]
[View All...]

 


[Enter Forums...]

Note: Discussion forums for Gamasutra are hosted by the IGDA, which is free to join.
 


Features

A Glimpse Inside the Cell Processor

Magic Eight Ball

The real magic of cell lies with its eight "synergistic processor elements," or SPEs, shown in Figure 2. These are specially designed processors created from scratch by the IBM/Sony/Toshiba team just for Cell. They're not compatible with Power or PowerPC code in any way; they have their own distinct instruction set and internal architecture. For most code, and particularly for parallel vector operations, the SPEs do the heavy lifting. Each SPE is identical to its neighbors, and all share the same common bus with the central Power Processing Element (PPE in IBM-speak).

Like the central PowerPC processor, each SPE is a dual-issue machine but unlike the PPE the two execution pipelines are not symmetrical. In other words, each SPE can execute two instructions simultaneously but not two of the same instruction. The SPE's pipeline is "unbalanced," in that it can execute only arithmetic operations on one side (either fixed- or floating-point) and only logic, memory, or flow-control operations on the other side. That's not unusual; other superscalar processors have unbalanced pipelines, too. Most modern x86 chips, for example, have internal execution units dedicated to math, logic, or flow-control instructions and the hardware (or the compiler) determines how many of those can actually be used each cycle. It's the combination of these elements that determines the processor's ultimate performance and suitability to a task.

Each SPE is a 128-bit machine, with 128 registers that are each 128 bits wide. Its internal execution units are also 128 bits wide, which allows each SPE to handle either very large numbers or several small numbers at once. For example, each SPE can process two double-precision floats, four single-precision floats or long integers, eight 16-bit short integers, or 16 chars or other byte-sized quantities, all in a single cycle.

Although it stretches the definition considerably, each SPE has a RISC-like instruction set. They can load and store only quad-word (128-bit) quantities and all transactions must be on aligned addresses. If you want to load or store a byte or char, you've got to transfer the whole 16-byte quantity first and then mask off, merge, or extract the bits you want.

Each SPE actually has seven different execution units, although only two can be used at a time, as mentioned previously. Because one of the two execution pipelines is dedicated to arithmetic operations, an SPE can process fixed- or floating-point numbers nonstop while the other execution unit(s) in the other pipeline handle program flow. This reduces (but doesn't prevent) pipeline "bubbles" that get in the way of streaming data at top speed without interruption. Some DSP processors have similar internal architectures that separate program flow from data manipulation, and it works quite well most of the time. If the code tries to execute two arithmetic operations at once, the chip simply runs them in sequence instead of side-by-side. This isn't really a programming error but it does reduce the SPE's throughput considerably.

Internal Data Flow

Unlike the PPE, the SPEs do not have caches. Instead, they each get a 256K "local store" that only they can see. All code and data for the SPE must be stored within this 256K local area. In fact, the SPEs cannot "see" the rest of the chip's address space at all. They can't access each others' local stores nor can they access the PPE's caches or other on-chip or off-chip resources. In effect, each SPE is blind and limited to just its own little corner of the Cell world.

Why the crippled address map? Each SPE is limited to just a single memory bank with deterministic access characteristics in order to guarantee its performance. Off-chip (or even on-chip) memory accesses take time--sometimes an unpredictable amount of time, and that goes against the SPE's purpose. They're designed to be ultra-fast and ultra-reliable units for processing streaming media, often in real-time situations where the data can't be retransmitted. By limiting their options and purpose, Cell's designers gave the SPEs deterministic performance.

This is where the DMA controllers come in. Each SPE has its own 128-bit wide DMA controller (64 bits in, 64 bits out) between it and Cell's local bus. The PPE and all eight SPEs share the same bus, called the Element Interconnect Bus (EIB). Through this bus each DMA controller fetches the instructions and data that its attached SPE will need. The DMA controller also pushes results out onto the shared bus, where it can be exported off-chip, sent to on-chip peripherals, or cached by the PPE.

The central processor's L1 and L2 caches snoop the EIB, so the caches are always fully coherent. The SPEs do not snoop the bus; in fact, they don't monitor bus traffic at all. That means that the central PowerPC processor is aware of what data the SPEs may transfer but the SPEs are totally unaware of any traffic amongst their neighbors. Again, this keeps the SPEs relatively simple and limits interruptions or unwanted effects on their behavior. If the SPEs need to be made aware of external data changes, their respective DMA controllers will have to fetch it. And that, presumably, would be under the control of the central PPE.

Super Cell

Mere mortals can program the Cell processor but it's a unique experience. A handful of embedded systems developers already have experience programming multiprocessor systems; some have even coded multicore processors. But Cell promises to up the game. Each of the chip's nine individual processor elements is itself a dual-issue machine with complex pipeline interlocks, cache-coherence issues, and synchronization problems. Keeping all eight SPEs fed at once promises to be a real chore. Yet the results are bound to be spectacular. If your application can benefit from sustained high-speed floating-point operations and can be parallelized across two or more SPEs you should be in for a real treat. That is, once you get the code running.

IBM is working on an "Octopiler" that compiles C code and balances it across Cell's eight SPEs. Tools like that, and like the ones described in our companion article on page 18, are absolutely necessary if Cell is to be a success. To take another example from the video game industry, Sega's Saturn console was a failure largely because its four-processor architecture (three SuperH chips and a 68000) was too difficult to program. Developers working under tight deadlines simply ignored much of the system's power because they couldn't harness it effectively. Cell brings that problem in spades. It's an impressive achievement in computer architecture and semiconductor manufacturing. Products based on Cell promise to be equally impressive. But bringing Cell to life will require real software alchemy.

Jim Turley is editor in chief of Embedded Systems Design magazine. You can reach him at jturley@cmp.com.

 

 


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2005 CMP Media LLC

privacy policy
| terms of service