The
real magic of cell lies with its eight "synergistic processor
elements," or SPEs, shown in Figure 2. These are specially designed
processors created from scratch by the IBM/Sony/Toshiba team just for
Cell. They're not compatible with Power or PowerPC code in any way;
they have their own distinct instruction set and internal architecture.
For most code, and particularly for parallel vector operations, the
SPEs do the heavy lifting. Each SPE is identical to its neighbors, and
all share the same common bus with the central Power Processing Element
(PPE in IBM-speak).
Like the central PowerPC processor, each SPE is a dual-issue machine
but unlike the PPE the two execution pipelines are not symmetrical. In
other words, each SPE can execute two instructions simultaneously but
not two of the same instruction. The SPE's pipeline is "unbalanced," in
that it can execute only arithmetic operations on one side (either
fixed- or floating-point) and only logic, memory, or flow-control
operations on the other side. That's not unusual; other superscalar
processors have unbalanced pipelines, too. Most modern x86 chips, for
example, have internal execution units dedicated to math, logic, or
flow-control instructions and the hardware (or the compiler) determines
how many of those can actually be used each cycle. It's the combination
of these elements that determines the processor's ultimate performance
and suitability to a task.
Each SPE is a 128-bit machine, with 128 registers that are each 128
bits wide. Its internal execution units are also 128 bits wide, which
allows each SPE to handle either very large numbers or several small
numbers at once. For example, each SPE can process two double-precision
floats, four single-precision floats or long integers, eight 16-bit
short integers, or 16 chars or other byte-sized quantities, all in a
single cycle.
Although it stretches the definition considerably, each SPE has a
RISC-like instruction set. They can load and store only quad-word
(128-bit) quantities and all transactions must be on aligned addresses.
If you want to load or store a byte or char, you've got to transfer the
whole 16-byte quantity first and then mask off, merge, or extract the
bits you want.
Each SPE actually has seven different execution units, although only
two can be used at a time, as mentioned previously. Because one of the
two execution pipelines is dedicated to arithmetic operations, an SPE
can process fixed- or floating-point numbers nonstop while the other
execution unit(s) in the other pipeline handle program flow. This
reduces (but doesn't prevent) pipeline "bubbles" that get in the way of
streaming data at top speed without interruption. Some DSP processors
have similar internal architectures that separate program flow from
data manipulation, and it works quite well most of the time. If the
code tries to execute two arithmetic operations at once, the chip
simply runs them in sequence instead of side-by-side. This isn't really
a programming error but it does reduce the SPE's throughput
considerably.
Internal Data Flow
Unlike
the PPE, the SPEs do not have caches. Instead, they each get a 256K
"local store" that only they can see. All code and data for the SPE
must be stored within this 256K local area. In fact, the SPEs cannot
"see" the rest of the chip's address space at all. They can't access
each others' local stores nor can they access the PPE's caches or other
on-chip or off-chip resources. In effect, each SPE is blind and limited
to just its own little corner of the Cell world.
Why the crippled address map? Each SPE is limited to just a single
memory bank with deterministic access characteristics in order to
guarantee its performance. Off-chip (or even on-chip) memory accesses
take time--sometimes an unpredictable amount of time, and that goes
against the SPE's purpose. They're designed to be ultra-fast and
ultra-reliable units for processing streaming media, often in real-time
situations where the data can't be retransmitted. By limiting their
options and purpose, Cell's designers gave the SPEs deterministic
performance.
This is where the DMA controllers come in. Each SPE has its own 128-bit
wide DMA controller (64 bits in, 64 bits out) between it and Cell's
local bus. The PPE and all eight SPEs share the same bus, called the
Element Interconnect Bus (EIB). Through this bus each DMA controller
fetches the instructions and data that its attached SPE will need. The
DMA controller also pushes results out onto the shared bus, where it
can be exported off-chip, sent to on-chip peripherals, or cached by the
PPE.
The central processor's L1 and L2 caches snoop the EIB, so the caches
are always fully coherent. The SPEs do not snoop the bus; in fact, they
don't monitor bus traffic at all. That means that the central PowerPC
processor is aware of what data the SPEs may transfer but the SPEs are
totally unaware of any traffic amongst their neighbors. Again, this
keeps the SPEs relatively simple and limits interruptions or unwanted
effects on their behavior. If the SPEs need to be made aware of
external data changes, their respective DMA controllers will have to
fetch it. And that, presumably, would be under the control of the
central PPE.
Super Cell
Mere mortals can program the Cell processor but it's a unique
experience. A handful of embedded systems developers already have
experience programming multiprocessor systems; some have even coded
multicore processors. But Cell promises to up the game. Each of the
chip's nine individual processor elements is itself a dual-issue
machine with complex pipeline interlocks, cache-coherence issues, and
synchronization problems. Keeping all eight SPEs fed at once promises
to be a real chore. Yet the results are bound to be spectacular. If
your application can benefit from sustained high-speed floating-point
operations and can be parallelized across two or more SPEs you should
be in for a real treat. That is, once you get the code running.
IBM is working on an "Octopiler" that compiles C code and balances it
across Cell's eight SPEs. Tools like that, and like the ones described
in our companion article on page 18, are absolutely necessary if Cell
is to be a success. To take another example from the video game
industry, Sega's Saturn console was a failure largely because its
four-processor architecture (three SuperH chips and a 68000) was too
difficult to program. Developers working under tight deadlines simply
ignored much of the system's power because they couldn't harness it
effectively. Cell brings that problem in spades. It's an impressive
achievement in computer architecture and semiconductor manufacturing.
Products based on Cell promise to be equally impressive. But bringing
Cell to life will require real software alchemy.
Jim Turley is editor in chief of Embedded Systems Design magazine. You can reach him at jturley@cmp.com.