|
With
nine processor cores, a single Cell processor chip (called a Cell
Broadband Engine or CBE) often performs an order of magnitude more work
than a traditional single-core chip at the same clock rate. Cell's
parallel configuration and performance are seldom seen in traditional
CPU architectures for any market--much less the cost-sensitive consumer
electronics business.
The CBE, jointly created by Sony, Toshiba, and IBM, distributes its
huge computational capacity over two different kinds of processor
cores, so its development environment is quite different from that of
comparatively conventional homogeneous multiprocessor architectures.
Cell programmers need special facilities to help them harvest such
computation resources effectively.
In this article, I'll introduce a new object format called CBE Embedded
SPE Object Format (CESOF). Programmers in the Sony, Toshiba, and IBM
Cell Broadband Engine Design Center (STIDC) created the CESOF
specification to help Cell programmers integrate the interacting
programs for these two different types of processor cores. I'll also
introduce the design concept, the structure, and a simple usage sample
of CESOF.
Multiple Cores in the Cell Processor
Using heterogeneous (that is, different kinds of) processor cores in a
multi-core system has become a popular practice in the embedded systems
space. For a particular algorithm or application with expected
regularity, a specialized and highly optimized circuit usually provides
better performance within a smaller chip area and with lower power
consumption than general-purpose cores. In fact, embedded systems
designers recognize that many of their applications or workloads can
benefit from specialized cores such as a single instruction, multiple
data (SIMD) engine, a floating-point accelerator, or a direct memory
access (DMA) controller.
High-performance computation workloads, modern media-rich applications,
and many algorithms in other domains all exhibit a lot of regularity in
their tasks. Replacing one or more of the generic processor cores with
specialized circuits will likely give a better performance/cost ratio
for these applications.
Cell's chip design, shown in Figure 1, strikes a balance by using one
generic Power Processor Element (PPE) and eight Synergistic Processor
Elements (SPEs) to provide a better performance/cost ratio (in terms of
chip area and power consumption), particularly for high-performance
computing and media processing. The eight SPEs are specialized SIMD
cores, each with its own private local memory. The performance/cost
ratio is particularly impressive when an algorithm can be distributed
over all eight engines at the same time with properly staged data
traffic.
Each SPE can run independently from the others. Its instruction set is
designed to execute SIMD instructions efficiently. All SIMD
instructions handle 128-bit vector data in different element
configurations: byte, half word, word, and quad-word sizes. For
example, one SIMD instruction can perform 16 character operations at
the same time.
Another important design aspect is the use of the on-chip local memory
located next to each SPE. This closeness reduces the distance, and thus
the latency, from a processor core to its execution memory space. The
address space of the SPE instructions spans only its own local memory;
the SPE fetches instructions from the local store, loads data from the
local store, and stores data to the local store. SPE instructions
cannot "see" the rest of the chip's (or the system's) address space.
The simplicity of this local memory design improves memory-access time,
memory bandwidth, chip area, and power consumption, but it does require
extra steps for an SPE program to bring external data into the local
store. An SPE can't load or store data to/from the system memory
directly. Instead, it uses a DMA operation to transfer data between the
system memory and its local store. This is quite different from the
general-purpose PPE core. The PPE load and store instructions access
the data directly from the effective address backed by off-chip
physical system memory.
As a side note, the internal core of an SPE, without the DMA engine, is
called a Synergistic Processor Unit (SPU). The use of SPU in the naming
convention of software code is sometimes intermixed with the use of SPE
where a distinction may not be as important.
Connecting these nine cores (one PPE and eight SPEs) with the physical
memory is a high-speed bus called Element Interconnect Bus (EIB).
Through this bus, an SPE DMA engine (not the SPE load and store
instructions) transfers data between the system memory and its local
store memory.
Developing and combining the code modules for cores with different
instruction sets and memory spaces presents a big challenge to
conventional programming tools. Programmers need an additional
facility, such as CESOF, to glue these heterogeneous code modules
together. In the remainder of this article I'll introduce the design
concept, the structure, and a simple usage example of CESOF.
|