Gamasutra: The Art & Business of Making Gamesspacer
Programming the Cell Broadband Engine
View All     RSS
June 23, 2017
arrowPress Releases
June 23, 2017
Games Press
View All     RSS






If you enjoy reading this site, you might also want to check out these UBM Tech sites:


 
Programming the Cell Broadband Engine

July 21, 2006 Article Start Page 1 of 6 Next
 

With nine processor cores, a single Cell processor chip (called a Cell Broadband Engine or CBE) often performs an order of magnitude more work than a traditional single-core chip at the same clock rate. Cell's parallel configuration and performance are seldom seen in traditional CPU architectures for any market--much less the cost-sensitive consumer electronics business.

The CBE, jointly created by Sony, Toshiba, and IBM, distributes its huge computational capacity over two different kinds of processor cores, so its development environment is quite different from that of comparatively conventional homogeneous multiprocessor architectures. Cell programmers need special facilities to help them harvest such computation resources effectively.

In this article, I'll introduce a new object format called CBE Embedded SPE Object Format (CESOF). Programmers in the Sony, Toshiba, and IBM Cell Broadband Engine Design Center (STIDC) created the CESOF specification to help Cell programmers integrate the interacting programs for these two different types of processor cores. I'll also introduce the design concept, the structure, and a simple usage sample of CESOF.

Multiple Cores in the Cell Processor

Using heterogeneous (that is, different kinds of) processor cores in a multi-core system has become a popular practice in the embedded systems space. For a particular algorithm or application with expected regularity, a specialized and highly optimized circuit usually provides better performance within a smaller chip area and with lower power consumption than general-purpose cores. In fact, embedded systems designers recognize that many of their applications or workloads can benefit from specialized cores such as a single instruction, multiple data (SIMD) engine, a floating-point accelerator, or a direct memory access (DMA) controller.

High-performance computation workloads, modern media-rich applications, and many algorithms in other domains all exhibit a lot of regularity in their tasks. Replacing one or more of the generic processor cores with specialized circuits will likely give a better performance/cost ratio for these applications.

Cell's chip design, shown in Figure 1, strikes a balance by using one generic Power Processor Element (PPE) and eight Synergistic Processor Elements (SPEs) to provide a better performance/cost ratio (in terms of chip area and power consumption), particularly for high-performance computing and media processing. The eight SPEs are specialized SIMD cores, each with its own private local memory. The performance/cost ratio is particularly impressive when an algorithm can be distributed over all eight engines at the same time with properly staged data traffic.

 

Each SPE can run independently from the others. Its instruction set is designed to execute SIMD instructions efficiently. All SIMD instructions handle 128-bit vector data in different element configurations: byte, half word, word, and quad-word sizes. For example, one SIMD instruction can perform 16 character operations at the same time.

Another important design aspect is the use of the on-chip local memory located next to each SPE. This closeness reduces the distance, and thus the latency, from a processor core to its execution memory space. The address space of the SPE instructions spans only its own local memory; the SPE fetches instructions from the local store, loads data from the local store, and stores data to the local store. SPE instructions cannot "see" the rest of the chip's (or the system's) address space.

The simplicity of this local memory design improves memory-access time, memory bandwidth, chip area, and power consumption, but it does require extra steps for an SPE program to bring external data into the local store. An SPE can't load or store data to/from the system memory directly. Instead, it uses a DMA operation to transfer data between the system memory and its local store. This is quite different from the general-purpose PPE core. The PPE load and store instructions access the data directly from the effective address backed by off-chip physical system memory.

As a side note, the internal core of an SPE, without the DMA engine, is called a Synergistic Processor Unit (SPU). The use of SPU in the naming convention of software code is sometimes intermixed with the use of SPE where a distinction may not be as important.

Connecting these nine cores (one PPE and eight SPEs) with the physical memory is a high-speed bus called Element Interconnect Bus (EIB). Through this bus, an SPE DMA engine (not the SPE load and store instructions) transfers data between the system memory and its local store memory.

Developing and combining the code modules for cores with different instruction sets and memory spaces presents a big challenge to conventional programming tools. Programmers need an additional facility, such as CESOF, to glue these heterogeneous code modules together. In the remainder of this article I'll introduce the design concept, the structure, and a simple usage example of CESOF.


Article Start Page 1 of 6 Next

Related Jobs

Tangentlemen
Tangentlemen — Playa Vista, California, United States
[06.22.17]

Technical Art Lead
Tangentlemen
Tangentlemen — Playa Vista, California, United States
[06.22.17]

Generalist Engineer
Telltale Games
Telltale Games — San Rafael, California, United States
[06.22.17]

Technical Director
Telltale Games
Telltale Games — San Rafael, California, United States
[06.22.17]

Build Engineer





Loading Comments

loader image