It's free to join Gamasutra!|Have a question? Want to know who runs this site? Here you go.|Targeting the game development market with your product or service? Get info on advertising here.||For altering your contact information or changing email subscription preferences.
Registered members can log in here.Back to the home page.

Search articles, jobs, buyers guide, and more.

By Ronen Zohar & Haim Barad
Gamasutra
April 16, 1999

Letters to the Editor:
Write a letter
View all letters


Features

 

Contents

Introduction

An Introduction to Streaming SIMD Extensions

The 3D Pipeline Structure and Body

Simple Lighting

Ideas for future Improvements

An Introduction to Streaming SIMD Extensions

The Streaming SIMD Extensions is a natural extension of SIMD-style processing which was initiated when MMX technology was launched. The Streaming SIMD Extensions include three basic categories of instructions: SIMD floating-point instructions, SIMD integer instructions, and memory-related cache-control instructions.

The SIMD integer commands are extensions to MMX technology. They include new instructions that operate on the 64-bit wide MMX technology registers. However, for the purposes of this article, we are primarily interested in the latter two instruction categories. The SIMD floating-point instructions operate on 128-bit wide packed single-precision floating-point registers. There are eight such new registers in the Pentium III, which support 4-wide SIMD processing using single precision, floating point numbers. The memory-related cache-control instructions help control cache locality and reduce engine latency. These memory hints (e.g., prefetches) tell the processor that we’ll need to grab this data in advance so it’s ready when we need it. Figure 1 shows a diagram with the new 128-bit packed floating-point registers. For more information, read "Optimizing Games for the Pentium III Processor".

Figure 1. The new 128-bit packed floating-point registers.

Working with Data structures

To begin with, we emphasize vertical parallelism. The algorithms are often efficiently written in a serial manner, processing four pieces of data (i.e., vertices) in parallel. This type of processing is often more efficient in a SIMD environment since data dependencies often limit the amount of horizontal parallelism you can exploit.

Now, let's look at the data. A vertex pool is represented as an AOS. Each structure contains data for one vertex (such as its position, normal, color, texture coordinates, and so on), and the entire pool is represented as an array of all these structures. However, this method of representing the vertex pool makes the transfer to SIMD difficult. Why? Well, keep in mind that data should be 16-byte aligned to avoid performance penalties, and that the Pentium III requires 16-byte alignment when reading and writing the Streaming SIMD Extensions registers (since their new registers are 128 bits wide). As the size of a vertex structure is not always divisible by 16, you can always align the first vertex in the array, but not necessarily the second one.

On the other hand, you could represent your data as an structure of arrays (SOA). In this method, each element of vertex data is stored in an array, and all of these arrays are used together to generate the vertex pool (which takes the format of xxxx….,yyyy…..,zzzz….., and so on). Although this method may seem a bit strange, it has advantages. First, if all the element heads are aligned, then the entire data structure is aligned. Second, the regular non-SIMD algorithm can be used in a vertical fashion; instead of dealing with one vertex per iteration, the algorithm can handle four vertices per iteration. There are some drawbacks to this method, though. The main drawback is that you have to manage a set of pointers or indices – one for each element of the vertex data. Another drawback is that because algorithm handles four data elements in one iteration, all of the constants and parameters have to be expanded four times, which requires more time and storage space.

There is a third method of representing a vertex pool. This method is a hybrid between the previous two methods, and involves dividing the vertex pool into small SIMD portions (usually four or eight vertices per portion). You could represent each portion in an SOA, and the entire vertex pool could be made up of arrays of these portions. Thus, if the portion size was four, the vertex pool would look like this: xxxxyyyyzzzz, xxxxyyyyzzzz,…and so on. This method retains the alignment, in each iteration we deal with four vertices without changing the algorithm, and we need to manage only one pointer. We will use this method in this article.

The data structures we will work with are defined as the following:

struct SsimdTex {

F32vec4 u,v;

};

Struct SsimdVector {

F32vec4 x,y,z;

};

struct SsimdVertex {

SsimdVector pos;

SsimdVector norm;

SSimdTex tex;

};

The vertex pool is defined as :

_MM_ALIGN16 SSimdVertex vertex_pool[(MAX_NUM_VERITCES + 3)/4];

Since the algorithm operates on four vertices in each iteration, all of the external data (such as the transform matrix and lighting and material information) should be expanded. To keep things simple, we created structures built from F32vec4 members that will hold the expanded data. For example, the transform matrix is defined as:

struct SSimd4X4Matrix {

F32Vec4 _11,_12,_13,_14;

F32Vec4 _21,_22,_23,_24;

F32Vec4 _31,_32,_33,_34;

F32Vec4 _41,_42,_43,_44;

};


The 3D Pipeline Structure and Body


join | contact us | advertise | write | my profile
news | features | companies | jobs | resumes | education | product guide | projects | store



Copyright © 2003 CMP Media LLC

privacy policy
| terms of service