Gamasutra: The Art & Business of Making Gamesspacer
View All     RSS
September 30, 2014
arrowPress Releases
September 30, 2014
PR Newswire
View All

If you enjoy reading this site, you might also want to check out these UBM Tech sites:

The Science Behind Kinects or Kinect 1.0 versus 2.0
by Daniel Lau on 11/27/13 10:20:00 am   Expert Blogs   Featured Blogs

The following blog post, unless otherwise noted, was written by a member of Gamasutra’s community.
The thoughts and opinions expressed are those of the writer and not Gamasutra or its parent company.


Machine vision is the area of research focused on measuring physical objects using a video camera. Sometimes confused with the broader area of computer vision, machine vision is typically associated with manufacturing and detecting defects in parts on an assembly line. Computer vision is interested in any and all applications that involve teaching a computer about its physical surroundings via imaging sensors. In either case, 3-D imaging sensors such as Microsoft's Kinect camera are having a profound impact because they solve a host of problems associated with perspective distortion in traditional 2-D cameras at an absurdly low price. Perspective distortion refers to how things look smaller as they get farther away.


Figure 1: Illustration of the stereo-imaging method.



Of the many methods of measuring depth with a camera, the Kinect 1.0 sensor falls within a broad range of technologies that rely on triangulation. Triangulation is the process used by stereo-imaging systems which is how the human visual system (i.e. two eyes) works. The process is illustrated in Fig. 1 where I show two points in space, at varying distances from the camera. Looking at the two images as viewed by the stereo-camera pair, the blue sphere being closer to the cameras has a greater disparity in position from camera A to camera B. That is, the blue sphere appears to move almost three-quarters of the cameras' fields of view while the red sphere moves only half of this distance. The disparity in travel distance between the red and blue spheres is a phenomenon known as parallax such that closer objects produce greater degrees of parallax than distant objects.

Of course, the Kinect 1.0 sensor doesn't have two cameras performing triangulation. Instead, it relies on triangulation between a near-infrared camera and a near-infrared laser source to perform a process called structured-light. Structured light is one of the first methods of 3-D scanning where a single light stripe mechanically sweeps across a target object, and from a large sequence of images taken as the stripe sweeps across the target, a complete 3-D surface can be reconstructed. Figure 2 shows how this would work for a single frame of video where the position of the laser appears to move left and right with depth such that the more to the right, the closer to the camera the target surface. Note that with just the single laser stripe, a single frame of video can only reconstruct a single point of the target surface per row of the image sensor. So a 640x480 camera could only reconstruct, at most, 480 unique points in space. Hence, we need to sweep the laser so that we can generate many points in each row.


Figure 2: Illustration of the structured-light imaging method using a single laser stripe.



Obviously for our sweeping laser stripe, we would need the target object to stay still during the scanning process. So this doesn't really work in real-time systems like the Kinect 1.0 camera when we want to measure a moving subject. Instead, the Kinect 1.0 sensor makes use of a pseudo-random dot pattern produced by the near-infrared laser source that illuminates the entire field of view as illustrated in Fig. 3. Having this image, the processing hardware inside the camera then looks at small windows of the captured image and attempts to find the matching dot pattern in the projected pattern.


Figure 3: Captured IR image from Microsoft Kinect 1.0 Sensor.



Of course, you're probably wondering why the perspective distortion phenomenon doesn't make dots look smaller and closer together as the reflecting surface gets farther away from the camera. It doesn't because the camera and the laser projector are epipolar rectified. That is, the camera and projector have matching fields of view such that as the reflecting surface gets farther from the sensor, the light from the laser projector is getting larger and larger since it is a cone of laser light that is getting ever larger as you get farther from its source. So the cone of light that is the projector is expanding at the same rate as the lines of sight of camera's pixels are expanding. What this means in the captured image is that dots in the projected pattern appear to move left to right with distance, not up and down. So by simply tracking a dot's horizontal coordinate, the Kinect can tell you how far that dot is from the camera sensor, i.e. that pixel's observed depth.

The problem for the Kinect 1.0 sensor is in the manner in which it relies on small windows in the captured image. That is, it needs to detect individual dots, and then it needs to find neighboring dots as well. And from a constellation of these dots, it can identify the exact constellation of points from the projected dot pattern. Not having a constellation of points, there is no way for the Kinect processor to uniquely identify a dot in the projected pattern. We call this an ambiguity, and it means I cannot derive a depth estimate. For computer gaming, this isn't really a problem because body parts are large enough that my constellations fit inside the pixels forming your arm. For measuring thin objects like hair or, perhaps, a utility cord as thin as a single image pixel, this is a significant obstacle, and its the major roadblock that the Kinect 2.0 attempts to address.

The Microsoft Kinect 2.0 sensor relies upon a novel image sensor that indirectly measures the time it takes for pulses of laser light to travel from a laser projector, to a target surface, and then back to an image sensor. How is this possible? Well quite easily if you consider that, in the time it takes for one complete clock cycle in a 1 GHz processor, a pulse of light travels about 1 foot. That means that if I can build a stop watch that runs at 10 GHz, I can easily measure the round trip travel distance of a pulse of light to within 0.10 feet. If I pulse the laser and make many measurements over a short period of time, I can increase my precision to the point where LIDAR systems are available that can measure distances to within less than a centimeter at a range over one kilometer.

What the Kinect 2.0 sensor does is that it takes a pixel and divides it in half. Half of this pixel is then turned on and off really fast such that, when it is on, it is absorbing photons of laser light, and when it is off, it rejects the photons. The other half of the pixel is doing the same thing; however, its doing it 180 degrees out of phase from the first half such that, when the first half is on, its off. And when the first half is off, its on. At the same time this happening, a laser light source is also being pulsed in phase with the first pixel half such that, if the first half is on, so is the laser. And if the pixel half is off, the laser will be too.

As illustrated by the timing diagram of Fig. 4, suppose we aim the laser source directly at the camera in very close proximity, then the time it takes for the laser light to leave the laser and land on the sensor is basically 0 seconds. This is depicted in Fig. 4 by the top row of light pulses (red boxes) being in perfect alignment with the gray columns. As such, the laser light will be absorbed by the first half of all camera pixels, since these halves are turned on, and rejected by the second halves since these halves are turned off. Now suppose I move the laser source back one foot. Then the laser light will arrive 1 GHz clock cycle later than when it left the source, as depicted by the second row of laser pulses in Fig. 4. So light photons that left the laser just as it was turned on will arrive after the first halves of the camera pixels are turned on, meaning that they will be absorbed by the first halves and rejected by the second.


Figure 4: Illustration of the indirect time-of-flight method.



Photons leaving the laser just as it is turned off will then arrive just after the camera pixels' second halves are turned on, meaning they will be rejected by the first halves and absorbed by the second. That means that the total amount of light absorbed by the first halves will decrease slightly while the second halves with increase slightly. As we move the laser source even farther away from the camera sensor, more and more of the photons emitted by the laser source will arrive at the camera sensor while the second halves are turned on, meaning that the second half recordings will be larger and larger while the first half recordings will be smaller and smaller. And after several milliseconds of exposure, the two total amounts of photons recorded by the two halves are compared. As more and more total photons are recorded by the second halves of the pixels compared to the first, we can assume the round trip distance that the light traveled is larger. Now its important to have two halves record the incoming laser light because it may be that the target surface will absorb some of the laser light. If it does, then the total number of photons reflected back will not be equal to the total number that was projected. This will affect both pixel halves equally. So its the ratio of photons recorded by the two halves, not the total number recorded by either side.

At some point though, the travel distance of the laser light might be so long that laser photons will arrive so late to the sensor that they overshoot the pixels' second halves' on-window and arrive at the first halves' on-window, as depicted by the third row of laser pulses in Fig. 4. This results in an ambiguity, which is resolved by increasing the time that the pixel halves are turned on, giving more time for the light to travel round trip and land inside the second halves' on-window. Of course, it also means that it will be harder to detect small changes in travel distance since all sensors have some amount of thermal noise (i.e. free electrons floating through the semiconductor lattice) that look like light photons as well as having limited precision. So what Kinect 2.0 does it that it takes two measurements, where the first measurement is a low resolution estimate with no ambiguities in distance. The second measurement is then taken with high precision, using the first estimate to eliminate any ambiguities. Of course depending on how fast the sensor works, we can always take additional estimates with greater and greater degrees of precision.

Now while all of this time-of-flight business sounds really cool, the Kinect 2.0 is even cooler because the sensor also has built-in ambient light rejection where each pixel individually detects when that pixel is over saturated with incoming ambient light, and it then resets the pixel in the middle of an exposure. The Kinect 1.0 sensor has no means of rejecting ambient light, and as such, cannot be used in environments prone to near-infrared light sources (i.e. sunlight). In fact, the Kinect 2.0 sensor's light rejection is one of the reasons why its original developers considered using the system in automotive applications for things like a rear-view camera.

For gaming, this process of indirectly measuring the time of flight of a laser pulse allows for each pixel to independently measure distance; whereas, Kinect 1.0 has to measure distance using neighborhoods of pixels. Kinect 1.0 could not measure distances in the spaces between laser spots. And this has an impact of depth resolution where Kinect 1.0 has been cited as having a depth resolution limit of around 1 centimeter. Kinect 2.0 is limited by the speed at which it can pulse its laser source with shorter pulses offering high degrees of depth precision, and it can pulse that laser at really short intervals. What Kinect 1.0 has that Kinect 2.0 doesn't have is that it can rely on off-the-shelf camera sensors and can run at any frame rate; whereas, Kinect 2.0 has a very unique sensor that is very expensive to manufacture. Only a large corporation like Microsoft and only a high volume market like gaming could achieve the economies of scale needed to bring this sensor into the home at such an affordable price. Considering the technology involved, one might say absurdly low price.


Of course, over the next couple of months the differences between the two sensors is going to become apparent as researchers, such as myself, will be getting our hands on the new sensor and will have the opportunity to see when and where the two systems are most appropriate. At present, I'm hard at work developing machine vision systems for precision diary farming using the Kinect 1.0 sensor, but I'm doing so knowing that I'll need the precision of the new sensor.  

Related Jobs

Raven Software / Activision
Raven Software / Activision — Madison, Wisconsin, United States

Sr. Gameplay Engineer - Raven
CCP — Reykjavik, Iceland

Director, Performance Marketing
Nintendo of America Inc.
Nintendo of America Inc. — Redmond, Washington, United States

Software Business Development Manager, Licensing
Respawn Entertainment
Respawn Entertainment — San Fernando Valley, California, United States

Senior Animator


Stanley Rosenbaum
profile image
Great down to earth explanation of the differences between the two sensors!

It is definitely a help is understanding what they do.

Joel Bennett
profile image
Good explanation.

As someone who is also working with the Kinect for non-gaming purposes, I completely agree - it will be interesting to see how well it holds up. From what I've seen, it does seem to have a much wider field of view, but also seems to suffer from significant noise toward the outside ranges of that field of view. For my purposes (low-cost 3D scanning), the large field of view really doesn't help much. I'd be curious to hear more about how you are using it with dairy farming. (For some reason I am picturing automatic milking machines that locate and latch onto the utters...)

Daniel Lau
profile image
The milking machine is an actual product and an impressive one at that. I'm have the cameras mounted on the ceiling outside the milking parlor so I can record the cows leaving the parlor. The cows are black and white. So its hard to observe them and separate the cow's body pixels from the floor and walls of the hallway. Using the Kinect, I can do adaptive green-screening, and then I can merge the images to produce a complete 3-D model of the animal. And from its markings, I can identify which cow it is. Over many days of this, I can observe changes in body shape that might be indicative of motion concerns and change in body weight.

Ernest Adams
profile image
This is incredibly cool. Seems kind of a waste to use it on games... :)

Kostas Zarifis
profile image
Fascinating stuff, thanks for sharing Daniel

Merc Hoffner
profile image
Hi Daniel. I hope you're still reading this - I missed the article at first - really really great, very informative.

I was wondering if you could answer some of my technical questions? Is the random structured lightfield in the Kinect 1 sensor actually random or pseudo random? What I mean to say is, is the pattern pre-prepared and stored in some kind of lookup-table, identical for every Kinect sold? Or is it different for every unit requiring custom 'calibration/programming' for each unit? Or is it different every time, requiring some kind of internal image calibration? How exactly is the pattern generated? Some kind of filter in the laser? To achieve such a high contrast speckle effect without defocusing at arbitrary distances I assume the laser has to replicate an ultra-bright point-source - how do they manage the heat? And if the hypothetical filter is so close to the source and it's dealing with a relatively large wavelength wouldn't the tiny image structures create diffraction limit problems?

Also with Kinect 2.0, with such a short absorbance duration, each pixel must capture incredibly few photons. Does this mean the sensors are incredibly sensitive? Meaning the pixels have to have a large surface area and thus a low resolution or a very expensive piece of silicon? And/or as I think you alluded to, if a huge number of short duration samples is statistically accumulated to boost the signal and filter noise, does that mean some kind of on-board accumulation logic running at +10Ghz? It sounded like there was a lot of logic overall - how can that all run at these ultra-high speeds without something crazy like a liquid nitrogen setup? Also, if the traditional approach to keep everything in focus is to keep the optics and the sensor small, how does that marry up with the competing drive to make the sensor big? With near infra-red is the sensor pushing into diffraction limitations?

Finally, I heard the market analysis guys say Kinect 2 costs $74 to build. I figure they have no experience with commercial TOF parts, and the cheapest lowest resolution sensors I've seen go upwards of $3000, and even a Microsoft guy said Kinect 2 costs nearly as much to build as the rest of the box, which would still represent an enormous cost reduction compared to academic/industrial units, right? Are the analysts way off base? (could you even tell us?).

Sorry, a lot of questions. Thanks for any kind of insight you can provide.

Daniel Lau
profile image
The pattern is a repeating (tiled) pseudo-random dot pattern specifically designed for this application and is created by reflecting the laser source off a diffractive grating. The laser has to be class 1 for consumer use. Diffractive gratings are very inexpensive to make in quantity. For the most part, there is probably very little difference between any two sensors coming off the assembly line. I'm not particularly impressed with how well the IR and RGB cameras are aligned with one another, but I wouldn't be surprised if there was a simple calibration process used on each sensor before it ships from the factory. There are warnings that if you drop the unit on the floor, you could lose calibration. At under 1 mw, there isn't much heat to worry about, and the dots aren't that small that you would see diffraction issues with the laser spots.

The Kinect 2 uses a custom CMOS camera sensor. CMOS camera sensors are readily available and run at rates of 3000+ frames per second. The turning on and off of the pixels is an impressive innovation in camera design; however, it's not like a clock signal that works its way through a transistor network. It's more like a wire that runs down the center of each pixel forming a column of pixels. And when the wire is on, one side of the pixel is on and when its off, the other side of the pixel is on. So the rest of the camera circuitry thinks its a traditional camera with moderate exposure times like 15msecs. But because of this pulsating wire, each half of the pixel is only on for half that window.

Intel has a time-of-flight sensor that is being manufactured and sold by Creative. That sensor is under $150 and is being sold at a profit. Do you think Creative can afford to sell anything at a loss or would have any way to make money after the initial sale like MS does with games? So price wise, that $74 unit price for the Kinect 2 sounds about right. As for the engineering costs, I'm sure the Kinect 2 probably did cost the same in engineering as the console simply because the console is just taking hardware off the shelf. The Kinect 2 is a completely novel image sensor; however, a lot of the engineering was done by Canesta before they were bought out by Microsoft.

Rohit Raghunathan
profile image
Hi Daniel,

Thanks for posting this. By far the most informative article about the kinect 2 I've read so far. Could you post a link to your references for your info about the kinect 2 in this article? I haven't found any literature on the ToF tech used in new kinect.