Machine vision is the area of research focused on measuring physical objects using a video camera. Sometimes confused with the broader area of computer vision, machine vision is typically associated with manufacturing and detecting defects in parts on an assembly line. Computer vision is interested in any and all applications that involve teaching a computer about its physical surroundings via imaging sensors. In either case, 3-D imaging sensors such as Microsoft's Kinect camera are having a profound impact because they solve a host of problems associated with perspective distortion in traditional 2-D cameras at an absurdly low price. Perspective distortion refers to how things look smaller as they get farther away.
Of the many methods of measuring depth with a camera, the Kinect 1.0 sensor falls within a broad range of technologies that rely on triangulation. Triangulation is the process used by stereo-imaging systems which is how the human visual system (i.e. two eyes) works. The process is illustrated in Fig.¬†1 where I show two points in space, at varying distances from the camera. Looking at the two images as viewed by the stereo-camera pair, the blue sphere being closer to the cameras has a greater disparity in position from camera A to camera B. That is, the blue sphere appears to move almost three-quarters of the cameras' fields of view while the red sphere moves only half of this distance. The disparity in travel distance between the red and blue spheres is a phenomenon known as parallax such that closer objects produce greater degrees of parallax than distant objects.
Of course, the Kinect 1.0 sensor doesn't have two cameras performing triangulation. Instead, it relies on triangulation between a near-infrared camera and a near-infrared laser source to perform a process called structured-light. Structured light is one of the first methods of 3-D scanning where a single light stripe mechanically sweeps across a target object, and from a large sequence of images taken as the stripe sweeps across the target, a complete 3-D surface can be reconstructed. Figure¬†2 shows how this would work for a single frame of video where the position of the laser appears to move left and right with depth such that the more to the right, the closer to the camera the target surface. Note that with just the single laser stripe, a single frame of video can only reconstruct a single point of the target surface per row of the image sensor. So a 640x480 camera could only reconstruct, at most, 480 unique points in space. Hence, we need to sweep the laser so that we can generate many points in each row.
Obviously for our sweeping laser stripe, we would need the target object to stay still during the scanning process. So this doesn't really work in real-time systems like the Kinect 1.0 camera when we want to measure a moving subject. Instead, the Kinect 1.0 sensor makes use of a pseudo-random dot pattern produced by the near-infrared laser source that illuminates the entire field of view as illustrated in Fig.¬†3. Having this image, the processing hardware inside the camera then looks at small windows of the captured image and attempts to find the matching dot pattern in the projected pattern.
Of course, you're probably wondering why the perspective distortion phenomenon doesn't make dots look smaller and closer together as the reflecting surface gets farther away from the camera. It doesn't because the camera and the laser projector are epipolar rectified. That is, the camera and projector have matching fields of view such that as the reflecting surface gets farther from the sensor, the light from the laser projector is getting larger and larger since it is a cone of laser light that is getting ever larger as you get farther from its source. So the cone of light that is the projector is expanding at the same rate as the lines of sight of camera's pixels are expanding. What this means in the captured image is that dots in the projected pattern appear to move left to right with distance, not up and down. So by simply tracking a dot's horizontal coordinate, the Kinect can tell you how far that dot is from the camera sensor, i.e. that pixel's observed depth.
The problem for the Kinect 1.0 sensor is in the manner in which it relies on small windows in the captured image. That is, it needs to detect individual dots, and then it needs to find neighboring dots as well. And from a constellation of these dots, it can identify the exact constellation of points from the projected dot pattern. Not having a constellation of points, there is no way for the Kinect processor to uniquely identify a dot in the projected pattern. We call this an ambiguity, and it means I cannot derive a depth estimate. For computer gaming, this isn't really a problem because body parts are large enough that my constellations fit inside the pixels forming your arm. For measuring thin objects like hair or, perhaps, a utility cord as thin as a single image pixel, this is a significant obstacle, and its the major roadblock that the Kinect 2.0 attempts to address.
The Microsoft Kinect 2.0 sensor relies upon a novel image sensor that indirectly measures the time it takes for pulses of laser light to travel from a laser projector, to a target surface, and then back to an image sensor. How is this possible? Well quite easily if you consider that, in the time it takes for one complete clock cycle in a 1¬†GHz processor, a pulse of light travels about 1 foot. That means that if I can build a stop watch that runs at 10¬†GHz, I can easily measure the round trip travel distance of a pulse of light to within 0.10 feet. If I pulse the laser and make many measurements over a short period of time, I can increase my precision to the point where LIDAR systems are available that can measure distances to within less than a centimeter at a range over one kilometer.
What the Kinect 2.0 sensor does is that it takes a pixel and divides it in half. Half of this pixel is then turned on and off really fast such that, when it is on, it is absorbing photons of laser light, and when it is off, it rejects the photons. The other half of the pixel is doing the same thing; however, its doing it 180 degrees out of phase from the first half such that, when the first half is on, its off. And when the first half is off, its on. At the same time this happening, a laser light source is also being pulsed in phase with the first pixel half such that, if the first half is on, so is the laser. And if the pixel half is off, the laser will be too.
As illustrated by the timing diagram of Fig. 4, suppose we aim the laser source directly at the camera in very close proximity, then the time it takes for the laser light to leave the laser and land on the sensor is basically 0 seconds. This is depicted in Fig. 4 by the top row of light pulses (red boxes) being in perfect alignment with the gray columns. As such, the laser light will be absorbed by the first half of all camera pixels, since these halves are turned on, and rejected by the second halves since these halves are turned off. Now suppose I move the laser source back one foot. Then the laser light will arrive 1¬†GHz clock cycle later than when it left the source, as depicted by the second row of laser pulses in Fig. 4. So light photons that left the laser just as it was turned on will arrive after the first halves of the camera pixels are turned on, meaning that they will be absorbed by the first halves and rejected by the second.
Photons leaving the laser just as it is turned off will then arrive just after the camera pixels' second halves are turned on, meaning they will be rejected by the first halves and absorbed by the second. That means that the total amount of light absorbed by the first halves will decrease slightly while the second halves with increase slightly. As we move the laser source even farther away from the camera sensor, more and more of the photons emitted by the laser source will arrive at the camera sensor while the second halves are turned on, meaning that the second half recordings will be larger and larger while the first half recordings will be smaller and smaller. And after several milliseconds of exposure, the two total amounts of photons recorded by the two halves are compared. As more and more total photons are recorded by the second halves of the pixels compared to the first, we can assume the round trip distance that the light traveled is larger. Now its important to have two halves record the incoming laser light because it may be that the target surface will absorb some of the laser light. If it does, then the total number of photons reflected back will not be equal to the total number that was projected. This will affect both pixel halves equally. So its the ratio of photons recorded by the two halves, not the total number recorded by either side.
At some point though, the travel distance of the laser light might be so long that laser photons will arrive so late to the sensor that they overshoot the pixels' second halves' on-window and arrive at the first halves' on-window, as depicted by the third row of laser pulses in Fig.¬†4. This results in an ambiguity, which is resolved by increasing the time that the pixel halves are turned on, giving more time for the light to travel round trip and land inside the second halves' on-window. Of course, it also means that it will be harder to detect small changes in travel distance since all sensors have some amount of thermal noise (i.e. free electrons floating through the semiconductor lattice) that look like light photons as well as having limited precision. So what Kinect 2.0 does it that it takes two measurements, where the first measurement is a low resolution estimate with no ambiguities in distance. The second measurement is then taken with high precision, using the first estimate to eliminate any ambiguities. Of course depending on how fast the sensor works, we can always take additional estimates with greater and greater degrees of precision.
Now while all of this time-of-flight business sounds really cool, the Kinect 2.0 is even cooler because the sensor also has built-in ambient light rejection where each pixel individually detects when that pixel is over saturated with incoming ambient light, and it then resets the pixel in the middle of an exposure. The Kinect 1.0 sensor has no means of rejecting ambient light, and as such, cannot be used in environments prone to near-infrared light sources (i.e. sunlight). In fact, the Kinect 2.0 sensor's light rejection is one of the reasons why its original developers considered using the system in automotive applications for things like a rear-view camera.
For gaming, this process of indirectly measuring the time of flight of a laser pulse allows for each pixel to independently measure distance; whereas, Kinect 1.0 has to measure distance using neighborhoods of pixels. Kinect 1.0 could not measure distances in the spaces between laser spots. And this has an impact of depth resolution where Kinect 1.0 has been cited as having a depth resolution limit of around 1 centimeter. Kinect 2.0 is limited by the speed at which it can pulse its laser source with shorter pulses offering high degrees of depth precision, and it can pulse that laser at really short intervals. What Kinect 1.0 has that Kinect 2.0 doesn't have is that it can rely on off-the-shelf camera sensors and can run at any frame rate; whereas, Kinect 2.0 has a very unique sensor that is very expensive to manufacture. Only a large corporation like Microsoft and only a high volume market like gaming could achieve the economies of scale needed to bring this sensor into the home at such an affordable price. Considering the technology involved, one might say absurdly low price.
Of course, over the next couple of months the differences between the two sensors is going to become apparent as researchers, such as myself, will be getting our hands on the new sensor and will have the opportunity to see when and where the two systems are most appropriate. At present, I'm hard at work developing machine vision systems for precision diary farming using the Kinect 1.0 sensor, but I'm doing so knowing that I'll need the precision of the new sensor. ¬†