Understanding Errors in Time of Flight Depth Sensors

3D_Sensor_Hardware

Device	Kinect v2	Xtion 2	O3D303
Color Image Resolution	1920 x 1080	2592 x 1944	-
Depth Image Resolution	512 x 424	640 x 480	352 x 264
Frame Rate Depth Image	30 fps	30 fps	4 - 25 fps
Access to IR Recordings	Yes	No	Yes
Field of View Depth Image	70° x 60°	74° x 52°	60° x 45°
Range	0.5 - 4.5 m	0.8 - 3.5 m	0.3 - 30 m

This article deals with the analysis of the individual Time-of-Flight camera systems used in the course of the work. For this purpose, the characteristics of the IFM O3D303 (Figure 1c), the Microsoft Kinect v2 (Figure 1a), and the Asus Xtion 2 (Figure 1b) are examined. Table 1 lists the technical data of the camera systems published by the manufacturer. It should be noted that the Kinect v2 and the Xtion 2 are RGB-D camera systems, which, in addition to the Time-of-Flight sensor, have an RGB camera installed, while the O3D303 exclusively features a Time-of-Flight sensor. The Kinect v2 delivers an RGB color image, a depth image, and an infrared image, while the Xtion 2 only provides a color image and a depth image, and the O3D303 provides a depth image and an infrared image. The Kinect v2 and the Xtion 2 are operated via USB 3.0, and the O3D303 is addressed via a network connection over which the data is transmitted. As explained in the article Capturing Depth Information: Principles and Techniques of Time-of-Flight Sensors, the ambiguity distance is directly related to the frequency used. Based on the measured maximum distance of the depth image for the O3D303, a frequency of 30 MHz and for the Xtion 2, a frequency of 20 MHz for the used wave function was determined.

None of the cameras feature an adjustable focus or aperture. However, the O3D303 allows for finer configuration of exposure times and additional configuration options, such as using optionally up to three different frequencies and three exposure modes, in which multiple images with different exposure times are created to enable high exposure and associated low noise without overexposing the sensors. In the present work, only one frequency of 30 MHz and the simplest exposure mode are used for evaluating the O3D303, in which only one shot with variable exposure time is taken.

In the following subchapters, the individual camera systems will be examined for their characteristics and the influence of light propagation on the depth image. It should be noted that for better distinction, the graphs follow a fixed scheme: The datasets of the Kinect v2 are represented by blue squares, while for the Xtion 2 red circles and for the O3D303 black triangles are used.

In the following subchapters, the individual camera systems will be examined for their characteristics and the influence of light propagation on the depth image. It should be noted that for better distinction, the graphs adhere to a fixed scheme: The datasets of the Kinect v2 are represented by blue squares, while those for the Xtion 2 are red circles and for the O3D303 black triangles.

Radial Distortion

Monocular camera calibration can be considered a 'solved problem' and is performed using algorithms offered by OpenCV [HF14]. This work uses chessboard patterns to calibrate the RGB and infrared cameras to determine the radial distortion. Radial distortion refers to the geometric imaging error of optical systems caused by lens curvature. The deviation of the camera image from the linear camera model can be determined using a known chessboard pattern. In addition, the calibration determines the focal length and the optical center of the camera, which are needed to estimate the position and orientation of ArUco markers, which will be addressed in the following subchapters.

Chessboard shots for the calibration of the Kinect v2 — Figure 3: Selection of chessboard shots used for the calibration of the Kinect v2.

The chessboard pattern is an 8 by 5 chessboard with exactly square fields, each with a side length of 32mm. For calibration, between 60 and 80 shots were taken for each infrared camera in different orientations and positions of the chessboard (see Figure 3). For the Xtion 2, a transparent film with a chessboard pattern had to be printed, which was attached to a plexiglass sheet, as the Xtion 2 does not provide access to the infrared image (see Figure 2b). Therefore, the image shows reflections caused by reflections on the plexiglass sheet. Instead of the infrared image, the depth image was used to calibrate the depth camera. The result of the calibration is output in the form of a camera matrix and distortion coefficients, which can be used to correct the distortion and convert the camera image into a linear camera model.

The camera matrix is defined as follows:

Equation 1

where $f_x$ and $f_y$ represent the focal length and $c_x$ and $c_y$ the optical center in pixel coordinates. The distortion coefficients are provided as a tuple of 4, 5, or 8 values:

Equation 2

where in this study exclusively 5-tuples are used. With the help of the camera matrix and the distortion coefficients, an undistorted pixel coordinate $(u', v')$ can now be determined for each pixel coordinate $(u, v)$. Here, the implementation of OpenCV is used for this purpose:

Equation 3

Calibration and illustration of radial distortion — Figure 4: Result of the calibration and illustration of the radial distortion of the infrared cameras.

Figure 4 illustrates the correction of the used infrared cameras. For this purpose, a line was drawn from $(u, v)$ to $(u', v')$ in a regular grid, which is supposed to represent the degree of distortion. For all cameras, the lens shape can be recognized by the distortion, and it can be seen that the strength of the distortion increases towards the edge of the image and is particularly pronounced in the corners.

Temperature

Changes in depth values of all sensors over time — Figure 5: The changes in depth values of all sensors over time.

Multiple sources confirm that the measured depth values of Time-of-Flight sensors are influenced by the temperature of the sensors themselves, as they warm up during operation [GVS18][HF14]. While Hertzberg and Frese [HF14] merely indicate that operating temperature can affect the sensors, Giancola et al. [GVS18] investigated the influence of temperature on the depth values of the Kinect v2. The sensors were started cold, and a static scene was captured while observing the influence of operating temperature on the measured depth values over time. Additionally, Giancola et al. [GVS18] examined the influence of improved cooling by using additional fans. They concluded that temperature affects the measured depth values of the device and that additional cooling counteracts this effect. Therefore, in this study, the devices used are also examined for their operating temperature and its influence. The sensors are started cold, and the optical axis is aligned orthogonally to a flat surface. The central pixel of the capture is observed, and changes over a period of 13 minutes are recorded. Figure 5 illustrates the change in depth values with increasing operating time. The study focuses only on the temperature of the sensors themselves, hence the influence of ambient temperature was not considered. Since the sensors are affected by ambient temperature, the recordings were made in an air-conditioned room at a constant 23 degrees Celsius. Since Giancola et al. [GVS18] observed the progression of deviation over a period of 24 hours and found that the deviation stabilizes after about 10 minutes, a long-term study was omitted.

Systematic error in depth value — Figure 6: Systematic error in the depth value caused by temperature after 30 minutes of operation.

As the measurement was limited to the central pixel of the depth image, it was additionally examined whether the deviation of depth values caused by temperature is the same for every pixel of the image. Figure 6 shows the degree of deviation in relation to the image. For each pixel, the depth value measured in the first minute was compared with the depth value measured after 30 minutes of operation. There are hardly any changes to be seen with the O3D303, while slight variances in the form of a ring in the image center are visible with the Kinect v2. However, the deviation of the depth value of the Xtion 2 is highly dependent on the scene being viewed. As a detailed examination of the influence of temperature would exceed the scope of this work, it is not considered in the simulation. However, it is noted that temperature does influence the depth values and that these are pronounced with the Xtion 2, which was taken into account in the following measurements.

Random Errors

Figure 7: Random error in the measurement of depth values in relation to distance.

Previous studies have investigated the relationship between noise behavior and measured distance [GVS18][Kel15][But14]. Giancola et al. [GVS18] specifically examined the Kinect v2 in detail, finding a linear dependence between distance and noise. For measurement, they used a robot arm extrinsically calibrated to the camera, so the distance between the camera and the measuring surface was known. They examined random errors from 1000 mm up to a distance of 4000 mm and found that the noise decreased between 1000 mm and 1500 mm before it started to increase linearly again. Butkiewicz [But14] found similar results.

Figure 8: Random error of Kinect v2 in measuring depth values in relation to distance. The two trends represent measurements of light and dark surfaces.

In contrast, this study did not use a calibrated robot arm and looked at the noise behavior in relation to the measured distance of the Time-of-Flight sensor. Two independent recordings were made. The camera was fixed while the distance of a flat surface, aligned orthogonally to the optical axis, was increased. The surface was moved further away from the sensor in small steps, and 1000 depth values were recorded each time to calculate the standard deviation (sigma) for each distance. This process was carried out with black and white construction paper. Figure 7 shows the standard deviation of the sensors in relation to the distance. It should be noted that the O3D303 was operated with a constant exposure time. The O3D303 is capable of creating several phase images with different exposure times through more complex exposure modes, thus reducing noise behavior. This was omitted in favor of comparability. It is evident that the behavior of all investigated Time-of-Flight sensors differs from one another. For the Kinect, the course of the standard deviation is additionally shown in Figure 8 in a different scale. The two trends represent the measurements of light and dark surfaces. The dark surface shows a higher standard deviation in relation to the distance. As in the study by Giancola et al. [GVS18], a slight decrease in deviation up to a distance of ~1000 mm and a subsequent increase is observed. The Xion 2 shows similar behavior, with the minimum more in the area of 1500 mm, while the initial minimum is not observed with the O3D303. The local minimum of the O3D303's measurements at around 2400 mm is presumably correlated with the systematic error investigated in the subsection Systematic Errors.

Since according to Li [Li14], noise behavior depends on the intensity of the reflected infrared signal, this study extended the investigation of noise behavior to include a measurement in relation to the intensity of the signal. The same recordings used to create Figure 7 were used. Figure 9 shows for the Kinect v2 and the O3D303 the standard deviation in relation to the intensity of the reflected infrared radiation. The course of deviation corresponds to the expected course of Equation 43 introduced in the study by Li [Li14].

Systematic Errors

Figure 10: Experimental setup for generating ground truth depth values and resulting infrared image of the O3D303 during measurement.

The systematic error, often referred to as Wiggling Error in other works, is caused by the emitted amplitude-modulated signal deviating from a perfect sine wave, resulting in an error in the phase shift estimation that depends on the distance [GVS18]. To measure the distance-dependent error, a ground truth distance is required to compare with the measured distance of the sensor. Giancola et al. [GVS18] used a robot arm extrinsically calibrated to the camera in their work, which made the distance between the robot arm and sensor known. This work uses ArUco markers to determine the ground truth distance between the measuring surface and the sensor. ArUco is a library for Augmented Reality applications developed at the University of Córdoba [RRMSMC18][SGJMSMCMC15]. With ArUco, it is possible to detect markers in the image and determine their position and orientation relative to the camera. When the ArUco markers are placed on a planar surface, it is possible to approximate the distance of the measuring surface, which is between the four ArUco markers, to the camera. For the recording, the camera was fixed, and the measuring surface was aligned orthogonally to the optical axis, while the distance of the measuring surface to the camera was slowly increased. Figure 10a shows the experimental setup with which the data for the O3D303 was collected.

Since the resolution of the image plays an important role in the precise determination of the position of the ArUco markers, the color cameras were used for determining the marker positions for the Xtion 2 and Kinect v2, as they have a higher resolution. The depth camera and the color camera were extrinsically calibrated to each other using a chessboard pattern, and the calculated depth information from the color image was then transformed into the space of the depth image. However, determining the marker position introduces its own error, which increases with increasing distance to the marker and manifests itself through noise behavior in the recordings, as the determination of the corners of the markers becomes less accurate.

Systematic error in the measurement of depth values in relation to the actual distance

Figure 11 shows the results of the measurements. It should be noted that the O3D303 does not have a color camera and the resolution of the infrared camera at 352 x 264 is too low to guarantee precise determination of the ArUco marker positions in the image. Therefore, three recordings with different marker sizes were overlaid to allow for sufficient accuracy, but this also introduced errors into the measurement. The results of the Kinect v2 are consistent with the measurements of Giancola et al. [GVS18], who detected an overlay of several sinusoidal waves. Giancola et al. [GVS18] and Keller [Kel15] also determined the phase and amplitude of the sinusoidal waves using Fourier transformation, which was used by Keller [Kel15] to simulate the error, which is not done in this work. The measurement results of the O3D303 suggest that the course of the deviation, similar to the results of Keller [Kel15], could be due to only one sinusoidal wave, as the O3D303, unlike the Kinect v2, uses only one wave function with a frequency of 30 MHz to generate the depth image. Only the measurement results of the Xtion 2 deviate from expectations, as no sinusoidal wave can be detected in the measurement results.

Multiple Path Errors due to Indirect Lighting

Experimental setup for measuring the multipath error — Figure 12: Experimental setup with orthogonally arranged surfaces for measuring the multipath error and the point cloud of the experimental setup used as a reference.

The Multipath Error is an error in the calculation of depth values caused by the influence of indirect lighting. The emitted light is reflected to the sensor via detours due to reflections on the surface. This leads to an overestimation of the depth value since indirect reflections have traveled a longer path than direct reflections before reaching the sensor. For the investigation of the influence of indirect lighting, two bright panels were set up orthogonally to each other in front of the camera for the experimental setup, with ArUco markers attached to them to determine the camera position in relation to the surfaces.

To compare the depth values provided by the depth sensor with ground truth depth values, a point cloud of the experimental setup was generated with a Structured Light camera. A Structured Light camera is a camera system that projects a known dot pattern with an infrared projector, which is captured by an infrared camera, from which depth values are generated. The method is a proprietary process developed by PrimeSense, so the details of how it works cannot be discussed in detail.

Figure 13: Multiple Path Error caused by indirect lighting of the surfaces.

The Xtion 1 camera used does not exhibit a multipath error, making it suitable for generating ground truth depth values for this experiment [WS17]. The Xtion 1 also exhibits a systematic error, but it increases linearly with the measured distance and is therefore sufficiently accurate for short distances of up to 1000 mm [GVS18][TMT13]. For the generation of the point cloud, the Marker Mapping developed by Muñoz-Salinas et al. [MSMJYBMC17] was also used. With the help of marker mapping, the relative marker positions to each other could be calculated and used in the form of a marker map, making it possible to determine a unique position in space for each ArUco marker when generating the point cloud. Like the Xtion 2 and Kinect v2, the Xtion 1 has a color and a depth camera, making it possible to use the color image in combination with the ArUco marker map to determine the orientation of the camera in space and accordingly transform the measured depth values of the camera into space. Figure 12 shows the experimental setup with the O3D303 and the generated point cloud used as a reference.

Figure 14: Top-Down view of depth values from the middle horizontal cut of the image. The blue depth values represent the ground truth distance and the red depth values represent the measurements of the individual cameras.

For each camera, 10000 depth images of the static scene were taken, and the average depth value for each pixel was formed to reduce the random error explained in the subsection on Random Errors. With the help of the ArUco markers, the point cloud was transformed so that it corresponded to the real positions of the experimental setup relative to the camera. This allowed a deviation of the measured depth values from the expected depth values to be determined. Figure 13 illustrates the result of the experiment in the form of a color map. It can be seen that the influences of the multipath error differ in detail for each camera but have in common that corners in the image are rounded due to the influence of interreflections. Figure 14 further illustrates this behavior by comparing the depth values of the point cloud and the Time-of-Flight cameras in the middle of the image along a horizontal scan line.

Lens Scattering Error

The Lens Scattering Error describes a well-known error in the depth image from the literature, caused by incoming light affecting adjacent pixels [HF14][JL10][CCMDH07][CCMDH09]. In addition to diffraction through the aperture, reflections and scattering effects within the lens system are cited as reasons for distorted depth values of neighboring pixels. This occurring effect causes strongly illuminated objects in the foreground to influence the depth values of objects in the background that are less illuminated. Hertzberg and Frese [HF14] investigated the lens scattering effect using a round retroreflector positioned in front of a weakly reflecting background. First, a recording of the background without the reflector was made and then the same scene was recorded with the reflector. This allowed the influence of the retroreflector on the background to be determined. This work avoids the use of a retroreflector, as it caused lens flares or lens reflections in addition to the lens scattering effect in all sensors. Additionally, the use of the retroreflector led to the intensity of the infrared image of the reflector being saturated, so the original intensity of the reflected radiation could no longer be determined, rendering the recording unusable for evaluating the lens scattering effect.

Jamtsho and Lichti [JL10] used two highly reflective surfaces at different distances, positioned orthogonally to the optical axis of the sensor, for their evaluation. They observed the influence of the depth values of the surface closer to the sensor on the depth values of the surface in the background. This work follows a similar approach and uses a weakly reflective surface made of black construction paper in the background and a highly reflective surface in the foreground. The distance of the highly reflective surface to the camera is varied, and its influence on the background is observed. Figure 15 illustrates the experimental setup and the resulting point cloud from the measurement, which will be examined in detail subsequently.

Figure 17: Lens Scattering Error in relation to the intensity of the infrared image.

In Figure 16, the depth values from the horizontal scan line are presented. The blue depth values are from the reference recording of the background. The red depth values were taken from a recording in which an additional object was positioned in the foreground of the image. The highly reflective foreground affects the depth values of the entire remaining recording. The transition from the foreground to the background is where the lens scattering error particularly affects the depth values of the background. These observations are also consistent with those of Hertzberg and Frese [HF14] and Jamtsho and Lichti [JL10].

Figure 17 shows the results of the investigation of the lens scattering effect. The graphs show the influence of the pixel on its neighboring pixels in relation to the intensity in the form of a point spread function. This indicates how much the intensity of one pixel spreads to its neighboring pixels. The results differ from those of the studies by Hertzberg and Frese [HF14] and Jamtsho and Lichti [JL10]. While Hertzberg and Frese determined a solely positive course of the function, the O3D303 also exhibits negative values (see Figure 17a). Jamtsho and Lichti approximated the lens scattering function using a sum of Gaussian functions, which is not possible for the results of the O3D303 and Kinect v2 as the function is not monotonically decreasing in the positive range and not monotonically increasing in the negative range. The course of the function is reminiscent of the pattern of diffraction disks or also called Airy disks, which arise from the diffraction of a light beam at an aperture [Pad09], suggesting a connection.

Mixed Pixels Error

The Mixed Pixels error, often also called Flying Pixels error, describes a well-known error in the depth image from the literature, observed at directly adjacent depth values, caused by a depth value for a pixel being composed of several measured depths [GVS18][TMT13][But14][HF14]. This error is especially noticeable in areas with greatly varying depth values. As the influence of lens scattering must be excluded for the evaluation of the mixed pixels error, the error is first isolated by removing the influence of lens scattering. Therefore, a weakly reflecting object was positioned in front of a highly reflecting background to evaluate the mixed pixels error, and the influence of the weakly reflecting surface on the background was observed. The influence of the lens scattering effect cannot be completely excluded for this observation, but the foreground should have a relatively low influence on the background. In this way, the mixed pixels error for the background can be isolated from the lens scattering error.

For evaluation, three separate recordings were made with the O3D303. First, only the background was recorded to evaluate the influence of the foreground object on the background. 10,000 depth images were created and the average depth value for each pixel was formed to eliminate random error. Then, an object was placed in front of the background and another recording was made in the same way. Finally, the background was removed again, and a recording was made of the foreground object to evaluate the influence of the background on the foreground. Figure 19 shows the resulting depth values from the horizontal scan line of the depth image from the recordings. The blue depth values represent the separately recorded background, the black depth values are from a separate recording of the foreground object, and the red depth values represent the joint recording of the foreground and background. It can be seen that the foreground object hardly affects the depth values of the background; on the other hand, the depth values of the foreground object are distorted by the background. To exclude the influence of lens scattering considered in the subsection Lens Scattering Error, all depth values belonging to the foreground were therefore masked.

Figure 20 shows the joint recording of the foreground object and the background, where the depth values of the joint recording were modified so that the points identified as part of the foreground from the separate recording of the foreground object were removed. It can be seen that the mixed pixels error has almost completely disappeared. Therefore, it is suspected that the mixed pixels error known from the literature is solely due to the lens scattering effect since the error disappears when attempting to isolate it.

Positions of the Infrared LEDs

Figure 21: Experimental setup for evaluating the error caused by shadows. The error can be seen in the white dots in the background, whose depth was overestimated.

In the analysis of the mixed pixels error, it was noticed that in the case of the Kinect v2 and the Xtion 2, part of the image lies in shadow and was not directly illuminated by the infrared LEDs, resulting in artifacts in the image. These do not occur with the O3D303, as the sensor is surrounded by four LEDs. The LEDs of the Kinect v2 and the Xtion 2 are positioned on one side next to the sensor, creating shadows and areas of the image that are only indirectly illuminated, leading to an overestimation of the distance.

To investigate the error, a reference recording of a background was first made. Then, an object was positioned in the foreground, and a second recording was made. The recordings were then compared, and the deviation from the reference image was determined. Figure 21 shows the experimental setup and the resulting point cloud. The deviations from the reference value are represented by the grayscale. A bright color in this case means that a deviation is present, while black indicates that the depth value matches the reference recording.

Figure 22 illustrates, using the coordinates along the horizontal cut line, the error in the depth image caused by the shadow of an object. The blue line shows the course of the background, and the red dots show the coordinates of the recording with an object placed in front of this background. The points with X-coordinates in the range between 86 and 131 are the coordinates of the object placed in the image that casts the shadow. At the X-coordinate 186, a flying pixel can be seen, caused by the lens scattering error. At points from an X-coordinate of 263, clear deviations from the ground truth values can be seen, caused by the area lying in shadow and being only indirectly illuminated. Hertzberg and Frese [HF14] pointed out that the position of the LEDs can influence the depth image, but this influence has neither been investigated nor simulated in the literature to date.

Bibliography
[HF14]	Christoph Hertzberg and Udo Frese. Detailed modeling and calibration of a time-of-flight camera. In Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics. SCITEPRESS - Science and and Technology Publications, 2014. doi:10.5220/0005067205680579.
[GVS18]	Silvio Giancola, Matteo Valenti, and Remo Sala. A Survey on 3D Cameras: Metrological Comparison of Time-of-Flight, Structured-Light and Active Stereoscopy Technologies. Springer-Verlag GmbH, 2018.
[Kel15]	Maik Keller. Real-time Simulation of Time-of-Flight Sensors and Accumulation of Range Camera Data. PhD thesis, University of Siegen, 2015.
[But14]	Thomas Butkiewicz. Low-cost coastal mapping using kinect v2 timeof-flight cameras. In 2014 Oceans - St. John’s. IEEE, sep 2014. doi: 10.1109/oceans.2014.7003084.
[Li14]	Larry Li. Time-of-flight camera - an introduction. Texas Instruments-Technical White Paper, 2014.
[RRMSMC18]	Francisco Romero Ramirez, Rafael Muñoz-Salinas, and Rafael Medina-Carnicer. Speeded up detection of squared fiducial markers. Image and Vision Computing, 76, 06 2018. doi:10.1016/j.imavis.2018.05.004.
[SGJMSMCMC15]	Sergio S. Garrido-Jurado, Rafael Muñoz-Salinas, Francisco Madrid-Cuevas, and Rafael Medina-Carnicer. Generation of fiducial marker dictionaries using mixed integer linear programming. Pattern Recognition, 51, 10 2015. doi:10.1016/j.patcog.2015.09.023.
[WS17]	Oliver Wasenmüller and Didier Stricker. Comparison of kinect v1 and v2 depth images in terms of accuracy and precision. In Computer Vision – ACCV 2016 Workshops, pages 34–45. Springer International Publishing, 2017. doi:10.1007/9783319544274_3.
[TMT13]	Alex Teichman, Stephen Miller, and Sebastian Thrun. Unsupervised intrinsic calibration of depth sensors via slam. In Robotics: Science and Systems 2013, 06 2013. doi:10.15607/RSS.2013.IX.027.
[MSMJYBMC17]	Rafael Muñoz-Salinas, Manuel J. Marin-Jimenez, Enrique Yeguas-Bolivar, and R. Medina-Carnicer. Mapping and localization from planar markers. Pattern Recognition, 73:158–171, January 2017. doi:10.13140/RG.2.2.31751.65440.
[JL10]	Sonam Jamtsho and Derek Lichti. Modelling scattering distortion in 3d range camera. International Archives of Photogrammetry, Remote Sensing and Spatial Information Sciences, 38, January 2010.
[CCMDH07]	James Christian Charles Mure-Dubois and Heinz Hügli. Optimized scattering compensation for time-of-flight camera - art. no. 67620h. September 2007. doi:10.1117/12.733961.
[CCMDH09]	James Christian Charles Mure-Dubois and Heinz Hügli. Time-of-flight imaging of indoor scenes with scattering compensation. January 2009.
[Pad09]	Paul Padley. Waves and Optics. 2009.

Capturing Depth Information: Principles and Techniques of Time-of-Flight Sensors