Depth Extraction from a Single Image and Its Application

In this chapter, a method for the generation of depth map was presented. To generate the depth map from an image, the proposed approach involves application of a sequence of blurring and deblurring operations on a point to determine the depth of the point. The proposed method makes no assumptions with regard to the properties of the scene in resolving depth ambiguity in complex images. Since applications involving depth map manipulation can be achieved by obtaining all-in-focus images through a deblurring operation and then blurring the obtained images, we have presented methods to derive all-in-focus images from our depth maps. Furthermore, 2D to 3D conversion can also be achieved from the estimated depth map. Some demonstrations show the performance and applications of the estimated depth map in this chapter.


Introduction
Derivation of depth information from 2D images is one of the most important issues in the field of image processing and computer vision. The depth information can be applied in 2D to 3D conversion, image refocusing, scene interpretation, the reconstruction of 3D scenes, and depth-based image editing. There are some techniques used to derive depth information, such as depth from focus [1], stereo vision [2], and depth from motion [3]. Nevertheless, these techniques are complicated by the need to acquire multiple images, thereby making them impractical when only one image is available or the features corresponding between the images cannot be resolved well. To this end, a number of approaches have been proposed to acquire depth information from a single image, such as the computational photography approach [4], which modifies the shape of the aperture of a traditional lens, and the Kinect approach [5], which uses a structured light to derive depth maps.
An image captured by a conventional camera contains a blurred version of a scene that is out of focus. The blurriness of a pixel is called the "circle of confusion" (COC) and is usually modeled as a 2D Gaussian function. When a single image is taken by a conventional camera with a fixed focal length, aperture size, and distance from the image plane to the lens, a pixel's COC is related only to the depth of the corresponding scene point. In such cases, depth estimation corresponds to blur estimation. Theoretically, if the depth map of an image can be accurately estimated, applications that manipulate depths of objects can be run by first applying deblurring followed by blurring operations. This is because the deblurring operation will move the objects closer and the blurring operation will move them farther away from the camera.
Blurring operation is more robust to depth map inaccuracy than the deblurring one, and many applications have been successfully built based on this operation. For example, defocus magnification [6] increases the out-of-focus area in an image by magnifying the existing blurriness to keep the shape of sharp regions and by modifying the depths of objects that are not in the focal plane to move the objects farther away from the plane. Deblurring operation, on the other hand, can be very sensitive to the accuracy of a depth map. A deblurring operation usually highlights the edged and textured points in an image. If their depths are overestimated, the operation can generate ringing artifacts that severely degrade the perceptual quality of an image. The depth map estimation from an image is a fundamentally ill-posed problem. For example, in a single image, we cannot resolve the ambiguity between out-offocus edges and the original smooth edges, we cannot determine whether blurriness of a point is in the front of focal plane or behind the focal plane, and we cannot estimate the depths of points in a smooth area. These problems cannot be resolved without the assumptions between the local image features and the scene. In this context, a widely adopted assumption is that a blurred edge is obtained via smoothing a step edge with a Gaussian kernel [7]. Although this assumption has been adopted in an autofocus system of a camera, the goal of the autofocus is to derive the depths of manually selected scene points rather than the depth map of an image. The approach that is based on the assumption on scene points has also been used in estimating a depth map. Edges in a scene are first modeled, and the depths of the blurred edge points are then derived from the degree of blurriness that has been applied on the scene to obtain the points. However, because many types of singularities far beyond the step edges in the scene can appear in an image, the approach based on the scene modeling can be too restricted, as only a few types of singularities can be modeled, to derive precise depths of all points. As a result, the depth precision derived based on scene modeling is usually limited to images with two depth layers, foreground and background.
In this chapter, we propose a blurring-deblurring method that does not require the modeling of edge points. In the blurring process, a point is blurred by increasing its COC to the limit of a camera. However, in the deblurring process, a point is deblurred by reducing its COC to the limit in the other end. Combine the results of these two processes and derive the depths of edges. Therefore, the approach estimates the depths of edge points based on the characteristic curve of COC vs. the depth characteristic curve of a camera. We demonstrated that the proposed approaches can reliably derive depth maps of complex images and synthesize all-infocus images. Furthermore, the depth maps can also be applied to synthesize the stereo image to 3D visualization through the mobile device. Figure 1 shows a diagram of applications from the depth map of a single image.
The remainder of this chapter is organized as the following. The relationship between the depth of a point and out-of-focus blurriness in images obtained by the thin-lens camera model is reviewed in Section 2. The proposed blurring-deblurring approach is presented in Section 3. The depth refinement approach and image deblurring process are presented in Section 4. In Section 5, we demonstrated the depth map results and applications. Section 6 contains some conclusions.

Camera model and out-of-focus blurriness
The out-of-focus blurriness is defined by the COC if an object is not in the camera's focal plane. However, it is impossible to determine whether an object is behind or in front of the focal plane based on the blurriness of an object [8]. In the following, we consider a case of the condition.
In a thin-lens model, where u is the distance between the lens and the scene point, v is the distance between the focal plane and the lens, and f is the focal length of the lens. If a light point is not in the focal plane but placed in front of the camera, the source's image will be a circular disk with diameter D COC instead of a point, as shown in Figure 2.
Let d be the distance between the lens and the image sensor. Then, the in-focus scene distance u inÀfocus can be derived as follows: For a particular lens, the focal length f and the aperture A are constants; the Fnumber N= f A is also a constant. Given the geometric relationship D COC u ð Þ shown in Figure 2 and the lens formula, the COC's diameter of a scene point at distance u from the lens depends on whether u . u inÀfocus (the scene point is farther from the lens than the focal plane) or u , u inÀfocus (the scene point is closer to the lens than the focal plane).
In the case where u . u inÀfocus , we can derive the following relationship from the similar triangles shown in Figure 2(a): Using Eq. (1) and In the case where u , u inÀfocus , we can derive the following relationship from similar triangles shown in Figure 2(b): Using Eq. (1) and N ¼ f A , we obtain From Eqs. (4) and (6), we can derive D COC of a scene point; however, the equations do not allow us to determine whether the scene point is in front of or behind the focal plane. To remove the ambiguity, the assumption that all the scene points are behind the focal plane is adopted.
An image is usually modeled as the convolution of the scene and a camerarelative PSF. The Pillbox function is an ideal PSF, which is a box function with support σ and constant value 1 σ . Usually, the Gaussian function is the approximation of the Pillbox function, and the standard derivation of the Gaussian function is σ ffiffi 2 p : Due to the factor, the difference between their frequency domain magnitudes is small, and the latter is easier to do analysis. In this chapter, we will use the Gaussian function to characterize the PSF of a camera.

Blurring and deblurring approach
Using the proposed approach, scene depths will be determined from the estimated blurriness in a single image by using a combination of blurring and deblurring processes. In this section, we will explain the rationale for combining the The geometry of imaging: u is the distance of a scene point from the lens, u inÀfocus is the distance of the focal plane from the lens, d is the distance between the lens and the image sensor, and the diameter of the lens' aperture is A. blurring and deblurring processes and provide the formulation of the combined approach. The depth of a scene point is defined as the distance between the camera lens and the point. In addition, the proposed method assumes that all the interested scene points are behind the focal plane (it matches to the case: u . u inÀfocus ) as [9].

Concept
The (D COC vs. u) curve of Eq. (4), illustrated in Figure 3(a), gives the relationship between the depth of a scene point and its D COC value of a camera. The latter increases with the depth of the scene point. When D COC reaches its limit (D * COC ), the point can be assumed to be at infinity.
Let D COC be the blurriness of a point. A blurring operator can be defined to add an increment of blurriness to the point to obtain a new blurriness 0. This can be regarded as increasing the depth of the point by moving it along the (D COC vs. u) curve toward the right end point of the figure. If the blurring operation is applied repeatedly, the blurriness can reach D * COC and the point is at infinity.
If the increment in the blurriness of a point to reach D * COC can be determined, we can convert this increment into the increment in depth by referring to the curve (D COC vs. u) in Figure 3(a) and derive the true depth of the point. However, as shown in the curve ∂u ∂D COC of Figure 3(b), a small increment in blurriness close to D * COC yields a substantially large increment of depth. This means that the depth determination close to D * COC is relatively unstable and inaccurate. On the other hand, the deblurring operator is defined to reduce the blurriness of the point to obtain D COC À δ D COC ð Þ. If a deblurring operation is repeatedly applied to a point, the latter will become sharper. The deblurring process gradually reduces the depth of the point by moving it along the (D COC vs. u) curve toward the left end point, corresponding to move the point to the focal plane or be in focus. If the decrement in blurriness, by making a point in-focus, can be determined, we can convert the decrement to the decrement in depth of the point to the focal plane. Then, we refer to the curve (D COC vs. u) in Figure 3(a) to acquire the true depth of the point. However, as shown in the curve ∂u ∂D COC of Figure 3(b), a small decrement In the blurring process, the dotted point is moved along curve A. In the deblurring process, the point is moved along curve B. (b) The derivation of curve (a). When u approaches ∞, the blurring process fails to estimate the depth. When u is close to the focal plane, u inÀfocused , the deblurring process fails in the area. in depth close to the focal plane can yield a substantial decrement in D COC , which means that D COC cannot be reliably and accurately obtained when the point moves closer to the focal plane by a deblurring operation.
Since the depth estimation at large D COC and the D COC estimation of a point close to focal plane are unreliable, we were motivated to propose the blurring and deblurring approach that combines the differential blurring and deblurring operations to yield a more robust depth estimation of a scene point than only using one of them.

Formulation
Let u 0 be the true depth of the point at x; and let be the blurring measurement and deblurring measurement, respectively. The blurring measurement measures whether a point is blurred to D * COC , and the deblurring measurement measures whether that point is deblurred to be in-focus. We define that is a proper function with a (local) minimum near u inÀfocused . The following formula is used to determine the true depth u 0 of the point x: with the constraint that where D * COC À D COC u inÀfocused À Á is a camera-dependent constant and λ is the Lagrangian parameter that balances the blurriness and deblurriness measurements, and is the increment of blurriness to D * COC and denotes the decrement of the D COC assuming the point at depth u to the focal plane. The constraint in Eq. (8) is necessary because it indicates that the sum of the added blurriness from the current guess u to D * COC and the reduced blurriness from u to D COC u inÀfocused À Á is a constant, D * COC À D COC u inÀfocused À Á :

Blurring and deblurring measurements
For a simplified analysis but without any loss of generality, the following derivations were based on one-dimensional signals and neglecting the boundary conditions.

Blurring measurement
The objective of the blurring process is to determine the amount of blurriness required for a point to reach D * COC . When edged or textured patches are gradually placed at far distance, the details of the patches become faint, their variances decrease, and only their mean brightness can be derived at infinity. Thus, the variance of a patch can be used as the blurriness measurement. Specifically, when a patch is blurred to reach D * COC , its variance can be assumed to be 0. Let the true depth of the scene point x be u 0 , and let the image of the point be where g is the Gaussian function and σ u 0 ð Þ 2 is the variance of g at depth u 0 . We define the blurriness measurement as follows: where we obtain from Eqs. (11) and (12) The above equation is derived by using the fact that the convolution of two Gaussians of variances σ 2 1 and σ 2 2 is a Gaussian of variance σ 2 1 þ σ 2 2 . If u is equal to u 0 , then since g σ u ∞ ð Þ 2 * s x ð Þ can be approximated as the mean of f 0 x ð Þ. Thus, Þ reaches a local minimum when u is equal to u 0 .

Deblurring measurement using blurring-deblurring operator
In contrast to blurring, deblurring is extremely unstable, and it usually assumes some prior knowledge of the scene so that the high-frequency (edge and texture) information can be recovered. Because of the prior assumption, when the estimated depth is overestimated, ringing artifact occurs in the result of the deblurring process.
An image is first deblurred and then blurred by the same Gaussian kernel with variance σ u ð Þ 2 . A deblurring process will tend to over-enhance the high-frequency information in the image if u is an overestimated depth. As shown by the subfigures in the second row of Figure 4(c), it causes severe artifacts. So, the measurement of the error from the blurring-deblurring operator of a given point is proposed in the following: As shown in Figure 4(d), when the estimation of the blurring scale is over the true scale (the scale is 4), the result of S x; u ð Þ increases dramatically because of the artifacts in the neighborhood of the edge points.
From Figure 4(d), S x; u ð Þ is asymmetric with respect to over-and underestimation of u 0 , where u 0 is the true blurring scale or true depth. To capture the transition point from small to large values of S x; u ð Þ, we calculated the curvature at u j of the smooth curve as follows: as shown in Figure 4(e). The larger the value of the curvature, higher is the probability that it is the pivot point for the transition. Thus, we define the deblurring measurement as follows: The measure F d x; u À u 0 þ u inÀfocused À Á has a local minimum at the transition of S x; u ð Þ. Thus, when u is equal to u 0 , F d x; u À u 0 þ u inÀfocused À Á becomes the minimum.

Depth estimation
The blurring and deblurring measurements, F b and F d , defined in Eqs. (12) and (17), respectively, can be substituted in the objective function in Eq. (7) to obtain The blurring and deblurring approach can now be used to derive the solution for based on the constraint in Eq. (8). The complexity of the problem relies on how precise the depth is measured for each point. Although depth is an important cue, it seems that the relative depths, such as which object is in foreground and which is in background, are more important than the accurate depths. The blurring and deblurring approach is a point-wise optimization method. We used a method by deriving the best solution from the d candidate depths in a sequence, u 1 , ⋯u d , to save the computational cost. To the optimization problem, the solution is to pick a candidate depth that has the minimization. The candidate depths were chosen so that where γN 2 is the ratio of edged and textured pixels in an image of N 2 pixels, point-wise blurring and deblurring operations to determine the depth map of the image.

Depth refinement and image deblurring
The blurring and deblurring measurements are only able to determine the depths of edge and texture points. However, if blurring and deblurring with Gaussian kernel of any variance are applied to a sufficiently large patch of constant value, it will yield a patch of constant value. Therefore, the proposed approach cannot reliably determine the depths of points in smooth regions. Thus, we resort to another approach to derive the depths of smooth scene points.
On the other hand, all in focus is to generate an image that is focused everywhere or to transfer the depth map of an image to a depth map in focal plane. Therefore, an all-in-focus image can be generated by the deblurring process. Practically, since a deblurring process is very sensitive to overestimated depth, if there are overestimated depths in images, the deblurring process can hamper the all-in-focus result and render a visually unacceptable image.
This problem cannot be trivially solved by subtracting a depth from all the points because the value to subtract is not easy to determine. This value should be large enough to stabilize the deblurring process and at the same time small enough so that the depths are not underestimated too much, rendering a blurred all-in-focus image. We used two methods, viz., depth quantization and TV deblurring process, to rectify the effects caused from the depth estimation error.

Depths of smooth scene points
We followed [6] to estimate the depth at edges and texture followed by propagation of the results to other points. In our method, we use Canny edge detector [10] to decide whether a point is an edged or textured point. The depths of these points (called Canny points) were then estimated from the blurring and deblurring approach. For convenience, we called the remaining points as the smooth points.
The propagation algorithm to derive the depths of smooth points was based on the solution of the Dirichlet problem [11], which addresses the temperature distribution from the boundary to the interior of a medium. The solution of the Dirichlet problem is based on two principles: the maximum principle and the uniqueness principle. The maximum principle states that the interior temperature lies between the maximum boundary temperature and the minimum boundary temperature, and the uniqueness principle states that the solution of the problem is unique. In our approach, we regarded the temperature as the depth and defined the boundary points as the union of the smooth points at the border and the non-smooth points of an image. The depth was first assigned to the smooth points at the border of the image. Then, we used the solution of the Dirichlet problem to derive the depth of the smooth points inside the image. By this approach, the depths of the smooth points were never larger than those of the enclosing points.
The steps of the depth propagation procedure are as follows. First, we normalized the depths of the Canny points by setting the depth of the point farthest from the camera as 1. Then, we assigned depths to the smooth points on the borders of the image. Because the top border of an image is usually the background, the depth of the smooth points on that border was assigned the value 1. For a smooth point on the left-hand, right-hand, and bottom borders, we assigned the depth of the closest non-smooth point. Figure 5 demonstrates an example how the depths are propagated to an image.

Depth quantization
For the quantization process, a depth can be approximated by a layer. The motivation of the idea was from the scalar quantization in compression. Via the quantization process, a coefficient can be approximated. In the process, the layer L is a parameter. From the results in Section 3.3.3, the histogram of the depths was calculated firstly. The representative (anchor) depth, a 1 , for layer 1 was always assigned as the minimum depth of all the depths. When the depths are partitioned into two layers, we proposed the following optimization to determine the anchor depth of layer 2, a 2 : where a 2 is subject to a 2 . z i . a 1 , for each z i in layer 1, and z i ≥ a 2 , for each z i in layer 2. By recursively subdividing a depth layer to acquire two more depth layers, the procedure can be applied to acquire L layers. The depths in the layer are updated to be the depth of the anchor if the anchor depth of a layer is determined. For instance, the depth z i in layer j is assigned as a j . Hence, z i À a j for z i in layer j is the error in approximating z i with a j . In the approximation, the anchor depth can be found from Eq. (20), which has the minimization of the error.
By sampling a few depths as candidate anchor depths firstly, the anchor depth a 2 in Eq. (20) can be derived. Next, we set each candidate as a 2 to calculate the average error from Eq. (20). With the help of the histogram of the depths, this process can be efficiently achieved. So, the anchor depth is the depth in the candidate depths that yields the smallest average error. Figure 6 shows the estimated depth map and the depth quantization results on an image, composed of four layers of depths. After quantization, some outliers in Figure 6(b) are removed.

Deblurring process
Patch y can be modeled as g σ ð Þ * x where x and y are vectors, x is the vector of the scene patch X (x = vec(X)), and σ is the out-of-focus blurriness of x in y. The deblurring process restores x by  where μ is a Lagrangian multiplier and D 1 X k k 1 and D 2 X k k 1 denote the discrete total variation of X in horizontal and vertical directions, respectively. The g σ ð Þ * x can be represented as a matrix-vector multiplication. Eq. (21) can be solved by an efficient variable splitting technique, as described in Ref. [12].

Experimental results
We will demonstrate the depth estimation results and some applications related to the estimated depth map including all in focus, refocusing, defocus magnification, and 2D to 3D conversion. We use synthetic images, real images, and video frames to show the performance of the method.

Synthetic image
First, we evaluated the depth estimations on synthetic images. The scene images with the ground truth depth map are from [13], but the images and depth maps are not aligned. The proposed method is allowed to estimate the blurriness from a blur image; however, the given scene image is near to an all-in-focus image. Therefore, we align the depth map and the original image by nearest neighbor scaling method and then apply Eq. (4) to generate a blur map according to the given depth map with some fixed camera parameters. The transformed Gaussian blur kernels are controlled in the range from 0 to 8. A blur image is synthesized when the original scene image is convolved with the corresponding blur map. Two sets of the original scene images, depth maps of the ground truth, synthetic blurred images, and estimated depth maps are shown in Figure 7

Real image
Second, we set real images to the depth estimation method. Figure 8 shows the results on a two-layer image, and the input image (a) is from [14]. The estimated depth map, synthesized all-in-focus image, quantized depth map, refocus image,  and defocus magnification image are shown in Figure 8(b)-(f), respectively. With the quantized depth map, the applications of refocusing and defocus magnification can be manipulated from the all-in-focus image. Therefore, if obtaining a high quality of the depth map and all-in-focus image, the performance of these applications will be great.    Figure 9(a) is a three-layer poker card image, and the camera was focused on the third row of cards. The blurriness increased from the bottom part of the image to the top. So, the first and second rows of cards are out of focus. Figure 9(b) shows the synthesized all-in-focus image, and (c) and (d) show the corresponding magnified regions from (a) and (b), respectively. Figure 10(a) was captured from a ramp of brick wall. As the camera was focused on the leftmost part of the image, the blurriness of the brick wall progressively increased from the left part of the image to the right. Figure 10(b) shows the synthesized all-in-focus image, and (c) and (d) show the magnified regions. From these two sets of image, the results show a significant improvement when comparing to the original images.

Video frame
Third, we apply the method to the video frames. In this subsection, we show the performance of 2D to 3D conversion. The video frames in Figure 11(a) are used as the input left-eye images, which are from YouTube; (b) are the synthesized righteye images, which are from left-eye images and corresponding depth maps; and (c) are the synthesized anaglyphs, which are from (a) and (b). With the anaglyph glasses, the 3D effect will be visualization. Combine the mobile device with the stereo image and VR device, such as Google Cardboard; the 3D effect also will be visualized.

Conclusions
In this chapter, a single-image depth estimation method was presented. The depth map was derived based on the characteristic curve of COC vs. depth of a camera. Applications to manipulate depth maps can be achieved by first deblurring an image to all in focus and then blurring the all-in-focus image. Thus, generation of an all-in-focus image from a depth map successfully is significant for the depth estimation method. Furthermore, the quality for 2D to 3D conversion also depends on the performance of depth estimation. The proposed depth estimation method makes it possible to produce high-quality all-in-focus images and 2D to 3D conversion, even from originals with a complex depth map layout.