RMSE of recovered depth.

## Abstract

Fixational eye movement is an essential function for watching things using the retina, which has the property of responding only to changes in incident light. However, since the rotation of the eyeball causes the translational movement of the crystalline lens, it is possible in principle to recover the depth of the object from the moving image obtained in this way. We have proposed two types of depth restoration methods based on fixation tremor; differential-type method and integral-type method. The first is based on the change in image brightness between frames, and the latter is based on image blurring due to movement. In this chapter, we introduce them and explain the simulations and experiments performed to verify their operation.

### Keywords

- motion stereoscopic
- fixational eye movements
- differential-type method
- integral-type method
- optical flow
- gradient equation
- image blur

## 1. Introduction

When humans stare at a target, an irregular involuntary movement called fixational eye movements occur [1]. The human retina can maintain reception sensitivity by finely vibrating the image of the target on the retina, so in order to see something, first fixation motion is required. It has been reported that the vibrations may work not only as such the intrinsic function to preserve photosensitivity but also as an assistance in image analysis, the mechanism of which can be interpreted as an instance of stochastic resonance (SR) [1]. SR is inspired by biology, more specifically by neuron dynamics [2], and based on it, the Dynamic Retina (DR) [3] and the Resonant Retina (RR) [4], which are new vision devices taking advantage of random camera vibrations, were proposed for contrast enhancement and edge detection respectively. It has been reported that the movement of the retinal image due to fixation eye movements can be an unconscious clue to depth perception, and an actual vision system based on fixational eye movements has been proposed [5].

On the other hand, binocular stereopsis is vigorous and plays an essential role in depth perception of a human vision system [6]. In general, binocular stereopsis detects relatively large disparities, hence it can recognize high accurate depth. However this causes an occlusion problem, and a lot of solutions of it have been proposed. Wang et al. have proposed a local detector for occlusion based on deep learning [7]. In [8], a robust depth restoration method has been proposed that integrates line-field imaging technology that simultaneously observes multiple angle views with stereo vision. Therefore, we expect that primitive depth information detected by fixational eye movements can be used to solve occlusion for binocular stereopsis. There is a concern that the accuracy of depth restoration by a small camera motion is lower than that of stereo vision. Even so, it is expected that erroneous correspondence due to the existence of occlusion can be reduced by using the depth information from fixational eye movements for the correspondence problem in stereo vision.

In monocular stereoscopic vision, “structure from motion (SFM)” has been the most widely studied, and many remarkable results have been reported. SFM has various calculation principles. To achieve spatially dense depth recovery with high computational efficiency, a method based on the gradient equation that expresses the constraint between the spatiotemporal derivative values of image intensity and the movement on the image is effective [9, 10, 11]. It should be noted that for such a gradient method, there is an appropriate size of movement to recover the correct depth. Since the gradient equation holds perfectly for small motions, the error in the equation cannot be ignored for very large motions. On the contrary, in the case of small movement, the motion information is buried in the observation error of the spatiotemporal derivative in intensity.

Adaptation of the frame rate is required to make the motion size suitable for the gradient method. We have proposed a method that does not require a variable frame rate based on multi-resolution decomposition of images, but it requires high computational cost [12]. Therefore, we focus on small movements with an emphasis on avoiding equation errors in the gradient method. Then, in order to solve the above signal-to-noise ratio (S/N) problem that occurs with small movements, many observations are collected and used all at once [13, 14]. In such a strategy, it is desirable that the direction and size of the motion take different values. From the above discussion, we examined a depth perception model based on fixational eye movement and gradient method. Fixational eye movements are divided into three types: microsaccades, drifts, and tremors. As the first report of our attempt, we focused on tremor, the smallest of the three types. In the next step, we plan to use drift and microsaccade analogies for further progress. Using a lot of images captured with random small motions of camera, which consists of three-dimensional (3-D) rotations imitating fixational eyeball motions [1], many observations can be used at each pixel, i.e. many gradient equations can be used to recover the each depth value corresponding to the each pixel. Since the difference between the center of the three-dimensional rotation and the lens center generates a translational motion of the lens center, depth information can be obtained from these images. Simulations with artificial images confirm that the proposed method works effectively when the observed noise is an actual sample of a theoretically defined noise model.

However, if the wavelength of the main luminance pattern is small compared to the size of the motion in the image, aliasing will occur and the gradient equation will be useless. In other words, the methods of [13, 14] cannot be applied. To avoid the above problem, we proposed a new scheme based on the integral form that also used the analogy of fixational eye movement [15, 16]. Add up the many images generated by the above method to get one blurry image. The degree of blur is a function of pixel position and also depends on the depth value of each pixel. That is, the difference in the degree of image blur indicates the depth information. Based on the proposed scheme, the spatial distribution of the image blur is effectively estimated using the blurred image and the original image without blur. By modeling the small 3-D rotation of the camera as a Gaussian random variable, the depth map can be calculated analytically from this blur distribution.

Several depth recovery methods using motion-blur have been already proposed, but those use the blur caused by definite and simple camera motions. For example, blur by a translational camera motion is used in [17], and blur by an unconstrained camera motion composed of translation and rotation is assumed in [18]. The depth recovery performance of these methods depends on the orientation of the texture in the image. That is, if the texture has a strip pattern whose direction is parallel to the direction of motion in the image, there is less blurring and accurate depth recovery is difficult. Unlike such a specific camera motion, the random camera rotation used in this study works well for any texture. Deterministic camera motion can be used just to solve this problem, but since it does not require precise control of the camera, it is easy to implement random camera rotation in a real system.

We proposed two algorithms based on the integral scheme. The first algorithm detects a point spread function (PSF) that represents image blur and then analyzes it to restore depth [15]. The second algorithm directly calculates the depth without detecting the PSF [16]. These algorithms use a motion-blurred image and a reference un-blurred image. It is expected that the performance of the proposed scheme depends on the degree of motion blur. For the same PSF, i.e. the fixed deviation of the random camera rotations, fine texture is advantageous for observing the accurate blur. This characteristic is the opposite of the method based on the differential scheme based on the gradient equation.

The features of our methods described above can be summarized as follows.

In a camera motion model that simulates eyeball rotation, the translation of the lens center is secondarily generated by eye rotation, reducing the number of unknown parameters. This camera motion model also allows you to recover depth as an absolute quantity instead of a relative quantity.

The movement of the camera is subtle because it simulates the tremor component of the eyeball. Therefore, by using a large number of image pairs, it is possible to improve the accuracy of depth recovery while avoiding occlusion.

In order to use multiple image pairs at the same time, we have adopted a direct framework that explicitly uses the inverse depth, which is a common parameter for them.

In the following, we will first explain the camera model and the imaging system, and then explain the differential scheme and the integral scheme, and the algorithms based on each. Due to the limitation of the number of pages, the integer type explains only the direct method. The function and characteristics of each algorithm are shown as the result of computer simulation with an emphasis on quantitative comparison with the true value. Finally, one of the algorithms in the differential scheme is applied to the real image, and the result is explained.

## 2. Camera motions imitating tremor and projection model

In this research, we use a perspective projection system as a camera imaging model. The camera is fixed in the

A brief description of the camera motion model that mimics the tremor component of fixational eye movements proposed in the previous study [12]. According to the analogy of the human eyeball, the center of camera rotation is set behind the lens center by

In general, the translational motion of a camera lens is essential to recover depth, and our camera motion model can implicitly achieve that translation simply by rotating the camera. This facilitates camera control. In addition, the system can recover absolute depth by pre-calibrating

From Eq. (1), it can be known that

In the above equations,

In this study, we treat

In this study,

## 3. Differential-type method

### 3.1 Gradient equation for rigid motion

The general gradient equation is the first approximation of the assumption that image brightness is invariable before and after the relative 3-D motion between a camera and an object. Assuming that the brightness values before and after 3-D motion are equal, the image brightness after 3-D motion are expressed by Taylor expansion, and terms of degree 2 and above are ignored. As a result, at each pixel

By substituting Eqs. (2) and (3) into Eq. (5), the gradient equation representing a rigid motion constraint can be derived explicitly.

### 3.2 Probabilistic model for differential-type method

Let * d* for each pixel when dealing with drift and microsaccade in future research.

In this study, we assume that optical flow is very small, and hence, observation errors of

where

As mentioned above, considering the neighborhood correlation of * d* to be recovered, in this study, to simplify modeling, we use the following equation as the depth model.

where

### 3.3 Algorithm for differential-type method

By applying the MAP-EM algorithm [19], parameter

In the EM scheme,

Based on the densities defined at 3.2, the objective function is derived as

using the following definitions derived by formulating the posterior density

In the M step,

From this representation,

For

where

and

### 3.4 Numerical experiments of differential-type method

In order to confirm the function of the proposed method, we conducted numerical experiments using artificial images. Figure 2(a) shows the original image generated using the depth map shown in Figure 2(b). The image size is

In our model, the successive image pairs are used in turn to calculate

We executed the proposed algorithm using . However, over-applying this constraint will increase the recovery error because the recovered depth map will be too smooth. Note that the scales of the

0.6983 | 0.3948 | 0.2097 | 0.0945 | 0.3699 | |

0.4741 | 0.3079 | 0.1841 | 0.0769 | 0.2124 |

## 4. Integral-type method

### 4.1 Image motion blurring related to depth

From Eqs. (2) and (3), and the probabilistic characteristics of

This equation can be seen as a function of the inverse depth

There are some schemes to obtain the variance–covariance matrix of optical flow defined by Eq. (19) locally at each image position from multiple images observed through random camera rotations imitating tremor. The simplest and most natural way is to first detect the optical flow from the image and then calculate its quadratic statistic. However, here we are considering a case where the intensity pattern is fine with respect to the temporal sampling rate and it is difficult to accurately detect the optical flow. Therefore, we adopt an integral-formed scheme in which the variance–covariance matrix is calculated as the distribution of local image blur.

We define an averaged image brightness

where

As explained above, we model

### 4.2 Direct algorithms for integral-type method

In the two-step algorithm, after detecting the variance–covariance matrix of the image blur shown in Eq. (19), the maximum likelihood estimation of

We assume that a depth value in a local region

We can recover the depth individually for each pixel by minimizing this function with respect to the depth corresponding to each pixel. Therefore, simultaneous multivariate optimization is not required and one-dimensional numerical search can be adopted.

By assuming a spatially smooth depth map in the solution, we can define the following objective function based on regularization theory of Poggio et al. [21].

where

For discrete computation, we can approximate the smoothness constraint in Eq. (23) using

Using Eq. (25) and the discrete representation of Eq. (24), we can minimize Eq. (23) by the following iterative formulation with an iteration number of

#### 4.2.1 Numerical experiments of integral-type method

The proposed algorithm assumes that the definition of the motion blur image in Eq. (20) holds. To observe such ideal motion blur, it takes enough exposure time for imaging. Here, we use artificial images to examine the characteristics of the proposed algorithm with respect to the relationship between the size of image motion and the fineness of intensity texture.

We artificially generate motion blur images. First, a true depth map is set up, and a large number of images are generated by a computer graphics technique that randomly samples

Local optimization algorithms (LOA) are computationally expensive, and global optimization algorithms (GOA) are slow to converge. Therefore, we considered a hybrid algorithm that uses LOA sparsely to obtain the initial value of GOA. For the initial value of LOA, use the plane corresponding to the background of Figure 4(b). To make a rough estimate, LOA uses blocks of

The depth restoration simulation was executed while changing the camera rotation size

## 5. Real image experiments of differential-type method

### 5.1 Selective use of image pairs to improve accuracy

When applying the difference-type method to an actual image and checking the actual performance, the performance was improved by selecting the image pair used for depth restoration. We have adopted a scheme that excludes image pairs that are expected to have large approximation errors in the gradient equation on a pixel-by-pixel basis. We can use the inner product of the spatial gradient vectors of consecutive image pairs to select image pairs that do not cause aliasing problems. For each pixel, the image pairs of which the sign of the inner product

In the next step, from the image pairs remained by the above decision, we additively select the suitable image pairs at each pixel by estimating the amount of the higher order terms included in the observation of

After discarding a bad image pair, the higher-order terms can be considered small. In this case, the quadratic term in Eq. (28) can be estimated for each pixel

We can define a measure for estimating the equation error as the ratio of this higher order term to the first order term.

This measurement depends on the direction of the optical flow but is invariant with respect to the amplitude of the optical flow. To calculate the value of

### 5.2 Camera system implementation

We built the camera hardware system for examining the practical performance of our camera model shown in Figure 1. The implemented camera system is shown in Figure 9.

The camera system can be rotated around the horizontal axis i.e.

### 5.3 Experimental results

In this section, we explain the results of the experiments using the real images captured by the developed camera system [22]. Our camera system has a parallel stereo function. That is, the camera can be moved laterally by the slide system. Prior to the experiment, we calibrated the camera’s internal parameters, including focal length and

Figure 11 shows the result of the recovered depth for each threshold set as a constant multiple of the reference value. We also looked at the results using all image pairs. From these results, it can be confirmed that by reducing the magnification, inappropriate image pairs can be discarded and the accuracy of depth recovery is improved. The percentage shown in the caption of the figure shows the number of image pairs used for recovery, which is determined in conjunction with the change in threshold.

## 6. Conclusions

In this chapter, we introduced a depth recovery algorithms that uses large number of images with small movements by using camera motion that simulates fixational eye movements, especially the tremor component. The algorithms can be divided into a differential-type and an integral-type. For the differential-type, it is desirable that the movement on the image is relatively small with respect to the texture pattern of the surface to be imaged, and conversely, for the integral-type, it is appropriate to apply it to a fine texture compared to the movement on the image. Therefore, ideally, the development of a depth recovery system in which both schemes function adaptively and selectively according to the target texture is the most important task in the future.

A detailed technical issue is to automatically determine the parameters that control the smoothness of the depth. This can be achieved by considering all unknowns as stochastic variables and formulating them in the variational Bayesian framework. As for the integration method, since the resolution of the recovered depth is low in principle, it is possible to consider a composite type in which the differential-type is applied again and refinement is performed on the result obtained by the integral-type.

So far, we have considered a method that assumes only tremor, but in the future, we are planning to study camera motion that also simulates drift and microsaccade. In the method for drift component, it is necessary to extend the method based on tremor to the online version, and then update the depth estimate while advancing the tracking of the target as time series processing. When using microsaccades, it is necessary to handle large movements between frames. Therefore, based on the correspondence of feature points, sparse but highly accurate depth restoration can be expected. Drift itself does not have much merit in its use, but it plays an important role in generating microsaccades. As described above, we believe that an interesting system can be realized by comprehensively using the three components.

On the other hand, stereoscopic vision and motion stereoscopic vision are difficult to handle objects with few textures. In [24], we proposed a stereo system that considers shading information. The projected images to both cameras are calculated by computer graphics technique while changing the depth estimation value. The depth is determined so that the generated image matches the image observed by each camera. As a result, the association between images is indirectly realized. By introducing this method, it becomes possible to handle textureless objects. We aim to develop a comprehensive depth restoration method, including the multi-resolution processing proposed in [12]. In another scheme that deals with the textureless region in stereo vision, the region where the depth value is constant or changes smoothly, called the support region, is adaptively determined [25]. We will also consider whether the relationship between image changes due to tremor and microsaccade can be used for adaptive determination of this support region.

In recent years, many realizations of stereoscopic vision and motion stereoscopic vision by deep learning have been reported [26, 27, 28]. And the relationship with the conventional method based on mathematical formulas is often questioned. The deep learning method is hampered by the addition of a large number of images and annotations to them. Although unsupervised learning is often devised, the solution is often limited. Therefore, even if the conventional method is rather complicated and takes time, if a method capable of more precise depth recovery is constructed, it can be used for annotation calculation of deep learning. This can be understood as copying the conventional method to deep neural network (DNN). DNN takes time to learn, but has the advantage of being able to infer at high speed. In this way, it is important that both schemes develop in a two-sided relationship.

Here, the method of calibrating the axis of rotation is explained using Figure 12. Let a point in 3-D space be

The translation

where

Furthermore, by substituting the relation of

By expressing this equation in terms of components and organizing it, the following two equations are derived.

By substituting Eq. (35) into Eq. (36), the solution of

## References

- 1.
Martinez-Conde S, Macknik S L, and Hubel D. The role of fixational eye movements in visual perception. Nature Reviews. 2004;5:229–240 - 2.
Stemmler M. A single spike suffices: the simplest form of stochastic resonance in model neuron. Network: Computations in Neural Systems. 1996;61:687–716 - 3.
Propokopowicz P and Cooper P. The dynamic retina. Int. J. Computer Vision. 1995;16:191–204 - 4.
Hongler M-O, de Meneses Y L, Beyeler A, and Jacot J. The resonant retina: Exploiting vibration noise to optimally detect edges in an image. IEEE Trans. Pattern Anal. Machine Intell. 2003;25:1051–1062 - 5.
Ando S, Ono N, and Kimachi A. Involuntary eye-movement vision based on three-phase correlation image sensor. In: Proceedings on19th Sensor Symposium; 2002. p. 83–86 - 6.
Lazaros N, Sirakoulis G-C, and Gasteratos A. Review of stereo vision algorithm: from software to hardware. Int. J. Optomechatronics. 2008;5:435–462 - 7.
Wang J and Zickler T. Local detection of stereo occlusion boundaries. In: Proceedings on CVPR; 2019. p. 3818–3827 - 8.
Liu F, Zhou S, Wang Y, Hou G, Sun Z, and Tan T. Binocular light-field: imaging theory and occlusion-robust depth perception application. IEEE Trans. Image Process. 2019;29: 1628–1640 - 9.
Horn B K P and Schunk B. Determining optical flow. Artif. Intell. 1981;17:185–203 - 10.
Simoncelli E P. Bayesian multi-scale differential optical flow. In: Jahne B, Haubecker H, Geibier P, editors. Handbook of Computer Vision and Applications.Academic Press; 1999. vol. 2. p. 397–422 - 11.
Bruhn A and Weickert J. Locas/kanade meets horn/schunk: combining local and global optic flow methods. Int. J. Computer Vision. 2005;61:211–231 - 12.
Tagawa N, Kawaguchi J, Naganuma S, and Okubo K. Direct 3-d shape recovery from image sequence based on multi-scale bayesian network. In: Proceedings on ICPR; 2008. p. CD–ROM - 13.
Tagawa N. Depth Perception model based on fixational eye movements using Bayesian statistical inference. In: Proceedings on ICPR; 2010. p. 1662–1665 - 14.
Tagawa N and Alexandrova T. Computational model of depth perception based on fixational eye movements. In: Proceedings on VISAPP; 2010, p. 328–333 - 15.
Tagawa N, Iida Y, and Okubo K. Depth perception model exploiting blurring caused by random small camera motions. In: Proceedings on VISAPP; 2012, p. 329–334 - 16.
Tagawa N, Koizumi S, and Okubo K. Direct depth recovery from motion blur caused by random camera rotations imitating fixational eye movements. In: Proceedings on VISAPP; 2013, p. 177–186 - 17.
Sorel M and Flusser J. Space-variant restoration of images degraded by camera motion blur. IEEE Trans. Image Processing. 2008;17:105–116 - 18.
Paramanand C and Rajagopalan A N. Depth from motion and optical blur with unscented Kalman filter. IEEE Trans. Image Processing. 2012;21:2798–2811 - 19.
Dempster A-P, Laird N-M, and Rubin D-B. Maximum likelihood from incomplete data. J. Roy. Statist. Soc. B. 1977;39:1–38 - 20.
Green P.-J. On use of the Em algorithm for penalized likelihood estimation. J. Roy. Statist. Soc. B. 1990;52:443–452 - 21.
Poggio T, Torre V, and Koch C. Computational vision and regularization theory. Nature. 1985;317:314–319 - 22.
Tsukada S, Ho Y, Tagawa N, and Okubo K. Accuracy improvement for depth from small irregular camera motions and its performance evaluation. In: Proceedings on ICIAR; 2015, p. 306–315 - 23.
Zhang Z. A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Machine Intell. 2000;22:1330–1334 - 24.
Wakabayashi K, Tagaw N, and Okubo K. Shape from multi-view images based on image generation consistency. In: Proceedings on VISAPP; 2013, p. 334–340 - 25.
Wu W, Zhu H, Yu S, and Shi J. Stereo matching with fusing adaptive support weights. IEEE Access. 2019;7:61960–61974 - 26.
Laina I, Rupprecht C, Belagiannis V, Tombari F, and Navab N. Deeper depth prediction with fully convolutional residual networks. In: Proceedings on 3D Vision;2016, p. 239–248 - 27.
Ummenhofer B, Zhau H, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, and Brox T. DeMoN: Depth and motion network for learning monocular stereo. In: Proceedings on CVPR;2017, p. 5038–5047 - 28.
Chen P-Y, Liu A H, Liu Y-C, Wang Y-C F. Towards scene understanding: unsupervised monocular depth estimation with semantic-aware representation. In: Proceedings on CVPR;2019, p. 2624–2632