Refinement of Visual Hulls for Human Performance Capture

Generation of dynamic three-dimensional (3D) mesh sequences of human performance using multiple cameras has been actively investigated in recent years (de Aguiar et al., 2008; Hisatomi et al., 2008; Kanade et al., 1997; Kim et al., 2007; Matsuyama et al., 2004; Nobuhara & Matsuyama, 2003; Snow et al., 2000; Starck & Hilton, 2007; Tomiyama et al., 2004; Toyoura et al., 2007; Tung et al., 2008; Vlasic et al., 2008). The topic is drawing a lot of attention because conventional 3D shape measurement tools, such as laser scanners, shape (structure)-from-motion (Huang & Netravali, 1994; Poelman & Kanade, 1997), shape-from-shading (Zhang et al., 1999), etc., are difficult to apply to dynamic scenes. On the other hand, depth cameras, such as time-of-flight (Foix et al., 2011) and structured light (Fofi et al., 2004) cameras can measure depth only from the viewpoint, and they do not measure the entire 3D shape of objects. There are many attractive applications of 3D human performance capture such as movies, education, computer aided design (CAD), heritage documentation, broadcasting, surveillance, gaming, etc. Shape-from-silhouette (or volume intersection) (Laurentini, 1994) is a fundamental process in generating the convex hulls of the 3D objects. Because the shape-from-silhouette algorithm is directly affected by the foreground/background segmentation, a well-controlled monotone background is often employed (de Aguiar et al., 2008; Kim et al., 2007; Starck & Hilton, 2007; Tomiyama et al., 2004; Toyoura et al., 2007; Vlasic et al., 2008). However, proper segmentation has been a serious problem even in such studios. Therefore, a number of approaches have been proposed for refining the geometrical data of the objects in both the spatial and temporal domains. This chapter reviews recent works on the refinement of visual hulls and describes our contribution featuring iterative refinement of foreground/background segmentation and visual hull generation. The rest of this chapter is organized as follows. Section 2 reviews related works for the robust 3D model reconstruction. Section 3 describes our 3D studio and our proposed algorithm. Experimental results are presented in Section 4. Finally, concluding remarks are given in Section 5. 7


Introduction
Generation of dynamic three-dimensional (3D) mesh sequences of human performance using multiple cameras has been actively investigated in recent years (de Aguiar et al., 2008;Hisatomi et al., 2008;Kanade et al., 1997;Kim et al., 2007;Matsuyama et al., 2004;Nobuhara & Matsuyama, 2003;Snow et al., 2000;Starck & Hilton, 2007;Tomiyama et al., 2004;Toyoura et al., 2007;Tung et al., 2008;Vlasic et al., 2008).The topic is drawing a lot of attention because conventional 3D shape measurement tools, such as laser scanners, shape (structure)-from-motion (Huang & Netravali, 1994;Poelman & Kanade, 1997), shape-from-shading (Zhang et al., 1999), etc., are difficult to apply to dynamic scenes.On the other hand, depth cameras, such as time-of-flight (Foix et al., 2011) and structured light (Fofi et al., 2004) cameras can measure depth only from the viewpoint, and they do not measure the entire 3D shape of objects.There are many attractive applications of 3D human performance capture such as movies, education, computer aided design (CAD), heritage documentation, broadcasting, surveillance, gaming, etc. Shape-from-silhouette (or volume intersection) (Laurentini, 1994) is a fundamental process in generating the convex hulls of the 3D objects.Because the shape-from-silhouette algorithm is directly affected by the foreground/background segmentation, a well-controlled monotone background is often employed (de Aguiar et al., 2008;Kim et al., 2007;Starck & Hilton, 2007;Tomiyama et al., 2004;Toyoura et al., 2007;Vlasic et al., 2008).However, proper segmentation has been a serious problem even in such studios.Therefore, a number of approaches have been proposed for refining the geometrical data of the objects in both the spatial and temporal domains.This chapter reviews recent works on the refinement of visual hulls and describes our contribution featuring iterative refinement of foreground/background segmentation and visual hull generation.The rest of this chapter is organized as follows.Section 2 reviews related works for the robust 3D model reconstruction.Section 3 describes our 3D studio and our proposed algorithm.Experimental results are presented in Section 4. Finally, concluding remarks are given in Section 5.
Tomiyama (Tomiyama et al., 2004) and Starck (Starck & Hilton, 2007) employed stereo matching to calculate a more detailed shape of the object.The depth search range was restricted by the visual hull model with the assumption that the actual surface point should exist on or inside the visual hull according to the theory of space carving (Kutulakos & Seitz, 2000).By this constraint, the computational cost was reduced and at the same time, the depth estimation error due to mismatching was reduced.A similar idea can also be found in (Fua & Leclerc, 1995), but this work was meant for 2.5D (multiview + depth) model reconstruction.The graph cuts algorithm was also employed after the shape-from-silhouette for refining the concave part of objects (Hisatomi et al., 2008;Liu et al., 2006;Tung et al., 2008).In (Hisatomi et al., 2008), the constraint term imposed by silhouette edges was introduced to preserve thin parts.Tung (Tung et al., 2008) combined both superresolution and dynamic 3D shape reconstruction problems into a unique Markov random field (MRF) energy formulation and optimized the cost function by graph cuts.These approaches are used only for removing unnecessary voxels; the loss of voxels deriving from erroneous silhouette extraction cannot be recovered.Therefore, these algorithms should be applied after the shape-from-silhouette processing with perfect foreground/background segmentation to remove only surplus voxels.

Other approaches
An alternative approach to using shape-from-silhouette is using graph cuts in the 3D space (Snow et al., 2000).The difference from (Hisatomi et al., 2008;Liu et al., 2006;Tung et al., 2008) is that this approach does not use the volume intersection.In (Snow et al., 2000), the data term was the sum of the values attached to the voxels, where the value was based on the observed intensities of the pixels that intersect it, and the smoothness term was defined as the number of empty voxels adjacent to filled ones.However, the accuracy of the modeling was not discussed in (Snow et al., 2000).As pointed out in (Hisatomi et al., 2008), combining the shape-from-silhouette and the graph cuts algorithm yields better results for flat color and repetitive color pattern regions.The probabilistic model (Broadhurst et al., 2001) calculates the photo-consistency energy of the two cases; i.e., whether the voxel exists or not.The probability of the existence of each voxel was calculated by Bayse's rule to choose which case is more likely.Similar stochastic approaches can also be found in (Bonet & Viola, 1999;Isidoro & Sclaroff, 2002).

Temporal domain approaches
In temporal domain approaches, 3D models are generated by deforming and refining the reference 3D models in different frames.Therefore, the manner of taking the correspondence between the feature points in neighboring frames (models) is important for extracting deformation and refinement parameters.The temporal domain refinement not only generates more accurate shapes of the 3D objects, but also keeps the geometry and the topology of the generated 3D models coherent throughout the frames (i.e., the number of vertices and their connectivity are consistent).This would also facilitate better quality texture mapping, compression, and motion tracking and analysis of the generated 3D model sequences.Nobuhara (Nobuhara & Matsuyama, 2003) proposed a deformable mesh model taking into account five constraints, such as photo consistency, silhouette, smoothness, 3D motion flow, and inertia.First, intraframe deformation was conducted considering the first three 161 Refinement of Visual Hulls for Human Performance Capture www.intechopen.comconstraints (this part constitutes spatial refinement) and the 3D model in the previous frame was deformed to match the model in the present frame considering the last two constraints.In (Vlasic et al., 2008), a skeleton model was used to track the motion of the object and the template model was deformed using linear blend skinning to meet the silhouette fitting constraint.The algorithm depended only on the silhouette and no color information was utilized.A feature-based tracking in captured 2D images using scale-invariant feature transform (SIFT) features (Lowe, 2004) was proposed by de Aguiar (de Aguiar et al., 2008).The model was then deformed based on the extracted motion.Details were recovered by adjusting the vertices to the silhouette contours and by estimating the depth using multiview stereo.In this work, the initial model was generated using a laser scanner.In (Luo et al., 2010), a modified annealed particle filtering was proposed to track the motion, and deformation and shape refinement were performed considering the silhouette of the human body.

Proposed work in this chapter
Most of the algorithms, for spatial refinement in particular, are designed only to eliminate unnecessary voxels, not to recover erroneously removed voxels (exceptions can be found, for example, in (Kim et al., 2007)).Therefore, the misclassification of the foreground object region as background in segmentation is a critical problem, not to mention that the excess number of voxels in the dilation process utilized for solving such a problem is difficult to remove even with the fancy algorithms listed above.Therefore, we have developed a 3D model generation algorithm with smaller numbers of lost and surplus voxels (Yamasaki et al., 2009), which can be categorized as the spatial domain approach.This algorithm works well even without a monotone background.Our algorithm is based on the iterative feedback between the silhouette extraction and the 3D modeling; namely, the generated 3D models are rendered and used as a seed for the graph cuts algorithm (Boykov & Jolly, 2001;Rother et al., 2008) for better silhouette extraction.The improved silhouette images are used to reconstruct the 3D models.This iterative process is repeated until the geometrical shape of the 3D models converges.As a result, both the voxel loss and surplus can be suppressed drastically compared with conventional algorithms.The difference from (Kim et al., 2007;Toyoura et al., 2007) is that the generated 3D models are improved iteratively, not by a single-shot correction.In addition, the computational cost is not very large because the number of required iterations is quite small, as discussed in 4. Whereas (Nobuhara et al., 2007) updates the silhouette image one by one sequentially, which is therefore time consuming, the proposed method updated all the silhouette images in each iteration.

Studio setup
Our 3D modeling studio is illustrated in Fig. 1.The studio consisted of 12 sets of capturing units: a camera with 1360 × 1024 resolution and camera-link interface, light, and personal computer (Intel Core2 Duo 2.4 GHz, 4 GB memory, RAID 0 HDD operating at 3 GB/s) attached to a pole.All the cameras were synchronized by an external signal generator.The frame rate was up to 34 fps.The system was built in our laboratory room (Fig. 1(b)).No special background such as a blue sheet was utilized.Only the computers were covered with 162 Modeling and Simulation in Engineering www.intechopen.comcloths because they are shiny and affect the silhouette extraction.Camera calibration was done using Tsai's method (Tsai, 1987).
The system was easy to set up and portable.Disassembling and setting up the studio again can be achieved in a few hours.The size of the studio was about 6 m × 5 m but these dimensions are flexible, depending on the size of the object and the area required for the object to move around.
(a) Floor plan (b) A view from a certain camera.

Flow of the algorithm
The flowchart of our 3D modeling algorithm is shown in Fig. 2. In the initial step, we conducted conventional silhouette extraction and 3D modeling.Then, we proceeded to the iterative processing between silhouette refinement using the rendered images and the 3D model reconstruction with error compensation.When the generated 3D model converged and was not very different from that of the previous step, the iteration was terminated and the final 3D mesh was obtained.
For higher-quality modeling, especially for reconstructing concave parts, sophisticated model refinement algorithms are required after the shape-from-silhouette, such as deformable mesh (Matsuyama et al., 2004), stereo matching, (Starck & Hilton, 2007;Tomiyama et al., 2004) and graph cuts in the 3D space (Hisatomi et al., 2008;Tung et al., 2008).However, such a model refinement process is out of the scope of this chapter.Our target was to generate shape-from-silhouette-based 3D mesh models with loss of fewer voxels while suppressing surplus of voxels for such refinement algorithms to work better.

Shape-from-silhouette with error compensation
The shape-from-silhouette is a 3D modeling algorithm that works by taking the intersections of visual cones of all the cameras surrounding the object, as shown in Fig. 3.In other words, if a voxel is seen from all the cameras, the voxel remains.Otherwise, the voxel is removed.In this manner, the visual hull of the 3D object is estimated.Then, various refinement algorithms are applied for modeling convex parts or smoothing the model.One of the most significant disadvantages of this approach is that when a voxel is invisible from even a single camera due to erroneous silhouette extraction, it is eliminated.On the other hand, the probability of a nonobject voxel to be visible to all the cameras is quite low because the voxel can be labeled as a nonobject by other cameras.Such loss of voxels degrades the visual quality of the model.An example is shown in Fig. 4. In this case, the left arm in camera #10 was missing because of erroneous silhouette extraction and the error significantly affected the generated 3D model.Note that the error in Fig. 4    ( Hisatomi et al., 2008;Matsuyama et al., 2004;Tomiyama et al., 2004;Tung et al., 2008) cannot recover such loss of voxels, because they are designed to eliminate unnecessary voxels, not to add necessary ones.Therefore, two kinds of error (loss) compensation algorithms were introduced in this chapter.One such algorithm is the voting-based modeling method.Here, we assumed the number of cameras in the studio as n,a n dm was an integer ranging from 1 to n − 1.If the voxel is visible from n − m cameras, the voxel survives.Typically, m is set as 1 − 2 because the 164 Modeling and Simulation in Engineering www.intechopen.comprobability of a voxel that belongs to an object to be invisible from two or more cameras in the view range is quite low.Therefore, voxels that were deleted due to the erroneous segmentation can be recovered.If we increase m, the generated 3D model would expand more than necessary; namely, the voxels that should be deleted remain in the visual hulls.
If the error in silhouette extraction occurs in many camera views, we should reconsider the silhouette extraction algorithm itself.In this approach, one 3D model is generated for a single frame, independent of the value m.
The other approach is modeling with the other (n − 1) camera views.When generating the foreground/background seeds for the i-th camera view, the (n − 1) camera views, excluding the i-th camera view, are used for the modeling, and the generated 3D model is rendered from the i-th camera position only for improving the i-th silhouette.Therefore, we need to conduct the 3D modeling for all the n camera views.This approach implicitly assumes that the segmentation error does not occur in multiple views at the same time, which is reasonable in most cases.It is important to note here that such an error can occur in multiple parts as long as the condition mentioned above holds.The restriction here is that a voxel is misclassified as a nonobject region by not more than a single camera.Modeling with the other (n − 2) camera or fewer views is not reasonable because the number of models to generate becomes quite large: n × (n − 1) for the case of n − 2.
In the iteration process, 3D model reconstruction is conducted multiple times.In particular, the cost for modeling with the (n − 1) camera views approach becomes quite expensive as the number of cameras increases.To save computational cost, the 3D modeling in the iteration can be done with rough spatial resolution and only the final modeling should be carried out with finer spatial resolution.Another option is to iterate the refinement process only once because the modeling accuracy by a single iteration becomes sufficiently high, as demonstrated in Section 4.

Silhouette extraction and updating
In the initial silhouette extraction, conventional background subtraction with the graph cuts was used.The background and foreground regions with high confidence were generated as follows. if Here, Y(x, y) is the chroma value of the pixel at (x, y) and Y BG (x, y) is that of the background model.Th1andTh2 are predefined threshold values where Th1 > Th2 to extract background and foreground regions with high confidence.When |Y(x, y) − Y BG (x, y)| is between Th1and Th2, the pixel is left as unknown.Then, the background/foreground maps are fed to the graph cuts algorithms as seeds.The silhouette extraction results are shown in Fig. 4(a).
In the iteration process, we assume that the erroneous loss of voxels is compensated by either of the ways described in 3.2.The silhouette refinement for each camera view was conducted using three images: the original captured image (Fig. 5

Experimental setup
The experiments were conducted using the 3D studio with 12 cameras, as described in Section 3. silhouettes were generated by hand.Then, ground-truth 3D model sequences were generated by the shape-from-silhouette algorithm.Our shape-from-silhouette program was based on (Tomiyama et al., 2004) (courtesy of Tomiyama and colleagues).The stereo matching in (Tomiyama et al., 2004) was disabled in the experiments to investigate the effect of the iterative silhouette updating only.The accuracy of the model was calculated by comparing the voxels.
The voxels in the generated model that did not exist in the ground-truth model were regarded as surplus voxels.On the other hand, voxels in the ground truth that were not observed in the generated model were regarded as lost voxels.

Evaluation of the five different models
Fig. 6 shows 3D models using only the initial silhouettes, those using the voting-based modeling method, and ground-truth models.In model A in Fig. 6(a), for instance, it is observed that the lost voxels at the back of the head and the missing right hand were compensated correctly.On the other hand, there were still some lost voxels at the right leg in model E. In this case, the color of the trousers was very close to that of the carpet and the assumption that "the probability of the voxel that belongs to the object to be invisible from two or more cameras is quite low" made in Section 3.2 did not hold any more.If the cameras were looking down on the objects, the same region of the floor was observed by multiple cameras.Therefore, the color of the floor should be different from that of the trousers of the performer and vice versa.Otherwise a random pattern can be used only on the floor, as in (Toyoura et al., 2007).
The average errors over the frames for the best (model A) and the worst (model B) cases are summarized in Tables 1 and 2, except for model E that does not hold the assumption.The modeling performance by Toyoura et al. (Toyoura et al., 2007) is also shown in Table 3 for comparison.Note that the experimental setup and the target models were very different from (Toyoura et al., 2007).In Toyoura's approach, the loss of voxels is reduced but at the same time the surplus of voxels is increased and the generated models are "fat" compared with the ground-truth model.On the other hand, in our approach, both the loss and surplus of voxels were suppressed effectively.(Toyoura et al., 2007) 2.7% 11% 14%

Loss
Table 3. Results in (Toyoura et al., 2007).modeling method without iteration using n − 1 camera views was employed, the loss of voxels was quite small.However, generated models contained many surplus voxels, resulting in a larger total error than in the modeling using the initial silhouettes.The region where a major loss of voxels occurred (0.18%) was the region that did not hold the assumption that the probability of a voxel that belongs to the object to be invisible from two or more cameras was quite low.In other words, our assumption was valid for 99.8% of the region.It can be observed that the proposed approaches can generate better 3D models than a simple volume intersection method in terms of both loss and surplus of voxels.Namely, the iterative processing between silhouette extraction and 3D modeling can reduce voxel loss while suppressing voxel surplus.Among the lost voxels in the initial model (2.1%), 90% of them (1.9% of the whole model) were invisible only from a single camera and the loss was reduced to 0.73% in the voting-based method using n − 1 cameras and to 0.90% with the other camera views method.In addition, we can see that modeling with the voting-based method was good at reducing voxel loss and modeling with the other camera views method performed well in reducing voxel surplus.The modeling errors with the looser assumption that the probability of the voxel belonging to the object to be invisible from three (not two) or more cameras is low is also shown in Table 4 (see voting-based methods using n − 2 cameras with/without iteration).In the voting-based method without iteration, the loss of voxels was as few as 0.007%, almost negligible.On the other hand, the surplus of voxels increased up to 47%.When the voting-based method with iteration using n − 2 cameras was employed, voxel loss was at the minimum among the proposed methods.However, the surplus of voxels tended to be somewhat more than in the other approaches and was almost the same as the initial model in some frames (not shown).
The optimal number of cameras to use in the iteration should be decided considering the number of cameras, the shape refinement process in the following stage, the required error rate, etc. Fig. 7 shows the modeling accuracy for model A. It is demonstrated that the error was almost constant throughout the frames independent of the poses of the performer.In all the frames, the shape of the model converged at the second iteration (the difference between the models in the first and second iterations was smaller than ǫ).To investigate how the errors change in the iteration process, the errors for model A averaged over the 10 frames as a function of the number of the iteration is shown in Fig. 8.In this experiment, the termination decision was disabled.Iteration zero stands for the initial model.Regardless of whether the algorithm was the voting-based method or modeling with the other cameras, the generated 3D model converged quickly and the errors did not improve very much after the first iteration.Therefore, modeling with only a single feedback is enough in most cases.
The mean processing time for the voting-based method using n − 1 cameras was 35 s and that      for modeling with the other cameras was 45 s using the Intel Core2 Duo 2.4 GHz and 2.5 GB memory.On the other hand, the simple volume intersection took 2.5 s.

Conclusions
In this chapter, we have reviewed visual hull refinement algorithms and presented an iterative refinement algorithm.By the cross-feedback between the 3D model reconstruction with the updated silhouette and the silhouette extraction using the rendered image, the loss and surplus of voxels can be kept very small.We have also proposed two shape-from-silhouette algorithms with error compensation to recover missed segmentation of the background/foreground.Experimental results demonstrated that the loss of voxels was reduced from 2.1% to 0.73-0.90%and the surplus of voxels was reduced from 9.4% to 0.99-1.2%,respectively.Achieving as few a loss of voxels as possible is important because the 171 Refinement of Visual Hulls for Human Performance Capture www.intechopen.comsurplus of voxels can be eliminated by further postprocessing, whereas it is very difficult to recover the erroneously eliminated voxels.
(a)), the silhouette image in the previous step (Fig. 5(b)), and the rendered 3D image from the camera position (Fig. 5(c)).The background seed was generated by the logical AND operation between the background regions in the previous silhouette image (Fig. 5(b)) and the rendered image (Fig. 5(c)).A similar color region (Fig. 5(d)) between the original captured image (Fig. 5(a)) and the rendered image (Fig. 5(c)) and the eroded silhouette image in the previous step (Fig. 5(e)) were 165 Refinement of Visual Hulls for Human Performance Capture www.intechopen.comlogicallysummed to form a foreground seed.As a result, the seeds for the background and the foreground for the graph cuts in the next step were generated, as demonstrated in Fig.5(f).In the figure, the gray, black, and white regions represent the background, foreground, and unknown regions, respectively.The updated silhouette is shown in Fig.5(g).This procedure was applied to each camera view independently.The updated silhouette images were again utilized for the 3D modeling.An example of the updated 3D model after a single feedback loop is shown in Fig.5(h).
1. Consecutive 5 − 10 frames of video (12 cameras × 5 − 10 frames = 60 − 120 images) were recorded for five people in different clothes and poses.The ground-truth data of the 166 Modeling and Simulation in Engineering www.intechopen.comRefinement of Visual Hulls for Human Performance Capture 9 modeling w/o iteration (n-1) voting-based modeling with iteration (n-1) modeling using initial silhouette modeling with the other (n-1) cameras (a) Surplus of voxels.
modeling w/o iteration (n-1) voting-based modeling with iteration (n-1) modeling with the other (n-1) cameras modeling using initial silhouette (b) Loss of voxels.
modeling w/o iteration (n-1) voting-based modeling with iteration (n-1) modeling with the other (n-1) cameras modeling using initial silhouette (c) Total error.

Fig. 8 .
Fig. 8. Model refinement effects as a function of the number of iterations.

Table 4 .
Averaged modeling accuracy for model A over the 10 frames.