The upcoming of digital video has caused a technological revolution that has changed audiovisual communication in several ways. The digital format, in its essence, is appropriate to computational processing. As a consequence, it has a huge impact in the cinema and television industries. Nowadays, with advances experimented in Internet and wireless networking, digital video has been consolidated as a new and important media. For example, the Skype application relies in this kind of media in order to allow partners that are distant far away communicate to each other.
Current generation of digital video brings revolutionary aspects as the incorporation of new data types in the media. Depth information is certainly one data type that is typically natural, inserted in digital videos in order to provide more realism. That is, the insertion of depth agrees with human perceptual system and also makes easier the scene analysis using computers, mainly if the goal is to extract high-level information. In this way, three-dimensional video (or simply 3D video) comes up, used to reproduce images in movement with the third dimension sensation or to recreate a dynamic scene visualization with other viewpoints besides the one that the movie has been filmed. 3D videos that allow the scene visualization from new viewpoints can be constructed using an image or model based approach. These type of 3D videos are known as free-viewpoint video, or so-called FVV, and 3D videos providing depth perception are so-called 3DV or stereoscopic videos.
So, in the scope of this text, the main characteristic of a 3D video is that it captures the dynamics and movement of the scene during the filming, offering to the user the possibility to change the point of view during the exhibition, beyond supplying the three-dimensional model of visualized objects. Automatic construction of three-dimensional photo-realistic models of a scene is important in applications such as interactive visualization of environment or objects that are remotely located, for example. One could provide a modification of a real scene for virtual reality tasks. Other applications of 3D video are in Archeology, Oceanography, Historic and Cultural Sites, Arts, Education and Entertainment.
In general, an end-to-end 3D video system pipeline consists of the following stages: capture system setup, 3D reconstruction, 3D representation, coding, transmission, decoding, rendenring and 3D display. They can be classified in four main blocks: 3D Content Creation (capture and 3D recontruction stages), 3D Representation, Delivery (coding, transmission and decoding stages) and Visualization (rendering and 3D display stages).
In this text, we provide an extensive literature review on 3D Content Creation, 3D Representation and Visualization blocks of the 3D video pipeline. The Delivery block regarding coding, transmission and decoding techniques is not in the scope of this text. It is mainly intended for applications involving some network channels, such as, internet applications and 3D TV.
The chapter is organized as follows. Section ▭ explains the pipeline of 3D videos from capture to display. As part of the 3D Content Creation block, we discuss acquisition systems and 3D reconstructions techniques in Section ▭. Section ▭ presents the most popular 3D representations formats in the context of 3DV and FVV. The Visualization block, with rendering and 3D display stages, is discussed in Section ▭. Finally, Section ▭ concludes the chapter.
2. Pipeline of 3D video systems
3D videos are now a huge success due to the release of Avatar film in 2010. Besides its use in cinemas, applications that require some sort of 3D video transmission, such as internet and 3D TV is also receiving attention. 3D TV, for example, is a reality and the first 3D commercial channels are available.
For such sort of applications, an end-to-end 3D video system is subdivided into four main blocks: 3D Content Creation, 3D Representation, Delivery and Visualization (see Fig. ▭).
The 3D Content Creation block (Fig. ▭) is responsible for providing the data used to create the 3D video. The process starts at the Capture stage (Subsec. ▭) with the choice of equipments that will be used to capture the scene and process data. Examples of devices for scene capture are 3D scanners, time-of-flight (TOF) sensors and digital cameras. The latter is the most widely used for capturing dynamic scenes, sometimes combined with other sensors. Other necessary equipments are computers, disks, grabber cards, etc. Projectors are also used in some systems to improve the quality of captured data. The number of cameras in a setting varies and it depends on the application, as well as, its costs. For example, in literature we can find systems with more than 50 cameras  and also systems composed by only one camera and one projector .
After capture stage the data is sent to post-processing where low-level algorithms are applied to correct and improve data accuracy. For example, algorithms for color correction, correction of lens distortion and keystone distortion, camera calibration, features extraction and tracking, image rectification and alignment are within this stage. For explanations on these algorithms, we refer the reader to any Computer Vision book, such as the one in .
The processed data is sent to the 3D Reconstruction stage. The 3D reconstruction problem refers to the recovering of scene geometry, i.e., the 3D coordinates of objects that compose the scene. This stage is responsible for creating the data that will be used within the 3D video representation. Common techniques performed for geometry recovery are structure-from-stereo, shape-from-silhouette, structure-from-motion, shape-from-focus and defocus, as well as, shape-from-shading. In Subsection ▭ we will discuss structure from stereo, structure from motion and shape from silhoettes techniques in the context of 3DV and FVV. Structure-from-stereo methods are the most popular in 3D videos literature and have been investigated by the MPEG group for standardization. Another research line on 3D reconstruction fuses data obtained from digital cameras and ToF sensors , .
A review of dynamic scenes capture can be found in .
At the 3D Representation stage (Section ▭) a format is chosen to store data from the 3D Content Creation block. There are a variety of 3D representation schemes in literature . Its choice depends on the target application and capture devices. They can be classified in image-based (Subsec. ▭), geometry-based (Subsec. ▭) and a representation based on depth maps (Subsec. ▭), which combines image and geometry aspects . Geometry-based formats represent data as we know from Computer Graphics. They offer a full navigation of the scene or object, but it has realistic rendering issues due to errors in reconstruction step. On the other hand, image-based formats avoid the explicit 3D reconstruction of the scene and provides a more realistic visualization. Depth-maps formats are more suitable for 3DV and FVV coding and has been investigated for standardization by the MPEG group.
The Delivey block is responsible for 3D video coding, transmission and decoding. Usually, it is necessary in applications with some type of network, such as Internet and 3D TV. Moreover, coding and decoding of 3D videos are important for development of storage media, e.g., Blu-ray discs. These are not in the scope of this text. We refer the reader interested in coding of 3D videos to the works in , , . Readers interested in transmission and also storage of 3D videos are refered to references , . A discussion about technologies to deliver 3D content to mobile devices can be found in .
The last building block of a 3D video system is the most important to the end user, because it deals with Visualization of the 3D content. It comprises Rendering stage and 3D Displays (Fig. ▭). The Rendering stage ▭ is responsible for employing algorithms to render the data stored at the representation format. The main focus is the view synthesis methods. They are necessary for free view point functionality and autoestereoscopic displays. More than others stages, this one is in charge of providing a realistic view of 3D dynamic scenes. Of course, its performance depends on several factors, such as the accuracy of the reconstructed data and data loss during transmission. In a 3D TV scenario it also depends on the receiver processing capability.
3D Displays (Subsec. ▭) are responsible for depth perception of stereoscopic videos. Also, for free-viewpoint videos they have to be able to provide means of interaction with the visualized content. 3D displays technologies are in constant development since 3D media became more accessible to home user. Specialists in consumer electronics predict that in 2015 more than 30% of all high-definition panels at home will be equipped with 3D capabilities. Stereoscopic videos technologies are mature and a huge success in cinemas, but there is room for improvement, specially regarding 3D displays. Stereoscopic displays are the most popular 3D display in the market, but in order to provide depth perception they require the use of uncomfortable glasses. To overcome this limitation, researches on autostereoscopic displays are under development. Autostereoscopic displays allow depth perception and FVV with no requirement of eyewear. Other types of 3D displays are holography and integral imaging. We refer readers interested in advances in holography and integral imaging to references  and , respectively.
3. 3D content creation
There are a variety of technologies for digitally acquiring the geometry of a 3D object. The choice of the acquisition setup strongly depends on the application, and of course, its costs. Digital cameras, 3D laser scanners and time-of-flight (TOF) sensors are the most popular devices for geometry and color acquisition.
An important laser scanner system has been presented in . It utilizes a laser triangulation scanner and a high-resolution color camera to scan the 5m tall Davi, a sculpture of Michelangelo. Structured light scanners settings are composed by a projector and one or more cameras , . In these systems a pattern is projected onto the object surface in order to improve the quality of the captured 3D object coordinates. In reference  the authors propose the scanning of 3D objects using a ToF camera. All systems cited above capture 3D information of static scenes. Figure ▭ shows the simple acquisition setup utilized to capture the geometry of Parthenon sculptures .
Simple structured light scanner consisting of a digital camera, a projector and a tripod used in and, on the right, a sculpture model obtained after 14 scans.
For dynamic scenes the most used devices are digital cameras. Systems with one or two cameras can be found in literature. For example, in reference  scene structure and motion are retrieved using a hand-held camera and a real-time 3D system with a high-definition camera and a projector is presented in . However, most settings utilize several digital cameras as in , for example. The concept of 3D video bricks was introduced in . One 3D video brick is composed by a projector, two black-and-white cameras and a high-definition color camera. The complete setting comprises multiple 3D video bricks.
Cameras can be arranged in a parallel or convergent setup (see Figure ▭). One of the pioneering projects in this area is presented in reference . The 3D Dome Studio uses 51 cameras mounted on a 5m diameter dome and applies stereo techniques to reconstruct the shape of a moving object. The same techniques have been used in a circular setup with more than 30 cameras to shoot a football game. All cameras are sinchronized and pointing to the same target from different angles. The set of captured multi-view images are processed and a 3D model is reconstructed. In reference , 7 cameras where placed in a convergent fixed setup pointing to the center of the scene. Cameras are synchronized and calibrated. The main goal is the reconstruction and rendering of human bodys from any viewpoint and estimate its motion parameters. Another curved setting can be found in  where 12 cameras where placed at the ceiling, around the scene.
An example of parallel setup can be seen in . It uses six consumer quality Fire-Wire video cameras aligned in two rows. Cameras where partitioned in stereo pairs, and every stereo pair is connected to one PC for stereo processing. The 3D system in  captures dynamic events with several cameras displaced in sequence and generates novel views with interpolation methods. Another example of parallel camera arrangement can be seen in .
Example of convergent setup with 51 cameras proposed in (left) and a parallel one with 16 cameras in (right).
All studio settings shown above use controled illumination to facilitate reconstruction processes. As a consequence, studio setups rigorously restricts the type of observed scene. In  authors use auto-exposure and gain changes compensation in order to capture outdoor scenes which has a large variation in illumination. The setup is portable and can be hold in a backpack or vehicle mount. It consists of a GPS, an inertial sensor and an omnidirectional camera, with six cameras within (see Fig. ▭).
Recently a new system configuration has been investigated. These 3D systems employs sensor fusion combining depth sensors and digital cameras , . Their main goal is obtain more accurate depth maps by combining stereo methods and data acquired by depth sensors.
Commercial solutions for ease 3D acquisition are available. They are called stereo- or 3D cameras. Figure ▭ shows the stereo camera Bumblebee XB3 from Point Grey Research and the full-HD professional 3D Panasonic AG-3DA1. Both cameras are available at Natalnet Laboratory. Bumblebee XB3 has a 3-sensor multi-baseline with variable resolutions and come with softwares for stereo processing. The Panasonic AG-3DA1 has integrated twin-lens and records and process synchronized left and right streams. The recorded channels are stored on memory cards in AVCHD format.
Bumblebee XB3 from Point Grey Research (left) and full-HD professional 3D Panasonic AG-3DA1 (right).
3.2. 3D reconstruction
After images are captured and pre-processed they are sent to the reconstruction stage. The 3D reconstruction problem refers to the recovering of scene geometry, i.e., the 3D coordinates of objects that compose the scene. This stage is responsible for creating the data that will be used within the 3D video representation.
3D video systems in literature differ on the employed reconstruction methods. Examples of such methods are shape from focus, shape from shading, structure from motion, shape from silhouette and structure from stereo. We refer the reader to any Computer Vision book  for a broad discussion about existing reconstruction methods. However, structure-from-stereo techniques have shown be more suitable for 3DV and FVV .
Here we will review some works of structure-from-stereo, structure-from-motion and shape-from-silhouette techniques within the context of 3D videos.
3.2.1. Structure from stereo
The most popular method of 3D reconstruction is stereo . It is based in the principle of stereo vision (or stereopsis) which copes with the human visual system . Because of the position of our eyes, our brain receives two views of a same scene from two slightly different viewpoints at the same horizontal level. Our brains fuse these two images and measure the disparity in order to estimate depth . Computationally, stereo process has three main steps: selection of a particular location of the surface in one image (feature extraction); the selected location must be identified in the other image ( matching or correspondence problem); the disparity in two correspondent locations must be computed (reconstruction) . The process used to obtain 3D point coordinates from a set of known corresponding image locations is called triangulation , . Overviews about the problem of recovering 3D structures from stereo can be found in literature , .
Over the years many efforts have been made by academics to compute stereo efficiently for static and dynamic events. The literature stereo is very extensive. In  an important work surveying and evaluating binocular stereo algorithms has been presented. The authors have categorized dense binocular stereo according to: matching cost computation, cost aggregation, disparity computation and disparity refinement.
For free-viewpoint video development it is mandatory the acquisition of images from many different viewpoints (see Fig. ▭). Thus, the problem of reconstructing 3D scenes from more than 2 frames arises, the so-called multi-view stereo reconstruction problem .
Example of multi-camera setup (left) and images of a same scene captured from many different viewpoints (right). Figures taken from
Many algorithms to compute multi-view stereo has been developed . A taxonomy for multi-view stereo methods has been proposed , similar to the one presented in  for binocular stereo methods evaluation. The multi-view algorithms are classified and evaluated according to six categories: scene representation, photoconsistency measure, visibility model, shape prior, reconstruction algorithm, initialization requirements. According to this taxonomy the reconstruction algorithms can be classified in four mais classes :
Cost computation on a 3D volume - for example, voxel coloring methods ;
Minimization of a cost function - for example, space carving methods ;
Computation of depth-maps;
Extraction and matching of feature points.
In  the authors propose a new algorithm to implement multi-view stereo reconstruction by employing a pipeline other than Feature Extraction, Matching and Reconstruction as traditional stereo methods. It starts with a sparse set of matched points that are expanded to a more dense set and filtered using visibility constraints. This process results in a patch-based representation of the surface which is transformed into a mesh-based representation.
Multiview stereo algorithms have been applied to obtain 3D objects geometry from photos . Also, many 3D video systems based on multiview stereo algorithms have been proposed , , , . In the context of FVV one of the pioneering works can be seen in . The authors use the multi-baseline stereo algorithm of  to obtain depth maps that are edited to remove innacuracys. It reconstructs fore- and back-ground objects. The system in  is also based on the same algorithm.
An recent overview of coding algorithms to stereo and multiview video can be found in .
The most difficult part in stereo computation is the matching or correspondence problem . Active stereo methos try to overcome this limitation by emitting and projecting some sort of waves onto the surface. In structured light approaches a controlled illumination pattern is projected. This methodology has been applied to obtain 3D models of cultural artifacts, such as statutes , , .
Many 3DV and FVV systems benefits from this idea , , , , , . In , for example, a real-time 3D system is presented. It utilizes only one camera and one projector. They must be synchronized to guarantee that the projected pattern it will be projected at the same time the camera captures it. Camera and projector have to be calibrated, as well. The projector projects slides with a sequence of colored stripes and consecutive stripes may not have the same color(see Fig. ▭). Experiments where made with static and also reasonably fast movements scenes. The system needs improvements on the quality of reconstructed scenes but it is a promising approach towards real-time 3D video system. Unlike the previous setting, the multi-view stereo system in  projects a binary vertical stripes pattern with randomly varying stripes width (see Fig. ▭).
Upper row: scene illuminated with colored stripes (left) and the reconstructed scene (right) . Lower row: color image (left) and same image with structured light illumination (right) .
Methods to compute depth via triangulation have been widely investigated by the computer vision community. Stereo, laser scanning and time- or color- structured light are the most popular. Usually they are classified as active or passive methods. In  a new classification of the 3D reconstruction methods based on triangulation is proposed. Instead of passive or active approaches, the methods would be classified according to the domain where corresponding features are located. Techniques such as laser scanning and passive stereo identify features only in spacial domain. Methods such as time structured light use features only in temporal domain. The spacetime stereo approach looks for features in both spatial and temporal domains (see Fig ▭). This new methodology has been applied for dynamic scenes reconstruction , .
In parallel, other research groups where also interested in spatio-temporal benefits. In reference  the authors have employed spacetime approach in three different cases. For static scenes they have used structured light to obtain high-quality depth maps and where observed improvements over traditional stereo methods. They have tested the spacetime theory in quasi-static objects such as waterfalls and it proved to be more efficient. For dynamic scenes under natural lighting conditions it behaved like traditional stereo. The approach presented in  have been used to develop a scalable 3D video system .
In reference  the spacetime approach was used to improve the video resolution of dynamic scenes. The super-resolution is obtained simultaneasly in space and time and makes the system capable of recovering dynamic events that happens faster than video frame-rate.
3.2.2. Structure from motion
In Computer Vision, the problem of recovering the Structure From Motion (SFM)  refers to the process of finding the three dimensional structure of an object by analyzing its motion over time. We perceive a lot of information from the three dimensional structure of the environment by moving around. The same happens when the objects perform some movement in the scene.
The SFM problem is similar to stereo vision. In both approaches, the image correspondences and the 3D coordinates of the object must be computed. But in SFM, in order to find correspondences between images, features such as corners must be tracked from an image to another. The trajectories of these features are used to reconstruct the 3D object and the camera motion. Because of features tracking, SFM is especially effective with video sequences.
Most SFM techniques reconstructs scenes with rigid objects, but in ,  the authors deal with scenes with non-rigid objects, such as animals and humans. A limitation of SFM is that the pixels correspondences can only be calculated accurately for salient features.
In  the authors use structure from motion to reconstruct statics scenes from a sequence of uncalibrated images. For such, a hand-held camera is used. They required restrict camera motion, specially camera rotations. No prior information is required besides the images themselves. One limitation is that it strongly depends on image texture because it is a feature based approach.
The reconstruction of 3D scenes captured by a hand-held camera was the main goal of other works, , as well. Structure from motion techniques were used to reconstruct citys architecture . The authors try to fuse the data obtained by SFM approach and GPS measurements.
3.2.3. Shape from silhouette
Many algorithms of 3D reconstruction are based on object's silhouettes. This class of techniques are known as Shape-from-Silhouette . The important concept of Visual Hull of an object was introduced in  to identify which parts of are important to silhouette-based approaches. A formal definition is:
" The visual hull of an object relative to a viewing region is a region of such that, for each point and each viewpoint , the half-line starting at and passing trough contains at least a point of ." 
For each viewpoint , the lines starting at and passing trough form a silhouette cone. The volume generated by intersecting all silhouette cones from all viewpoints is the visual hull ▭. Volume carving  is the approach commonly used for such. Since volumetric techniques are traditionally slow, an image-based visual hull (IBVH)  have been developed to overcome this limitation. It is real-time and like all image-based rendering technique it provides a realistic rendering of the scene. It is pertinent to observe that silhouettes approaches suffer from one important limitation: they are not able to distinguish concave surface regions. Thus, the reconstruction of concave objects is not guaranteed with silhouette approaches only. Efforts to overcome this problem have been made , as well.
In the context of 3DV and FVV silhouettes approaches have been widely used to recover the 3D object surface. The systems in , , ,  employ the same volumetric strategy: the visual hull volume is computed, then it is divided in voxels. For each frame and viewing position all voxels are marked as occupied (object portion) or empty (background portion). After this process the remaining voxels contain the object and form a voxel-based representation of it. Finally, the marching cubes algorithm transforms the voxels model into a triangle mesh, which represents the object surface.
3D video systems using variants of IBVH have been already proposed , , . Reference  presents a complete 3DV and FVV system combining visual hull, surface texture, image features and inertia constraints to perform a high quality reconstruction of dynamic scenes.
4. 3D video representation
Various representation schemes for 3D videos can be found in literature . Usually its choice depends on the target application. But for some authors  it determines completely the 3D video system design.
Geometry-based modeling (Subsec. ▭) represents data as we know from Computer Graphics. In order to use this format the 3D scene has to be reconstructed and the geometry stored in a well know format such as, polygonal meshes or point clouds. They offer a full navigation of the scene or object, but it has realistic rendering issues due to errors in reconstruction step. On the other hand, image-based formats (Subsec. ▭) avoid the explicit 3D reconstruction of the scene and provides a more realistic visualization. But there is a critical trade-off between realistic rendering and size of stored data.
4.1. Image-based representation
The popular format of a three-dimensional video is a stereoscopic video composed by two video signals, one for each eye. It is the image-based format used by movie theaters and current 3D TV for home entertainment. Due to its simple format it can be encoded using existing video codecs, by performing spatial or temporal interleaving. For spatial interleaving the images for the right and left eye are resized and packed into a single frame. They can be arranged in side-by-side or top-bottom. In a temporal interleaving the right and left images are shown in alternate times.
For FVV systems exist Light fields ,  and Ray-space  representations. Both representations do not perform any geometric reconstruction, avoiding the artifacts generated by this process. Thus, they lead to a more realistic rendering of the scenes. However, the realistic rendering is paid by the cost of the huge amount of necessary data. They need to store and transmit a set of views that are, at the receiver side, interpolated in order to generate novel views. If only a few views are transmitted the rendering quality is poor.
4.2. Geometry-based representations
4.2.1. Polygonal meshes
Polygonal meshes  are the most popular 3D scene representation in many industries such as architecture and entertainment. Due to realism requirements in computer graphics and the development of 3D scanning technologies, polygonal meshes representing 3D surfaces contain millions of polygons. On one hand they can represent satisfactorily almost any geometric detail of the surface. On the other hand these meshes are complex and computationally expensive to be stored, transmitted and rendered. To overcome these limitations, many techniques to compress and simplify complex meshes have been developed leading to progressive approaches , even for time-varying meshes .
Important projects that build 3D polygonal mesh models from scanner systems or photos have been proposed , . Many developed 3DV and FVV systems are based on polygonal mesh representation , , . In , , , ,  a triangular mesh is obtained from a voxel representation via marching cubes algorithm, after silhouette-based reconstruction. In reference  instead of marching cubes algorithm the authors perform multi-level partition of unity implicits (MPLU) . Reference  uses a prior body model consisting of 16 closed triangle meshes. Researchers in ,  present a deformable three-dimensional mesh model which allows the recovery of the 3D shape and 3D motion. The shape is represented by the triangular mesh, while the movement by vertices translations. Deformations occur inter- and intra-frames, with photometric and smoothness constraints, for example. Figure ▭ shows a result obtained after intra-frame deformation.
4.2.2. Point-based representation
In point-based schemes the geometry is represented by a set of points sampled from the surfaces in the scene . Neither topological nor connectivity informations are explicitly stored. Points offer advantages over other representations because they are the simplest geometric primitive.
Progressive approaches have also been applied to point-based representations , . The need arises in applications which deal with a huge amount of data and/or make some sort of data transmission, such as internet or broadcast. In  the 3D objects geometry and texture are encoded in terms of surface particles associated to an octree . The encoding is done in an appropriate order which allows the surface be reconstructed progressively. The same idea has been employed to reconstruct and render the Davi statue . In the last one a hierarchy of spheres have been used instead of an octree and the resulting representation have been rendered using splatting techniques .
3D video systems claiming high-quality rendering of point-based representations are available in literature , , . In  each point of the representation is associated with its color, avoiding the use of textures. Also, each point is modeled by a Gaussian ellipsoid generated by three vectors, with origin in its center. This is a probabilistic model representing the positional uncertainty of each point.
The authors in  propose a framework for recording 3D videos. The prototype have been tested to capture and reproduce dynamic scenes with one human in movement. They utilize a time-varying three-dimensional hierarchical point-based data structure to store the 3D video. One such data structure is constructed per frame. Then, two different splatting techniques are employed for rendering a continuous surface of the 3D object.
In  a point-based variant of image-based visual hull  is used in the design of an immersive environment for virtual design and collaboration. Authors of  propose a real-time free-viewpoint system based on the concept of 3D video fragments. 3D video fragments are point samples of a 3D object surface with some attributes, e.g., position, surface normal and color. It uses an inter-frame prediction scheme to dynamically update those attributes in order to avoid recompute the full 3D representation for each frame.
Comercials 3D video systems based on point representations are already available in the market. For example, Libero Vision Company  offers products for creating realistic virtual views for arbitrary viewpoints of sports .
4.3. Depth maps-based representation
Depth map  is a special case of digital image. In a depth map each pixel represents the distance from the sensor to a visible point at the scene. Thus, it reproduces the 3D scene structure and can be interpreted as a surface sampling .
Nowadays, representations based on depth maps are the most popular and promising representation for 3DV and FVV. This is due to the fact that some representations based on depth maps are able to perform at the same time 3DV coding - where the left and right images are encoded - and FVV coding - where view synthesis can be performed. Explanation on depth-image based representations and a recent review of 3D video representations using depth-maps are available in literature , .
In order to build a reliable 3D model a dense depth map must be established, that is, a depth estimate corresponding to each pixel in the intensity images. The pionnering works in ,  computes dense depth maps from all available views. A scene description is created using the depth map aligned with the intensity image for each recorded angle. However, they convert each depth map into a triangle mesh and employs texture mapping for rendering the scene. This representation reproduces only free-viewpoint video and it is not capable of rendering stereoscopic videos. The work developed in  utilizes a layered representation - intensity image and associated depth map - for view interpolation. They also convert the depth maps into a triangle mesh to benefit from programmable GPUs.
A 3D representation that combines conventional 2D video stream with synchronized depth informations have been proposed in  during the ATTEST project . It is called video plus depth (V+D) and it allows the rendering of two virtual views corresponding to a stereo pair. This format have been standartized by the MPEG group and it is known as MPEG-C Part 3 .
The video plus depth format has been extended to the multiview video plus depth (MVD) . With this format only a subset of M images and its associated depth maps are transmitted to a display of N views. The remaining views are interpolated via image-based warping.
Another available format is layered depth video (LDV) . It is based on the concept of layered-depth image (LDI) . An LDV is composed by a 2D video (color image), the associated depth maps and other layers, for example, an occlusion layer or residual layers of depth and color. This representation is more compact than MVD. However, due to redundancy MVD format provides a more realistic rendering. Both formats are under investigation at MPEG group .
5. 3D video visualization
Most developed 3D video systems aim to provide realistic visualization. The rendering technique employed strongly depends on the 3D representation used to model the 3D scene.
Popular approaches are texture mapping and colorimetry for surface-based representations, light fields  and depth-image based rendering (DIBR) . Examples of FVV systems employing these rendering techniques can be found in , , . Systems based on point-cloud representations usually apply splatting techniques , . For 3DV rendering, video plus depth (V+D) representations achieve depth perception by performing DIBR techniques of the second video.
One important task of the rendering stage is to generate virtual views. This is important not only for FVV systems, but also for autostereoscopic displays. The general idea behind virtual view synthesis is to project the image into the 3D space and then project it again at a chosen virtual camera at the desired position. Inherent problems with this processing are occlusions and object boundaries areas. An occluded region in a natural view could be visible from a virtual view position, leading to holes at the novel view. Object boundaries areas are difficult to handle because they have back- and foreground colors. Also depth estimation of such areas are unreliable. Both situations lead to artifacts after projection into novel views.
The view interpolation schemes of MVD and LDV representations presented in  and , respectively, are good strategies to overcome these limitations. MVD identify unreliable regions by extracting a main and two boundaries layers - one for background boundaries and another for foreground boundaries (Fig. ▭). Following layers extraction, they are projected into the 3D space and the virtual view position is interpolated from the original view positions trough spherical linear interpolation. After that, all layers in 3D space are projected separately in proper order and the results are merged. Finally, the artifacts naturally introduced by image-based 3D warping are detected and corrected.
LDV approach also identify unreliable regions and extract a main layer but, unlike MVD, it extracts only one boundary layer combining either back- and foreground boundaries (see Fig.▭ for comparison). In LDV representations only a central view and associated residual layers are transmitted, leading to some color difference in novel views. Thus, MDV performs better than LDV regarding rendering aspects, but the latter is a more compact representation.
(Left) Layers of MVD: main layer in gray, foreground boundary layer in blue and background boundary layer in gree. (Right) Layers of LDV: main layer in gray and one boundary layer combining either back- and foreground boundaries
5.2. 3D displays
Mechanisms offering the perception of depth is a reality. 3D cinemas is experimenting huge success and 3D TV for home entertainment is now a reality. The popularization of 3D TV is due to advances in the whole 3D video pipeline, specially in 3D displays.
Examples of 3D displays are stereoscopic and autostereoscopic displays , holograms  and integral imaging . Here we will briefly present the most intended for home entertainment: stereoscopic and autostereoscopic displays.
Stereoscopic displays are the most popular type of 3D display. It projects two multiplexed images at the screen. Both images show the same scene captured from two slightly different angles. A viewer needs to wear special glasses that separates the multiplexed image into two images - one for the left eye and one for the right eye. In particular, the glasses make each viewer's eye view only one of the two images. Schemes of images multiplexing rely on color, polarization or time multiplexing. Thus, the separation is possible because each image uses a different color (e.g., red and cyan), polarization or are projected in alternate frame sequencing. In each case, anaglyph, polarized or shutter glasses are required to send each image to the correspondent eye, respectively. The major drawback of this approach is that the viewer must to wear glasses for depth perception.
Autostereoscopic displays offer depth perception without the requirement of using any device such as special glasses or user-mounted devices. The main limitations of this technology are the cost and number of users able to perceive depth at the same time. Autostereoscopic displays are based on viewing areas the user should remain making one image to be visible to the right eye and another to the left. It could be a two-view or multi-view display. In the first case only one stereo pair is displayed allowing 3DV capabilities. In the second, multiple stereo pairs are displayed and allows 3DV and FVV functionalities. Here, FVV is in the sense that when the observer moves in front of the display, he/she can perceive a natural motion parallax impression. Technologies employed in two-view autostereoscopic displays are parallax barrier and lenticular sheets. In the multi-view case, the performed methods are multiview parallax barrier, time multiplexing combined with parallax barrier and lenticular arrays combined with pixelated emissive displays.
This chapter provides an overview of 3D videos production pipeline. We have concentrated in systems with no interest in 3D data coding and transmission. 3D video is a broad research area and here we outlined its main issues and advances briefly. An extensive list of publications is provided below for readers interested in more details.
3D media is already in our everyday lives and for this reason many leading researches are under development. Regarding capture devices, 3D cameras are already in the market, even for professional use. Still they are expensive. Although there are not many options for home users, they are becoming cheaper with development of new technologies.
Along with the quality of produced 3D content and advances in 3D displays, standardization plays an important role in 3D videos success. For such, MPEG group works on standardization of depth-maps based representations, which have shown be more suitable in this context. In parallel, the development of multiview autostereoscopic displays intend to make them the next generation of TV sets.