Recently used pedestrian databases by the researchers.
A key frame is a representative frame which includes the whole facts of the video collection. It is used for indexing, classification, evaluation, and retrieval of video. The existing algorithms generate relevant key frames, but additionally, they generate a few redundant key frames. A number of them are not capable of constituting the entire shot. In this chapter, an effective algorithm primarily based on the fusion of deep features and histogram has been proposed to overcome these issues. It extracts the maximum relevant key frames by way of eliminating the vagueness of the choice of key frames. It can be applied parallel and concurrently to the video sequence, which results in the reduction of computational and time complexity. The performance of this algorithm indicates its effectiveness in terms of relevant key frame extraction from videos.
- deep learning
- neural network
- video processing
- computer vision
In video analysis and processing, relevant and necessary information retrieval is a mandatory task, because if the video is large, then it is difficult to process the complete video in less time without losing its semantic details. Key frame extraction is a primary step of a computer vision algorithm. The key frame means the part of the video that can represent a visual summary and meaningful information about the video sequence. The key frame can be useful in many applications such as video scene analysis, browsing, searching, information retrieval, and indexing. Aigrain et al. in  describe the benefits of key frame extraction for information extraction in a video sequence. HongJiang et al.  significantly justify that for any video sequence the user can perform searching, indexing, and retrieval of information efficiently and faster using key frame extraction. Liu et al.  and Gargi et al.  proposed an object motion-based approach of key frame extraction. Basically, the video has a complex structure. It is a combination of the scene, shot and frames  as shown in Figure 1. In many computer vision applications such as content-based video retrieval (CBVR), video scene analysis and video sequence summarization is mandatory to analyze the overall video structure. Video analysis major components are video scene segmentation, shot boundary detection, key frame selection, and extraction [6, 7, 8]. The main use of key frame extraction is to reduce the redundant frames in a video to make a video scene readable and compact and prepare video sequences for faster processing.
Conventional key frame extraction methods eliminate the redundant and similar frame in a video without affecting the semantic details visual content. These techniques inputs are either a complete video or a video is divided into a set of shots by shot boundary detection methods. As shown in Figure 1, the shot is a consecutive, adjacent sequence of frames captured by the video camera. Thus, in this chapter we propose an efficient approach for video key frame extraction, which is faster, accurate, and computationally efficient. This chapter is organized into the following sections. Section 1 gives an introductory part of video structure and the importance of key frame extraction in a video surveillance system. Section 2 describes the existing approach for key frame size selection algorithms. Section 3 describes the existing key frame extraction methods with its issues and challenges. Section 4 describes the proposed approach for key frame extraction. Section 5 discusses experimental results and possible future directions. Finally, the chapter concluded with a discussion in Section 6.
2. Key frame size estimation methods available in the literature
The major problem we face in a key frame extraction algorithm is computing the size or number of key frames for a specific video sequence. In literature, there are several methods available for the key frame size estimation. In this section, we have discussed these methods in brief. In  approach the author has considered one key frame for each shot of a video. The selection of the key frame in each shot is based on the maximum entropy value of each shot. This consideration is not appropriate and accurate for the video which is having a big shot. Again, many of the useful frames of the video are discarded due to the pre-defined fixed selection of key frames. Lesser key frame extraction does not solve the problem. A set of key frames having necessary and sufficient representation of the visual content of the video is required in the output. In other proposed approaches, first, middle, or ending location frames in the shot are considered. But the resulting frames are having a low correlation with each other in visual content. These methods are computationally less complex, but having less accuracy. In [9, 10], authors have described the three different ways of identifying the key frames in a video sequence. Each method is described in brief as follows.
2.1 Priori knowledge base as a fixed number
In this method, a pre-defined number of fixed key frames are considered as a fixed value before the key frame extraction process. Consider “k” as the number of key frames, and then the key frame set is defined by Eq. (1):
The sequence of video frames is the change as per the type of video. The specific summarization of key frames is defined by Eq. (2):
n is represented as a number of frames in a video, represents the key frame summarization factor, and represents the distance measure, i.e., used for computing dissimilarity between frames. The in this method is useful for maintaining a lesser number of key frames by covering complete visual content details in the video.
2.2 Posteriori knowledge base as unknown
In this method, the number of key frames is not fixed. The number of key frames is unknown until the key frame extraction process gets completed. The key frame size is depending upon the type of content of the video frame. If the video scene consists of dynamic action movements, then the number of key frames is more otherwise less for static video scenes. Key frame generation can be represented by Eq. (3):
where parameter is used for tolerance to dissimilarity level. Another parameter is similar to the previous method.
2.3 Determined-fixed number
In this method, the number of key frames is predetermined before the whole process key frame extraction process. In [11, 12] approaches key frames are extracted using the clustering technique. The key frame extraction algorithms stop when extracted key frame size matched with the pre-defined key frame value.
3. Key frame extraction methods with its issues and challenges
In literature, there are several methods to extract key frames. Hannane et al.  and Hu et al.  categorize the key frame extraction into different categories as a sequential comparison of frames, global comparison of frames, the minimum correlation between frames, minimum reconstruction error in frames, temporal variance between frames, maximum coverage of video frames, reference key frame, curve simplification, key frame extraction using clustering, object- and event-based key frame extraction, and panoramic key frames. Each of these methods is described in brief as follows.
3.1 Sequential comparison of frames
In this method, each frame of a video sequence is compared with the previously extracted key frame. If the difference between the extracted key frame and the current key frame is high, then this frame is considered as the new key frame. In  key frames are extracted based on the color histogram comparison between the current and previous frames of a video sequence. The main advantage of this method is that it is simple and computationally less complex. But the disadvantage is that the extracted key frame consists of redundant key frames.
3.2 Universal frame comparison
In this method, the global difference between frames in a shot is computed using a predetermined objective function, which is application-specific. Zhuang et al. in  describes the different objective functions for comparison of frames in the shot. Each of these functions is discussed in brief as follows.
3.3 Minimum associations
In this method, relevant key frames are generated from a shot by reducing the summation of the association between frames. The extracted key frames are tightly coupled with each other. Liu in  uses a graph-based approach to extract distinct key frames with their association. Weight directed graph is used to represent the shot, and the shortest path is computed using the A* algorithm. The frames, which are having minimum association and less correlation, are represented as key frames in the shot.
3.4 Minimum reformation error
In this method, the key frames are extracted by reducing the variation between the prevision frame and set of frames in a shot. The prevision frame is generated by the numeric analysis method interpolation. Chao et al. in  presented an approach to select a pre-defined set of key frames and reduce the frame reformation error. In  a combined approach of the prevision frame-based approach and a pre-defined set of key frame selection approach is proposed. This method uses the motion-based features.
3.5 Similar temporal variance
In these methods, frames having similar variance are selected as the key frames of the specific shot . The sum of temporal variance between all frames is selected as an objective function. The temporal variance is computed by the summation of change in the frame content in a shot.
3.6 Maximum key frame representation coverage
In this method, the representation coverage of a key frame means a number of frames in a shot that a key frame can cover . This method can be useful in the size of the key frame selection. The advantage of this method over a universal comparison method is that the extracted key frames are maintainable and consist of global context information of a shot. The only disadvantage of this method is that it is computationally complex.
3.7 Predetermined reference frame
In this method, a key frame is generated by comparing the predetermined reference frame and each frame in a shot . The main advantage of this method is that it is not computationally complex and easy to implement. Its drawback is that it does not represent the global context in a shot efficiently.
3.8 Trajectory curve simplification
In this method, the trajectory curve is generated from the frames. The curve consists of a sequential combination of points in the feature. Calic and Izquierdo in  presents a dynamic method for change detection in the scene and the key frame generation. The frame difference metric is computed using the small size block features in a scene. After that contour detection method is used for trajectory curve plotting using the metric.
3.9 Cluster-based key frame extraction
In this method, key frame clusters are created using the data points and features of video sequences. The set of key frames is created with frames that have the closest distance from the center of the cluster. In [21, 22] fuzzy K-means- and fuzzy C-means-based methods for the key frame selection are presented. The clusters are generated based on the different features like motion sequences and the distance matrix score. In  an approach that combined K-means and mean squared error for the key frame selection is presented. Pan et al. in  proposed an enhanced fuzzy C-means clustering algorithm for the key frame selection. Clusters are generated using the color feature. The key frames having the highest entropy are considered as a key frame from each cluster. The advantage of cluster-based approaches is that it covers the global characteristics of the scene. The disadvantage of these methods is that it requires a high computational cost for cluster generation and feature extraction from the scene.
3.10 Event-driven key frame extraction
In this method, the extracted key frame consists of event and object details. The advantage of this method is that each key frame describes the object motion pattern, object, and event details . The drawback of this method is that the pre-defined rules need to be defined as per the application, identifying objects and events in a key frame. Hence, the accuracy of this algorithm depends upon the pre-assumption parameters set before the key frame extraction algorithm is executed.
3.11 Full details key frame extraction (panoramic frame)
In this method, the key frame consists of the complete detail of a scene in a shot. Papageorgiou and Poggio in  presented a key frame extraction approach using the homography matrix. The main advantage of this method is that it covers the global context of the shot. The drawback of this method is that it is having high computational complexity. The comparative analysis of recently utilized key frame extraction algorithms is shown in Table 1. The comparison is performed in terms of characteristics, advantages, and shortcomings of the method.
|Method name||Characteristics||Advantage||Shortcomings of the method||Ref.||Year|
|Clustering method (Zhuang et al.)||Analysis of short boundary video||Faster processing||||1998|
|Entropy method (Mentzelopoulos et al.)||Best method for unpredictable dataset||Local feature selection||||2012|
|Histogram method (Rasheed et al.)||Similarity measure between key frames||High-level segmentations||||2015|
|Motion analysis method (Wolf et al.)||Optical flow-based analysis||Faster mid-range key frame selection||||2016|
|Triangle-based method (Liu et al.)||Determination of the motion characteristics||Reduces the motion effects on the video||||2016|
|3D augmentation method (Chao et al.)||Processing short and fast motion video data||Combines the video data into multidimensional model||||2018|
|Optimal key frame selection method (Sze et al.)||Best method for continuously growing video sequence by adopting the temporary key frame||Faster processing due to probabilistic analysis||||2017|
|Context-based method (Chang et al.)||Best method for repetitive information contents||Generates a multilevel abstract of the information||||2017|
|Motion-based extraction method (Luo et al.)||Adopts the advantages from digital capture devices||Reduces the spatiotemporal effects||||2015|
|Robust principal component analysis method (Dang et al.)||Adopts the decomposition method for sparse component analysis||Analyzes the frames for consumer videoswith fewer contents or rapid content shift||||2010|
4. Proposed methodology for key frame extraction
The proposed approach is based on the combination of the histogram and deep learning to extract the relevant key frame from the video sequence. Figure 2 shows the main steps of the proposed framework. The steps of key frame extraction include (1) video reading from the database, (2) frame extraction from video, (3) preprocessing, (4) histogram generation, (5) comparison of the histogram, (6) distinct key frame generation, and (7) key frame extraction using convolution neural network (CNN). Each of these steps is described in subsequent subsections.
4.1 Video reading from database
We have tested this algorithm on the various publicly available datasets and on our own behavioral dataset. The first step is to read a video from the database. The raw video sequence selected from database is represented by Eq. (4):
4.2 Frame extraction from video
The number of frames is extracted from the video selected in step 1. The extracted frames are stored in a local directory for further processing. It is represented by Eq. (5):
4.3 Preprocessing of frames
In the preprocessing step, the key frame queue initialized with . The key frame queue initialized with zero because in the initial step key frame is zero. Next, the extracted frames of step 2 are converted from RGB model space to the HSV model space. This conversion is necessary to get a more specific color, gray shade, and brightness information. In HSV model space, hue is the color portion of the model, expressed as a number from 0 to 360. Saturation describes the amount of gray in a particular color, ranging from 0 to 100%. The value component represents the intensity of the color, ranging from 0 to 100%, where 0 is completely black and 100 is the brightest and reveals the most color.
4.4 Histogram generation
In this step, the normalized histogram is generated from the hue-saturation and value component in order to compare the adjacent frame. The normalized histogram is generated for contrast enhancement and compact representation of intensity and color information of the frame. Normalized histogram Hn is computed by Eq. ()
where, n indicates possible intensity value.
4.5 Histogram comparison
In this step, the normalized histogram is generated for each frame, and adjacent frame histogram is compared using the Bhattacharyya distance measure. It is defined by Eq. (7):
indicates histogram of the previous frame
indicates histogram of the current frame .
indicates the number of histogram bins
The Bhattacharyya distance is the result of a comparison of the matched score . The value ranges from 0 to 1. The value 0 indicates an exact match of the content of the video frame, 0.5 is half match and 1 represents mismatch. Next, different conditions are checked to match to extract dissimilar frames and similar frames. The different conditions of the score are compared with a threshold as:
current frame is dissimilar than the previous
Add a frame in the queue of key frame ←
4.6 Distinct key frame generation
In this step, the distinct key frame is selected, and redundant key frames are removed from the Frame queue as follows:
current frame is dissimilar than the previous.
Add a frame in the queue of key frame ← .
4.7 Key frame extraction using a convolution neural network
CNN is composed of two basic parts of feature extraction and classification. Feature extraction includes several convolution layers followed by max-pooling and an activation function. The classifier usually consists of fully connected layers as shown in Figure 3.
Extracted distinct key frames are used as testing queries in classification phase, and input frames features are extracted using the CNN feature extraction module, and learn features are matched with distinct key frame features to obtain the best match frame which is considered as key in the output as a frame index number. The key frame extraction and CNN approach perform in parallel to obtain efficiency.
5. Experiment results and discussion
In this section, we have evaluated the efficiency of the proposed method on a publicly available database and our own human action database. The results demonstrate significant improvement over the conventional methods and with low time complexity. Next, in subsequent sections, the various experiments conducted are discussed as follows.
5.1 Dataset analysis
The performance of a key frame extraction technique was evaluated and compared with the state-of-the-art methods using benchmark databases. We have taken sample videos of benchmark database and human action database as shown in Table 2.
|Data source||Purpose||# Image or video clips||Annotation||Environment||Ref.||Year|
|MIT||City street pedestrian segmentation, detection, and tracking||709 pedestrian images, 509 training and 200 test images||No annotated pedestrian||Daylight scenario|||
|Caltech Pedestrian dataset||Detection and tracking of pedestrian walking on the street||250,000 frames (in 137 approximately minute-long segments)||350,000 bounding boxes and 2300 unique pedestrians were annotated||Urban environment||||2012|
|GM-ATCI||Rear-view pedestrian segmentation, detection, and tracking||250 video sequences||200 K annotated pedestrian bounding boxes||Dataset was collected in both day and night scenarios, with different weather and lighting conditions||||2015|
|Daimler||Detection and tracking of pedestrian||15,560 pedestrian samples, 6744 negative samples||2D bounding box overlap criterion and float disparity map and a ground truth shape image||Urban environment||||2016|
|NICTA 2016||Segmentation, pose estimation, learning of pedestrian||25,551 unique pedestrians, 50,000 images||2D ground truth image||Urban environment||||2016|
|MS COCO 2018||Object detection, segmentation, key point detection, DensePose detection||300,000, 2 million instances,|
80 object categories
|5 captions per image||Urban environment||||2018|
|Mapillary Vistas dataset|
|Semantic understanding street scenes||25,000 images, 152 object categories||Pixel-accurate and instance-specific human annotations for understanding street scenes||Urban environment||||2017|
|MS COCO 2017||Recognition, segmentation, captioning||328,124 images, 1.5 million object instances||Segmented people and objects||Urban environment||||2017|
|MS COCO 2015||Recognition, segmentation,|
|328,124 images, 80 object categories||Segmented people and objects||Urban environment||||2015|
|ETH||Segmentation, detection, tracking||Videos||The dataset consists of other traffic agents such as different cars and pedestrians||Urban environment||||2010|
|TUD-Brussels||Detection, tracking||1092 image pairs||1776 annotated pedestrian||Urban environment||||2009|
|INRIA||Detection, segmentation||498 images||Annotations are marked manually||Urban environment||||2005|
|Detection, tracking||60,000 frames||7900 annotated pedestrians||Urban environment||||2009|
|PASCAL VOC 2012||Detection, classification, segmentation||11,530 images, 20 object classes||27,450 ROI annotated 6929 segmentations||Urban environment||||2012|
dataset (own DB)
|Pedestrian behavior recorded in the college environment||50 human behavior datasets||No annotated pedestrian||Daylight scenario||—||—|
5.2 Computational complexity of the proposed system
The proposed methodology is clearly superior to the rest of the techniques for key frame extraction as shown in Table 3. The comparative analysis of recall and precision metric for each video sequence is shown in Figure 4. It is observed that the proposed approach of key frame extraction achieves the highest values for recall and precision for all the video sequences. A maximum value of one of the metrics is generally not sufficient. The precision metric is used to measure the ability of a technique to retrieve the most precise results. A high value of precision means better relevance between the key frames. However, a high value of precision can be achieved by selecting very few key frames in a video sequence. The speed and accuracy of both parameters are important in the key frame extraction algorithm. If the algorithm is slow, then the throughput of the system gets affected. It is also necessary that extracted key frames are relevant and accurate. Further, it will affect the other process, such as object detection, classification, object description, etc., respectively (Figures 5 and 6).
|Type of features||Recall||Precision||CPU time (ms)|
|Proposed key frame extraction algorithm||0.95||0.92||0.50|
|Discrete cosine coefficients and rough sets theory based ||0.88||0.82||0.90|
|Content relative thresholding technique based ||0.80||0.81||0.80|
|Multi-scale color contrast, relative motion intensity, and relative motion consistency based ||0.83||0.80||0.90|
|Color and structure feature based ||0.80||0.86||0.98|
5.3 Qualitative result of frame extraction
Qualitative results from the proposed deep learning approach for the key frames extraction algorithm are shown in Figure 7. The figure illustrates the relevant and non-redundant key frames are extracted from the video sequence. The dataset consists of 7 suspicious student behavior. The pedestrian behaviors are recorded at prominent places of the college in different academic activities.
This chapter describes and evaluates the methodologies, strategies, and stages involved in video key frame extraction. It also analyzes the issue and challenges of each of the key frame extraction methods. Based on the literature survey, most of the available techniques proposed by the earlier researchers can perform key frame extraction. However, most of them failed to encounter the trade-off problem between accuracy and speed. The proposed framework and approach give significant improvements for key frame extraction irrespective of the video length rather on the content type. This is made possible due to the histogram-based comparison of video scene content and convolution neural network-based deep features approach. With significantly satisfactory results, this work can generate a key frame dynamically from any video sequence. We have performed experiments on the publicly available database and obtained encouraging results.