Comparison of methods.
This work describes the recognition of human activity based on the interaction between people and objects in domestic settings, specifically in a kitchen. The difference between this and other proposals is that considers a human activity in a process without vision tracking. Videos are a sequence of photographs. Taking this into account, if you analyze an orderly sequence of images it could be based on the objects present in each scene so that you can understand the possible activity performed. However, it is not enough to consider the objects present in the scene; it is necessary to determine if those objects are employed or not by the humans present. If they are used, it is evident that they are necessary to carry out the activity; if they are not used they would only provide noise to the recognized activity. Therefore, it is necessary to generate a conceptualization of objects in the scene with characteristics (definition of an object, motion detector, object recognition, object position, object action) that allows you to recognize them and to determine the degree of use (unchanged, added, removed, moved, and indeterminate) and influence the possible recognized activity.
- human activity
- computer vision
- human/computer interaction
- human/robot interaction
- feature extraction
- behavior representation
This study is part of a major research project called InHands [1, 2]. The approach specifically analyzes the recognition of human activity without using the traditional proposal that uses the follow-up of movements of hands and arms. A short version of this research is shown in  and this work presents several improvements to the initial proposal presented in .
To introduce the field of recognition of human activity is of interest in this research and the definition mentioned in  on human/object interactions is as follows:
“The most typical human-object interaction recognition approaches are the ones ignoring interplays between object recognition and motion estimation. In those works, objects are generally recognized first, and activities involving them are recognized by analyzing the objects’ motion. They have made the object recognition and motion estimation independent or made it so that the motion estimation is strictly dependent on the object recognition” [2, 4].
In addition to the definition quoted here, this chapter defines the structure and classification of human activity recognition, from which we extract the following:
This research applies the approach of human/object interactions, and included in these are more specific subtopics: syntactic and description-based.
In  a proposal for the recognition of activity based on description-based is presented. This methodology consists of motion detection and tracking complemented by event analysis, which is of interest for the detection of movement. On the basis of this the capture of images for our proposal is carried out. As in  we use a single camera for capture and for segmentation the background is extracted, but we use image difference.
Among the relevant works to establish an activity recognition procedure , each action event is assigned a symbol and then a sequence of actions corresponds to a string of symbols. In our proposal we will use words instead of symbols, and a set of words regardless of their order could form an activity.
A similar way to  is the proposal of : a BOW (bag of words). A BOW considers that an image might be similar to a paragraph where repetition of one or more words would allow us to recognize the content or essence of the text. For us it is similar to considering the repetition of objects in the image, and to interact with them would give us indications of the activity that is developing.
For the InHands project, proactive assistance is very important, and with this premise in  they present a probabilistic prediction of the actions carried out, which is precisely what is intended to implement our proposal.
A similar work where the scenario is to employ a Kinect camera is . However, to increase the reliability of this methodology of recognition radio-frequency identification tags are added, which will not be implemented in our proposal.
The first information this system considers is hand and object tracking, and later object and action recognition. Regarding the detail of the recognized actions, seven principals are defined: place; move; chop; mixing; pouring; spooning; and scooping .
An example of the results obtained is the preparation of a cake, for which seven objects, 17 actions, about 6000 frames, and approximately 200 seconds are used. For us the definition of actions is simpler. Therefore, our actions will be those explained in Section 2.1.5.
In this research, we consider that the results of activity recognition would be useful to provide proactive assistance. Therefore, the recognition of the activity should be determined while the activity is being carried out and with the aim of facilitating the robotic assistance considered in later stages of the InHands project.
Recognition is approached with computer vision methods. In a specific way, it is based on the recognition of objects in the scene and the interaction with these objects based on their manipulation. However, it is necessary to detail that the interaction with the objects does not contemplate the tracking of the objects, arms, or hands of the person who intervenes in the action. The initial and final positions of the objects, their presence or not, are of importance in the recognition process.
2.1. Conceptualization of an object
In this proposal the conceptualization of an object is explained in five constitutive parts:
Definition of an object
2.1.1. Definition of an object
Considering the relevance of the objects and interaction with these, it is necessary to develop a definition of the object with parameters that allow the differentiation between them. Therefore, the chosen parameters are:
Identification number: this allows us to have a unique number for each object despite having similar characteristics such as color.
Color: this is the main characteristic that allows us to recognize the type of object present in the scene.
Position: this is confirmed by the coordinates of the centroid for each object, taking as origin the lower left corner of the kitchen counter.
Actions with objects: four basic interactions have been defined with the objects by the user: add, remove, move, unchanged.
2.1.2. Motion detector
Motion detection is important so that we know that an activity is taking place in the scene. In addition, objects in the scene (OBJECT RECOGNITION) and their position (OBJECT POSITION) are acquired. Since we do not track, is important to recapture the objects in the scene after the movements made by the person present. By making this subsequent capture we avoid occlusions and we can determine an action (OBJECT ACTION) for each object when checking through their presence and position if they were moved, removed, added, or not used in the last action.
As demonstrated in the flow chart of Figure 2, one of the most important methods applied in this part of the system is image difference, specifically “the mixture of the Gaussian method” according to . Figure 2(a) explains the flow chart for motion detection and Figure 2(b) shows the three results after executing the motion detection algorithm.
This image difference allows us to extract the objects from the background, which is dynamically updated while the system is working. The motion detection algorithm performs a continuous comparison of frames by setting a minimum threshold level to consider whether that variation between a frame at t = 0 and the following at t + 1, t + 2, and t + 3 implies an action performed or is only a visual noise.
2.1.3. Object recognition
The object recognition process is developed in Figure 3. The first step is to get the mask by difference of images and apply it to the original image. The result of the application of the mask is the region of interest (ROI).
From each ROI we obtain a histogram of 10 BINS RG chromaticity space (two-dimensional color space in which there is no color intensity information). Working with RG chromaticity allows us to avoid the problems of brightness, additionally normalized the images in RGB color space (a pixel is identified by the intensity of red, green and blue values) and was restricted by thresholds the colors black and white.
A comparison of the new histograms obtained from the ROI was made against our database. The selected comparison method was the Bhattacharyya distance. As evidenced in Table 1 this method was selected considering the percentage of errors reduced and low amount of time.
|Observations ||Quick||Moderately fast—more accurate matches||Quick—and—dirty matching||Moderately fast—more accurate matches|
|Range [exact … mismatch]||[1.0 …–1.0]||[0.0…2.0]||[1.0…0.0]||[0.0…1.0]|
The database of the objects has five different perspectives for each one. As far as the classifier was concerned, Knn (nearest neighbors) was chosen with K = 1.
Evaluation of the aforementioned classifier was performed in a similar way to the proposal in . In detail, a confusion matrix is employed to measure the recognition of each of the objects. Forty images were used for validation and 30 for each test of a total of 1870 images. Some examples are shown in Table 2.
Probably a better alternative to reduce time in object recognition is . The YOLO Detection System is a really fast detector that can process streaming video in real time with less than 25 ms of latency. However, the YOLO method is not convenient for localization errors, for example Fast R-CNN has 8.6% of localization errors versus 19% for the YOLO method. This is explained in Section 2.1.4; localization is an important part of this proposal for object definition.
Other alternatives taking to account reduced time can be [13, 14]. Tensor flow is a flexible system and can be used to express a wide variety of algorithms. For this proposal, the main advantage would be the capacity for distributing the process in many computational devices for object recognition.
2.1.4. Object position
In this proposal the position of the object is one of the essential characteristics; this allows us to define the actions resulting from the interaction with it. If the position does not change between images it means that the object was not used (UNCHANGED); if on the other hand it changes it means that the object is necessary in the developed activity (MOVE).
As for the technical details the centroid is a pixel in the image; this pixel is positioned in the coordinate system of the image. Taking into account that the InHands project requires obtaining world coordinates to assist with robots, then these coordinates must be referenced to the kitchen counter.
Homography matrix H is used according to . It is necessary to apply matrix H; the intrinsic and extrinsic matrixes of the calibration of the camera are explained in . The result of homography is expressed in millimeters and the final adjustment is made with rotation and translation matrixes (Figure 4).
2.1.5. Object action
To build the object with the previously obtained characteristics (ID number, color, centroid), in “Action” we assign the state of “UNDETERMINED,” see Figure 5(a). The human/object interaction is defined in the feature of the object called Action. This can take four possible options:
UNCHANGED: The object was present in the previous activity but it has not been moved, which implies that the object is present in the scene but it was not used in the activity.
MOVE: This indicates that the object was present in a previous scene and now changes position implying that it was used in the developing activity.
ADD: The object was not present in previous scene, and because it is now added to the scene it is assumed to be necessary for the developing activity.
REMOVE: An object present in previous scenes is no longer present. This induces the thought that it is now not necessary for the activity that is developing.
To avoid detecting false movements caused by occlusions or errors in the calculation of the centroid a tolerance range was established; only movements greater than 5 mm are recorded.
2.2. Human activity recognition
The taxonomy proposed in  allows us to illustrate an approximation to the approach that this research has. In detailed form it is typecast in hierarchical approaches and has many coincidences with the vision presented in syntactic and description-based. Specifically, the approach of human/object interactions uses a syntax to define human activity but it is not necessary to consider order, sequences, and logical structure (Figure 6).
An approach whereby it is important to make a comparison of the methodology used is . BOW collects features by assigning the nearest word and the frequency of occurrence of this in the images. In our proposal, each object could be a word and the repetition of these words will be relevant to determine the activity, but unlike  it is not necessary to consider the sequence of occurrence of words.
2.2.1. Definition of an activity
How do you define an activity? This is one of the questions that arose during the project; in this case a recipe inspires it, so we will use ingredients, kitchen tools, and possible substitutes to define an activity, see Figure 7(a).
INGREDIENTS: This refers to objects considered as ingredients for the preparation of a recipe, e.g. for a cereal-activity (cereal, milk).
TOOLS: This includes cooking utensils and cutlery necessary for the elaboration of the recipes-activities, e.g. cereal-activity (bowl, spoon).
SUBSTITUTES: These are kitchen utensils that could be replaced by others that perform a similar function, e.g. a cup for a glass.
2.2.2. Evaluation function and activity recognition
To start the recognition of activity we made three groupings of objects according to their action, in other words a list for moved objects (MOVE), another one for objects added to the scene (ADD), and a list of objects present in the scene, but they do not have to be moved or withdrawn (UNCHANGED).
The first evaluation corresponds to calculating that the contribution of each ingredient, utensil, and substitute has undergone a movement, taking into account that the contribution of ingredients, utensils, and substitutes is being weighted by the constants a, b, and c, respectively. The result is the value that these objects provide for each of the probable activities carried out Eq. (1).
The second evaluation corresponds to calculating the contribution of each ingredient, utensil, and substitute that has been added to the scene. Similar to the first evaluation, the contribution of ingredients, utensils, and substitutes is considered to be weighted by the constants a, b, and c, respectively. The result is the value that these objects provides for each of the probable activities carried out Eq. (2).
The third evaluation is exactly the same in its procedure as the two previously made, with the only difference being that the objects that intervene here are the ones that have remained in the scene without any change, i.e. they have not been moved, added, or removed (Eq. 3).
It is important to emphasize that an object could be a utensil or a substitute depending on the activity, for example a glass would be a utensil if the activity is to prepare juice (activity 1) but would be a substitute if the activity is to prepare coffee (activity N). where:
= Result for each activity from 1 to N.
= Recognized objects that are considered ingredients for each activity.
= Recognized objects that are considered utensils for each activity.
= Recognized objects that are considered substitutes for each activity.
= MOVE, ADD, UNCHANGED.
= Constants for tuning contribution, .
The fourth evaluation corresponds to the addition of the activity lists resulting from Eqs. (1)–(3). Explicitly the result of Eq. (4) (SUM WEIGHTED OF THE LISTS IN THE TIME) corresponds to adding the probable activities that result from moving, adding, or not changing objects in a scene (Eq. 4). This result would correspond to the probable instantaneous activity, i.e. in the last frames (1 to 4 frames).
= The summation of value by activity (the last 1 to 4 frames).
= Variables changing on the time, .
The elapsed time from the start of an activity. Initial value is set to , later decreased.
Average time for the execution of any predefined activity.
Factors serve to ponder the contribution of , , . Note that activities resulting from actions such as moving (MOVE) or adding (ADD) objects to the scene are more relevant over time than the result of just keeping objects in the scene with no position changes (UNCHANGED).
The results of Eq. (4) represent activities recognized in the last four frames, so in no way would represent a global or final result of the activity recognized.
To obtain a more reliable result we must add a considerable group of results of Eq. (4), and by adding these results we obtain a statistically reliable result of the activity recognized activity, Eq. (8) being the maximum value of the activity recognized.
= Total samples of results of Eq. (4) during average time of activity recognized.
The complete system that is obtained to perform our proposal of activity recognition is outlined in Figure 8; each of its constitutive parts was explained in the previous sections.
In this research a color Kinect camera was employed; the depth of view was not used. People doing the cooking activities in Figure 9(a) did so randomly without any training or prepared script. The system was tested with four typical breakfast activities consisting of: brewing coffee, juice preparation, cereal preparation, and chocolate preparation.
The ingredients, cooking utensils, and substitutes are shown in Figure 9(b) as follows: coffee, sugar, chocolate, juice, cereal, milk, bowl, cup, glass, plate, and spoon.
The first tests of the system of recognition of the proposed activity used five videos for each of the four activities raised. The results were excellent and after 600 frames it could be clearly differentiated which was the activity being executed in Figure 10(b). Partial results of Eq. (4) are illustrated in Figure 10(a). It should be mentioned that the results of 1 to 4 frames illustrated in Figure 10(a) suffer from occlusions and changes in lighting that hinder a more accurate recognition of objects; however, thanks to the evaluation of Eq. (8) the erroneous partial results can be filtered to obtain a correct overall result.
Proactive support is one of the objectives of the InHands project, so our system should be able to recognize activities without having to be segmented, i.e. in a normal sequence of events recognize the different activities that are being developed.
As mentioned, the following test phase consisted of placing several activities without segmenting and checking whether the system was capable of recognizing them. Figure 11 illustrates one of the tests performed with the preparation of unsegmented juice, cereal, and coffee without a preconfigured specific order. As evidenced in Figure 11(a) the instantaneous activity recognition system (1 to 4 frames) is unclear in showing the activity developed but after processing it with Eq. (8) the result is clear (Figure 11(b)). Therefore, the functioning of the systems in a continuous way without the need to segment activities is demonstrated. The final activity is coffee preparation with satisfactory performance from the beginning. Figure 12 shows a sample frame of our video process for the recognition of objects and activities .
This proposal for recognition of human activity is different from others based on tracking as shown in . It is based on the preparation of cooking recipes, considering the interaction of the user with the objects present in the scene. In a detailed way, after the recognition of objects we classify them in categories (ingredients, utensils, and substitutes) and for the interaction we define four actions (add, remove, move, and unchanged) weighting the contribution of objects and interactions to determine possible activity.
Among its notable features are: it does not use intrusive methods with the user and requires an average time of 0.25 s for an instant recognition of activity [18, 19, 20, 21, 22]. It is also noted that in all the tests the recognitions were met by surpassing problems of brightness and occlusions to allow completely natural movements of the user.
The system is flexible and scalable by simply adding more activity definitions (recipes). The system works continuously with no default activity segmentation.
Future work would be to define with statistical methods the weighting constants here designated as a, b, c,.
This research was supported by the InHands project (Interactive robotics for Human Assistance in Domestic Scenarios), grant P6-L13-AL.INHAND founded by Fundaci La Caixa, inside the Recercaixa research program .
C. Flores and J. Aranda are associated with the Institute for Bioengineering of Catalunya and Universitat Politècnica de Catalunya, Barcelona-Tech, Spain .