Open access peer-reviewed chapter

Stereo-Measurement of 3D Poses of a Walking Person with a Puppet Model and Low-Rank Filter

Written By

Hiroshi Hanaizumi

Submitted: 18 November 2022 Reviewed: 26 November 2022 Published: 11 January 2023

DOI: 10.5772/intechopen.1000850

From the Edited Volume

Model Predictive Control - Theory and Applications

Constantin Volosencu

Chapter metrics overview

41 Chapter Downloads

View Full Metrics

Abstract

In order to achieve safety and security in public places by recognizing people’s behavior, we have to measure a person’s pose and capture his or her movement. Conventionally, it was hard to recognize the person itself before capturing the pose, and research has been carried out based on the silhouette of the person obtained by performing such as background subtraction. Efficient recognition was difficult in using silhouettes because they were greatly affected by the observation direction. Since the difficulty disappears if we adopt three-dimensional processing, some methods using Kinect have been proposed. Unfortunately, their performance was limited in the usage environment. Here OpenPose is applied to the stereo-pair images for getting the two-dimensional coordinates of the joint points. The coordinates may fluctuate depending on the lighting. Assuming the fluctuation follows a Gaussian distribution, the three-dimensional coordinates will be mixed with a larger noise component following a Cauchy distribution. Since the Cauchy distribution does not have a defined mean and variance, we can not achieve the desired smoothing even with averaging. The method described here provides stable poses that can withstand subsequent use by applying a low-rank filter and a puppet model in which the distance between joint points is invariant.

Keywords

  • stereo-camera
  • OpenPose
  • joint position
  • noise reduction
  • Cauchy distribution

1. Introduction

Recognizing the behaviors of people in public spaces such as train stations is very important to ensure our safety and security. For example, even if a person feels sick, falls, or has trouble with another person while walking, we can automatically recognize it and can help the person immediately. A person’s behavior can be defined as a time trace pattern of poses of the person. In other words, if the pose can be quantified, it is possible to recognize the behavior as a pattern of temporal change. For the quantification, it is first essential to be able to measure the pose with high precision. There have been many methods for recognizing human poses [1, 2, 3]. They can be divided into two groups: silhouette-based methods and joint-based ones. In the former group, human silhouettes were obtained, for example, by a background subtraction or by calculating optical flow [4], and were used to analyze the behavior. In the latter, joint positions measured by Kinect were used [5, 6, 7, 8]. The silhouette-based methods, however, had difficulties in estimating the poses. It was not easy to find some feature points to describe the pose from the silhouette. On the other hand, the joint-based methods easily described the pose as the placing of the joint points. In particular, Kinect gave us three-dimensional positions of them and made the analysis easier. Unfortunately, Kinect projected a dot pattern to obtain the distance to a target, so there were some problems such as limitations in measurable distance and outdoor environment. OpenPose [9] was also one of the solutions. It gave us 25 keypoints, including human joints on an image. The two-dimensional coordinates were useful in describing poses in a walk perpendicularly crossing the optical axis [10]. The OpenPose-based method had similar problems to the silhouette-based one. The size change of the pose with the distance made the analysis more complicated. Some solutions have been required.

In this paper, we propose a measurement method for obtaining three-dimensional joint coordinates of a human pose by applying OpenPose to a sequence of stereo-pair images. We obtain the three-dimensional joint coordinates from the two-dimensional coordinate pair of the same joint extracted by OpenPose in each of the stereo-pair images. This means that we quantify the pose of a person in the three-dimensional space with a small number of joints. Recently, OpenPose provided the three-dimensional keypoint features with a demo. It, however, requires a specific multi-camera system and has some barriers to use for general motion measurements. Here we construct a software system for obtaining three-dimensional joint coordinates using a single general-purpose stereo camera. The stereo-measurement also has the feature of being able to acquire a person’s pose from a distance. Therefore, we can measure the poses of multiple persons simultaneously in a wide area. The joint coordinates reconstructed from stereo-pair two-dimensional ones are severely affected by missing values and/or some fluctuations contained in OpenPose output. Therefore, noise reduction is essential in the proposed method. When we obtain the three-dimensional coordinates, a parallax on stereo-pair images is included in the denominator of the stereo-vision formula. Therefore, Gaussian fluctuation in the stereo-pair images appears as a larger noise component following a Cauchy distribution on the three-dimensional coordinates. Since the mean and the variance of the Cauchy distribution are undefined, simple averaging against Gaussian noise does not achieve an expected smoothing. We need another approach for the noise reduction in the three-dimensional coordinates. We regard walking as being consisted of a kind of periodical movement with a spatial correlation structure among all joints. Therefore, we focus on the dominant mode of them as the signal component to be measured. We realize the noise reduction by extracting the low-rank component [11] of the time sequence of joint coordinates recognized by OpenPose. We call It a low-rank filter (LRF). Applying LRF separately to each joint data may destroy the spatial structure. Therefore, we compensate the joint coordinates using a puppet model [12] introduced from the fact that every bone length, that is distance between two joints, is invariant for any pose. Severe noise included in measured coordinates of the joints yields many candidates of the pose to be obtained. Those poses include meaningless ones as the human posture. The puppet model enables us to eliminate the meaningless poses by fixing the length of the upper and lower limbs without restricting the angle between the limbs. The length fixing acts as some kind of constraint. Thus, knowledge about the object makes meaningful measurement and recognition even in adverse conditions. In the following sections, we describe the principle and the procedure of the method. Application results of an actual person walking are also shown.

Advertisement

2. Principle

In order to realize direction-free analysis for human walking, we consider a remotely joint-based measurement using stereo-vision. In the conversion from the joint coordinates in both stereo-pair images to three-dimensional ones, Gaussian noise expands to a terrible one following the Cauchy distribution. Therefore, the main task of this research is to reduce the severe noise and to get accurate three-dimensional joint coordinates. We consider that human joints do not move randomly but move with a mutual spatial correlation structure. From this fact, we introduce a puppet model that acts to restrict the joint position in an area.

2.1 World and camera coordinate system

Figure 1 shows world and camera coordinate systems used here. A set of stereo-camera C1 and C2 are calibrated (distortion correction and collimation are performed) so that their optical axes are parallel and have the same focal length with distance B. They are installed so that the center of the camera C1 coincides with the origin of the world coordinate system.

Figure 1.

World and camera coordinate systems.

A point Pxyz on the world coordinate system is observed as image points p1ulvl and p2ur.vr on cameras C1 and C2, respectively. The coordinates x,y,z are obtained as

x=ulBulur,y=vlBulur,z=fBulur,E1

where, f is the focal length and B baseline length. The camera coordinates ul,vl and ur,vr are also calculated from the coordinates x,y,z as

ul=fxz,ur=fxBz,vl=vr=fyz.E2

Gaussian noise contained in joint coordinates ul, vl, ur, and vr observed on both stereo-pair images appear in each of the numerator and the denominator of Eq.(1). Since the ratio of two random normal distributed variables with mean 0 and different variances follows the Cauchy distribution [13], obtained three-dimensional joint coordinates x, y, and z are severely affected by noise following a Cauchy distribution. The first and second moments of the Cauchy distribution diverge to infinity, its mean and variance are not defined. This means a simple averaging does not achieve the expected smoothing.

2.2 OpenPose and keypoint coordinates

OpenPose is the first real-time multi-person system to jointly detect human body, hand, facial, and foot keypoints [9]. Here we use it as a human joint detector and adopt 15 keypoints for human walking analysis among the output format BODY_25, as shown in Figure 2. OpenPose provides the keypoint with detection confidence in a JSON file. The confidence takes a value between 0 and 1, and the higher the value the more confident it is. When a joint is hidden by the body, its confidence takes 0, and the coordinates are also 0. Therefore, the missing coordinates such as due to joint occlusion yield larger noise. We have to complete the missing values first. OpenPose may also cause incorrect detections, such as leg swapping, arm swapping, leg duplication, and arm duplication, due to poor lighting conditions in image capturing. In those cases, the confidence takes lower but non-zero values, so it is hard to detect them using confidence. Both the swapping and the duplication also cause larger noise, so we need some algorithms to detect and correct them.

Figure 2.

OpenPose keypoints in BODY_25.

2.2.1 Missing value interpolation

Our strategy for interpolating the missing values is to fill them in a simple way, then smooth them with preserving features. Let s1sisjsn be a sequence of measured coordinates with some missing values as

sk=0ikj0others.E3

Those missing values are interpolated by using internal division points of both ends of non-zero-confidence observations as

sk=si1jk+1+sj+1ki+1ji+2ikj.E4

When j=n, we use sk=si1ikn, and i=1 we use sk=sj+11kj.

2.2.2 Keypoint swapping and duplication

In wrong recognitions, left and right legs are duplicated as one leg, and leg swapping also occurs more frequently. In normal walking, one of the legs is swung to the front and fixed on the floor, and the other is also swung and fixed alternately. The left and right ankles draw continuous trajectories that accurately reflect these movements. When the leg swapping and/or duplication occur, some discontinuities are yielded in the trajectories. Therefore, we detect them by finding the discontinuities. We detect the swapping by finding the discontinuity at the same position on both left and right ankle trajectories, and the duplication by finding it on one of them. For simplicity, consider a person walking parallel to the x-axis as shown in Figure 2. Let x-coordinate vectors of left and right ankles be sL=sL1sL2sLnT and sR=sR1sR2sRnT for n observations. Their elements sLi and sRii=1n are relative position to a keypoint mid-hip. The swapping is detected at the i-th element when δLi/w>threshold and δRi/w>threshold, where, w means the stride length of the walk as

w=minmaxsLminsLmaxsRminsR,E5
δLi=sLisLi1,E6
δRi=sRisRi1.E7

2.2.3 Vertical averaging

As already mentioned, stereo-cameras are also parallelized by the calibration. Therefore, there should be no difference in vertical coordinates between corresponding joints on the stereo-pair images. If the difference exists, it is considered to be due to noise, and we replace those vertical coordinates with their average values. Thus, OpenPose joint data are completed.

2.3 Pose vector and behavior matrix

In this study, we define behavior as the temporal change of the poses. Therefore, it is necessary to quantify each pose for describing the behavior. We define keypoint vector qt and the pose vector pt independently of the dimension. The former is described as

qt=u0tv0tu1tv1tum1tvm1t,E8

on an image or

qt=x0ty0tz0tx1ty1tz1txm1tym1tzm1t,E9

in the world coordinate system, where, uivi and xiyizii=0m1 are two- and three-dimensional coordinates of the keypoints, respectively, and m=15. Of course, since the 2D pose strongly depends on the direction of observation, the 3D pose vector without such dependence is valid. Therefore, when we simply refer to the pose vector, we mean the 3-D one. Using their relative coordinates, the latter is represented as

pt=u0tu8tv0tv8tum1tu8tvm1tv8t,E10

on an image or

pt=x0tx8ty0ty8tz0tz8txm1tx8tym1ty8tzm1tz8t,E11

in the world coordinate system, where, u8v8 and x8y8z8 are coordinates of the keypoint mid-hip at time t, and pt does not include it while qt does. We consider that the keypoint mid-hip is located near the true centroid of a human body, so we use it instead of the centroid. Thus, the pose vector is useful to represent relative pose regardless of the position. Finally, we define keypoint matrix Qn and behavior matrix Pn for n observations as

Qn=p0p1pn1,Pn=p0μp1μpn1μ,μ=1ni=0n1pi.E12

The covariance matrix of the behavior matrix is also obtained as

Σ=1n1PnTPn.E13

Since the elements within the covariance matrix represent the degree of correlation between corresponding coordinates, we consider that it includes important information about the spatial correlation structure of joint movements.

2.4 Auto-regression model and low-rank filter

OpenPose may yield some fluctuations in its recognitions, and pose vectors still have noise due to the fluctuation. Since they severely affect the three-dimensional coordinates, we have to reduce the noise in the pose vectors. On the other hand, the coordinates have some correlation structures, such that they do not change abruptly in time while keeping their relative positions. Therefore, we realize the noise reduction by extracting low-rank components from the observed time sequential joint coordinates. The one-dimensional coordinate sequence is once converted to a two-dimensional matrix shape to extract the low-rank components. After we apply a rank reduction operation to that matrix, the resultant matrix is converted back to the original one-dimensional sequence. To perform the process, we need an interconversion algorithm between a one-dimensional signal and a two-dimensional matrix. Let Uk be a temporal observation vector of positions of the k-th joint of a moving person on the image as

Uk=uk0uk1ukn,E14

where, the superscripts show time, that is, frame. Since the movement is not so fast and not random, they have some correlation with each other. This means that they fit into an auto-regressive (AR) model with length of m as

ukm=a0uk0+a1uk1++am1ukm1ukm+1=a0uk1+a1uk2++am1ukmukn=a0uknm+a1uknm+1++am1ukn1.E15

These equations show that the current value of a one-dimensional signal is determined by the values of the m signals preceding it. Using matrix format, we rewrite Eq. (15) as

uk0uk1ukm1uk1uk2ukmuknmuknm+1ukn1a0a1am1=ukmukm+1ukn,E16

and for short as

XA=Y,E17

where matrix X is called the Hankel matrix. The vector Y is calculated from the matrix X, and also X is reconstructed from Y. Considering that the observations consist of dominant signal component with some correlation structure and some noise, Eq. (17) can be described as

Xs+XnA=Ys+Yn.E18

Here we realize the low-rank filter for the vector Ys by separating Xs with rank r from Xn as

XsAs=YsrankXs=rf.E19

Since the coefficient vector A in Eq. (19) is different from that in Eq. (17), an iterative algorithm is needed for the calculation [11].

2.5 Spatial AR model and filtering

As we have already defined, human behavior is a multi-temporal set of poses. We have focused on the temporal correlation structure of each joint in human behavior, and constructed the low-rank filter. Here as an analogy from temporal correlation to spatial one, we propose a spatial AR model. The model is based on the fact that we can estimate the pose of the whole body even if some joints are hidden and cannot be observed. We consider that the linear form of the matrix X in Eq. (17) and its reproducibility from the coordinate vector Y are essential in the low-rank filtering. Therefore, we introduce a matrix and a coordinate vector in the same manner. Let xjt be the j-th joint coordinates at time t, 0 zero coordinates, and ak the k-th coefficient vector as

xjt=xjtyjtzjt,0=000,ak=akxakyakz.E20

By analogy to the description of Eq.(16), we propose a new linear matrix equation as

0x1t2xm1t2x0t20xm1t2x0t2x1t200x1t1xm1t1x0t10xm1t1x0t1x1t100x1txm1tx0t0xm1tx0tx1t0a0a1a3m1=x0t2xm1t2x0t1xm1t1x0txm1t.E21

The model is characterized by symmetrical calculation such that each joint coordinates x, y, and z on the right side are obtained from all non-zero coordinates on the left side. We believe the symmetrical calculation enables us to realize stable analysis independently of the walking direction. Since the number of unknowns increases threefold, we need at least sequential three poses for the calculation. We use more poses to get the smoothing effect. The filtering is achieved by decreasing the rank of the left-side matrix.

2.6 Puppet models

Eq. (1) tells us that fluctuations in observed two-dimensional coordinates ul and ur are expanded and yield severe noise in the three-dimensional ones. Since the noise reduction by using the low-rank filter above is independently applied to each joint observation, it is not enough to reconstruct the meaningful three-dimensional pose. Some constraints are required. Here we propose two types of puppet models as pose stabilizers. The model is based on the fact that the distance between the neighboring joints is invariant of the pose. Figure 3 shows the puppet with the joint number (black) and limb length in [m] (red). The puppet has a symmetrical structure with a predetermined length limb and freely moving joints in any direction. We determined their lengths based on roughly measuring the length of the limbs of a person with a height of 1.7 [m]. We introduce two types of models with different constraint methods: the flexible type and the rigid one. We call the former Puppet-I [12] and the latter Puppet-II. In both models, we regard the keypoint 8 (mid-hip) as the reference joint.

Figure 3.

Puppet model with joint number (black) and limb lengths (red) [m].

2.6.1 Puppet-I

In this model, we evaluate the limb length between a neighboring joint pair whose positions are calculated by using Eq. (1). If the length does not match that of the model, we regard that there is an error in the position of the joint ur on the right camera image, and adjust the parallax ulur so that they match. This process is started from the reference joint. Let the position of the i-th neighboring joint Pixiyizi be calculated from the stereo-pair image points Qiluilvil and Qiruirvir, and Pkxkykzk be the reference. The parallax adjustment is performed as

ΔuoptrargminΔurxixk2+yiyk2+zizk2Lik,E22

where,

xi=Builuiluir+Δur,E23
yi=Bviluiluir+Δur,E24
zi=Bfuiluir+Δur,E25

and Lik is the predetermined distance from the reference to the i-th joint. The adjustment is repeatedly applied to successive joint pairs, such as mid-hip to hip, hip to knee, and knee to ankle for legs, and mid-hip to neck, neck to shoulder, shoulder to elbow, and elbow to wrist for arms. The optimization process in Eq. (22) is similar to the finding solution for a quadratic-like curve as shown in Figure 4. Figure 4 indicates that we may have two solutions (left) and have nothing (right). In the former case, we select the best combination so that, for example, both shoulders have the longest distance. Arms and legs are also compensated so that they have the largest angle at the elbow and knee, respectively. In the latter case, the compensation is performed so that the difference becomes the minimum. Substituting the optimal adjuster Δuoptr for the i-th joint into Eqs. (23)(25), we obtain the three-dimensional coordinate of the joints.

Figure 4.

Model limb length and solutions.

2.6.2 Puppet-II

In this model, we regard that relative positions among body joints, that is, mid hip, left hip, right hip, neck, left shoulder, and right shoulder, are fixed independently of poses. Assuming the rotation angles α, β, γ and the shift xs, ys, zs, we transfer the model shown in Figure 5a into the world coordinate system. Then, we predict the position of their joints in both camera coordinate systems. We regard that the measurement was made when the squared sum of the difference among predicted positions ulpvlp,urpvrp and observed ones ulovlo,urovro on the camera coordinates for six joints within the part body falls to the minimum and that we obtain the optimum angles and shifts as.

Figure 5.

Rigid body model, upper limb model, and lower one with limb lengths.

Eoptbodyargminα,β,γ,xs,ys,zsjbodyul,joul,jp2+vl,jovl,jp2+ur,jour,jp2+vr,jovr,jp2,E26

where, Eoptbody is the optimized angles and shifts as

Eoptbody=αoptβoptγoptxoptyoptzopt,E27

and the summation means sum over six joints in the body model (nose is not used), and camera image coordinates are calculated from the initial position of body joints xjinityjinitzjinitjbody as

xjpyjpzjp=Rαβγxjinityjinitzjinit+xsyszsjbody,E28
Rαβγ=1000cosαsinα0sinαcosαcosβ0sinβ010sinβ0cosβcosγsinγ0sinγcosγ0001,E29
ul,jp=fxjpzjp,ur,jp=fxjpBzjp,vl,jp=vr,jp=fyjpzjp.E30

The three-dimensional coordinates of joints in the body are obtained by substituting the optimum angles and shifts into Eq. (28). After body joints measurement, those in the arms and legs are processed. As shown in Figure 5b and c, arm and leg models have the same structure. Therefore, we can apply the same process to both arm and leg joint measurements. We modify the processes described in Eq. (26)-(30) to, for example, that for left the arm l_arm, including joints 6 and 7 as

Eoptl_armargminα,β,γ,rjl_armul,joul,jp2+vl,jovl,jp2+ur,jour,jp2+vr,jovr,jp2,E31

where, r is a bending coefficient 0r1 between the upper limb l1 and the lower one l2 in the triangle shown in Figure 5d, and

h=2Srl1+l2.E32

The area S of the triangle is calculated using Heron’s formula. After initial coordinates setting, the rotation matrix Rαβγ is applied to them, then they are shifted to the corresponding body joint, that is, the left shoulder in this case as.

xjpyjpzjp=Rαβγxjinityjinitzjinit+x5y5z5jl_arm.E33

After the optimization processing, elements of the vector Eoptl_arm are substituted into Eq. (33) for obtaining the three-dimensional coordinates of joints 6 and 7, that is, left elbow and left wrist. Those of other parts, right arm, left leg, and right leg, are also determined using the same manner. Table 1 shows the initial coordinates of joints in each part optimization. Thus, three-dimensional coordinates of all the joints are measured.

JointXyz
2000
3-hl12h20
40rl1+l20
000
6hl12h20
70rl1+l20
000
10-hl12h20
110rl1+l20
000
13hl12h20
140rl1+l20

Table 1.

Initial coordinates for arm and leg models.

Advertisement

3. Procedures

The followings are the procedures of the proposed method. We measure the three-dimensional position of 15 joints in model Body_25. The main process is noise reduction in the two-dimensional joint coordinates recognized by OpenPose on stereo-pair images.

Step 1: Reading keypoint data from a JSON file and checking their consistencies.

The keypoint data for a person in the image may be divided into a few portions and be output as a few persons’ data. Mis-recognitions such as ghost keypoints also appear due to the very high sensitivity. The portion collection is made by using keypoint consistency between the stereo-pair images.

Step 2: Interpolating the coordinates of missing keypoints.

Using the confidence in the JSON file, we determine whether there is a missing keypoint or not. Then, the missing values are interpolated.

Step 3: Correcting the leg swapping and duplication.

They are detected by using the normalized speed of both ankles and are corrected. Arm swapping and duplication are also completed.

Step 4: Applying the low-rank filter (LRF-2D).

The LRF-2D is applied to the time sequence of each two-dimensional coordinate of each joint on both stereo-pair images.

Step 5: Constructing three-dimensional poses.

Three-dimensional poses are constructed by applying puppet-I or puppet-II.

Step 6: Applying LRF-3D.

LRF-3D is applied to the time sequence of each three-dimensional coordinate of each joint in the constructed poses.

Step 7: Applying the spatial AR model filtering (SARF).

SARF is applied to a three-dimensional pose in all frames.

Figure 6 shows the processing flow of the proposed method in which devices or functions are blue-backed and resultant data orange-backed.

Figure 6.

Processing flow of the proposed method.

Advertisement

4. Experiments and discussion

Here we show the stereo camera used and its calibration result, observed joint data, and some preprocessing results for them. Performances of the low-rank filter, spatial AR model, and puppet models are also discussed with the implementation.

4.1 Stereo camera and its calibration

Figure 7 shows the stereo camera we used in the experiments below, and Table 2 shows its specification. The camera was stereo-calibrated so that parallelization was made. We evaluated the performance of the calibration by showing the difference in vertical coordinates between both the stereo-pair images of corner points of a chess board sheet used for the camera calibration as shown in Figure 8. The difference was within plus or minus 0.5 pixels except around the four corners of the image, and we found that the parallelization was achieved with high accuracy.

Figure 7.

A stereo camera Zed2 was used.

ParameterValue
Number of pixels3840 × 1080
FormatSide by side
Frames per second30 fps
Baseline length120 mm
Field of view120 degrees

Table 2.

Specification of the stereo camera. *Zed2 originally has higher performance. We used lower mode for easy recording.

Figure 8.

Vertical coordinate differences among all corresponding points in stereo-pair images.

4.2 Data completion and 2D-filtering

We recorded a person walking in a room illuminated by a ceiling light in a video. He/she walked across the optical axis of the camera at a distance of about 5 [m] in going right and 4 [m] in returning with the same pace as far as possible. We applied OpenPose to get the coordinates of keypoints on the stereo-pair images in each frame as shown in Figure 9. Some of the joint coordinates were missing as we were afraid, so we found the sections having zero confidents and linearly interpolated them by connecting nonzero values at both ends of the sections, as shown in Figure 10. Top of it shows that the confidence falls to zero around 70, 90, and over 140 in frames due to occlusions. The first two of them yield spike-shaped noises and the third zero output in the middle. The bottom shows that the spikes were well interpolated and that the last nonzero value was kept to the end of the data.

Figure 9.

Walking path and an example of recognized keypoints.

Figure 10.

Performance in completing missing data; confidence (top), missing data (middle), and completion result (bottom).

Figure 11 shows an example of the original trajectories of ankles (upper) after the missing value completion and their correction result (lower). The original trajectories included the leg duplications at frames 23–25, 41, 58, 90, and 121, and leg swappings at 56 and 73, respectively. We experimentally determined the threshold as 0.3 and got the best performance. Figure 12 shows the performance of the low-rank filter applied to the time sequence of the right ankle horizontal coordinate. The resultant coordinates were compared with the original ones. After some trials, we set the AR model length as 12 and the rank as unity (rf=1). The low-rank filter provided us the excellent performance in reducing severe noise. The result was smooth and well reproduced the original shape, which cannot be obtained with a simple low-pass filter. Expecting to get something personal feature, we repeated the experiments by increasing the rank up to three. Unfortunately, larger ranks yielded only instability in the pose.

Figure 11.

Correction results for the swappings and duplications; original (upper) and correcting result (lower).

Figure 12.

Performance of 2D low-rank filter applied to the time sequence of the right ankle horizontal coordinates.

4.3 Puppet model fitting

Figure 13 shows the performances of the puppet model fitting for stabilizing the three-dimensional poses reconstructed from the joints on the stereo-pair images. The left row indicates the reconstructed pose with footprints, and the right trajectories of both shoulders are projected on the x-z plane.

Figure 13.

Reconstructions of 3D-pose and shoulder trajectories.

In Figure 13a, although some joints were shifted from their original positions, their overall shape has preserved a human shape. Such discrepancies are considered to be the result of the low-rank filter in the previous stage being independently applied to each joint regardless of the mutual position of the joints. This can also be seen from the fact that the gaps between the two-dimensional trajectories of both shoulders shown in Figure 13b were not constant. The distortion was somewhat compensated by applying puppet-I model as shown in Figure 13c. The model puppet-I acted as a kind of constraint and gave the reconstructed pose a possible positional relationship among the joints as a human. You can see that the shoulder width was corrected to a constant though their positions included larger fluctuation as shown in Figure 13d. The fluctuation indicates that some systematic noise remains in the parallax after fitting the model puppet-I. Although the shoulder width alone is not enough, the puppet-I enabled us to reproduce relative poses from the measured joint coordinates with smoothing by the low-rank filter. We consider that the fluctuations can be reduced by applying a normal low-pass filter, such as a moving average one.

The model puppet-II was developed as the improved version of the puppet-I. This is because puppet-I rarely produces larger distortions that stretch the upper body (shoulders and arms) from the lower body (hips and legs), particularly in an area far from the camera. We speculate that this phenomenon is related to decreasing the precision in OpenPose recognition of joint positions as the object becomes smaller. Therefore, we adopted the rigid body in the puppet-II with some pose stabilizing techniques. Whereas the puppet poses are optimized in the three-dimensional world coordinate system in puppet-I, it is done on the two-dimensional image coordinate system in puppet-II. When a person crosses the optical axis and both shoulders are on it, they appear at the same point on the reference image, and the difference in their parallax is slight. Therefore, noise in the stereo-pair images prevents us to distinguish whether the front of the model is fitted to the front of the object or inverted to fit the back. Since the optimization strongly depended on the initial values, we suppressed the model flip by stabilizing them. The stabilization was performed by fitting the body model in Figure 5a to joint coordinates on not only the target frame but also multiple frames before and after the target frame. The multi-frame fitting was based on the fact that the corresponding point translation on the stereo-pair images with keeping the parallax just produced the translation in the world coordinate system without changing the distance. We translated the body joints on all the frames in the window so that the mid-point between the neck and the mid-hip of them matched that of the target. We set a window w with a width of 24 and used the optimization result for all frames in it as the initial values for optimization in a frame in the central small window containing 12 frames. After finishing the optimization of them, we shifted the window so that the central small window covered frames without gaps. We repeated those procedures until all frames were processed. Before performing the optimization for legs and arms, we applied an inverted test. When the direction from neck to nose in the optimized model did not match that in the measured pose, we flipped the body model. Figure 13e shows one of the resultant poses with footprints and Figure 13f shoulder trajectories. The procedures above were for adjusting the positional relationship among the joints and did not affect smoothing the distance from the camera. Figure 14 shows the result of suppressing the variation in the distance by applying a moving average filter and their shoulder trajectories. We consider that walking human does not make sudden changes in posture for 1/3 of a second and determined the window width as 21.

Figure 14.

Processed 3D-pose and shoulder trajectories with moving average.

4.4 Posterial 3D-filterings

The final processings were three-dimensional low-rank filtering as a smoother in the time domain and spatial AR model filtering as one in the spatial domain. Those filters were applied to the moving average data.

Figure 15a indicates the performance of the low-rank filter LRF-3D separately applied to the time sequence of three-dimensional coordinates of the right ankle, and Figure 15b that of the spatial AR model filter SARF with the rank reduction from 45 to 16 using sequential 7 poses at a time applied to each pose coordinates. Both figures show the differences dx,dy, and dz in coordinates x,y, and z between before and after applying the filters, respectively. Pulse-shaped noises mainly in the z-coordinates were removed by applying the former filter, and those slightly remaining in the x-coordinates were cut by applying the latter one. The values on the vertical axis in Figure 15b were one digit smaller than those in Figure 15a. This indicates that the previously applied LRF-3D has removed most of the spike-shaped noise. Since the mean value of the difference did not contain a larger bias component, it is suggested that both filters worked to smooth the poses. Figure 16 shows the final results using the models puppet-I and puppet-II, respectively. We applied the same post-processing after the puppet model fitting. In those figures, we also add the footprints and a virtual plane for recognizing the walking trajectory and for comparing their performances.

Figure 15.

Performance of the low-rank filter and the spatial AR model filter.

Figure 16.

Final poses with footprints.

In Figure 16a, we see that the puppet-I produced almost the same shape in the pose itself, but slightly disturbed footprints with distorted trajectories from the U-shape. On the other hand, we see that footprints in Figure 16b had a more precise U-shaped trajectory and a regular pattern. These features indicate rhythmic movement of the legs in actual walking was captured well by using the model puppet-II. Therefore, we regard that the model puppet-II has higher performance as a sensing tool, especially in predicting joint positions, in the person walking measurement.

Figure 17 shows intermediate results corresponding to procedures listed in Figure 6. Joints reconstructed from the original data themselves were severely affected by intensive noise and gave us impossible shapes and widely scattered footprints as shown in Figure 17b. Applying the missing value completion did not change the situation as shown in Figure 17c. The impossibilities in joints’ location and scattered footprints were dramatically improved by the low-rank filter LRF-2D, although some unnaturalness remained as shown in Figure 17d. The unnaturalness was removed almost perfectly by predicting and correcting body position with fitting the model puppet-II as shown in Figure 17e. Following the low-rank filter, LRF-3D provided smoothness in joint movements as shown in Figure 17f. The last spatial AR model filter SARF also gave spatial smoothness as shown in Figure 17g, although LRF-3D was so powerful that the effect of SARF was almost invisible. The result indicates that the proposed method measured the person walking almost perfectly in not only its pose but also its position. We also provide the whole scene of the person walking by a video at https://youtube.com/shorts/gvqF3m9xPjk. As can be seen from Figure 16 (and the video), LRF and the puppet-II model performed very well in the walking measurements conducted in this study. The former extracts the dominant mode in the signal, so from the standpoint of the behavior measurement, the longer the same movement lasts (in this case, the more steps taken), the more stable the results will be. On the other hand, if the pose changes significantly in a short period, the dominant mode itself may become unstable and the performance may not be sufficient. We believe that such poor performance may be improved by using a stereo camera with a high frame rate for recording poses. The latter was introduced on the fact that the joint positioning in a person’s upper body (shoulders and hips) does not change significantly with a pose. The good performance of this model can be interpreted as the effect of incorporating already existing knowledge into the measurement. In this case, we believe that the fixed joint positioning as a correlation structure narrows the search area in the optimizing process and thereby improves total accuracy.

Figure 17.

Intermediate results after the main processing steps have been completed.

Now, we consider the applicability of the model puppet-II used here to persons with different heights and body types. Based on the followings, we think the body type is not so much larger problem, and the height difference can be compensated by expanding or shrinking the length of each limb correspondingly to the height. The reasons are that roughly measured limb lengths already gave us almost the perfect results, the model had a very simple structure, and the position and the direction of body parts were determined by an optimization process, not a matching one.

Conventionally, to recognize human behavior, we have to measure poses and analyze the behavior as the temporal change of the poses. In this process, recognizing the human itself was an extremely hard task. Recent advances in deep learning have enabled software systems such as OpenPose to quantitatively extract human joints independently of the observation direction. Analysis with two-dimensional data, however, still depended on the observing direction, and one with three-dimensional data was attractive but some problems also remained in acquiring the three-dimensional human joint coordinates. In this study, we have established a technical basis for quantitatively handling three-dimensional poses and their temporal changes, that is, behavior, by removing the extreme noise in the three-dimensional joint coordinates converted from OpenPose extracting stereo-pair joint coordinates.

Advertisement

5. Conclusions

Based on the fact that human joints did not move at random but moved together with having a kind of spatial correlation structure in each action, a joint-based method was proposed to measure three-dimensional human pose with position in walking as the first step for obtaining the correlation structure. The joint positions were obtained by applying OpenPose to both left and right images acquired with a general-purpose stereo camera. The original three-dimensional joint coordinates were severely affected by missing values and/or positional fluctuation included in two-dimensional coordinates on the stereo-pair images. Since noise in tree-dimensional coordinates caused by Gaussian assumed fluctuation followed a Cauchy distribution, a simple averaging did not smooth the noise well. Focusing on the dominant mode, we succeeded in separating severe noise by lowering the rank of time-series data of the joint coordinates. We called it a two-dimensional low-rank filter LRF-2D. Independent application of LRF-2D to the time series of each joint coordinate yielded some positional distortions in joint locations. Those distortions were corrected by introducing the puppet model. The puppet model was introduced as a kind of restraint from the fact that lengths of the backbone, upper limbs, and lower ones were invariant in actions. Two types of model, flexible puppet and rigid one, were proposed and the rigid puppet model gave us better performance than the flexible one. In the rigid model, locations of both shoulders and hips to the backbone were fixed in the body part of the puppet, and their three-dimensional positions were determined so that the squared sum of differences among their two-dimensional joint coordinates back-projected and observed ones on stereo-pair images fell down the minimum. This meant that the optimum position of the body part was determined from observed two-dimensional joint coordinates while predicting its possible three-dimensional positions using known knowledge. Following the application of LRF-3D to time-series data of joint coordinates and application of the spatial AR model filter SARF to joints in each frame achieved almost the perfect smoothing of the joint movements in walking. The final result represented rhythmic footsteps well. Thus, the model predictive measurement derived us to the success of person walking measurement. In addition, we found that OpenPose had high precision in detecting human joints and high directional reproducibility in recognizing their positions for performing a stereo-vision measurement.

In this study, we have established a technological basis that enables a quantitative understanding of human posture and its temporal changes, that is, positioned behavior, by optical measurement from a remote location. This is expected to advance research on the recognition of human behavior and its application to gait recognition. However, although we evaluated the measurement performance of the pedestrian system qualitatively here, more quantitative evaluation is needed to use the system for applied research. At the same time, a rigorous evaluation is also needed to determine the performance of OpenPose as a human joint recognition tool, including the adoption of another recognition system with higher performance. Another future task is to extract and analyze the spatial correlation structure among all joints in movements such as walking.

Advertisement

Abbreviations

ARAuto-Regression
JSONJavaScript Object Notation
LRFLow-rank filter
LRF-2DTwo-dimensional low-rank filter
LRF-3DThree-dimensional low-rank filter
SARF

References

  1. 1. Urtanstm R, Fua P. 3D Tracking for Gait Characterization and Recognitio. In: Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition (FGR’04); 17-19 May 2004; Seoul. New York: IEEE. Vol. 1. pp. 17-22
  2. 2. Spencer N, Carter J. Towards pose invariant gait reconstruction. In: Proceedings of IEEE International Conference on Image Processing 2005; New York: IEEE. Vol. 3. 2005
  3. 3. Bashir K, Xiang T, Gong S, Mary Q. Gait representation using flow fields. In: Proceedings Of the British Machine Vision Conference 2009 (BMVC 2009); London. 7-10 September 2009
  4. 4. Munea TL, Jembre YZ, Weldegebriel HT, Chen L, Huang C, Yang C. The progress of human pose estimation: A survey and taxonomy of models applied in 2D human pose estimation. IEEE Access. 2020;8:133330-133348
  5. 5. Khamsemanan N, Nattee C, Jianwattanapaisarn N. Human identification from freestyle walks using posture-based gait feature. IEEE Transaction on Information Forensics and Security. 2018;13(1):119-128
  6. 6. Gianaria E, Balossino N, Velonaki M. Gait characterization using dynamic skeleton acquisition. In: Proceedings of the IEEE 15th International Workshop on Multimedia Signal Processing (MMSP 2013); September 30–October 2 2013; Pula. New York: IEEE. 2013. pp. 440-445
  7. 7. Chalottopadhyay P, Sural S, Mukherjee J. Frontal gait recognition from incomplete sequences using RGB-D camera. IEEE Transactions on Information Forensics and Security. 2014;9(11):1843-1856
  8. 8. Ahmed F, Paul PP, Gavrilova ML. DTW-based kernel and rank-level fusion for 3D gait recognition using Kinect. The Visual Computer. 2015;31:915-924. DOI: 10.1007/s00371-015-1092-0
  9. 9. Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2021;43:172-186
  10. 10. Hanaizumi H, Misono H. An openpose based medhod to detect texting while walking. In: Proceedings of The 7th International Conference on, Intelligent Systems and Image Processing 2019 (ICISIP 2019); Taipei; 5-9 September 2019. pp. 130-134
  11. 11. Konishi K, Uruma K, Takahashi T, Furukawa T. Iterative partial matrix shrinkage algorithm for matrix rank minimization. Signal Processing. 2014;100:124-131. DOI: 10.1016/j.sigpro.2014.01.014
  12. 12. Hanaizumi H, Otahara A. A Method for Measuring Three-Dimensional Human Joint Movements in Walking. In: Proceedings of the 60th Annual Conference of the Society of Instrument and Control Engineers of Japan 2021 (SICE 2021); Tokyo; 8-12 September 2021. pp. 1476-1481
  13. 13. Wikipedia: Ratio distrubution [Internet]. Available from: https://en.wikipedia.org/wiki/Ratio_distribution [Accessed: 2022-11-30]

Written By

Hiroshi Hanaizumi

Submitted: 18 November 2022 Reviewed: 26 November 2022 Published: 11 January 2023