EE 863 Final Project
Yisheng Chen & Wei Hu
Implementation and Comparison of Different Human Body Part Detection Algorithms
3. Two Approaches of Human Body Part Detection
4. Implementation and Simulation Results
5. Comparison, Conclusion and Future Work
¡¡
Human body part detection in video sequences is an interesting yet challenging problem in computer vision. It is an important step for human body tracking and posture estimation systems. In combination with a temporal analysis, it can be used for action recognition or event detection.
We have some previous course experience in human posture recognition in which we extracted silhouette of the figure from the captured video and compared them against the silhouette of our synthetic figure using pixel-by-pixel, distance-based function to evaluate the goodness of fit and recognize the posture. We are interested in other approaches that detect human body part and recognize human posture without synthetic figure model. And we are also curious about the effectiveness of all these different approaches in real implementation problems.
With the knowledge learned in the computer vision course this quarter and the opportunity of selecting a final vision project at our interest, we are motivated to explore more in the area of human body part detection and hope to try some different approaches and compare the results. After through and broad search, we have decided to implement two different human body part detection algorithms introduced in the following two papers:
Paper No. 1: Ghost: A Human Body Part Labeling System Using Silhouettes by Ismail Haritaohlu, et. al, 14th International Conference on Pattern Recognition, August 1998, Brisbane, Australia.
Paper No. 2: Particle Filter with Analytical Inference for Human Body Tracking, by Mun Wai Lee, et. al, IEEE Workshop on Motion and Video Computing, 2002.
In paper no. 1, a human body part labeling system using silhouettes is introduced and the main idea for the human body part detection is to use a convex hull analysis. In paper no. 2, a broader topic (human body tracking) is introduced. With the limitation of the time and energy, we will only focus on the human body part detection algorithm introduced in paper no. 2 in which curvature analysis other than convex hull analysis is explored and proved to be more helpful. We will implement the two approaches of human body parts detection algorithms and compare the simulation results.
The rest of the report is organized as follows: section 2 gives some background knowledge of human body part detection; section 3 gives more details about the algorithm introduced in the two reference papers; section 4 presents our implementation and simulation results; then comparison, conclusion and future works are finally provided in section 5.
Human body in a given posture has a topological structure which constrains the relative location of body parts. The relative locations of the body parts do not change much as long as the body remains in the same posture. But for different camera view points, the order of the location of body parts may varies as a function of the view points.
Features such as intensities, edges, silhouettes are widely used in human body parts detection. Many approaches developed before including those that we are going to implement use silhouettes since it is less sensitive to noise and it is likely that the primary parts of the human body lie on the silhouette boundary in most generic postures. A silhouette-based 2D body Model usually consists of the primary body parts, such as head, hands and feet, and the secondary parts such as elbow, knees, shoulders and hip etc. After categorizing the posture to some main posture type (standing, sitting, crawling and laying etc.), usually the body part detection algorithm tries to locate primary body parts using different constrains, such as constraints of the body parts order of certain posture and constraints on path distance between parts.
For body part detection, usually head is first detected and is taken as a reference point and the other parts are detected with respect to the head location. The silhouette region that includes the head is usually found by combination of the major axis, median, and posture estimation with convex and concave points. The rest primary body parts are then found by analyzing different constraints. The relative distance of a hull point from the head, the median coordinate, and feet should be consistent with the topology of the main posture.
3. Two approaches of human body part detection
The system Ghost, proposed in paper no. 1 by Haritaoglu et al, segments the silhouette from the background. It computes the vertical and horizontal projections of the silhouette to determine the global posture of the person and his orientation relative to the camera (front view, left side view and right side view). To recognize posture the system computes the projections of the current silhouette and compares them to the model of projection realized for a set of predefined posture and point of view. The projection histogram method they used is a descriptor for shape analysis tasks. It describes the region of an object by the use of pixel projections onto the cartesian coordinate axes. The normalization is done by rescaling the object mask to a maximum size of pixels in one dimension and centering a small pixels window on it. The vertical and horizontal histograms are obtained from the object mask by counting the pixels along the horizontal and the vertical lines respectively.
After they estimate the main posture and the view point using the projection histogram method, they determines the body parts by analyzing the contour of the silhouette. The localization of the body parts is accomplished by combining a convex hull analysis, partial mapping of the body parts and the known topology of the human body for each main posture. They detect the principal parts of the body such as head, hands and feet (the extremities of the body) first and based on these detections they search for the secondary parts of the body such as shoulders, elbows and knees (the articulations).
We will implement most steps of the algorithm introduced in paper no. 1 except the posture recognition process with projection histogram since this step requires a large data library of normalized horizontal and vertical projection templates for different main posture with different view points. We don¡¯t have these templates so we will ignore this step. But we will implement all the other steps of the detection algorithm, such as detection of the convex hulls on silhouettes and prediction of body part locations, etc.
In paper no. 2, the authors introduced a new method of body parts detection. They use different approaches to detect the hands, the head and the main axis of the torso. First, the hands are detected along the outlines of the foreground. Peaks of convex curvature are extracted along the silhouette boundary. Using prior estimation of the hand positions, based on human body structure and tracking information, unlikely hand positions are further eliminate. The head detection is performed using a reference chain code representation of a head-shoulder contour as a template for head. They match this template along the contour boundary of the extracted silhouette to detect the head. To achieve scale invariance, the contours are rescaled with respect to the estimated human height. The chain code features are normalized before comparison to achieve rotation invariance. Matching error is based on chain code differencing. For the main axis of the torso, they first extract the medial axis of the 2D silhouettes. The medial axis points in different views are matched using epipolar constraint, and the 3D positions computed. A line is then fitted to these 3D points using PCA and RANSAC method. This extracted line provides a measurement of the torso orientation, and a constraint that the torso must lay along the line.
Since we using the 2D silhouette of the image and out video sequences only have a single view point, we are not going to try their head or torso detection method. We will only try the curvature analysis method and implement that to our hand and feet detection.
4. Implementation and Simulation Results
We have implemented most steps of the human body part detection using silhouettes introduced in paper no. 1. We follow the assumption that the video sequence is acquired using a stationary camera and there is only very little background clutter which is typical for an indoor environment. The video sequence of dance footages (48 frames) which meets the requirements is our test data and is shown below. And there is another video sequence of a badminton player, 100 frames.
Dance Footage (48 frames) Badminton Player (100 frames)

We then extract the body silhouette manually. The background subtraction algorithm is not the focus of this paper and is actually introduced in the paper ¡°W4: A real time system for detecting and tracking people¡± by Haritauglu, et. al in Proceedings of Third Face and Gesture Recognition Conf in 1998. We will concentrate on the detection part. Figure 2 shows the silhouette we get. Similarly, it¡¯s only one frame of a video sequence. The whole video sequence can be viewed on the web designed for this course project.
Manually Extracted Silhouette (for the right person only).
After we get the silhouette, we find the contour of the silhouette using an edge detector and the contour.

We can start the detection process with both the contour and the silhouette. First we will detect the head. As to the position of the head, we cannot assume it is the highest pixel on the contour, because hands may be higher than head. One example is shown in Figure 4.
Figure 4 Posture with hands higher than head

We then travel along the contour and find the local highest pixels. For example, in figure 5, black '+¡¯ sign is the start of local highest pixel, red circle is the end pixel for local highest pixels.
Figure 5 Local highest pixels for a contour (click to get lager)
We then start the step of horizontal and vertical projection of the silhouette. Calculate the corresponding x-axis projection, because we know head is near the max of x-axis projection of the histogram from the constraints for the standing posture. But this is not true all the time. For example, in figure 5, the green pixel marked with a ¡®*¡¯ is the nearest local highest pixel to the x-axis, but it is not the head; another example is shown in figure 6, in which green pixel ¡®*¡¯ is the nearest local highest pixel to the x-axis, but it is not the head either.
Figure 6 Example in which head is not near the max of x-axis projection.
So we need to add more constraints to locate the head. The first constraint is that head must be in the upper body; the second is that head position can¡¯t change much between two neighbor frames. If unfortunately we cannot locate the head position in the first frame, we need to locate it manually at the beginning. And with the above two constraints, we can deal with the following frames and locate the position of head which is further decided by the middle point between the start and end of the local highest pixels.
The following 4 images in figure 7 show how the head position changes as time changes. The left column shows the results we get by ignoring information from previous frames and these results are incorrect obviously; the right column shows the results that considering the information of previous frames and these are the correct results. And the upper row shows the head x-axis position and the bottom row shows the head y-axis position.
Figure 7 Head position change
¡¡
so, finally, the head position is the middle point between
the start point of local highest and end point.
¡¡
After we get the head position, travel the contour clockwise from the head. We then implement the convex hull algorithm to locate hands and feet. Most of the time (from the 1st frame to 46th frame), the convex hull algorithm works very well. Hands and feet are included in the convex hull set, but we don't know the exact points of hands and feet. In the last frame (48th frames), the convex hull algorithm gives the knees instead of feet as shown in figure 8.
Figure 8. Cases in which convex hull algorithm doesn¡¯t work.
Since usually there are multiple convex corners near a body part tip, we should compact the multiple convex corners near one body part tip which means that if several corners are very near to each other, we consider them as one corner, and the new corner is the middle point of these combined corners.
Then we get plots in which the frame number is showed in x-axis and the ratio of contour point indices over total counter point number is showed in y-axis. We use the head position detected from previous step as the starting point and end point of the contour. The ratio 0 and 1 are actually the same position at the contour. The head detected from convex hull algorithm is pretty close to the position detected from previous step but for some frames, the convex hull algorithm doesn¡¯t work for head detection. And we show 5 curves in figure 9 and these 5 curves are for the two hands, two feet and the head. In some frames, there are more than 4 corners (excluding head) from the convex hull algorithm, and we should delete the extras according to the context.
Figure 9. Primary body part position in the contour from convex hull analysis for the 48 frame.
Figure 10 Two hands and two feet position in the contour from convex hull analysis for the 48 frame.
Finally, we get the body part detected for all 48 frames in which the green point at the top represents the head and the 4 red points represent the hands and feet.

Here is the result of the badminton sequence using the same head detection and convex hull algorithm.

For the other approach mentioned in paper no. 2, since we only have one view point and we are only interested in body part detection, we borrowed the idea of curvature analysis hand and feet detection and implement it in combine with the head detection algorithm in paper no. 1.
Along the contour of the silhouette, we calculate the curvature of each point on the contour, and a typical curvature figure is as following:
Figure 12 Curvature of the contour.
The curvatures of the convex points are negative and those of concave points are positive. Moreover, hands and feet tend to be the local minimum points. So, we can roughly locate hands and feet by finding local minimum. We marked all local minima which are less than -0.1 with black *. If two minima are very close to each other, combine them as one point, which is marked as red ¡®o¡¯. Figure 13 is the resulted hands and feet position in the contour from curvature analysis for the 48 frame.
Figure 13 Hands and feet position in the contour from curvature analysis for the 48 frame.
The next step is to limit the number of body parts for each frame to exactly 4, left/right hand/foot. Position of these primary parts in previous frames are used here, since the position of the hands and feet cannot change dramatically in the continuous two frames, we eliminate those extra false alarm points for the body parts from the curvature analysis. After this step, we have the following body parts position for the video sequence:
Figure 14 Improved hand and feet position in the contour from curvature analysis for the 48 frame.
Finally, we get the body part detected for all 48 frames with the aid of curvature analysis.

Result of the badminton player

5. Comparison, Conclusion and Future Work
We have implemented a simplified version of the two approaches for human body part detection. The advantage of the two approaches is, we only employ the silhouette information, and we avoid using color, texture and other complicated features.
The first step of both algorithms is to detect head, and all other body parts are identified by using the relative position to head position. Thus if we cannot obtain the head position, we may continue to detect the body parts, but we cannot track them accurately. Figure 16 and 17 show the cases in which the head position is really hard to be detected by our algorithms. In figure 16, the head is overlapped with the arms, and in figure 17, the head is surrounded by the arms and we cannot find the local highest position of the head.
Figure 16 Example of head detection failure Figure 17 Yet another example of head detection failure

After the head detection, one algorithm uses convex hull analysis and the other uses curvature analysis as and aid. For both approaches, we get the final human body part tracker. But comparing the final results for our implementation, we can see that the second approach using curvature analysis as an aid seems working better than the first approach.
For the first approach using convex hull analysis in paper no. 1, we find some problems for our implementation, such as the situations in the following two frames. For the following two situation, the hand is not properly detected in figure 18, we detected the left wrist in the 1st frame; the feet is not properly detected as in figure 19, right knee in the 46th frame is detected but what we need are finger and toe.
Figure 18 Incorrect hand detection example Figure 19 Incorrect foot detection

This is because the limitation of convex hull algorithm. Since not all the primary parts are on the convex hull for the posture. With the help of the curvature analysis, we can overcome this problem since the hands and feet tend to be at those points with local minimum of curvature. And together with the constraints for position of body parts for continuous frames, we finally get a pretty nice body part tracker.
Since we work on single camera scenario, both algorithms cannot guarantee to detect all body parts when there is self-occlusion. For example, there is some overlap between the hand and the leg (in Figure 20), in the silhouette, we cannot see her hand but only the elbow. Coupled with the fact that the silhouettes we used are very noisy, there may be a lot of false detection in the results. Actually, in our results, convex hull algorithm gives out more false detection than the curvature analysis algorithm. The reason is, there is no way to assign weights to different convex hull corners, and thus it is very hard to eliminate the false alarm points in the convex hull algorithm. On the other hand, the importance of each point in the curvature analysis algorithm is marked by the curvature, and the larger the absolute value is, the more important the point is. Then we can choose the most significant points as the body parts. Moreover, during the calculating of curvature, we may change the smoothness parameter (the sigma of the Gaussian filter) to eliminated potential false alarm points at that time. As shown in the Figure 21, Curvature Analysis algorithm gives less false alarm points.
Figure 20 The limitation of using silhouette

Figure 21 Primary body part candidates before false alarm deletion for the badminton sequence. The left is generated by Convex Hull algorithm, and the right is by Curvature.
When most of the primary body parts appear in the silhouettes, we can detect, identify and track them very well. However, due to limitation of single camera and lack of color and texture features, in some scenarios, we may fail to detect the existence of some body parts, or we may make false judgment to identify them because of ambiguity, and thus track is unsuccessful. For example, in figure 21, the middle two curve converge into one and then diverge again, which means the two legs gradually overlap and then separate. In that case, we cannot identify which curve is the left leg, and which is the right one. Another typically example is the turning sequence, which may introduce more ambiguities. Loose clothes may bring more troubles to good tracking.
Example of failed tracking

In the future, we will introduce more image features into the detection and tracking, making the algorithm more robust. Multiple-camera view is another option to increase the robustness. The positions of primary body parts are good clues to determine the character¡¯s posture, and our ultimate goal is to reconstruct the posture in 3D.
All videos are .mov files, and we recommend QuickTime.
Head is marked in Green or Blue, and hands and feet are in Red.
Dance Footage (840 KB)
Dance Detection & Tracking by Convex Hull (944 KB)
Dance Detection & Tracking by Curvature Analysis (944 KB)
Badminton Footage (1.7 MB)
Badminton Detection & Tracking by Convex Hull (1.92 MB)
Badminton Detection & Tracking by Curvature Analysis (1.92 MB)