| NSF Report - Facial Expression Understanding | |
III-B. Computer Vision and Face Processing Thomas S. Huang, Peter Burt, and Kenji Mase Abstract. Computer vision deals with the problem of scene analysis; more specifically the extraction of 3D information about scenes/objects from 2D (possibly time-varying) images obtained by sensors such as television cameras. Over the years, many algorithms have been developed for determining 3D shape, texture, and motion. In this session, some of the major approaches are reviewed in light of applications to face structure/motion analysis. Emphasis is on methods of estimating 3D rigid and nonrigid motion/structure from 2D image sequences. Presenters: T. S. Huang, P. Burt, K. Mase Introduction In this brief tutorial, we review computer vision with special emphasis on face processing. From an engineering viewpoint, the goal of computer vision is to build automated systems capable of analyzing and understanding 3D and possibly time-varying scenes. The input to such systems are typically two-dimensional (2D) images taken, for example, by television cameras, although sometimes direct range sensors are also used. There are three levels of tasks in computer vision systems. (i) Level 1 (lowest): Reconstruction -- to recover 3D information, both geometrical (shape) and photometric (texture), from the sensed data. (ii) Level 2 (middle): Recognition -- to detect and identify objects. (iii) Level 3 (highest): Understanding -- to figure out what is going on in the (possibly time-varying) scene. Representation Representation is a central issue in object recognition and scene understanding. The desirable features of a representation are: It should be easy to construct (from sensed data), easy to update, easy to use (for particular applications such as object recognition), and efficient to store. There are two main classes of 3D shape representation methods: surface and volume (Chen & Huang, 1988; Agin & Binford, 1973; Requicha, 1980). The spatial relationships between objects can be represented by relational graphs (structures) (Barrow & Popplestone, 1971). Most work on object representation in robot vision has been on the geometrical aspects. Relatively little has been done on the photometric aspects of object representation. Reconstruction By reconstruction, we mean the determination of the 3D coordinates of points on the object surface and some of the reflectance properties. Again, most work in computer vision has been on the geometrical aspects. The use of laser range finders is the most direct way of obtaining ranges of points on an object surface. A laser beam hits a surface point, and the reflected light is caught by a sensor. The range of the point is determined by either the time-of-flight or the phase shift of the beam. By scanning the laser beam, one can obtain a range map. A recent example is the ERIM laser scanner which uses the phase-shift method. It gives a 256 x 256 range map as well as several bands of intensity images in about 2 seconds. The maximum range is about 32 feet, and the accuracy about plus or minus 0.5 feet. The 3D shape of an object surface can be determined from a single 2D image of it, if a regular structure (such as a line array or a square grid) is projected on it (Will & Pennington, 1971). For example, if a square grid is projected on a plane (in 3D), then the images of the grid lines are still straight lines, and from the skewness of the grid, one can determine the orientation of the plane. Another method is laser illuminated triangulation. The method involves a laser and a camera. The geometry of the setup is known. The laser beam illuminates a spot on the object surface. An image of the surface is taken, and the 2D coordinates of the image of the spot are measured. Then, by triangulation, the 3D coordinates of the point are determined. By changing the direction of the laser beam, one can obtain 3D coordinates of different points on the surface. The classical technique of photogrammetry (Moffitt & Mikhail, 1980) is passive stereo. While the methods described above are "active'' in the sense that the illumination is controlled, this method is "passive'' -- the illumination is natural. Two images of an object are taken from different viewpoints. Then: (i) corresponding points are found between the two images. (ii) for each corresponding pair, the 3D coordinates of the point are determined by triangulation. The first step, that of finding correspondences, is extremely difficult. Typically, for each point in one of the images, one aims to find the corresponding point in the other image. This is usually done by some sort of cross-correlation. In computer vision, seminal work in stereo was done by Marr (1982). Usually, it is not possible to find point correspondences in some regions of the images (e.g., where the intensity is almost uniform). Thus, 3D surface fitting (interpolation) is necessary (Grimson, 1983). The most difficult problem in reconstruction is to extract 3D information of an object from a single 2D image. The pioneering work is that of Horn (1986), who investigated the problem of shape from shading, i.e., to extract surface orientation information of an object from a 2D intensity image of it. Equations can be written, relating the normal direction at a 3D surface point and the observed intensity of the corresponding image point, which involve the reflectance function at the 3D point and the illumination parameters. The difficulty is that there are too many unknown variables, making the problem basically underdetermined. Witkin (1981) was the first to investigate the recovery of surface orientation from texture. If we assume that the surface texture is actually isotropic in 3D, then from the anisotropicity of the observed image texture, the surface orientation in 3D may be deduced. For example, we can estimate the orientation of a planar lawn relative to the camera geometry from the local density variation of the grass in the image. Kanade (1981) was the first to investigate the recovery of surface orientation from 2D shapes in the image. If we can assume that the 3D object has certain symmetry, then surface orientation may be recovered from the skew symmetry of the image. For example, if we can assume that an observed 2D ellipse is the perspective view of a circle in 3D, then the orientation of the circle may be determined. More recently, interesting work has been done using color (Lee, 1986) and polarization (Wolff, 1988). Object Recognition The problem is to recognize a 3D object from one (or sometimes several) 2D views of it. By recognition, one usually means to classify the object into one of a set of prescribed classes. In an easy case, all objects belonging to the same class look almost exactly the same, and the number of classes is small. In a harder case, the number of classes is very large -- e.g., in fingerprint identification. In the hardest case, the classes are generic, i.e., objects in the same class may look very different -- e.g., to recognize a "chair.'' The main approach to recognition of a 3D object from a single 2D view is as follows. We have a 3D model for each class. The observed 2D view of the unknown object is compared with many 2D views generated from the 3D models of candidate classes. The best match determines the class. In practice, a major problem is how to reduce the search space (the number of candidate classes, and the number of 2D views from each 3D model) by using a priori information, heuristics, etc. For example, in vehicle classification, to detect and count the number of wheels can help to limit the number of candidate classes. Thus, if there are more than two wheels in the side view of a vehicle, it cannot be a sedan (Leung & Huang, 1992). A general way of reducing the number of 2D views of a 3D model (which one has to generate and compare with the 2D view of the unknown object) is the concept of aspect graph (Koenderink & van Doorn, 1979; Eggert & Bowyer, 1989; Ponce & Kriegman, 1989). An aspect graph enumerates all possible "qualitative'' aspects an object may assume. We partition the viewing space into regions such that viewpoints within the same region give 2D views of the object which are "qualitatively'' similar. A change in "quality'' occurs only when the viewpoint crosses a region boundary. Thus, it is necessary to compare the 2D view of the unknown object with only one typical 2D view from each region. Of course, for a given application one has to define precisely what is meant by "quality.'' And the definition of quality will obviously influence the matching criterion. E.g., if the 3D object is a polyhedron, then the "quality'' of a 2D view could be the numbers of faces, edges, and vertices. Motion Analysis and Face Processing The role of motion in recognizing facial expressions Automated analysis of image motion can play several roles in monitoring and recognizing human facial expressions. A motion system is needed to detect and track the head and face. Then, within the face, a motion system is needed to track motions or deformations of the mouth, eyes, and other facial structures involved in the generation of expressions. While expressions can often be recognized from a single snapshot of a face, the ability to discriminate subtle expressions requires a comparison over time as the face changes shape. In addition, the temporal pattern of changes in facial expression observed through motion analysis may carry important information. Particularly rapid and precise motion analysis of lip motion is needed for lip reading. At the other extreme, a motion analysis system is needed to monitor the gross motions of the head, body and hands to provide gesture recognition for next generation computer interfaces (Huang & Orchard, 1992). Three approaches to motion analysis The approaches to motion analysis that have been developed in computer vision fall broadly into three classes (Huang, 1987). Each has advantages and disadvantages for applications to the recognition of facial expressions. Feature Tracking: In the feature tracking approach to motion analysis, motion estimates are obtained only for a selected set of prominent features in the scene. Analysis is performed in two steps: first each image frame of a video sequence is processed to detect prominent features, such as edges or corner-like patterns, then the features are matched between frames to determine their motion. An advantage of this approach is that it achieves efficiency by greatly reducing image data prior to motion analysis, from a quarter million pixels to a few hundred features. A disadvantage is that motion vectors are not obtained for all points in the scene. When applied to a face it may not find vectors for all parts of the face that contribute to an expression. Flow: In the flow approach to motion analysis, motion vectors are estimated at a regular array of points over the scene. The motion vector at a given point can be based on local gradients in space and time at that point, or on the cross correlation of the pattern in the neighborhood of the point between successive frames. Advantages of the flow approach are that a dense array of motion vectors is obtained, and that processing is well suited for special purpose hardware. Disadvantages are that local estimates of flow tend to be "noise'' and that prodigious computations are required. Pattern Tracking: Pattern tracking is similar to feature tracking in that patterns are first located in each image frame, then these are followed from frame to frame to estimate motion. It differs from feature tracking in that "high level'' patterns are used that are indicative of the objects of interest, rather than "low level'' generic features, such as edge elements. In the case of face tracking, a pattern-based method might search for an entire face, using a set of face templates to span a range of face types, orientations, and scales, or it might search first for distinctive parts of the face, such as the eyes. Efficient techniques restrict the search to regions of the scene where a face is likely to occur based on where it was found in the previous frame, and reduce the resolution and sample density of the source images to be the minimum sufficient for the task. Advantages of the pattern tracking approach are that it is directed to just the objects of interest in the scene, such as faces; motion is not computed for extraneous background patterns. Disadvantages are that an enormous number of relatively complex patterns may need to be searched to identify faces over a wide range of viewing conditions. Model based alignment Advanced methods for motion analysis make use of models to constrain the estimation process. Models range from simple and generic to complex and object specific. In the first category are "smoothness'' constraints that look for solutions to the motion equations that vary smoothly everywhere in a scene, except at object boundaries. These constraints are implicit in almost all approaches to motion analysis and reflect the fact that physical objects are locally rigid. In the second category are object model constraints that, in effect, align a 3D model of an object, such as a face and head, to the image data in order to estimate both object motion and orientation (pose). The use of models can significantly improve the efficiency and precision of motion estimates since analysis considers only the motions that are physically possible. At the same time it facilitates image interpretation by analyzing motion into physically meaningful components. For example, observed motions of a face can be separated into the motions of the head within the camera-centered coordinate system, and the motions of parts of the face within a head-centered coordinate system. This separation is essential for detecting subtle facial expressions. State of technology The analysis of image motion is perhaps the most active subarea of research in computer vision. Methods that provide sufficient detail and precision for recognizing facial expression and that can be computed at video rates are only now becoming feasible. Hardware capable of real time motion analysis is being built for specialized tasks, including video compression and moving target detection. Still, significant further development is required before "off-the-shelf'' technology will be available for application to recognizing facial expressions. Optic Flow Approach to Face Processing Once head position is located by any means, we can examine the movements of facial actions. The human face has several distinctive features such as eyes, mouth and nose. Facial skin at cheek and forehead has the texture of a fine-grained organ. These features and texture are exploitable as clues in extracting and then recognizing facial expressions by computer vision techniques. In the following, we give examples of applying existing computer vision techniques and experimental results to expression recognition and lip reading. Extraction of facial movement Muscle actions can be directly observed in image sequence as optical flow, which is calculated by facial features and skin deformation. A gradient based optical flow algorithm has better characteristics for face analysis than other algorithms such as correlation based, filtering based, and token-matching based. For instance, it does not require feature extraction and tracking processes but only assumes smooth, short range translational motion. Dense flow information was computed from image sequences of facial expression with Horn and Schunck's (1981) basic gradient algorithm. The computed flow can capture skin deformation fairly well, except for the area where the aperture problem and the motion discontinuity problem arise. The aperture problem refers to the fact that if a straight edge is in motion locally it is possible to estimate only the component of the velocity that is orthogonal to the edge. Abrupt changes of the flow field cause problems because the Horn-Schunck algorithm uses the smoothness constraint. Extracting muscle movement The dense optical flow information to each muscle (group) action is reduced by taking the average length of directional components in the major directions of muscle contraction. Several muscle windows are located manually to define each muscle group using feature points as references. The muscle windows are regarded as a muscle (group) model which has a single orientation of muscle contraction/relaxation. Experiments assured that some important action could be extracted and be illustrated as functions of time. The muscle action derived from the muscle (group) model can be associated with several Action Units (AUs) of the Facial Action Coding System (Ekman & Friesen, 1978). Thus, for example, an estimated muscle motion of a happy expression could be scored as AUs 6(0.6) + 12(0.35) + 17(0.28) + 25(0.8) + 26(0.20). The figures in the parenthesis, e.g. 0.6 for AU 6, indicate the AU's strength, which is equated to the maximum absolute velocity (pixels/frame) within a short sequence. Recognition of facial expression Since optical flow of a facial image sequence contains rich information of facial actions of various expressions, conventional pattern classification techniques as well as neural networks can be used for recognition of expressions. As a preliminary experiment, a feature vector whose elements are the means and the variances of optical flow data at evenly divided small regions (blocks) is used. The dimensionality is reduced to make the vector concise by introducing a separation goodness criterion function, which is similar to Fisher's criterion (Mase, 1991). In the recognition experiment of four expressions in motion, e.g. happiness, anger, surprise and disgust, 19 out of 22 test data were correctly identified (Mase & Pentland, 1990a). Lipreading Approaches similar to those used for facial expression recognition can be used for lip reading (Duda & Hart, 1973). Four muscle windows are located around the mouth; above, below, left, and right, to extract optical flow data related to speech generation. The mean values of each flow vector component within a window are computed, as we did in expression recognition. The principle component analysis on the training set of English digits produces two principle parameters, i.e., mouth open and elongation, which are also acceptable intuitively. The experiments on the continuously spoken data by four speakers shows over 70% accuracy of digit recognition including word segmentation. Since optical flow becomes zero at motion stop and reverse, the approach has good advantage in temporal segmentation of facial expression as well as lipreading. Note: This report was prepared by T. S. Huang, P. Burt, and K. Mase, based on Workshop presentations by these authors, and edited by T. S. Huang. |