Essential Behavioral Science of the Face and Gesture that Computer Scientists Need to Know |
Joseph C. Hager & Paul Ekman
Introduction
Scientific research on topics such as nonverbal communication, body movement, and facial expression is a difficult and slow effort because investigators lack tools that would make measuring relevant phenomena inexpensive, highly repeatable, and rapid. Advances in digital computing equipment and computational approaches to image processing, object recognition, and tracking provide a possible answer to the question of how to measure such events. This paper focuses on a different question, one that must not be overlooked in a fascination with technique: what to measure. Before providing an answer, we briefly consider some of the background and concepts important for justifying our position.
The study of how the human body participates in communication has a long, diverse history in the fields of philosophy, communications, psychology, and social sciences. No one definition has emerged as perfect for every theorist and situation, but the idea that changes in the appearance of the body produce an effect in another person is often at the core of definitions of nonverbal communication. Shannon and Weaver (1949) presented one influential approach to communication from an engineering perspective, useful in the present discussion. They distinguished information from the meaning of signals, and suggested the base 2 logarithm of the set of equally likely possible signals as a measure of information, similar to a measure of entropy. Natural animal communication, however, is characterized not by equally likely signals, but by Markov chains, in which each behavior in a sequence statistically constrains the subsequent behaviors. Thus, signals transmitted in such communication have a residual entropy (information) component and a redundancy component. Considering Shannon and Weaver's perspective, we derive a concept of what can be measured: the set of possible signals, the subset that is emitted in actual communication, the redundancies or constraints among signals, and the semantic import of these signals. In regard to the face, measurement of the signals themselves is the current bottleneck, and a most suitable goal for computer measurement. How does one measure signals?
In a branch of philosophy and communications called semiotics, a distinction is made between the sign vehicle and the sign. The sign vehicle is the physical substrate from which the sign is composed, the energies that are the basis of the sign. Messages are composed of signs. The sign is a token that carries meaning, corresponding closely to the signal or symbol in Shannon's terminology, while the sign vehicle is the transmitter's medium and part of mechanisms for emitting signals. In the following, we argue that the best strategy to measure signals is to measure the sign vehicles on which they are based. As Shannon and Weaver imply, it is not necessary for the measurement of signals to understand their meaning. Indeed, you will see below that it can be counterproductive to focus on the message (meaning) when one is interested in measuring signals. An important error to avoid is the omission of certain signals from the measurement because they were prematurely considered to be meaningless. (Hager, 1983, reviewed non-signal functions of facial behaviors, which are not considered here as they have no special implications for measuring behaviors.)
Once signals can be measured, it is possible to calculate information and determine the meaning of the signals. For example, ethologists studying the displays of animals have used Shannon's method to measure information and channel capacity (see Smith, 1977, for a review). As discussed below, many studies on the emotional meaning of facial signs have been conducted.
Here, then, is the main point of this paper for those trying to measure behaviors involved in communication: measure the sign vehicles. For the face, this means measuring muscular actions by the visible changes they produce in bulges, bags, pouches, wrinkles, shapes and positions of facial features. Sign vehicles for other body behaviors include the muscular actions that produce changes in the location, position, orientation, shape, size, color, and other characteristics of the body. Allow researchers who specialize in the meaning of such signals to determine their interpretation, if any, with as much freedom from prior filtering and reduction as possible.
Gesture and Body Movement
Work on the analysis of gesture and body movement has, for the most part, taken one of two basic approaches: body motion and orientation versus analysis of function explained below.
Body motion and orientation
Perhaps the approach most compatible with a computerized measurement system focuses on the orientation and motion of the body in space. Rosenfeld (1982) summarized this approach, noting issues familiar in computer vision, such as frames of reference, segmentation, and feature location. There is no consensus about what aspects of body movement and orientation are most important to measure for much of this field, especially in areas where the concept of discrete signals is not important. The number of possible measurements is open ended, ranging, in Rosenfeld's terms, from "elementary systems [that] emphasize minimal components of movement" to "higher-order systems [that] emphasize complex functional configurations as their basic units" (p. 242). Measurements typically are highly specific to the research application, such as the orientation of a body part to a particular object in a room, but coarser measurements, such as gross body movement, might be applicable across a greater number of research projects.
Functional analysis
This approach focuses on relatively complex behaviors and their meaning. For example, Ekman & Friesen (1969) categorized the types of messages conveyed by nonverbal behaviors. Affect displays (emotions), including happiness, sadness, anger, disgust, surprise, and fear, convey messages about organismic states and are relatively automatic, involuntary, and stereotyped. The most specific of these messages are conveyed by the face, and some are the same in all known human cultures. Emblems are learned, culture-specific symbolic communicators, such as the wink, that convey messages similar to short verbal phrases. Adaptors are self-manipulative movements, such as lip biting, that help manage body function. Illustrators are actions that accompany and highlight speech, such as a raised brow and the sweep of a hand. Regulators are nonverbal conversational mediators, such as nods or smiles. The variety of configurations, orientations, and locations that equivalent behaviors at this level might assume would seem to make them a difficult candidate for computational measurement.
Facial Signal Systems
Ekman (1978) described the four general classes of sign vehicles by which the face conveys information. Static facial sign vehicles represent relatively permanent features of the face, such as the bony structure and soft tissue masses, that contribute to an individual's enduring appearance. Slow facial sign vehicles represent changes in the appearance of the face that occur gradually over time, such as the development of permanent wrinkles and changes in skin texture. Artificial sign vehicles represent features of the face determined artificially, such as eyeglasses and cosmetics. Rapid facial sign vehicles represent phasic changes in neuromuscular activity that lead to visually detectable changes in facial appearance. Ekman also discussed eighteen different classes of messages that can be derived from these sign vehicles.
The face's rapid sign vehicles are relevant to signals about emotion and cognitive state, with the other three classes providing noise or background. These movements of the facial muscles pull the skin and tissues, temporarily distorting the shape of the eyes, brows, and lips, and the appearance of folds, furrows and bulges in different patches of skin. The changes in facial muscular activity typically are brief, lasting a few seconds; rarely do they endure more than five seconds or less than 250 ms., but can last for minutes or even hours, particularly in the case of crisis or pathology. The most useful terminology for describing or measuring facial actions refers to the production system -- the activity of specific muscles. These muscles may be designated by their Latin names or a numeric system. An alternative level of description involves terms such as smile, smirk, frown, sneer, etc. which are imprecise, ignoring differences between a variety of different muscular actions to which they may refer, and mixing description of the sign with inferences about meaning or the message which they may convey.
For the facial signal system, it can be stated without equivocation that if one is interested in emotional or cognitive signals, one must measure the rapid sign vehicles, based principally on muscular actions. Some investigators have tried to measure motion in the face by tracking features, artificially applied markers, distortion of grids, or some other basis. It cannot be stated too strongly that the relationship of such measures to concepts such as emotion is totally unknown, except to the extent that they are known to reflect the actions of specific muscles. The only substantial body of evidence for relationships between facial behavior and emotion (other than the trivial) is based on the muscles acting. These measurements might be useful is if they are tied to the muscles, but this is not necessarily a straightforward task. For example, it is possible to construct by consulting reference manuals a table of the direction in which the eyebrows move under action of the different muscles that occur in different emotions, but no experienced facial scorer would base the scoring of a muscle on this one criterion because many other changes must be considered when scoring. Therefore, a measurement system based on verbal descriptions of how certain features move when particular muscles move would be imprecise. One could construct a measurement system by measuring sign vehicles or some features that were not mapped or referenced to specific muscles, but these measurements would then have to be validated as relevant to emotion separately from work that has validated the relationship of muscle actions to emotion. The best way to construct a computer measurement system is to understand the techniques now used to measure facial behaviors, including those that use the human as the measurer.
Current Techniques for Measuring the Rapid Facial Signals
Investigators have used numerous methods for measuring facial movements resulting from the action of muscles (see a review by Ekman, 1982, of 14 such techniques and Hager, 1985, for a comparison of the units of measurement), although only a few are based on the muscles acting. Of these, the Facial Action Coding System (FACS) (Ekman and Friesen, 1978) is the most comprehensive, widely used, and versatile. The sections below describe FACS and compare it to electromyography (EMG), the only machine measurement system in widespread use.
The Facial Action Coding System (FACS)
Ekman and Friesen's (1978) Facial Action Coding System (FACS) was developed by determining from palpation, EMG, knowledge of anatomy, videotapes and photographs how the contraction of each facial muscle (singly and in combination with other muscles) changes the appearance of the face. Videotapes of more than 5000 different combinations of muscular actions were examined to determine the specific changes in appearance that occurred and how best to differentiate one from another. Measurement with FACS is in terms of Action Units (AUs) rather than muscular units for two reasons. First, for a few changes in appearance, more than one muscle is combined into a single AU. It is not possible to reliably distinguish which specific muscle acts to produce the lowering of the eyebrow and the drawing of the eyebrows together, and therefore the three muscles involved in these changes in appearance are combined into one specific Action Unit. Likewise, the muscles involved in opening the lips are also combined. Second, FACS separates into two AUs the activity of the frontalis muscle, because the inner and outer portion of this muscle can act independently, producing different changes in appearance. There are 46 AUs which account for changes in facial expression, and 12 AUs which describe changes in gaze direction and head orientation in coarser terms.
FACS coders spend approximately 100 hours learning the basics of FACS. Self instructional materials teach the anatomy of facial activity, i.e., how muscles singly and in combination change the appearance of the face. Prior to using FACS in research, learners are urged to score a videotaped test, to insure they are measuring facial behavior in agreement with prior learners. To date, more than 300 people in all areas of the world have learned FACS and achieved inter-coder agreement on this test of proficiency, and many others have some degree of familiarity with this method, which has become a de facto standard for social, behavioral, and computer scientists studying the face.
A FACS coder "dissects" an observed expression, decomposing it into the specific AUs which produced the movement. The coder repeatedly views records of behavior in slowed and stopped motion to determine which AU or combination of AUs best account for the observed changes. The scores for a facial expression consist of the list of AUs which produced it. The focus on individual AUs minimizes the coder's inference about the emotional meanings of the expression. The precise duration of each action also is determined, and the intensity of each muscular action and any lateral asymmetry is rated. In the most elaborate use of FACS, the coder determines the onset (first evidence) of each AU, when the action reaches an apex (maximum excursion), the end of the apex period when it begins to decline, and when it disappears from the face completely (offset). These time measurements are usually much more costly to obtain than the decision about which AU(s) produced the movement, and in most research only onset and offset have been measured.
Facial Electromyography (EMG)
Neural activation results in the release of acetylcholine at motor neuron end plates in the striated muscles, producing muscle action potentials (MAPs). Electromyography (EMG) measures these MAPs. Activation of the facial muscles produces dynamic as well as configuration information about neuromotor processes underlying rapid facial signals (Cacioppo & Dorfman, 1987). Short or low intensity changes in these discharges can occur without producing feature distortions on the surface of the face, which enables EMG to measure activity that might not be visible, and, therefore, not a social signal. EMG has served as a useful complement to visible facial action coding systems (see review by Cacioppo, Tassinary, and Fridlund, 1990). The activity of individual muscle strands can be measured by inserting needle electrodes into tissue, but this method is painful, time consuming, and highly intrusive. Usually in human research, electrodes are attached to the surface of the face, creating the possibility of crosstalk between MAPs generated from nearby muscles. Obviously, it is impossible to disguise the EMG researcher's interest in the subject's facial behavior, an awareness that could alter behavior. Even the placement of electrodes, with their collars, adhesives, and conducting gels, might physically interfere with the natural action of muscles or provide unnatural sensory feedback about facial movements. Finally, many electrodes might need to be attached to distinguish all the important muscular actions that the face can produce -- this correspondence has not been fully explicated. Nevertheless, EMG produces numbers related to a relevant physical process with minimal intervention of human judgement, a circumstance that inspires confidence in users and funding agencies. These later features of a measurement tool should be duplicated by computer measurement, without the disadvantages listed above.
Evidence About Which Facial Actions Signal Which Emotions
The FACS scoring units are descriptive, involving no inferences about emotions. For example, the scores for an upper face expression might be that the inner corners of the eyebrows are pulled up (AU 1) and together (AU 4), not that the eyebrows' position shows sadness. Analysis can use these purely descriptive AU scores or FACS scores can be converted into other scores, such as their emotional relevance. This conversion involves working at the semantic level of communication, finding the meaning of actions in terms of emotion. How are relationships between facial actions and emotions established?
Although the relations between facial signals and emotion were originally based on theory, there is now considerable empirical support for the emotional meanings of many facial action patterns (for a review, see Ekman, 1994). The evidence is primarily based on observers' interpretations of facial expressions (e.g., judgements of pictures of facial expressions). Some research has examined how facial expressions relate to other responses the person may emit (i.e., physiological activity, voice, and verbal content) and to the occasion when the expression occurs. The findings that follow indicate the flavor of this evidence. Across cultures there is highly significant agreement among observers in categorizing facial expressions of happiness, sadness, surprise, anger, disgust, and fear. The experimental inductions of what individuals report as being positive and negative emotional states are associated with distinct facial actions, as are the reports of specific positive and specific negative emotions. Cultural influences can, but do not necessarily, alter these experimental outcomes significantly. These outcomes can be found in neonates and the blind as well as sighted adults, although the evidence on the blind and neonates is more limited than that for sighted adults. Emotion-specific activity in the autonomic nervous system appears to emerge when facial prototypes of emotion are produced on request, muscle by muscle. Different patterns of regional brain activity coincide with different facial expressions. The variability in emotional expressions observed across individuals and cultures is attributable to factors such as differences in which emotion, or sequence of emotions, was evoked and to cultural prescriptions regarding the display of emotions. Findings such as these are tied to the specific facial muscles that act. If new measurements produced by computer are not tied to specific muscles, then similar steps to validate the measurements as signals with emotional meaning must be conducted again.
Knowledge about the muscular actions that signal emotion can help define what a machine measurement technique must do. Textbook illustrations of expressions corresponding to categories of emotions are great simplifications of what is actually seen in natural communication, which is what social and behavioral scientists need to study. There is no single expression that is the one prototype for any emotion; instead, a prototype expression is more concept than natural fact, representing a high amplitude, highly redundant signal. The complete set of expressions corresponding to each emotion has not yet been identified, although there is consensus about the key muscular actions involved in several categories of emotion. Expressions like those in textbook illustrations are a relatively small proportion of emotion-related expressions observed in naturally occurring communication. Thus, a computer that could recognize only prototype expressions like those seen in textbooks would be useless to behavioral scientists, and it is not yet possible to specify a training set of all the emotion expressions. This example illustrates the potential hazards of measuring meaning as the base unit, rather that the signal. The solution, again, is to measure the sign vehicles, not merely signs with known meanings.
Still photographs of prototype expressions capture actions frozen at high intensities. In nature, not all the actions in a prototype may need to appear in order to signal an emotion. Even if all the prototype actions do occur, they may not start, reach their maximum excursion, offset, and end at the same time. The individual actions may be temporally dispersed, they may not reach the same level of intensity, or all may be at low intensity. Thus, a complete computer measurement system must be able to measure the temporal dynamics of each muscular action and their intensity in the complete range. Eventually, programs that reassemble individual actions into interpretable groups of actions based, in part, on such temporal information need to be written, guided by behavioral research. In addition to partial expressions of a prototype are expressions that blend signals of two or more emotions. These possibilities make for quite an extensive set of facial signals relevant to emotion.
Technological advances may make it possible to identify and track every hair and pore on the face, allowing very subtle distinctions in appearance. This possibility raises the question of how fine and detailed measurements need to be in order to capture the information relevant to signals investigators study. The complete answer is largely unknown, but some details of facial action relevant to emotion can illustrate these subtleties. A number of lines of evidence -- from physiological correlates to subjective feelings (reviewed in Ekman, 1992) -- support a distinction between emotional and nonemotional smiling, which is based on whether or not the muscle that orbits the eye (AU 6) is present with the muscle that pulls the lip corners up obliquely (AU 12). Emotional smiles are presumed to be involuntary and to be associated with the subjective experience of happiness and associated physiological changes. Nonemotional smiles are presumed to be voluntary, and not to be associated with happy feelings nor with physiological changes unique to happiness. This case illustrates that a relatively subtle distinction in appearance can have important implications for interpretation. An illustration for the temporal domain is that the 60 fields per second of NTSC video are not fast enough to determine unequivocally whether some lid movements are blinks (closures).
The issue of emotional versus nonemotional smiling is related to other distinctions psychologists make about facial expressions, such as conscious/unconscious, intended/unintended, controlled/uncontrolled, voluntary/involuntary, and spontaneous/deliberate. These distinctions are important and implicate different neural substrates. Subcortical mechanisms underlie spontaneous, rigid behaviors that arise from elementary processes, and cortical mechanisms provide flexibility in facial behavior with the influence of learning. Humans know how simulate expressions of emotion deliberately without involving any other component of the emotional system. Consider, for example, an attempt to train a computer to recognize emotion expressions by using expressions requested (e.g., "look angry") from subjects as a reference set. Signals relevant to emotion would be confounded with signals about the conscious, controlled, voluntary, deliberate, nonemotional nature of such performances, seriously compromising the effort. (Such requests also have other faults, such as differing, and possibly incorrect, sets of muscular actions among subjects. They might be of some value if the muscles involved were scored manually with FACS.) These performances might be useful for other goals, such as in determining the signs of feigned expressions, perhaps bearing on the issue of deception. To repeat, the best strategy is to measure sign vehicles, then let behavioral scientists figure out what the signs and their meanings might be.
Databases of Signs
One final issue to clarify remains. How does one get examples of the sign vehicles associated with specific facial muscular actions? We have spent the last three years collecting examples of appearances associated with known muscular actions, and can report that this task is tedious and difficult. To obtain these samples, subjects contracted specific facial muscles, singly and in combination, on request, repeatedly and over multiple sessions. Few individuals can perform these actions accurately, without also moving an unwanted muscle. The difficult combinations of actions require subjects who know FACS so they understand the complicated requests and can interpret feedback about what to do or not to do. Thus, most of the subjects needed to be professionals or students in the field of facial expression. Their experience with the action of their facial muscles enabled them to provide feedback about the correctness of their own actions. These performances were videotaped and carefully examined using FACS to locate performances that contained the correct actions and nothing else. The images digitized from these videotapes are precise examples of specific FACS Action Units and combinations, in a range of intensities.
This collection of images and additional collections need to be put online in a database that researchers can access. Besides the images, the database must also contain information about what exactly the images contain. We are currently searching for funds for this purpose. A similar approach should be taken with body movement samples.
References
Cacioppo, J. T. & Dorfman, D. D. (1987). Waveform moment analysis in psychophysiological research. Psychological Bulletin, 102, 421-438.
Cacioppo, J. T., Tassinary, L. G., & Fridlund, A. F. (1990). The skeletomotor system. In J. T. Cacioppo and L. G. Tassinary (Eds.), Principles of psychophysiology: Physical, social, and inferential elements (pp. 325-384). New York: Cambridge University Press.
Ekman, P. (1978). Facial signs: Facts, fantasies, and possibilities. In T. Sebeok (Ed.), Sight, Sound and Sense. Bloomington: Indiana University Press.
Ekman, P. (1982). Methods for measuring facial action. In K.R. Scherer and P. Ekman (Eds.), Handbook of methods in Nonverbal Behavior Research (pp 45- 90). Cambridge: Cambridge University Press.
Ekman, P. (1992). Facial expression of emotion: New findings, new questions. Psychological Science, 3, 34-38.
Ekman, P. (1994). Strong evidence for universals in facial expressions: A reply to Russell's mistaken critique. Psychological Bulletin, 115, 268-287.
Ekman, P. & Friesen, W. V. (1969). The repertoire of nonverbal behavior: Categories, origins, usage, and coding. Semiotica, 1, 49- 98.
Ekman, P. & Friesen, W. V. (1978). Facial action coding system: A technique for the measurement of facial movement. Palo Alto, Calif.: Consulting Psychologists Press.
Hager, J. C. (1983). The inner and outer meanings of facial expression. In J. T. Cacioppo & R. E. Petty (Eds.), Social Psychophysiology: A Sourcebook. New York: Guilford.
Hager, J. C. (1985). A comparison of units for visually measuring facial action. Behavior research methods, instruments and computers, 17, 450-468.
Rosenfeld, H. M. (1982). Measurement of body motion and orientation. In K.R. Scherer & P. Ekman (Eds.), Handbook of methods in nonverbal behavior research. (pp. 199-286). Cambridge: Cambridge University Press.
Shannon, C. E. & Weaver, W. (1949). The Mathematical Theory of Communication. Urbana, IL: University of Illinois Press.
Smith, W. J. (1977). The Behavior of Communicating: An Ethological Appraisal. Cambridge, MA: Harvard University Press.
Note: This manuscript from International Workshop on Automatic Face- and Gesture-Recognition Preceedings June 26-28, 1995, Zurich, Switzerland. Martin Bichsel (Editor) Published by MultiMedia Laboratory, Department of Computer Science, University of Zurich, Winterthurerstrasse 190, CH-8057 Zurich, Switzerland