next up previous
Next: Multimodal Details Up: Project Overview Previous: Speech Details

Gestures Details

Gestures were created for the eight commands described in the speech details section above. All gestures were created using the right hand elevated at approximately shoulder level. See Figure 3 through Figure 10 for examples of the gestures.

The backward gesture starts with the right hand elevated and the palm toward the face. The gesture then consists of repeated movements of the palm of the right hand toward the body.


  
Figure 3: Backward Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...ckadj.ps2,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

The down gesture starts with the palm of the right hand parallel to the floor followed by repeated movement of the palm of the right hand vertically.


  
Figure 4: Down Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...wnadj.ps2,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

The forward gesture is opposite of the backward gesture. The gesture starts with the right hand elevated as the backward gesture, but the back of the hand is toward the face. The gesture then consists of repeated movements of the hand away from the body.


  
Figure 5: Forward Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...readj.ps2,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

The left gesture starts with the right hand elevated at approximately shoulder level, and the palm of the hand is turned perpendicular to the floor and toward the body. The palm is then swept repeatedly from left to right.


  
Figure 6: Left Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...ftadj.ps2,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

The release gesture starts with the right hand elevated at approximately shoulder level, and the hand starts in a fist. The fist is then opened to a hand shaped like it is grasping a large ball. (Think of throwing something from your clasped hand.)


  
Figure 7: Release Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...leadj.ps2,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

The right gesture is opposite of the left gesture. The right gesture starts with the right hand elevated at approximately shoulder level, and the palm of the hand is turned perpendicular to the floor and away the body. The palm is then swept repeatedly from right to left.


  
Figure 8: Right Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...htadj.ps2,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

The stop gesture is the "international symbol" stop gesture. The hand is held above shoulder level, with the palm of the hand visible. The hand does not move.


  
Figure 9: Stop Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...topadj.ps,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

The up gesture is opposite of the down gesture. The gesture starts with the back of the right hand parallel to the floor followed by repeated movement of the palm of the right hand vertically.


  
Figure 10: Up Gesture
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...=upadj.ps,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

As stated before, the data was recorded on video tape using a production grade S-VHS camera. The video data was digitized from video tape, and the even fields were used for visual feature extraction. Hence the feature vector rate was determined by the frame rate: 30 Hz or 33.3 milliseconds. Again as mentioned in the speech details, the video sequence endpoints were segmented by hand. (i.e. Start times and end times for digitization were determined by my visual analysis of the video sequence.) To create features vectors that would parameterize the video sequences, a similar concept to speech analysis was utilized. Instead of spectral features, temporal features of the view of the hands were exploited. In other words, the hands motion over a period of time and the visual shape of the hand describe the gesture. For example the stop gesture seen in Figure 9 provides much more hand surface area over time than the up gesture as seen in Figure 10. The surface area varies temporally in the up gesture. Hence, temporal derivatives of location along with center of mass measurements were used as data for gesture analysis. These derivatives of location, velocity and acceleration, emphasize relative movements which are not positionally dependent in the scene. This relative measurement provides for robust detection no matter where the subject is in the display. Also, a simple measurement of hand distance from head was used to attempt to classify different gestures in which the hand appears to be the same in both gestures. The idea being that although spatially an up gesture (Figure 10) can appear very similar to a forward gesture (Figure 5), the up gesture is typically performed further from the body.

A separate program was developed as a gesture analyzer and performs the same step, feature vector creation, as HCopy in HTK. It was created as a project for a Computer Vision class, and a complete write up can be found in [1]. Essentially the hand and head are tracked using a ``blob tracking'' algorithm, which segments an image via skin color. In other words, the centroid of skin regions were calculated to determine velocities and accelerations of blobs, as well as, the major and minor axis of a constant irradiance ellipse with a tilt. The centroid is calculated using the standard second order equations:

\begin{displaymath}
xCenter = \frac {\mu_{10}}{\mu_{00}}, \,\,\,
 yCenter = \frac {\mu_{01}}{\mu_{00}} \end{displaymath} (1)
\begin{displaymath}
Maj = \left( \frac
 {\mu_{20}+\mu_{02}+[(\mu_{20}-\mu_{02})^2+4\mu_{11}]^{1/2}}{\mu_{00}/2}\right)^{1/2}\end{displaymath} (2)
\begin{displaymath}
Min = \left( \frac {\mu_{20}+\mu_{02}-[(\mu_{20}-\mu_{02})^2+4\mu_{11}]^{1/2}}{\mu_{00}/2}\right)^{1/2} \\ \end{displaymath} (3)
\begin{displaymath}
Tilt = (1/2)\tan^{-1}\left(\frac {2\mu_{11}}{\mu_{20}-\mu_{02}}\right)\end{displaymath} (4)
Second order derivatives were computed over three images to obtain hand velocities and accelerations. The ellipsoid representing the hand's shape was given in each image by the equations stated previously. The feature vector was made up of eight components of ``HTK's user defined data type'' as follows: X velocity, Y velocity, X acceleration, Y acceleration, hand ellipsoid major axis length, hand ellipsoid minor axis length, hand ellipsoid tilt in radians, and distance from head center to hand center. The vectors were sampled every 33.333 milliseconds, and no smoothing function was used on the data. Again as in speech a HMM model for isolated gestures devised, but this time a five state single stream eight mixture HMM. See Figure 11 for an example of the left to right prototype HMM used for all gestures. A five state HMM was chosen because the most gestures consisted of the following: a beginning followed by a transitional period, followed by a repeated sequence, another transitional period, and finally an end.


  
Figure 11: Gesture HMM Prototype
\begin{figure}
\begin{minipage}[t]
{75.0truemm}
\hfill
\begin{minipage}[t]
{82.5...
 ...e=hmm2.ps,width=82.5truemm}

 \end {minipage}
 
\hfill\end{minipage}\end{figure}

For each command gesture a single test vectors was chosen from the total vector set. These chosen vectors were denoted as test vectors. See Table 2.2. for the number of training versus test vectors. The prototype HMM was trained using the vectors set aside for each gesture.


Table 2. Number of training vectors and test vectors.
  Backward Down Forward Left
Train 15 11 19 11
Test 1 1 1 1
Total 16 12 20 12
  Release Right Stop Up
Train 15 11 18 15
Test 1 1 1 1
Total 16 12 19 16

The simplest grammar was imposed on the gesture recognizer as in the speech recognizer. Again, each isolated gesture was equally likely after every other isolated gesture. See Figure 2 as an visual example of the gesture network used. Lastly, the dictionary was equally as simple, because one HMM corresponded to one gesture.


next up previous
Next: Multimodal Details Up: Project Overview Previous: Speech Details
Greg Berry
9/15/1997