de German gb English es Spanish

SignSpeak research highlights

On this page, we illustrate some highlights of the research results obtained by the SignSpeak project.

Corpus Annotation

In the course of the SignSpeak project, the RWTH-PHOENIX-Weather corpus, a large corpus in German Sign Language and German in the domain of weather forecasting has been recorded and annotated by one of the consortium members, RWTH Aachen University. With ground-truth annotations for head and face tracking as well as gloss time segmentations, annotations and translations, the corpus can foster scientific research in various domains such as face and hand tracking, sign language recognition and translation. The corpus comprises a total of 46,438 glosses, which corresponds to about seven hours of signing, being one of the largest single domain sign language corpora available. As the data originates from the natural signing situation of interpreting spoken TV news programmes, it is much less controlled and thus more challenging than the artificial laboratory data of other corpora. Thus, the RWTH-PHOENIX-Weather corpus is a first step to develop automatic recognition and translation systems for natural sign languages.

In addition, Radboud University Nijmegen further consolidated its Corpus-NGT by improving the consistency of the annotation scheme and the translations. In addition to the already existing corpus, about 68,000 glosses were annotated. Here, the left and right hand were annotated individually, and head shake information was partly annotated as well. Thus, the Corpus-NGT allows for research about the multimodal nature of sign languages.

Linguistic Research

The linguistic research performed by Radboud University Nijmegen focused on topics which are highly relevant for automatic processing of sign languages.

Since automatic recognition and translation systems usually work on the sentence level, a first step in an overall pipeline is to segment a sign language video into sentences. The linguistic research therefore examined cues for sentence boundaries in an empirical study, as sentences are usually not separated by long pauses. Here, one finding was that the 'spreading' of the non-dominant hand in a two-handed sign is clearly related to sentence boundaries. Other cues often only gave circumstantial evidence, since they did not only indicate sentence boundaries but also conveyed other information.

Another problem in sign language recognition are transitional movements, where the hand moves from the end position of one sign to the starting position of the next sign. Here, kinematic measurements indicated that the timing of the peak velocity differs between lexical movements and transitional movements.

Moreover, consecutive signs often influence each other. For example, in order to minimize transitional movements, a person might sign the same sign at different heights depending on the height of the previous sign. These coarticulations are challenging for automatic systems, because the same sign may look different depending on the context. In a study, these coarticulations were examined on the Corpus-NGT, confirming previous studies for American Sign Language and especially indicating that locations on the weak hand were influenced more strongly by coarticulation than other locations. In a further study, coarticulations of handshapes were also examined. Here, strong coarticulation effects of the thumb position were found.

To summarize, the linguistic research investigated several areas which are of direct use for the development of automatic sign language recognition systems.

Face and Hand Tracking

The most important visual cue is the extraction of the trajectory of the dominant hand of the signer. Other cues, in sum, are however equally important, including the trajectory of the non-dominant hand, hand configurations, and facial expressions, which are also exploited by the SignSpeak system. In using all of these cues, the SignSpeak project goes beyond most existing work on automatic analysis of sign language which concentrates mostly on hand trajectories. Even more and equally important cues exist beyond those exploited by SignSpeak such as pointing gestures and body lean.

Feature Extraction

An important step in video analysis is the extraction of localized features that describe the appearance of objects visible in the image in terms of the spatial or spatiotemporal arrangement of color and intensity patterns. Vectors of such features can be used to detect, recognize and track objects such as hands and faces, and to characterize and distinguish hand configurations without fitting articulated hand models, which would be extremely difficult to achieve within the constraints imposed by the SignSpeak scenario.

Hand feature extraction

Like facial expressions (e.g. fear, anger, surprise, contempt, disgust, happiness and sadness) implying facial attributes (e.g. overall face, eyes, ears, mouth, chin, nose and cheeks), hands can take a variety of shapes. Hand shapes are particularly important in sign languages, where we apply machine learning techniques to learn, recognize, express and translate what has been signed. For the SignSpeak project, CRIC has investigated the extraction of spatio-temporal features by means of histograms of 3D oriented gradients (HoG3D). We found that features extracted by HoG3D for a single scale (see next figure) are highly descriptive if temporal and spatial domain are well sampled.

Figure 1. Handshapes spatiotemporal feature extraction (HoG3D)

At the same time, CRIC has developed a supervised learning method, namely hand-shape features using learning-based descriptive fragments. In this method, we show that hand shape features derived from image fragments (see next figure), can be extracted from a subset of a sign language vocabulary and shared across the complete vocabulary. This allows reducing the time complexity for classifying much bigger vocabularies of Sign Language.

We evaluate the hand features extracted by both methods using the test suites developed and provided by RWTH with two annotated corpora, for training and testing. The obtained word error rates (WER) indicate the highly descriptive capabilities of both methods. As both methods generate very similar results, it is important to take into account their respective processing time for hand features extraction: here, HoG3D is much faster than hand-shape features using learning-based descriptive fragments. Overall, the hand-shape features using learning-based method can improve the descriptiveness of the extracted features, whereas HoG3D features seem to be bounded.

Figure 2. Relevant feature subset of 483 descriptive handshape fragments

Hand tracking

Hand tracking is the problem of finding the most likely sequence of hand positions along the video, given the image observations, physical constraints imposed by the human body, and contextual knowledge about the scenario. Since hands can take a vast variety of different appearances, hand tracking in video remains a difficult problem, and the use of information in addition to image observations is essential.
Addressing the highly-variable hand and background appearances, UIBK developed a hand tracking algorithm based on robust principal component analysis. It employs an L1 objective function and projection-pursuit optimization to achieve robust localization of hands by reducing sensitivity to background clutter and partial occlusions.

UIBK developed a multi-object tracking algorithm that tracks both hands simultaneously. Like the RWTH tracker, it first detects plausible hand locations and then links them over time by global optimization. Unlike the RWTH tracker, it optimizes using linear programming instead of dynamic programming, and by jointly optimizing the trajectories of both hands simultaneously, it avoids the collapsing of both targets onto the same hand, and reduces the danger of confusing left and right hands. To play to its full strengths, this method requires more than two tracked objects and longer sequences than those used in SignSpeak task. Tracking algorithms, especially those that use many different input features, contain many tunable parameters which makes them difficult to use in practice. UIBK developed a novel strategy for tuning hand-tracking parameters automatically via a structured-output support vector machine algorithm, with very promising results.

Head and Face Tracking

In vision-based automatic recognition and translation of sign language, facial expressions are some of the most important non-manual semantic features that can be observed or measured in the visual signal. They convey rich information that often helps to disambiguate or complement the manual features. UIBK developed and improved a series of tools for robust and accurate extraction of face features in real-time. These include landmark points – e.g., the nose tip, the lips contour points, and many more – as well as the degree of aperture of the mouth, the degree of aperture of each eye, the degree of raising of each eyebrow, the angles of rotation of the head in 3-space, and head shake. The common denominator of those tools is the modeling of the general human face as an object with a variable shape and appearance, i.e. a shape and appearance which can depart to some extent from a fixed template (the 'average face') via a basis of allowed variation modes. Those variation modes are meant to capture the intrinsic sources of variability (identity, expression, skin color, etc.) as well as the extrinsic sources of variability (light changes, camera distortion, etc.). Such a model is coined a deformable model. Of particular interest are statistical deformable models – i.e. those for which the allowed variations are learned from the data – as they make it possible to easily inject high-level expert knowledge via labeled examples. To this end, UIBK annotated the face shape on numerous images taken from SignSpeak corpora, notably RWTH-PHOENIX-Weather for which 369 images were manually annotated (see next figure).

Figure 3. RWTH-PHOENIX-v2.0 face shape-annotated images

Specifically, UIBK chose to employ the so-called Active Appearance Models (AAM) technique, a statistical deformable model construction and fitting technique often used in image interpretation. Model construction is performed in such a way that it is possible to generate photorealistic images of the object of interest via a number of parameters. When fitting the model to a new image, the somewhat large number of parameters controlling the 'AAM image' are iteratively refined toward producing the best possible match to the target image. From the choice of the particular algorithm used for fitting stem different variants with different performances. UIBK implemented and evaluated a number of these variants, each suited for different cases of application: some perform very fast and accurate image interpretation in a single-subject setup, while others offer a remarkable robustness to extreme expressions and changes of identity and illumination, allowing their use in a multi-subject setup – typically one single model for all signers from the same database. Additionally, UIBK incorporated some refinements to its tools to further improve the overall robustness, such as the use of 3D shape models against large head rotations, as well as a strategy to adapt on-the-fly a slow and unstable multi-subject AAM into a robust and fast person-specific AAM. It should be noted that genuine generic model construction / fitting is a difficult problem and still a very active topic of research; this on-the-fly adaption method allows the construction of effective models without the need for expert-labeled data, even if no genuinely generic model is available beforehand.

Facial Feature Extraction

UIBK designed a simple yet effective tracking scheme for AAM-based extraction of facial features. Using designated polygons and/or local appearance patches obtained from the accurate AAM-based segmentation of the face image, UIBK developed shape-based heuristics and classification/ranking methods for extracting semantically-meaningful facial features such as head pose, mouth and eye apertures, and eybrow raise. Such features were found to significantly enhance sign recognition as measured by the Word Error Rate (WER) within SignSpeak. Next figure illustrates facial feature extraction for one of the seven RWTH-PHOENIX-v02 signers.

Loading the player...

Sign Language Recognition

Sign languages convey meaning simultaneously via different communication channels such as the hand shape, orientation, position and movement of the hands as well as by non-manual information such as body posture and facial expression. Sign language recognition, inspired by spoken language recognition where just the single communication channel of the audio signal exists, requires a combination of these different aspects of signing to deal with the multimodal nature of sign languages. For example, a signer may move his two hands independently and may indicate additional information by facial expression and mouthing. Thus, RWTH Aachen University developed a technique of stream combination which takes into account such asynchronous information channels. Moreover, many techniques which are state of the art in automatic speech recognition could be transferred to sign language recognition, e.g. signer adaptation and discriminative training.

For lab data, the achieved recognition error rates of 12% are similar to the results in automatic speech recognition. On real life data such as interpretations of broadcast news, the error rates are still high and more research is needed until computers are able to understand natural signing.

Sign Language Translation

In sign language translation, several techniques specifically suited for sign languages have been developed.

Since sign language corpora are rather small when compared to bilingual corpora in spoken language translation, the classical approach to split off a dedicated development set for parameter optimization turned out to lead to unstable results. Consequently, a method similar to cross-validation was applied in which alternating parts of the training data were used as development set, leading to more stable parameters.

The morphological analyzer Morphisto was applied to the spoken German text to deal with the mismatch in morphology between sign languages and spoken languages. Words were converted to their base forms, and noun compounds were split, leading to better alignments between the glosses and the spoken words.

For the Corpus-NGT, for which the left and right hand and head shakes were annotated on different streams, a simple method to deal with multiple input streams was applied as well. In an empirical study, the coverage of this approach to merge the streams according to the timeline was evaluated.

One issue of sign language translation is the lack of large size corpora, because corpus annotation are time consuming and laborious. Consequently, we investigated methods to automatically generate more bilingual data without any additional annotation effort by applying semi-supervised learning techniques to monolingual German data. In this effort, morphological analysis again played an important role to bridge the difference between German Sign Language and spoken German.

The user study showed that the quality of the translation system itself was quite satisfactory when neglecting recognition errors.