de German gb English es Spanish

SignSpeak results and Demonstration

On this page, we illustrate the overall system developed in the course of the project and present some of the results achieved including a few example videos.

The overall system which translates a sign language video into the text of a spoken language consists of three steps:
First, an image and video analysis is applied to the video of the signed utterances, leading to a stream of features which describe the hand form, facial features and movements. These features are then used in the continuous sign language recognition software to produce a sequence of glosses. Glosses transcribe the meaning of the individual signs of a signed utterance. Since sign languages differ in grammar and word order from spoken languages, the gloss sequence has to be translated into a text of the spoken language in the sign language translation system.


With regard to the system, several significant research results have been achieved in the course of the project:

  • As part of the project, the RWTH-Phoenix Weather corpus, one of the largest real-life data corpora in German Sign Language was annotated. So far, mainly lab-recorded corpora with a reduced complexity existed.
  • Linguistic research focused on handshape coarticulation and cues for sentence boundaries.
  • Face tracking using a special technique called active appearance models led to significant improvements.
  • In sign language recognition, the feature streams were combined to incorporate multiple modalities (right and left hand, facial expressions, body pose, etc.)
  • In sign language translation, a method to create stable data-driven systems with the amounts of data at hand was developed, moreover, monolingual data was incorporated using lightly supervised training.
  • The industrial partner "Telefonica I+D" investigated the integration of SignSpeak technologies into a communication service with the example of a sign language video mail system.
For mode details about the highlights and achievements of this project, please also refer to the "Highlights" page.

system architecture

To evaluate the quality of the individual parts of our architecture, we use the following setup:
  • To evaluate the quality of the recognition system, we compare the gloss output of that system to the glosses which were annotated by a human annotator. We calculate an error rate by counting the number of glosses which were inserted, deleted or changed by the system when compared to the human annotated sequence, which is divided by the number of glosses of the reference. This measure is called "recognition error rate".
  • To evaluate the quality of the translation system, we similarly compare the text output of the system and compare it to a human translation. Here, we do not only allow insertions, deletions and changes, but also the shift of parts of the sentence to another position, e.g. the phrase "on Monday" can be put to the beginning or the end of a sentence without considerably changing the meaning of the sentence. This measure is called "translation edit rate" (TER).

However, it is usually possible to translate one sentence into several equivalent sentences, and thus the comparison of the translation output to only one reference usually underestimates the translation quality. We therefore opted to also perform a human evaluation at the end of the project in which the quality of the system output was judged by a total of nine human experts. We asked both Deaf and hearing experts who are fluent in German Sign Language to evaluate the sentences generated by our system. In the following, we present the results from this evaluation with some example videos.

The user evaluation was performed on translations of the RWTH-Phoenix weather corpus, a corpus in the domain of weather forecasts in German Sign Language and Spoken German. In the evaluation, we compared three setups:

  • For some videos, the evaluators had to judge the results of the overall pipeline of recognition and translation ("pipeline").
  • For some videos, the evaluators had to judge the machine translation of human annotated glosses ("translation").
    In this setup, the errors introduced by the sign language recognition system were neglected.
  • For some videos, the evaluators were presented the original text spoken by the announcer for comparison purposes ("reference").

The evaluators had to judge the adequacy and the fluency on a score from 1 (poor) to 5 (excellent). The adequacy score measures whether the translation conveys the meaning of the original video. Taking the example video 2 shown below, if the system translates a signed utterance with the reference translation " At daytime 12 degrees at the Baltic and up to 20 degrees in Lower Bavaria." to "At daytime in the north and in the south up to 30 degrees in the northeast", it would receive a low adequacy, since important information ( "12 degrees", "Baltic", "20 degrees" "Lower Bavaria") was dropped or recognized/translated incorrectly. The fluency score measures whether the translation is grammatically correct and fluent. A translation which contains all the information of the signed utterance but contains grammatical errors thus gets a good adequacy score, but a low fluency score. An example is video 3 shown below, which translates the utterance with the meaning "The new week still begins alternating and a bit colder." to "On Monday alternating weather colder." While the main information is captured, the sentence does neither contain a verb nor a conjunction to link "alternating weather" and "colder". The result of the human evaluation can be seen in the following figures.

Human evaluation adequacy results Human evaluation fluency results


In the first figure, we can see that in terms of adequacy the gap between the "pipeline" setup and the "translation" setup is larger than the gap between "translation" and "reference". This indicates that most of the adequacy, i.e. the transfer of meaning, is lost in the recognition part of the pipeline. The overall adequacy of the system was judged with 2.5, which is somewhat below the average. The fluency, that is, the grammatical wellformedness of the overall pipeline was judged with an average 3.0. Here we can see that the score of the whole pipeline is very similar to the translation of the annotated glosses. This implies that it is mainly the translation system which introduces grammatical errors and dysfluent expressions. What is remarkable is that also the text spoken by the announcer does not get a perfect score of 5. This is mainly due to the fact that the sentences signed by the interpreters do not exactly correspond to what the anchor man said, as the interpreters leave out some information due to time constraints or slightly reformulate the sentences.

In the following, we show three short video sequences which were translated by the SignSpeak system.

(Note that the original corpus is in German Sign Language and German. For demonstration purposes, we have translated both the glosses and the text spoken by the announcer into English. To see the original data, please change the language of the website to German.)

Loading the player...
Gloss: SOUTH lefthand-REGION A_BIT RAIN CAN
Recognition: SOUTH A_BIT RAIN TEMPEST CAN
Translation: In the south initially still a few showers or tempests .
Reference: In the very south a few showers are possible.


The sign RAIN contains a repetition of the movement of the hands which sometimes poses problems to the recognition system, which regards the sequence as several signs instead of only one sign. Here, the system produced an additional TEMPEST, which together with the left out lefthand-REGION leads to a recognition error rate of 40%. The translation system carries over the tempest error, moreover it translates A_BIT with "initially still a few" because this translation was seen in the training data. The human evaluation of this translation was average to good with 3.5 adequacy and 3.6 fluency, while the automatic translation measure TER would indicate a high error rate of 62.5%.

Loading the player...
Gloss: DAY NORTHEAST 12 SOUTH THEN BAVARIA IX BAVARIA
MAXIMUM 20 TEMPERATURE
Recognition: DAY NORTH NORTHEAST SOUTH MAXIMUM 30 TEMPERATURE
Translation: At daytime in the north and in the south up to 30 degrees in the northeast
Reference: At daytime 12 degrees at the Baltic and up to 20 degrees in Lower Bavaria.


This video shows an example where the system did not perform so well. One issue is the mouthing which was not properly recognized, for example BALTIC differs from NORTHEAST only by mouthing ("Ostsee"). The sign NORTHEAST was misrecognized as NORTH NORTHEAST, because the sign for NORTHEAST begins with the sign NORTH.The translation system then reordered the NORTHEAST to the end of the sentence. The recognition error rate was 71,4%, and the human evaluation was 1.1 for adequacy and 1.9 for fluency. The project results indicate that for German Sign Language, future research should focus on the recognition of mouthing.

Loading the player...
Gloss: MONDAY CHANGE MORE COLD
Recognition: MONDAY CHANGE MORE COLD
Translation: On Monday alternating weather colder .
Reference: The new week still begins alternating and a bit colder.


For this short sequence, the recognition system correctly recognized the glosses, leading to a recognition error rate of 0%. The translation system did not add any verb to the sentence, leading to a translation which sounds a bit choppy. Thus, the human evaluation led to an adequacy of 3.3 and rather low fluency of 2.6. In general, one issue of the translation task was to add filler words and sometimes verbs which are not part of the gloss sequence. Since we used data-driven statistical methods, the system would learn to add these words based on translations it saw during training. The often free translation of the human interpreters (e.g. interpreting the original phrase "The new week still begins" simply with "MONDAY") made this task rather difficult.

To summarize, the main achievement of the project was to create an overall system which can translate real-life data and not only data recorded in a research laboratory. The experiments showed that the translation part of the pipeline already works quite satisfactory. The sign language recognition system has made substantial progress both with regard to recognition results and to the incorporation of multiple modalities (right and left hand, facial expression, etc.). In addition, as part of the project, the RWTH-Phoenix Weather corpus, a corpus in German Sign Language and German, was annotated, which can be used for research purposes by the community. Thus, we hope that future research can profit both from the methods developed and the data annotated in the course of the SignSpeak project.