Modifying the apertium stream format and solving the markup reordering problem using wordbound blanks
- Mentors
- Flammie, Tino Didriksen
- Organization
- Apertium
Markup handling has been a problem in Apertium for a long time. It was done using superblanks that encapsulate markup information inside them during the translation process. This works well to protect the formatting of the document. However, languages represent information differently and during translation, words/phrases move around, get deleted, split, merge, etc. The markup information on the words needs to stick with the words, otherwise we end up with erroneous markup in the translation, which is what happened:
Spanish Input: <i>El perro</i> <b>blanco</b>
English Output: <i>The white</i> <b>dog</b>
As part of this project, a new kind of blank was proposed - a wordbound blank. It contains any information that needs to stay attached to a word/phrase during the translation process. After modifying most modules in the pipeline to work with these wblanks, writing new de/reformatters, and adding markup in wblanks, the translation we have:
Spanish Input: <i>El perro</i> <b>blanco</b>
English Output: <i>The</i> <b>white</b> <i>dog</i>
It should prove immensely useful for users of Apertium MT system to translate html or any formatted documents such as odt, docx, pptx.