German nominal compounds in Statistical Machine Translation tasks into Spanish

Friday February 28, 2014 at 14:15 – 16:00 (HF: 217)

Compounds in Germanic languages such as German or Norwegian pose a challenge for many Natural Language Processing (NLP) Applications as they can be coined on-the-fly and therefore need to be detected, disambiguated and processed successfully along with the other words in a text being processed by an NLP Application.

The common state-of-the-art strategy to deal with new non lexicalized compounds usually consists on splitting them into their constituents to avoid data scarcity problems. This approach has also been proven to be successful in the case of Statistical Machine Translation (SMT), as reported by Koehn and Knight (2003), Popović et al. (2006), Stymne (2008), Fritzinger and Fraser (2010) and Stymne et al. (2013). However, all experiments involved language pairs between Germanic languages (mainly German, but also Swedish, Danish and Norwegian) and English. I have focused on the statistical machine translation of German nominal compounds into Spanish. Spanish being a morphologically rich language, the state-of-the-art strategy of simply splitting the compounds does not work as well as it does in the case of English and alternative solutions are needed.

In this presentation, I will show the results of the experiments I carried out during my secondment in the RWTH Aachen University in Germany using both the state-of-the-art strategy and also another approach and I will briefly present the work I am currently doing to incorporate my findings and achieve a better outcome.

By Carla Parra Escartín, PhD Candidate, Research Group: Language Models and Resources, LLE

Leave a Reply