HandOn — Scaling up the LOGON Prototype

At the conclusion of about fifteen person years in research and development, the LOGON consortium had assembled a functional prototype for quality-oriented Norwegian–English machine translation for a limited domain and relatively small vocabulary. For practical reasons, the LOGON target domain was circumscribed by a collection of about 50,000 words of Norwegian running text, six booklets on hiking in the Norwegian mountains (most published in both Norwegian and English). About one tenth of the original corpus was set aside for testing and evaluation of quality, where the final LOGON prototype reached an end-to-end coverage on held-out data of about two thirds (i.e. translating two ouf of three sentences); very strong human judgments of translation quality indicate that the hybrid transfer-based approach has a good handle on output quality for those cases where it does succeed end-to-end. The total LOGON vocabulary—in a sense the intersection of the analysis, transfer, and generation grammars—comprises about 4,000 word stems.

In 2007, the LOGON core developers team received funding from the Norwegian KUNSTI language technology program to conduct an experiment in scalability of the LOGON prototype. The HandOn project is an intense, six-month effort at significantly enlarging the end-to-end vocabulary in the LOGON prototype, while of course preserving the emphasis on high-quality translation. The main research theme in HandOn, accordingly, is the efficacy of existing methodology and tools for scaling up the LOGON approach, where the HandOn corsortium aims to gauge precisely the cost (and possibly the limits) of system scalability in the hybrid LOGON setup.

Moving to a less constrained (more realistic, some might say) notion of its target domain, the consortium has assembled a corpus of comparable (rather than more expensive parallel) texts of tourism information for visitors to Norway of around 200,000 running words each, in both Norwegian and English. Aiming for an enlarged end-to-end vocabulary of around 8,000 word stems, the consortium is currently (as of late 2007) enlarging the lexica of all three grammars in a semi-automated fashion. In parallel work, the HandOn initiative extends the LOGON methodology for automated transfer rule acquisition from the combination of a bilingual dictionary and monolingual text corpora, with special emphasis on the ubiquitous phenomenon of compound analysis and translation.

The HandOn consortium aims to wrap up an enhanced translation demonstrator early in 2008 and will document its experience (including precise quantitative measures of efforts invested and associated system improvements) in a project-final report. Assuming a positive outcome to this experiment, the consortium aims to seek renewed funding for a follow-up effort, further extending the scope of the LOGON approach (possibly also targeting additional language pairs) and advancing cutting edge research on hybrid, transfer-based MT. The HandOn project includes the Universities of Oslo (co-ordinator) and Bergen, as well as the Center for the Study of Language and Information at Stanford University.

