Serge SHAROFF

Université de Leeds (Royaume-Uni)
Invité du Lattice – mai & juin 2019

JPEG - 65.3 ko

En mai et juin 2019, le labex TransferS et Thierry Poibeau (Lattice) accueillent Serge SHAROFF, maître de conférences à l’institut des langues, cultures et sociétés de l’Université de Leeds.

Traitement automatique de langues et multilinguisme

Alors qu’il existe plus de 6000 langues dans le monde, on dispose de ressources électroniques permettant de développer des analyseurs performants (syntaxiques ou sémantiques) pour une cinquantaine de langues seulement. Et encore ne dispose-t-on de données en quantité suffisantes pour « entraîner » des systèmes de traitement automatique que pour une minorité de ces cinquante langues. Pour contourner ces problèmes, les chercheurs mettent aujourd’hui au point des systèmes reposant sur des représentations « multilingues » de l’information. Bien que cela soit contre-intuitif de prime abord, il est possible d’obtenir un système performant pour une langue X à partir du transfert de connaissances obtenu par et pour une langue Y (en fait, on utilise plusieurs langues à chaque fois dans les systèmes de traitement moderne de ce type). La question des systèmes de représentation multilingue des connaissances, qui est en général abordée d’un point de vue purement « ingénierique », mériterait un regard pluridisciplinaire. Serge Sharoff discutera notamment de l’impact des recherches en typologie linguistique et de l’impact du contact entre les langues sur le traitement automatique (par exemple, bien que le komi – langue finno-ougrienne du nord de la Russie – n’ait pas du tout la même origine que le russe, le contact entre les deux langues et le bilinguisme de tous les locuteurs komi a amené évidemment une grande porosité de la langue qui a directement transposé des structures du russe en komi).

Mardi 14 mai : Evaluation et utilisabilité de la traduction automatique

Salle Cavaillès, à partir de 10h

Translation quality evaluation : MT vs Human translation

Serge Sharoff (Univ. Leeds)

In modern life we are surrounded by translations from other languages, some
of which are unreliable. This talk investigates the task of detecting low-quality human translations automatically. The task is important in many applications, such as translation training, screening candidates or monitoring translation submissions, while few resources are available for training Machine Learning models for this task. In my talk, I will show how to approximate a proper training corpus with a composite one created from low quality MT outputs and good quality human translations.

Post-editing machine translation : MT technologies in real-life use scenarios

Hanna Martikainen (CLILLAC-ARP — Univ Paris Diderot)

It is generally acknowledged that machine-translated output is of sufficient quality today for commercial use with post-editing, and the technology is being integrated into translation workflows in various settings (Koponen 2016). With the recent advent of neural MT and the undeniable advances in fluency it has brought about, this trend is expected to grow even stronger. However, automatic and human evaluation metrics of MT often yield inconsistent results on quality (see for instance Castilho et al. 2017), and discrepancies between automatic metrics such as HTER scores and perceived post-editing effort as well as post-editing time have been observed (see for instance Koponen et al. 2019). In this talk, I will present some real-life scenarios of MT integration into translation workflows in professional as well as educational settings and discuss end-users’ perception of MT. I will seek to determine what kind of factors are known to influence the use and usefulness of MT in actual settings and explore the different parameters that affect it, with a specific focus on the emerging paradigm of neural MT.

Lundi 20 mai

Lattice, salle 512, Montrouge, 11h

Text typology vs text topology : reliable detection of genres

Serge Sharoff (Univ. Leeds)

There are different kinds of texts on the Web, from FAQs to shopping pages to journalism to blogs. Different sorts of page have quite different uses and characteristics. The talk will present a topological approach to text typology in which the texts are described in terms of their similarity to prototype
genres. The suggested set of categories is designed to be applicable to any text on the Web and to be reliable in annotation practice. Reliably annotated texts also provide the basis for automatic genre classification.

[ce séminaire sera précédé d’un autre séminaire par Ismael Ramos Ruiz, de l’Université de Caen]

Mercredi 5 juin, matinée : « Approches multilingues et transfert entre langues en TAL (traitement automatique des langues) »

9h30 : Accueil des participants, café

9h30-10h30 :

Language adaptation : exploiting similarity between the languages in NLP models

Serge Sharoff (Univ. Leeds)

Some languages have very few NLP resources, while many of them are closely related to better resourced languages. This talk explores how the similarity between the languages can be utilised by porting resources from better to lesser resourced languages. This can be achieved by combining cross-lingual embedding methods with a lexical similarity measure which is based on detection of cognates. I show that the resulting embedding space helps in such applications
as morphological prediction and Named Entity Recognition, when a model is trained using data from better resourced languages and is applied to lesser resourced ones.

10h30-11h30 :

Annotate and predict semantic frames in French : feedback from adapting FrameNet to another language

Marie Candito (Université Paris Diderot)

FrameNet’s semantic frames (Baker et al. 98) are structured representations of eventualities as evoked in texts. These representations provide semantic generalizations, by grouping several predicates evoking the same core semantics, and tagging predicate-argument relations with semantic roles. These generalizations are both useful for the practical objective of representing the semantics of texts
and for qualitatively and quantitatively analyzing the syntactic-semantic divergences. In this talk we will present the creation of a French FrameNet that includes semantic frames adapted from English, a lexicon of predicates that can evoke these frames, and corpus annotations. We will provide feedback from this adaptation and will present a study fro the automatic framenet parsing, showing that using normalized syntactic representations is beneficial.

[This is joint work with Marianne Djemaa, Laure Vieu, Philippe Muller (for the French FrameNet) and Olivier Michalon, Alexis Nasr, Corentin Ribeyre (for FrameNet parsing)]

11h30-11h45 : Pause

11h45-12h45 :

Leveraging neural machine translation models for language documentation

Laurent Besacier (Université Grenoble-Alpes)

Recent advances in NLP now make it possible to recast several language-related research topics as highly interdisciplinary fields where linguistics leverage machine learning for improving our knowledge of language use. For instance, the emerging field of computational language documentation (CLD) seeks to assist linguists in providing them with automatic processing tools.
But can we really leverage modern computational models when only small and mostly oral corpora are available ? I will examine the particular problem of discovering words in an unsegmented stream of characters, or phonemes, transcribed from speech in a very-low-resource setting. This primarily involves a segmentation procedure, which can also be paired with an alignment procedure when a translation is available. Neural sequence-to-sequence (seq2seq) models, now very successful for neural machine translation, can be used for this. To analyze if such models also learn from small data, I will present an empirical evaluation of three seq2seq architectures (based on CNN, RNN and Transformer) on a realistic Bantu corpus for language documentation (Mboshi from the Republic of Congo).

12h45 : Clôture de la matinée

Mercredi 5 juin, après-midi (salle à déterminer)

Natural Language Processing in Russia today, an Overview

Informal workshop, discussion with Serge Sharoff

Serge Sharoff joined the University of Leeds, UK, in 2003 after obtaining my PhD in 1997 from the Moscow Lomonosov State University and postdoctoral appointments at the Russian Research Institute for Artificial Intelligence (1997-2000), and Humboldt Research fellowship at the Univesity of Bielefeld (2001-2002). His research focuses on Natural Language Processing, including automated methods for collecting corpora from the web, their analysis in terms of domains and genres and extraction of lexicons and terminology from corpora. The application domains for this kind of research in the Digital Humanities include text annotation, information retrieval, machine translation and computer-assisted language learning. His research stresses the inherent multilinguality of NLP, which implies that tools and resources can be ported across languages.

Entrée libre dans la limite des places disponibles

Mardi 14 mai 2019, à partir de 10h
ENS, 45 rue d’Ulm, 75005
salle Cavaillès (1er étage, escalier A)
Lundi 20 mai 2019, 11h
Laboratoire Lattice,
1 rue Maurice Arnoux 92120 Montrouge
salle 512
Mercredi 5 juin 2019, à partir de 9h30
ENS, 45 rue d’Ulm, 75005
salle Celan (RdC, escalier A)