Parallel corpora : perspectives from motion events and Slavic languages


Annemarie Verkerk et Ruprecht von Waldenfels (University of Reading, Berkshire, United Kingdom)

9 décembre 2013


Conférence donnée dans le cadre du projet « Espace, Temps, Existence » du Lattice.
Responsables : Laure Sarda et Anne Carlier


Parallel corpora are collections of texts that are all translations of a single original text. In recent years, there has been an increase of interest in the use of parallel corpora in order to study a range of linguistic phenomena. In this presentation, we will give a hands-on introduction to building and using parallel corpora. We will start with a general introduction, giving an overview of some studies that have used parallel corpora and the advantages and disadvantages of using them for comparative research. Then we will present two parallel corpora : the Alice corpus and the ParaSol parallel corpus.

The Alice corpus contains parallel prose in more than 20 Indo-European languages. It has been constructed solely for the purpose of investigating motion event encoding and is therefore quite small and not publically available. We will give an overview of how this corpus was created, discussing various aspects of parallel corpus building, including : 1) deciding on which texts to use ; 2) storing data in a relational database ; 3) managing access of multiple users ; 4) generating output to use for further analysis.

ParaSol is a parallel corpus focussing on Slavic languages, but containing prose in over 20 languages. It uses a framework and web interface designed to make it easy to build, maintain and query based on standard tools. We will cover rationale, overall design and the concrete steps leading from digitizing data to querying the corpus.

At the end of the presentation, we will leave some time for interaction with the audience, enabling the audience to see how these two corpora can be used.

