Skip to main content

LREC, Conference on Language Resources and Evaluation (Istanbul, 2012)

Since the first LREC held in Granada in 1998, LREC has become the major event on language resources and evaluation for language technologies . In the Research Group for Human Language Technologies 's article we describe and make public large-scale language resources (a large webcorpus and word frequency list) and the toolchain used in their creation for medium density European languages. To make the process uniform across languages, we used tools that are either language-independent or easily customizable for each language, and reimplemented certain stages of the process (sentence- and word-level tokenizers, boilerplate and near-duplicate detection).