The Acquis Communautaire multilingual parallel corpus and Eurovoc (v 2.2)

Version 3.0 is now available. Go to http://langtech.jrc.it/JRC-Acquis.html to get this latest and extended version.

Introduction
Statistics on the corpus
Source of the documents
Document conversion and processing
Sentence alignment across the various languages
Eurovoc classification of the texts
Usage conditions / licensing issues
Related information
Contributors
Contact

1) Introduction

What is the Acquis Communautaire

Before joining the European Union (EU), the new Member States (NMS) needed to translate and approve the existing EU legislation, consisting of selected texts written between the 1950s and 2005. This body of legislative text, which consists of approximately eight thousand documents and which covers a variety of domains, is called the Acquis Communautaire (AC). As there were 20 official EU languages at the beginning of the year 2005, the AC thus exists as a parallel text (text and its translation) in 20 languages. The languages are Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Slovak, Slovene and Swedish. The EU Candidate Countries Croatia, Romania and Bulgaria have started translating the AC, so that some of the documents are available in these languages, as well. However, Croatian and Bulgarian texts are not currently part of the distribution.

The linguistic research interest of the Acquis Communautaire

In (computational) linguistics, parallel corpora are useful resources that are used for various applications and purposes. Most parallel corpora exist for a small number of languages. To our knowledge, the AC with its 20+ languages and its approximately 8,000 documents is the largest existing parallel corpus, if we take into account both its size and the number of languages covered.

The AC and other Community legislation is publicly available on the European Commission's web sites. The Language Technology team of the Joint Research Centre (JRC, http://langtech.jrc.it) in Ispra, Italy, has attempted to identify the documents that are part of the AC, has downloaded them and converted them to XML format. In further processing steps, the texts were cleaned of their footers and annexes, and they were sentence-aligned. Instead of using a single pivot language, all possible language pair combinations were aligned individually. This is useful due to the n-to-n relationship between aligned sentences, which often differs depending on the language pair involved.

For some of the documents, only preliminary translations were available. For the online texts in some of the languages, only the title has been translated, but the text displayed is English. An automatic language recognition tool was therefore used to filter out those texts that are displayed as being one language, but which are actually English. No manual check was carried out.

The European Commission's Office for Official Publications OPOCE manages the distribution rights of this aligned multilingual parallel corpus. OPOCE agreed that the corpus can be given to research partners for non-commercial use. See the section on licensing issues, below.

2) Statistics

The ACQUIS corpus is currently available in 21 languages with the following distribution (version 2.2):

Language ISO Code	N^o of Texts	Text body			Signature	Annex	Total N^o Words (Text + Signature + Annex)
Language ISO Code	N^o of Texts	Total N^o Words	Total N^o Characters	Average N^o Words	Total N^o Words	Total N^o Words	Total N^o Words (Text + Signature + Annex)
cs	7983	5979261	38479314	749	609441	2100301	8689003
da	7939	6548461	44444011	825	691894	1599456	8839811
de	7914	6576633	47047334	831	571928	1506847	8654608
el	7782	7377316	47715936	948	559487	1628451	9565254
en	7972	7512013	45150120	942	667978	1752545	9932536
es	7809	7964255	48281455	1020	709279	1832745	10506279
et	7944	4925361	38603952	620	439184	1819226	7183771
fi	7735	5134294	43705813	664	565226	1180877	6880397
fr	7862	7812577	45609935	994	673061	1726720	10212358
hu	7489	5391810	40601868	720	539967	1887476	7819253
it	7872	7264126	46792286	923	707467	1704221	9675814
lt	7966	5386359	39936370	676	625365	1948354	7960078
lv	7980	5656335	39290110	709	461736	2011426	8129497
mt	7639	7230538	43919981	947	505324	2288013	10023875
nl	7882	7339465	47699598	931	712255	1710041	9761761
pl	7968	5974605	43160945	750	668248	2070687	8713540
pt	7848	7851904	47225710	1001	648180	1838833	10338917
ro	5792	5122354	33681450	884	402929	4047393	9572676
sk	5278	3911895	26077956	741	413511	1381471	5706877
sl	7984	5989322	37844883	750	573052	2153138	8715512
sv	7731	6472717	42990411	837	560188	1424887	8457792
Average	7,636	6,353,410	42,340,925	831	585,947	1,886,338	8,825,695

3) Source of the documents

All documents were downloaded from the websites http://europa.eu/ and http://ccvista.taiex.be. See the publication at LREC'2006 for details. The texts in the official EU languages on http://europa.eu/ were found in html format, while the Romanian translations on http://ccvista.taiex.be were found in MS-Word format.

4) Document conversion and processing

All documents have a numerical identifier called the CELEX code (see http://europa.eu/celex/). This code helps to find the same text in the various languages.

Conversion from HTML and MS-Word to XML

After having downloaded the HTML documents (see Section 3), the documents were converted to XML. The title and body text were isolated, the paragraph breaks (<P> HTML tags) were kept. All texts were uniformly encoded with UTF-8.

Identification of footers / annexes

A list of rules was used to detect the beginning of the documents' annexes and signatures (repetitive and frequently multilingual text strings ending the documents) and to separate the text body from the less useful text parts. As the rules were hand-written by developers who do not speak the 20 languages, some signatures and annexes may have been missed and some may have been recognised wrongly.

Document format / DTD

The documents have the format as illustrated below. The DTD for this format is also provided with the distribution.

            <TEI.2 id="jrcCELEX-LG" n="CELEX" lang="LG">
            <teiHeader lang="en" date.created="DATE">
            <fileDesc>
                <titleStmt>
                    <title>JRC-ACQUIS CELEX LANGUAGE</title>
                    <title>Document Title</title>
                </titleStmt>
                <extent>nb_of_paragraphs paragraph segments</extent>
                <publicationStmt>
                    <distributor>
                        <xref url="http://wt.jrc.it/lt/acquis/">http://wt.jrc.it/lt/acquis/</xref>
                    </distributor>
                </publicationStmt>
                <notesStmt>
                    ....
                </notesStmt>
                <sourceDesc>
                        <bibl>Downloaded from <xref url="Downloading_URL">Downloading_URL</xref> on <date>Downloading_DATE</date></bibl>
                </sourceDesc>
            </fileDesc>
            <profileDesc>
                    <textClass>
                            <classCode scheme="eurovoc">Eurovoc_Code</classCode>

                                .....
                    </textClass>
            </profileDesc>
        </teiHeader>
        <text>
            <body>
                <head n="1">Document Title</head>
                <div type="body">

.......

</div>

<div type="signature">
<p n="paragraph_number">... signature text...</p>

                        ....
                </div>
                <div type="annex">
                    <p n="paragraph_number">... annex text...</p>

....

                 </div>
            </body>
        </text>

</TEI.2>

Notice that the title, body text, signature and annex further contain <p>...</p> tags. Each tag contains as attribute (n) its sequential number in the document, which is used in the paragraph alignment.

5) Sentence alignment of the texts across languages

Strictly speaking, the corpus is currently aligned at the paragraph level, as it was the <P> elements that were being aligned. However, the paragraphs of the AC Corpus are usually short and do usually contain one sentence, or even only part of a sentence.

Two different programs were used for the alignment, and both alignment results are available as part of the distribution. The alignments have not been evaluated. The first program used for alignment was Vanilla, written by Pernilla Danielsson and Daniel Ridings, which implements the widespread Church and Gale / Dynamic Time Warping algorithm. The C source and documentation of the program are available at http://nl.ijs.si/telri/Vanilla/. The second program used is HunAlign, described by Varga, Halácsy, Kornai, Nagy, Németh & Trón.

We decided to align the sentences of each language pair separately, instead of using one pivot language. As the corpus exists in twenty-one languages, there are 210 possible language pair combinations. For each individual language pair, we thus produced files containing the language pair-specific alignment information. These files contain, for each document identifier (CELEX number), pointers (in "n" attribute) to the paragraphs that are translations of each other. The format used is that of the Text Encoding Initiative (TEI).

Due to the size of the corpus and the number of language pairs, the files do not contain the text itself. If you want to produce the parallel corpus for a specific language pair, you thus need to generate this corpus on the basis of the monolingual corpora (which all contain paragraph identifiers in <p> tags) and the alignment information.
See directory alignment for further information on how to generate such an aligned corpus.

6) Eurovoc classification of the texts

Most CELEX (EU) documents have been manually classified according to the subject domains to which they belong. The classification scheme used is the Eurovoc thesaurus (http://europa.eu/celex/eurovoc/), which is a multilingual wide-coverage conceptual thesaurus. The European Parliament, large parts of the European Commission and about twenty national and regional European parliaments use Eurovoc for the classification of their documents. The Eurovoc thesaurus consists of over 6,000 descriptor terms (classes) that are organized hierarchically into up to eight levels, using the relationships Broader Term - Narrower Term (BT-NT) to describe the hierarchical relationship, and Related Term (RT) to link descriptors that are related but not linked hierarchically. Additionally, synonyms and near-synonyms for some of the descriptors are listed, marked with the Use-For (UF) tag. Eurovoc exists in over twenty languages and is maintained actively. As the descriptors are defined precisely with Scope Notes, each descriptor has exactly one translation in each of the languages. Numerical descriptor IDs link the various language versions. This feature makes Eurovoc an ideal means for cross-lingual search and retrieval applications and more.

While the parliaments use professional human indexers to classify their documents manually, the JRC has been working on automating this task. For details, see http://langtech.jrc.it/Eurovoc.html.

Most Acquis Communautaire texts have been classified manually with Eurovoc descriptors. The file celex-EurovocId.txt contains the lists of numerical descriptor IDs that have been assigned to each of the AC documents. As the AC documents have been written over a period of about fifty years and the Eurovoc thesaurus keeps evolving, the documents are indexed with different Eurovoc versions. The Eurovoc descriptor codes for documents older than 1995 are not currently available. Furthermore, a small number of newer documents also seems not to have been Eurovoc-indexed, so that Eurovoc descriptor codes are not available for all AC documents.

With this distribution, we provide the numerical Eurovoc descriptor codes. Should you be interested in the descriptor text (the class name in any of the EU languages), you will need to get the licence for Eurovoc from OPOCE.

7) Usage conditions / Licensing issues

Acquis Communautaire corpus

According to an agreement with the European Commission's Office for Official Publications OPOCE, the AC corpus can be used and distributed for research purposes, but the following usage conditions must be adhered to:

The European Communities consider legislative and quasi-legislative documents published in the Official Journal of the European Union and related COM and SEC series as well as charters and treaties and ECJ case-law to be in the public domain. Prior written permission is thus not required for their reproduction/translation, and they may be reproduced/translated freely without restriction, including for the purpose of further non-commercial dissemination to final users, subject to the condition that appropriate acknowledgment is given to the European Communities and to the source, and provided that the additional guidelines set out below are respected.

(1) Whenever a document is reproduced verbatim from a source other than the printed version of the Official Journal of the European Union, a prominently positioned disclaimer should read:

'Only European Community legislation printed in the paper edition of the Official Journal of the European Union is deemed authentic.'

(2) For the reasons stated in the disclaimer above, it is advisable to ensure that translations are made from the printed, authentic version of the Official Journal. This precaution, while minimizing the risk of error, does not confer any legal status whatsoever to the translated text. The following notice shall accompany the translated text, printed below the acknowledgment:

'Originally published in the official languages of the European Union in the Official Journal of the European Union by the Office for Official Publications of the European Communities. Responsibility for the translation into [specify language] from the original [specify language] edition lies entirely with [name of translation copyright holder].'

Moreover, please note that we do not consider a "further commercial dissemination" the inclusion, as reference material for consultation purposes, of small amounts of relevant legislative texts in articles/thesis/studies/reports/books issued by third-party authors or publishers, whatever the means, and disseminated subject to payment.

Eurovoc thesaurus

Unlike the AC corpus, the Eurovoc thesaurus (http://europa.eu/celex/eurovoc/) must not be used or disseminated without prior written permission from the European Commission's Office for Official Publications OPOCE. If you want to get the rights to use Eurovoc and to receive a copy of the multilingual thesaurus, please contact OPOCE at opoce-info-copyright@cec.eu.int, mentioning the file reference number 2005-COP-395. To our knowledge, the licence is free of charge for research purposes and a commercial licence costs 500 Euro. To obtain a commercial licence, please contact OPOCE.

8) Related information

The JRC Workshop on Exploiting multilingual parallel corpora (26-27 September 2005) was dedicated to exploring methods to exploit the Acquis Communautaire and similar corpora. You find more information on the workshop web page http://langtech.jrc.it/0509_EU-Enlargement-Workshop.html.

A description of the Acquis Communautaire corpus was published in the paper below. Please check the web site http://langtech.jrc.it for more up-to-date publications on the subject.

Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž Erjavec, Dan Tufiş, Dániel Varga (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5^th International Conference on Language Resources and Evaluation (LREC'2006). Genoa, Italy, 24-26 May 2006. Available at http://langtech.jrc.it/.

9) Contributors

The following persons have contributed to the gathering, preparation and publication of the aligned Acquis Communautaire corpus:

10) Contact

To obtain a licence of the Eurovoc thesaurus, please contact the European Commission's Office for Official Publications OPOCE at opoce-info-copyright@cec.eu.int, mentioning the file reference number 2005-COP-395 (see above).

For information about the AC corpus and related work, please contact Ralf Steinberger or another member of the JRC's Language Technology team (see http://langtech.jrc.it/index.html#Staff) at the email address of the format Firstname.Lastname@jrc.it. The postal address is: