GETALIGNMENTWITHTEXT(1)User Contributed Perl DocumentationGETALIGNMENTWITHTEXT(1) NAME getAlignmentWithText.pl - program that add the text to the alignment files. SYNOPSIS perl getAlignmentWithText.pl -acquisDir "JRC-Acquis_corpus_folder" jrc-en-fr.xml >en-fr_alignedCorpus_withText.xml To select only the document from a list of celex codes : perl getAlignmentWithText.pl -acquisDir "JRC-Acquis_corpus_folder" -selectionList "file_withCelexCode" jrc-en-fr.xml >en-fr_alignedCorpus_withText.xml To process more files use an output folder as following: perl getAlignmentWithText.pl -acquisDir "JRC-Acquis_corpus_folder" -selectionList "file_withCelexCode" -outDir "Output_folder" jrc-bg-cs.xml jrc-en-it.xml DESCRIPTION To get the aligned corpora for a language pair, the program need as input the corpora by language and the alignment information. The cor- pora will be provided by the option "acquisDir" that will specify where the Acquis corpus is located. If the option is not specified the default value is the current directory. The alignment information will be provided as argument. Using the option "selectionList" you can provided a list of Celex codes that has to be processed and the programm will output only the files that has the Celex code specified in the list. The codes are given in a file - one celex code by line. This option could be useful if you want to process only documents that have Eurovoc descriptors. The option "outDir" gives the possibility to process more than one lan- guage pair. You have to specify all the language pairs that you want to process as arguments and to give the output directory where the align- ments with text will be written. The result files will have the name composed by the name of the input file (without extension) followed by "_withText", followed by the extension (i.e. jrc-en-fr_withText.xml) The program outputs an aligned corpus, containing documents in the fol- lowing format: ... ....header....

19 paragraph links:

Décision du Comité mixte de l’EEE DECIZIA COMITETULUI MIXT AL SEE no 163/2002 nr. 163/2002 du 6 décembre 2002 din 6 decembrie 2002 .... The file is fully XML, we must use the UTF-8 encoding to handle all character sets (French-Greek for example). Example of use for Lithuanian-Swedish alignment: Before launching it make sure you have uncompressed (using gunzip com- mand for example) the alignment file. gunzip jrc-lt-sv.xml.gz Then, you need to get and unpack the two corpora: tar xzf jrc-lt.tgz tar xzf jrc-sv.tgz Then you can launch this program using a perl5 interpreter: perl getAlignmentWithText.pl -acquisDir . jrc-lt-sv.xml > jrc-lt-sv_withText.xml COMMENTS We have deliberately chosen to parse the texts without an XML parser. The format of Xml texts is well known, and the script has to be as fast as possible to handle 8000 texts in less than 5 minutes. AUTHORS camelia.ignat@jrc.it, bruno.pouliquen@jrc.it perl v5.8.5 2007-07-13 GETALIGNMENTWITHTEXT(1)