Extracting A Parallel Corpus from the Common Crawl Candidate
Transcripción
Extracting A Parallel Corpus from the Common Crawl Candidate
Extracting A Parallel Corpus from the Common Crawl ● Candidate document pairs identified via URLs ○ http://europa.eu/index_de.htm ○ http://europa.eu/index_en.htm ● HTML documents aligned using tags ○ misaligned pairs are dropped ● Sentences aligned using Church & Gale ○ i.e. based on sentence length only ● A handful of heuristics to check sentence alignment ○ if there are numbers they must match ● currently no charset detection or language detection (!) Accessing CommonCrawl Data ● 60TB of data on Amazon S3, only feasible to access through Elastic Map-reduce ● Strategy: ○ Mappers search for language codes in URLs: ■ http://europa.eu/index_de.htm and http://europa. eu/index_en.htm are both mapped to http://europa.eu/index_*.htm ○ Reducers receive candidate bilingual document pairs and return aligned parallel sentences ● Implementation: ○ Hadoop code (Java) which works on Amazon Elastic Map-Reduce or on a local Hadoop install More choices, better coverage Más opciones y mejor cobertura We pay 100% of covered costs directly Pagamos directamente el 100% dea los costos Clear terms, flexible payment options Condiciones claras y opciones de pago flexibles 24-hour roadside assistance Asistencia en carretera las 24 horas Your choice of repair shops Usted elige el taller de reparación Towing and rental car provided Servicio de remolque y alquiler de automóviles A company you can trust Una empresa en la que usted puede confiar de-en: de-es: de-fr: en-es: en-fr: es-fr: sents 5025 408 186 390 5535 284 words 15971 883 477 902 25673 1106 chars 122701 8068 3435 6016 161938 7649 (from a 1GB sample of the 60TB corpus)