- Catalyst
Transcripción
- Catalyst
(a) Finding parallel data, and (b) Introduction to synchronous grammars LING 575 Lecture 6, part two Chris Quirk With materials borrowed from Kristina Toutanova, David Chiang, Philipp Koehn, Dekai Wu, and others… Learning curves (Koehn, Och, and Marcu 2003) • Parameters and algorithms matter… • …but there’s no data like more data Where did the initial SMT data come from? • Two papers at ACL 1991, both on parallel sentence extraction – Gale and Church at AT&T – Brown, Lai, and Mercer at IBM – Both described a very similar method for finding parallel sentence pairs from Hansards, parallel governmental corpus • Parliamentary discussion – 1973-1986 – English: 85M tokens / 3.5M sentences – French: 98M tokens / 3.7M sentences Sample parliamentary discussion Alignment model • Sequence of sentences & paragraph markers ¶ • Generate a sequence of beads – Find a sequence of beads that generates both the French and English corpus – 8 bead types – Learn a prior over each bead type, and a distribution over sentence lengths given a bead type Length parameters • Probability distributions over sentence lengths Pr ℓ𝑒 and Pr ℓ𝑓 – For ℓ ≤ 80, we can use empirical distribution – For ℓ > 80, fit a Poisson distribution to tail • Conditional distribution over joint lengths: – Pr ℓ𝑓 ℓ𝑒 ∝ exp − 𝑟−𝜇 2 2𝜎 2 Distribution over beads • Bead probs: Pr 𝑒𝑖 = Pr 𝑒 × Pr ℓ𝑒𝑖 Pr 𝑓𝑗 = Pr 𝑓 × Pr ℓ𝑓𝑗 Pr 𝑒𝑖 𝑓𝑗 = Pr 𝑒𝑓 × Pr ℓ𝑒𝑖 × Pr ℓ𝑓𝑗 ℓ𝑒𝑖 Pr 𝑒𝑖 𝑒𝑖+1 𝑓𝑗 = Pr 𝑒𝑒𝑓 × Pr ℓ𝑒𝑖 × Pr 𝑙𝑒𝑖+1 × Pr ℓ𝑓𝑗 ℓ𝑒𝑖 + ℓ𝑒𝑖+1 Pr 𝑒𝑖 𝑓𝑗 𝑓𝑗+1 = Pr 𝑒𝑓𝑓 × Pr ℓ𝑒𝑖 × Pr ℓ𝑓𝑗 + ℓ𝑓𝑗+1 ℓ𝑒𝑖 Pr ¶𝑒 = Pr ¶𝑓 Pr ¶𝑒 ¶𝑓 Evaluation • Error rate: 3.2% w/o paragraphs • Drops to 2.0% w/ paragraphs • Drops to 0.9% w/ “anchor points” – Coarse to fine alignment; coarse based on speaker changes Weaknesses and next steps • Each document is just a sequence of lengths and paragraph markers – Good enough for very parallel data – More and more dangerous as the data becomes less parallel • In addition, can model lexical items – For instance, Moore 2002: Fast and accurate sentence alignment of bilingual corpora – Problems: how do we model, where do we estimate parameters, what about the search space? How to model lexical correspondences? • Moore’s approach: just use Model 1 𝜖 𝑃 𝑡𝑠 = 𝑙+1 𝑚 𝑙 𝑡 𝑡𝑗 𝑠𝑖 𝑚 𝑗=1 𝑖=0 • Prune parameters – Keep only top 5K vocab, tokens must occur at least twice, fold everything else into UNK class – Minimizes parameter set with minimal impact on quality How do we estimate parameters? • Could use a seed parallel corpus • Here, use the output of a length-based aligner – Find beads with a very high posterior probability – Train Model 1 on these beads • Upside: no need for resources to start the process • Downside: length based aligner must be a reasonable starting point Evaluation • English-Spanish technical docs More parallel documents? • Existing sources? More parallel documents? • Existing sources? • Find bitexts from the web • Identify parallelism using – Document structure – Document content • Exploit search engines, Internet Archive • Evaluate intrinsically STRAND (Resnik and Smith, 2003) Local pages that might have translations Generate potential page pairs Filter pairs based on structural critera Finding pages with translations • Look for – Parent pages (e.g. page with outgoing links labeled English/Anglais and French/Francais) – Sibling pages (e.g. French page with an English/Anglias outgoing link) • Can optionally spider out from there – Domain with one parallel page pair is likely to have many more Extracting handles from URLs Generating candidate pairs • Need to match up pages – siblings are easy, crawled document pairs are more difficult • Start with subst. list {en_us -> fr_fr} For each English document URL e For each substitution mapping s matching e f = s subst e If f is a URL for a French document Propose pair <e, f> • Can match other ways, e.g. by doc length Structural filtering Parallel documents often have parallel structure Comparing structure Align using diff – worst case is worse than O(nm); common case is about 10x faster Alignment quality Quality of extracted data • Ask two humans to rate adeuqacy of – 30 good sentence pairs (from NIST training data) – 30 web extracted sentence pairs – 30 Chinese sentences with MT English output • Scale: Ratings Recent work • Large Scale Parallel Document Mining for Machine Translation – Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and Moshe Dubiner – Translate as much content as possible from native language into English – Perform approximate duplicate detection What about imperfect data • Limited amounts of truly parallel data • We can learn a lot from “comparable” corpora – Lexicon entries (e.g. Rapp 1999, Fung 2000) – Sentence pairs (e.g. Zhang and Vogel 2002) – Named entity translations (e.g. Al-Onaizan and Knight 2002, Klementiev and Roth 2006) – Fragment pairs (e.g. Munteanu and Marcu 2006) Bird Brain Dies After Years of Research ¿Qué le ha pasado a Alex? WALTHAM, Mass. (AP) — Alex, a parrot that could count to six, identify colors and even express frustration with repetitive scientific trials, has died after 30 years of helping researchers better understand the avian brain. The cause of Alex's death was unknown. The African grey parrot's average life span is 50 years, Brandeis University scientist Irene Pepperberg said. Alex was discovered dead in his cage Friday, she said, but she waited to release the news until this week so grieving researchers could get over the shock and talk about it. "It's devastating to lose an individual you've worked with pretty much every day for 30 years," Pepperberg told The Boston Globe. "Someone was working with him eight to 12 hours every day of his life." Alex's advanced language and recognition skills revolutionized the understanding of the avian brain. After Pepperberg bought Alex from an animal shop in 1973, the parrot learned enough English to identify 50 objects, seven colors and five shapes. He could count up to six, including zero, was able to express desires, including his frustration with the repetitive research. He also occasionally instructed two other parrots at the lab to "talk better" if they mumbled, though it wasn't clear whether he was simply mimicking researchers. Alex hadn't reached his full cognitive potential and was demonstrating the ability to take distinct sounds from words he knew and combine them to form new words, Pepperberg said. Just last month, he pronounced the word "seven" for the first time. … Washington. (EFE y Redacción).- Alex, un loro africano que podía diferenciar colores y cuya inteligencia maravilló a los científicos durante más de 30 años, fue encontrado muerto en el laboratorio de la Universidad de Brandeis, en el estado de Massachusetts. Un comunicado de la universidad señaló hoy que Alex, un ejemplar gris comprado para estudiar el cerebro de las aves en 1977, podía diferenciar 50 objetos, distinguía siete colores y formas. Además, podía contar hasta seis y expresaba deseos y hasta frustración cuando las pruebas científicas eran demasiado repetidas. También decía "hablen bien" cuando los otros dos loros del laboratorio, Griffin, de 12 años, y Arthur, de 8, pronunciaban mal las palabras que habían aprendido. Según la universidad, su desarrollo era similar al de un niño de 2 años e intelectualmente, tenía el cerebro de uno de 5. "Es devastador perder a un individuo con el cual una ha trabajado todos los días durante 30 años", dijo Irene Pepperberg, científico de la Universidad de Brandeis. Se calcula que la media de vida de un loro es alrededor de 50 años, aunque pueden alcanzar los 100. Alex pertenecía a la variante de los 'yacos', los loros más inteligentes de su especie. Pepperberg agregó que Alex fue encontrado muerto en su jaula el pasado viernes y que se desconocen las causas de su deceso. La investigadora informó que lo vio con vida el jueves pasado cuando se despidió de él diciéndole: "Sé bueno. Te quiero. Nos vemos mañana". El loro le respondió: "Mañana estás aquí". … WALTHAM, Mass. (AP) — Alex, a parrot that could count to six, identify colors and even express frustration with repetitive scientific trials, has died after 30 years of helping researchers better understand the avian brain. Bird Brain Dies After Years of Research ¿Qué le ha pasado a Alex? WALTHAM, Mass. (AP) — Alex, a parrot that could count to six, identify colors and even express frustration with repetitive scientific trials, has died after 30 years of helping researchers better understand the avian brain. The cause of Alex's death was unknown. The African grey parrot's average life span is 50 years, Brandeis University scientist Irene Pepperberg said. Alex was discovered dead in his cage Friday, she said, but she waited to release the news until this week so grieving researchers could get over the shock and talk about it. "It's devastating to lose an individual you've worked with pretty much every day for 30 years," Pepperberg told The Boston Globe. "Someone was working with him eight to 12 hours every day of his life." Alex's advanced language and recognition skills revolutionized the understanding of the avian brain. After Pepperberg bought Alex from an animal shop in 1973, the parrot learned enough English to identify 50 objects, seven colors and five shapes. He could count up to six, including zero, was able to express desires, including his frustration with the repetitive research. He also occasionally instructed two other parrots at the lab to "talk better" if they mumbled, though it wasn't clear whether he was simply mimicking researchers. Alex hadn't reached his full cognitive potential and was demonstrating the ability to take distinct sounds from words he knew and combine them to form new words, Pepperberg said. Just last month, he pronounced the word "seven" for the first time. … Washington. (EFE y Redacción).- Alex, un loro africano que podía diferenciar colores y cuya inteligencia maravilló a los científicos durante más de 30 años, fue encontrado muerto en el laboratorio de la Universidad de Brandeis, en el estado de Massachusetts. Un comunicado de la universidad señaló hoy que Alex, un ejemplar gris comprado para estudiar el cerebro de las aves en 1977, podía diferenciar 50 objetos, distinguía siete colores y formas. Además, podía contar hasta seis y expresaba deseos y hasta frustración cuando las pruebas científicas eran demasiado repetidas. También decía "hablen bien" cuando los otros dos loros del laboratorio, Griffin, de 12 años, y Arthur, de 8, pronunciaban mal las palabras que habían aprendido. Según la universidad, su desarrollo era similar al de un niño de 2 años e intelectualmente, tenía el cerebro de uno de 5. "Es devastador perder a un individuo con el cual una ha trabajado todos los días durante 30 años", dijo Irene Pepperberg, científico de la Universidad de Brandeis. Se calcula que la media de vida de un loro es alrededor de 50 años, aunque pueden alcanzar los 100. Alex pertenecía a la variante de los 'yacos', los loros más inteligentes de su especie. Pepperberg agregó que Alex fue encontrado muerto en su jaula el pasado viernes y que se desconocen las causas de su deceso. La investigadora informó que lo vio con vida el jueves pasado cuando se despidió de él diciéndole: "Sé bueno. Te quiero. Nos vemos mañana". El loro le respondió: "Mañana estás aquí". … Además, podía contar hasta seis y expresaba deseos y hasta frustración cuando las pruebas científicas eran demasiado repetidas. He could count up to six, including zero, was able to express desires, including his frustration with the repetitive research. Bird Brain Dies After Years of Research ¿Qué le ha pasado a Alex? WALTHAM, Mass. (AP) — Alex, a parrot that could count to six, identify colors and even express frustration with repetitive scientific trials, has died after 30 years of helping researchers better understand the avian brain. The cause of Alex's death was unknown. The African grey parrot's average life span is 50 years, Brandeis University scientist Irene Pepperberg said. Alex was discovered dead in his cage Friday, she said, but she waited to release the news until this week so grieving researchers could get over the shock and talk about it. "It's devastating to lose an individual you've worked with pretty much every day for 30 years," Pepperberg told The Boston Globe. "Someone was working with him eight to 12 hours every day of his life." Alex's advanced language and recognition skills revolutionized the understanding of the avian brain. After Pepperberg bought Alex from an animal shop in 1973, the parrot learned enough English to identify 50 objects, seven colors and five shapes. He could count up to six, including zero, was able to express desires, including his frustration with the repetitive research. He also occasionally instructed two other parrots at the lab to "talk better" if they mumbled, though it wasn't clear whether he was simply mimicking researchers. Alex hadn't reached his full cognitive potential and was demonstrating the ability to take distinct sounds from words he knew and combine them to form new words, Pepperberg said. Just last month, he pronounced the word "seven" for the first time. … Washington. (EFE y Redacción).- Alex, un loro africano que podía diferenciar colores y cuya inteligencia maravilló a los científicos durante más de 30 años, fue encontrado muerto en el laboratorio de la Universidad de Brandeis, en el estado de Massachusetts. Un comunicado de la universidad señaló hoy que Alex, un ejemplar gris comprado para estudiar el cerebro de las aves en 1977, podía diferenciar 50 objetos, distinguía siete colores y formas. Además, podía contar hasta seis y expresaba deseos y hasta frustración cuando las pruebas científicas eran demasiado repetidas. También decía "hablen bien" cuando los otros dos loros del laboratorio, Griffin, de 12 años, y Arthur, de 8, pronunciaban mal las palabras que habían aprendido. Según la universidad, su desarrollo era similar al de un niño de 2 años e intelectualmente, tenía el cerebro de uno de 5. "Es devastador perder a un individuo con el cual una ha trabajado todos los días durante 30 años", dijo Irene Pepperberg, científico de la Universidad de Brandeis. Se calcula que la media de vida de un loro es alrededor de 50 años, aunque pueden alcanzar los 100. Alex pertenecía a la variante de los 'yacos', los loros más inteligentes de su especie. Pepperberg agregó que Alex fue encontrado muerto en su jaula el pasado viernes y que se desconocen las causas de su deceso. La investigadora informó que lo vio con vida el jueves pasado cuando se despidió de él diciéndole: "Sé bueno. Te quiero. Nos vemos mañana". El loro le respondió: "Mañana estás aquí". … Bird Brain Dies After Years of Research ¿Qué le ha pasado a Alex? WALTHAM, Mass. (AP) — Alex, a parrot that could count to six, identify colors and even express frustration with repetitive scientific trials, has died after 30 years of helping researchers better understand the avian brain. The cause of Alex's death was unknown. The African grey parrot's average life span is 50 years, Brandeis University scientist Irene Pepperberg said. Alex was discovered dead in his cage Friday, she said, but she waited to release the news until this week so grieving researchers could get over the shock and talk about it. "It's devastating to lose an individual you've worked with pretty much every day for 30 years," Pepperberg told The Boston Globe. "Someone was working with him eight to 12 hours every day of his life." Alex's advanced language and recognition skills revolutionized the understanding of the avian brain. After Pepperberg bought Alex from an animal shop in 1973, the parrot learned enough English to identify 50 objects, seven colors and five shapes. He could count up to six, including zero, was able to express desires, including his frustration with the repetitive research. He also occasionally instructed two other parrots at the lab to "talk better" if they mumbled, though it wasn't clear whether he was simply mimicking researchers. Alex hadn't reached his full cognitive potential and was demonstrating the ability to take distinct sounds from words he knew and combine them to form new words, Pepperberg said. Just last month, he pronounced the word "seven" for the first time. … Washington. (EFE y Redacción).- Alex, un loro africano que podía diferenciar colores y cuya inteligencia maravilló a los científicos durante más de 30 años, fue encontrado muerto en el laboratorio de la Universidad de Brandeis, en el estado de Massachusetts. Un comunicado de la universidad señaló hoy que Alex, un ejemplar gris comprado para estudiar el cerebro de las aves en 1977, podía diferenciar 50 objetos, distinguía siete colores y formas. Además, podía contar hasta seis y expresaba deseos y hasta frustración cuando las pruebas científicas eran demasiado repetidas. También decía "hablen bien" cuando los otros dos loros del laboratorio, Griffin, de 12 años, y Arthur, de 8, pronunciaban mal las palabras que habían aprendido. Según la universidad, su desarrollo era similar al de un niño de 2 años e intelectualmente, tenía el cerebro de uno de 5. "Es devastador perder a un individuo con el cual una ha trabajado todos los días durante 30 años", dijo Irene Pepperberg, científico de la Universidad de Brandeis. Se calcula que la media de vida de un loro es alrededor de 50 años, aunque pueden alcanzar los 100. Alex pertenecía a la variante de los 'yacos', los loros más inteligentes de su especie. Pepperberg agregó que Alex fue encontrado muerto en su jaula el pasado viernes y que se desconocen las causas de su deceso. La investigadora informó que lo vio con vida el jueves pasado cuando se despidió de él diciéndole: "Sé bueno. Te quiero. Nos vemos mañana". El loro le respondió: "Mañana estás aquí". … "Es devastador perder a un individuo con el cual una ha trabajado todos los días durante 30 años", dijo Irene Pepperberg, científico de la Universidad de Brandeis. "It's devastating to lose an individual you've worked with pretty much every day for 30 years," Pepperberg told The Boston Globe. Fundamental problem • Given sentence pairs with some content in common, identify the fragment alignment "It's devastating to lose an individual you've worked with pretty much every day for 30 years," Pepperberg told The Boston Globe. "Es devastador perder a un individuo con el cual una ha trabajado todos los días durante 30 años", dijo Irene Pepperberg, científico de la Universidad de Brandeis. Finding promising sentence pairs in news corpora (Munteanu & Marcu 2006) seed parallel corpus target language documents source language documents word aligner find similar doc pairs using CLIR index words inverted index word alignment models promising document pairs filter by length, vocabulary promising sentence pairs Munteanu and Marcu’s approach • Motivated by signal processing – Pick one language, assign each word a score [-1,+1] based on its best scoring partner word in the other language – Smooth the signal with moving average – Retain strictly positive sequences ≥ 3 words – Apply to each side independently • Comments – Selection is independent: the English translation of a Spanish word may not meet the English filtering criteria – Spans are simply concatenated – No notion of location A probabilistic interpretation • Insight: fragments S and T are parallel iff P(S, T) > P(S) ∙ P(T) … iff P(T | S) > P(T) … iff P(S | T) > P(S) • Present two generative models of comparable sentence pairs that capture this insight • Hidden alignments identify fragment correspondences in noisy sentence pairs – Selection is no longer independent – Position in the other sentence matters • Evaluate in terms of end-to-end MT (BLEU) Comparable Model B (Quirk and Udupa, 2007) • Joint, generative model of source, target fragments – Decide the number of fragments – Each fragment generates source, target words (one side, not both, may be empty) to be appended – fragment alignment is monotonic • Requires source + target n-gram LMs, conditional HMM models of S|T (and vice versa) • Monolingual fragment score: P(S) (or P(T)) • Bilingual fragment score: min { P(S) ∙ P(T|S), P(T) ∙ P(S|T) } Bosch afirmo situacion actual la en Dominicana Republica la en normal lo es fraude El Fraud is normal in min { P(El fraude…) ∙ P(Fraud is… | El fraude…), P(Fraud is…) ∙ P(El fraude… | Fraud is…) } the Dominican Republic he said to reporters P(en la actual situacion | … Republica Dominicana) Search procedure • Monotone 0th order model: dynamic programming – δ[j, k] := best fragment align of first j source, k target words δ[j, k] = max { δ[j’, k’] ∙ P(Sj’…Sj, Tk’…Tk) for all 0 ≤ j’ < j, 0 ≤ k’ < k } • Exact, but expensive: O(n6) • Beam search provides significant speedup – Model 1 scores prune bilingual spans – Bilingual fragment size limitations reduce search space Data Parallel data: Spanish-English Europarl WMT 2006 – Provided tokenization, lowercased – GIZA++ word alignment (15H545); grow-diag-final Comparable data: Spanish, English Gigaword corpora – Use same tokenization as above, lowercased – Spanish: 2.2M docs, 20M sents, 690M words – English: 3.5M docs, 49M sents, 1.8B words First pass extraction stats – Low recall: 27M doc pairs, 2.6M promising sentence pairs – High recall: 28M doc pairs, 84M promising sentence pairs Fragment extraction • Extract fragments from Spanish and English Gigaword corpora using three approaches: – MM: reimplementation of Munteanu and Marcu 2006 – A(e|s): conditional model, one direction only – B: joint model of Spanish, English • Word alignment models, language models, and other models for MM trained on Europarl data only Spanish-English BLEU scores 29 28.7 28.7 28.5 27.7 27 25.3 24.5 25 23 22.1 22.6 22.6 22.5 22.5 23 19 news MM A(e|s) B 21 europarl baseline web What about Wikipedia? • Available in many languages • Same topic articles connected via “Interwiki links” – – – – – English: 3,000,000+ articles English-Spanish pairs: 278,000+ English-Bulgarian pairs: 50,000+ English-German pairs: 477,000+ English-Dutch pairs: 356,000+ • Goal: find parallel sentences, improve MT system Baseline Model: binary classifier on sentence pairs Baseline Model: binary classifier on sentence pairs Baseline Model: binary classifier on sentence pairs Baseline Model: binary classifier on sentence pairs Baseline Model: binary classifier on sentence pairs Baseline model • Binary classifier: – (𝑆, 𝑇) = 𝑡𝑟𝑢𝑒 for true sentence pairs <S,T> – (𝑆, 𝑇’) = (𝑆’, 𝑇) = 𝑓𝑎𝑙𝑠𝑒 for all other pairs • Severe class imbalance problem – We get 𝒪 𝑛 positive examples, and 𝒪 𝑛2 negative examples – Classifier needs a strong push to predict positive • One solution: ranking model – Train a model 𝑆, 𝑇 so that for a given pair (S,T), 𝑆, 𝑇 > 𝑆, 𝑇 ′ for all other target sentences T’ in the same document Ranking Model Features • We define the following features on both Model 1 and HMM word alignments: – Log probability – Number of aligned/unaligned words – Longest aligned/unaligned spans – Number of words with fertility 1, 2, and 3+ • Length feature (Moore 2002) log(Poisson(|T|, |S|·r)) • Difference in relative sentence position Wikipedia Features • Number of hyperlinks pointing to equivalent articles (determined by the “Interwiki links”). • Image Feature: fires on sentences which are parts of captions of the same image. Wikipedia Features • List Feature: fires when both sentences are part of a list. Sequence Model • We extend the ranking model to include features based on the previous alignment: • These features are based on the positions of the aligned target sentences. Experiments • We annotated 20 Wikipedia article pairs in 1. Spanish – English 2. Bulgarian – English 3. German – English • Sentence pairs were given a quality rating: – 1: Some phrases are parallel – 2: Mostly parallel, with some missing words – 3: Very high quality, likely translated by a bilingual user. • 2 and 3 were considered correct in our experiments. BLEU results INTRODUCTION TO SYNCHRONOUS GRAMMARS Why do we parse natural language? Why do we parse natural language? 1. To find a deeper or alternate interpretation – A translation task: find target language meaning given source language representation 2. Check for grammaticality / well-formedness – Also a translation task, but an argument for using in the target side Overview • Motivation – Examples of reordering/translation phenomena • Synchronous context free grammar – Example derivations – ITG grammars – Reordering for ITG grammars • Applications of bracketing ITG grammars – Applications: ITGs for word alignment • Hierarchical phrase-based translation with Hiero – Rule extraction – Model features • Decoding for SCFGs and integrating a LM Motivation for tree-based translation • Phrases capture contextual translation and local reordering surprisingly well • However this information is brittle: – “author of the book 本書的作者” tells us nothing about how to translate “author of the pamphlet” or “author of the play” – The Chinese phrase “NOUN1的 NOUN2” becomes “NOUN2 of NOUN1” in English Motivation for tree-based translation • There are general principles a phrase-based system is not using – – – – Some languages have adjectives before the nouns, some after Some languages place prepositions before nouns, some after Some languages put PPs before the head, others after Some languages place relative clauses before head, others after • Discontinuous translations are not handled well by phrase-based systems – ne … pas in French – Separable prefixes in German – Split constructions in Chinese Types of tree-based systems • Formally syntax-based – Use the notion of a grammar, but without linguistically motivated annotations (no nouns, verbs, etc.) – Model hierarchical nature of language – Examples: phrase-based ITGs and Hiero (will focus on these in this lecture) • Linguistically syntax-based – Use parse information derived from a parser (rulebased, treebank-trained, etc.) – Could be source or target parsing – Phrase structure trees, dependency trees SYNCHRONOUS CONTEXT-FREE GRAMMARS Review: context free grammars • CFG can be a tuple: – Terminal set Σ (e.g. “the”, “man”, “left”) – Nonterminal set 𝑁 (e.g. “NP”, “VP”) – Rule set 𝑅, each rule has a parent symbol from 𝑁 and production or yield drawn from 𝑁 ∪ Σ ⋆ (e.g. “S -> NP VP”, “DT -> the”) – A top symbol 𝑆 ∈ 𝑁 • We can both parse and generate with this – Parse: start with a terminal sequence, find substrings that match the yield of a rule and replace with parent symbol, repeat until we find top symbol – Generate: start with top symbol, pick a non-terminal and replace with its yield, until we have only terminal symbols in the sequence Synchronous context-free grammars • A generalization of context free grammars Slide from David Chiang, ACL 2006 tutorial Context-free grammars (example in Japanese) Slide from David Chiang, ACL 2006 tutorial Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Rules with probabilities Joint probability of source and target language re-writes, given non-terminal on left. Could also use conditional probability of target given source or source given target. Synchronous CFGs Slide from David Chiang, ACL 2006 tutorial Chinese English example S NP I VP ate PP at NP the restaurant 70 Chinese English example wo I NP S VP zai at PP NP fan dien restaurant chi fan ate I ate at the restaurant 71 Stochastic Inversion Transduction Grammars (Wu, 1997) • First use of SCFG in stat MT • Restricted form of SCFGs: only concatenate or swap adjacent non-terminals 𝑋 → 𝑌1 𝑍2 , 𝑌1 𝑍2 , or 𝑋 → 𝑌 𝑍 𝑋 → 𝑌1 𝑍2 , 𝑍2 𝑌1 , or 𝑋 → 𝑌 𝑍 • At the lowest level, generate words or wordpairs 𝑋 → 𝑒, 𝑓; 𝑋 → 𝜀, 𝑓; 𝑋 → 𝑒, 𝜀 Even more restricted: Bracketing ITG grammars • A minimal number of non-terminal symbols • Does not capture linguistic syntax but can be used to explain word alignment and translation 𝐴 → 𝐴1 𝐴2 , A1 A2 or 𝐴 → [𝐴 𝐴] 𝐴 → 𝐴1 𝐴2 , A2 A1 or 𝐴 → < 𝐴, 𝐴 > 𝐴 → 𝑥, 𝑦 𝐴 → 𝑥, 𝜖 𝐴 → 𝜖, 𝑦 • Can be extended to allow direct generation of one-to-many or many-to-many blocks (Block ITG) 𝐴 → 𝑥, 𝑦 Reordering in bracketing ITG grammar • Because of assumption of hierarchical movement of contiguous sequences, the space of possible word alignments between sentence pairs is limited • Assume we start with a bracketing ITG grammar • Allow any foreign word to translate to any English word or empty – 𝐴 → 𝑓, 𝑒 𝐴 → 𝑓, 𝜖 𝐴 → 𝜖, 𝑒 • Possible alignment – One that is the result of a synchronous parse of the source and target with the grammar Example re-ordering with ITG Grammar includes 𝐴 → 1, 1 ; 𝐴 → 2,2 ; 𝐴 → 3,3 ; 𝐴 → 4,4 Can the bracketing ITG generate these sentence pairs? [1,2,3,4] [1,2,3,4] 𝐴 → [𝐴 𝐴 ] 1 2 3 𝐴2 → [𝐴4 𝐴5 ] 𝐴4 → 𝐴6 𝐴7 𝐴6 → 1,1 𝐴7 → 2,2 𝐴5 → 3,3 𝐴3 → 4,4 Example re-ordering with ITG Are there other synchronous parses of this sentence pair? [1,2,3,4] [1,2,3,4] Example re-ordering with ITG • Other re-orderings with parses • A horizontal bar means the non-terminals are swapped But some re-orderings are not allowed • When words move inside-out • 22 out of the 24 permutations of 4 words are parsable by the bracketing ITG Number of permutations compared to ones parsable by ITG Application of ITGs • Word alignment and translation • Also string edit distance with moves • One recent interesting work is Haghighi et al’s 09 paper on supervised word alignment with block ITGs – Aria Haghighi, John Blitzer, John DeNero, and Dan Klein “Better word alignments with supervised ITG Models” Comparison of oracle alignment error (AER) for different alignment spaces From Haghighi et al 09 Space of all alignments, space of 1-to-1 alignments, space of ITG alignments Block ITG: adding one to many alignments From Haghighi et al 09 Comparison of oracle alignment error (AER) for different alignment spaces From Haghighi et al 09 Alignment performance using discriminative model From Haghighi et al 09 Training for maximum likelihood • So far results were with MIRA – Requiring only finding the best alignment under the model – Efficient under 1-to-1 and ITG models • If we want to train for maximum likelihood according to a log-linear model – Requires summing over all possible alignments – This is tractable in ITGs (will discuss bitext parsing in a bit) – One of the big advantages of ITGs MIRA versus maximum likelihood training Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial Slide from David Chiang, ACL 2006 tutorial David Chiang ISI, USC HIERARCHICAL PHRASE-BASED TRANSLATION Hierarchical phrase-based translation overview • • • • • Motivation Extracting rules Scoring derivations Decoding without an LM Decoding with a LM Motivation • Review of phrase based models – Segment input into sequence of phrases – Translate each phrase – Re-order phrases depending on distortion and perhaps the lexical content of the phrases • Properties of phrase-based models – Local re-ordering is captured within phrases for frequently occurring groups of words – Global re-ordering is not modeled well – Only contiguous translations are learned Chinese-English example Australia is one of the few countries that have diplomatic relations with North Korea. Output from phrase-based system: Captured some reordering through phrase translation and phrase re-ordering Did not re-order the relative clause and the noun phrase. Idea: Hierarchical phrases 𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 , 𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡 𝑋1 • The variables stand for corresponding hierarchical phrases • Capture the fact that PP phrases tend to be before the verb in Chinese and after the verb in English • Serves as both a discontinuous phrase pair and re-ordering rule Other example hierarchical phrases 𝑋1 𝑑𝑒 𝑋2 , 𝑡𝑒 𝑋2 𝑡𝑎𝑡 𝑋1 Chinese relative clauses modify NPs on the left, and English relative clauses modify NPs on the right 𝑋1 𝑧𝑖𝑦𝑖, 𝑜𝑛𝑒 𝑜𝑓𝑋1 A Synchronous CFG for example Only 1 non-terminal X plus start symbol S used • • • • • • • • • • • 𝑋 → 𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 , 𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡 𝑋1 𝑋 → 𝑋1 𝑑𝑒 𝑋2 , 𝑡𝑒 𝑋2 𝑡𝑎𝑡 𝑋1 𝑋 → 𝑋1 𝑧𝑖𝑦𝑖, 𝑜𝑛𝑒 𝑜𝑓𝑋1 𝑋 → 𝐴𝑜𝑧𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎 𝑋 → 𝐵𝑒𝑖𝑎𝑛, 𝑁𝑜𝑟𝑡 𝐾𝑜𝑟𝑒𝑎 𝑋 → 𝑠𝑖, 𝑖𝑠 𝑋 → 𝑏𝑎𝑛𝑗𝑖𝑎𝑜, 𝑑𝑖𝑝𝑙𝑜𝑚𝑎𝑡𝑖𝑐 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠 𝑋 → 𝐴𝑜𝑧𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎 𝑋 → 𝑠𝑎𝑜𝑠𝑢 𝑔𝑢𝑜𝑗𝑖𝑎, 𝑓𝑒𝑤 𝑐𝑜𝑢𝑛𝑡𝑟𝑖𝑒𝑠 𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2 [glue rule] 𝑆 → 𝑋1, 𝑋1 General approach • Align parallel training data using word-alignment models (e.g. GIZA++) • Extract hierarchical phrase pairs – Can be represented as SCFG rules • Assign probabilities (scores) to rules – Like in log-linear models for phrase-based MT, can define various features on rules to come up with rule scores • Translating new sentences – Parsing with an SCFG grammar – Integrating a language model Example derivation Extracting hierarchical phrases • Start with contiguous phrase pairs, as in phrasal SMT models (called initial phrase pairs) • Make rules for these phrase pairs and add them to the rule-set extracted from this sentence pair Extracting hierarchical phrase pairs • For every rule of the sentence pair – For ever initial phrase pair contained in it • Replace an initial phrase pair by non-terminal • Extract a new rule Another example Hierarchical phrase Traditional phrases Constraining the grammar rules • This method generates too many phrase pairs and leads to spurious ambiguity – Place constraints on the set of allowable rules for robustness/speed Adding glue rules • For continuity with phrase-based models, add glue rules which can split the source into phrases and translate each – 𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2 – 𝑆 → 𝑋1, 𝑋1 • Question: if we only have conventional phrase pairs and these two rules, what system do we have? • Question: what do we get if we also add these rules – X→ 𝑋1 𝑋2 , 𝑋1 𝑋2 – X → 𝑋1 𝑋2 , 𝑋2 𝑋1 Assigning scores to derivations • A derivation is a parse tree for the source and target sentences • As in phrase-based models, we choose a derivation that maximizes the score – The derivation corresponds to a target sentence, which is returned as a translation – There are multiple derivations of a target sentence but we do not sum over them, approximate with max as in phrase-based models • 𝑃(𝐷) ∝ 𝑖 𝜙𝑖 𝐷 𝜆 𝑖 • Feature functions on rules and a language model feature – 𝑃(𝐷) ∝ 𝑃𝐿𝑀 𝑒 𝜆𝑙𝑚 𝑖≠𝐿𝑀 𝑋→𝛾,𝛼∈𝐷 𝜙𝑖 𝑋 → 𝛾, 𝛼 𝜆𝑖 Assigning scores to derivations • Except for language model, can represent as weight according to weighted SCFG – 𝑤 𝐷 = 𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋 → 𝛾, 𝛼) – 𝑤 𝑋 → 𝛾, 𝛼 = 𝑖≠𝐿𝑀 𝜙𝑖 𝑋 → 𝛾, 𝛼 – 𝑃 𝐷 ∝ 𝑃𝐿𝑀 𝑒 × 𝑤(𝐷) 𝜆𝑖 • Features – Rule probabilities in two directions and lexical weighting in two directions – 𝑃 𝛾 𝛼 and 𝑃(𝛼|𝛾) – 𝑃𝑙𝑒𝑥 (𝛾|𝛼) and 𝑃𝑙𝑒𝑥 (𝛼|𝛾) – Exp of rule count, word count, glue rule count Estimating feature values and feature weights • Estimating translation probabilities in two directions – Like in phrase-based models, heuristic estimation using relative frequency of counts of rules – Count of one from every sentence for each initial phrase pair – For each initial phrase pair in a sentence, fractional (equal) count to each rule obtained by subtracting sub-phrases – Relative frequency of these counts 𝑐(𝑋 → 𝛾, 𝛼) 𝑃 𝛾𝛼 = ′ 𝛾′ 𝑐(𝑋 → 𝛾 , 𝛼) • Estimating values of parameters 𝜆 – Minimum error rate training like in phrase-based models: maximize BLEU score on a development set Finding the best translation: decoding • In the absence of a language model feature, our scores look like this –𝑤 𝐷 = 𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋 → 𝛾, 𝛼) • This can be represented by a weighted SCFG • Can parse the source sentence using source side of grammar using CKY (cubic time in sent length) • Read off target sentence assembled from corresponding target derivation Finding the best translation including an LM • Method 1: generate k-best derivations without LM, then rescore with LM – May need an extremely large k-best list to get to highest scoring +LM derivation • Method 2: integrate LM in grammar by intersecting target side of SCFG with an LM : very large expansion of rule-set with time 𝑂(𝑛3 𝑇 4 𝑚−1 ) • Method 3: integrate LM while parsing with the SCFG, using cube pruning to generate k-best LMscored translations at each span Comparison to a phrase-based system Summary • Described hierarchical phrase-based translation – Uses hierarchical rules encoding phrase re-ordering and discontinuous lexical correspondence – Rules include traditional contiguous phrase pairs – Can translate efficiently without LM using SCFG parsing – Outperforms phrase-based models for several languages • Hiero is implemented in Moses and Joshua References • Hierarchical phrase-based translation. David Chiang, CL 2007. • An introduction to Synchronous Grammars. Notes and slides from ACL 2006 tutorial. David Chiang. • Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Dekai Wu, CL 1997. • Better word alignment with Supervised ITG models. ACL 2009, A. Haghighi, J. Blitzer, J. DeNero, and D. Klein • Many other interesting papers using ITGs and extensions to Hiero: will add some to the web page