- Catalyst

Transcripción

- Catalyst

(a) Finding parallel data, and
(b) Introduction to synchronous
grammars
LING 575 Lecture 6, part two
Chris Quirk
With materials borrowed from Kristina Toutanova,
David Chiang, Philipp Koehn, Dekai Wu, and others…
Learning curves
(Koehn, Och, and Marcu 2003)
• Parameters and algorithms matter…
• …but there’s no data like more data
Where did the initial SMT data come
from?
• Two papers at ACL 1991, both on parallel
sentence extraction
– Gale and Church at AT&T
– Brown, Lai, and Mercer at IBM
– Both described a very similar method for finding
parallel sentence pairs from Hansards, parallel
governmental corpus
• Parliamentary discussion
– 1973-1986
– English: 85M tokens / 3.5M sentences
– French: 98M tokens / 3.7M sentences
Sample parliamentary discussion
Alignment model
• Sequence of sentences &
paragraph markers ¶
• Generate a sequence of
beads
– Find a sequence of beads
that generates both the
French and English corpus
– 8 bead types
– Learn a prior over each
bead type, and a
distribution over sentence
lengths given a bead type
Length parameters
• Probability distributions
over sentence lengths
Pr ℓ𝑒 and Pr ℓ𝑓
– For ℓ ≤ 80, we can use
empirical distribution
– For ℓ > 80, fit a Poisson
distribution to tail
• Conditional distribution
over joint lengths:
– Pr ℓ𝑓 ℓ𝑒 ∝ exp
− 𝑟−𝜇 2
2𝜎 2
Distribution over beads
• Bead probs:
Pr 𝑒𝑖 = Pr 𝑒 × Pr ℓ𝑒𝑖
Pr 𝑓𝑗 = Pr 𝑓 × Pr ℓ𝑓𝑗
Pr 𝑒𝑖 𝑓𝑗 = Pr 𝑒𝑓 × Pr ℓ𝑒𝑖 × Pr ℓ𝑓𝑗 ℓ𝑒𝑖
Pr 𝑒𝑖 𝑒𝑖+1 𝑓𝑗
= Pr 𝑒𝑒𝑓 × Pr ℓ𝑒𝑖 × Pr 𝑙𝑒𝑖+1
× Pr ℓ𝑓𝑗 ℓ𝑒𝑖 + ℓ𝑒𝑖+1
Pr 𝑒𝑖 𝑓𝑗 𝑓𝑗+1
= Pr 𝑒𝑓𝑓 × Pr ℓ𝑒𝑖 × Pr ℓ𝑓𝑗 + ℓ𝑓𝑗+1 ℓ𝑒𝑖
Pr ¶𝑒 = Pr ¶𝑓
Pr ¶𝑒 ¶𝑓
Evaluation
• Error rate: 3.2% w/o paragraphs
• Drops to 2.0% w/ paragraphs
• Drops to 0.9% w/ “anchor points”
– Coarse to fine alignment; coarse based on speaker changes
Weaknesses and next steps
• Each document is just a sequence of lengths and
paragraph markers
– Good enough for very parallel data
– More and more dangerous as the data becomes less
parallel
• In addition, can model lexical items
– For instance, Moore 2002: Fast and accurate sentence
alignment of bilingual corpora
– Problems: how do we model, where do we estimate
parameters, what about the search space?
How to model lexical
correspondences?
• Moore’s approach: just use Model 1
𝜖
𝑃 𝑡𝑠 =
𝑙+1
𝑚
𝑙
𝑡 𝑡𝑗 𝑠𝑖
𝑚
𝑗=1 𝑖=0
• Prune parameters
– Keep only top 5K vocab, tokens must occur at
least twice, fold everything else into UNK class
– Minimizes parameter set with minimal impact on
quality
How do we estimate parameters?
• Could use a seed parallel corpus
• Here, use the output of a length-based aligner
– Find beads with a very high posterior probability
– Train Model 1 on these beads
• Upside: no need for resources to start the
process
• Downside: length based aligner must be a
reasonable starting point
Evaluation
• English-Spanish technical docs
More parallel documents?
• Existing sources?
More parallel documents?
• Existing sources?
• Find bitexts from the web
• Identify parallelism using
– Document structure
– Document content
• Exploit search engines, Internet Archive
• Evaluate intrinsically
STRAND (Resnik and Smith, 2003)
Local pages that might have
translations
Generate potential page
pairs
Filter pairs based on
structural critera
Finding pages with translations
• Look for
– Parent pages (e.g. page with outgoing links
labeled English/Anglais and French/Francais)
– Sibling pages (e.g. French page with an
English/Anglias outgoing link)
• Can optionally spider out from there
– Domain with one parallel page pair is likely to
have many more
Extracting handles from URLs
Generating candidate pairs
• Need to match up pages – siblings are easy,
crawled document pairs are more difficult
• Start with subst. list {en_us -> fr_fr}
For each English document URL e
For each substitution mapping s matching e
f = s subst e
If f is a URL for a French document
Propose pair <e, f>
• Can match other ways, e.g. by doc length
Structural filtering
Parallel documents often have parallel structure
Comparing structure
Align using diff – worst case is
worse than O(nm); common
case is about 10x faster
Alignment quality
Quality of extracted data
• Ask two humans to rate adeuqacy of
– 30 good sentence pairs (from NIST training data)
– 30 web extracted sentence pairs
– 30 Chinese sentences with MT English output
• Scale:
Ratings
Recent work
• Large Scale Parallel Document Mining for
Machine Translation
– Jakob Uszkoreit, Jay M. Ponte, Ashok C. Popat, and
Moshe Dubiner
– Translate as much content as possible from native
language into English
– Perform approximate duplicate detection
What about imperfect data
• Limited amounts of truly parallel data
• We can learn a lot from “comparable” corpora
– Lexicon entries (e.g. Rapp 1999, Fung 2000)
– Sentence pairs (e.g. Zhang and Vogel 2002)
– Named entity translations (e.g. Al-Onaizan and
Knight 2002, Klementiev and Roth 2006)
– Fragment pairs (e.g. Munteanu and Marcu 2006)
Bird Brain Dies After Years of Research
¿Qué le ha pasado a Alex?
WALTHAM, Mass. (AP) — Alex, a parrot that could count to
six, identify colors and even express frustration with
repetitive scientific trials, has died after 30 years of helping
researchers better understand the avian brain.
The cause of Alex's death was unknown. The African grey
parrot's average life span is 50 years, Brandeis University
scientist Irene Pepperberg said. Alex was discovered dead in
his cage Friday, she said, but she waited to release the news
until this week so grieving researchers could get over the
shock and talk about it.
"It's devastating to lose an individual you've worked with
pretty much every day for 30 years," Pepperberg told The
Boston Globe. "Someone was working with him eight to 12
hours every day of his life."
Alex's advanced language and recognition skills
revolutionized the understanding of the avian brain. After
Pepperberg bought Alex from an animal shop in 1973, the
parrot learned enough English to identify 50 objects, seven
colors and five shapes. He could count up to six, including
zero, was able to express desires, including his frustration
with the repetitive research.
He also occasionally instructed two other parrots at the
lab to "talk better" if they mumbled, though it wasn't clear
whether he was simply mimicking researchers.
Alex hadn't reached his full cognitive potential and was
demonstrating the ability to take distinct sounds from words
he knew and combine them to form new words, Pepperberg
said. Just last month, he pronounced the word "seven" for
the first time.
…
Washington. (EFE y Redacción).- Alex, un loro africano que
podía diferenciar colores y cuya inteligencia maravilló a los
científicos durante más de 30 años, fue encontrado muerto
en el laboratorio de la Universidad de Brandeis, en el estado
de Massachusetts.
Un comunicado de la universidad señaló hoy que Alex, un
ejemplar gris comprado para estudiar el cerebro de las aves
en 1977, podía diferenciar 50 objetos, distinguía siete
colores y formas. Además, podía contar hasta seis y
expresaba deseos y hasta frustración cuando las pruebas
científicas eran demasiado repetidas.
También decía "hablen bien" cuando los otros dos loros
del laboratorio, Griffin, de 12 años, y Arthur, de 8,
pronunciaban mal las palabras que habían aprendido.
Según la universidad, su desarrollo era similar al de un
niño de 2 años e intelectualmente, tenía el cerebro de uno de
5.
"Es devastador perder a un individuo con el cual una ha
trabajado todos los días durante 30 años", dijo Irene
Pepperberg, científico de la Universidad de Brandeis.
Se calcula que la media de vida de un loro es alrededor de
50 años, aunque pueden alcanzar los 100. Alex pertenecía a
la variante de los 'yacos', los loros más inteligentes de su
especie.
Pepperberg agregó que Alex fue encontrado muerto en su
jaula el pasado viernes y que se desconocen las causas de su
deceso.
La investigadora informó que lo vio con vida el jueves
pasado cuando se despidió de él diciéndole: "Sé bueno. Te
quiero. Nos vemos mañana".
El loro le respondió: "Mañana estás aquí".
…
WALTHAM, Mass. (AP) — Alex, a
parrot that could count to six, identify
colors and even express frustration
with repetitive scientific trials, has
died after 30 years of helping
researchers better understand the
avian brain.
the first time.
…
de Massachusetts.
5.
especie.
deceso.
…
Además, podía contar hasta
seis y expresaba deseos y
hasta frustración cuando
las pruebas científicas eran
demasiado repetidas.
He could count up to six, including
zero, was able to express desires,
including his frustration with the
repetitive research.
repetitive scientific trials, has died after 30 years of
helping researchers better understand the avian brain.
The cause of Alex's death was unknown. The African
grey parrot's average life span is 50 years, Brandeis
University scientist Irene Pepperberg said. Alex was
discovered dead in his cage Friday, she said, but she
waited to release the news until this week so grieving
researchers could get over the shock and talk about it.
"It's devastating to lose an individual you've worked
with pretty much every day for 30 years," Pepperberg
told The Boston Globe. "Someone was working with him
eight to 12 hours every day of his life."
parrot learned enough English to identify 50 objects,
seven colors and five shapes. He could count up to six,
including zero, was able to express desires, including his
frustration with the repetitive research.
He also occasionally instructed two other parrots at
the lab to "talk better" if they mumbled, though it wasn't
clear whether he was simply mimicking researchers.
the first time.
…
podía diferenciar colores y cuya inteligencia maravilló a
los científicos durante más de 30 años, fue encontrado
muerto en el laboratorio de la Universidad de Brandeis, en
el estado de Massachusetts.
También decía "hablen bien" cuando los otros dos
loros del laboratorio, Griffin, de 12 años, y Arthur, de 8,
5.
"Es devastador perder a un individuo con el cual una
ha trabajado todos los días durante 30 años", dijo Irene
Se calcula que la media de vida de un loro es alrededor
de 50 años, aunque pueden alcanzar los 100. Alex
pertenecía a la variante de los 'yacos', los loros más
inteligentes de su especie.
Pepperberg agregó que Alex fue encontrado muerto en
su jaula el pasado viernes y que se desconocen las causas
de su deceso.
…
the first time.
…
de Massachusetts.
5.
especie.
deceso.
…
"Es devastador perder a un
individuo con el cual una ha
trabajado todos los días durante
30 años", dijo Irene Pepperberg,
científico de la Universidad de
Brandeis.
"It's devastating to lose an
individual you've worked with
pretty much every day for 30
years," Pepperberg told The
Boston Globe.
Fundamental problem
• Given sentence pairs with some content in
common, identify the fragment alignment
"It's devastating to lose an individual you've
worked with pretty much every day for 30
years," Pepperberg told The Boston Globe.
"Es devastador perder a un individuo con el
cual una ha trabajado todos los días durante
30 años", dijo Irene Pepperberg, científico de
la Universidad de Brandeis.
Finding promising sentence pairs in news
corpora (Munteanu & Marcu 2006)
seed
parallel
corpus
target
language
documents
source
language
documents
word
aligner
find similar
doc pairs
using CLIR
index
words
inverted
index
word
alignment
models
promising
document
pairs
filter by
length,
vocabulary
promising
sentence
pairs
Munteanu and Marcu’s approach
• Motivated by signal processing
– Pick one language, assign each word a score [-1,+1] based
on its best scoring partner word in the other language
– Smooth the signal with moving average
– Retain strictly positive sequences ≥ 3 words
– Apply to each side independently
• Comments
– Selection is independent: the English translation of a
Spanish word may not meet the English filtering criteria
– Spans are simply concatenated
– No notion of location
A probabilistic interpretation
• Insight: fragments S and T are parallel iff
P(S, T) > P(S) ∙ P(T)
… iff P(T | S) > P(T)
… iff P(S | T) > P(S)
• Present two generative models of comparable
sentence pairs that capture this insight
• Hidden alignments identify fragment correspondences
in noisy sentence pairs
– Selection is no longer independent
– Position in the other sentence matters
• Evaluate in terms of end-to-end MT (BLEU)
Comparable Model B
(Quirk and Udupa, 2007)
• Joint, generative model of source, target
fragments
– Decide the number of fragments
– Each fragment generates source, target words (one
side, not both, may be empty) to be appended –
fragment alignment is monotonic
• Requires source + target n-gram LMs, conditional
HMM models of S|T (and vice versa)
• Monolingual fragment score: P(S) (or P(T))
• Bilingual fragment score:
min { P(S) ∙ P(T|S), P(T) ∙ P(S|T) }
Bosch
afirmo
situacion
actual
la
en
Dominicana
Republica
la
en
normal
lo
es
fraude
El
Fraud
is
normal
in
min {
P(El fraude…) ∙ P(Fraud is… | El fraude…),
P(Fraud is…) ∙ P(El fraude… | Fraud is…)
}
the
Dominican
Republic
he
said
to
reporters
P(en la actual situacion | … Republica Dominicana)
Search procedure
• Monotone 0th order model: dynamic
programming
– δ[j, k] := best fragment align of first j source, k target
words
δ[j, k] = max {
δ[j’, k’] ∙ P(Sj’…Sj, Tk’…Tk)
for all 0 ≤ j’ < j, 0 ≤ k’ < k }
• Exact, but expensive: O(n6)
• Beam search provides significant speedup
– Model 1 scores prune bilingual spans
– Bilingual fragment size limitations reduce search space
Data
Parallel data: Spanish-English Europarl WMT 2006
– Provided tokenization, lowercased
– GIZA++ word alignment (15H545); grow-diag-final
Comparable data: Spanish, English Gigaword corpora
– Use same tokenization as above, lowercased
– Spanish: 2.2M docs, 20M sents, 690M words
– English: 3.5M docs, 49M sents, 1.8B words
First pass extraction stats
– Low recall: 27M doc pairs, 2.6M promising sentence pairs
– High recall: 28M doc pairs, 84M promising sentence pairs
Fragment extraction
• Extract fragments from Spanish and English
Gigaword corpora using three approaches:
– MM: reimplementation of Munteanu and Marcu
2006
– A(e|s): conditional model, one direction only
– B: joint model of Spanish, English
• Word alignment models, language models,
and other models for MM trained on Europarl
data only
Spanish-English BLEU scores
29
28.7
28.7 28.5
27.7
27
25.3
24.5
25
23
22.1
22.6 22.6 22.5
22.5
23
19
news
MM
A(e|s)
B
21
europarl
baseline
web
What about Wikipedia?
• Available in many languages
• Same topic articles connected via “Interwiki links”
–
–
–
–
–
English: 3,000,000+ articles
English-Spanish pairs: 278,000+
English-Bulgarian pairs: 50,000+
English-German pairs: 477,000+
English-Dutch pairs: 356,000+
• Goal: find parallel sentences, improve MT system
Baseline Model: binary classifier on
sentence pairs
sentence pairs
sentence pairs
sentence pairs
sentence pairs
Baseline model
• Binary classifier:
– 𝑕(𝑆, 𝑇) = 𝑡𝑟𝑢𝑒 for true sentence pairs <S,T>
– 𝑕(𝑆, 𝑇’) = 𝑕(𝑆’, 𝑇) = 𝑓𝑎𝑙𝑠𝑒 for all other pairs
• Severe class imbalance problem
– We get 𝒪 𝑛 positive examples, and 𝒪 𝑛2 negative
examples
– Classifier needs a strong push to predict positive
• One solution: ranking model
– Train a model 𝑕 𝑆, 𝑇 so that for a given pair (S,T),
𝑕 𝑆, 𝑇 > 𝑕 𝑆, 𝑇 ′ for all other target sentences T’ in
the same document
Ranking Model
Features
• We define the following features on both
Model 1 and HMM word alignments:
– Log probability
– Number of aligned/unaligned words
– Longest aligned/unaligned spans
– Number of words with fertility 1, 2, and 3+
• Length feature (Moore 2002)
log(Poisson(|T|, |S|·r))
• Difference in relative sentence position
Wikipedia Features
• Number of hyperlinks pointing to equivalent
articles (determined by the “Interwiki links”).
• Image Feature: fires on sentences which are
parts of captions of the
same image.
Wikipedia Features
• List Feature: fires when both sentences are
part of a list.
Sequence Model
• We extend the ranking model to include
features based on the previous alignment:
• These features are based on the positions of
the aligned target sentences.
Experiments
• We annotated 20 Wikipedia article pairs in
1. Spanish – English
2. Bulgarian – English
3. German – English
• Sentence pairs were given a quality rating:
– 1: Some phrases are parallel
– 2: Mostly parallel, with some missing words
– 3: Very high quality, likely translated by a bilingual
user.
• 2 and 3 were considered correct in our
experiments.
BLEU results
INTRODUCTION TO SYNCHRONOUS
GRAMMARS
Why do we parse natural language?
Why do we parse natural language?
1. To find a deeper or alternate interpretation
– A translation task: find target language meaning
given source language representation
2. Check for grammaticality / well-formedness
– Also a translation task, but an argument for using
in the target side
Overview
• Motivation
– Examples of reordering/translation phenomena
• Synchronous context free grammar
– Example derivations
– ITG grammars
– Reordering for ITG grammars
• Applications of bracketing ITG grammars
– Applications: ITGs for word alignment
• Hierarchical phrase-based translation with Hiero
– Rule extraction
– Model features
• Decoding for SCFGs and integrating a LM
Motivation for tree-based translation
• Phrases capture contextual translation and
local reordering surprisingly well
• However this information is brittle:
– “author of the book  本書的作者” tells us
nothing about how to translate “author of the
pamphlet” or “author of the play”
– The Chinese phrase “NOUN1的 NOUN2” becomes
“NOUN2 of NOUN1” in English
Motivation for tree-based translation
• There are general principles a phrase-based system is not
using
–
–
–
–
Some languages have adjectives before the nouns, some after
Some languages place prepositions before nouns, some after
Some languages put PPs before the head, others after
Some languages place relative clauses before head, others after
• Discontinuous translations are not handled well by
phrase-based systems
– ne … pas in French
– Separable prefixes in German
– Split constructions in Chinese
Types of tree-based systems
• Formally syntax-based
– Use the notion of a grammar, but without linguistically
motivated annotations (no nouns, verbs, etc.)
– Model hierarchical nature of language
– Examples: phrase-based ITGs and Hiero (will focus on
these in this lecture)
• Linguistically syntax-based
– Use parse information derived from a parser (rulebased, treebank-trained, etc.)
– Could be source or target parsing
– Phrase structure trees, dependency trees
SYNCHRONOUS CONTEXT-FREE
GRAMMARS
Review: context free grammars
• CFG can be a tuple:
– Terminal set Σ (e.g. “the”, “man”, “left”)
– Nonterminal set 𝑁 (e.g. “NP”, “VP”)
– Rule set 𝑅, each rule has a parent symbol from 𝑁 and
production or yield drawn from 𝑁 ∪ Σ ⋆ (e.g. “S -> NP VP”, “DT
-> the”)
– A top symbol 𝑆 ∈ 𝑁
• We can both parse and generate with this
– Parse: start with a terminal sequence, find substrings that match
the yield of a rule and replace with parent symbol, repeat until
we find top symbol
– Generate: start with top symbol, pick a non-terminal and
replace with its yield, until we have only terminal symbols in the
sequence
Synchronous context-free grammars
• A generalization of context free grammars
Slide from David Chiang, ACL 2006 tutorial
Context-free grammars (example in
Japanese)
Synchronous CFGs
Synchronous CFGs
Synchronous CFGs
Rules with probabilities
Joint probability of source and target language re-writes, given non-terminal on left.
Could also use conditional probability of target given source or source given target.
Synchronous CFGs
Chinese English example
S
NP
I
VP
ate
PP
at
NP
the
restaurant
70
Chinese English example
wo
I
NP
S
VP
zai
at
PP
NP
fan dien
restaurant
chi fan
ate
I
ate
at
the
restaurant
71
Stochastic Inversion Transduction
Grammars (Wu, 1997)
• First use of SCFG in stat MT
• Restricted form of SCFGs: only concatenate or
swap adjacent non-terminals
𝑋 → 𝑌1 𝑍2 , 𝑌1 𝑍2 , or 𝑋 → 𝑌 𝑍
𝑋 → 𝑌1 𝑍2 , 𝑍2 𝑌1 , or 𝑋 → 𝑌 𝑍
• At the lowest level, generate words or wordpairs
𝑋 → 𝑒, 𝑓; 𝑋 → 𝜀, 𝑓; 𝑋 → 𝑒, 𝜀
Even more restricted:
Bracketing ITG grammars
• A minimal number of non-terminal symbols
• Does not capture linguistic syntax but can be used to explain
word alignment and translation
𝐴 → 𝐴1 𝐴2 , A1 A2 or 𝐴 → [𝐴 𝐴]
𝐴 → 𝐴1 𝐴2 , A2 A1 or 𝐴 → < 𝐴, 𝐴 >
𝐴 → 𝑥, 𝑦
𝐴 → 𝑥, 𝜖
𝐴 → 𝜖, 𝑦
• Can be extended to allow direct generation of one-to-many or
many-to-many blocks (Block ITG)
𝐴 → 𝑥, 𝑦
Reordering in bracketing ITG grammar
• Because of assumption of hierarchical movement
of contiguous sequences, the space of possible
word alignments between sentence pairs is
limited
• Assume we start with a bracketing ITG grammar
• Allow any foreign word to translate to any English
word or empty
– 𝐴 → 𝑓, 𝑒 𝐴 → 𝑓, 𝜖 𝐴 → 𝜖, 𝑒
• Possible alignment
– One that is the result of a synchronous parse of the
source and target with the grammar
Example re-ordering with ITG
Grammar includes 𝐴 → 1, 1 ; 𝐴 → 2,2 ; 𝐴 → 3,3 ; 𝐴 →
4,4
Can the bracketing ITG generate these sentence
pairs?
[1,2,3,4] [1,2,3,4]
𝐴 → [𝐴 𝐴 ]
1
2 3
𝐴2 → [𝐴4 𝐴5 ]
𝐴4 → 𝐴6 𝐴7
𝐴6 → 1,1
𝐴7 → 2,2
𝐴5 → 3,3
𝐴3 → 4,4
Are there other synchronous parses of this sentence
pair?
[1,2,3,4] [1,2,3,4]
• Other re-orderings with parses
• A horizontal bar means the non-terminals are
swapped
But some re-orderings are not allowed
• When words move inside-out
• 22 out of the 24 permutations of 4 words are
parsable by the bracketing ITG
Number of permutations compared to
ones parsable by ITG
Application of ITGs
• Word alignment and translation
• Also string edit distance with moves
• One recent interesting work is Haghighi et al’s
09 paper on supervised word alignment with
block ITGs
– Aria Haghighi, John Blitzer, John DeNero, and Dan
Klein “Better word alignments with supervised ITG
Models”
Comparison of oracle alignment error (AER) for
different alignment spaces
From Haghighi et al 09
Space of all alignments, space of 1-to-1 alignments, space of ITG alignments
Block ITG: adding one to many alignments
Comparison of oracle alignment error (AER) for
different alignment spaces
Alignment performance using discriminative
model
Training for maximum likelihood
• So far results were with MIRA
– Requiring only finding the best alignment under the
model
– Efficient under 1-to-1 and ITG models
• If we want to train for maximum likelihood
according to a log-linear model
– Requires summing over all possible alignments
– This is tractable in ITGs (will discuss bitext parsing in a
bit)
– One of the big advantages of ITGs
MIRA versus maximum likelihood
training
David Chiang
ISI, USC
HIERARCHICAL PHRASE-BASED
TRANSLATION
Hierarchical phrase-based translation
overview
•
•
•
•
•
Motivation
Extracting rules
Scoring derivations
Decoding without an LM
Decoding with a LM
Motivation
• Review of phrase based models
– Segment input into sequence of phrases
– Translate each phrase
– Re-order phrases depending on distortion and perhaps the lexical content
of the phrases
• Properties of phrase-based models
– Local re-ordering is captured within phrases for frequently occurring
groups of words
– Global re-ordering is not modeled well
– Only contiguous translations are learned
Chinese-English example
Australia is one of the few countries that have diplomatic relations with North Korea.
Output from phrase-based system:
Captured some reordering through phrase translation and phrase re-ordering
Did not re-order the relative clause and the noun phrase.
Idea: Hierarchical phrases
𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 ,
𝑕𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡𝑕 𝑋1
• The variables stand for corresponding hierarchical phrases
• Capture the fact that PP phrases tend to be before the verb in
Chinese and after the verb in English
• Serves as both a discontinuous phrase pair and re-ordering
rule
Other example hierarchical phrases
𝑋1 𝑑𝑒 𝑋2 ,

𝑡𝑕𝑒 𝑋2 𝑡𝑕𝑎𝑡 𝑋1
Chinese relative clauses modify NPs on the left, and English
relative clauses modify NPs on the right
𝑋1 𝑧𝑕𝑖𝑦𝑖,
𝑜𝑛𝑒 𝑜𝑓𝑋1
A Synchronous CFG for example
Only 1 non-terminal X plus start symbol S used
•
•
•
•
•
•
•
•
•
•
•
𝑋 → 𝑦𝑢 𝑋1 𝑦𝑜𝑢 𝑋2 , 𝑕𝑎𝑣𝑒 𝑋2 𝑤𝑖𝑡𝑕 𝑋1
𝑋 → 𝑋1 𝑑𝑒 𝑋2 , 𝑡𝑕𝑒 𝑋2 𝑡𝑕𝑎𝑡 𝑋1
𝑋 → 𝑋1 𝑧𝑕𝑖𝑦𝑖, 𝑜𝑛𝑒 𝑜𝑓𝑋1
𝑋 → 𝐴𝑜𝑧𝑕𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎
𝑋 → 𝐵𝑒𝑖𝑕𝑎𝑛, 𝑁𝑜𝑟𝑡𝑕 𝐾𝑜𝑟𝑒𝑎
𝑋 → 𝑠𝑕𝑖, 𝑖𝑠
𝑋 → 𝑏𝑎𝑛𝑗𝑖𝑎𝑜, 𝑑𝑖𝑝𝑙𝑜𝑚𝑎𝑡𝑖𝑐 𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛𝑠
𝑋 → 𝐴𝑜𝑧𝑕𝑜𝑢, 𝐴𝑢𝑠𝑡𝑟𝑎𝑙𝑖𝑎
𝑋 → 𝑠𝑕𝑎𝑜𝑠𝑕𝑢 𝑔𝑢𝑜𝑗𝑖𝑎, 𝑓𝑒𝑤 𝑐𝑜𝑢𝑛𝑡𝑟𝑖𝑒𝑠
𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2 [glue rule]
𝑆 → 𝑋1, 𝑋1
General approach
• Align parallel training data using word-alignment
models (e.g. GIZA++)
• Extract hierarchical phrase pairs
– Can be represented as SCFG rules
• Assign probabilities (scores) to rules
– Like in log-linear models for phrase-based MT, can
define various features on rules to come up with rule
scores
• Translating new sentences
– Parsing with an SCFG grammar
– Integrating a language model
Example derivation
Extracting hierarchical phrases
• Start with contiguous phrase pairs, as in phrasal
SMT models (called initial phrase pairs)
• Make rules for these phrase pairs and add them
to the rule-set extracted from this sentence pair
Extracting hierarchical phrase pairs
• For every rule of the sentence pair
– For ever initial phrase pair contained in it
• Replace an initial phrase pair by non-terminal
• Extract a new rule
Another example
Hierarchical phrase
Traditional phrases
Constraining the grammar rules
• This method generates too many phrase pairs and leads to
spurious ambiguity
– Place constraints on the set of allowable rules for robustness/speed
Adding glue rules
• For continuity with phrase-based models, add glue
rules which can split the source into phrases and
translate each
– 𝑆 → 𝑆1 𝑋2 , 𝑆1 𝑋2
– 𝑆 → 𝑋1, 𝑋1
• Question: if we only have conventional phrase pairs
and these two rules, what system do we have?
• Question: what do we get if we also add these rules
– X→ 𝑋1 𝑋2 , 𝑋1 𝑋2
– X → 𝑋1 𝑋2 , 𝑋2 𝑋1
Assigning scores to derivations
• A derivation is a parse tree for the source and target
sentences
• As in phrase-based models, we choose a derivation that
maximizes the score
– The derivation corresponds to a target sentence, which is returned as
a translation
– There are multiple derivations of a target sentence but we do not sum
over them, approximate with max as in phrase-based models
• 𝑃(𝐷) ∝
𝑖 𝜙𝑖
𝐷
𝜆
𝑖
• Feature functions on rules and a language model feature
– 𝑃(𝐷) ∝ 𝑃𝐿𝑀 𝑒
𝜆𝑙𝑚
𝑖≠𝐿𝑀
𝑋→𝛾,𝛼∈𝐷 𝜙𝑖
𝑋 → 𝛾, 𝛼
𝜆𝑖
Assigning scores to derivations
• Except for language model, can represent as weight
according to weighted SCFG
– 𝑤 𝐷 =
𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋
→ 𝛾, 𝛼)
– 𝑤 𝑋 → 𝛾, 𝛼 = 𝑖≠𝐿𝑀 𝜙𝑖 𝑋 → 𝛾, 𝛼
– 𝑃 𝐷 ∝ 𝑃𝐿𝑀 𝑒 × 𝑤(𝐷)
𝜆𝑖
• Features
– Rule probabilities in two directions and lexical weighting in two
directions
– 𝑃 𝛾 𝛼 and 𝑃(𝛼|𝛾)
– 𝑃𝑙𝑒𝑥 (𝛾|𝛼) and 𝑃𝑙𝑒𝑥 (𝛼|𝛾)
– Exp of rule count, word count, glue rule count
Estimating feature values and feature weights
• Estimating translation probabilities in two directions
– Like in phrase-based models, heuristic estimation using relative
frequency of counts of rules
– Count of one from every sentence for each initial phrase pair
– For each initial phrase pair in a sentence, fractional (equal) count to
each rule obtained by subtracting sub-phrases
– Relative frequency of these counts
𝑐(𝑋 → 𝛾, 𝛼)
𝑃 𝛾𝛼 =
′
𝛾′ 𝑐(𝑋 → 𝛾 , 𝛼)
• Estimating values of parameters 𝜆
– Minimum error rate training like in phrase-based models: maximize
BLEU score on a development set
Finding the best translation: decoding
• In the absence of a language model feature,
our scores look like this
–𝑤 𝐷 =
𝑋→ 𝛾,𝛼∈𝐷 𝑤(𝑋
→ 𝛾, 𝛼)
• This can be represented by a weighted SCFG
• Can parse the source sentence using source
side of grammar using CKY (cubic time in sent
length)
• Read off target sentence assembled from
corresponding target derivation
Finding the best translation including
an LM
• Method 1: generate k-best derivations without
LM, then rescore with LM
– May need an extremely large k-best list to get to
highest scoring +LM derivation
• Method 2: integrate LM in grammar by
intersecting target side of SCFG with an LM : very
large expansion of rule-set with time
𝑂(𝑛3 𝑇 4 𝑚−1 )
• Method 3: integrate LM while parsing with the
SCFG, using cube pruning to generate k-best LMscored translations at each span
Comparison to a phrase-based system
Summary
• Described hierarchical phrase-based translation
– Uses hierarchical rules encoding phrase re-ordering and
discontinuous lexical correspondence
– Rules include traditional contiguous phrase pairs
– Can translate efficiently without LM using SCFG parsing
– Outperforms phrase-based models for several languages
• Hiero is implemented in Moses and Joshua
References
• Hierarchical phrase-based translation. David Chiang, CL 2007.
• An introduction to Synchronous Grammars. Notes and slides
from ACL 2006 tutorial. David Chiang.
• Stochastic inversion transduction grammars and bilingual
parsing of parallel corpora. Dekai Wu, CL 1997.
• Better word alignment with Supervised ITG models. ACL 2009,
A. Haghighi, J. Blitzer, J. DeNero, and D. Klein
• Many other interesting papers using ITGs and extensions to
Hiero: will add some to the web page

- Catalyst

Transcripción

Documentos relacionados

Spanish 3 Patterns Semester 2 Unit 1 Lesson 8