Bibliography for Statistical Alignment and Machine Translation
1 Alignment models
2 Translation models
3 Preprocessing and complementary applications
4 Decoding
5 Related tasks
6 Useful methods and algorithms
Bibliography
- SINGLE-WORD BASED TRANSLATION/ALIGNMENT MODELS
- Systematic Comparison of Various Statistical Alignment Models [Och and Ney 2003]
- A description and comparison of some heuristic models as well as IBM models 1 to 5
[Brown 1993],HMM model [Vogel 1996]
and a log-linear combination of HMM model and IBM model 4 called model 6.
[Och and Ney 2003] have measured the effect of various techniques that,
independently of the models used, can improve the results:
- number of alignment in training
- The effect measured here is that
of including more alignments in the training of the fertility-based alignment models: the neighborhood of the Viterbi alignment, or an even larger
set of high probability alignments, called pegged alignment in
[Brown 1993].
- smoothing
- Alignment and fertility probabilities are smoothed
to solve the problems of rare words and overfitting.
- word classes
- The alignment parameters of HMM, Model 4 and Model
5 include a dependence on the word classes of the surrounding
words.
- conventional bilingual dictionary
- Entries of a conventional
bilingual dictionary are added to the corpus.
- symmetrization
- The Viterbi alignment is calculated in both
source-target and target-source directions. An alignment matrix is
calculated putting together information from both alignments.
- Extensions to HMM-based Statistical Word Alignment Models
- [Toutanova 2002]
Several extensions to IBM models are tested, namely introducing a "tag
translation" probability as a product to alignment and lexicon
probabilities, simulating fertility for the HMM model with a new
probability, and a new treatment of the NULL word. Its combinations shows
promising results (in Alignment Error Rate), when compared to HMM model and
IBM4, especially when few training sentences, since the POS seems to be a
more useful clustering to build word classes than learning them
automatically. Other dependencies from POS tags are described, but no
results are reported.
- PHRASE-BASED TRANSLATION/ALIGNMENT MODELS
- Modeling with structures in SMT
- [Wang and Waibel 1998] The sentences are decomposed into a sequence of n
phrases (bilingual mutual information clustering algorithm + phrasing operator). It's an IBM-type translation model with both phrase-based parameters, that describe the alignment between phrases,
and single-word-based parameters, that account for alignments within
phrases and for individual word translations. All the parameters are trained together with the
EM-algorithm.
- Phrase-based, Joint Probability Model for SMT
- [Marcu and Wong 2002] The parameters of the model are the phrase translation probabilities and the probabilities
of distortion between the word of the source phrase and the center
of mass of target phrase. For the phrase translation
table, all the unigrams are selected and a selection of
high-frequency n-grams. A Viterbi-based EM algorithm is applied.
- SMT based on Hierarchical Phrase Alignment
- [Watanabe 2002] See next section.
- Integrated Phrase Segmentation and Alignment Model
- [Zhang unpublished] Like in
[Marcu and Wong 2002], it's a joint model based on bilingual phrase pairs called concepts. The present model doesn't require a previous selection of phrases. A metric is defined to calculate the probability of association of a source phrase and target phrase. The metric involves parameters of the type p(,) and p(,) which are based on co-ocurrences between words. The algorithm first calculates associations between words in the alignment matrix and extend them into phrases.
- ASSOCIATION APPROACH
- Combining clues for Word Alignment
- [Tiedemann 2003]
To find an alignment, builds a matrix with probabilities of association
between each pair of words. These probabilities are the result of combining
several features related to each pair (co-occurrence in the corpus -Dice
score-, string similarity -LCSR-, POS tags, position in sentence,
chunks...). Co-occurrence and similarity features are used to define a raw
alignment and allow the learning of the other features. From the association
matrix, a word alignment is extracted using a dynamic programming algorithm.
- Probability Model
- [Cherry and Lin 2003]
The model calculates the probability of a sequence of links given the context of each link. The context is formed by the sentence pair and the previous links.
It uses a parse tree to extract adjacency and dependency features. Best-first search to seek the highest probability alignment.
- lexical correspondences generation
- [Ahrenberg 1998] Using a co-occurrence measure (t-score) and doing
certain assumptions on the bitext, an alignment is found between open-class
units and between closed-class units. Certain morphological info is used
(expressions with different suffixes might be treated equally), weights are
distributed depending on positions and multi-word expressions are also
considered.
- Melamed's work
- add reference
- SINGLE-WORD BASED TRANSLATION/ALIGNMENT MODELS
- Systematic Comparison of Various Statistical Alignment Models
- [Och and Ney 2003]
- Extensions to HMM-based Statistical Word Alignment Models
- [Toutanova 2002]
- Context-dependent maximum entropy models
- [Berger 1996,García Varea 2002]. For instance in
[García Varea 2002], during the Giza++ training, some lexical parameters are estimated with a context-dependent ME model.
- Gibbs-Markov models
- [Lafferty 1996] The model presented is a statistical model for which there is an underlying HMM, but where the state transition and output symbol generation probabilities are given by Gibbs distributions, that is in the form of the exponential of a sum of weighted feature functions. These features allow the probabilities to be context-dependent.
- PHRASE-BASED TRANSLATION/ALIGNMENT MODELS
- Modeling with structures in SMT
- [Wang and Waibel 1998]
- Phrase-based, Joint Probability Model for SMT
- [Marcu and Wong 2002]
- SMT based on Hierarchical Phrase Alignment
- [Watanabe 2002]
- Integrated Phrase Segmentation and Alignment Model
- [Zhang unpublished]
- TRANSLATION MODELS BASED ON PHRASES ``CONSISTENT'' WITH THE WORD ALIGNMENT
- Alignment templates
- [Och 1999] Phrases consistent with the symmetrized Giza++ alignment are extracted. Consistent means that the words in a phrase pair are aligned to each other, not to words outside. Then the phrases are generalized with bilingual word classes and the probability of application of the templates are estimated with relative frequencies.
- Statistical Phrase-Based Translation
- [Koehn 2003] The model has got a phrase translation component and a phrase distortion component. The distortion component is trained with the model of
[Marcu and Wong 2002] and 3 methods to build the phrase translation table are compared: blocks consistent with Giza++ symmetrized alignment (translation probabilities are relative frequencies), the same but consistent with syntax and finally phrases trained with the model of
[Marcu and Wong 2002]. The limitation to syntactic phrases is harmfull, the phrases extracted with giza++ give the best results.
- phrase-based SMT
- [Zens 2002] Introduces the segmentation in phrases as a hidden variable. Estimates phrase translation probabilities as relative frequencies. Not better results than with alignment templates.
- FST'S
- translation using FSTs
- [Casacuberta 2002] FSTs offer two (three) possible strategies to deal
with speech translation, namely sequential and integrated (and iterative).
Both rely on the training of a transducer from a set of bilingual units
usually called tuples, implementing a bilingual language model. These tuples
must be extracted from a word alignment done beforehand.
- TRANSLATION MODELS USING A PREVIOUS ALIGNMENT, POS TAGGING AND PARSING/CHUNKING
- chunkMT
- [Koehn and Knight 2002] A symmetrized Giza++ alignment of the corpus is performed and both sides of the corpus are POS
tagged and chunked. Next, chunk mappings are collected based on the
alignment between words of those chunks.
The model divides the process of machine translation in three steps: sentence level chunk reordering, chunk mapping and word translation.
- SMT based on Hierarchical Phrase Alignment
- [Watanabe 2002] Hierarchical structures were converted to a sequence of chunks. The ``chunks'' corpus was trained with model 4 as if one chunk was one word. Model 4 is applied again but this time at the word level, only allowing alignment between chunks.
- SYNTAX-BASED TRANSLATION MODELS
- Syntax-based translation model
- [Yamada and Knight 2001] the source side of the corpus is POS-tagged and syntactically parsed
to build parse trees, which are the input of the system. Three successive
operations are stochastically applied to the tree: reorder of child nodes on each
internal node, insertion (or not) of an extra word at each
node, translation of the leaves. Finally, reading the
translated leaves gives the target sentence. The model parameters
(probabilities of each operation) are trained via the EM algorithm.
- TRANSLATION MODELS USING FEATURES
- Discriminative training and Maximum Entropy Models for Statistical Machine Translation
- [Och and Ney 2002] The components of the translation model are integrated as feature functions of the Maximum Entropy framework.
More feature functions are added.
- Statistical Machine Translation on Paraphrased Corpora
- (Taro Watanabe)
- Sonja Niessen's papers
-
- Using POS Information for SMT into Morphologically Rich Languages
- [Ueffing and Ney 2003] When translating from English to Spanish or Catalan, the generation of the correct verb fullform is especially difficult because the english verb forms contain fewer information. To cope with this, the english pronouns, modals and verbs are identified with POS tagging and grouped as ``new'' fullforms (ex: ``you will have''
``you_will_have''). In questions, the order of verbal forms is inversed before forming the groups.
- NATURAL LANGUAGE UNDERSTANDING
- Comparison of Alignment Templates and Maximum Entropy Models for Natural Language Understanding
- [Bender 2003] The training is done on a word-aligned corpus. If the corpus is annotated with word-concepts correspondances, very precise features can be implemented, giving information on source words surrounding the current word and concepts surrounding the corresponding concept (for instance). In the TABA corpus (87k words, vocabulary 2k, 27 concepts), 6000 lexical features, 400 prior features and 23000 compound features have been defined.
- Feature-based Language Understanding
- [Papineni 1998] The training is done on a sentence-aligned corpus: the features indicate the presence or absence in a sentence pair of n-grams, long-distance bigrams or word groups.
- MACHINE-AIDED TRANSLATION
- Maximum Entropy/Minimum Divergence Translation Model
- [Foster 2000]
In this machine assisted translation task (find the most probable next word given a context), the same type of lexical features are defined than in
[Papineni 1998]. Position features are also introduced, based on position classes to avoid data sparseness. Because of the huge number of features, a feature selection step is necessary.
- MAXIMUM ENTROPY MODELLING
- A Maximum Entropy Approach to Natural Language Processing
- [Berger 1996] A fundamental paper.
- A comparison of algorithms for maximum entropy parameter estimation
- [Malouf 2002]
Lars Ahrenberg, Mikael Andersson, and Magnus Merkel. 1998.
A simple hybrid aligner for generating lexical correspondences in
parallel texts.
In Proc. of the 36th Annual Meeting of the Association for
Computational Linguistics and 17th International Conference on Computational
Linguistics (COLING-ACL'98), pages 29-35, Montreal, Canada, August 10-14.
PDF
Oliver Bender, Klaus Macherey, Franz Josef Och, and Hermann Ney. 2003.
Comparison of alignment templates and maximum entropy models for
natural language understanding.
In Proc. of the 10th Conference of the European Chapter of the
ACL (EACL), Budapest, Hungary, April.
PS
Adam Berger, Stephen Della Pietra, and Vincent Della Pietra. 1996.
A maximum entropy approach to natural language processing.
Computational Linguistics, 22(1):39-72, March.
PS
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and R. L.
Mercer.
1993.
The mathematics of statistical machine translation: Parameter
estimation.
Computational Linguistics, 19(2):263-311.
PDF
Francisco Casacuberta, Enrique Vidal, and Juan Miguel Vilar.
2002.
Architectures for speech-to-speech translation using finite-state
models.
In Proceedings of the Workshop on Speech-to-Speech Translation:
Algorithms and Systems, pages 39-44, Philadelphia, July.
PDF
Colin Cherry and Dekang Lin.
2003.
A probability model to improve word alignment.
In Proc. of the Annual Meeting of the Association for
Computational Linguistics.
PDF PS
George Foster.
2000.
Incorporating position information into a maximum entropy/minimum
divergence translation model.
In Proc. of CoNLL-2000 and LLL-2000, pages 37-42, Lisbon,
Portugal.
PS
Ismael García Varea, Franz Josef Och, Hermann Ney, and Francisco Casacuberta.
2002.
Improving alignment quality in statistical machine translation using
context-dependent maximum entropy models.
In Proc. 19thInt. Conf. on Computational Linguistics, pages
1051-1054, Taipei,Taiwan.
PS.GZ
Philipp Koehn and Kevin Knight.
2002.
Chunkmt: Statistical machine translation with richer linguistic
knowledge.
Draft. Unpublished.
PS
Philipp Koehn, Franz Josef Och, and Daniel Marcu.
2003.
Statistical phrase-based translation.
In Proc. of the Annual Meeting of the Association for
Computational Linguistics.
PDF PS
John Lafferty.
1996.
Gibbs-markov models.
Computing Science and Statistics, 27:370-377.
PS
Robert Malouf.
2002.
A comparison of algorithms for maximum entropy parameter estimation.
In Proceedings of the Sixth Conference on Natural Language Learning (CoNLL), pages 49-55
PDF
Daniel Marcu and William Wong. 2002.
A phrase-based, joint probability model for statistical machine
translation.
In Proc. of the Conference on Empirical Methods in Natural
Language Processing, Philadelphia, PA, July 6-7.
PDF
Franz Josef Och and Hermann Ney.
2002.
Dicriminative training and maximum entropy models for statistical
machine translation.
In Proc. of the Annual Meeting of the Association for
Computational Linguistics, pages 295-302, Philadelphia, PA, July.
PS
Franz Josef Och and Hermann Ney.2003.
A systematic comparison of various statistical alignment models.
Computational Linguistics, 29(1):19-51, March.
Franz Josef Och, Christoph Tillmann, and Hermann Ney.
1999.
Improved alignment models for statistical machine translation.
In Proc. of the Conference on Empirical Methods in Natural
Language Processing and Very Large Corpora, pages 20-28, University of
Maryland, College Park, MD, June.
PS
K. A. Papineni, S. Roukos, and R. T. Ward.
1998.
Maximum likelihood and discriminative training of direct translation
models.
In Proc. Int. Conf. on Acoustics, Speech, and Signal
processing, pages 189-192, Seattle, WA, May.
PDF
Jörg Tiedemann.
2003.
Combining clues for word alignment.
In Proc. of the 10th Conference of the European Chapter of the
ACL (EACL), Budapest, Hungary, April 12-17.
PDF PS
Kristina Toutanova, H. Tolga Ilhan, and Christopher D. Manning.
2002.
Extensions to hmm-based statistical word alignment models.
In Proc. of the Conference on Empirical Methods in Natural
Language Processing, Philadelphia, PA, July 6-7.
PDF
Nicola Ueffing and Hermann Ney.
2003.
Using pos information for statistical machine translation into
morphologically rich languages.
In Proc. of the 10th Conference of the European Chapter of the
ACL (EACL), Budapest, Hungary, April.
PS
Stephan Vogel, Hermann Ney, and Christoph Tillmann.
1996.
Hmm-based word alignment in statistical translation.
In COLING'96: The 16thInt. Conf. on Computational Linguistics,
pages 836-841, Copenhagen, Denmark, August.
PS
Ye-Yi Wang and Alex Waibel.
1998.
Modeling with structures in statistical machine translation.
In Proc. of the 36th Annual Meeting of the Association for
Computational Linguistics and 17th International Conference on Computational
Linguistics, Montreal, Canada.
PS.GZ
Taro Watanabe, Kenji Imamura, and Eiichiro Sumita.
2002.
Statistical machine translation based on hierarchical phrase
alignment.
In Proc. of the 9th International Conference on Theoretical and
Methodological Issues in Machine Translation (TMI), pages 188-198,
Keihanna, Japan, March.
PDF
Kenji Yamada and Kevin Knight.
2001.
A syntax-based statistical translation model.
In Proc. of the Annual Meeting of the Association for
Computational Linguistics, Toulouse, France.
PS
Richard Zens, Franz Josef Och, and Hermann Ney.
2002.
Phrase-based statistical machine translation.
In Springer Verlag, editor, Proc. German Conference on
Artificial Intelligence (KI), september.
PS
Ying Zhang.
unpublished.
Integrated phrase segmentation and alignment model for statistical
machine translation.
Adrià de Gispert
Patrik Lambert
Last update: July 31, 2003