Algorithmic approaches to identifying inflectional stems: implications
Transcripción
Algorithmic approaches to identifying inflectional stems: implications
Algorithmic approaches to identifying inflectional stems: implications from Spanish Jackson L. Lee and John A. Goldsmith (University of Chicago) Overview: Given a paradigm with inflected word forms, how exactly do we know what the stem material is? This paper deals with the long-standing but under-researched problem of inflectional stem identification (Goldsmith 2010, Spencer 2012) and, by drawing insights from mathematics and bioinformatics, describes three language-independent and algorithmic solutions: substrings, multisets, and subsequences. Inflectional stem identification is important because successfully identifying stem material in a paradigm entails locating affixal material in each word form. Stem identification is therefore part of the morpheme segmentation problem – the center of a morphological analysis, which in turn is the basis for syntactic/semantic analysis. We pin down three algorithmically explicit approaches and compare their pros and cons in morphological analysis by focusing on verbal paradigms in Spanish. We conclude that, among the approaches we consider, the subsequence approach matches most closely to what is desirable. As the approaches are algorithmic and machine-implementable, they are useful and relevant not only to Spanish and other Romance languages, but also languages with inflectional morphology in general. Leading idea: When a linguist seeks to identify the stem in an inflectional paradigm, the intuition at work is that we wish to find the maximal common material across all word forms (Spencer 2012). The point of this paper is to show that there are three natural ways to define the notion of “common material”, employing the notions of substrings, multisets, and subsequences. The three approaches: Consider the string “abcde”. “a”, “bc”, and “cde” are substrings. “ac” is not a substring, because it violates adjacency, and “ba” is not either, because it violates linear ordering. For stem identification, the substring approach chooses the longest common substring among inflected word forms in a paradigm. The multiset approach disregards linear ordering and adjacency altogether: it treats each word as if it were a bag of unordered phonemes. Any combination of a subset of symbols from “abcde” would be a multiset from “abcde”. (A multiset, as opposed to a set, allows repetition of symbols within it; for instance, {b,a,a} is a multiset, but not a set, while {b,a} is a set.) To identify stem material in a paradigm, the multiset approach chooses the largest common multiset among the inflected word forms. Finally, a subsequence from a string respects the linear ordering without necessarily abiding by the adjacency requirement. For the original string “abcde”, legal subsequences include “abd”, “ace”, “bde”, and so forth. The subsequence approach chooses the longest common subsequence among word forms in a paradigm. Substrings: The substring approach to identifying stems has, implicitly or not, been frequently used, and lies at the heart of what is called concatenative morphology. For instance, given the present indicative paradigm of Spanish cantar ‘to sing’, canto-cantas-canta-cantamos-cantáiscantan, the longest common substring among these six forms is “cant” which corresponds well to what a linguist would say is the stem for the cantar paradigm. While this substring approach for stem identification is clearly effective for concatenative morphology, its weakness appears when there is deviation from simple concatenation. This issue has long been acknowledged in the literature on computational approaches to morphological learning (Goldsmith 2010, Hammarström and Borin 2011), most notably with the challenges posed by non-concatenative, root-and-pattern morphology in Semitic languages such as Arabic and Hebrew. But similar challenges are also presented by Spanish conjugation, with various vowel-alternating patterns in the stem-changing verbs. An example is the verb poder ‘can’ with o∼ue whose present indicative paradigm is puedo-puedespuede-podemos-podéis-pueden. For this poder paradigm, the substring approach to identifying the stem gives three analyses, with “p”, “e”, “d” being three possible longest common substrings 1 (ignoring accents for stress here): STEM puedo puedes puede podemos podeis pueden analysis 1 p puedo puedes puede podemos podeis pueden analysis 2 e puedo puedes puedes puede puede podemos podeis pueden pueden analysis 3 d puedo puedes puede podemos podeis pueden Intuitively, at least both “p” and “d” should be part of the stem for the poder paradigm, but none of the three substring analyses above give a stem satisfying this. Therefore, the substring approach is suboptimal and too restrictive for such morphology. Multisets: The multiset approach abandons both linear ordering and adjacency and squeezes out as stem material what is common across all word forms. For the poder paradigm, the multiset approach gives the unordered {p,d,e} as the stem: STEM puedo puedes puede podemos podeis pueden {p,d,e} puedo puedes puedes puede puede podemos podeis pueden pueden The multiset approach improves on the substring approach, because both “p” and “d” are in the multiset-based stem. However, a new problem arises: {p,d,e} cannot tell if “e” is suffixal or part of the o∼ue alternation. Abandoning linear ordering in stem identification appears undesirable. Subsequences: The subsequence approach requires linear ordering but not adjacency for stem identification. For the poder paradigm, two longest common subsequences are “p-d” and “p-e”: STEM puedo puedes puede podemos podeis pueden analysis 1 pd puedo puedes puede podemos podeis pueden analysis 2 pe puedo puedes puedes puede puede podemos podeis pueden pueden The subsequence-based stem “p-d” is desirable because both o∼ue and all suffixes are excluded. To decide between “p-d” and “p-e”, our automatic morphological learner relies on cross-paradigmatic comparison (e.g., with non-stem-changing verbs of the -ER conjugation) and observes that the suffixal pattern due to the “p-d” stem is a better match with suffixal patterns from other paradigms. Conclusions: This paper is intended as a contribution to the development of algorithmic approaches to identifying inflectional stems. We show that there are three approaches—substrings, multisets, and subsequences—useful for such a task, and that the subsequence approach appears to be the most helpful regardless of morphological typology. Not only are these approaches explicit, they are also straightforwardly amenable to machine-implemented analyses, which makes them readily applicable to cross-linguistic paradigmatic data sets for comparable and verifiable results. References: Goldsmith, J. 2010. Segmentation and morphology. In Handbook of Computational Ling. and Natural Lang. Processing. • Hammarström, H and L. Borin 2011. Unsupervised learning of morphology. Computational Ling. 37. • Spencer, A. 2012. Identifying stems. Word Structure 5. 2