Algorithmic approaches to identifying inflectional stems: implications

Transcripción

Algorithmic approaches to identifying inflectional stems: implications
Algorithmic approaches to identifying inflectional stems: implications from Spanish
Jackson L. Lee and John A. Goldsmith (University of Chicago)
Overview: Given a paradigm with inflected word forms, how exactly do we know what the stem
material is? This paper deals with the long-standing but under-researched problem of inflectional
stem identification (Goldsmith 2010, Spencer 2012) and, by drawing insights from mathematics
and bioinformatics, describes three language-independent and algorithmic solutions: substrings,
multisets, and subsequences. Inflectional stem identification is important because successfully
identifying stem material in a paradigm entails locating affixal material in each word form. Stem
identification is therefore part of the morpheme segmentation problem – the center of a morphological analysis, which in turn is the basis for syntactic/semantic analysis. We pin down three
algorithmically explicit approaches and compare their pros and cons in morphological analysis by
focusing on verbal paradigms in Spanish. We conclude that, among the approaches we consider,
the subsequence approach matches most closely to what is desirable. As the approaches are algorithmic and machine-implementable, they are useful and relevant not only to Spanish and other
Romance languages, but also languages with inflectional morphology in general.
Leading idea: When a linguist seeks to identify the stem in an inflectional paradigm, the intuition
at work is that we wish to find the maximal common material across all word forms (Spencer
2012). The point of this paper is to show that there are three natural ways to define the notion of
“common material”, employing the notions of substrings, multisets, and subsequences.
The three approaches: Consider the string “abcde”. “a”, “bc”, and “cde” are substrings. “ac”
is not a substring, because it violates adjacency, and “ba” is not either, because it violates linear
ordering. For stem identification, the substring approach chooses the longest common substring
among inflected word forms in a paradigm. The multiset approach disregards linear ordering
and adjacency altogether: it treats each word as if it were a bag of unordered phonemes. Any
combination of a subset of symbols from “abcde” would be a multiset from “abcde”. (A multiset,
as opposed to a set, allows repetition of symbols within it; for instance, {b,a,a} is a multiset, but
not a set, while {b,a} is a set.) To identify stem material in a paradigm, the multiset approach
chooses the largest common multiset among the inflected word forms. Finally, a subsequence from
a string respects the linear ordering without necessarily abiding by the adjacency requirement.
For the original string “abcde”, legal subsequences include “abd”, “ace”, “bde”, and so forth. The
subsequence approach chooses the longest common subsequence among word forms in a paradigm.
Substrings: The substring approach to identifying stems has, implicitly or not, been frequently
used, and lies at the heart of what is called concatenative morphology. For instance, given the
present indicative paradigm of Spanish cantar ‘to sing’, canto-cantas-canta-cantamos-cantáiscantan, the longest common substring among these six forms is “cant” which corresponds well to
what a linguist would say is the stem for the cantar paradigm. While this substring approach for
stem identification is clearly effective for concatenative morphology, its weakness appears when
there is deviation from simple concatenation. This issue has long been acknowledged in the literature on computational approaches to morphological learning (Goldsmith 2010, Hammarström and
Borin 2011), most notably with the challenges posed by non-concatenative, root-and-pattern morphology in Semitic languages such as Arabic and Hebrew. But similar challenges are also presented
by Spanish conjugation, with various vowel-alternating patterns in the stem-changing verbs. An
example is the verb poder ‘can’ with o∼ue whose present indicative paradigm is puedo-puedespuede-podemos-podéis-pueden. For this poder paradigm, the substring approach to identifying
the stem gives three analyses, with “p”, “e”, “d” being three possible longest common substrings
1
(ignoring accents for stress here):
STEM
puedo
puedes
puede
podemos
podeis
pueden
analysis 1
p
puedo
puedes
puede
podemos
podeis
pueden
analysis 2
e
puedo
puedes
puedes
puede
puede
podemos
podeis
pueden
pueden
analysis 3
d
puedo
puedes
puede
podemos
podeis
pueden
Intuitively, at least both “p” and “d” should be part of the stem for the poder paradigm, but none
of the three substring analyses above give a stem satisfying this. Therefore, the substring approach
is suboptimal and too restrictive for such morphology.
Multisets: The multiset approach abandons both linear ordering and adjacency and squeezes out
as stem material what is common across all word forms. For the poder paradigm, the multiset
approach gives the unordered {p,d,e} as the stem:
STEM
puedo
puedes
puede
podemos
podeis
pueden
{p,d,e}
puedo
puedes
puedes
puede
puede
podemos
podeis
pueden
pueden
The multiset approach improves on the substring approach, because both “p” and “d” are in the
multiset-based stem. However, a new problem arises: {p,d,e} cannot tell if “e” is suffixal or part
of the o∼ue alternation. Abandoning linear ordering in stem identification appears undesirable.
Subsequences: The subsequence approach requires linear ordering but not adjacency for stem
identification. For the poder paradigm, two longest common subsequences are “p-d” and “p-e”:
STEM
puedo
puedes
puede
podemos
podeis
pueden
analysis 1
pd
puedo
puedes
puede
podemos
podeis
pueden
analysis 2
pe
puedo
puedes
puedes
puede
puede
podemos
podeis
pueden
pueden
The subsequence-based stem “p-d” is desirable because both o∼ue and all suffixes are excluded. To
decide between “p-d” and “p-e”, our automatic morphological learner relies on cross-paradigmatic
comparison (e.g., with non-stem-changing verbs of the -ER conjugation) and observes that the
suffixal pattern due to the “p-d” stem is a better match with suffixal patterns from other paradigms.
Conclusions: This paper is intended as a contribution to the development of algorithmic approaches to identifying inflectional stems. We show that there are three approaches—substrings,
multisets, and subsequences—useful for such a task, and that the subsequence approach appears to
be the most helpful regardless of morphological typology. Not only are these approaches explicit,
they are also straightforwardly amenable to machine-implemented analyses, which makes them
readily applicable to cross-linguistic paradigmatic data sets for comparable and verifiable results.
References: Goldsmith, J. 2010. Segmentation and morphology. In Handbook of Computational
Ling. and Natural Lang. Processing. • Hammarström, H and L. Borin 2011. Unsupervised learning
of morphology. Computational Ling. 37. • Spencer, A. 2012. Identifying stems. Word Structure 5.
2