Assignment 11: Morphological analyzers and finite-state transducers

This is a graded assignment. You need to follow the link at the private course page to work on your own repository. The deadline for this assignment is Friday Feb 8 2019, 12:00 CET.

You are strongly recommended to use git properly/productively: commit every ‘unit’ of work (e.g., individual exercises) and bug fixes separately. You do not need to worry about the mistakes in the earlier commits. Your assignment will be evaluated based only on the final commit before the deadline.

This assignment has two parts. In the fist part, you are required to build a simple finite-state morphological analyzer using Xerox languages. In the second part you are going to implement an FST class in Python that is capable transducing a string given an FST similar to the one you created in the fist part.

Part I: Defining morphology of English nouns

This part of the assignment is about extending the simple morphological analyzer we studied in the earlier in-class demonstration. A slightly modified solution is given you as a starter code in analyzer/ subdirectory. You should study the explanations in the page linked above, and the statter code before starting.

The code should work with both hfst-xfst from HFST and foma. Both implement Xerox’s XFST interface, and provide pre-compiled binaries for major operating systems (see respective web pages for installation instructions).

The following is an example interactive session with hfst-xfst (edited for brevity):

$ hfst-xfst 
hfst[0]: source analyzer.xfst
hfst[1]: down cat<N><SG>
cat
hfst[1]: up cats
cat<N><PL>
hfst[1]: up
apply up> foxes
fox<N><PL>
apply up> foxs
???
apply up> finches
finch<N><PL>
apply up> finchs
???
apply up> cates
???
apply up> mice
mouse<N><PL>
apply up> mouses
???

A finite-state morphological description can be used for both analysis and generation. In analysis mode, the system is presented with a word (surface form) and asked to provide an analysis, e.g., given cats we get cat<N><PL> as analysis which is one way of saying that “cats is the plural noun with lemma cat”. The generation is the reverse problem. We want to know, for example, what is the plural of the noun cat. Given our FST, we form the appropriate symbolic expression, which turns out to be cat<N><PL> in our definition above, and ask the system to generate the word.

The convention in defining finite-state morphological analyzers is to consider the underlying form (analyses) as the input language, and surface forms (actual words) as the output language. By default the transducer generates surface forms from the underlying representations. To use it as an analyzer, we need to either follow the output symbols on the arcs during transduction and output corresponding input symbols, or we need to invert the transducer.

In the XFST terminology, apply down means transduce from the input language to the output language. Hence, it will work as a morphological generator. apply up means transduce from the output language to the input language, analyzing the given word.

In the first example (line 3) above, the transducer generates cat from the analysis string cat<N><SG>. The other examples are analysis examples, where cats, foxes, finches and mice are correctly analyzed as the plural forms of corresponding lemmas, while incorrect forms *cates, *foxs *finchs and *mouses are rejected (we are listing the animals, computer mouses do not count).

1. More exceptions, and additional contexts

Change the morphological analyzer such that the following words are analyzed correctly.

sheep (only sheep should be accepted as plural)
fish (both fish and fishes should be accepted as plural)
monarch (only monarchs should be accepted as plural)

2. Add a ‘y-replacement’ rule

The starter code implements (a partial version of) the e-insertion rule. Add a rule that replaces (or does not replace) the word-final ‘y’ with ‘i’ as in the following examples.

butterfly (plural butterflies)
turkey (plural turkeys)

Make sure the new rule is also used appropriately by the final FST. Your rule should be general. That is, it should work for any noun ending in “y”, e.g., bunny or donkey, if its lemma is added to the lexicon.

Save the resulting finite state transducer as analyzer.fst in AT&T format, using write att command of xfst. Make sure to check in and push all your changes, including the AT&T formatted automata file.

3. (optional) extend the analyzer for superlative and comparative forms of adjectives

This exercise is optional, you are recommended to do it to get further insight into finite-state morphological analyzer development, but it will not affect your grade.

Extend the morphological analyzer to analyze comparative and superlative forms of adjectives. At a minimum, your analyzer should cover the following cases.

fast - fast<Adj>
faster - fast<Adj><Comp>
fastest - fast<Adj><Sup>
later - late<Adj><Comp>
easier - easy<Adj><Comp>
bigger - big<Adj><Comp>
*comfortabler - no analyses
*comfortablest - no analyses

Naturally, the incorrect forms like *fastr *lateer *easyer *biger should be rejected.

Your analyzer should reuse the code from the earlier steps, e.g., using the same e-insertion or y-replacement rules with nouns.

Part II: Implementing a finite-state transducer

In this part you will implement a simple transducer in python. The template is provided in fst.py. The exercises below can be implemented by extending the relevant parts of the finite-state automata assignment.

4. Read an ATT transducer file

Similar to the FSA exercise, we will construct our FST objects by reading in transducer files in AT&T format. The format is similar to the one you worked with earlier. The only difference is, instead of a single symbol, now we have two symbols. The following is an example describing a transducer that accepts one or more as, replacing all but the final one with b, and deleting the final a from the output. Note that the symbol @0@ stands for null string, ε.

0       1       a       @0@
0       0       a       b
1

The format is similar to FSA format, except transitions now have four columns, namely, source state, target state, input symbol, and output symbol, separated by tabs.

Implement the read_att() method that reads in such a text file, and constructs the FST transition table.

5. Implement the transduction

Implement the method transduce() that returns (or yields) the list of all output strings for a given input if the input is accepted.

Note that ε-transitions are common in finite-state morphological analyzers, and determinization and epsilon removal is not possible for all FSTs. As a result, we will need to handle epsilon transitions during transduction (or recognition). However, you are not required to detect ε-loops for this assignment, you can assume the transducer does not contain any ε-loops.

Another common practice is the use of multi-character symbols, such as <N> on FST transitions. You should make sure that your transduce() method tokenizes the input string such that the multi-char symbols are handled correctly. This means that multi-char symbols are never treated as multiple symbols.

6. Implement the inversion

Implement the method invert() that inverts the FST in-place. That is, after calling the method, input and outputs symbols on all transitions must be swapped.

7. Implement a command line interface

Implement a command line interface with the following specification:

python3 fst.py -h should print command line help
python3 fst.py analyze fst_file input_file should read in the AT&T-formatted fst_file, and produces all morphological analyses for all words in the input_file (note: you need to invert the original FST). Input file should contain one word per line. The output should list all analyses of a word on single row. For example, for the analyzer.fst created in the first part of the assignment, and a input file
```
cats
fish
```
the output should be

cat<N><PL>
fish<N><SG> fish<N><PL>

python3 fst.py generate fst_file input_file should read in the fst_file and produce all outputs for the words in the input_file. Input and output formats should be the same as analyze above.

For this simple exercise, you can parse the command line arguments yourself by inspecting sys.argv. However, using a standard library, like argparse is recommended.