Assignment 12: Exercises with dependency treebanks

This is a graded assignment. Follow the link at the private course page to work on your own repository. The deadline for this assignment is Friday, Feb 22 2019, 12:00 CET.

You are strongly recommended to use git properly/productively: commit every ‘unit’ of work (e.g., individual exercises) and bug fixes separately. You do not need to worry about the mistakes in the earlier commits. Your assignment will be evaluated based only on the final commit before the deadline.

In this set of exercises you will work with dependency trees and dependency treebanks. We will not do a full dependency parsing exercise, but work on three relevant smaller exercises. Your knowledge from graph theory and graph algorithms will also be handy for the solutions of these problems.

Although the programs you develop should be general, and can be used with any dependency treebank, in this assignment, we will use Universal Dependencies treebanks, which are freely available from the link above.

Some utilities for handling (reading/writing) CoNLL-U files is provided as conllu.py. You are free to use it for the following exercises. You can also extend it if you find it necessary, or just write your own routines to handle CoNLL-U files from scratch. Do not use other CoNLL-U/UD libraries or code from them for the exercises below.

For all of the exercises, we are interested in primary dependencies (columns 7 and 8). You do not need to pay attention to the extended dependencies (column 9).

1. Statistics on word order

SVO     10  5%
SOV     50  25%
...

Do not forget to check in your dat files (text files with statistics with your choice of languages) as well as your code.

2. Finding and counting non-projective trees

Write another Python program that finds and counts the number of non-projective trees in a treebank in CoNLL-U format. Your program should take a single command-line argument, the name of the CoNLL-U file, and output the number and ratio (percentage) of non-projective trees in the treebank.

Update the texts files you created in Exercise 1, by adding a single line similar to the following example.

NON-PROJ    20  0.5%

You are encouraged to also count the number of crossing dependencies, but not required for this assignment. Detecting a single crossing dependency is enough as counting a sentence as non-projective.

The template for this exercise is in file non-proj.py.

3. Pseudo projectivization

Given a non-projective parser, one way to deal with non-projectivity is to “projectivize” the trees during training, and postprocess the parser output to obtain non-projective trees.

To create a projective tree from a non-projective one, crossing dependencies are modified by moving the head of the dependency one level up until the dependency is not non-projective. The non-projectivity information is typically encoded in the dependency labels. Given a non-projective edge, (w1, l1, w2) where w1 is the head and l1 is the relation type, we replace it with (w0, l0.l1, w2) where w0 is the head of w1 and l0 is the relation label between w1 and w0, that is, there is a dependency (w0, l0, w1).

For example, given the following tree

Leftt headed

we convert it to:

Right headed

Make sure that your tree has a single root (there is only one token with head 0), after the conversion.

The template for this exercise is in file pseudo-proj.py. A few toy non-projective trees are given in non-proj.conllu, which you can use for testing your code.

4. (optional) post processing pseudo-projective trees

Write another Python script to convert a pseudo-projective tree as described in Exercise 3 above back to its non-projective original. For this exercise, you can assume that the combined relation labels (like obj.nmod above) are not ambiguous.

This exercise will not be part of your graded work, but recommended to gain more insight into the problem.