Many less resourced languages don’t have a tagged corpus or only have a small amount of poorly tagged material. In such cases unsupervised learning is a great first step beyond simply picking a random tag or the most frequent form according to some corpus statistics. The way CG is currently combined with the tagger works reasonably well empirically but is actually somewhat unsound in terms of the theory of operation of the tagger (words are identified by different ambiguity classes during training and tagging). For my first task, I will implement new ways of combining them both during in tagging and during tagger training .
For my second task I will implement an averaged perception POS tagger based on the one in the Python module nltk. This will have some level of configurability in terms of word features (ie can choose to look at suffixes or prefixes of different lengths) to accommodate different types of languages. Another requirement will be to work (not necessarily very well) with little or no configuration (which means eventually not using the coarse tags of the other taggers).