Prototype-Based Relevance Learning for Genre Classification

Bachelor's Thesis by Jan Gasthaus

Submitted September 2007


In this bachelor's thesis several prototype-based supervised learning algorithms from the Learning Vector Quantization (LVQ) family are evaluated with respect to their suitability for text classification tasks in computational linguistics. The algorithms under investigation are LVQ [Kohonen, 1986], GLVQ [Sato and Yamada, 1996], and SNG [Hammer et al., 2005], as well as their extensions to relevance learning GRLVQ [Hammer and Villmann, 2002] and SRNG [Hammer et al., 2005]. Genre classification in the British National Corpus is used as the benchmark text classification problem.

The algorithms are evaluated in terms of performance on 3 distinct genre classification tasks, each combined with two different sets of features. The performance is analyzed with respect to different paramter settings and model complexities, and the influences of the parameters on the classification accuaracy as well as the learning behavior are examined. The performance is also compared to that of the well known support vector machine (SVM) classifier. In addition, a qualitative analysis of the additional information provided by these algorithms (relevance information, interpretable prototypes) is performed.

The algorithms are found to achieve high accuracies comparable to those achieved by the SVM classifier on all classifications tasks and data sets.