Thesis Proposal:

B.Sc. Thesis Topics:

      Ask by mail for recent topics.

 

M.Sc. Thesis Topic:

In cooperation with Peter Fankhauser from IDS Mannheim this topic is offered:

Training and Testing Corpus-Specific Spelling Correction

Optical Character Recognition (OCR) for historical documents is still a challenging task, resulting in a fairly high percentage of misspellings. The noisy channel model (see, e.g. [Jurafski and Maring 2016, Chapter 5, Peter Norvik 2009]) provides a powerful and flexible framework to spelling correction. It estimates the probability of a spelling correction by combining a channel (error) model with a language model for the target language. The goal of this thesis is to train corpus specific channel models based on example misspellings derived from clustering words based on word embeddings [Mikolov et al. 2013]. Preliminary experiments [Knappen et al. 2017], show that such clusters are able to group correctly spelled words with their misspelled counterparts due to their rather similar usage contexts. In more detail, this thesis comprises the following tasks:

(1) Deriving a corpus specific channel model based on manually filtered word embedding clusters.

(2) Calculating an appropriate language model for the corpus.

(3) Implementing a noisy channel spelling correction (or adapting and existing implementation)

(4) Testing the accurracy of the approach and comparing it to noisy channels based on generic channel and language models As test corpus the Royal Society Corpus shall be used.

 

References:

Daniel Jurafsky & James H. Martin. Speech and Language Processing. Draft Nov 2016. URL: https://web.stanford.edu/~jurafsky/slp3/5.pdf

Jörg Knappen, Stefan Fischer, Hannah Kermes, Elke Teich, and Peter Fankhauser. The Making of the Royal Society Corpus.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119.

Peter Norvik. Natural Language Corpus Data. Chapter 14 in Toby Segaran, Jeff Hammerbacher (eds.) Beautiful Data, O'Reilly Media, June 2009.

 If you have questions, send an e-mail to fankhauser@ids-mannheim.de and dietrich@lsv.uni-saarland.de