Grzegorz Chrupała
Grzegorz Chrupala
Saarland University
FR 7.4 Spoken Language Systems
Building C7 1, Room 0.04
P.O.B. 151150
66041 Saarbrücken
+49 681 302 58126
gchrupala@lsv.uni-saarland.de
I am a Postdoctoral Researcher at the Spoken Language Systems at Saarland University. Currently my research is focused on using data-driven language processing methods to improve Question Answering.
Teaching
- Introduction to classification and sequence labeling at IRTG Annual Meeting 2009. Slides
- Two-day tutorial on Machine Learning at Dublin City University, 18-19 March 2009. Slides
-
- Pattern and Speech recognition, winter semester 2009/2010
- Statistical natural language processing, summer semester 2009
- Pattern and Speech recognition, winter semester 2008/9
- Statistical natural language processing, summer semester 2008
Software
- Morfette: a tool for supervised learning of inflectional morphology.
Recent publications
- Grzegorz Chrupała and Dietrich Klakow. A Named Entity Labeler for German: exploiting
Wikipedia and distributional clusters. To appear in LREC 2010
- Michael Wiegand, Saeedeh Momtazi, Stefan Kazalski, Fang Xu, Grzegorz Chrupała and Dietrich Klakow. 2008. The Alyssa System at TAC QA 2008. In Proceedings of TAC 2008. PDF.
We present the Alyssa QA system which
participated in the TAC 2008 Question
Answering Track. The system consists of
two parallel streams: the blogger stream
which is used in order to deal with
questions which ask for lists of blog authors,
and the main stream which processes other
opinion questions. We also use a named
entity detection component specialized to
the entertainment domain. Evaluation
results show that our system exhibits
systematically better performance on blogger
questions than on other rigid questions.
- Grzegorz Chrupała. 2008. Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. PhD dissertation, Dublin City University. PDF Single-spaced PDF
Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages.
The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems.
Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing.
In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages.
- Grzegorz Chrupała, Georgiana Dinu and Josef van
Genabith. 2008. Learning Morphology with Morfette. In Proceedings of LREC
2008. PDF
Morfette is a modular, data-driven, probabilistic system which learns
to perform joint morphological tagging and lemmatization from
morphologically annotated corpora. The system is composed of two
learning modules which are trained to predict morphological tags and
lemmas using the Maximum Entropy classifier. The third module
dynamically combines the predictions of the Maximum-Entropy models and
outputs a probability distribution over tag-lemma pair sequences. The
lemmatization module exploits the idea of recasting lemmatization as a
classification task by using class labels which encode mappings from
wordforms to lemmas.
Experimental evaluation results and error analysis on three
morphologically rich languages show that the system achieves high
accuracy with no language-specific feature engineering or additional
resources.
- Grzegorz Chrupała, Josef van Genabith. Using very large
corpora to detect raising and control verbs. 2007. In Proceedings
of LFG07. PDF
The distinction between raising and subject-control verbs, although
crucial for the construction of semantics, is not easy to make given
access to only the local syntactic configuration of the sentence. In
most contexts raising verbs and control verbs display identical
superficial syntactic structure.
Linguists apply grammaticality tests to distinguish these verb
classes. Our idea is to learn to predict the raising-control
distinction by simulating such grammaticality judgments by means of
pattern searches.
Experiments with regression tree models show that using pattern
counts from large unannotated corpora can be used to assess how likely
a verb form is to appear in raising vs. control constructions. For
this task it is beneficial to use the much larger but also noisier Web
corpus rather than the smaller and cleaner Gigaword corpus.
A similar methodology can be useful for detecting other lexical
semantic distinctions: it could be used whenever a test employed to
make linguistically interesting distinctions can be reduced to a
pattern search in an unannotated corpus.
- Grzegorz Chrupała, Nicolas Stroppa, Josef van Genabith and
Georgiana Dinu. 2007. Better Training for Function
Labeling. In Proceedings of RANLP 2007, Borovets,
Bulgaria. PDF
Function labels enrich constituency parse tree nodes with information
about their abstract syntactic and semantic roles. A common way to
obtain function-labeled trees is to use a two-stage architecture where
first a statistical parser produces the constituent structure and then
a second component such as a classifier adds the missing function
tags.
In order to achieve optimal results, training examples for
machine-learning-based classifiers should be as similar as possible to
the instances seen during prediction. However, the method which has
been used so far to obtain training examples for the function labeling
classifier suffers from a serious drawback: the training examples come
from perfect treebank trees, whereas test examples are derived from
parser-produced, imperfect trees.
We show that extracting training instances from the reparsed training
part of the treebank results in better training material as measured
by similarity to test instances. We show that our training method
achieves statistically significantly higher f-scores on the function
labeling task for the English Penn Treebank. Currently our method
achieves 91.47% f-score on the section 23 of WSJ, the highest score
reported in the literature so far.
- Grzegorz Chrupała. 2006. Simple Data-Driven Context-Sensitive
Lemmatization. In Proceedings of SEPLN 2006, Zaragoza, Spain. PDF
Lemmatization for languages with rich inflectional morphology is one of the
basic, indispensable steps in a language processing pipeline. In this paper
we present a simple data-driven context-sensitive approach to lemmatizating word forms in
running text. We treat lemmatization as a classification task for Machine Learning, and
automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES)
between reversed input and output strings. A SES describes the transformations that have to
be applied to the input string (word form) in order to convert it to the output string
(lemma). Our approach shows competitive performance on a range of typologically different
languages.
- Grzegorz Chrupała and Josef van Genabith. 2006. Using
Machine-Learning to Assign Function Labels to Parser Output for
Spanish. In Proceedings of the COLING/ACL 2006 Main Conference
Poster Sessions. PDF
Data-driven grammatical function tag assignment has been studied for English using the Penn-II Treebank data. In this paper we address the question of whether such methods can be applied successfully to other languages and treebank resources. In addition to tag assignment accuracy and f-scores we also present results of a task-based evaluation.
We use three machine-learning methods to assign Cast3LB function tags to sentences parsed with Bikel's parser trained on the Cast3LB treebank. The best performing method, SVM, achieves an f-score of 86.87% on gold-standard trees and 66.67% on parser output - a statistically significant improvement of 6.74% over the baseline. In a task-based evaluation we generate LFG functional-structures from the function-tag-enriched trees. On this task we achieve an f-score of 75.67%, a statistically significant 3.4% improvement over the baseline.
- Grzegorz Chrupała and Josef van Genabith. 2006. Improving
Treebank-Based Automatic LFG Induction for Spanish. In
Proceedings of the LFG06 Conference. PDF
We describe several improvements to the method of treebank-based LFG
induction for Spanish from the Cast3LB treebank [10]. We discuss the
different categories of problems encountered and present the solutions
adopted. Some of the problems involve a simple adoption of existing
linguistic analyses, as in our treatment of clitic doubling and null
subjects. In other cases there is no standard LFG account for the
phenomenon we wish to model and we adopt a compromise, conservative
solution. This is exempli?ed by our treatment of Spanish periphrastic
constructions. In yet another case, the less con?gurational nature of
Spanish means that the LFG annotation algorithm has to rely mostly on
Cast3LB function tags, and consequently a reliable method of adding
those tags to parse trees had to be developed. This method achieves
over 6% improvement over the baseline for the Cast3LB-function-tag
assignment task, and over 3% improvement over the baseline for LFG
f-structure construction from function-tag-enriched trees.
-
Xavier Carreras, Lluís Màrquez and Grzegorz Chrupała. 2004. Hierarchical Recognition of Propositional Arguments with Perceptrons. In Proceedings of HLT-NAACL 2004 Workshop: Eighth Conference on Computational Natural Language Learning (CoNLL-2004). PDF
We describe a system for the CoNLL-2004 Shared Task
on Semantic Role Labeling (Carreras and Marquez,2004a).
The system implements a two-layer learning architecture to
recognize arguments in a sentence and predict the role they
play in the propositions. The exploration strategy visits
possible arguments bottom-up, navigating through the clause
hierarchy. The learning components in the architecture are
implemented as Perceptrons,
and are trained simultaneously online, adapting their
behavior to the global target of the system. The learning
algorithm follows the global strategy introduced in
(Collins, 2002) and adapted in (Carreras and Marquez, 2004b)
for partial parsing tasks.
Older publications
- Introduction to classification and sequence labeling at IRTG Annual Meeting 2009. Slides
- Two-day tutorial on Machine Learning at Dublin City University, 18-19 March 2009. Slides
- Pattern and Speech recognition, winter semester 2009/2010
- Statistical natural language processing, summer semester 2009
- Pattern and Speech recognition, winter semester 2008/9
- Statistical natural language processing, summer semester 2008
Software
- Morfette: a tool for supervised learning of inflectional morphology.
Recent publications
- Grzegorz Chrupała and Dietrich Klakow. A Named Entity Labeler for German: exploiting Wikipedia and distributional clusters. To appear in LREC 2010
- Michael Wiegand, Saeedeh Momtazi, Stefan Kazalski, Fang Xu, Grzegorz Chrupała and Dietrich Klakow. 2008. The Alyssa System at TAC QA 2008. In Proceedings of TAC 2008. PDF.
We present the Alyssa QA system which participated in the TAC 2008 Question Answering Track. The system consists of two parallel streams: the blogger stream which is used in order to deal with questions which ask for lists of blog authors, and the main stream which processes other opinion questions. We also use a named entity detection component specialized to the entertainment domain. Evaluation results show that our system exhibits systematically better performance on blogger questions than on other rigid questions.
- Grzegorz Chrupała. 2008. Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. PhD dissertation, Dublin City University. PDF Single-spaced PDF
Data-driven grammar induction aims at producing wide-coverage grammars of human languages. Initial efforts in this field produced relatively shallow linguistic representations such as phrase-structure trees, which only encode constituent structure. Recent work on inducing deep grammars from treebanks addresses this shortcoming by also recovering non-local dependencies and grammatical relations. My aim is to investigate the issues arising when adapting an existing Lexical Functional Grammar (LFG) induction method to a new language and treebank, and find solutions which will generalize robustly across multiple languages.
The research hypothesis is that by exploiting machine-learning algorithms to learn morphological features, lemmatization classes and grammatical functions from treebanks we can reduce the amount of manual specification and improve robustness, accuracy and domain- and language -independence for LFG parsing systems.
Function labels can often be relatively straightforwardly mapped to LFG grammatical functions. Learning them reliably permits grammar induction to depend less on language-specific LFG annotation rules. I therefore propose ways to improve acquisition of function labels from treebanks and translate those improvements into better-quality f-structure parsing.
In a lexicalized grammatical formalism such as LFG a large amount of syntactically relevant information comes from lexical entries. It is, therefore, important to be able to perform morphological analysis in an accurate and robust way for morphologically rich languages. I propose a fully data-driven supervised method to simultaneously lemmatize and morphologically analyze text and obtain competitive or improved results on a range of typologically diverse languages. - Grzegorz Chrupała, Georgiana Dinu and Josef van
Genabith. 2008. Learning Morphology with Morfette. In Proceedings of LREC
2008. PDF
Morfette is a modular, data-driven, probabilistic system which learns to perform joint morphological tagging and lemmatization from morphologically annotated corpora. The system is composed of two learning modules which are trained to predict morphological tags and lemmas using the Maximum Entropy classifier. The third module dynamically combines the predictions of the Maximum-Entropy models and outputs a probability distribution over tag-lemma pair sequences. The lemmatization module exploits the idea of recasting lemmatization as a classification task by using class labels which encode mappings from wordforms to lemmas. Experimental evaluation results and error analysis on three morphologically rich languages show that the system achieves high accuracy with no language-specific feature engineering or additional resources.
- Grzegorz Chrupała, Josef van Genabith. Using very large
corpora to detect raising and control verbs. 2007. In Proceedings
of LFG07. PDF
The distinction between raising and subject-control verbs, although crucial for the construction of semantics, is not easy to make given access to only the local syntactic configuration of the sentence. In most contexts raising verbs and control verbs display identical superficial syntactic structure. Linguists apply grammaticality tests to distinguish these verb classes. Our idea is to learn to predict the raising-control distinction by simulating such grammaticality judgments by means of pattern searches. Experiments with regression tree models show that using pattern counts from large unannotated corpora can be used to assess how likely a verb form is to appear in raising vs. control constructions. For this task it is beneficial to use the much larger but also noisier Web corpus rather than the smaller and cleaner Gigaword corpus. A similar methodology can be useful for detecting other lexical semantic distinctions: it could be used whenever a test employed to make linguistically interesting distinctions can be reduced to a pattern search in an unannotated corpus.
- Grzegorz Chrupała, Nicolas Stroppa, Josef van Genabith and
Georgiana Dinu. 2007. Better Training for Function
Labeling. In Proceedings of RANLP 2007, Borovets,
Bulgaria. PDF
Function labels enrich constituency parse tree nodes with information about their abstract syntactic and semantic roles. A common way to obtain function-labeled trees is to use a two-stage architecture where first a statistical parser produces the constituent structure and then a second component such as a classifier adds the missing function tags.
In order to achieve optimal results, training examples for machine-learning-based classifiers should be as similar as possible to the instances seen during prediction. However, the method which has been used so far to obtain training examples for the function labeling classifier suffers from a serious drawback: the training examples come from perfect treebank trees, whereas test examples are derived from parser-produced, imperfect trees.
We show that extracting training instances from the reparsed training part of the treebank results in better training material as measured by similarity to test instances. We show that our training method achieves statistically significantly higher f-scores on the function labeling task for the English Penn Treebank. Currently our method achieves 91.47% f-score on the section 23 of WSJ, the highest score reported in the literature so far. - Grzegorz Chrupała. 2006. Simple Data-Driven Context-Sensitive
Lemmatization. In Proceedings of SEPLN 2006, Zaragoza, Spain. PDF
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages.
- Grzegorz Chrupała and Josef van Genabith. 2006. Using
Machine-Learning to Assign Function Labels to Parser Output for
Spanish. In Proceedings of the COLING/ACL 2006 Main Conference
Poster Sessions. PDF
Data-driven grammatical function tag assignment has been studied for English using the Penn-II Treebank data. In this paper we address the question of whether such methods can be applied successfully to other languages and treebank resources. In addition to tag assignment accuracy and f-scores we also present results of a task-based evaluation. We use three machine-learning methods to assign Cast3LB function tags to sentences parsed with Bikel's parser trained on the Cast3LB treebank. The best performing method, SVM, achieves an f-score of 86.87% on gold-standard trees and 66.67% on parser output - a statistically significant improvement of 6.74% over the baseline. In a task-based evaluation we generate LFG functional-structures from the function-tag-enriched trees. On this task we achieve an f-score of 75.67%, a statistically significant 3.4% improvement over the baseline.
- Grzegorz Chrupała and Josef van Genabith. 2006. Improving
Treebank-Based Automatic LFG Induction for Spanish. In
Proceedings of the LFG06 Conference. PDF
We describe several improvements to the method of treebank-based LFG induction for Spanish from the Cast3LB treebank [10]. We discuss the different categories of problems encountered and present the solutions adopted. Some of the problems involve a simple adoption of existing linguistic analyses, as in our treatment of clitic doubling and null subjects. In other cases there is no standard LFG account for the phenomenon we wish to model and we adopt a compromise, conservative solution. This is exempli?ed by our treatment of Spanish periphrastic constructions. In yet another case, the less con?gurational nature of Spanish means that the LFG annotation algorithm has to rely mostly on Cast3LB function tags, and consequently a reliable method of adding those tags to parse trees had to be developed. This method achieves over 6% improvement over the baseline for the Cast3LB-function-tag assignment task, and over 3% improvement over the baseline for LFG f-structure construction from function-tag-enriched trees.
-
Xavier Carreras, Lluís Màrquez and Grzegorz Chrupała. 2004. Hierarchical Recognition of Propositional Arguments with Perceptrons. In Proceedings of HLT-NAACL 2004 Workshop: Eighth Conference on Computational Natural Language Learning (CoNLL-2004). PDF
We describe a system for the CoNLL-2004 Shared Task on Semantic Role Labeling (Carreras and Marquez,2004a). The system implements a two-layer learning architecture to recognize arguments in a sentence and predict the role they play in the propositions. The exploration strategy visits possible arguments bottom-up, navigating through the clause hierarchy. The learning components in the architecture are implemented as Perceptrons, and are trained simultaneously online, adapting their behavior to the global target of the system. The learning algorithm follows the global strategy introduced in (Collins, 2002) and adapted in (Carreras and Marquez, 2004b) for partial parsing tasks.