Time & Location: TBD
Instructor: Vagrant Gautam (vgautam@lsv.uni-saarland.de)
Suitable for: Research-oriented (LST and CS) Masters students
NLP papers commonly use various abstract concepts like “interpretability,” “bias,” “reasoning,” “stereotypes,” and so on. Each subfield has a shared understanding of what these terms mean and how we should treat them, and this shared understanding is the basis on which datasets are built to evaluate these abilities, metrics are proposed to quantify them, and claims are made about systems. But what exactly do these terms mean? And, indeed, what should they mean, and how do we measure that? These questions are the focus of this seminar on defining and measuring abstract concepts in NLP.
In two week cycles, we will cover various concepts in NLP research from the selection below. We will read papers that analyze or critique how a given concept is used, and then we will use this as a lens to read, discuss, and critique 2 or more recent NLP papers that use that concept. We will also try to reimagine how we would run these projects and write these papers in light of what we have learned.
Course requirements and grading
- Present a concept and lead a discussion about it (a sample will be given of what this should look like) – 35%
- Engage with presentations about other concepts – 30%
- Write a report designing an NLP project involving the concept that is either novel or a reimagination of one of the papers we saw in the seminar – 35%
Learning objectives
This course will help you acquire / practise the following skills, among others:
- How to read and critique papers (both interdisciplinary and more conventional NLP papers)
- How to critically evaluate aspects of conceptualization (defining an abstract concept) and operationalization (creating empirical measures of the abstract concept), towards answering questions like: What is X? How do we conceive of X? What does an abstract X mean? How do we translate our conceptions of X into something that we can observe and measure? How, concretely, can we measure it with an NLP system? How do we operationalize the measurement with data and metrics and gold labels? What should X mean? How do our choices in defining, conceptualizing and operationalizing X lead to gaps in how we make claims about X?
- How to design NLP projects in ways that address critiques and push the discipline forward
Course logistics
At our first session, we will discuss and decide on some concrete details for the course structure including submission deadlines, specifics of course requirements, etc. I will present an overview of what this seminar will cover, discuss expectations about presentations and participation, and teach you how to effectively (and quickly) read papers. After this session, I will send out a form for students to indicate interest in different concepts. At our second session, I will give you a sample of what leading a discussion should look like. Sessions 3 to N-1 will be student-led discussions. We will have one final session after report submission to share ideas from the reports, discuss feedback, and wrap up the seminar.
General reading
- It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance
- If all you have is a hammer, everything looks like a nail: Operationalization matters
- Reading papers: How to Read a Paper, How to Read a Technical Paper, Ten simple rules for reading a scientific paper, Reading a Technical Paper
List of concepts
We will choose a subset of the following topics based on class size and student interest: robustness, stereotypes, interpretability, explainability, paraphrases, bias, race, emergent abilities, gender, generalization, names, reasoning. Each concept is listed below with a paper that does a critical or conceptual analysis (sometimes with a survey) of it. In addition, other papers that use the concept are listed here for us to discuss and evaluate. I may add further topics, including understanding, intelligence, accountability, memorization, trust, etc., depending on interest.
Robustness: Beyond generalization: a theory of robustness in machine learning
- How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?
- Sorting through the noise: Testing robustness of information processing in pre-trained language models
- ROBBIE: Robust Bias Evaluation of Large Generative Language Models
Stereotypes: Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets
- SeeGULL: A Stereotype Benchmark with Broad Geo-Cultural Coverage Leveraging Generative Models
- StereoSet: Measuring stereotypical bias in pretrained language models
- Stereotypes and Smut: The (Mis)representation of Non-cisgender Identities by Text-to-Image Models
Interpretability: The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery and Against Interpretability: a Critical Examination of the Interpretability Problem in Machine Learning
- RAVEL: Evaluating Interpretability Methods on Disentangling Language Model Representations
- Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models
Explainability: Explanation in artificial intelligence: Insights from the social sciences
- Attention is not Explanation
- Attention is not not Explanation
- Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals
Paraphrases: What Is a Paraphrase?
- How Large Language Models are Transforming Machine-Paraphrase Plagiarism
- Paraphrase Types for Generation and Detection
- When and how to paraphrase for named entity recognition?
Bias: Language (Technology) is Power: A Critical Survey of “Bias” in NLP
- Don’t Blame the Annotator: Bias Already Starts in the Annotation Instructions
- Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection
- Social Bias Frames: Reasoning about Social and Power Implications of Language
- Intrinsic Bias Metrics Do Not Correlate with Application Bias
Race: A Survey of Race, Racism, and Anti-Racism in NLP
- A Multilingual Dataset of Racial Stereotypes in Social Media Conversational Threads
- The Risk of Racial Bias in Hate Speech Detection
- Evaluation of African American Language Bias in Natural Language Generation
Emergent abilities: Are Emergent Abilities of Large Language Models a Mirage?
- Emergent Abilities of Large Language Models
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
Gender: Theories of “Gender” in NLP Bias Research and Gender as a Variable in Natural-Language Processing: Ethical Considerations
- The Lou Dataset — Exploring the Impact of Gender-Fair Language in German Text Classification
- On Evaluating and Mitigating Gender Biases in Multilingual Settings
Generalization: A taxonomy and review of generalization research in NLP
- Lexical Generalization Improves with Larger Models and Longer Training
- Crosslingual Generalization through Multitask Finetuning
- Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Names: Stop! In the Name of Flaws: Disentangling Personal Names and Sociodemographic Attributes in NLP