CS674: Natural Language Processing
Reaction Essay Readings
The primary readings in this list were chosen with a variety of
criteria in mind, including impact, accessibility to the non-expert, recency,
brevity, and on-line availability. Also, since these papers are meant to
serve as subjects for reaction essays, "provocativeness" was an
additional consideration. It was not always easy to balance all these conditions,
and in some cases potentially unorthodox choices were made. Hence, this
collection should not be regarded as a compilation of major papers, but
rather as a set of starting points for becoming acquainted with some issues
and ideas in natural language processing.
In this vein, the related references are meant simply to give some idea
of the range of work on the same topic. I've omitted papers covered in
Instructions: Reaction essays are due on the following Mondays:
February 4, 11, 18, and 25, and March 4. The intent is threefold: to acquaint
you with some important recent papers and subjects in the field, to help
you find a project topic, and to train you in how to read a research paper
both quickly and effectively. Of course, it is expected that you will look
at papers that aren't on the list as well!
Your reaction essay is just that: a short (one or two pages) critical
reading of one of the primary papers
(in red and marked with a filled-in bullet)
from the list below. Briefly (1-2 paragraphs) describe the problem attacked
and the solution proposed, but do not merely summarize the paper. Then,
address such questions as, is the evaluation fair and informative? Are
the underlying assumptions valid? When are the proposed methods applicable?
On the other hand, don't spend an inordinate amount of time on these essays
-- they are meant to be brief, and will be graded on a check-plus/check/check-minus
A note about proper citation: it is not acceptable to copy out
sentences or phrases from papers without citation. If you have not placed
quotation marks around a passage, then you are claiming that these words
are your own. Violating the Code of Academic Integrity may result and has
resulted in failing the course. When in doubt, credit your source!
Eugene Charniak, 1997. Statistical
Parsing with a Context-free Grammar and Word Statistics
Proceedings of the Fourteenth National Conference on Artificial
Intelligence (AAAI-97), pp 598-603.
Abstract: We describe a parsing system based upon a language model for
English that is, in turn, based upon assigning probabilities to possible
parses for a sentence. This model is used in a parsing system by finding
the parse for the sentence with the highest probability. This system outperforms
previous schemes. As this is the third in a series of parsers by different
authors that are similar enough to invite detailed comparisons but different
enough to give rise to different levels of performance, we also report
on some experiments designed to identify what aspects of these systems
best explain their relative performance.
Michael Collins, 1997. Three
generative, lexicalised models for statistical parsing
35th Annual Meeting of the Association for Computational Linguistics
and 8th Conference of the European Chapter of the Association for Computational
Linguistics: Proceedings of the Conference (ACL/EACL '97), pp 16-23.
algorithms and metrics.
Proceedings of the 34th ACL, pp. 177--183.
Eugene Charniak, 1999. A
Proceedings of the First Meeting of the North American Chapter of
the Association for Computational Linguistics (NAACL 2000), pp. 132--139.
See also Technical
Report CS99-12, Department of Computer Science, Brown University.
Adwait Ratnaparkhi, 1999. Learning
to Parse Natural Language with Maximum Entropy Models.
Machine Learning 34 (1/2/3), pp. 151--175.
Michael Collins, 2000.
Discriminative Reranking for Natural Language Parsing.
Mark Johnson, 2001. Joint
and conditional estimation of tagging and parsing models.
Proceedings of the 39th Annual Meeting of the Association for Computational
Linguistics/10th Conference of the European Chapter (ACL-2001), pp.
David Chiang, 2000.
parsing with an automatically-extracted tree adjoining grammar
Proceedings of the 38th Annual Meeting of the Association for Computational
Linguistics (ACL 2000), pp 456--463.
Abstract: We discuss the advantages of lexicalized tree-adjoining grammar
as an alternative to lexicalized PCFG for statistical parsing, describing
the induction of a probabilistic LTAG model from the Penn Treebank and
evaluating its parsing performance. We find that this induction method
is an improvement over the EM-based method of (Hwa, 1998), and that the
induced model yields results comparable to lexicalized PCFG.
- Chiang's home page
- Related references:
Yves Schabes and Richard C. Waters, 1995. Tree
Insertion Grammar: A Cubic-Time Parsable Formalism That Lexicalizes Context-Free
Grammar without Changing the Trees Produced.
Computational Linguistics 21, pp. 479--513.
Note: the version linked here is to MERL Technical Report TR94-13.
Rebecca Hwa, 1998. An
Empirical Evaluation of Probabilistic Lexicalized Tree Insertion Grammars.
Proceedings of ACL/COLING 1998, pp 557--563.
Fei Xia, Martha Palmer, and Aravind
Joshi, 2000. A
Uniform Method of Grammar Extraction and Its Applications.
Proceedings of Empirical Methods in Natural Language Processing
(EMNLP-2000), pp. 53--62.
Owen Rambow, K.
Vijay-Shanker, and David
Weir, 2001. D-Tree
Computational Linguistics 27(1), pp. 87--122.
The U. Penn XTAG system
Jong C. Park, 1995. Quantifier
Scope and Constituency
ACL '95, pp. 205--212.
Abstract: Traditional approaches to quantifier scope typically need stipulation
to exclude readings that are unavailable to human understanders. This paper
shows that quantifier scope phenomena can be precisely characterized by
a semantic representation constrained by surface constituency, if the distinction
between referential and quantificational NPs is properly observed. A CCG
implementation is described and compared to other approaches.
- Park's home page
Jerry R. Hobbs and Andrew Kehler, 1997.
theory of parallelism and the case of VP ellipsis
35th Annual Meeting of the Association for Computational Linguistics
and 8th Conference of the European Chapter of the Association for Computational
Linguistics: Proceedings of the Conference, pp. 394--401.
Abstract: We provide a general account of parallelism in discourse, and
apply it to the special case of resolving possible readings for instances
of VP ellipsis. We show how several problematic examples are accounted
for in a natural and straightforward fashion. The generality of the approach
makes it directly applicable to a variety of other types of ellipsis and
- Author home pages: Hobbs
home page, Kehler home page
Marilyn Walker, 1996. Limited
Attention and Discourse Structure
Computational Linguistics, 22(2), pp. 255--264.
Abstract: This squib examines the role of limited attention in a theory
of discourse structure and proposes a model of attentional state that relates
current hierarchical theories of discourse structure to empirical evidence
about human discourse processing capabilities. First, I present examples
that are not predicted by Grosz and Sidner's stack model of attentional
state. Then I consider an alternative model of attentional state, the cache
model, which accounts for the examples, and which makes particular processing
predictions. Finally I suggest a number of ways that future research could
distinguish the predictions of the cache model and the stack model.
- Walker home page
Michael Strube, 1998.
Look Back: An Alternative to Centering
Proceedings of COLING-ACL '98, pp. 1251--1257.
Abstract: I propose a model for determining the hearer's attentional state
which depends solely on a list of salient discourse entities (S-list).
The ordering among the elements of the S-list covers also the function
of the backward-looking center in the centering model. The ranking criteria
for the S-list are based on the distinction between hearer-old and hearer-new
discourse entities and incorporate preferences for inter- and intra-sentential
anaphora. The model is the basis for an algorithm which operates incrementally,
word by word.
- Strube homepage
Barbara Grosz and Candace
Sidner. 1998. Lost
Intuitions and Forgotten Intentions. In Centering in Discourse,
Marilyn Walker, Aravind Joshi and Ellen Prince, eds., pp. 39-51.
Marilyn A. Walker,
anaphora resolution, and discourse structure.
In Marilyn A. Walker, Aravind K. Joshi, and Ellen F. Prince, editors,
Eleni Miltsakaki and Karen Kukich, 2000. The
Role of Centering Theory's Rough-Shift in the Teaching and Evaluation of
Proceedings of the 38th ACL, pp. 408--415.
Ciprian Chelba and Frederick Jelinek, 1998.
syntactic structure for language modeling
COLING-ACL '98 36th Annual Meeting of the Association for Computational
Linguistics and 17th International Conference on Computational Linguistics:
Proceedings of the Conference, Vol 1, pp 225--231
Abstract: The paper presents a language model that develops syntactic structure
and uses it to extract meaningful information from the word history, thus
enabling the use of long distance dependencies. The model assigns probability
to every joint sequence of words-binary-parse-structure with headword annotation
and operates in a left-to-right manner - therefore usable for automatic
speech recognition. The model, its probabilistic parameterization, and
a set of experiments meant to evaluate its predictive power are presented;
an improvement over standard trigram modeling is achieved.
Stanley F. Chen and Joshua
Goodman, 1996. An
empirical study of smoothing techniques for language modeling
Proceedings of the 34th Meeting of the Association for Computational
Linguistics, pp 310--318.
TR version: TR-10-98,
Computer Science Group, Harvard University, 1998.
Ronald Rosenfeld, 2000. Two
decades of Statistical Language Modeling: Where Do We Go From Here?.
Proceedings of the IEEE, 88(8).
Extraction Using the Structured Language Model.
Proceedings of the 2001 Conference on Empirical Methods in Natural
Language Processing, Lillian Lee and Donna Harman, eds., pp. 74--81.
Lillian Lee and Fernando Pereira, 1999.
similarity models: Clustering vs. Nearest Neighbors
Proceedings of the Thirty-Seventh Annual Meeting of the Association
for Computational Linguistics (ACL'99), pp 23-40.
Abstract: Distributional similarity is a useful notion in estimating the
probabilities of rare joint events. It has been employed both to cluster
events according to their distributions, and to directly compute averages
of estimates for distributional neighbors of a target event. Here, we examine
the tradeoffs between model size and prediction accuracy for cluster-based
and nearest neighbors distributional models of unseen events.
Hinrich Schütze, 1992. Dimensions
Proceedings of Supercomputing, pp 787-796.
is an HTML version.
Fernando Pereira, Naftali
Tishby, and Lillian Lee, 1993. Distributional
Clustering of English Words
Proceedings of the 31st ACL, pp 183-90.
Thomas Hofmann and Jan Puzicha, 1998. Statistical
Models for Co-occurrence Data .
AI Memo 1625, CBCL Memo 159, MIT.
Dekang Lin, 1998. Automatic
Retrieval and Clustering of Similar Words.
COLING-ACL98, pp. 768--773.
Mats Rooth, Stefan
Riezler, Detlef Prescher, Glenn Carroll, and Franz Beil.
a Semantically Annotated Lexicon via EM-Based Clustering
Proceedings of ACL '99, pp. 104--111.
Naftali Z. Tishby, Fernando
Pereira, and William Bialek, 1999. The
Information Bottleneck Method.
Proceedings of the 37th Allerton Conference on Communication, Control
David Yarowsky and Grace Ngai, 2001.
Multilingual POS Taggers and NP Bracketers via Robust Projection across
Aligned Corpora .
Proceedings of NAACL 2001, pp. 200--207.
Abstract: This paper investigates the potential for projecting linguistic
annotations including part-of-speech tags and base noun phrase bracketings
from one language to another via automatically word-aligned parallel corpora.
First, experiments assess the accuracy of unmodified direct transfer of
tags and brackets from the source language English to the target languages
French and Chinese, both for noisy machine-aligned sentences and for clean
hand-aligned sentences. Performance is then substantially boosted over
both of these baselines by using training techniques optimized for very
noisy data, yielding 94-96% core French part-of-speech tag accuracy and
90% French bracketing F-measure for stand-alone monolingual tools trained
without the need for any human-annotated data in the given language.
Kenji Yamada and Kevin Knight, 2001.
Syntax-Based Statistical Translation Model.
Proceedings of ACL 2001, pp. 523--530.
Abstract: We present a syntax-based statistical translation model.
Our model transforms a source-language parse tree into a target-language
string by applying stochastic operations at each node. These operations
capture linguistic differences such as word order and case marking. Model
parameters are estimated in polynomial time using an EM algorithm. The
model produces word alignments that are better than those produced by IBM
For further reading (but not a reaction essay): Bonnie Dorr, Pamela Jordan,
and J. Benoit, 1999, A
Survey of Current Paradigms in Machine Translation
Advances in Computers, edited by Marvin V. Zelkowitz, Vol 49,
to home page
David Yarowsky, 1995. Unsupervised
word sense disambiguation rivaling supervised methods
Proceedings of the 33rd Annual Meeting of the Association for Computational
Linguistics, pp 189--196.
Abstract: This paper presents an unsupervised learning algorithm for sense
disambiguation that, when trained on unannotated English text, rivals the
performance of supervised techniques that require time-consuming hand annotations.
The algorithm is based on two powerful constraints -- that words tend to
have one sense per discourse and one sense per collocation -- exploited
in an iterative bootstrapping procedure. Tested accuracy exceeds 96%.
Dan Roth, 1998. Learning
to Resolve Natural Language Ambiguities: A Unified Approach
Proceedings of AAAI '98.
Abstract: We analyze a few of the commonly used statistics based and machine
learning algorithms for natural language disambiguation tasks and observe
that they can be re-cast as learning linear separators in the feature space.
Each of the methods makes a priori assumptions, which it employs, given
the data, when searching for its hypothesis. Nevertheless, as we show,
it searches a space that is as rich as the space of all linear separators.
We use this to build an argument for a data driven approach which merely
searches for a good linear separator in the feature space, without
further assumptions on the domain or a specific problem.
We present such an approach - a sparse network of linear separators,
utilizing the Winnow learning algorithm - and show how to use it in a variety
of ambiguity resolution problems. The learning approach presented is attribute-efficient
and, therefore, appropriate for domains having very large number of attributes.
In particular, we present an extensive experimental comparison of our
approach with other methods on several well studied lexical disambiguation
tasks such as context-sensitive spelling correction, prepositional phrase
attachment and part of speech tagging. In all cases we show that our approach
either outperforms other methods tried for these tasks or performs comparably
to the best.
Steven Abney and Marc Light, 1999.Hiding
a Semantic Class Hierarchy in a Markov Model
Proceedings of the ACL '99 Workshop on Unsupervised Learning in
Natural Language Processing, pp. 1--8.
Abstract: We introduce a new model of selectional preference induction.
Unlike previous approaches, we provide a stochastic generation model for
the words that appear as arguments of a predicate. More specifically, we
define a hiddne Markov model with the general shape of a given semantic
class hierarchy. This model has a number of attractive features, among
them that selectional preference can be seen as distributions over words.
Initial results are promising. However, unsupervised parameter estimation
has proven problematic. A central problem is word sense ambiguity in the
training corpora. We describe attempts to modify the forward-backward algorithm,
an EM algorithm, to handle such disambiguation. Although these attempts
were unsuccessful at improving performance, we believe they give insight
into the nature of the bottlenecks and into the behavior of the EM algorithm.
Michele Banko and Eric Brill, 2001.Scaling
to Very Very Large Corpora for Natural Language Disambiguation
Proceedings of ACL 2001, pp. 26--33.
Abstract: The amount of readily available on-line text has reached hundreds
of billions of words and continues to grow. Yet for most core natural language
tasks, algorithms continue to be optimized, tested and compared after training
on corpora consisting of only one million words or less. In this paper,
we evaluate the performance of different learning methods on a prototypical
natural language disambiguation task, confusion set disambiguation, when
trained on orders of magnitude more labeled data than has previously been
used. We are fortunate that for this particular application, correctly
labeled training data is free. Since this will often not be the case, we
examine methods for effectively exploiting very large corpora when labeled
data comes at a cost.
CS674, Spring '02