NLP Viterbi CKY algorithm

Alistair receives a text message from his friend Beatrice, who is very excited because she
has received word of a happy event. She texts:
i talked to my dad’s second cousin in toronto on good friday and he was just good newsed!
Consider each of the following linguistic concepts. For each one, briefly explain what it is,
and decide whether the sentence illustrates it, and if so, explain how. You do not need to
start each answer on a new page. (We’re indicating this by using roman numerals instead of
Constituents
Multi-wordunits/collocations
Coreference
Syntactic ambiguity
Number agreement
“Infinite capacity from finite means”

Consider the following two sentences:
(S1) the people eat roasted peanuts in the morning
(S2) the peanuts eat roasted people in the morning
and consider this probabilistic context-free grammar in Chomsky-normal form, where the number following each rule
-› y denotes Pr/A > y | A)
› NP VP. 1.0
VP-+ VNP 0.7
-› VP PP. 0.3
NP › Det N. 0.6
NP – Adj N, 0.4
PP › P NP, 1.0
-› morning. 0.5
› people. 0.3
› peanuts, 0.2
V-›eat. 0.7
V -›buy, 0.3
Adj -› roasted, 0.4
Adi -» toasted. 0.5
Adj – little. 0.1
Det -> the. 0.5
Det -› a, 0.5
P-*in. 0.7
Recall that the Viterbi CKY algorithm is a version of the CKY algorithm that finds the highest probability parse for a
sentence S given a probabilistic CFG like this one. Let’s call the probability of that parse viterbi(S). Would viterbi(S1)
= viterbi(S2)? Provide a brief, clear explanation of your answer. Actually, computing viterbi(S1) and viterbi(S2) using
Viterbi CKY is optional and in fact not necessary to construct a full credit response; and, conversely, only computing
viterbi(S1) and viterbi(S2) to compare them is not a full-credit answer.
(b) Adam and Betty are building an app for the Amazon Echo. Adam is using the CKY algorithm to handle the
analysis of input sentences using a non-probabilistic context-free grammar. (Naturally the grammar is in Chomsky
normal form.) Betty is writing a semantic interpreter that takes resulting parse trees as input and translates them into
meaning representations that get used in the app.’
They’ve decided that for every sentence, Adam’s code should pass Betty’s interpreter a list of all the valid parses for
that input sentence, instead of just one parse tree for it, because without a probabilistic grammar Adam doesn’t really
have a good way of identifying which parse is best, if there’s more than one.
Caitlin is Adam and Betty’s boss. She’s asked Adam and Betty how much time it will take for the app to process an n-
word-long sentence in the worst possible case. Adam answers O(n°). Betty says it’s exponential in n.
After thinking for a minute, Caitlin points to one of them and says, ” You’re right!”. Who did Caitlin point to, and why
was that person correct? Give a clear, concise argument. You are welcome to illustrate with examples if that would
be helpful, although you don’t have to.

(a) Consider the following problem: given a previously unseen string of natural language text, what language is it
written in? One approach to this problem is to use n-gram models
Assume you have training samples containing 20,000 words of running text in each of k languages,
{L1,…, Lk).. Furthermore, to keep things simple, assume that all these languages are alphabetic (i.e., not
ideographic like many Asian languages) and that all data is in Unicode, and that all k languages use
significantly overlapping alphabets (e.g., you could suppose they all use a roman alphabet with ac-
cents/diacritics). If you need to, for concreteness, you can assume the genre is newswire, and that there is
no markup.
Now, imagine that you are asked to design a solution to the problem based on n-grams
(i) Would you use n-grams composed of words or characters? Justify your answer in quantitative terms
clearly stating any assumptions you make (e.g., about sizes of vocabularies, sizes of alphabets, or any other
relevant quantities).
(i) Assuming you had a well-smoothed n-gram model Mi for each language Li, and a previously unseen
observed document O, how would you use perplexity or cross-entropy (either one is ok) to decide which
language document O is in?
(b) An early proposal for statistical part-speech tagging suggested that, given a sequence of words{w1,w2…. Wn)
one should choose the sequence of tags (t1, t2,.
., tn) for which the value of
is maximized. Was this model an instance of the noisy channel model for part-of-speech tagging? If so, explain. If
not, explain why not.

For each question, state whether it is true or false, and briefly but clearly explain the rationale for your answer. You do not need
to start each answer on a new page. (We’re indicating this by using roman numerals instead of letters.)
True or false: The sentence “Susan and Mary enjoyed the spaghetti they served with her brother illustrates at least
three different kinds of linguistic ambiguity.
True or false: Zipfs law tells us that to build successful systems we only need to care about the highest probability
bigram model, we express the probability of an entire sequence w1w? .
However, the parameters used in the product are defined as follows:
P(W+|We-1)
where C = c1.
for words in an HMM.
CC is a set of hidden categones, very much analogous to the way hidden states are used as tags
True or false: If the size of the vocabulary is V = 50,000, then using this model with C = 32 cuts the number of
parameters you need to estimate by a factor of about 10, compared to the number of parameters in the standard
bigram model.
Recall that to evaluate a constituency parser, one way to do it is to look at the precision and recall of the constituents
that it finds, evaluated against the constituents found in a correct parse of the same sentence. Consider a true parse T.
and a system parse P, below: 3
[T] (S (NP the prof) (VP (VPRT looked up) (NP the grade)) (ADVP today))
[P] (S (NP the prof) (VP looked (PP up (NP the grade)) (ADVP today)
True or false: The label precision of constituents for the predicted parse is 23 and the labeled recallis 5