You are on page 1of 3

UNIT 5 Тext Analytics

Part I: Twitter

Q1. Which of these problems is the LEAST likely to be a good application of natural language
processing?
- Processing medical records to extract symptoms
- Judging the winner of a poetry contest
- Flagging customer reviews on Amazon as problematic
- Automatically organizing emails based on their content

Q2. For each tweet, we computed an overall score by averaging all five scores assigned by the
Amazon Mechanical Turk workers. However, Amazon Mechanical Turk workers might make
significant mistakes when labeling a tweet. The mean could be highly affected by this.
Which of the three alternative metrics below would best capture the typical opinion of the five
Amazon Mechanical Turk workers, would be less affected by mistakes, and is well-defined regardless
of the five labels?
- An overall score equal to the median (middle) score
- An overall score equal to the majority score
- An overall score equal to the minimum score
Q3. For each of the following questions, pick the preprocessing task that we discussed in the
previous video that would change the sentence "Data is useful AND powerful!" to the new sentence
listed in the question.
1. New sentence: Data useful powerful!
2. New sentence: data is useful and powerful
3. New sentence: Data is use AND power!

a) Cleaning up irregularities (changing to lowercase and removing punctuation)


b) Removing stop words
c) Stemming

Q4. a) Given a corpus in R, how many commands do you need to run in R to clean up the
irregularities (removing capital letters and punctuation)?

b) How many commands do you need to run to stem the document?

Q5. In the previous video, we showed a list of all words that appear at least 20 times in our tweets.
Which of the following words appear at least 100 times? Select all that apply. (HINT: use the
findFreqTerms function)

app, buy, freak, ipad, iphon, itun, like, love, new, think

Q6. In the previous video, we used CART and Random Forest to predict sentiment. Let's see how well
logistic regression does. Build a logistic regression model (using the training set) to predict
"Negative" using all of the independent variables. You may get a warning message after building
your model - don't worry (we explain what it means in the explanation).

Now, make predictions using the logistic regression model:

predictions = predict(tweetLog, newdata=testSparse, type="response")

where "tweetLog" should be the name of your logistic regression model. You might also get a
warning message after this command, but don't worry - it is due to the same problem as the
previous warning message.

a) Build a confusion matrix (with a threshold of 0.5) and compute the accuracy of the model.
What is the accuracy?
b) Is this worse or better than the baseline model accuracy of 84.5%? Think about the
properties of logistic regression that might make this the case!

Part II: Man vs. Machine: How IBM Built a Jeopardy Champion

Q1. What were the goals of IBM when they set out to build Watson? Select all that apply.

- To build a computer that could compete with the best human players
at Jeopardy!.
- To build a computer for Jeopardy! that could be used as a contestant on every
show.
- To build a computer that could generate questions on Jeopardy!.
- To build a computer that could answer questions that are commonly believed to
require human intelligence.

Q2. For which of the following reasons is Jeopardy! challenging? Select all that apply.

- A wide variety of categories.


- Expert knowledge is required in a few specific categories.
- Speed is required - you have to buzz in faster than your competitors.
- The categories and clues are often cryptic.

Q3. Which of the following two questions do you think would be EASIEST for a computer to answer?

- Was Abraham Lincoln generally considered a happy man?, or


- What year was Abraham Lincoln born?

Q4. a) Select the LAT of the following Jeopardy question: NICHOLAS II WAS THE LAST RULING CZAR
OF THIS ROYAL FAMILY (Hint: The answer is "The Romanovs")
- NICHOLAS II
- THE LAST RULING CZAR
- THIS ROYAL FAMILY
b) Select the LAT of the following Jeopardy question: REGARDING THIS DEVICE, ARCHIMEDES
SAID, "GIVE ME A PLACE TO STAND ON, AND I WILL MOVE THE EARTH" (Hint: The answer is
"A lever")
- THIS DEVICE
- ARCHIMEDES
- A PLACE
- THE EARTH

Q5. To predict which candidate answer is correct, we said that Watson uses logistic regression.
Which of the other techniques that we have learned could be used instead? Select all that apply.
- Linear Regression
- CART
- Random Forest

You might also like