You are on page 1of 7

Neural Networks

Universidad de las
Amricas Puebla
Neural Networks
Final Report
Sentiment Analysis with Python
and Scikit-Learn
Carmen Paola Hernndez Morales
Hctor Beristain Bermdez

ID 146873
ID 145826

Professor: Gibran Etcheverry Doger

Carmen Paola Hernndez Morales / Hctor Beristain Bermdez

Neural Networks
Analyze a particular application of sentiment analysis document
Whether a positive or negative feeling towards the product that is
being discussed is provided.
Theoretical framework
In general, a learning disability considered a set of n data samples and then
tries to predict the properties of unknown data.
We can categorize learning problems as follows:
Supervised learning, where the data comes with additional attributes
we want to predict this kind comprises:
Rating: samples belong to two or more classes and want to learn from
the data and predict the class labeled unlabeled data. An example of
classification problem would recognition Example handwritten digits, where
the objective is to assign each input vector to one of a finite number of
discrete categories. Another way to think of it as a discrete classification (as
opposed to continuous) form of supervised learning where you have a limited
number of categories and for each of the samples provided No one is trying
to label them with the correct category or class.
Regression: If the desired output consists of one or more continuous
variables, then the task is called regression. An example of a regression
problem would be predicting the length of a salmon as a function of age and
Unsupervised learning, in which the training data consist of a set of input
vectors x without any corresponding target values. The goal in this type of
problem can be to find groups of similar examples within the data, where it is
called the group, or to determine the distribution of data in the input space,
known as density estimation, or to project data from a high-dimensional
space to two or three dimensions for the purpose of visualization.
In general words of Benzanini Sentiment Analysis can be defines as the
process of determining whether a piece of writing is positive, negative or
neutral [1]. Sentiment analysis is a field of study that analyzes people's
opinions towards the products entities, usually expressed in written form and
online reviews. In recent years, there has been much discussed in academia
and industry, thanks to the popularity of social networks that provide a
constant source of full-text data views for analyzing.

Carmen Paola Hernndez Morales / Hctor Beristain Bermdez

Neural Networks
We are using Python and in particular scikit-learn for these experiments.
Scikit-learn to install, use the following commands in console:
Last login: Thu Nov 12 10:14:40 on console
MacBook-Pro-de-Hector-5:~ HectorBeristainBermudez$ pip install -U scikit-learn
Collecting scikit-learn
Downloading scikit-learn-0.17.tar.gz (7.8MB)
100% || 7.8MB 36kB/s

Whereupon install scikit-learn in Python 3. The terminal returns us

confirmation thereof; RTF document (TerminalScikitInstall) is the complete
system response.
Building wheels for collected packages: scikit-learn
Running bdist_wheel for scikit-learn
Complete output from command /Users/HectorBeristainBermudez/anaconda/bin/python3 -c "import
setuptools;__file__='/private/var/folders/l0/gz5nbf7d0z138k05b6zs0dfr0000gn/T/pip-build-fyqoahor/scikitlearn/';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d
Partial import of sklearn during the build process.
UserWarning: Specified path is invalid.
warnings.warn('Specified path %s is invalid.' % d)
error: Command "g++ -fno-strict-aliasing -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall
-I/Users/HectorBeristainBermudez/anaconda/include -arch x86_64
-I/Users/HectorBeristainBermudez/anaconda/lib/python3.5/site-packages/numpy/core/include -c
sklearn/svm/src/libsvm/libsvm_template.cpp -o build/temp.macosx-10.5-x86_643.5/sklearn/svm/src/libsvm/libsvm_template.o" failed with exit status 69
---------------------------------------MacBook-Pro-de-Hector-5:~ HectorBeristainBermudez$
MacBook-Pro-de-Hector-5:~ HectorBeristainBermudez$

The dataset used for these experiments is known Dataset Polarity v2.0,
downloadable from the link Movie Review Data provided by Bonzanini .
The dataset contains 2,000 documents, labeled and preprocessed. In
particular, there are two labels, positive and negative with 1,000 documents
in each block. Each line of a document is a prayer. Preprocessing absorbs
most of the work we have to do to get started, so you can focus on the
problem of classification.
The real-world data are often not ordered and need suitable
pretreatment before we can make good use of them. All we need to do in this
case is read files and divide more words in the blanks.
The code may be found as Gist on Marco Bonzaninis Github. In the following
will explain the main tasks of the scrip created:
# You need to install scikit-learn:
# sudo pip install scikit-learn
# Dataset: Polarity dataset v2.0

Carmen Paola Hernndez Morales / Hctor Beristain Bermdez

Neural Networks
import sys
import os
import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import classification_report
def usage():
print("python %s <data_dir>" % sys.argv[0])
if __name__ == '__main__':
if len(sys.argv) < 2:
data_dir = sys.argv[1]
classes = ['pos', 'neg']

The first reads the content of the files and creates lists of training/testing
documents and labels.
We split the data set into training (90% of the documents) and testing (10%)
by exploiting the file names (they all start with cvX, with X=[0..9]). This
calls for k-fold cross-validation,
not implemented in the example but fairly easy to integrate.
# Read the data
train_data = []
train_labels = []
test_data = []
test_labels = []
for curr_class in classes:
dirname = os.path.join(data_dir, curr_class)
for fname in os.listdir(dirname):
with open(os.path.join(dirname, fname), 'r') as f:
content =
if fname.startswith('cv9'):

Scikit-Learn provides several vectorizers to translate the input documents

into vectors of features. Typically we want to give appropriate weights to
different words, and TF-IDF is one of the most common weighting schemes
used in text analytics applications. In Scikit-Learn, we can use the
# Create feature vectors
vectorizer = TfidfVectorizer(min_df=5,
max_df = 0.8,

Carmen Paola Hernndez Morales / Hctor Beristain Bermdez

Neural Networks
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

The parameters used in this example with the vectorizer are:

min_df=5, discard words appearing in less than 5 documents, max_df=0.8,
discard words appering in more than 80% of the documents,
sublinear_tf=True, use sublinear weighting, use_idf=True, enable IDF.
Scikit-Learn comes with a number of different classifiers already built-in. In
these experiments, we use different variations of Support Vector Machine
(SVM), which is commonly used in classification applications.
Once the vectorizer has generated the feature vectors for training and
testing, we can call the classifier as described above. In the example, we try
different variations of SVM:
# Perform classification with SVM, kernel=rbf
classifier_rbf = svm.SVC()
t0 = time.time(), train_labels)
t1 = time.time()
prediction_rbf = classifier_rbf.predict(test_vectors)
t2 = time.time()
time_rbf_train = t1-t0
time_rbf_predict = t2-t1
# Perform classification with SVM, kernel=linear
classifier_linear = svm.SVC(kernel='linear')
t0 = time.time(), train_labels)
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time(), train_labels)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(test_vectors)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

The SVC() class generates a SVM classifier with RBF (Gaussian) kernel as
default option (several other options are available).
The fit() method will perform the training and it requires the training
data processed by the vectorizer as well as the correct class labels.
The classification step consists in predicting the labels for the test data.

Carmen Paola Hernndez Morales / Hctor Beristain Bermdez

Neural Networks
After performing the classification, we print the quality results using
classification_report(), and some timing information.
# Print results in a nice table
print("Results for SVC(kernel=rbf)")
print("Training time: %fs; Prediction time: %fs" % (time_rbf_train, time_rbf_predict))
print(classification_report(test_labels, prediction_rbf))
print("Results for SVC(kernel=linear)")
print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))
print(classification_report(test_labels, prediction_linear))
print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(test_labels, prediction_liblinear))

By following the link for the complete code on Gist/GitHub at the end of the
article and getting the script, we saved the script and then call it from
command line with:
Last login: Mon Nov 23 09:13:34 on ttys000
MacBook-Pro-de-Hector-5:~ HectorBeristainBermudez$ cd desktop
MacBook-Pro-de-Hector-5:desktop HectorBeristainBermudez$ python review_polarity/txt_sentoken/
DeprecationWarning: inspect.getargspec() is deprecated, use inspect.signature() instead
if 'order' in inspect.getargspec(np.copy)[0]:
Results for SVC(kernel=rbf)
Training time: 8.367630s; Prediction time: 0.805447s
precision recall f1-score support
avg / total





Results for SVC(kernel=linear)

Training time: 6.867254s; Prediction time: 0.760790s
precision recall f1-score support
avg / total





Results for LinearSVC()

Training time: 0.056955s; Prediction time: 0.000342s
precision recall f1-score support
avg / total





MacBook-Pro-de-Hector-5:desktop HectorBeristainBermudez$

The default RBG kernel performs worse than the linear kernel, this opens for
a discussion on Gaussian vs. linear kernels, not really part of this blog post,
but as a rule of thumb when the number of features is much higher than the
number of samples (documents), a linear kernel is probably the preferred
choice. Moreover, there are options to properly tune the parameters of a RBF

Carmen Paola Hernndez Morales / Hctor Beristain Bermdez

Neural Networks

SVC() with linear kernel is much much slower than LinearSVC(), this is easily
explained by the fact that, under the hood, scikit-learn relies on different C
libraries. In particular SVC() is implemented using libSVM, while LinearSVC()
is implemented using liblinear, which is explicitly designed for this kind of

We talked about an application of sentiment analysis, addressed as a
problem of classification of documents with Python and Scikit-Learn.
The choice of the classifier, and the feature extraction process, influence the
overall quality of results, and it is always good to experiment with different
Scikit-learn offers many options from this point of view.
Knowing the underlying implementation also allows a better option in terms
of speed.
[1] Bonzanini, M. (2015, January 19). Sentiment Analysis with Python and
scikit-learn. Retrieved November 20, 2015.
[3] Scikit-Learn. (n.d.). Retrieved November 20, 2015.
[4] Sentiment Analysis. (2015). Retrieved November 20, 2015.
[5] 1.4 Support Vectors Machines. (n.d.). Retrieved November 22, 2015.

Carmen Paola Hernndez Morales / Hctor Beristain Bermdez