Professional Documents
Culture Documents
3
4
Content
• Text classification problem
• Feature extraction from text
• Naïve Bayes method
• Logistic Regression method
• Sentiment Analysis Casestudy
Text Classification: definition
• Input:
• a document d
• a fixed set of classes C = {c1, c2,…, cJ}
8
Machine Learning Artchitecture
Learning(Học) Training
Labels
Training
Samples
Learned
Features Training
model
Learned
model
Inference(Dự đoán)
Features Prediction
Test Sample
Classification Methods:
Supervised Machine Learning
• Any kind of classifier
• Naïve Bayes
• Logistic regression
• Support-vector machines
• k-Nearest Neighbors
• Neural Network
• …
Precision and recall
• Precision: % of selected items that are correct
Recall: % of correct items that are selected
13
Feature vectors
• Values for several features of an observation can be put into a
single vector
# proper # 1st person # commas
nouns pronouns
2 0 0
5 0 0
0 1 1
14
Feature vectors
• Features should be useful in discriminating between categories.
15
Classification: Sentiment Analysis
16
Feature Representation
• Convert this example to a vector using bag-of-words features
18
Brill and Banko on spelling correction
Naive Bayes: Exercise!
• Model
• Inference
20
Logistic Regression
21
Logistic Regression
22
Sigmoid Function
• Trong số các hàm có 2 tích chất trên thì hàm sigmoid: f (s) 1 1e được sử
s
23
Optimization of loss based on gradient decent
In more detail about Gradient
decent:
https://towardsdatascience.com/imp
lement-gradient-descent-in-python-
9b93ed7108d1
• It is an optimization algorithm to
find the minimum of a function.
We start with a random point on
the function and move in the
negative direction of the
gradient of the function to reach
the local/global minima.
24
Gradient descent
• Nếu đạo hàm của hàm số x tại xt: f′(xt) >0 f′(xt)>0 thì xt nằm
về bên phải so với x* (và ngược lại). Để điểm tiếp theo xt +1
gần với x* hơn, chúng ta cần di chuyển xt về phía bên trái,
tức về phía âm. Nói các khác, chúng ta cần di chuyển
ngược dấu với đạo hàm: xt+1 = xt + ∆
• xt+1 = xt - µ f’(xt )
25
Gradient descent
26
Gradient Descent Algorithm
Variants of GD algorithm:
- Batch Gradient Descent
- Stochastic Gradient Descent
- Mini-batch Gradient Descent
27
Pros and cons of gradient descent
• Simple and often quite effective on ML tasks
• Often very scalable
• Only applies to smooth functions (differentiable)
• Might find a local minimum, rather than a global
one
28
Visualization of algorithms of GD
29
Selecting learning rate
• Use grid-search in log-space over small values on a tuning set:
• e.g., 0.01, 0.001, …
• Sometimes, decrease after each pass:
• e.g factor of 1/(1 + dt), t=epoch
• sometimes 1/t2
• Fancier techniques:
• Adaptive gradient: scale gradient differently for each dimension
(Adagrad, ADAM, ….)
30
Logistic Regression: Summary
31
SENTIMENT ANALYSIS
32
Google Product Search
• a
33
Bing Shopping
• a
34
Sentiment Analysis task
• Input: Cho một câu (đoạn) văn bản.
• Ouput: Xác định câu này là câu tích cực (1) hoặc tiêu cực (0).
• Thưc hiện trên bộ dữ liệu review phim ảnh: IMDB
35
Sentiment Analysis
• Chuẩn bị dữ liệu: IMDB Movie Reviews
• Tiền xử lý dữ liệu
• Trích trọn đặc trưng (Vectorization)
• Chọn mô hình học máy và huấn luyên
36
Tiền xử lý dữ liệu
• Đơn vị nhỏ nhất là từ. Dãy các từ:
Ví dụ: Text: This is a cat. --> Word Sequence: [this, is, a, cat]
37
Biểu thức chính qui
• Biêu thức chính qui (or regex) is a sequence of characters that
represent a search pattern
• Each character has a meaning; for example, “. means any
character that isn't the newline character: '\n‘”
• These characters are often combined with quantifiers, such as *,
which means zero or more.
• Biểu thức chính qui rất có ích trong việc xử lý xâu ký tự
38
Biểu thức chính qui
39
Quiz
• Example:
1) Trính ra số từ s:
s = 'My 2 favourite numbers are 8 and 25. My mobie is
0912203062‘.
1) Trích ra email từ s:
s = ‘Hello from shubhamg199630@gmail.com to priya@yahoo.com
about the meeting @2PM'
40
Features extraction
• Now that we have a way to extract information from text in the
form of word sequences, we need a way to transform these
word sequences into numerical features: this is vectorization.
• Kỹ thuật vector hóa đơn giản nhất là Bag Of Words (BOW). It
starts with a list of words called the vocabulary (this is often all
the words that occur in the training data)
41
Features extraction
• To use BOW vectorization in Python, we can rely on
CountVectorizer from the scikit-learn library
• scikit-learn has a built-in list of stop words that can
be ignored by passing stop_words="english" to the
vectorizer
• Moreover, we can pass our custom pre-processing function
fromearlier to automatically clean the text before it’s
vectorized.
42
Sentiment Analysis CaseStudy
• Dữ liệu văn bản IMDB: The IMDB movie reviews dataset is a set
of 50,000 reviews, half of which are positive and the other half
negative
• Chúng ta có thể download dữ liệu từ:
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
• Chúng ta có thư mục chứa dữ liệu là: aclImdb
• we can use the following function to load the training/test datasets from
IMDB
43
Sentiment Analysis CaseStudy
• feature vectors that result from BOW are usually very large
(80,000-dimensional vectors in this case)
• we need to use simple algorithms that are efficient on a large
number of features (e.g., Naive Bayes, linear SVM, or logistic
regression)
44
Cải tiến mô hình hiện tại
• Trích trọng đặc trưng rất là quan trọng (Features Engineering
• There are some biases attached with only looking at how many
times a word occurs in a text. In particular, the longer the text,
the higher its features (word counts) will be
• Dựa vào đặc trưng TF-IDF
• Dựa vào n-gram
45
Cải tiến mô hình
• Đặc trưng TF-IDF
• Chúng ta có thể huấn luyện mô hình Linear SVM với đặc
trưng TF-IDF đơn giản bằng việc thay thế hàm
CountVectorizer bằng hàm TfIdfVectorizer
• Kết quả tăng khoảng 2%
46
Cải tiến mô hình
• Sử dụng các từ độc lập sẽ không tốt. Ví dụ:
• if the word good occurs in a text, we will naturally tend to
say that this text is positive, even if the actual expression
that occurs is actually not good. Cụm từ tốt hơn
• Sử dụng n-gram để xử lý vấn đề này
• An N-gram is a set of N successive words (e.g., very
good [ 2-gram] and not good at all [4-gram]).
Using N-grams, we produce richer word sequences.
• Ví dụ với N =2:
This is a cat. --> [this, is, a, cat, (this, is), (is, a), (a, cat)]
47
Cải tiến mô hình
• In practice, including N-grams in our TF-IDF vectorizer
is as simple as providing an additional parameter
ngram_range=(1, N).
vectorizer = TfidfVectorizer(stop_words="english",
preprocessor=clean_text,
ngram_range=(1, 2))
48
Homework
• Phân tích quan điểm với dữ liệu chuẩn IMDB (link:
http://ai.stanford.edu/~amaas/data/sentiment/)
49
References
• Chapter 4,5 - Speech and Language Processing (3rd ed. draft)
(https://web.stanford.edu/~jurafsky/slp3/4.pdf)
(https://web.stanford.edu/~jurafsky/slp3/5.pdf)
50