Professional Documents
Culture Documents
for Dummies
Po-Ting Lai (賴柏廷)
potinglai@iis.sinica.edu.tw
Intelligent Agent Systems Lab. (IASL)
Institute of Information Science, Academia Sinica,
Nankang, Taipei
Research Assistant
Outline
• Set up environment (Anaconda)
• Named entity recognition task
• Implement named entity recognition
Anaconda
• Available Packages
• See details in
https://docs.anaconda.com/anaconda/packages/py3.6_win-
64.html
3
Set up environment
• Install Anaconda
• Install sklearn-crfsuite
• “Hello World!” (in Jupyter)
Install “Anaconda” 1/3
Install “Anaconda” 2/3
Install “Anaconda” 3/3
Environment for the Lab
• Anaconda 5.2 with Python 3.6 version
• https://www.anaconda.com/download/
• Windows 10 64 bit
8
Set up environment
• Install Anaconda
• Install sklearn-crfsuite
• “Hello World!” (in Jupyter)
Install “sklearn-crfsuite”
Install sklearn-crfsuite
• pip install sklearn-crfsuite
Set up environment
• Install Anaconda
• Install sklearn-crfsuite
• “Hello World!” (in Jupyter)
“Hello World!” (in Jupyter) 1/4
https://aidea-web.tw/moe
Data Example
Data Tsv Example
Data Format Converter
• Convert corpus to conll:
• Inputs: "sample_data/txt”, "sample_data/annotations.tsv”
• Output: "sample_data.iob2”
• Command:
• python corpus2conll.py "sample_data/txt" "sample_data/annotations.tsv"
"sample_data.iob2”
• Convert conll to corpus:
• Inputs: "sample_data/txt”, "sample_data.iob2”
• Output: "sample_data.tsv”
• Command:
• python conll2corpus.py "sample_data/txt" "sample_data.iob2"
"sample_data.tsv"
•
corpus2conll.py
conll2corpus.py
Some Tools for NER
• CRF:
• CRF++: https://taku910.github.io/crfpp/
• Sklearn-crfsuite: https://sklearn-crfsuite.readthedocs.io/en/latest/
• Deep Learning:
• GRAM-CNN: https://github.com/valdersoul/GRAM-CNN
• Anago: https://github.com/Hironsan/anago
Load Data
Load Data
Code
def read_iob2_sents(in_iob2_file):
sents = []
with open(in_iob2_file, 'r', encoding = 'utf8') as f:
for sent_iob2 in f.read().split('\n\n'):
sent = []
for raw in sent_iob2.split('\n'):
if raw == '':
continue
columns = raw.split('\t')
sent.append(tuple(columns))
sents.append(sent)
return sents
train_sents = read_iob2_sents('sample_data.iob2')
train_sents[0]
Outline
• Set up environment (Anaconda)
• Named entity recognition task
• Implement named entity recognition
Load Training and Test Sets
X_train and y_train
Code
train_sents = read_iob2_sents('sample_data.iob2')
test_sents = read_iob2_sents('sample_data.iob2')
• algorithm:
• 'lbfgs' - Gradient descent using the L-BFGS method
• 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
• 'ap' - Averaged Perceptron
• 'pa' - Passive Aggressive (PA)
• 'arow' - Adaptive Regularization Of Weight Vector (AROW)
• min_freq (float, optional (default=0)):
• Cut-off threshold for occurrence frequency of a feature. CRFsuite
will ignore features whose frequencies of occurrences in the
training data are no greater than min_freq. The default is no cut-off.
Parameters
• c1 (float, optional (default=0)):
• The coefficient for L1 regularization. If a non-zero value is specified,
CRFsuite switches to the Orthant-Wise Limited-memory Quasi-
Newton (OWL-QN) method. The default value is zero (no L1
regularization).
• Supported training algorithms: lbfgs
• c2 (float, optional (default=1.0)):
• The coefficient for L2 regularization.
• Supported training algorithms: l2sgd, lbfgs
Prediction
Code
y_pred = crf.predict(X_test)
y_pred
Evaluation
Evaluation
Code
from sklearn_crfsuite import metrics
labels = list(crf.classes_)
labels.remove('O')
sorted_labels = sorted(
labels,
key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(y_test, y_pred,
labels=sorted_labels, digits=3))
word2features
sent2features
Code
def word2features(sent, i):
word = sent[i][0]
features = {
'word': word,
'word.islower()': word.islower(),
'word.isalnum()': word.isalnum(),
'word.isupper()': word.isupper(),
'word.isdigit()': word.isdigit(),
'word.isalpha()': word.isalpha()}
if i > 0:
word1 = sent[i-1][0]
features.update({
'-1:word': word1,
'-1:word.islower()': word1.islower(),
'-1:word.isalnum()': word1.isalnum(),
'-1:word.isupper()': word1.isupper(),
'-1:word.isdigit()': word1.isdigit(),
'-1:word.isalpha()': word1.isalpha()})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
features.update({
'+1:word': word1,
'+1:word.islower()': word1.islower(),
'+1:word.isalnum()': word1.isalnum(),
'+1:word.isupper()': word1.isupper(),
'+1:word.isdigit()': word1.isdigit(),
'+1:word.isalpha()': word1.isalpha()})
else:
features['EOS'] = True
return features
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
features = sent2features(train_sents[0])
features
Load Training and Test Sets
X_train
Code
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]