NER Lab

Named Entity Recognition
for Dummies
Po-Ting Lai (賴柏廷)
potinglai@iis.sinica.edu.tw
Intelligent Agent Systems Lab. (IASL)
Institute of Information Science, Academia Sinica,
Nankang, Taipei
Research Assistant
Outline
• Set up environment (Anaconda)
• Named entity recognition task
• Implement named entity recognition
Anaconda
• Available Packages
• See details in
https://docs.anaconda.com/anaconda/packages/py3.6_win-
64.html
3
Set up environment
• Install Anaconda
• Install sklearn-crfsuite
• “Hello World!” (in Jupyter)
Install “Anaconda” 1/3
Environment for the Lab
• Anaconda 5.2 with Python 3.6 version
• https://www.anaconda.com/download/
• Windows 10 64 bit
8
Set up environment
Install “sklearn-crfsuite”
• Open Anaconda Prompt (Anaconda3)

11
Install sklearn-crfsuite
• pip install sklearn-crfsuite
Set up environment
“Hello World!” (in Jupyter) 1/4
• Enter “jupyter notebook”

After entering print(“Hello World!”), enter <shift> + <enter> to

execute the code of the block
• Shift+Enter: Execute the cell
• ESC+A: Create a new cell above the current cell
• ESC+B: Create a new cell below the current cell
• ESC+H: Display all keyboard shortcuts
Outline
Our NER Task
https://aidea-web.tw/moe
Data Example
Data Tsv Example
Data Format Converter
• Convert corpus to conll:
• Inputs: "sample_data/txt”, "sample_data/annotations.tsv”
• Output: "sample_data.iob2”
• Command:
• python corpus2conll.py "sample_data/txt" "sample_data/annotations.tsv"
"sample_data.iob2”
• Convert conll to corpus:
• Inputs: "sample_data/txt”, "sample_data.iob2”
• Output: "sample_data.tsv”
• Command：
• python conll2corpus.py "sample_data/txt" "sample_data.iob2"
"sample_data.tsv"
•
corpus2conll.py
conll2corpus.py
Some Tools for NER
• CRF:
• CRF++: https://taku910.github.io/crfpp/
• Sklearn-crfsuite: https://sklearn-crfsuite.readthedocs.io/en/latest/
• Deep Learning:
• GRAM-CNN: https://github.com/valdersoul/GRAM-CNN
• Anago: https://github.com/Hironsan/anago
Load Data
Load Data
Code
def read_iob2_sents(in_iob2_file):
sents = []
with open(in_iob2_file, 'r', encoding = 'utf8') as f:
for sent_iob2 in f.read().split('\n\n'):
sent = []
for raw in sent_iob2.split('\n'):
if raw == '':
continue
columns = raw.split('\t')
sent.append(tuple(columns))
sents.append(sent)
return sents
train_sents = read_iob2_sents('sample_data.iob2')
train_sents[0]
Outline
Load Training and Test Sets
X_train and y_train
Code
train_sents = read_iob2_sents('sample_data.iob2')
test_sents = read_iob2_sents('sample_data.iob2')
X_train = [sent2tokens(s) for s in train_sents]

y_train = [sent2labels(s) for s in train_sents]
X_test = [sent2tokens(s) for s in test_sents]

y_test = [sent2labels(s) for s in test_sents]
Learn CRF Model
Code
import sklearn
import sklearn_crfsuite
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
min_freq=0
)
crf.fit(X_train, y_train)
Parameters
• https://sklearn-crfsuite.readthedocs.io/en/latest/api.html
• algorithm:
• 'lbfgs' - Gradient descent using the L-BFGS method
• 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
• 'ap' - Averaged Perceptron
• 'pa' - Passive Aggressive (PA)
• 'arow' - Adaptive Regularization Of Weight Vector (AROW)
• min_freq (float, optional (default=0)):
• Cut-off threshold for occurrence frequency of a feature. CRFsuite
will ignore features whose frequencies of occurrences in the
training data are no greater than min_freq. The default is no cut-off.
Parameters
• c1 (float, optional (default=0)):
• The coefficient for L1 regularization. If a non-zero value is specified,
CRFsuite switches to the Orthant-Wise Limited-memory Quasi-
Newton (OWL-QN) method. The default value is zero (no L1
regularization).
• Supported training algorithms: lbfgs
• c2 (float, optional (default=1.0)):
• The coefficient for L2 regularization.
• Supported training algorithms: l2sgd, lbfgs
Prediction
Code
y_pred = crf.predict(X_test)
y_pred
Evaluation
Evaluation
Code
from sklearn_crfsuite import metrics
labels = list(crf.classes_)
labels.remove('O')
sorted_labels = sorted(
labels,
key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(y_test, y_pred,
labels=sorted_labels, digits=3))
word2features
sent2features
Code
def word2features(sent, i):
word = sent[i][0]
features = {
'word': word,
'word.islower()': word.islower(),
'word.isalnum()': word.isalnum(),
'word.isupper()': word.isupper(),
'word.isdigit()': word.isdigit(),
'word.isalpha()': word.isalpha()}
if i > 0:
word1 = sent[i-1][0]
features.update({
'-1:word': word1,
'-1:word.islower()': word1.islower(),
'-1:word.isalnum()': word1.isalnum(),
'-1:word.isupper()': word1.isupper(),
'-1:word.isdigit()': word1.isdigit(),
'-1:word.isalpha()': word1.isalpha()})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
features.update({
'+1:word': word1,
'+1:word.islower()': word1.islower(),
'+1:word.isalnum()': word1.isalnum(),
'+1:word.isupper()': word1.isupper(),
'+1:word.isdigit()': word1.isdigit(),
'+1:word.isalpha()': word1.isalpha()})
else:
features['EOS'] = True
return features
def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]
features = sent2features(train_sents[0])
features
Load Training and Test Sets
X_train
Code
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]
X_test = [sent2features(s) for s in test_sents]

y_test = [sent2labels(s) for s in test_sents]
Learn CRF Model
Code
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
min_freq=0
)
crf.fit(X_train, y_train)
Prediction
Code
y_pred = crf.predict(X_test)
y_pred
Evaluation
Code
print(metrics.flat_classification_report(y_test, y_pred,
labels=sorted_labels, digits=3))
Dump Prediction
Code
def write_iob2(data, pred, out_iob2_file):
with open(out_iob2_file, 'wb') as iob2_writer:
for _data, _pred in zip(data, pred):
for _tuple, _pred in zip(_data, _pred):
iob2_writer.write(bytes('\t'.join(_tuple) + '\t' +
_pred + '\n', encoding = 'utf8'))
iob2_writer.write(bytes('\n', encoding = 'utf8'))
write_iob2(test_sents, y_pred, 'pred_sample_data.iob2')
Data Format Converter
• Convert corpus to conll:
• Inputs: "sample_data/txt”, "sample_data/annotations.tsv”
• Output: "sample_data.iob2”
• Command:
• python corpus2conll.py "sample_data/txt" "sample_data/annotations.tsv"
"sample_data.iob2”
• Convert conll to corpus:
• Inputs: "sample_data/txt”, "sample_data.iob2”
• Output: "sample_data.tsv”
• Command：
• python conll2corpus.py "sample_data/txt" "sample_data.iob2"
"sample_data.tsv"
•
conll2corpus.py
Q&A
Po-Ting Lai (賴柏廷)
potinglai@iis.sinica.edu.tw
Intelligent Agent Systems Lab. (IASL)
Institute of Information Science,
Academia Sinica,
Nankang, Taipei
Research Assistant
sent
features
features

NER Lab

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NER Lab

Uploaded by

Copyright:

Available Formats

Named Entity Recognition

• Open Anaconda Prompt (Anaconda3)

• Enter “jupyter notebook”

After entering print(“Hello World!”), enter <shift> + <enter> to

X_train = [sent2tokens(s) for s in train_sents]

X_test = [sent2tokens(s) for s in test_sents]

X_test = [sent2features(s) for s in test_sents]

You might also like