You are on page 1of 65

Named Entity Recognition

for Dummies
Po-Ting Lai (賴柏廷)
potinglai@iis.sinica.edu.tw
Intelligent Agent Systems Lab. (IASL)
Institute of Information Science, Academia Sinica,
Nankang, Taipei
Research Assistant
Outline
• Set up environment (Anaconda)
• Named entity recognition task
• Implement named entity recognition
Anaconda
• Available Packages

• See details in
https://docs.anaconda.com/anaconda/packages/py3.6_win-
64.html

3
Set up environment
• Install Anaconda
• Install sklearn-crfsuite
• “Hello World!” (in Jupyter)
Install “Anaconda” 1/3
Install “Anaconda” 2/3
Install “Anaconda” 3/3
Environment for the Lab
• Anaconda 5.2 with Python 3.6 version
• https://www.anaconda.com/download/
• Windows 10 64 bit

8
Set up environment
• Install Anaconda
• Install sklearn-crfsuite
• “Hello World!” (in Jupyter)
Install “sklearn-crfsuite”

• Open Anaconda Prompt (Anaconda3)


11

Install sklearn-crfsuite
• pip install sklearn-crfsuite
Set up environment
• Install Anaconda
• Install sklearn-crfsuite
• “Hello World!” (in Jupyter)
“Hello World!” (in Jupyter) 1/4

• Enter “jupyter notebook”


“Hello World!” (in Jupyter) 1/4
“Hello World!” (in Jupyter) 2/4
“Hello World!” (in Jupyter) 2/4
“Hello World!” (in Jupyter) 3/4

After entering print(“Hello World!”), enter <shift> + <enter> to


execute the code of the block
“Hello World!” (in Jupyter) 4/4
• Shift+Enter: Execute the cell
• ESC+A: Create a new cell above the current cell
• ESC+B: Create a new cell below the current cell
• ESC+H: Display all keyboard shortcuts
Outline
• Set up environment (Anaconda)
• Named entity recognition task
• Implement named entity recognition
Our NER Task

https://aidea-web.tw/moe
Data Example
Data Tsv Example
Data Format Converter
• Convert corpus to conll:
• Inputs: "sample_data/txt”, "sample_data/annotations.tsv”
• Output: "sample_data.iob2”
• Command:
• python corpus2conll.py "sample_data/txt" "sample_data/annotations.tsv"
"sample_data.iob2”
• Convert conll to corpus:
• Inputs: "sample_data/txt”, "sample_data.iob2”
• Output: "sample_data.tsv”
• Command:
• python conll2corpus.py "sample_data/txt" "sample_data.iob2"
"sample_data.tsv"

corpus2conll.py
conll2corpus.py
Some Tools for NER
• CRF:
• CRF++: https://taku910.github.io/crfpp/
• Sklearn-crfsuite: https://sklearn-crfsuite.readthedocs.io/en/latest/

• Deep Learning:
• GRAM-CNN: https://github.com/valdersoul/GRAM-CNN
• Anago: https://github.com/Hironsan/anago
Load Data
Load Data
Code
def read_iob2_sents(in_iob2_file):
sents = []
with open(in_iob2_file, 'r', encoding = 'utf8') as f:
for sent_iob2 in f.read().split('\n\n'):
sent = []
for raw in sent_iob2.split('\n'):
if raw == '':
continue
columns = raw.split('\t')
sent.append(tuple(columns))
sents.append(sent)
return sents
train_sents = read_iob2_sents('sample_data.iob2')
train_sents[0]
Outline
• Set up environment (Anaconda)
• Named entity recognition task
• Implement named entity recognition
Load Training and Test Sets
X_train and y_train
Code
train_sents = read_iob2_sents('sample_data.iob2')
test_sents = read_iob2_sents('sample_data.iob2')

X_train = [sent2tokens(s) for s in train_sents]


y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2tokens(s) for s in test_sents]


y_test = [sent2labels(s) for s in test_sents]
Learn CRF Model
Code
import sklearn
import sklearn_crfsuite
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
min_freq=0
)
crf.fit(X_train, y_train)
Parameters
• https://sklearn-crfsuite.readthedocs.io/en/latest/api.html

• algorithm:
• 'lbfgs' - Gradient descent using the L-BFGS method
• 'l2sgd' - Stochastic Gradient Descent with L2 regularization term
• 'ap' - Averaged Perceptron
• 'pa' - Passive Aggressive (PA)
• 'arow' - Adaptive Regularization Of Weight Vector (AROW)
• min_freq (float, optional (default=0)):
• Cut-off threshold for occurrence frequency of a feature. CRFsuite
will ignore features whose frequencies of occurrences in the
training data are no greater than min_freq. The default is no cut-off.
Parameters
• c1 (float, optional (default=0)):
• The coefficient for L1 regularization. If a non-zero value is specified,
CRFsuite switches to the Orthant-Wise Limited-memory Quasi-
Newton (OWL-QN) method. The default value is zero (no L1
regularization).
• Supported training algorithms: lbfgs
• c2 (float, optional (default=1.0)):
• The coefficient for L2 regularization.
• Supported training algorithms: l2sgd, lbfgs
Prediction
Code
y_pred = crf.predict(X_test)
y_pred
Evaluation
Evaluation
Code
from sklearn_crfsuite import metrics
labels = list(crf.classes_)
labels.remove('O')
sorted_labels = sorted(
labels,
key=lambda name: (name[1:], name[0])
)
print(metrics.flat_classification_report(y_test, y_pred,
labels=sorted_labels, digits=3))
word2features
sent2features
Code
def word2features(sent, i):
word = sent[i][0]
features = {
'word': word,
'word.islower()': word.islower(),
'word.isalnum()': word.isalnum(),
'word.isupper()': word.isupper(),
'word.isdigit()': word.isdigit(),
'word.isalpha()': word.isalpha()}
if i > 0:
word1 = sent[i-1][0]
features.update({
'-1:word': word1,
'-1:word.islower()': word1.islower(),
'-1:word.isalnum()': word1.isalnum(),
'-1:word.isupper()': word1.isupper(),
'-1:word.isdigit()': word1.isdigit(),
'-1:word.isalpha()': word1.isalpha()})
else:
features['BOS'] = True
if i < len(sent)-1:
word1 = sent[i+1][0]
features.update({
'+1:word': word1,
'+1:word.islower()': word1.islower(),
'+1:word.isalnum()': word1.isalnum(),
'+1:word.isupper()': word1.isupper(),
'+1:word.isdigit()': word1.isdigit(),
'+1:word.isalpha()': word1.isalpha()})
else:
features['EOS'] = True
return features

def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]

features = sent2features(train_sents[0])
features
Load Training and Test Sets
X_train
Code
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]


y_test = [sent2labels(s) for s in test_sents]
Learn CRF Model
Code
crf = sklearn_crfsuite.CRF(
algorithm='lbfgs',
c1=0.1,
c2=0.1,
min_freq=0
)
crf.fit(X_train, y_train)
Prediction
Code
y_pred = crf.predict(X_test)
y_pred
Evaluation
Code
print(metrics.flat_classification_report(y_test, y_pred,
labels=sorted_labels, digits=3))
Dump Prediction
Code
def write_iob2(data, pred, out_iob2_file):
with open(out_iob2_file, 'wb') as iob2_writer:
for _data, _pred in zip(data, pred):
for _tuple, _pred in zip(_data, _pred):
iob2_writer.write(bytes('\t'.join(_tuple) + '\t' +
_pred + '\n', encoding = 'utf8'))
iob2_writer.write(bytes('\n', encoding = 'utf8'))
write_iob2(test_sents, y_pred, 'pred_sample_data.iob2')
Data Format Converter
• Convert corpus to conll:
• Inputs: "sample_data/txt”, "sample_data/annotations.tsv”
• Output: "sample_data.iob2”
• Command:
• python corpus2conll.py "sample_data/txt" "sample_data/annotations.tsv"
"sample_data.iob2”
• Convert conll to corpus:
• Inputs: "sample_data/txt”, "sample_data.iob2”
• Output: "sample_data.tsv”
• Command:
• python conll2corpus.py "sample_data/txt" "sample_data.iob2"
"sample_data.tsv"

conll2corpus.py
Q&A
Po-Ting Lai (賴柏廷)
potinglai@iis.sinica.edu.tw
Intelligent Agent Systems Lab. (IASL)
Institute of Information Science,
Academia Sinica,
Nankang, Taipei
Research Assistant
sent
features
features

You might also like