Professional Documents
Culture Documents
PRB: Prompt-Based Rule Discovery and Boosting For Interactive Weakly-Supervised Learning
PRB: Prompt-Based Rule Discovery and Boosting For Interactive Weakly-Supervised Learning
Figure 1: Overall framework for PRB OOST. In each iteration, PRB OOST (1) identifies large-error instances
from the limited clean data and converts each large-error instance to a prompt template for prompting-based rule
discovery; (2) presents candidate rules to human experts for annotation and uses accepted rules to generate new
weak labels; (3) trains a new weak model with self-training and ensembles it with the previous models.
weakly labeled dataset Dr . Then we self-train and model ensembling (Section 4.3). We compute
the weak model mt+1 and integrate it into the αt from the model’s error rate on Dl :
ensemble model. 1 − errt
αt = log + log(K − 1), (2)
errt
4.1 Candidate Rule Generation where errt is given by
|Dl | |Dl |
Target rule proposal on large-error instances
X X
errt = wi I (yi 6= mt (xi )) / wi . (3)
We design a boosting-style (Hastie et al., 2009) i=1 i=1
strategy for generating prompt-based candidate Intuitively, a sample xi receives a larger weight
rules. This strategy iteratively checks feature wi (Equation 1) if the model ensemble consistently
regimes in which the current model mt is weak, make mistakes on xi . A large error is often caused
and proposes candidate rules from such regimes. by poor coverage (unlabeled instances matched by
We use the small labeled dataset Dl to identify hard few or no rules) or dominating noise in the local
instances, i.e., where the model tends to make cu- feature regimes (rule-matched labels are wrong).
mulative mistakes during iterative learning. The The weights can thus guide the rule generator to tar-
discovered rules can complement the current rule get the top-n large-error instances Xe = {xei }ni=1 .
set R and refine the weak labels, so the next model By proposing rules from such instances, we aim to
mt+1 trained on the refined weakly labeled data discover novel rules that can complement the cur-
can perform better in the weak regimes. rent rule set and model ensemble most effectively.
We initialize the weights of the instances in Dl
Prompt-based rule proposal For a wide range of
as wi = 1/|Dl |, i = 1, 2, · · · , |Dl |. During the
NLP tasks such as relation extraction and text clas-
iterative model learning process, each wi is updated
sification, we can leverage prompts to construct
as the model’s weighted loss on instance xi ∈ Dl .
informative rule templates, which naturally leads
Specifically, in iteration t ∈ {1, · · · , n}, we weigh
to expressive labeling rules for WSL.
the samples by
Motivated by this, we design a rule proposal
wi ← wi · eαt I(yi 6=mt (xi )) , i = 1, 2, . . . , |Dl |. (1)
module based on PLM prompting. We present con-
In Equation 1, αt is the weight of model mt , crete examples of our prompt-based rules in Table 1.
which will be used for both detecting hard instances The input instance comes from the large-error in-
Input : Microsoft is an American technology corporation founded by Bill Gates.
Prompt : [Input] The Person Bill Gates [Mask] the Organization Microsoft.
Rule : {Entity Pair == (Person, Org)} ∧ {[Mask] == founded} ∧ {st,j ≥ threshold} → per:found
Input : Marvell Software Solutions Israel is a wholly owned subsidiary of Marvell Technology Group.
Prompt : [Input] The Marvell Software Solutions Israel is a [Mask].
Rule : {[Mask] == subsidiary ∨ corporation ∨ company} ∧ {st,j ≥ threshold} → Company
Input : Liverpool short of firepower for crucial encounter. Rafael Benitez must gamble with Liverpools
Champions League prospects tonight but lacks the ammunition to make it a fair fight.
Prompt : [Mask] News: [Input]
Rule : {[Mask] == Liverpool ∨ Team ∨ Football ∨ Sports} ∧ {st,j ≥ threshold} → Sports
Table 1: The examples of prompt-based rules for relation extraction, ontology classification, and news topic clas-
sification. Here [Input] denotes the original input, [Mask] denotes the mask token, and ∧, ∨ are the logical
operators. We use bold words to show the ground-truth label of the original input.
stances identified on the clean dataset Dl . For each By filling the rules based on xei with the prompt
task, we have a task-specific template to reshape the predictions, we obtain the candidate rule set in
original input for prompting PLMs. The resulting iteration t, denoted as Rt = {rj }B
j=1 .
prompt typically includes the original input as the
context and a mask token to be filled by the PLMs. 4.2 Rule Annotation and Matching
The final rule encompasses multiple atomic parts to Interactive rule evaluation As the candidate rules
capture different views of information. Each rule is Rt can be still noisy, PRB OOST thus presents Rt
accompanied by a ground-truth label of the original to humans for selecting high-quality rules. Specifi-
input instance, such a label will be assigned to the cally, for each candidate rule rj ∈ Rt , we present
unlabeled instances matched by this rule. it along with its prompt template xpj to human ex-
For example, as shown in Table 1, the prompt of perts, then they judge whether the rule rj should be
the relation extraction task can be "entity [MASK] accepted or not. Formally, rj is associated with a la-
entity", which rephrases the original input using bel dj ∈ {1, 0}. When a rule is accepted (dj = 1),
relation phrases while keeping the key semantics. it will be incorporated into the accepted rule set
Take news topic classification as another example, R+ for later weak label generation.
by filling the masked slot in the prompt, PLMs pro- Weak Label Generation After human evaluation,
pose candidate keyword-based rules for topic clas- the accepted rules R+ t are used to match unla-
sification. Different from the rules extracted from beled instances Du . We design a mixed soft-
surface patterns of the corpus (e.g., n-gram rules), matching procedure for matching rules with unla-
such a prompt-based rule proposal can generate beled instances, which combines embedding-based
words that do not appear in the original inputs—this similarity and prompt-based vocabulary similarity.
capability is important to model generalization. The two similarities complements each other: the
Given a large-error instance xei ∈ Xe , we first embedding-based similarity captures global seman-
convert it into a prompt by xpi = τ (xei ). Such tics, while the prompt-based similarity captures
a prompt consists of the key components of the local features in terms of vocabulary overlapping.
original input and a [MASK] token. By inherit- Given a rule rj ∈ R+ t and an unlabeled instance
ing the original input, we construct context for the xu ∈ Du , we detail the computations of the two
[MASK] token to be predicted by a pre-trained similarities below.
LM M. To complete the rule, we feed each xpi First, the embedding similarity is computed as
to M to obtain the probability distribution of the the cosine similarity between the rule and instance
[MASK] token over the vocabulary V: embeddings (Zhou et al., 2020):
exp (v̂ · M(xpi )) saj = (eu · erj )/(keu k · kerj k), (5)
p(MASK = v̂ | xpi ) = P , (4) where eu is the instance embedding of xu and erj
exp (v · M(xpi ))
v∈V
is the rule embedding of rj , both embeddings are
where M(·) denotes the output vector of M, v is obtained from a PLM encoder.
the embedding of the token in the vocabulary V, Next, to compute the prompt-based similarity,
and v̂ is the embedding of the predicted masked we feed τ (xu ) into the prompting model (Equation
token. We collect the top-k predictions with highest 4) and use the top-k candidates of the [MASK] po-
p(MASK = v̂ | xpi ) to form the candidate rules. sition as the predicted vocabulary for instance xu .
We measure the vocabulary overlapping between Finally, we incorporate the self-trained weak
Vu and Vrj as model into the ensemble model. The final model is
sbj =| Vu ∩ Vrj | /k, (6) a weighted ensemble of the weak models:
n
where Vu is the vocabulary of instance xu and Vrj
X
fθ (·) = αt mt , (11)
is the vocabulary of rule rj . Note that for the un- t
labeled instance, we have |Vu | = k, while for the where a weak model mt with a low error rate errt
rule, we have |Vrj | ≤ k because human annotators will be assigned a higher coefficient αt according
may abstain some candidate predictions. to Equation 2.
The final matching score is computed by com-
bining the above two similarities: 5 Experiments
sj = αsaj + (1 − α)sbj . (7) 5.1 Experiment Setup
The instance xu is matched by the rule rj if sj is
Tasks and Datasets We conduct experiments
higher than the matching threshold σ obtained on
on four benchmark datasets, including TA-
the development set. When xu is matched by mul-
CRED (Zhang et al., 2017) for relation extraction,
tiple rules that provide conflicting labels, we use
DBPedia (Zhang et al., 2015) for ontology clas-
the one with the highest matching score to assign
sification, ChemProt (Krallinger et al., 2017) for
the weak label. If ∀j ∈ 1, · · · , k, the matching
chemical-protein interaction classification and AG
score sj is lower than σ, we abstain from labeling
News (Zhang et al., 2015) for news topic classifica-
the instance xu .
tion. For the initial weak supervision sources, we
4.3 Model Training & Ensemble use the labeling rules provided by existing works:
Zhou et al. (2020) for TACRED, Meng et al. (2020)
In iteration t, with the new rule-matched data Dr ,
for DBPedia, and Zhang et al. (2021) for Chemprot
we obtain an enlarged weakly labeled dataset Dt =
and AG News. The statistics of the four datasets
Dt−1 ∪ Dr . We fit a weak model mt on Dt by
are shown in table 5. For the development set, we
optimizing:
1 X do not directly use the full development set as sug-
min `CE (mt (xi ), ŷi ) , (8) gested by the recent works (Gao et al., 2021; Perez
θ |Dt |
(xi ,ŷi )∈Dt et al., 2021). This prevents the model from taking
where ŷi is the weak label for instance xi , and `CE the advantage of the massive number of labeled
is the cross entropy loss. data in the development set. Instead, we create a
While the weakly labeled dataset has been en- real label-scarce scenario and keep the number of
larged, there are still unmatched instances in Du . sample in validation set Dv the same as the limited
To exploit such unlabeled and unmatched instances, clean labeled set Dl , namely |Dv | = |Dl |.
we adopt the self-training technique for weak Baselines We include three groups of baselines:
model training (Lee, 2013). The self-training pro- Fully Supervised Baseline: PLM: We use the pre-
cess can propagate information from the matched trained language model RoBERTa-base (Liu et al.,
weak labels to the unmatched instances to improve 2019) as the backbone and fine-tune it with the
the model mt . Following previous models (Xie full clean labeled data except for ChemProt. On
et al., 2016; Yu et al., 2021b), for each instance ChemProt, we choose BioBERT (Lee et al., 2020)
xi ∈ Du , we generate a soft pseudo-label y e ij from as the backbone for all the baselines and our model
the current model mt : to better adapt to this domain-specific task. The
2 /f
qij j X performance of fully supervised methods serves as
y
e ij = P 2 , fj = qij (9)
j 0 ∈Y (qij 0 /fj 0 ) i
an upper bound for weakly-supervised methods.
where qi = mt (xi ) is a probability vector such that Weakly Supervised Baselines: (1) Snorkel (Rat-
qi ∈ RK , and qij is the j-th entry, j ∈ 1, · · · , K. ner et al., 2017) is a classic WSL model. It aggre-
gates different labeling functions with probabilistic
The above process yields a pseudo-labeled D eu .
models, then fed the aggregated labels to PLM for
We update mt by optimizing:
1 X the target task. (2) LOTClass (Meng et al., 2020)
Lc (mt , y
e) = DKL (ey kmt (x)), (10) is a recent model for weakly-supervised text clas-
|Deu |
x∈Du
e
P sification. It uses label names to probe PLMs to
where DKL (P kQ) = k pk log(pk /qk ) is the generate weak labels, and performs self-training
Kullback-Leibler divergence. using the weak labels for classification. (3) CO-
Method (Metrics) TACRED (F1) DBpedia (Acc.) ChemProt (Acc.) AG News (Acc.)
Supervised Baselines
PLM w. 100% training data 66.9 (66.3/67.6) 99.4 79.7 94.4
PLM w. limited training data† 32.9 (40.8/27.6) 98.0 59.4 86.4
Weakly Supervised Baselines
Rule Matching 20.1 (85.0/11.4) 63.2 46.9 52.3
Snorkel (Ratner et al., 2017) 39.7 (39.2/40.1) 69.5 56.4 86.2
LOTClass (Meng et al., 2020) — 91.1 — 86.4
COSINE (Yu et al., 2021b) 39.5 (38.9/40.3) 73.1 59.8 87.5
Snorkel + fine-tuning† 40.8 (41.0/40.6) 97.6 64.9 87.7
LOTClass + fine-tuning† — 98.1 — 88.0
COSINE + fine-tuning† 41.0 (40.4/41.7) 97.9 65.7 88.0
PRB OOST 48.1 (42.7/55.1) 98.3 67.1 88.9
Table 2: Main results on four benchmark datasets. †: we use different proportions of clean data for fine-tuning as
described in Section 5.1. We use gray background to show the results of WLS baselines fine-tuned on the clean
data. We highlight the best fine-tuned results with purple font, and the best WSL results with blue font.
Rule Performance
87
Model Accuracy
80 88.5
86 88.0
85 70 87.5
84 PRBoost CAL Ours
87.0
83 IWS Entropy
60 RoBERTa w. 5% data
COSINE+fine-tuning 86.5
82 COSINE
2 4 6 8 10 50
2 4 6 8 10
86.0
Iterations Iterations
Figure 3: Results of interactive methods on AG News Figure 4: Rule performance and model accuracy v.s.
iterations on AG News.
CRED, ChemProt and AG News when the training
data is limited. For the model fine-tuned with 100% we asked all the three annotators to time their anno-
training data, we narrow the gap to fully supervised tation. On average, it takes each annotator less than
learning, compared to other WS approaches. 3 seconds to annotate one rule, while it takes nearly
10 seconds to annotate one instance. Rule-level
Comparing the performance gains across
annotation is much more efficient than instance-
datasets, the performance gap between PR-
level annotation because 1) we show the prompt
B OOST and the baselines is the largest on TACRED,
rather than the original instance to humans, which
which is the most challenging task among the four
is shorter and easier to read; 2) upon scanning the
with 41 different relation types. ChemProt is the
prompt, the annotators can swiftly select qualified
smallest dataset with only 5400 training data, so the
rules as they only differ at the [MASK] position.
gain is larger when the WSL methods are fine-tuned
This shows that rule-level annotation is an efficient
with clean labels. The performance gaps among
and suitable paradigm for interactive WSL.
different methods are small on DBPedia, especially
after they are fine-tuned using clean labeled data. Iteration 1 2 3 4 5 6 7 8 9 10 Overall
DBpedia, being a relatively simple dataset, using P̄ .89 .90 .93 .90 .87 .92 .91 .91 .87 .90 .90
P¯e .63 .59 .73 .71 .62 .73 .66 .56 .68 .68 .65
only 0.1% clean data for fine-tuning RoBERTa al- κ .71 .77 .73 .66 .65 .71 .75 .79 .60 .68 .71
ready achieves 98% accuracy, and the other WSL Table 3: Annotation agreement measured by the Fleiss-
methods after fine-tuning perform similarly. Kappa κ on AG News. P̄ measures annotation agree-
It is worth noting that PRB OOST performs ment over all categories; P¯e computes the quadratic
strongly across all the tasks because we can easily sum of the proportion of assignments to each category.
design a task-specific prompt template to adapt to For the annotation agreement, we compute
each task. In contrast, some WSL baselines are Fleiss’ kappa κ (Fleiss, 1971) to evaluate the agree-
difficult to apply to certain tasks. For example, ment among multiple human annotators. This
LOTClass achieves strong performance for DBpe- statistic assesses the reliability of agreement among
dia and AGNews as its weak sources are tailored multiple annotators. κ = 1 indicates complete
for text classification. However, it is hard to apply agreement over all the annotators, and no agree-
it to relation extraction tasks. Similarly, IWS per- ment results in κ ≤ 0. As shown in Table 3, we
forms well on binary classification problems using obtained an average κ = 0.71, which means the
n-gram based rules, but the method is only designed annotators achieve substantial agreement. For each
for binary classification, making it unsuitable for iteration, the κ ranges between [0.60, 0.79] indicat-
complex multi-class tasks. ing the stability of the annotation agreement.
5.3 Rule Annotation Agreement and Cost 5.4 Rule Quality in Iterative Learning
In this set of experiments, we benchmark model In this set of experiments, we evaluate the quality
performance and annotation cost against interac- of the rules discovered by PRB OOST. Figure 2 vi-
tive learning baselines (detailed in Appendix D): sualizes the discovered rules on AG News dataset.
IWS, CAL, and Entropy-based AL. As shown in We observe that 1) the rules can rectify some mis-
Figure 3, PRB OOST outperforms IWS that also classified data, and 2) the rules can complement
features rule-level annotation by 1.2% with very each other. For the first observation, we can take
close annotation cost. Our method outperforms the Figure 2(a) and Figure 2(b) for example. In iter-
best interactive baseline CAL by 1.1% in terms of ation 0 where new rules have not been proposed,
accuracy, while using about 0.6× annotation cost. it is obvious that some green data points and pur-
While annotating model-proposed rule or instances, ple data points are mixed into the orange cluster.
After the first-round rule proposal, PRB OOST has 89.0
already rectified parts of wrong predictions via rule- 88.5
88.0
Performance
matching. This is because our rule proposal is tar- 87.5
geted on the large-error instances, such adaptively 87.0
discovered rules can capture the model’s weakness 86.5
Supervised w/o self-training
more accurately compared to the simply enumer- 86.0 Initial WS w/o rule
85.5 PRBoost w/o ensemble
ated rules. For the second observation, we found
2 4 6 8 10
that more mis-classified data points get matched Iterations
by the newly discovered rules as the iteration in- Figure 5: Ablation study on AG News. The three hori-
creases. It demonstrates PRB OOST can gradually zontal lines represent the no-iterative methods. We use
enlarge the effective rule set by adding complemen- COSINE as the initial WS baseline. For the supervised
baseline, we fine-tune RoBERTa on 5% clean data.
tary rules, which avoids proposing repetitive rules
that can not improve the rule coverage. performance gain. PRB OOST iteratively identifies
Figure 4 shows the changes in rule accuracy, the current model’s weaknesses and proposes rules
rule coverage, and model performance in the itera- to strengthen itself, therefore it adaptively discov-
tive learning process on AG News. As shown, the ers more effective rules than static rule discovery.
model’s accuracy increases steadily during learn- Second, ensembling alone without new rule dis-
ing, which is improved from 86.7% to 88.9% after covery is not as effective. For the "w/o rule" variant,
10 iterations. This improvement arises from two we do not propose new rules, but ensemble multi-
key aspects of PRB OOST. First, the enlarged rule ple self-trained weak classifiers instead. The final
set continuously augments weakly labeled data, performance drops significantly under this setting
which provides more supervision for the weak by 1.5%. It demonstrates the newly proposed rules
model training. Second, the model ensemble ap- provide complementary weak supervision to the
proach refines the previous large errors step by step, model. Although simply ensembling multiple weak
resulting in increasing ensemble performance. classifiers also helps WSL, it is not as effective as
Regarding the rule coverage and accuracy, we training multiple complementary weak models as
observe the coverage of the rule set is improved in PRB OOST.
from 56.4% to 77.8%, and rule accuracy from Third, self-training benefits learning from new
83.1% to 85.6%. Such improvements show that weak labels. For the "w/o self-training" setting, we
PRB OOST can adaptively propose novel rules to do not use the self-training technique when learning
complement the previous rule set, which can match each weak classifier. The performance deteriorates
more instances that were previously unmatchable. by 0.6%. This is because part of the data are still
Note that the increased rule converge has not com- unmatched after we propose new rules, and self-
promised rule accuracy, but rather improved it. The training leverages the unlabeled data to help the
reason is two-fold: (1) the human-in-the-loop eval- model generalize better.
uation can select high-quality rules for generating
new weak labels; (2) for the instances with wrong 6 Conclusion
initial weak labels, PRB OOST can discover more
rules for the same instances and correct the weak We proposed PRB OOST to iteratively dis-
labels through majority voting. cover prompt-based rules for interactive weakly-
supervised learning. Through a boosting-style
5.5 Ablation Study ensemble strategy, it iteratively evaluates model
weakness to identify large-error instances for new
We study the effectiveness of various components rule proposal. From such large-error instances, its
in PRB OOST and show the ablation study results prompt-based rule discovery module leads to ex-
in Figure 5. We have the following findings: pressive rules that can largely improve rule cover-
First, the boosting-based iterative rule discovery age while being easy to annotate. The discovered
strategy is effective. For the "w/o ensemble" set- rules complement the current rule set and refine
ting, we fix the annotation budget B but discover the WSL model continuously. Our experiments on
candidate rules from large-error samples in one four benchmarks demonstrate that PRB OOST can
iteration. The results show the superiority of the largely improve WSL and narrow the gaps between
iterative strategy in PRB OOST , which brings 1.2% WSL models and fully-supervised models.
Acknowledgements Joseph L Fleiss. 1971. Measuring nominal scale agree-
ment among many raters. Psychological bulletin,
This work was supported by ONR MURI N00014- 76(5):378.
17-1-2656, NSF IIS-2008334, IIS-2106961, and
Sainyam Galhotra, Behzad Golshan, and Wang-Chiew
research awards from Google, Amazon, Facebook, Tan. 2021. Adaptive rule discovery for labeling text
and Kolon Inc. data. In Proceedings of the 2021 International Con-
ference on Management of Data, pages 2217–2225.
Xiang Chen, Xin Xie, Ningyu Zhang, Jiahuan Yan, Alex Holub, Pietro Perona, and Michael C Burl. 2008.
Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si, and Entropy-based active learning for object recogni-
Huajun Chen. 2021. Adaprompt: Adaptive prompt- tion. In 2008 IEEE Computer Society Conference
based finetuning for relation extraction. arXiv on Computer Vision and Pattern Recognition Work-
preprint arXiv:2104.07650. shops, pages 1–8. IEEE.
Dongjin Choi, Sara Evensen, Çağatay Demiralp, and Cheng-Yu Hsieh, Jieyu Zhang, and Alexander Ratner.
Estevam Hruschka. 2021. Tagruler: Interactive tool 2022. Nemo: Guiding and contextualizing weak su-
for span-level data programming by demonstration. pervision for interactive data programming. arXiv
In Companion Proceedings of the Web Conference preprint arXiv:2203.01382.
2021, pages 673–677.
Shengding Hu, Ning Ding, Huadong Wang, Zhiyuan
Zi-Yi Dou, Pengfei Liu, Hiroaki Hayashi, Zhengbao Liu, Juanzi Li, and Maosong Sun. 2021. Knowl-
Jiang, and Graham Neubig. 2021. GSum: A gen- edgeable prompt-tuning: Incorporating knowledge
eral framework for guided neural abstractive summa- into prompt verbalizer for text classification. arXiv
rization. In Proceedings of the 2021 Conference of preprint arXiv:2108.02035.
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech- Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard
nologies, pages 4830–4842, Online. Association for Hovy, and Eric Xing. 2016. Harnessing deep neu-
Computational Linguistics. ral networks with logic rules. In Proceedings of the
54th Annual Meeting of the Association for Compu-
Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, tational Linguistics (Volume 1: Long Papers), pages
Lena Dankin, Leshem Choshen, Marina Danilevsky, 2410–2420, Berlin, Germany. Association for Com-
Ranit Aharonov, Yoav Katz, and Noam Slonim. putational Linguistics.
2020. Active Learning for BERT: An Empirical
Study. In Proceedings of the 2020 Conference on Martin Krallinger, Obdulia Rabal, Saber A Akhondi,
Empirical Methods in Natural Language Process- et al. 2017. Overview of the biocreative VI
ing (EMNLP), pages 7949–7962, Online. Associa- chemical-protein interaction track. In BioCreative
tion for Computational Linguistics. evaluation Workshop, volume 1, pages 141–146.
Dong-Hyun Lee. 2013. Pseudo-label: The simple and Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
efficient semi-supervised learning method for deep Hiroaki Hayashi, and Graham Neubig. 2021a. Pre-
neural networks. In Workshop on challenges in rep- train, prompt, and predict: A systematic survey of
resentation learning, ICML, volume 3, page 896. prompting methods in natural language processing.
arXiv preprint arXiv:2107.13586.
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Xiao Liu, Kaixuan Ji, Yicheng Fu, Zhengxiao Du,
Jaewoo Kang. 2020. Biobert: a pre-trained biomed- Zhilin Yang, and Jie Tang. 2021b. P-tuning v2:
ical language representation model for biomedical Prompt tuning can be comparable to fine-tuning uni-
text mining. Bioinformatics, 36(4):1234–1240. versally across scales and tasks. arXiv preprint
arXiv:2110.07602.
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
The power of scale for parameter-efficient prompt Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding,
tuning. In Proceedings of the 2021 Conference on Yujie Qian, Zhilin Yang, and Jie Tang. 2021c. Gpt
Empirical Methods in Natural Language Processing, understands, too. arXiv preprint arXiv:2103.10385.
pages 3045–3059, Online and Punta Cana, Domini- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
can Republic. Association for Computational Lin- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
guistics. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
Jiacheng Li, Haibo Ding, Jingbo Shang, Julian proach. arXiv preprint arXiv:1907.11692.
McAuley, and Zhe Feng. 2021a. Weakly supervised
named entity tagging with learnable logical rules. Ilya Loshchilov and Frank Hutter. 2019. Decoupled
In Proceedings of the 59th Annual Meeting of the weight decay regularization. In International Con-
Association for Computational Linguistics and the ference on Learning Representations.
11th International Joint Conference on Natural Lan-
guage Processing (Volume 1: Long Papers), pages Katerina Margatina, Giorgos Vernikos, Loïc Barrault,
4568–4581, Online. Association for Computational and Nikolaos Aletras. 2021. Active learning by ac-
Linguistics. quiring contrastive examples. In Proceedings of the
2021 Conference on Empirical Methods in Natural
Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Language Processing, pages 650–663, Online and
Optimizing continuous prompts for generation. In Punta Cana, Dominican Republic. Association for
Proceedings of the 59th Annual Meeting of the Computational Linguistics.
Association for Computational Linguistics and the
11th International Joint Conference on Natural Lan- Dheeraj Mekala and Jingbo Shang. 2020. Contextual-
guage Processing (Volume 1: Long Papers), pages ized weak supervision for text classification. In Pro-
4582–4597, Online. Association for Computational ceedings of the 58th Annual Meeting of the Associa-
Linguistics. tion for Computational Linguistics, pages 323–333,
Online. Association for Computational Linguistics.
Yinghao Li, Pranav Shetty, Lucas Liu, Chao Zhang,
and Le Song. 2021b. BERTifying the hidden Yu Meng, Yunyi Zhang, Jiaxin Huang, Chenyan Xiong,
Markov model for multi-source weakly supervised Heng Ji, Chao Zhang, and Jiawei Han. 2020. Text
named entity recognition. In Proceedings of the classification using label names only: A language
59th Annual Meeting of the Association for Compu- model self-training approach. In Proceedings of the
tational Linguistics and the 11th International Joint 2020 Conference on Empirical Methods in Natural
Conference on Natural Language Processing (Vol- Language Processing (EMNLP), pages 9006–9017,
ume 1: Long Papers), pages 6178–6190, Online. As- Online. Association for Computational Linguistics.
sociation for Computational Linguistics. Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Chen Liang, Yue Yu, Haoming Jiang, Siawpeng Er, Killeen, Zeming Lin, Natalia Gimelshein, Luca
Ruijia Wang, Tuo Zhao, and Chao Zhang. 2020. Antiga, et al. 2019. Pytorch: An imperative style,
Bond: Bert-assisted open-domain named entity high-performance deep learning library. Advances
recognition with distant supervision. In Proceed- in neural information processing systems, 32:8026–
ings of the 26th ACM SIGKDD International Confer- 8037.
ence on Knowledge Discovery & Data Mining, page
1054–1064, New York, NY, USA. Association for Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021.
Computing Machinery. True few-shot learning with language models. arXiv
preprint arXiv:2105.11447.
Pierre Lison, Jeremy Barnes, Aliaksandr Hubin, and
Samia Touileb. 2020. Named entity recognition Alexander Ratner, Stephen H Bach, Henry Ehrenberg,
without labelled data: A weak supervision approach. Jason Fries, Sen Wu, and Christopher Ré. 2017.
In Proceedings of the 58th Annual Meeting of the Snorkel: Rapid training data creation with weak su-
Association for Computational Linguistics, pages pervision. In Proceedings of the VLDB Endowment.
1518–1533, Online. Association for Computational International Conference on Very Large Data Bases,
Linguistics. volume 11, page 269. NIH Public Access.
Esteban Safranchik, Shiying Luo, and Stephen Bach. Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yam-
2020. Weakly supervised sequence tagging from ing Yang, Mao Yang, and Alexander Ratner. 2021.
noisy rules. In Proceedings of the AAAI Conference WRENCH: A comprehensive benchmark for weak
on Artificial Intelligence, volume 34, pages 5570– supervision. In Thirty-fifth Conference on Neural In-
5578. formation Processing Systems Datasets and Bench-
marks Track.
Timo Schick and Hinrich Schütze. 2021a. Exploiting
cloze-questions for few-shot text classification and Rongzhi Zhang, Yue Yu, and Chao Zhang. 2020. Se-
natural language inference. In Proceedings of the qMix: Augmenting active sequence labeling via se-
16th Conference of the European Chapter of the As- quence mixup. In Proceedings of the 2020 Con-
sociation for Computational Linguistics: Main Vol- ference on Empirical Methods in Natural Language
ume, pages 255–269, Online. Association for Com- Processing (EMNLP), pages 8566–8579, Online. As-
putational Linguistics. sociation for Computational Linguistics.
Timo Schick and Hinrich Schütze. 2021b. It’s not just Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
size that matters: Small language models are also Character-level convolutional networks for text clas-
few-shot learners. In Proceedings of the 2021 Con- sification. Advances in neural information process-
ference of the North American Chapter of the Asso- ing systems, 28:649–657.
ciation for Computational Linguistics: Human Lan-
Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An-
guage Technologies, pages 2339–2352. Association
geli, and Christopher D. Manning. 2017. Position-
for Computational Linguistics.
aware attention and supervised data improve slot
filling. In Proceedings of the 2017 Conference on
Yanyao Shen, Hyokun Yun, Zachary Lipton, Yakov
Empirical Methods in Natural Language Processing,
Kronrod, and Animashree Anandkumar. 2017.
pages 35–45, Copenhagen, Denmark. Association
Deep active learning for named entity recognition.
for Computational Linguistics.
In Proceedings of the 2nd Workshop on Representa-
tion Learning for NLP, pages 252–256, Vancouver, Xinyan Zhao, Haibo Ding, and Zhe Feng. 2021.
Canada. Association for Computational Linguistics. GLaRA: Graph-based labeling rule augmentation
for weakly supervised named entity recognition. In
Laurens Van der Maaten and Geoffrey Hinton. 2008. Proceedings of the 16th Conference of the European
Visualizing data using t-sne. Journal of machine Chapter of the Association for Computational Lin-
learning research, 9(11). guistics: Main Volume, pages 3636–3649, Online.
Association for Computational Linguistics.
Paroma Varma and Christopher Ré. 2018. Snuba: Au-
tomating weak supervision to label training data. In Wenxuan Zhou, Hongtao Lin, Bill Yuchen Lin, Ziqi
Proceedings of the VLDB Endowment. International Wang, Junyi Du, Leonardo Neves, and Xiang Ren.
Conference on Very Large Data Bases, volume 12, 2020. Nero: A neural rule grounding framework for
page 223. NIH Public Access. label-efficient relation extraction. In The Web Con-
ference, pages 2166–2176.
Junyuan Xie, Ross Girshick, and Ali Farhadi. 2016.
Unsupervised deep embedding for clustering analy-
sis. In International conference on machine learn-
ing, pages 478–487. PMLR.
B Hyper-parameters 3000
Table 4: More rule examples on the text classification dataset AG News and the relation extraction dataset TACRED.
open-source PLMs and public data. We believe ensemble. Therefore, we assign the same weight
this rule-level annotation process will not amplify to each weak model but still follow the design of
any bias in the original data. We do not foresee any identifying large-error instances. This is reason-
ethical issues or direct social consequences. able as the weight wi computed by Equation 1 still
reflects the model weakness and can guide the rule
F Model Ensemble proposal. By discovering rules based on the large-
error instances, we iteratively complement the fea-
In practice, we keep αt for each weak model as ture regimes through the model training on rule-
same during the model ensemble. Equation 11 matched data and strengthen the ensemble model.
weights each weak model mt by a computed co-
efficient αt . Intuitively, the weak model mt with
higher αt impacts the ensemble results more. This
paradigm is proved to be effective under fully-
supervised settings, but we found it is not directly
applicable in WSL. Since we initialize a model
m0 on the given weak source and it can achieve
a relatively strong performance (much better than
random guess), i.e., the error rate err0 is low. It
makes a high α0 based on Equation 2, so the ini-
tialized model will dominate the following predic-
tion, thus limiting the effectiveness of the model