You are on page 1of 10

Turk J Elec Eng & Comp Sci

(2021) : –
© TÜBİTAK
doi:10.3906/elk-

Automated question generation and question answering from Turkish texts


using text-to-text transformers
arXiv:2111.06476v1 [cs.LG] 11 Nov 2021

Fatih Cagatay AKYON1,2∗ , Devrim CAVUSOGLU1,3 , Cemil CENGIZ1


Sinan Onur ALTINUC1,2 , Alptekin TEMIZEL2
1
OBSS AI, Ankara, Turkey
2
Graduate School of Informatics, METU, Ankara, Turkey
3
Computer Engineering, METU, Ankara, Turkey

Received: .201 • Accepted/Published Online: .201 • Final Version: ..201

Abstract: While exam-style questions are a fundamental educational tool serving a variety of purposes, manual
construction of questions is a complex process that requires training, experience and resources. To reduce the expenses
associated with the manual construction of questions and to satisfy the need for a continuous supply of new questions,
automatic question generation (QG) techniques can be utilized. However, compared to automatic question answering
(QA), QG is a more challenging task. In this work, we fine-tune a multilingual T5 (mT5) transformer in a multi-
task setting for QA, QG and answer extraction tasks using a Turkish QA dataset. To the best of our knowledge, this
is the first academic work that attempts to perform automated text-to-text question generation from Turkish texts.
Evaluation results show that the proposed multi-task setting achieves state-of-the-art Turkish question answering and
question generation performance over TQuADv1, TQuADv2 datasets and XQuAD Turkish split. The source code and
pre-trained models are available at this https URL.

Key words: turkish, question answering, question generation, answer extraction, multi-task, transformer

1. Introduction
Question Generation (QG) is the task of generating questions from a given context and optionally, some answers.
The research area on QG has developed exponentially with the task getting more popular in areas such as
education, commercial applications (e.g chatbots, dialogue systems) and healthcare [1].
Early works in QG were based mainly on human-designed rules. These works focused on transforming
a declarative sentence to the corresponding question by utilizing sophisticated syntactic transformation rules.
These tasks mainly relied on handcrafted feature extraction from documents. A method for generating multiple-
choice tests from instructional documents like textbooks or encyclopedias was proposed in [2]. In this work,
domain-specific terms were extracted using the term frequency approach, and the sentences including the
retrieved terms were transformed into questions using the parsed syntactic information of the sentence. In [3],
the input text is first simplified with a set of transformations to produce multiple declarative sentences. Then,
a declarative sentence is transformed into a set of possible questions by syntactic and lexical transformations.
However, since it relies on rule based transformations, method is not applicable to other languages and question
styles. Similarly, [4] is a preliminary work that gives the details of an implementation plan for rule based
question generation from Turkish texts using syntactic (constituent or dependency) parsing and semantic role
∗ Correspondence: fatih.akyon@metu.edu.tr

1
This work is licensed under a Creative Commons Attribution 4.0 International License.
Akyon et al/Turk J Elec Eng & Comp Sci

labeling systems. In the question generation part, the author plans to utilize manually generated templates
and rules. However, the proposed method is not fully automated considering the manual selection of templates
and rule based nature. Moreover, the paper does not provide sufficient technical details and no follow-up paper
giving the details of the planned implementation is available.
Recently, there have been many neural networks based techniques proposed for QG. An encoder-decoder
architecture of a LSTM based seq2seq model is adopted in [5]. Both the input sentence and the paragraph
containing the sentence were encoded via separate bidirectional LSTMs [6] and then concatenated. This
representation was then fed to the decoder which was a left-to-right LSTM to generate the question. The decoder
learns to use the information in more relevant parts of the encoded input representation via an attention layer.
Later models included target answer in the input to avoid questions that are too short and/or broadly targeted
(lacking a precise target) such as ”What is mentioned?”. Some models have achieved that by either treating
the answer’s position as an extra input feature [7], [8] or by encoding the answer using a separate network
[9], [10]. Moreover, position embeddings have been used to give more attention to the answer words closer to
context words. Some utilized additional decoders to predict the question words (when, how, why, etc.) before
generating the question [11]. LSTM based Seq2seq models struggle to capture the paragraph-level context that
is needed to generate high quality questions. Seq2seq model was extended with answer tagging, maxout pointer
mechanism and a gated self-attention encoder in [8]. A multi-stage attention to link long document context
with targeted answer is used in [12].

Figure 1. Multi-task setting.

Figure 2. End-to-end question generation from context.

Recently, transformer based models have been dominating the NLP research in tandem with QG research.
These models are capable of capturing longer and comprehensive contexts more effectively than their QG
predecessors, mainly LSTM based seq2seq based models. A transformer based model, with a pre-processing
technique called Named Entity Recognition (NER), is used in [13]. In their work, they first extract a variety

2
Akyon et al/Turk J Elec Eng & Comp Sci

of named entities from the input text and then replace these entities with named entity tag to better allow
the model to generalize. Superior applications of QG task were proposed by those using a large transformer
based language model (LM). A pre-trained BERT for QG was adopted in [14] and three models was proposed
using straightforward BERT for QG, sequential question generation with BERT using the previous decoded
results, and finally highlighting the answer in the context which yielded a performance improvement. A pre-
trained GPT-2 for QG is used in [15] in a straightforward way by preparing inputs in a particular format. The
authors evaluate the model outcomes in several ways such as context-copying, failures on constructing questions,
robustness and answer-awareness.
In this work, in order to fully automate the question generation process from Turkish texts using a single
model, we propose a multi-task learning based fine-tuning of mT5 model [16]. To the best of our knowledge,
this is the first academic work that attempts to perform automated text-to-text question generation from
Turkish texts. The advantage of the mT5 language model is that it is pretrained on multiple tasks and multiple
languages simultaneously in a unified text-to-text fashion. Therefore, it can readily be fine-tuned on multiple
tasks in Turkish language such as answer extraction, QA and QG after converting the datasets to the common
text-to-text format. Generation in multitask setting can be seen in Figure 1. Moreover, this approach does not
require an external answer extraction model or human effort to label the answers since the same model can be
used to extract the answers from the context as seen in Figure 2.

Figure 3. Multi-task fine-tuning of the multilingual pre-trained mT5 model: 1) QA task, which uses context and
question pair as input and answer as target, 2) QG, which uses answer highlighted context as input and question as
target, 3) Answer Extraction, which uses sentence highlighted context as input and answer list with separator as target.

2. Proposed Approach

In this section, the details of the multi-task fine-tuning and inference are presented.

3
Akyon et al/Turk J Elec Eng & Comp Sci

2.1. Modeling
To formally describe the task; let c denote the context, a denote the answer which is a span inside c and let q
denote the question targeting the answer a . Given a context c, the task is to model
P (q, a|c) = P (a|c)P (q|a, c) (1)

To convey a fully automated question generation pipeline, we assume that in the generation phase answer
may not be given, and thus we also train the model to find answers in the given context. That is, we model
P (a|c) as an answer extraction task. This way, a question generation is allowed to be performed without
explicitly provided answers. For this task, we use context and answer pairs from {context, question, answer}
triplets from SQuAD style dataset. Context is used as a straightforward input and answers are used as targets
for answer extraction task in training.
For the second part of the equation, P (q|a, c) , we use the answer and context pairs as inputs and targeted
question for given answer as the target in training. In generation part, if an answer is given with the context,
we skip the answer extraction step, otherwise answer extraction is done before question generation. We use the
pre-trained multilingual T5 (mT5) [16] for modeling the tasks. The mT5 model is based on T5 [17], which is
an encoder-decoder model trained on multiple tasks simultaneously in a unified text-to-text fashion.
When providing the inputs to text-to-text transformer, different parts of input c, a, q are separated
by a separator. In both single-task and multi-task QG setting, we apply three different input format styling:
prepend, highlight and both. In prepend format, we prepend the base input text with a task specific prefix as in
T5 [18]. For example, for QG task we prepend the base input format with “generate question : ” prefix. In
highlight format, we wrap the answer text in the context with a special <hl> (highlight) token similar to [16].
The both input format, contains both of the previous input types, both prepend and highlight.

2.2. Single-task Setting


We use a SQuAD [19] style QA dataset as input and modify each sample to train our model in single-task
setting for QA and QG tasks. In generation phase, for QA task a context and a question is required to be
given. During generation for QG task, a context and an answer is required to be given. Different from multi-
task setting mentioned in the next section, it is required to give an answer along with context for QG task in
generation phase whereas in multi-task setting answer is not strictly required to be given.

2.3. Multi-task Setting


We use a SQuAD [19] style QA dataset as input and modify each sample to train our model in a multi-task
setting. For answer extraction task, we put highlight tokens at the start and end of the sentence that contains
the answer to be extracted, for the question generation, the answer of the question to be generated is highlighted
[16]. Moreover, we prepend “question” , “context” , “generate question” , “extract answer” tokens before each
sample to help the model distinguish one task from other. In Figure 3, the input and target formats of the
model during fine-tuning is presented.

2.4. End-to-end Question Generation


In this work, we adopt the answer-aware question generation methodology [20]. As a result, to generate
questions, the model requires both the context and answer as illustrated in Figure 3. Instead of manually

4
Akyon et al/Turk J Elec Eng & Comp Sci

deciding the answers to be highlighted, the model is used for automatic answer extraction as well. As a result,
end-to-end question generation from raw text becomes possible as seen in Figure 1.

3. Experimental Setup and Results


All the experiments have been performed on Nvidia A100 GPU with 80 GB VRAM using Transformers Trainer
[21] on Pytorch [22] backbone.

3.1. Datasets and Metrics


TQuAD [23] is a Turkish QA dataset on Turkish & Islamic Science History that is published within the scope of
Teknofest 2018 Artificial Intelligence competition. An extended version of TQuAD is also available [24]. Both
of these datasets have the same structure with SQuAD [19], and are used during fine-tuning and evaluation. In
the results, we will name them as TQuADv1 [23] and TQuADv2 [24].
XQuAD [25] is a multilingual QA dataset consisting of 240 paragraphs and 1190 question-answers in
Arabic, Chinese, German, Greek, Hindi, Russian, Spanish, Thai, Turkish and Vietnamese. These samples have
been professionally translated from the SQuAD 1.1 dev set. Turkish split of the XQuAD will be used to evaluate
the fine-tuned models.
For the evaluation of QA task performance, widely accepted F1 and Exact Match (EM) scores [19]
are calculated. Although there is no widely accepted automatic evaluation metric for measuring the QG
performance, as reported in [26] most of the previous works used the classic metrics like BLEU [27], METEOR
[28] and ROUGE [29]. METEOR is not applicable to Turkish as it applies stemming and synonym matching
(in English) other than computing precision and recall values. Hence it has been excluded in our experiments.
We reported BLEU-1, BLEU-2 and ROUGE-L metrics in the conducted experiments for evaluating the QG
task performance.

3.2. Hyper-parameter Tuning

Table 1. Hyper-parameter search to find the best initial learning rate value and number of training epochs for BERT.

LR Ep. TQuADv2 TQuADv2 XQuAD XQuAD


F1 EM F1 EM
1e-4 3 67.1 50.5 53.0 37.4
1e-4 5 66.5 50.1 52.8 36.7
1e-4 8 66.4 49.7 52.8 36.7
1e-4 10 66.7 50.0 52.7 37.7
1e-5 3 55.1 38.6 49.7 31.3
1e-5 5 57.5 41.4 41.4 23.6
1e-5 8 60.1 43.8 44.5 27.1
1e-5 10 60.8 44.1 45.5 30.1

Before performing detailed comparisons, we chose the best hyper-parameters for BERT [30] and mT5
[16] models. A grid-search has been performed to select the optimizer type, initial learning rate and the number
of training epochs. BERTurk-base language model [31] has been fine-tuned for QA task on TQuADv2 training
split, F1 and EM scores have been calculated on TQuADv2 validation split and XQuAD Turkish split. Results

5
Akyon et al/Turk J Elec Eng & Comp Sci

Table 2. Hyper-parameter search to find the best optimizer type, initial learning rate and number of training epochs
for mT5. The first block shows the model setup, the second block show evaluation scores for QA.

Setup QA
Optim. LR Ep. TQuADv2 TQuADv2 XQuAD XQuAD
F1 EM F1 EM
AdamW 1e-4 5 38.4 23.9 29.5 15.2
AdamW 1e-4 10 51.9 35.6 39.3 21.8
AdamW 1e-4 15 56.7 39.0 40.8 22.5
AdamW 1e-3 5 60.4 43.3 43.7 24.5
AdamW 1e-3 10 63.7 47.2 47.0 29.6
AdamW 1e-3 15 64.8 49.3 48.2 31.1
Adafactor 1e-4 10 50.9 35.1 37.2 19.6
Adafactor 1e-3 5 57.9 40.0 41.2 22.3
Adafactor 1e-3 10 62.6 45.2 43.4 26.4
Adafactor 1e-3 15 63.5 46.6 44.1 26.6

Table 3. Hyper-parameter search to find the best optimizer type, initial learning rate and number of training epochs
for mT5. The first block shows the model setup, the second block show evaluation scores for QG.

Setup QG
Optim. LR Ep. TQuADv2 TQuADv2 TQuADv2 XQuAD XQuAD XQuAD
BLEU-1 BLEU-2 ROUGE-L BLEU-1 BLEU-2 ROUGE-L
AdamW 1e-4 5 25.8 20.4 34.8 15.8 10.2 23.8
AdamW 1e-4 10 26.8 21.4 38.0 17.1 11.0 23.4
AdamW 1e-4 15 30.8 24.7 39.5 17.9 11.7 26.3
AdamW 1e-3 5 34.4 28.1 42.4 19.2 12.6 26.9
AdamW 1e-3 10 36.4 30.1 43.7 19.8 12.9 27.4
AdamW 1e-3 15 39.3 32.8 45.6 21.3 14.0 28.5
Adafactor 1e-4 10 27.5 22.0 37.5 16.9 11.0 25.0
Adafactor 1e-3 5 34.0 27.9 41.7 17.9 11.4 26.0
Adafactor 1e-3 10 37.5 31.1 44.2 18.8 12.0 26.6
Adafactor 1e-3 15 38.6 32.0 46.1 19.9 12.8 27.0

for learning rates 1e-4 and 1e-5 are summarized in Table 1. It has to be noted that as the model diverged with
learning rate 1e-3, results for this learning rate are not reported. As a result, we selected the set of parameters
which attain the overall best score in all metrics and set the learning rate to 1e-4 and number of epochs to 3 for
BERT. Similarly, mT5 language model [31] has been fine-tuned in a multi-task setting on TQuADv2 training
split. Then F1, EM scores for QA samples (Table 2) and BLEU, ROUGE scores for QG samples (Table 3)
have been calculated on TQuADv2 validation split and XQuAD Turkish split. Note that trainings with setups
having 1e-5 learning rate have diverged for both AdamW and Adafactor and hence not reported. After the
experiments, we selected the set of parameters which attain the overall best score in all metrics, which was
achieved by AdamW optimizer with a learning rate of 1e-3 and 15 epochs.
For the remainder of the experiments, these fine-tuned BERT and mT5 models with the determined set
of parameters have been used.

6
Akyon et al/Turk J Elec Eng & Comp Sci

3.3. Evaluation Results

Table 4. BERTurk-base and mT5-base QA evaluation results for TQuADv2 fine-tuning.

Setting TQuADv2 TQuADv2 XQuAD XQuAD


F1 EM F1 EM
BERTurk 67.1 50.5 53.0 37.4
Single-task mT5 71.6 55.1 60.7 40.2
Multi-task mT5 71.5 56.2 61.1 43.3

Table 5. mT5-base QG evaluation results for single-task (ST) and multi-task (MT) for TQuADv2 fine-tuning.

Setting TQuADv2 TQuADv2 XQuAD XQuAD


BLEU-1 ROUGE-L BLEU-1 ROUGE-L
MT-Both mT5 47.6 53.9 27.9 35.8
MT-Highlight mT5 45.9 52.5 26.2 34.8
MT-Prepend mT5 45.5 52.6 25.0 34.0
ST-Both mT5 46.1 52.6 26.2 34.1
ST-Highlight mT5 45.2 52.4 25.8 33.5
ST-Prepend mT5 43.6 50.8 23.4 31.8

Table 6. TQuADv1 and TQuADv2 fine-tuning QA evaluation results for multi-task mT5 variants and BERTurk.
MT-Both means, mT5 model is fine-tuned with ’Both’ input format and in a multi-task setting.

TQuADv1 fine-tuning results


Setting TQuADv1 TQuADv1 XQuAD XQuAD
F1 EM F1 EM
BERTurk-base 62.5 45.2 42.9 26.6
MT-Both mT5-small 63.8 48.5 43.5 27.0
MT-Both mT5-base 72.1 55.8 54.6 35.9
MT-Both mT5-large 74.7 59.6 62.1 42.9
TQuADv2 fine-tuning results
Setting TQuADv2 TQuADv2 XQuAD XQuAD
F1 EM F1 EM
BERTurk-base 67.1 50.5 53.0 37.4
MT-Both mT5-small 65.0 49.3 48.8 32.9
MT-Both mT5-base 71.5 56.2 61.1 43.3
MT-Both mT5-large 73.3 58.4 65.0 46.7

For evaluating the question answering performance of the proposed technique, we use the fine-tuned
BERTurk result as the baseline. BERT and mT5 models are fine-tuned on TQuADv2 training split separately.
Then F1, EM scores are calculated on TQuADv2 validation split and XQuAD Turkish split. TQuADv2 fine-
tuning results can be seen in Table 4. According to these results, proposed mT5 settings outperform the BERT
in QA task. Moreover, multi-task setting further increases the QA performance.

7
Akyon et al/Turk J Elec Eng & Comp Sci

Table 7. TQuADv1 and TQuADv2 fine-tuning QG evaluation results for multi-task mT5 variants. MT-Both means,
mT5 model is fine-tuned with ’Both’ input format and in a multi-task setting.

TQuADv1 fine-tuning results


Setting TQuADv1 TQuADv1 TQuADv1 XQuAD XQuAD XQuAD
BLEU-1 BLEU-2 ROUGE-L BLEU-1 BLEU-2 ROUGE-L
MT-Both mT5-small 37.3 30.1 44.3 19.8 12.7 26.3
MT-Both mT5-base 48.4 41.7 53.6 21.6 14.1 28.3
MT-Both mT5-large 49.8 43.2 55.2 24.9 16.3 30.2
TQuADv2 fine-tuning results
Setting TQuADv2 TQuADv2 TQuADv2 XQuAD XQuAD XQuAD
BLEU-1 BLEU-2 ROUGE-L BLEU-1 BLEU-2 ROUGE-L
MT-Both mT5-small 39.6 32.9 46.5 21.1 13.8 28.4
MT-Both mT5-base 47.6 41.2 53.9 27.9 20.9 35.8
MT-Both mT5-large 49.1 42.7 54.3 29.3 21.9 37.5

To compare the question generation performance of the proposed single and multi task settings, we fine-
tuned mT5 model on TQuADv2 training split in single-task and multi-task setting. Three different input formats
explained in Section 2.1 are used and BLEU-1, ROUGE-L scores are calculated on TQuADv2 validation split
and XQuAD Turkish split. According to TQuADv2 fine-tuning results in Table 5, highlight format increases
the BLUE-1 scores by up to 1.6 points and ROUGE-L scores by up to 1.7 points compared to prepend format in
single-task setting. Moreover, highlight format increases the BLUE-1 scores by up to 1.2 and ROUGE-L scores
by up to 0.8 points compared to prepend format in multi-task setting. Moreover, combining both techniques
increases BLEU-1 scores by up to 2.9 points and ROUGE-L scores by up to 3.8 points compared to prepend
format.
Additional experiments have been conducted to show the overall performance of the larger mT5 variants,
mT5-base and mT5-large as well as the place of the BERTurk among them. In Table 6 evaluation results for
the TQuADv1 and TQuADv2 fine-tunings are provided. Furthermore, in Table 7 QG evaluation results for the
TQuADv1 and TQuADv2 fine-tunings are provided. First, in smaller dataset sizes all mT5 variants outperform
BERTurk in QA task while in larger dataset sizes BERTurk may outperform mT5-small. This shows that for
QA task, mT5 based approach is a rather viable option when the data is scarce whereas the regular single-task
training may be preferred when there is enough data in hand. Second, the comparison of mT5 results show
that increasing the model size improves the performance drastically in all scores for both datasets. Moreover,
the largest gain is obtained after switching from mT5-small to mT5-base. While using an even bigger model
by switching to mT5-large improves the performance, it has a relatively smaller effect. Nevertheless, this trend
of obtaining better scores by increasing the model capacity is consistent with the previous works on other
transformer based models.

4. Conclusions
By combining the proposed answer extraction and answer-aware QG modules, it is possible to fully automate
the QG task without any manual answer extraction labor. Automated evaluation metrics on TQuAD validation
set show that the model is capable of generating meaningful question-answer pairs from the context after fine-
tuning. Moreover, results show that the proposed multi-task approach has better performance on QA, answer
extraction and QG compared to single-task setting. By combining the prepend and highlight input formats,

8
Akyon et al/Turk J Elec Eng & Comp Sci

QG performance of an mT5 model can be boosted up to 10%. In the future, further experiments with the
multi-task model on the other QG tasks such as multiple choice, true/false, yes/no will be examined and effect
of multilingual knowledge in mT5 will be analysed.

References

[1] Xiang Yue, Xinliang Frederick Zhang, Ziyu Yao, Simon Lin, and Huan Sun. Cliniqg4qa: Generating diverse questions
for domain adaptation of clinical question answering. arXiv preprint arXiv:2010.16021, 2020.
[2] Ruslan Mitkov et al. Computer-aided generation of multiple-choice tests. In Proceedings of the HLT-NAACL 03
workshop on Building educational applications using natural language processing, pages 17–22, 2003.
[3] Michael Heilman and Noah A Smith. Good question! statistical ranking for question generation. In Human Language
Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational
Linguistics, pages 609–617, 2010.
[4] Katira Soleymanzadeh. Domain specific automatic question generation from text. In Proceedings of ACL 2017,
Student Research Workshop, pages 82–88, 2017.
[5] Xinya Du, Junru Shao, and Claire Cardie. Learning to ask: Neural question generation for reading comprehension.
In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 1342–1352, 2017.
[6] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[7] Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao et al. Neural question generation from text: A
preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing, pages
662–671. Springer, 2017.
[8] Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. Paragraph-level neural question generation with maxout
pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pages 3901–3910, 2018.
[9] Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. Question generation for question answering. In Proceedings
of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874, 2017.
[10] Yanghoon Kim, Hwanhee Lee, Joongbo Shin, and Kyomin Jung. Improving neural question generation using answer
separation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6602–6609, 2019.
[11] Jing Sun, Xingwu and Liu, Yajuan Lyu, Wei He, Yanjun Ma et al. Answer-focused and position-aware neural
question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,
pages 3930–3939, 2018.
[12] Luu Anh Tuan, Darsh Shah, and Regina Barzilay. Capturing greater context for question generation. In Proceedings
of the AAAI Conference on Artificial Intelligence, volume 34, pages 9065–9072, 2020.
[13] Kettip Kriangchaivech and Artit Wangperawong. Question generation by transformers. arXiv preprint
arXiv:1909.05017, 2019.
[14] Ying-Hong Chan and Yao-Chung Fan. A recurrent bert-based model for question generation. In Proceedings of the
2nd Workshop on Machine Reading for Question Answering, pages 154–162, 2019.
[15] Luis Enrico Lopez, Diane Kathryn Cruz, Jan Christian Blaise Cruz, and Charibeth Cheng. Transformer-based
end-to-end question generation. arXiv preprint arXiv:2005.01107, 2020.
[16] Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou et al. mt5: A massively multilingual
pre-trained text-to-text transformer. arXiv e-prints, pages arXiv–2010, 2020.
[17] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang et al. Exploring the limits of transfer
learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683, 2019.

9
Akyon et al/Turk J Elec Eng & Comp Sci

[18] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang et al. Exploring the limits of transfer
learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020.
[19] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine
comprehension of text. In EMNLP, 2016.
[20] Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, Yanjun Ma et al. Answer-focused and position-aware neural question
generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages
3930–3939, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. . URL https:
//aclanthology.org/D18-1427.
[21] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue et al. Transformers: State-
of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational
Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
[22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury et al. Pytorch: An imperative style,
high-performance deep learning library. In NeurIPS, 2019.
[23] TQuAD. Turkish question answering dataset. https://github.com/TQuad/turkish-nlp-qa-dataset, 2018.
[24] Okan Ciftci, Ugurcan Kok, and Filiz Gozet. Extended turkish question answering dataset.
https://github.com/okanvk/Turkish-Reading-Comprehension-Question-Answering-Dataset, 2021.
[25] Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. On the cross-lingual transferability of monolingual represen-
tations. arXiv preprint arXiv:1910.11856, 2019.
[26] Jacopo Amidei, Paul Piwek, and Alistair Willis. Evaluation methodologies in automatic question generation 2013-
2018. 2018.
[27] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of
machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics,
pages 311–318, 2002.
[28] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with
human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine
translation and/or summarization, pages 65–72, 2005.
[29] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out,
pages 74–81, 2004.
[30] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[31] Stefan Schweter. Berturk - bert models for turkish, April 2020. URL https://doi.org/10.5281/zenodo.3770924.

10

You might also like