Professional Documents
Culture Documents
{20214001062,2022024954}@m.scnu.edu.cn, fanchenyou@scnu.edu.cn
1 Introduction
Classroom teaching is a crucial social process that involves extended interactions
between teachers and students. Within this context, teacher-student dialogues
promote the exchange of knowledge, skills, and ideas, and demonstrate cognitive
processes such as questioning, critical thinking, answering, reflecting, and receiving
feedback. These dialogues serve as a vital medium for effective communication
and the development of independent thinking skills among students.
In this perspective paper, we explore the use of Artificial Intelligence for
Education and focus on utilizing the recent successful Large Language Models
2 K. Tan et al.
S1: x-5
S2: x*5
AI Teacher
S3: 5*x
Additionally, there may be times when the class syllabus is temporarily altered,
requiring the teacher to design a new lesson within a limited time frame for public
demonstrations or other events. Consequently, it can be difficult to ensure the
delivery of a quality and effectively designed lesson, known as a case of “style
transfer.” Another challenge is faced by novice teachers who may struggle to
develop an effective dialogue plan, given their relative lack of experience and
limited access to teaching resources such as senior teachers’ templates. This is
known as a case of “knowledge transfer.”
Finally, The task of fairly grading teaching dialogues and offering constructive
feedback to teachers for self-enhancement remains a challenging issue in pedagogy
that presently necessitates meticulous human evaluations. This conundrum is
referred to as “dialogue assessment.”
In this perspective paper, we propose that advanced AI models can play a
crucial role in addressing key interactive scenarios during emergency situations
and improving the overall quality of dialogue. Specifically, based on the annotated
cases mentioned above, we identify three crucial and imperative areas of research:
4 K. Tan et al.
2 Related Work
Large Language Models (LLMs). Since the inception of Transformer archi-
tecture [47] in 2007, Transformer-based language models such as ELMo [31] and
BERT [12] have dominated the NLP realms by producing leading performance.
Recently, the practice of scaling up model sizes for building large language
models (LLMs) has received substantial attentions. The proposed LLMs such as
the GPT family [6, 30], PaLM [10], Galactica [43] , LaMDA [44], LLaMA [46]
have been emerging with enormous parameters from 10B to 1000B parameters.
These huge models show great capacities in understanding human languages and
performing reasoning tasks such as arithmetic reasoning and question answering.
Inspired by LLMs, Large Visual Models (LVMs), such as Vision Transformer
(ViT) [13] and SAM [21], and multi-modality models, such as CLIP [33], have
also emerged. These studies propose various ways of fusing different modalities
such as texts [12], image patches [9,33], segmentation masks [29], and sounds [34].
Many recent efforts have been made for improving the LLMs answering
quality. The Chain-of-Thought (CoT) [50] technique proposes to add existing
intermediate reasoning steps as hints at the head of the original query, which
Opportunities, Challenges and Prospects of LLMs in Classroom Teaching 5
3 Model Architecture
Dialogue Encoding
Teaching Scheme
physics teachers accordingly. This explicit setting of role player provides specific
context for the generated content.
The third practical user scenario we consider is to assess the AIGC for classroom
teaching fairly and consistently. The crucial challenge for using AIGC directly in
classroom setting is how to evaluate the excellence of the content generated by
the AIGC. To overcome this challenge, we propose a set of strategies to promote
equitable and competent grading and comparable to human expert evaluations.
1) Grading Based on Human-Feedback. We may seek the input of human
pedagogical experts and senior teachers to evaluate the quality of our generated
dialogues and content. For example, a usual practice is to ask the human evaluators
to assess the quality of the generated answer on a scale of 1 to 10, with 1 indicating
the lowest quality and 10 indicating the highest quality, and to provide a written
explanation justifying their scores. To avoid personal biases, we ensure that each
teaching case is scored by multiple experts and calculate the final score for each
case based on a reasonable set of scores or majority voting.
The outcomes from this process are then averaged to obtain a final grade for
each piece of content. Nevertheless, it is evident that this Human-Feedback based
rating approach has a crucial drawback: it can be very costly since it requires
human involvement. As a result, it cannot be easily scaled up in the same way
as AIGC, and it is limited in its applicability.
2) Grading Based on LLMs. LLMs are versatile and capable AI models
that, theoretically, can grade their own answers effectively. To achieve this self-
grading ability, we can follow a methodology called Self-Ask [32].
Specifically, in order to evaluate the quality of generated answers for down-
stream tasks, we utilize a prompting prefix to indicate response was self-generated
based on the preceding conversation. Then we can obtain LLM’s grading and
comments accordingly. By automating this process using the prompting prefix,
we can efficiently obtain a comprehensive set of scores for all generated answers.
Nevertheless, self-grading has systematic drawbacks. First, the general-purpose
LLMs lack the background knowledge necessary to accurately assess educational
objectives. As a result, grading answers is often based on general guesses such
as the fluency of the dialogue, rather than on dialogical reasons. Additionally,
it is widely known that most LLMs are better suited to understanding English
Opportunities, Challenges and Prospects of LLMs in Classroom Teaching 11
In this section, we outline our ideas for refining the capability and interpretability
of large AI models in order to unlock their domain-specific capabilities.
Supervised ranking loss. Inspired by the supervised Reward Modeling
methodology as mentioned in Section 4.3, we can design the following contrastive
ranking loss to align the AI-generated grading with human evaluations. Specifi-
cally, we can randomly select K AI-generated answers in each round. We then
ask both the human annotators and the Ext-LLM to grade the answers based
12 K. Tan et al.
on the same contexts for generation. Thus, we obtain K pairs of human and AI
grading as a dataset D.
We design a ranking loss function to align the Ext-LLM grading capacity
with humans by minimizing the discrepancy of their grading results, as follows:
1
loss(θ) = − E(x,y1 ,y2 )∼D [log (σ (rθ (x, y1 ) − rθ (x, y2 )))] . (1)
K(K − 1)/2
In Eq. 1, the rθ (x, y1 ) and rθ (x, y2 ) are the Ext-LLM’s grading of AI-generated
answer y1 and y2 with context x, respectively. Notedly, y1 is the preferred answer
of the randomly sampled pair (y1 , y2 ) by human grading. By minimizing ranking
loss 1 with K(K − 1)/2 distinct pairs enumerated from K answers of D, we can
fine-tune the Ext-LLM model parameter θ towards aligning its grading ability
with humans. It is convenient to extend such grading alignment procedures to
comments by using the reward model for comments.
Reinforcement Learning from Human Feedback methodology (RLHF).
Many schools have accumulated self-collected teaching databases over the years.
To fully explore those specialized educational datasets to improve AIGC, we can
further involve the human-in-the-loop processes for enhancing both assessment
and generation. To this end, we can construct more sophisticated and effective
approaches by employing extensive human interventions to further refine the
Ext-LLM model, based on the RLHF technique [11, 30].
We take the dialogue auto-completion task as an example. Given the context
and the generated dialogue yl by LLMs, we train the Ext-LLMs to refine the
dialogue as a normal prompt to obtain the refined dialogue ye . Let the reward of
Ext-LLM’s output ye be rθ (x, ye ), and let human annotated grade be rh (x, ye ).
We can utilize the standard RL method such as PPO [37] to follow the trace
of maximizing the reward rθ (x, ye ) while taking human reward as a baseline to
make training process robust. During training, we can also minimize the ranking
loss 1 together as a multi-task learning objective.
style transfer. For the auto-completion task, we presented two feasible ways of
generating dialogues from user prompts, helping teachers to deal with unforeseen
events or necessities of revising the teaching plan within a short period of time.
For the teaching dialogue transfer task, we showed the potential of AIGC for
imitating expert teaching dialogues and providing high-quality dialogues for a
wider range of topics and subjects.
In Section 5, we propose fully fine-tuning a controllable and competent
External LLM (Ext-LLM) for grading and commenting purposes to enhance the
interpretability and explainability of black-box large AI models. Therefore, we
propose employing medium-sized Ext-LLMs to maintain the capacity of content
generation as well as the flexibility and interpretability by tuning on our own
teaching databases or direct from human evaluations and dialogue corrections.
Finally, we identify several open problems related to the utilization of AIGC
in classroom teaching, which future works can pay special attention to.
1. How to extend the power of AI to generate multi-modal and multimedia con-
tent, such as teaching videos and audios, across multiple disciplines including
arts, music, and physical education.
2. How to ensure the privacy of AIGC is a crucial issue that can hinder its
widespread use. Studies can focus on preventing the generation of fake faces
through malicious DeepFake techniques [36, 45]. We can also explore the
differential privacy technique [3, 14, 15] to protect generated personal data and
identities.
Ultimately, we hope that this perspective paper can have a constructive impact
on helping more AI and pedagogical researchers to understand the frontiers of
emerging concepts in LLMs and AIGC and gain insights into the potential of
applying these techniques to causes of education. Moving forward, we provide
a comprehensive analysis of the limitations and potentials of AI-generation
techniques and propose several accessible directions for further research. We wish
for the development of AIGC for educational purposes to be safe and meaningful,
so that it can be used to its fullest potential without growing uncontrollably.
References
1. Midjourney. https://www.midjourney.com/ (accessed May 2, 2023)
2. Wolfram plugin for chatgpt. https://www.wolfram.com/wolfram-plugin-chatgpt/
(accessed May 4, 2023)
3. Abadi, M., Chu, A., Goodfellow, I., McMahan, H.B., Mironov, I., Talwar, K., Zhang,
L.: Deep learning with differential privacy. In: SIGSAC (2016)
4. Alexander, R.J.: Border crossings: Towards a comparative pedagogy. Comparative
Education 37(4), 507–523 (2001)
5. Asterhan, C., Clarke, S., Resnick, L.: Socializing intelligence through academic talk
and dialogue. Socializing Intelligence Through Academic Talk and Dialogue pp.
1–480 (2015)
6. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee-
lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot
learners. Advances in neural information processing systems 33, 1877–1901 (2020)
14 K. Tan et al.
7. Cao, Y., Li, S., Liu, Y., Yan, Z., Dai, Y., Yu, P.S., Sun, L.: A comprehensive survey
of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv
preprint arXiv:2303.04226 (2023)
8. Chen, M., Sutskever, I., Zaremba, W., et al.: Evaluating large language models
trained on code. arXiv preprint arXiv:2107.03374 (2021)
9. Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.:
Uniter: Universal image-text representation learning. In: ECCV. pp. 104–120 (2020)
10. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham,
P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling
with pathways. arXiv preprint arXiv:2204.02311 (2022)
11. Christiano, P.F., Leike, J., Brown, T., Martic, M., Legg, S., Amodei, D.: Deep
reinforcement learning from human preferences. Advances in neural information
processing systems 30 (2017)
12. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi-
rectional transformers for language understanding. arXiv preprint arXiv:1810.04805
(2018)
13. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image
recognition at scale. ICLR (2020)
14. Dwork, C., Roth, A.: The Algorithmic Foundations of Differential Privacy (2014)
15. Fan, C., Hu, J., Huang, J.: Private semi-supervised federated learning. In: Interna-
tional Joint Conference on Artificial Intelligence (2022)
16. Guzmán, V., Larrain, A.: The transformation of pedagogical practices into dialogic
teaching: towards a dialogic notion of teacher learning. Professional Development
in Education pp. 1–14 (2021)
17. Howe, C., Abedin, M.: Classroom dialogue: A systematic review across four decades
of research. Cambridge journal of education 43(3), 325–356 (2013)
18. Howe, C., Hennessy, S., Mercer, N., Vrikki, M., Wheatley, L.: Teacher–student
dialogue during classroom teaching: Does it really impact on student outcomes?
Journal of the Learning Sciences 0, 1–51 (03 2019)
19. Howe, C., Ilie, S., Guardia, P., Hofmann, R., Mercer, N., Riga, F.: Principled
improvement in science: Forces and proportional relations in early secondary-school
teaching. International Journal of Science Education 37(1), 162–184 (2015)
20. Howe, C., Mercer, N.: Commentary on the papers. Language and Education 31(1),
83–92 (2017)
21. Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T.,
Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything.
arXiv:2304.02643 (2023)
22. Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-
training with frozen image encoders and large language models. arXiv preprint
arXiv:2301.12597 (2023)
23. Liu, X., Pang, T., Fan, C.: Federated prompting and chain-of-thought reasoning
for improving llms answering. arXiv preprint arXiv:2304.13911 (2023)
24. Loewen, S., Sato, M.: Interaction and instructed second language acquisition.
Language teaching 51(3), 285–329 (2018)
25. MASATOSHI, S., Ballinger, S.: Understanding peer interaction: Research synthesis
and directions. Peer interaction and second language learning: Pedagogical potential
and research agenda p. 1 (2016)
26. Mercer, N.: The seeds of time: Why classroom dialogue needs a temporal analysis.
The journal of the learning sciences 17(1), 33–59 (2008)
27. Mercer, N., Dawes, L.: The study of talk between teachers and students, from the
1970s until the 2010s. Oxford review of education 40(4), 430–445 (2014)
Opportunities, Challenges and Prospects of LLMs in Classroom Teaching 15
46. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T.,
Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient
foundation language models. arXiv preprint arXiv:2302.13971 (2023)
47. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. NIPS (2017)
48. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A.,
Zhou, D.: Self-consistency improves chain of thought reasoning in language models
(2023)
49. Wankhade, M., Rao, A.C.S., Kulkarni, C.: A survey on sentiment analysis methods,
applications, and challenges. Artificial Intelligence Review 55(7), 5731–5780 (2022)
50. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., Zhou, D.: Chain
of thought prompting elicits reasoning in large language models. arXiv preprint
arXiv:2201.11903 (2022)
51. Wu, S., Irsoy, O., Lu, S., Dabravolski, V., Dredze, M., Gehrmann, S., Kambadur,
P., Rosenberg, D., Mann, G.: Bloomberggpt: A large language model for finance.
arXiv preprint arXiv:2303.17564 (2023)
52. Yang, J., Jin, H., Tang, R., Han, X., Feng, Q., Jiang, H., Yin, B., Hu, X.: Harnessing
the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint
arXiv:2304.13712 (2023)
53. Zaib, M., Zhang, W.E., Sheng, Q.Z., Mahmood, A., Zhang, Y.: Conversational
question answering: A survey. Knowledge and Information Systems 64(12), 3151–
3195 (2022)
54. Zhang, Z., Zhang, A., Li, M., Smola, A.: Automatic chain of thought prompting in
large language models. arXiv preprint arXiv:2210.03493 (2022)
55. Zhao, W.X., Zhou, K., et al.: A survey of large language models. arXiv preprint
arXiv:2303.18223 (2023)