You are on page 1of 6

HuaTuo (华驼): Tuning LLaMA Model with Chinese Medical Knowledge

Haochun Wang∗ , Chi Liu∗ , Nuwa Xi, Zewen Qiang, Sendong Zhao, Bing Qin and Ting Liu
Research Center for Social Computing and Information Retrieval,
Harbin Institute of Technology, China
{hcwang,cliu,nwxi,zwqiang,sdzhao,bqin,tliu}@ir.hit.edu.cn

Abstract diagnostic precision, drug recommendations, and


medical advice, potentially endangering patients.
Large Language Models (LLMs), such as the
Few efforts have been made to address this prob-
LLaMA model, have demonstrated their ef-
lem, with existing approaches primarily focusing
arXiv:2304.06975v1 [cs.CL] 14 Apr 2023

fectiveness in various general-domain natu-


ral language processing (NLP) tasks. Nev- on supplying LLMs with medical information re-
ertheless, LLMs have not yet performed op- trieved from conversations, where human errors
timally in biomedical domain tasks due to may occur more frequently. Additionally, LLMs
the need for medical expertise in the re- are typically trained in English, constraining their
sponses. In response to this challenge, we comprehension and response capabilities in lan-
propose HuaTuo (华 驼), a LLaMA-based guages that differ significantly from English, such
model that has been supervised-fine-tuned
as Chinese, rendering their direct application in
with generated QA (Question-Answer) in-
stances. The experimental results demon- Chinese contexts less than ideal.
strate that HuaTuo generates responses that In this paper, we present the HuaTuo (华驼)
possess more reliable medical knowledge. model, an LLM tailored for the biomedical do-
Our proposed HuaTuo model is accessi- main, focusing on the Chinese language. By gen-
ble at https://github.com/SCIR-HI/ erating diverse instruction data based on medical
Huatuo-Llama-Med-Chinese. knowledge from CMeKG, we emphasize ensuring
the correctness of facts in the model’s responses,
1 Introduction
which is vital in the biomedical domain. Through
The advent of instruction-following large lan- this process, we collect over 8,000 instruction
guage models (LLMs), representative by Chat- data for supervised fine-tuning. Our model builds
GPT(OpenAI, 2022), has generated significant in- upon the open-source LLaMa-7B base model, inte-
terest due to their exceptional performance in un- grates structured and unstructured medical knowl-
derstanding instructions and generating human-like edge from the Chinese medical knowledge graph
responses. Compared to smaller models, LLMs (CMeKG), and employs knowledge-based instruc-
exhibit strong generalization across various natu- tion data for fine-tuning.
ral language processing (NLP) tasks and unique In summary, our contributions can be summa-
emergent ability to solving unseen or complicated rized as follows:
tasks. Despite ChatGPT’s non-open source status,
open-source communities have provided several • We introduce the HuaTuo model, the first
alternatives, such as LLaMa(Touvron et al., 2023), open-source Chinese biomedical LLM tuned
with relatively affordable training costs. This po- with knowledge-based instruction data;
sitions LLMs as potential solutions for real-world
scenarios requiring communication and reasoning. • We integrate structured and unstructured
However, despite their numerous merits, LLMs medical knowledge from CMeKG, ensuring
are not designed to cater specifically to the medi- our model has accurate and domain-specific
cal domain. Their general domain knowledge of- knowledge;
ten falls short when addressing such specialized
fields, where accurate and domain-specific expert • We proposed SUS, a novel metric for evaluat-
knowledge is critical. This can lead to sub-optimal ing LMs in the biomedical domain consider-
* Equal contribution. ing safety, usability and smoothness.
2 Related Works leverages ChatGLM-6B as the base model and fine-
tunes it with the Chinese translation of ChatDoctor
2.1 Large Language Models
dataset, obtained through ChatGPT. Additionally,
Recent advancements in large language mod- Chen et al. develops a Chinese and Medically En-
els (LLMs) have demonstrated their superiority hanced Language model within their collection of
over previous-generation paradigms, such as pre- LLMs. Collectively, these works illustrate the po-
training and fine-tuning. The significant increase in tential for LLMs to be successfully applied within
model scale has led to qualitative changes in LLMs, the biomedical domain.
commonly referred to as emergent abilities. These
include in-context learning for zero-shot tasks and 3 HuaTuo Model
chains of thought that enhance the model’s perfor- In this section, we will introduce the training pro-
mance on complex tasks. cess of our HuaTuo (华驼) model.
OpenAI’s development of ChatGPT and GPT-4
has revolutionized the perception of LLMs. Al- 3.1 Base Model
though these models exhibit remarkable perfor- LLaMA (Touvron et al., 2023) is a collection of
mance, OpenAI has not disclosed details regard- multi-lingual base models with parameters ranging
ing their training strategies or weight parameters. from 7 billion to 65 billion, which are open-source
LLaMa serves as an open-source alternative for to the research community. Here, we adopted the
GPT, with sizes ranging from 7 billion to 65 billion LLaMA-7B model for more accessible training.
parameters. Taori et al. trained Alpaca based on
LLaMa with instruction tuning. 3.2 Medical Knowledge
While comparable in performance to GPT-3.5, There are various kinds of medical knowledge, gen-
LLaMa’s performance on Chinese tasks is subpar erally including (1) structured medical knowledge
due to its training data is primarily limited to En- like medical knowledge graphs and (2) unstruc-
glish corpus. To address Chinese-specific applica- tured medical knowledge like medical guidelines.
tions, Du et al.; Zeng et al. introduced GLM, a We utilized a Chinese medical knowledge graph,
130 billion-parameter auto-regressive pre-trained CMeKG (Odmaa et al., 2019), which also provides
model with multiple training objectives. ChatGLM retrieved medical knowledge about diseases, drugs,
further incorporates code training and aligns with symptoms, etc. Table 1 shows several knowledge
human intentions through supervised fine-tuning, cases in the CMeKG knowledge base.
offering a tailored solution for Chinese contexts.
3.3 Knowledge-based Instruction Data
2.2 Pre-trained Models in Biomedical Instruct-tuning has proven to be effective to tune
Domain large language models (Wei et al., 2022; Ouyang
Although large language models (LLMs) exhibit et al., 2022), which helps the models perform sat-
remarkable performance in general domains, their isfactorily under zero-shot scenarios with the cost
lack of domain-specific knowledge results in subop- of sufficient annotated instructions. Inspired by
timal performance in fields that require specialized the automatic construction of the instruction along
expertise, such as biomedicine. The biomedical with the instances (inputs and outputs) (Wang et al.,
field’s inherent nature necessitates models to pos- 2022; Taori et al., 2023), we generate our instruc-
sess comprehensive knowledge bases for relevant tion data based on the above medical knowledge.
queries, particularly when applied to real-world As demonstrated in Table 2, instruct-tuning in-
situations where patients seek health and medical volves supervised fine-tuning on the training in-
advice. Several efforts have been made to adapt stances and an instruction that describes the tasks in
LLMs to the biomedical domain. natural language. However, as for a large language
Existing approaches primarily employ ChatGPT model for medical dialogue, inputs are mostly
for data assistance and train smaller models us- stated as questions and instructions are all like
ing its distilled or translated knowledge. Chatdoc- “Answer the following question”. Therefore, we
tor(Li et al., 2023) represents the first attempt to dispose of the instructions and only preserve the
adapt LLMs to the biomedical field by fine-tuning inputs for our HuaTuo.
LLaMa using conversation demonstrations syn- While the generated instructions are required to
thesized via ChatGPT. DoctorGLM(Xiong et al.) be diverse enough for unseen tasks (Wang et al.,
Type Knowledge in Chinese Knowledge translated to English
Disease {"class": "百种常见病", "中心词": "肝 {"class": "Common Diseases", "Key Word":
癌", "药物治疗": ["瑞格非尼", "对乙型 "Liver Cancer", "Drug Treatment": ["Rego-
或丙型肝炎有效的抗病毒药物", "索拉 rafenib", "Antiviral drugs effective against
非尼"], "多发地区": ["撒哈拉以南的非 hepatitis B or C", "Sorafenib"], "High Preva-
洲"], "高危因素": ["肥胖", "HBV DNA过 lence Regions": ["Sub-Saharan Africa"],
高", "慢性酗酒", "男性", "慢性乙型肝 "High Risk Factors": ["Obesity", "High
炎 感 染", "肝 癌 家 族 史", "慢 性 丙 型 肝 HBV DNA levels", "Chronic alcoholism",
炎 肝 硬 化", "核 心 启 动 子 突 变", "肝 硬 "Male gender", "Chronic hepatitis B in-
化", "HCV重叠感染", "老年性心瓣膜病", fection", "Family history of liver cancer",
"乙型肝炎e抗原", "糖尿病"],"发病部位": "Cirrhosis due to chronic hepatitis C",
["肝脏"], "辅助检查": ["肝功能检查"], "Core promoter mutation", "Liver cirrho-
"病史": ["长期慢性乙肝病史"]} sis", "HCV co-infection", "Senile valvular
heart disease", "Hepatitis B e antigen", "Di-
abetes"], "Affected Area": ["Liver"], "Aux-
iliary Examination": ["Liver function test"],
"Medical History": ["Long-term history of
chronic hepatitis B"]}.
Drug { "class": "西药", "中心词": "二甲双胍", "Class": "Western Medicine", "Key Word":
"性状": ["糖衣或薄膜衣片,除去包衣 "Metformin", "Appearance": ["Sugar-
后显白色"], "英文名称": ["异福片", "格 coated or film-coated tablets, white after
华 止"], "分 类": ["双 胍 类", "抗 结 核 病 removal of coating"], "English Names":
药"], "规格": ["0.25g"], "OTC类型": ["乙 ["Yifupian", "Gehuazhi"], "Classification":
类OTC", "甲类OTC"], "适应证": ["糖尿 ["Biguanide class", "Anti-tuberculosis
病", "肥胖"], "通用名": ["异福片"], "成 drug"], "Specifications": ["0.25g"],
份": ["利福平及异烟肼", "异烟肼", "异 "OTC Types": ["OTC Class B", "OTC
烟肼0.1克", "异烟肼150毫克", "本品为 Class A"], "Indications": ["Diabetes",
复方制剂", "利福平", "利福平300毫克", "Obesity"], "Generic Name": ["Yifu-
"利 福 平0.15克", "盐 酸 二 甲 双 胍", "盐 pian"], "Ingredients": ["Isoniazid and
酸"] pyrazinamide", "Pyrazinamide", "0.1g
pyrazinamide", "150mg pyrazinamide",
"This product is a compound preparation",
"Isoniazid", "300mg isoniazid", "0.15g
isoniazid", "Metformin hydrochloride",
"Hydrochloride"]
Symptom { "中心词": "毛发脱落", "检查": ["毛发 "Key Word": "Hair Loss", "Examinations":
矿物质检查"], "相关疾病": ["斑秃", "慢 ["Hair mineral analysis"], "Related Dis-
性疲劳综合症"], "相关症状": ["毛发色 eases": ["Alopecia areata", "Chronic Fa-
淡而呈棕色", "毛发干燥易断", "皮肤变 tigue Syndrome"], "Related Symptoms":
硬"], "所属科室": ["内科", "皮肤性病", ["Hair color is light and brown", "Hair
"放疗、化疗科"], "发病部位": ["头部"] is dry and brittle", "Skin becomes hard-
ened"], "Related Departments": ["Internal
Medicine", "Dermatology and Venereol-
ogy", "Radiation and Chemotherapy"], "Af-
fected Area": ["Head"]

Table 1: Knowledge cases in the CMeKG.


Instruction: that can mislead the user into danger, such as wrong
Translate the following sentence into Chinese. medicine recommendations. Usability reflects the
medical expertise of a specific response. And, the
Input:
Smoothness represents the ability as a language
What are the possible reasons for liver cancer? model.
Output: In the domain of natural language generation,
肝癌可能的原因有什么? various evaluation metrics are utilized to assess
the efficacy of generative models. The widely
Table 2: Instance with an instruction. used metrics in the general domain include Bleu
and Rouge, which compare generated responses
2022) in the general domain, the correctness of with the ground truth. Additionally, for medical
the fact in the responses from the large language question-answering tasks, we introduce an evalua-
model is of more concern in the biomedical do- tion metric, SUS. The SUS metric consists of three
main (Gilson et al., 2023). Thus, we first sample dimensions: Safety, Usability, and Smoothness.
knowledge instances from the knowledge graph The “Safety” dimension assesses whether the gen-
and then generate the instances based on the spe- erated response has the potential to mislead the
cific knowledge with the OpenAI API (OpenAI, user and pose a danger to their health, for ex-
2022). Finally, we collect over 8,000 instruction ample, through incorrect medication recommen-
data, like examples in Table 3 as training instances dations. The “Usability” dimension evaluates the
for supervised fine-tuning. extent to which the generated response reflects med-
ical expertise, while the “Smoothness” dimension
4 Experiment measures the proficiency of the generative model
as a language model.
4.1 Baselines
In order to demonstrate the superior performance 4.3 Results
of HuaTuo, we conducted a comparative analysis In this study, we constructed a test set of poten-
with four baseline models. tial questions in Chinese dialogue scenarios and
• LLaMA (Touvron et al., 2023) serves as the compared the generated responses of our HuaTuo
foundation model for our HuaTuo. In particu- model with three other baseline models. To eval-
lar, we employed LLaMA-7B for its relative uate the model performance, we recruited five an-
fairness in comparison to the other baselines notators with medical backgrounds who assessed
and its ease of training. the randomly mixed responses of the models using
a three-point scale for each dimension of Safety,
• Alpaca (Taori et al., 2023) is an instructed- Usability, and Smoothness (SUS). The SUS scale
tuned version of LLaMA and boasts more ranges from 1 (not acceptable) to 3 (good), with
than 80,000 instances generated in the gen- 2 indicating an acceptable response. The average
eral domain. SUS scores are presented in Table 4. Although
LLaMA achieves the highest safety score, its re-
• ChatGLM (Zeng et al., 2023) is an optimized sponses are often uninformative and rephrased the
dialogue model specifically designed for the questions, resulting in a low usability score. On
Chinese chatting scenario. In our analysis, the other hand, our HuaTuo model significantly
we compared HuaTuo’s performance with improves knowledge usability without much com-
ChatGLM-6B. promising safety.
4.2 Metrics
Ethics Statement
For the generation tasks in the general domain,
evaluation metrics, such as Bleu and Rouge are The HuaTuo initiative is primarily dedicated to re-
utilized to determine whether a generative model search and is not intended to offer medical advice.
can produce responses similar to the ground truth. The medical information utilized in this study is
However, as for the medical QA tasks, namely (1) sourced from an open-access medical knowledge
safety, (2) usability, and (3) smoothness. Safety graph. It is important to note that the accuracy of re-
determines whether the response includes anything sponses generated by large language models cannot
Instance (in Chinese) Instances (translated to English)
Question: 患者XXX因头部及眼后部疼痛并能听 Patient XXX came to the hospital due to
到连续不断的隆隆声来到医院,医生对 pain in the head and behind the eyes, ac-
其进行了检查后诊断为偏头痛。请问, companied by continuous rumbling sounds.
该病有什么治疗方法? After an examination, the doctor diagnosed
the patient with a migraine. What are the
treatment options for this condition?
Answer 治疗偏头痛的方法包括药物治疗和非 The methods for treating migraines include
药物治疗。药物治疗主要包括三大类: medication and non-pharmacological ther-
急性发作期治疗、预防性治疗和混合治 apies. Medication therapy primarily con-
疗。非药物治疗包括改变生活方式,如 sists of three categories: acute attack treat-
减少熬夜、避免过度劳累、避免饮酒、 ment, preventative treatment, and mixed
戒烟等;此外还包括针灸、按摩、脉 treatment. Non-pharmacological therapies
冲放电治疗等。在治疗中应该综合评估 include lifestyle changes such as reducing
患者的病情、临床表现、个人情况等因 staying up late, avoiding overwork, avoid-
素,制定个性化的治疗方案。 ing alcohol consumption, and quitting smok-
ing; in addition, they also include treat-
ments such as acupuncture, massage, and
pulse discharge therapy. During treatment,
a comprehensive assessment of the patient’s
condition, clinical manifestations, personal
circumstances, and other factors should be
taken into account to develop a personalized
treatment plan.

Table 3: Example for the supervised-fine-tuning training instances.

Safety Usability Smoothness


LLaMA 2.93 1.21 1.58
Alpaca 2.64 2.05 2.30
ChatGLM 2.59 1.93 2.41
HuaTuo (华驼) 2.88 2.12 2.47

Table 4: Experimental results of SUS score for the models.

be guaranteed, and the medical knowledge utilized General language model pretraining with autoregres-
therein should not be construed as a substitute for sive blank infilling. In Proceedings of the 60th An-
nual Meeting of the Association for Computational
professional medical advice. If one experiences any
Linguistics (Volume 1: Long Papers), pages 320–
discomfort or distress, it is strongly advised to seek 335.
the guidance of a qualified medical professional.
Aidan Gilson, Conrad W Safranek, Thomas Huang,
Vimig Socrates, Ling Chi, Richard Andrew Taylor,
David Chartash, et al. 2023. How does chatgpt per-
References form on the united states medical licensing exami-
nation? the implications of large language models
Zhihong Chen, Junying Chen, Hongbo Zhang, Feng
for medical education and knowledge assessment.
Jiang, Guiming Chen, Fei Yu, Tiannan Wang, Juhao
JMIR Medical Education, 9(1):e45312.
Liang, Chen Zhang, Zhiyi Zhang, Jianquan Li, Xi-
ang Wan, Haizhou Li, and Benyou Wang. 2023. Llm Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, and
zoo: democratizing chatgpt. https://github. You Zhang. 2023. Chatdoctor: A medical chat
com/FreedomIntelligence/LLMZoo. model fine-tuned on llama model using medical do-
main knowledge.
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. Glm: BYAMBASUREN Odmaa, YANG Yunfei, SUI Zhi-
fang, DAI Damai, CHANG Baobao, LI Sujian, and
ZAN Hongying. 2019. Preliminary study on the con-
struction of chinese medical knowledge graph. Jour-
nal of Chinese Information Processing, 33(10):1–7.
OpenAI. 2022. Chatgpt. https://chat.openai.
com.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural In-
formation Processing Systems, 35:27730–27744.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
Dubois, Xuechen Li, Carlos Guestrin, Percy
Liang, and Tatsunori B. Hashimoto. 2023. Stan-
ford alpaca: An instruction-following llama
model. https://github.com/tatsu-lab/
stanford_alpaca.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
arXiv:2302.13971.
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Al-
isa Liu, Noah A Smith, Daniel Khashabi, and Han-
naneh Hajishirzi. 2022. Self-instruct: Aligning lan-
guage model with self generated instructions. arXiv
preprint arXiv:2212.10560.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,


Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Dai, and Quoc V Le. 2022. Finetuned language
models are zero-shot learners. In International Con-
ference on Learning Representations.

Honglin Xiong, Sheng Wang, Yitao Zhu, Zihao Zhao,


Yuxiao Liu, Qian Wang, and Dinggang Shen. Doc-
torglm: Fine-tuning your chinese doctor is not a her-
culean task.

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang,


Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, Weng Lam Tam, Zix-
uan Ma, Yufei Xue, Jidong Zhai, Wenguang Chen,
Zhiyuan Liu, Peng Zhang, Yuxiao Dong, and Jie
Tang. 2023. GLM-130b: An open bilingual pre-
trained model. In The Eleventh International Con-
ference on Learning Representations (ICLR).

You might also like