Professional Documents
Culture Documents
2 Related Work Currently, more and more studies has begun to pay
attention to the importance of data quality of in-
2.1 Instruction Tuning Dataset struction tuning. LIMA (Zhou et al., 2023) only
Instruction tuning aims to train large language mod- uses 1,000 high-quality instructions and outputs
els to generate responses that align with input in- for SFT, and does not even need to perform RLHF
structions, thereby enable LLMs with conversa- training to achieve very strong performance. Alpa-
tional and task execution capabilities. Compared Gasus (Chen et al., 2023) uses powerful LLM to
with standard LLM, SFT allows the behavior of automatically identify and filter low-quality data,
the model to be more controllable and predictable, resulting in high-quality instruction tuning data to
thereby achieving the purpose of aligning human improve performance and training speed. Hump-
will. Methods to build instruction tuning datasets back (Li et al., 2023) filters out high-quality sam-
include: (1) Pure manual annotation (Conover et al., ples to fine-tune a more powerful LLM.
2023). This method completely constructs instruc- Others (Song et al., 2023) explores the impact of
tions and answers manually, which is very time- the mixture strategies of different instruction tun-
consuming and laborious; (2) Converted from ex- ing datasets. Tulu series (Konchakov et al., 2023;
isting datasets (Mishra et al., 2022; Sanh et al., Ivison et al., 2023) show that increasing instruc-
2022; Chung et al., 2022). Some studies use super- tion diversity can effectively improve the perfor-
vised data sets from NLP tasks to construct instruc- mance and different instruction tuning datasets can
tion tuning data; (3) Automatically generated using discover or enhance specific skills, while no one
LLM (Honovich et al., 2022; Wang et al., 2023; dataset (or combination) provides the best perfor-
Xu et al., 2023a; Ji et al., 2023; Xu et al., 2023b). mance across all assessments.
Source Quantity Source Quantity
tents posted before 2018, as earlier content may
Zhihu 8837 Douban 3132
Xiaohongshu 1508 Segment Fault 458 become outdated due to changes in programming
Encyclopedia Article 980 Encyclopedia of China 1706
WikiHow 1876 COIG PC 3000 languages or software versions. We then select the
Middle school Exam 2000 Graduate Entrance Examination 475
Logi QA 422 CValue 906 "accepted" answers with at least 5 upvotes. Fur-
COIG-Human-Value 101 Chinese Traditional 232
Idiom Explanation 112 Poem Writing 47
thermore, we manually review all the (question,
Classical Chinese Translation
Finance NLP Task
112
600
MBA Encyclopedia
Medical Encyclopedia
10689
8351
answer) pairs to remove or modify low-quality con-
Medical Article 186 Law 2645 tent.
Total 48375
Figure 1: Most common root verbs (inner circle) and top 2021) to parse the instructions and then extract the
direct noun objects (outer circle) in the CQIA dataset. verb closest to the root along with its top direct
Note that we only visualize when a certain verb-noun noun object. We then plot the top 20 most common
pair has more than 30 instances, and many instructions root verbs and their corresponding direct noun ob-
do not contain a verb-noun structure. jects in Figure1. From this figure we can observe
that CQIA features a diverse range of instructions
and intentions.
5 EXPERIMENTAL SETUP
In this section, we describe how we use COIG-
CQIA to fine-tune models and elaborate our our
evaluation methods.
5.1 Evaluation
C-Eval is a comprehensive Chinese evaluation
suite for foundation models. It consists of 13948
Figure 2: Overview of CQIA Task Types. multi-choice questions spanning 52 diverse disci-
plines and four difficulty levels. We choosing the
answer option with the highest log-likelihood as
4 Data Analysis the final prediction of the model.
4.1 Statistics CMMLU is a comprehensive evaluation bench-
Table 1 describe the data statistics for all sources. mark specifically designed to evaluate the knowl-
We collected a total of 48,375 instances from 22 edge and reasoning abilities of LLMs within the
sources within the Chinese Internet and Commu- context of Chinese language and culture. CMMLU
nity, covering domains ranging from general knowl- covers a wide range of subjects, comprising 67
edge and STEM to humanities. Figure 2 illustrates topics that span from elementary to advanced pro-
the variety of task types, encompassing information fessional levels. It includes subjects that require
extraction, question answering, code generation, computational expertise, such as physics and math-
etc. We demonstrated the distribution in the length ematics, as well as disciplines within humanities
of the instructions and responses in Figure3. and social sciences.
BELLE-EVAL is an open-ended test set com-
4.2 Diversity prising 12 different instruction types across various
To analyze the diversity of the COIG-CQIA dataset, domains, including open question answering, brain-
we follow prior work(Wang et al., 2023; Lou et al., storming, mathematics, coding, etc. It can be used
2023) by employing the Hanlp tool(He and Choi, to assess a model’s ability to follow instructions
Source Open QA Brainstorming Classification Generation Summarization Rewrite Closed QA Extract Math Code Average
SegmentFault 26.3 33.6 8.9 23.5 29.9 20.1 25.0 22.6 14.5 32.1 23.7
COIG PC 31.4 47.7 28.7 44.8 43.4 53.3 45.5 28.4 35.6 23.2 38.2
Douban 45.2 64.5 23.3 64.4 38.6 37.1 34.2 25.4 32.9 42.1 40.8
Zhihu 48.1 72.3 19.0 66.9 24.3 29.5 28.9 12.3 8.5 40.0 35.0
Logi QA 45.0 65.4 32.9 55.3 37.1 49.5 47.1 38.9 17.9 40.0 42.9
Ruozhiba 64.8 84.6 50.3 73.1 45.0 39.9 57.0 30.3 27.3 63.6 53.6
Yi-6B
Wiki 51.0 67.6 21.5 60.4 30.8 31.5 30.2 21.4 22.7 34.7 37.2
Finance 43.2 65.7 30.0 57.3 36.4 30.2 34.6 31.4 15.7 27.5 37.2
Exam 51.5 74.3 42.0 70.9 54.1 60.5 56.2 47.7 41.5 49.9 54.9
Xhs 25.2 47.0 8.4 45.4 8.6 21.4 22.5 7.0 28.5 27.1 24.1
WikiHow 0.5 1.2 1.5 7.7 18.1 3.1 23.1 0.5 0.0 2.1 5.8
CQIA-Subset 59.8 86.4 48.2 79.4 60.9 69.9 50.6 42.0 37.8 55.8 64.2
Table 2: The performance of Yi-6B trained on various datasets evaluated on BELLE-EVAL using GPT4.
Source Open QA Brainstorming Classification Generation Summarization Rewrite Closed QA Extract Math Code Average
SegmentFault 51.3 70.7 43.8 66.8 57.1 75.7 41.4 47.1 75.5 65.0 60.7
COIG PC 19.1 38.0 27.8 43.9 37.8 63.7 40.6 25.8 43.6 19.1 37.2
Douban 57.4 81.7 63.6 76.2 54.7 60.1 47.5 50.4 73.8 50.6 63.2
Zhihu 66.5 90.4 52.6 82.6 71.2 78.2 46.4 39.6 76.0 62.2 69.1
Logi QA 51.4 76.4 64.9 75.6 60.0 71.4 61.6 52.7 47.0 45.3 62.0
Ruozhiba 75.9 92.3 76.5 92.1 77.3 70.9 67.2 68.5 72.6 65.2 76.9
Yi-6B
Wiki 63.0 75.7 44.0 80.6 47.9 66.6 47.9 50.0 56.8 55.6 60.5
Finance 46.8 71.1 17.1 60.1 27.4 23.6 17.2 29.4 28.5 24.8 36.7
Exam 49.4 79.7 64.7 79.9 61.5 79.8 66.2 61.0 52.8 56.3 66.2
Xhs 51.3 76.1 38.5 68.0 25.8 46.0 28.4 32.1 74.6 36.3 50.3
Wikihow 54.7 75.2 32.1 68.2 45.3 55.9 40.9 55.8 41.0 44.4 52.7
CQIA-Subset 56.2 84.5 48.1 72.9 60.5 70.9 54.6 50.8 52.5 49.5 61.9
Table 3: The performance of Yi-34B trained on various datasets evaluated on BELLE-EVAL using GPT4.