Professional Documents
Culture Documents
ABSTRACT
arXiv:2308.16474v1 [cs.CL] 31 Aug 2023
2. RELATED WORK
3.2. Optimal Subtask Result Acquisition
With the significant improvement in the capabilities of pro- 3.2.1. Models Selection
cessing multi-modal data by models such as GPT-4 [9],
Kosmos-2 [10], Shikra [11], and HuggingGPT [6], MLLMs Once task planning is complete, it is necessary to match each
have become a focal point of research attention. Research subtask with its corresponding model type, effectively select-
related to MLLMs primarily falls into four categories [4]. ing the most suitable model type for executing that subtask,
Multi-modal In-Context Learning (M-ICL): M-ICL involves such as object detection, image-to-text, image classification,
utilizing a small set of examples as prompt input to stim- etc. To achieve this goal, this paper adopts a similar approach
ulate the model’s latent capabilities and normalize its out- as HuggingGPT [6], utilizing model descriptions as the lan-
put [12, 13]; Multi-modal Chain-of-Thought (M-CoT): M- guage interface linking model. Specifically, first, obtain the
CoT obtains answers for multi-modal tasks through explicit descriptions of models from the machine learning commu-
step-by-step reasoning, presenting intermediate inference nity (Hugging Face), and then dynamically select the model
steps [7, 14]; Multi-modal Instruction Tuning (M-IT): M-IT for each subtask through the context task-model assignment
fine-tunes a pre-trained MLLM [15, 16] by using instruction mechanism. Then the classification corresponding to the se-
format data (instructions, multimodal inputs, and answers); lected model is the type of model to be obtained.
Multi-modal Tool Assistance (M-TA): M-TA involves uti- After determining the model type, existing methods [7, 8]
lizing external multi-modal tools such as search engines, typically choose a single pre-trained model to execute a spe-
OpenAI Codex, etc., to support computations beyond the cific subtask. However, it is difficult for them to determine
core capabilities of LLMs in the reasoning process [6, 17]. that the selected single pre-trained model is optimal, and thus
However, existing approaches primarily yield singular re- cannot determine that the subtask result is optimal. In order
sults for subtasks, failing to ensure the optimality of these to ensure the best subtask result, exemplified by the machine
outcomes. Therefore, we propose a strategy that leverages learning community Hugging Face as an example, this pa-
multiple models to obtain subtask results and subsequently per selects multiple pre-trained models to perform the same
deduces the most optimal subtask outcome. This approach subtask according to the rankings of multiple evaluation met-
aims to enhance the overall performance of MLLMs. rics such as ”Most Downloads”, ”Most Likes” and ”Trend-
Fig. 2. An overview of our model architecture.
ing”. In particular, when the pre-training model is repeated, amalgamating existing results to generate new results. This
the model is selected downwards according to the correspond- emphasis arises due to the potential introduction of error in-
ing evaluation metric. This approach ensures multiple high- formation during the generation process.
performance pre-trained models for the same subtask.
3.3. Response Generation
3.2.2. Models Execution
Finally, this paper utilizes LLM to amalgamate the optimal
After selecting multiple pre-trained models for the same sub- results of all subtasks to generate a response [6, 8], which
task, the subsequent step involves executing the multiple pre- integrates all the information of the first two parts into a con-
trained models. During this stage, the system automatically cise summary. The summary encompasses the list of planning
inputs the parameters of the subtask into the multiple pre- tasks and optimal results for all subtasks. Most important are
trained models and executes these models to obtain the in- the optimal results of the subtasks, as they constitute the piv-
ference results. In particular, for the same subtask, multiple otal points of the model’s final decision. These optimal results
pre-trained models are concurrently invoked. These multiple are presented in a structured format, such as categories with
pre-trained models employ identical parameters, allowing for classification probabilities in an image classification model.
the acquisition of results from multiple pre-trained models for The model allows LLM to input these structured optimal re-
the same subtask. sults and generate responses in user-friendly human language.
Table 1. Experiment results of our model ESP and existing models on three tasks in the GPT-4 annotated dataset.
tasks. One dataset involves annotations automatically gener- Sequential Task Graph Task
ated using GPT-4, encompassing 3,497 diverse user requests, Model
including 1,450 single-task requests, 1,917 sequential-task Acc ↑ ED ↓ Acc ↑ F1 ↑
requests, and 130 graph-task requests. The other dataset is HuggingGPT 0 0.96 4.17 4.17
human-annotated, where expert annotators were invited to Alpaca-7b
label intricate requests, totaling 46 user requests, including ESP 2.5 0.72 6.67 6.67
24 sequential-task requests, and 22 graph-task requests. HuggingGPT 7.45 0.89 10.12 7.84
Evaluation Metrics. The evaluation metrics applied to Vicuna-7b
ESP 11.11 0.67 12.82 9.82
the single task include Accuracy (Acc), Precision (Pre), Re-
call (Rec), and F1. For the sequence task, the evaluation met- HuggingGPT 18.18 0.76 20.83 16.45
rics consist of Edit Distance (ED) [19], Precision, Recall, and GPT-3.5
ESP 21.82 0.68 23.81 18.37
F1. For the graph task, the evaluation metrics comprise GPT-
4 Score (G4S) [20], Precision, Recall, and F1. For all these Table 2. Experiment results of our model ESP and existing
evaluation metrics, within the corresponding tables of experi- models on two tasks in the human-annotated dataset.
mental results, an upward arrow indicates that a higher metric
value signifies better model performance, while a downward
arrow indicates that a smaller metric value indicates better sequential task, and graph task from 29.44, 22.89, and 28.08
model performance. to 39.05, 51.92, and 51.91, respectively, demonstrating en-
Hyperparameters. This paper employs HuggingGPT [6] hancements of 18.3%, 105.5%, and 65.9%. In the case of the
as the baseline model and evaluates our approach against dif- GPT-3.5 LLM, our model raises the F1 scores for the single
ferent LLMs (Alpaca-7b [21], Vicuna-7b [20], and GPT-3.5). task, sequential task, and graph task from 39.05, 51.92, and
Additionally, the detailed prompts for the task planning and 51.91 to 40.02, 53.47, and 53.88, respectively, demonstrating
response generation stages are consistent with HuggingGPT. enhancements of 2.5%, 3.0%, and 3.8%.
The results of the experiments on the human-annotated
dataset are presented in Table 2. It is evident from the ta-
4.2. Main Results
ble that our model outperforms HuggingGPT across the two
The experimental results on the GPT-4 annotated dataset are tasks. Specifically, for LLM Alpaca-7b, Vicuna-7b, and GPT-
presented in Table 1. It is evident from the table that our 3.5, our model enhances the accuracy of the sequential task
model demonstrates superior performance across the three from 0, 7.45, and 18.18 to 2.5, 11.11, and 21.82, respec-
tasks compared to HuggingGPT. Specifically, in the respec- tively, while decreasing the Edit Distance from 0.96, 0.89, and
tive tasks, our model exhibits significantly higher accuracy, 0.76 to 0.72, 0.67 and 0.68, representing reductions of 25.0%,
precision, recall, F1 score, and GPT-4 Score than Hugging- 24.7% and 10.5%. Simultaneously, our model improves the
GPT, while also achieving notably lower Edit Distance val- accuracy of the graph task from 4.17, 10.12, and 20.83 to
ues. For instance, under the context of the Alpaca-7b LLM, 6.67, 12.82, and 23.81 under the three LLMs conditions, re-
our model enhances the F1 scores of the single task, sequen- sulting in improvements of 60.0%, 26.7%, and 14.3%, and in-
tial task, and graph task from 4.88, 22.80, and 20.59 to 9.76, creases its F1 score from 4.17, 7.84, and 16.45 to 6.67, 9.82,
25.06, and 22.71, respectively, representing improvements and 18.37, corresponding to enhancements of 60.0%, 25.3%
of 100.0%, 9.9%, and 10.3%. In the case of the Vicuna-7b and 11.7%, respectively.
LLM, our model elevates the F1 scores of the single task, In summary, the comparison of these experimental results
leads to the conclusion that our model showcases remark- [6] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li,
able enhancements across tasks on both the GPT-4 annotated Weiming Lu, and Yueting Zhuang, “Hugginggpt: Solv-
dataset and the human-annotated dataset, thereby providing ing ai tasks with chatgpt and its friends in huggingface,”
compelling evidence of the effectiveness of the proposed arXiv preprint arXiv:2303.17580, 2023.
methodology. We also found that our method is effective
when applied to various major language models mentioned [7] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong
above, leading to significant improvements across evaluation Wang, Zecheng Tang, and Nan Duan, “Visual chat-
metrics for different tasks. Moreover, as the parameter size of gpt: Talking, drawing and editing with visual foundation
the large language models decreases and their integration ca- models,” arXiv preprint arXiv:2303.04671, 2023.
pacity weakens, the enhancements provided by our proposed [8] Tanmay Gupta and Aniruddha Kembhavi, “Visual
method become even more pronounced. programming: Compositional visual reasoning without
training,” in Proceedings of the IEEE/CVF Conference
5. CONCLUSION on Computer Vision and Pattern Recognition, 2023, pp.
14953–14962.
This paper considers the utilization of multiple pre-trained
models to accomplish the same subtask. By integrating the [9] Harsha Nori, Nicholas King, Scott Mayer McKinney,
results from these multiple pre-trained models, the optimal Dean Carignan, and Eric Horvitz, “Capabilities of
outcome for the subtask can be obtained, thereby enhanc- gpt-4 on medical challenge problems,” arXiv preprint
ing the performance of multi-modal large language models. arXiv:2303.13375, 2023.
Overall, the method proposed in this paper offers a novel pos-
sibility for developing complex artificial intelligence systems. [10] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao,
It can provide more comprehensive solutions and yield supe- Shaohan Huang, Shuming Ma, and Furu Wei, “Kosmos-
rior results, thereby truly enhancing the performance of multi- 2: Grounding multimodal large language models to the
modal large language models in their application. world,” arXiv preprint arXiv:2306.14824, 2023.