Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models


1Media Analytics and Computing Lab, Department of Artificial Intelligence, School of Informatics, Xiamen University 2Institute of Artificial Intelligence, Xiamen University

LaVIN: large vision-language instructed model

In this work, we propose a novel and affordable solution for vision-language instruction tuning, namely Mixture-of-Modality Adaptation (MMA). Particularly, MMA is an end-to-end optimization regime, which connects the image encoder and LLM via lightweight adapters. Meanwhile, we also propose a novel routing algorithm in MMA, which can help the model automatically shifts the reasoning paths for single- and multi-modal instructions. Based on MMA, we develop a large vision-language instructed model called LaVIN, which demonstrates superior training efficiency and better reasoning ability than existing multimodal LLMs in various instruction-following tasks.

Performance

Science QA

Performance Comparison on ScienceQA test set. Question classes: NAT = natural science, SOC = social science, LAN = language science, TXT = text context, IMG = image context, NO = no context, G1-6 = grades 1-6, G7-12 = grades 7-12. † denotes that LaVIN is trained with doubled epochs. #Params denotes that the number of trainable parameters.

Examples

examples on different instruction-following tasks

Comparison of LaVIN-13B and existing methods on single- and multi-modal instructions. The noteworthy aspects of the responses are highlighted in green, whereas the illogical portions are marked in red.

examples of multimodal dialogue

Comparison of LaVIN-13B and existing multimodal LLMs in multi-turn conversations. GPT-4 assigns a score ranging from 1 to 10 to evaluate the quality of a response, with a higher score indicating superior performance. The noteworthy aspects of the responses are highlighted in green, whereas the illogical portions are marked in red.

BibTeX


        @article{luo2023towards,
          title={Towards Efficient Visual Adaption via Structural Re-parameterization},
          author={Luo, Gen and Huang, Minglang and Zhou, Yiyi  and Sun, Xiaoshuai and Jiang, Guangnan and Wang, Zhiyu and Ji, Rongrong},
          journal={arXiv preprint arXiv:2302.08106},
          year={2023}
        }
        
        @article{luo2023cheap,
          title={Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models},
          author={Luo, Gen and  Zhou, Yiyi and Ren, Tianhe and Chen, Shengxin abd Sun, Xiaoshuai and Ji, Rongrong},
          journal={arXiv preprint arXiv:2305.15023},
          year={2023}
        }
  

Acknowledgement

This repo borrows some data and codes from LLaMA, Stanford Alpaca, LLaVA and LLaMA-Adapter. Thanks for their great works.