Leaderboard

We present the results of voting using LLaVA-v1.5-13B as anchor. The numbers denote win/tie/lose of a benchmarked model over LLaVA-v1.5-13B. See more results of different evaluation protocols and anchors in our paper. The information of benchmarked models is here.

Rank Models Perception Understanding Applying Analyzing Evaluation Creation Win Rates over LLaVA-v1.5-13B
🏅️ Claude-3 56/13/1 98/9/3 45/11/4 83/14/3 33/5/2 33/6/1 0.83
🥈 GPT-4V 56/10/4 101/6/3 29/12/19 73/22/5 33/2/5 2/0/38 0.70
🥉 LLaVA-v1.6-34B 46/17/7 78/22/10 36/15/9 61/28/11 33/3/4 24/10/6 0.66
4 LLaVA-v1.6-Vicuna-13B 40/21/9 65/33/12 35/19/6 51/26/23 33/5/2 27/9/4 0.60
5 LLaVA-v1.6-Vicuna-7B 31/25/14 56/37/17 26/23/11 40/31/29 22/10/8 19/10/11 0.46
6 ALLaVA-3B-Longer 22/21/27 57/30/23 23/17/20 44/30/26 16/10/14 17/12/11 0.43
7 Gemini-1.0-Pro 45/10/15 36/35/39 24/19/17 33/28/39 9/8/23 16/8/16 0.39
8 Qwen-VL-Chat 34/22/14 38/36/36 26/18/16 35/29/36 15/6/19 9/12/19 0.37
9 LVIS 22/28/20 32/39/39 11/27/22 33/36/31 14/9/17 9/16/15 0.29
10 mPLUG-Owl2 16/24/30 30/34/46 17/17/26 23/38/39 15/8/17 11/14/15 0.27
11 LLaVA-v1.5-7B 19/22/29 27/47/36 13/29/18 21/43/36 9/14/17 8/13/19 0.23
12 MiniGPT-v2 12/25/33 24/32/54 11/25/24 17/38/45 9/9/22 6/6/28 0.19
13 InstructBLIP 15/16/39 13/36/61 6/23/31 13/29/58 10/7/23 4/9/27 0.15
14 Cheetor 12/20/38 7/27/76 10/22/28 16/23/61 4/4/32 3/4/33 0.12
15 SEED-LLaMA 16/15/39 5/25/80 10/21/29 7/25/68 3/7/30 3/3/34 0.10
16 kosmos2 6/22/42 6/18/86 6/15/39 10/20/70 1/4/35 2/3/35 0.07
17 Yi-VL-6B 4/17/49 8/22/80 5/27/28 5/29/66 3/9/28 3/9/28 0.07
18 Fuyu-8B 7/19/44 7/27/76 6/14/40 4/22/74 3/7/30 0/6/34 0.06
19 LWM 2/18/50 5/15/90 4/21/35 2/18/80 3/2/35 2/6/32 0.04
20 OpenFlamingo 8/13/49 2/8/100 3/14/43 2/21/77 1/2/37 1/5/34 0.04
21 BLIP2 3/13/54 2/15/93 6/8/46 0/22/78 0/1/39 0/2/38 0.03