PlotGen-Bench: Evaluating VLMs on Generating Visualization Code from Diverse Plots across Multiple Libraries

Yi Zhao1*, Zhen Yang1*, Shuaiqi Duan2, Wenmeng Yu2, Zhe Su2, Jibing Gong3†, Jie Tang1†
1Tsinghua University, 2Zhipu.AI, 3Yanshan University
*Equal Contribution Corresponding author

Abstract

Recent advances in vision–language models (VLMs) have expanded their multimodal code generation capabilities, yet their ability to generate executable visualization code from plots, especially for complex 3D, animated, plot-to-plot transformations, or multi-library scenarios, remains underexplored. To address this gap, we introduce PlotGen-Bench, a comprehensive benchmark for evaluating plot-to-code generation under realistic and complex visualization scenarios. The benchmark spans 9 major categories, 30 subcategories, and 3 core tasks—plot replication, plot transformation, and multi-library generation, covering both 2D, 3D and animated plots across 5 widely used visualization libraries. Through systematic evaluation of state-of-the-art open- and closed-source VLMs, we find that open-source models still lag considerably behind in visual fidelity and semantic consistency, despite achieving comparable code executability. Moreover, all models exhibit substantial degradation on reasoning-intensive tasks such as chart type conversion and animation generation. PlotGen-Bench establishes a rigorous foundation for advancing research toward more capable and reliable VLMs for visualization authoring and code synthesis.

Research Overview

Pipeline Overview
Figure 1: Overview of the PlotGen-Bench evaluation pipeline.

Performance Comparison Across Different Plot Types

Table 1: Performance comparison (0–100) of vision–language models on PlotGen-Bench across different plot types. Overall scores are computed as macro-averages across all sub-categories, excluding Animation for fair comparison.
Model Dist. Comp. Trend Comp. Corr. Flow Dim. Enh. Anim. Overall
Open-source VLM
InternVL3-9B 16.0 26.8 21.7 20.8 13.3 25.0 17.7 25.3 2.5 21.8
InternVL3-78B 43.4 55.7 39.6 47.3 49.0 40.4 55.9 48.0 33.2 48.3
Qwen2.5-VL-7B-Instruct 34.3 47.2 28.0 31.7 28.6 25.4 44.4 35.6 2.6 35.2
Qwen2.5-VL-72B-Instruct 56.2 61.1 48.0 52.8 60.2 51.7 71.8 55.4 32.0 56.4
MiMo-VL-7B-SFT 22.5 29.0 24.3 25.4 28.9 23.7 28.7 31.2 7.0 27.4
MiMo-VL-7B-RL 21.2 24.0 25.5 23.4 29.2 18.1 24.8 28.9 6.4 25.7
Kimi-VL-A3B-Instruct 33.4 39.9 40.6 25.5 31.0 31.0 30.3 38.0 17.7 34.4
Kimi-VL-A3B-Thinking 18.3 34.4 17.3 16.6 19.5 10.8 19.2 21.0 5.5 20.3
GLM-4.1V-9B-Thinking 32.4 44.2 31.9 33.8 35.0 47.0 34.4 40.0 -- 26.9
GLM-4.5V 48.6 52.0 44.2 43.5 40.2 28.9 51.2 50.9 -- 47.2
Qwen3-VL-235B-A22B-Instruct 64.7 71.0 60.6 66.4 66.8 61.5 65.1 62.3 38.1 64.6
Closed-source VLM
Claude-Sonnet-4-20250514-Thinking 76.7 76.2 72.7 74.3 76.1 50.1 82.9 72.5 44.5 74.1
Claude-3.7-Sonnet-20250219-Thinking 73.7 79.9 72.8 74.4 81.0 69.5 67.6 72.8 34.5 74.0
GPT-5-2025-08-07 78.7 87.8 76.1 78.3 78.4 75.0 81.9 79.9 45.2 79.7
GPT-4o-2024-11-20 69.1 71.5 69.1 58.5 68.2 68.7 70.1 69.8 41.7 67.7
GPT-4o-Mini-2024-07-18 51.2 62.8 54.4 52.4 51.8 54.9 61.2 57.2 37.1 55.5
Gemini-2.5-Pro 73.1 76.8 70.7 76.0 75.3 59.5 76.3 67.7 -- 72.0
Gemini-2.5-Flash 57.3 60.3 44.7 55.1 61.9 54.1 56.0 59.9 -- 60.0
Doubao-1.5-Thinking-Pro-Vision-250415 48.6 55.7 52.0 38.6 48.9 45.6 51.8 50.8 28.6 48.7
Doubao-Seed-1-6-Thinking-250715 55.2 67.9 54.0 54.5 52.7 47.7 57.3 58.9 32.5 57.1

Performance Comparison Across Different Libraries

Table 2: Performance comparison of vision–language models on PlotGen-Bench across different libraries. PR denotes the pass rate (%), and GS denotes the GPT-4o score (0–100).
Model Matplotlib Seaborn Plotly Plotnine NetworkX
PR GS PR GS PR GS PR GS PR GS
Open-source VLM
InternVL3-9B 43.4 20.7 20.4 6.9 46.5 19.3 0.0 0.0 41.7 21.5
InternVL3-78B 71.7 43.1 69.4 42.7 81.4 52.0 21.1 11.9 86.1 56.5
Qwen2.5-VL-7B-Instruct 71.7 28.7 57.1 31.3 79.1 45.0 0.0 0.0 75.0 35.2
Qwen2.5-VL-72B-Instruct 84.7 53.1 77.6 55.1 79.1 53.3 15.8 13.5 83.3 58.0
MiMo-VL-7B-SFT 56.5 30.7 34.7 17.2 2.3 1.0 10.5 2.7 47.2 21.8
MiMo-VL-7B-RL 60.9 27.4 28.6 16.2 7.0 4.2 0.0 0.0 47.2 26.2
Kimi-VL-A3B-Instruct 32.6 8.8 61.2 34.0 18.6 12.4 36.8 15.8 72.2 35.1
Kimi-VL-A3B-Thinking 45.7 12.0 26.5 14.4 11.6 6.3 0.0 0.0 38.9 18.2
GLM-4.1V-9B-Thinking 58.7 34.1 46.9 33.0 51.1 32.5 0.0 0.0 61.1 43.1
GLM-4.5V 58.7 44.4 49.0 38.7 32.5 22.4 21.1 17.3 52.8 39.5
Qwen3-VL-235B-A22B-Instruct 93.5 62 87.8 62.8 79.1 58.1 63.2 41.6 97.2 72.0
Closed-source VLM
Claude-Sonnet-4-20250514-Thinking 97.8 74.4 93.9 75.7 83.7 59.3 36.8 29.5 100.0 81.1
Claude-3.7-Sonnet-20250219-Thinking 93.5 78.6 87.8 73.0 87.0 64.2 5.3 4.2 94.4 79.8
GPT-5-2025-08-07 93.5 82.7 91.8 78.9 88.3 76.9 36.8 27.7 86.1 72.1
GPT-4o-2024-11-20 89.1 65.6 85.7 64.4 86.0 61.5 68.4 41.3 91.7 69.5
GPT-4o-Mini-2024-07-18 82.6 52.2 75.5 54.0 79.1 49.0 42.1 28.6 91.7 52.2
Gemini-2.5-Pro 82.6 74.2 89.8 80.9 72.1 60.4 10.5 8.0 97.2 85.0
Gemini-2.5-Flash 67.3 56.2 77.5 67.1 44.2 37.2 15.7 13.0 72.2 63.4
Doubao-1.5-Thinking-Pro-Vision-250415 65.2 50.1 36.7 28.0 44.2 34.2 10.5 8.0 61.1 46.8
Doubao-Seed-1-6-Thinking-250715 80.4 65.8 53.1 40.4 34.8 30.0 26.3 20.8 66.7 54.8

Data Download

The PlotGen-Bench dataset is publicly available for research purposes. It contains over 745 plot-code pairs across 9 major categories, 30 subcategories, and 5 widely used visualization libraries. The dataset is split into training, validation, and test sets to facilitate model development and evaluation.

Download Dataset

Citation

@article{ zhao2025plotgenbench, title={PlotGen-Bench: Evaluating VLMs on Generating Visualization Code from Diverse Plots across Multiple Libraries}, author={Yi Zhao and Zhen Yang and Shuaiqi Duan and Wenmeng Yu and Zhe Su and Jibing Gong and Jie Tang}, year={2025}, journal={arXiv preprint arXiv:xxxx.xxxxx}, }