viper.cs.columbia.edu Open in urlscan Pro
128.59.16.27  Public Scan

Submitted URL: http://viper.cs.columbia.edu/
Effective URL: https://viper.cs.columbia.edu/
Submission: On December 09 via api from US — Scanned from DE

Form analysis 0 forms found in the DOM

Text Content

VIPERGPT: VISUAL INFERENCE VIA PYTHON EXECUTION FOR REASONING

Dídac Surís*, Sachit Menon*, Carl Vondrick
*Equal contribution
Columbia University
Paper arXiv Code


VIPERGPT DECOMPOSES VISUAL QUERIES INTO INTERPRETABLE STEPS.


ABSTRACT

Answering visual queries is a complex task that requires both visual processing
and reasoning. End-to-end models, the dominant approach for this task, do not
explicitly differentiate between the two, limiting interpretability and
generalization. Learning modular programs presents a promising alternative, but
has proven challenging due to the difficulty of learning both the programs and
modules simultaneously. We introduce ViperGPT, a framework that leverages
code-generation models to compose vision-and-language models into subroutines to
produce a result for any query. ViperGPT utilizes a provided API to access the
available modules, and composes them by generating Python code that is later
executed. This simple approach requires no further training, and achieves
state-of-the-art results across various complex visual tasks.


HOW DOES IT WORK?


A BRIEF EXPLANATION OF VIPERGPT.


LOGICAL REASONING

ViperGPT can perform logical operations because it directly executes Python
code.

 1. 
 2. 
 3. 


Previous Next


SPATIAL UNDERSTANDING

We show ViperGPT's spatial understanding.

 1. 
 2. 
 3. 
 4. 
 5. 
 6. 
 7. 


Previous Next


KNOWLEDGE

ViperGPT can access the knowledge of large language models.




CONSISTENCY

ViperGPT answers similar questions with consistent reasoning.

 1. 
 2. 


Previous Next


MATH

ViperGPT can count, and divide. All using Python.




ATTRIBUTES

We show some ViperGPT examples involving attributes.

 1. 
 2. 
 3. 
 4. 
 5. 


Previous Next


RELATIONAL REASONING

Reasoning about relations.

 1. 
 2. 
 3. 
 4. 


Previous Next


NEGATION

Negation is programmatic, not neural.




TEMPORAL REASONING

ViperGPT can navigate through videos to understand them.







MORE RESULTS







RELATED WORK

 * • Visual Programming for Compositional Visual Reasoning
 * • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
 * • Neural Module Networks
 * • Inferring and Executing Programs for Visual Reasoning
 * • PAL: Program-aided Language Models
 * • Code4Struct: Code Generation for Few-Shot Structured Prediction from
   Natural Language
 * • Program of Thoughts Prompting: Disentangling Computation from Reasoning for
   Numerical Reasoning Tasks
 * • Toolformer: Language Models Can Teach Themselves to Use Tools


BIBTEX

@article{surismenon2023vipergpt,
    title={ViperGPT: Visual Inference via Python Execution for Reasoning},
    author={D\'idac Sur\'is and Sachit Menon and Carl Vondrick},
    journal={Proceedings of IEEE International Conference on Computer Vision (ICCV)},
    year={2023}
}

We thank Revant Teotia and Chengzhi Mao for helpful feedback, as well as
3Blue1Brown and the Manim community for their wonderful visualization library.

This research is based on work partially supported by the DARPA MCS program
under Federal Agreement No. N660011924032 and the NSF CAREER Award #2046910. DS
is supported by the Microsoft PhD Fellowship and SM is supported by the NSF
GRFP.

This webpage template was inspired by this and this project pages.