A Multi-Stage Vision-Language Framework for Knowledge-based Visual Question Answering
Devised a two-stage VLM-based pipeline by utilizing knowledge of LMs to first sample multiple candidate answers and then using CLIP to select the most likely choice.
Dec 5, 2023