giga-chad99

giga-chad99 t1_j45pa30 wrote

Regarding that Winoground paper: Isn't compositionally what DALLE-2, Image, Parti, etc are famous for? Like the avocado chair, or some very specific images like "a raccoon in a spacesuit playing poker". SOTA vision language model are the only models that actually show convincing compositionally, or am I wrong?

4