samb-t
samb-t t1_iwgz27t wrote
Reply to comment by ButterscotchLost421 in [D] How long should it take to train a diffusion model on CIFAR-10? by ButterscotchLost421
7 secs sounds very fast but if you're not using a massive model, it's on cifar, and on an A100 it's not implausible, but you might want to double check so you're sure
samb-t t1_iwguscp wrote
Have you got the 1.3M number from the config file (config.training.n_iters = 1300001), if so that's the number of training steps not epochs! So hopefully more like around 7 hours to train on an A100, thank god!
samb-t t1_irvsicm wrote
Reply to comment by MohamedRashad in [D] Reversing Image-to-text models to get the prompt by MohamedRashad
If you have enough resources to train an autoregressive model then you could take advantage of knowing that these big text-to-image models are conditioned on CLIP embeddings and instead train an autoregressive model to predict prompts conditioned on CLIP image embeddings. That way there's no non-differentiable parts to bypass and the CLIP embeddings should be a pretty great descriptor of both the input image and the prompt.
If you don't have enough resources then (just thinking out loud, probably be a better way but might give some ideas) you could again use a pretrained CLIP model. 1. Embed the input image. 2. Using the CLIP text embedding network optimise the input text to get an embedding close to the image embedding. Problem there is again that text is discrete so you can't backprop. You could use gumbel softmax to approximate the discrete text values though (anneal down how continuous it is). Alternatively you could treat the embedding distance loss as an energy function, and use discrete MCMC, something like gibbs-with-gradients. But both of those options still probably aren't great, it's a horrible optimisation space
samb-t t1_j50gpn4 wrote
Reply to [D] Question about using diffusion to denoise images by CurrentlyJoblessFML
I think what you're looking for is palette which is for paired image-to-image translation with conditional diffusion models. I believe that approach is exactly what you're describing, concatenating down the channels dimension.