Jean-Porte

Jean-Porte t1_ja9ejvo wrote

You can increase some timeout parameter, it helps

But I agree, I don't even understand why they don't log things locally when failing instead of KILLING A ONE WEEK JOB ON A HIGH END GPU SERVER ( MORE THAN 100$ WORTH OF COMPUTE TIME)

10

Jean-Porte t1_j6wvy2p wrote

The traditional language modeling loss (negative log-likelihood) is misaligned with human expectations. One negation radically changes the meaning of a sentence. It doesn't radically change the loglikelihood. It isn't more important than a "the" or a superfluous word.

With RLHF, important words have important impact, and the loss is exactly aligned to human interests.

22

Jean-Porte t1_j6hif9e wrote

T5 is fine-tuned on supervised classification. Trained to output labels. That's why it outperforms GPT3.

Generative models are not as good as discriminative models for discriminative tasks. A carefully tuned Deberta is probably better than chatGPT. But ChatGPT has a user-friendly text interface. And the glue-type evaluation is not charitable to chatGPT capabilities. The model might internally store the answer but it could be misaligned to the benchmark.

I always wonder why we don't try to scale-up discriminative models. Deberta-xxlarge is "only" 1.3B parameters, and it outperforms T5 13B.

17