Bot-69912020

Bot-69912020 t1_j24hkd7 wrote

It might be more transparent to split up your approach in two steps. First, we try to get a valid probability vector for each prediction (i.e. the vector sums up to 1). Second, we try to recalibrate the probabilities in each vector to improve the correctness of the predicted probabilities.

For the first point, it is important to know the range of your invalid outputs: If they are negative as well as positive, you might want to transform your whole output via softmax function. If you only have positive values v1, ..., vm, but the sum of the vector is not, then it is sufficient to compute vi / (v1+...+vm) to get valid probabilities.

Now, we can try to improve the predicted probabilities via post-hoc recalibration. For this, there have been several methods proposed. But, the simplest baseline, which works surprisingly well for most cases is temperature scaling. Start with that and try to make it work - it usually always gives at least minor improvements in ECE and NLL (don't use ECE alone, it is unreliable; see Fig.2). Once TS works, you can still try out ensemble temperature scaling, parametrized temperature scaling, intra-order preserving scaling, splines, ...

Some of these methods (including temperature scaling) use logits as inputs and their output are logits again. So, to receive logits, you apply the multivariate logit function if you already have probabilities, or simply use your untransformed outputs as logits if you would have used softmax in the first step.

2

Bot-69912020 t1_iyf8gr7 wrote

When my prof tried to get the conference batch, he accidentally queued at workboat and only realized it when they rejected him lol

Would have been really funny if they went through with it and he ended up walking through a boating conference, having no clue what is going on.

2

Bot-69912020 t1_ivxbxml wrote

I don't know about each specific implementation, but via the definition of subgradients you can get 'derivatives' of convex but non-differentiable functions (which ReLU is).

More formally: A subgradient at a point x of a convex function f is any x' such that f(y) >= f(x) + < x', y - x > for all y. The set of all possible subgradients at a point x is called the subdifferential of f at x.

For more details, see here.

17

Bot-69912020 t1_itfqpz6 wrote

illustration of why it happens in low-dimensions: https://twitter.com/adad8m/status/1582231644223987712

i think the main problem is that all textbooks introduce the bias-variance tradeoff as something close to a theoretical law, while in reality, it is just an empirical observation and we simply haven't bothered to further check this observation across more settings... until now

6