Softmax, can you derive the Jacobian? And should you care?

<- Back

Softmax, can you derive the Jacobian? And should you care?

smaddrellmander

Comments (47)

qurren
One thing extremely worth noting that the article does not:The reason "temperature" is called such is because softmax is mathematically identical to the Boltzmann distribution [1] from thermodynamics, which describes the probability distribution of energy states of an ensemble of particles in equilibrium. In terminology more well understood by ML folks, the particles' energies will be distributed as the softmax of their negative energies divided by their temperatures (in Kelvin). Units are scaled by the Boltzmann constant (k_B).Setting an LLM's temperature to zero is mathematically the same thing as cooling an ensemble of particles to absolute zero: in physics, the particles are all forced to their lowest energy state, in LLMs, the model is forced to deterministically predict the single most likely logit/token.Now to drow another analogy for what happens at high temperatures: the reason a heating element glows red when it is hot is because if you take the expectation value (mean) of energies under this softmax distribution, that mean goes up with temperature, and when the energy gets high enough, the particles start shaking off energy in the form of photons that are now high energy enough to be in the visible spectrum. Incandescent bulbs with tungsten filaments are even hotter than that heating element, and glow white because as temperature T is even higher, the softmax distribution's mean energy moves higher and flattens out, and it roughly covers the whole visible spectrum somewhat more uniformly. In the case of the bulb, photons of all sorts of wavelengths are being spewed out, that's white light. Likewise, if you set an LLM's temperature to an absurdly high number, it spews out a very wide spectrum of mostly nonsense tokens.[1] https://en.wikipedia.org/wiki/Boltzmann_distribution
antirez
> The relative differences between values get exaggerated, which means the largest logit value dominates the output, while smaller values are squashed. This is exactly what we want for confident predictions, but it also explains why softmax can be problematic when you want uncertainty estimatesActually I believe that most of the times even after softmax, sampling is ways too permissive, seldom accepting low quality candidates. We all have the experience of seeing frontier LLMs sometimes putting a word in a different language that is really off-putting and almost impossible to explain, or other odd errors in just a single word of the output: most of the times, this is not what the model wanted to say, but sampling that casually selected a low quality token. I believe a better approach is to have a strong filter on which candidates are acceptable, like in the example here: https://antirez.com/news/142
ComplexSystems
Good article, but"We take the exponential of each input and normalize by the sum of all exponentials. This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1, it technically this is a pseudo-probability distribution (they're not derived from a probability space), but it's close enough to a probability distribution and for practical purposes they work just fine."Why is this a "pseudo-probability distribution?"
hibijibies
Nice article and explanations!On a tangential note, I keep noticing "why x matters", "it's crucial here" that just remind me of Claude. Recently Claude has been gaslighting me in complex problems with such statements and seeing them on an article is low-key infuriating at this point. I can't trust Claude anymore on the most complex problems where it sometimes gets the answer right but completely misses the point and introduces huge complex blocks of code and logic with precisely "why it matters", "this is crucial here".
khelavastr
This is solved by MS research in 2018.. https://www.microsoft.com/en-us/research/blog/microsoft-rese...
Glyptodon
What happens if you use an integer like 2 or 3 instead of e in the softmax equation? Is e what makes it so they end up summing to 1? (I have not done real math in yearssss.)
bjourne
So softmax is e^x projection followed by l1 norm. Why is e^x projection useful?
dkislyuk
Something that really helped me grasp the foundational relevance of the softmax is to justify from first principles why e^x shows up in the preferred mapping function in the numerator (1). The stated problem of mapping raw inputs/scores/logits to a probability distribution can be solved by a bunch of arbitrary functions, and the usual justification given for a softmax is "it has nice derivatives" which is empirically useful but not satisfying.The sketch of the justification is something like this. We first need a function that maps from (-inf, inf) to a unique positive value, and then we need to normalize the resulting values. Setting aside the normalizing step, we imagine a f(x) that needs to fit the following properties:1. It should be strictly positive, so that we can normalize it into a (0, 1) probability.2. It should preserve the relative ordering of the logits to allow them to be interpreted as scores. Thus $f(x)$ should be monotonically increasing.3. It should be continuous and differentiable everywhere, since we are interested in learning through this function via backpropagation.4. It should have shift-invariance with respect to the input, as we don't want the model to have to learn some preferred logit-space where there is a stronger learning signal. For example, applying softmax on the values `(-1, 1, 3, 5)` would yield the same result as applying it to `(9, 11, 13, 15)`. This property can also be restated as a "scale invariance of probability ratios", where the ratio between $f(x)$ and $f(x+c)$ for a given $c$ is a constant. One useful interpretation of this property is that the learning domain or "gradient-learning surface" is stable, and high-magnitude initializations won't impede the learning process.Taken at face value, these properties uniquely define e^x. The last property is actually pretty debatable, because in the context of machine learning, we actually do have a "preferred logit-space", namely closer to zero, for numerical stability. But there are other ways to enforce this in a post-hoc manner (e.g. weight initialization, normalization layers, etc.)Another property that is uniquely justifies e^x and thus softmax is IIA (independence of irrelevant alternatives), which states that the odds for two classes, p_i / p_j, only depend on the logits/inputs for i and j, and an irrelevant class k has no impact. For example, for Softmax([5, 7, 1]) and Softmax([5, 7, 10]), the resulting odds for the first two values (p_i/p_j) should be the same from both distributions, regardless of the third value.Finally, if the "desired properties" approach is not satisfying, a more theoretical route for justifying the form of the softmax uses the framework of maximum entropy (E. T. Jaynes published this in 1957 to justify the Boltzmann distribution).TL;DR, softmax is not a the only solution to mapping function of unnormalized values to a probability distribution, but it can be justified through axiomatic properties.(1) one could say that the exponential shows up from the Boltzmann distribution, but then the same question applies.
xchip
"This transforms a vector of arbitrary real numbers into values between 0 and 1 that sum to 1"Not really, softmax transforms logits (logariths of probabilities) into probabilities.Probabilities → logits → back again.Start with p = [0.6, 0.3, 0.1]. Logits = log(p) = [-0.51, -1.20, -2.30]. Softmax(logits) = original p.NN prefer to output logits because they are linear and go from -inf to +inf.