## Introduction

I originally wanted to write down the proof for the Gumbel-max trick but soon realized it is actually the same idea as a much more common problem: *exponential race*. So, in this note let's go from this common problem and arrive at the Gumbel-max trick.

## Competing Alarms

As a preparation let's solve a probability problem first:

There are $N$ clocks started simultaneously, such that the $i$-th clock rings after a random time $T_i \sim \text{Exp}(\lambda_i)$

- (1) Designate $X$ as the random time after which some clock (i.e any one of the clocks) rings, find the distribution of $X$
- (2) Find the probability of the $i$-th clock rings first

Let $X = \min {T_1, T_2, \dots, T_n }$ and $F_X(t)$ and $F_{T_i}(t)$ be the CDFs. We also have $F_{T_i}(t) = 1- e^{-\lambda_it}$.

Following order statistics of $\min$, we have $P(X>t) = \prod_{i=1}^N P(T_i>t)$ or equivalently,

$$ 1 - F_X(t) = \prod_{i=1}^N (1-F_{T_i}(t)) = \prod_{i=1}^N e^{-\lambda_it} = e^{-\sum_{i=1}^N \lambda_it} $$Therefore

$$ \begin{equation} X \sim \text{Exp}(\lambda_X = \sum_{i=1}^N \lambda_i) \label{part1} \end{equation} $$i.e. the $\min$ of a set of i.i.d exponential random variables is still an exponential random variable with the rate $\lambda_X$ being the sum of the rates of that set of random variables.

For the second part of the problem, we can consider two competing alarms $T_1$ and $T_2$ to begin with. Our goal is to find $P(T_1<T_2)$.

$$ \begin{split} P(T_1 < T_2) & = \int_0^{+\infty} \int_{t_1}^{+\infty} P(T_1=t_1) P(T_2=t_2) dt_2 dt_1 \\\\ &= \int_0^{+\infty} P(T_1=t_1) \left(1-F_{T_2}(t_1)\right) dt_1 \\\\ &= \int_0^{+\infty} \lambda_1 e^{-\lambda_1 t_1} e^{-\lambda_2 t_1} dt_1 \\\\ & = \frac{\lambda_1}{\lambda_1+\lambda_2} \end{split} $$Now, let's consider one specific clock $T_k$ versus all the rest, noted as $T_{-k} = \min {T_i}_ {i \neq k}$. According to $\ref{part1}$, we know that $T_ {-k} \sim \text{Exp}(\sum_{i\neq k} \lambda_i)$. Using the result above we have the solution for part (2) as follows

$$ \begin{equation} P(T_k \text{ rings first}) = P(T_k < T_{-k}) = \frac{\lambda_k}{\lambda_k+\sum_{i\neq k}\lambda_i} = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i} \label{part2} \end{equation} $$Of course, we can do the integration directly and get the same result

$ \begin{split} P(T_k < T_{-k}) & = \int_0^{+\infty} P(T_k=t_k) \left( \idotsint_{t_k}^{+\infty} \prod_{i\neq k}P(T_i=t_i) dt_i \right) dt_k \\\\ & = \int_0^{+\infty} P(T_k=t_k) \left( \prod_{i\neq k} \left(1-F_{T_i}(t_k)\right) \right) dt_k \\\\ & = \int_0^{+\infty} \lambda_k \exp{\left(-\lambda_k t_k\right)} \exp{\left(-\sum_{i \neq k}\lambda_i t_k\right)} dt_k \\\\ & = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i} \end{split} $By the way, this setup with multiple exponential random variables and we look for the first arrival is also called *exponential race*.

## Exponential-Min Trick

I just made up the name "Exponential-Min". The better name for this section is probably *Sampling from Multinomial by Argmining*.

Suppose we have a set of positive numbers $[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]$. Correspondingly we have a normalized probabiilty vector $\vec{p}=[p_1, p_2, p_3, \dots, p_N]$, where $p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}$. This probability vector specifies a multinominal distribution over $N$ choices.

Now, if we were to get a sample ${1, 2, \dots, N}$ according to this multinominal distribution specified by $\vec{p}$ (which is fundamentally specified by $[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]$), what should we do?

Normally, we do the following:

- We have $[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]$
- We compute $\vec{p}=[p_1, p_2, p_3, \dots, p_N]$, where $p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}$.
- We generate a uniform random number $Q$ between 0 and 1, i.e. $Q \sim \text{Uniform}(0,1)$
- We figure out where $Q$ lands, i.e. if $p_i < Q < p_{i+1} $ we return $i$. (Of couse we should use $p_0=0$ and $p_{N+1}=1$)

But that's the boring way. Now we have this new Exponential-Min trick, we can do the following:

- We have $[\lambda_1, \lambda_2, \lambda_3, \dots, \lambda_N]$
- We don't compute $\vec{p}$; instead we sample $T_i \sim \text{Exp}(\lambda_i)$ for $i=1, 2, \dots, N$, i.e. we have a total of $N$ samples, one from each $\text{Exp}(\lambda_i)$
- We now take $\argmin([T_1, T_2, \dots, T_N])$ as our result sample
- We proved in $\ref{part2}$ that such a result sample indeed follows multinominal distribution specified by $\vec{p}=[p_1, p_2, p_3, \dots, p_N]$, where $p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}$.

Thus, somehow we ended up *Sampling from Multinomial by Argmining*!

## Gumbel Distribution

Now let's move on to Gumbel distribution from Exponential distribution.

Gumbel distribution with unit scale ($\beta=1$) is parameterized by location parameter $\mu$. $\text{Gumble}(\mu)$ has CDF and PDF as follows

$$ \text{CDF: } F(x; \mu)=e^{-e^{-(x-\mu)}} $$ $$ \text{PDF: }f(x; \mu) = e^{-\left((x-\mu)+e^{-(x-\mu)}\right)} $$Given a set of $N$ independent Gumbel random variables $G_i$, each with their own parameter $\mu_i$, i.e. $G_i \sim \text{Gumbel}(\mu_i)$.

Gumbel distribution has two properties that are quite analogous the *exponential race* example above.

- (1) Let $Z = \max {G_i }$, then $Z \sim \text{Gumble}\left(\mu_Z = \log \sum_{i=1}^N e^{\mu_i} \right)$

The proof is straightforward and similar to above:

$$ F_Z(x) = \prod_{i=1}^N F_{G_i}(x) = \prod_{i=1}^N e^{-e^{-(x-\mu_i)}} = e^{-\sum_{i=1}^N e^{-(x-\mu_i)}} = e^{-e^{-x} \sum_{i=1}^N e^{\mu_i}} = e^{-e^{-(x-\mu_Z)}} $$- (2) A corollary of the above is that the probability of $Z_k$ being the max is $P(Z_k > Z_{-k}) = \frac{e^{\mu_k}}{\sum_{i=1}^N e^{\mu_i}}$

## Gumbel-Max Trick

Now here we can tell nearly an identical/parallel story as in the section Exponential-Min Trick. And, this section should really be called *Sampling from Multinomial by Argmaxing*.

The main differences are

- The numbers (parameters) $\mu_i$ can be potentially negative, whereas $\lambda_i$ must be positive
- The probability vector is determined by $p_k = \frac{e^{\mu_k}}{\sum_{i=1}^N e^{\mu_i}}$ instead of $p_k = \frac{\lambda_k}{\sum_{i=1}^N\lambda_i}$
- We generate samples with $G_i \sim \text{Gumbel}(\mu_i)$ instead of $T_i \sim \text{Exp}(\lambda_i)$
- We take $\argmax$ over $G_i$ instead of taking $\argmin$ over $T_i$

## When is Gumbel-Max Trick Useful?

It seems a lot of work to sample multinominal by argmaxing over Gumble samples (or argmining over Exponential samples). In what situation do we ever want to do this?

The short answer is that Gumble-Max trick allows us to make a sampling step **differentiable**. Specifically, it makes sampling from multinomial distribution differentiable. We'll take a closer look at this in a future post but pause for a second and think about it. We are saying it is possible to** differentiate through the action of drawing a discrete sample** from a multinomial distribution! This was a pretty surprising/amazing possibility to me.

Regarding downstream applications, differentiating through sampling is an important "trick" in neural network based variational inference in general. Multinomial discrete random variables are prevalent in many learning problems. Gumble-max trick allows us to work with them in many interesting neural variational inference problems, which we will look into in future posts.