Jensen–Shannon Divergence

Jensen–Shannon divergence - Wikipedia

Jump to content

Donate

Create account

Personal tools

Donate

Create account

Jensen–Shannon divergence

4 languages

Català Polski Русский Tiếng Việt

Edit links

From Wikipedia, the free encyclopedia

Statistical distance measure

In probability theory and statistics, the Jensen–Shannon divergence , named after Johan Jensen and Claude Shannon, is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad )[1][2] or total divergence to the average .[3] It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it always has a finite value. The square root of the Jensen–Shannon divergence is a metric often referred to as Jensen–Shannon distance. The similarity between the distributions is greater when the Jensen-Shannon distance is closer to zero.[4][5][6]

Definition [edit]

Consider the set

{\displaystyle M_{+}^{1}(A)}

of probability distributions where

{\displaystyle A}

is a set provided with some σ-algebra of measurable subsets. In particular we can take

{\displaystyle A}

to be a finite or countable set with all subsets being measurable.

The Jensen–Shannon divergence (JSD) is a symmetrized and smoothed version of the Kullback–Leibler divergence

{\displaystyle D(P\parallel Q)}

. It is defined by

{\displaystyle {\rm {JSD}}(P\parallel Q)={\frac {1}{2}}D(P\parallel M)+{\frac {1}{2}}D(Q\parallel M),}

where

{\displaystyle M={\frac {1}{2}}(P+Q)}

is a mixture distribution of

{\displaystyle P}

and

{\displaystyle Q}

The geometric Jensen–Shannon divergence[7] (or G-Jensen–Shannon divergence) yields a closed-form formula for divergence between two Gaussian distributions by taking the geometric mean.

A more general definition, allowing for the comparison of more than two probability distributions, is:

{\displaystyle {\begin{aligned}{\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})&=\sum _{i}\pi _{i}D(P_{i}\parallel M)\\&=H\left(M\right)-\sum _{i=1}^{n}\pi _{i}H(P_{i})\end{aligned}}}

where

{\displaystyle {\begin{aligned}M&:=\sum _{i=1}^{n}\pi _{i}P_{i}\end{aligned}}}

and

{\displaystyle \pi _{1},\ldots ,\pi _{n}}

are weights that are selected for the probability distributions

{\displaystyle P_{1},P_{2},\ldots ,P_{n}}

, and

{\displaystyle H(P)}

is the Shannon entropy for distribution

{\displaystyle P}

. For the two-distribution case described above,

{\displaystyle P_{1}=P,P_{2}=Q,\pi _{1}=\pi _{2}={\frac {1}{2}}.\ }

Hence, for those distributions

{\displaystyle P,Q}

{\displaystyle JSD=H(M)-{\frac {1}{2}}{\bigg (}H(P)+H(Q){\bigg )}}

Bounds [edit]

The Jensen–Shannon divergence is bounded by 1 for two discrete probability distributions, given that one uses the base 2 logarithm:[8]

{\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq 1}

With this normalization, it is a lower bound on the total variation distance between P and Q:

{\displaystyle {\rm {JSD}}(P\parallel Q)\leq {\frac {1}{2}}\|P-Q\|_{1}={\frac {1}{2}}\sum _{\omega \in \Omega }|P(\omega )-Q(\omega )|}

With base-e logarithm, which is commonly used in statistical thermodynamics, the upper bound is

{\displaystyle \ln(2)}

. In general, the bound in base b is

log

{\displaystyle \log _{b}(2)}

log

{\displaystyle 0\leq {\rm {JSD}}(P\parallel Q)\leq \log _{b}(2)}

A more general bound, the Jensen–Shannon divergence is bounded by

log

{\displaystyle \log _{b}(n)}

for more than two probability distributions:[8]

log

{\displaystyle 0\leq {\rm {JSD}}_{\pi _{1},\ldots ,\pi _{n}}(P_{1},P_{2},\ldots ,P_{n})\leq \log _{b}(n)}

Relation to mutual information [edit]

The Jensen–Shannon divergence is the mutual information between a random variable

{\displaystyle X}

associated to a mixture distribution between

{\displaystyle P}

and

{\displaystyle Q}

and the binary indicator variable

{\displaystyle Z}

that is used to switch between

{\displaystyle P}

and

{\displaystyle Q}

to produce the mixture. Let

{\displaystyle X}

be some abstract function on the underlying set of events that discriminates well between events, and choose the value of

{\displaystyle X}

according to

{\displaystyle P}

{\displaystyle Z=0}

and according to

{\displaystyle Q}

{\displaystyle Z=1}

, where

{\displaystyle Z}

is equiprobable. That is, we are choosing

{\displaystyle X}

according to the probability measure

{\displaystyle M=(P+Q)/2}

, and its distribution is the mixture distribution. We compute

log

log log

log

log log

{\displaystyle {\begin{aligned}I(X;Z)&=H(X)-H(X|Z)\\&=-\sum M\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&=-\sum {\frac {P}{2}}\log M-\sum {\frac {Q}{2}}\log M+{\frac {1}{2}}\left[\sum P\log P+\sum Q\log Q\right]\\&={\frac {1}{2}}\sum P\left(\log P-\log M\right)+{\frac {1}{2}}\sum Q\left(\log Q-\log M\right)\\&={\rm {JSD}}(P\parallel...

Jensen–Shannon Divergence

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play