Foundations and tools

In this lecture, we introduce the foundational questions of information theory, both in the classical and quantum setting. This serves to emphasize the fact that quantum information theory is really information theory specialized to information carriers that abide by the axioms of quantum mechanics, and as a motivation to develop a set of tools that are useful to tackle more complex tasks.

1. State discrimination

State discrimination is the task of distinguishing between two known states given a system prepared in one of the two states. More precisely, we consider a setup where there is a sender and a receiver. The sender can prepare one of two states that are also known to the receiver. The receiver gets one such state and decides which of the two states was prepared by the sender.

1.1. Classical setting

In the classical setting, the state can be thought of a one-bit register that is prepared in $0$ with some probability $p$ or $q$ depending whether the sender is preparing one of the two states $P$ or $Q$.¹¹The interest of such task should be clear: the receiver is trying to recover the information about which state was prepared — that could be one bit of a message for instance — from the signal that he receives. An extreme situation is when $p = 1$ and $q = 0$, that is $P$ is a register with value 0, and $Q$ is a register with value $1$. Then it is easy to discriminate between the two upon reading the value of the register itself. Things get a bit more interesting whenever $p,q \in (0,1)$. What is then the best guess that can be made about the state chosen by the sender?

The answer is given by applying Bayes rule after stating our prior on the behavior of the sender. Here, we assume that our initial knowledge about the behavior of the sender is that he can choose to prepare $P$ or $Q$ with equal probability, i.e. $\frac 1 2$. If we denote by $R$ the random variable that corresponds to the choice of $P$ or $Q$, then we have:

\begin{align} \Pr(b = 0, R = P) = p/2, & \ \Pr(b = 1, R = P) = (1-p)/2 \\ \Pr(b = 0, R = Q) = q/2, & \ \Pr(b = 1, R = Q) = (1-q)/2. \end{align}

This in turn gives²²Remember that $b \in \{0,1\}$ so that $\Pr(b|R=P) = (1-b)p + b(1-p)$.:

\begin{align} \Pr(R = P | b) & = \frac{\Pr(b, R = P) }{\Pr(b)} = \frac{(1-b)\frac p 2 + b \frac{1-p}{2}}{(1-b)\frac{p+q}{2} + b (1-\frac{p+q}{2})} \\ \Pr(R = Q | b) & = \frac{\Pr(b, R = Q) }{\Pr(b)} = \frac{(1-b)\frac q 2 + b \frac{1-q}{2}}{(1-b)\frac{p+q}{2} + b (1-\frac{p+q}{2})}. \end{align}

Given this, upon observing $b$, the receiver should decide that $R = P$ whenever $\Pr(R = P | b) \geq \Pr(R = Q | b)$, i.e. whenever:

\begin{align} p-q & \geq 0 \mbox{ for } b = 0\\ q-p & \geq 0 \mbox{ for } b = 1. \end{align}

By denoting $\hat R$ the random variable that represents the choice of the receiver, we can now assess the success probability $\Pr(\hat R = R)$. This probability is equal to:

\begin{align} Pr(\hat R = R ) & = \Pr(b=0, R = P) \one_{p\geq q} + \Pr(b=1, R = P) \one_{p\leq q} \nonumber \\ & \qquad + \Pr(b=0, R = Q) \one_{q\geq p} + \Pr(b=1, R = Q) \one_{q\leq p} \\ & = \frac{p + 1 - q}{2} \one_{p\geq q} + \frac{1-p + q}{2} \one_{p\leq q}\\ & = \frac 1 2 + \frac 1 2 |p - q|. \end{align}

The trace distance³³The normalization of the trace distance varies depending on authors. It is often normalized using an additional $\frac{1}{2}$ factor so that the trace distance is between 0 and 1 for probability distributions. between two probability distributions $P$ and $Q$ over some event space $\Gamma$ is:

\begin{equation} \| P - Q \|_1 = \sum_{\gamma \in \Gamma} |\Pr(P = \gamma) - \Pr(Q =\gamma)|. \end{equation}

For binary variables $P$ and $Q$ parametrized by $p \coloneqq \Pr(P=0)$ and $q\coloneqq \Pr(Q=0)$ we have $\|P-Q\|_1 = 2 |p-q|$.

Using this definition, the above success probability for the classical state discrimination is $\frac{1}{2}(1 + \frac{1}{2}\| P - Q\|_1)$. This means that the trace-distance is an operational measure of similarity between probability distributions. It characterizes how well a receiver can perform the discrimination task presented above, and one recovers the usual intuition that when $p = 1$ and $q = 0$, $\|P - Q \|_1 = 2$ so that the probability of success is 1, while for $p = q$, both probability distributions are identical so that $\|P - Q\|_1 = 0$, and the receiver needs to randomly guess the value of $R$. This intuition can be made more formal by noticing that $\| \cdot \|_1$ defines a metric over the space of probability distributions:

it is symmetric $\| P - Q \|_1 = \| Q - P \|_1$
it is non negative $\| P-Q \|_1 \geq 0$
it is definite $\| P - Q \|_1 = 0 \Rightarrow P = Q$
and it satisfies the triangle inequality $\|P - R\|_1 \leq \|P - Q\|_1 + \|Q - R \|_1$.

1.2. Quantum setting

1.2.1. Trace norm and trace distance for operators

The trace norm of $A \in Hom(\mathcal H, \mathcal H')$ is:

\begin{equation} \|A\|_1 = \tr(|A|), \end{equation}

with $|A| = \sqrt{A^\dagger A}$.

The trace norm of $A$ is the sum of its singular values.

Decompose $A = U \Delta V$ where $U$ and $V$ are unitary matrices and $\Delta$ is a rectangular diagonal matrix with non-negative (singular) values. Then expanding $A^\dagger A = V \Delta^\dagger\Delta V$ shows that $A^\dagger A$ is a Hermitian matrix with singular values equal to the squares of those in $\Delta$.

It is also easy to show that:

$\|A\|_1 \geq 0$ and $\|A\|_1 = 0 \Leftrightarrow A = 0$;
$\| \alpha A \|_1 = |\alpha| \| A \|_1$ for $\alpha \in \mathbb C$;
$\| U A V^\dagger \|_1 = \| A \|_1$ for $U$ and $V$ isometries.

For $A \in End(\mathcal H)$,

\begin{equation} \|A\|_1 = \max_U |\tr(UA)|,\end{equation}

where $U$ is a unitary.

We use the SVD of $A = V \Delta W$ and write:

\begin{align} |\tr(UA) | & = |\tr(UV\Delta W)| = \tr((\sqrt{\Delta})(\sqrt \Delta W UV)) \\ & \leq \sqrt{\tr(\sqrt \Delta \sqrt \Delta)} \sqrt{\tr ((\sqrt \Delta W UV)^\dagger (\sqrt \Delta W UV))} \\ & = \tr(\Delta) \\ & = \| A \|_1 \end{align}

where we have used Cauchy-Schwartz inequality⁴⁴$|\langle A,B\rangle| \leq \|A\| \|B\|$ and equality if $A \propto B$. for the HS inner product. The equality is obtained for $W U V = \one_{\mathcal H}$.

From there we obtain the triangle inequality:

For $A,B \in End(\mathcal H)$,

\begin{equation} \|A + B \|_1 \leq \|A\|_1 + \|B \|_1. \end{equation}

Note that this holds indeed for $A,B \in Hom(\mathcal H, \mathcal H')$.

We use the variational characterization of the Trace-norm

\begin{align} \|A + B\|_1 & = \max_U |\tr(U(A+B))| \\ & = |\tr(\hat U (A+B))| \\ & = |\tr(\hat U A) + \tr(\hat UB)| \\ & \leq |\tr(\hat U A)| + |\tr(\hat UB)| \\ & \leq \max_U |\tr(UA)| + \max_U |\tr(UB)| \\ & = \|A\|_1 + \|B\|_1, \end{align}

Where $\hat U$ is the maximizing $U$ of the first line.

Using the triangle inequality, we then get the convexity of the trace norm:

For $A,B \in End(\mathcal H, \mathcal H')$ and $\lambda \in [0,1]$

\begin{equation} \| \lambda A + (1-\lambda) B \|_1 \leq \lambda \|A\|_1 + (1-\lambda) \|B\|_1. \end{equation}

For $A,B \in Hom(\mathcal H, \mathcal H')$ the trace-distance between $A$ and $B$ is

\begin{equation} \|A-B\|_1. \end{equation}

For $\rho$ and $\sigma$ density matrices we have $0\leq \|\rho-\sigma\|_1 \leq 2$ — which often motivates the use of the normalized trace-distance equal to $\frac 1 2 \| \rho - \sigma \|_1$.

1.2.2. Trace distance as an operational measure of distinguishability

The interest of the trace distance for density matrices is its operational interpretation as a measure of the ability to discriminate quantum states.

For $\rho, \sigma$ density matrices , we have

\begin{equation} \frac{1}{2} \|\rho -\sigma\|_1 = \max_{0 \leq K \leq \one} \tr(K(\rho-\sigma)) \end{equation}

Diagonalize $\rho-\sigma$ and consider $\Pi_{>0}$ the projector onto the subspaces with positive eigenvalues and $\Pi_{<0}$ the projector onto the subspace with negative eigenvalues. Then define $R\coloneqq \Pi_{>0} (\rho - \sigma) \Pi_{>0}$ and $S\coloneqq \Pi_{<0} (\sigma - \rho) \Pi_{<0}$ so that $R,S$ are PSD and $|\rho-\sigma| = R + S$. Now because the positive and negative eigenvalue subspaces are orthogonal we have

\begin{equation} \| \rho - \sigma\|_1 = \tr(|\rho - \sigma|) = \tr(R) + \tr(S). \end{equation}

But we also have $\tr(\rho - \sigma) = 0 = \tr(R) - \tr(S)$ so that $\| \rho - \sigma\|_1 = 2 \tr(R)$.

Now by construction $\tr(R) = \tr(\Pi_{>0} (\rho - \sigma) \Pi_{>0}) = \tr(\Pi_{>0} (\rho-\sigma))$. Finally, for any $0 \leq K \leq \one$, we have

\begin{align} \tr(K(\rho-\sigma)) & = \tr(K(R-S)) \\ & \leq \tr(K R) \\ & \leq \tr(R) \\ & = \frac{1}{2} \| \rho - \sigma\|_1. \end{align}

So that $R$ is the maximizing $K$.

The above property states that the trace-distance corresponds to the biggest probability difference for obtaining an outcome in a measurement of $\rho$ and $\sigma$. In our state discrimination setting the sender now has the possibility of randomly sending $\rho$ or $\sigma$ with equal probability to the receiver. The receiver in turn performs a measurement with POVM elements $K_0,K_1$ such that if it gets $K_0$, he declares to have received $\rho$ and if he gets $K_1$ he declares $\sigma$. In such case, the success probability is:

\begin{equation} \frac 1 2 \tr(K_0 \rho) + \frac 1 2 \tr(K_1 \sigma). \end{equation}

Using $K_0 + K_1 = \one$, we obtain that the success probability is

\begin{equation} \frac 1 2 (1 + \tr (K_0 (\rho-\sigma))) \end{equation}

Optimizing over $K_0$ can be done using the above property, and the maximizing $K_0$ is the projection onto the positive eigenvalue subspace of $\rho - \sigma$. In such case, we then obtain a success probability of

\begin{equation} \frac 1 2 \left( 1 + \frac 1 2 \| \rho - \sigma \|_1\right), \end{equation}

which corresponds to the generalization of the previous result for classical probability distributions.

2. Channel discrimination

Channel discrimination is another foundational information theoretic task that enlights some of the differences between quantum and classical information. There, we again have two parties, but in a slightly different setting: the first one, say Alice, is preparing a quantum system in state $\rho$, the second one, say Bob, applies to it a CPTP map at random from a set $\{\mathcal E, \mathcal F\}$ and gives back the quantum system to Alice, which then measures it. The task of Alice is to discriminate between $\mathcal E$ and $\mathcal F$.

2.1. Naive approach

Channel discrimination can be cast into state discrimination. After all, Alice could consider that she receives at random either $\mathcal E(\rho)$ or $\mathcal F(\rho)$ which she can discriminate using the optimal strategy seen above. The maximum success probability in this case is:

\begin{equation} \frac 1 2 \left ( 1 + \frac 1 2 \| \mathcal E(\rho) - \mathcal F(\rho) \|_1 \right). \end{equation}

Because Alice can optimize over $\rho$, we would naturally be interested in

\begin{equation} \max_{\rho} \frac 1 2 \left ( 1 + \frac 1 2 \| \mathcal E(\rho) - \mathcal F(\rho) \|_1 \right), \end{equation}

which points at $\max_\rho \| \mathcal E(\rho) - \mathcal F(\rho) \|_1$ as a measure of distinguishability between $\mathcal E$ and $\mathcal F$.

2.2. Super-dense coding

In fact, there is more that can be done in order to discriminate channels.

To get an intuition we will consider the following channels: $\mathcal E(\cdot) \coloneqq I(\cdot)I$ and $\mathcal F(\cdot) \coloneqq X(\cdot) X$, One can perform the optimization above very easily and obtain that the optimal state to construct is $\ket 0$ and the measurement is $\{\ketbra 0, \ketbra 1\}$. The success probability is 1 as the resulting state is $\ket 0$ for $\mathcal E$ and $ket 1$ for $\mathcal F$, which are perfectly distinguishabel.

In a similar fashion, if $\mathcal F(\cdot) \coloneqq Z(\cdot)Z$, then the optimal state is $\ket +$ and the measurement is $\{\ketbra +, \ketbra - \}$. The second case can actually be seen as the first one rotated by a Hadamard gate⁵⁵$H\ket 0 = \ket +$ and $H\ket 1 = \ket -$., and from there it is clear that using the naïve approach there is no single initial state and optimal measurement that can distinguish the 3 channels $I, X, Z$ — and a fortiori $I, X, Y, Z$ — with probability of success equal to 1.

Yet, there is a situation where discriminating between $I,X,Y,Z$ is possible with success probability 1, which can be derived from the existence of the super dense coding protocol.

Super dense coding is a protocol allowing a sender (Alice) to send a 2-bit message $m \in \{0,1,2,3\}$ to a receiver (Bob) by sending half a pre-established EPR-pair through a perfect quantum channel.

Alice and Bob share an EPR pair $\frac 1 {\sqrt 2} \left(\ket{00} + \ket{11}\right)$;
Alice chooses $m$ and depending on its value she applies $I,X,Y,Z$ to her half of the EPR-pair, and sends it to Bob
Bob performs a Bell State Measurement on the two qubits (the received half and his own half)
He sets the received message $\hat m$ depending on the observed Bell state:

\begin{align} \frac{1}{\sqrt 2}\left(\ket{00} + \ket{11}\right) & \rightarrow \hat m = 0 \\ \frac{1}{\sqrt 2}\left(\ket{00} - \ket{11}\right) & \rightarrow \hat m = 3 \\ \frac{1}{\sqrt 2}\left(\ket{10} + \ket{01}\right) & \rightarrow \hat m = 1 \\ \frac{1}{\sqrt 2}\left(\ket{10} - \ket{10}\right) & \rightarrow \hat m = 2 \end{align}

The interest of this protocol for channel discrimination comes from the fact that we can imagine that Alice chooses one of the $I,X,Y,Z$ channel to apply to half an EPR-pair prepared by Bob, who then performs a measurement that perfectly distinguishes the 4 possibilities. This obviously violates the conclusion we drew from the naïve approach. What is the special ingredient that allows this? The use of EPR-pair, i.e. a maximally entangled state, for which one half acts as a reference. The naïve approach didn't allow this as the channel was implicitely applied to the whole state, while here it is only applied partially. In fact, this shouldn't come too much as a surprise as we have seen that a channel is characterized by its Choi state, obtained by applying the channel onto half a maximally entangled state.

2.3. Diamond norm

Given the intuition built through the previous example, we realize that the optimization over $\rho$ is not enough. It should be supplemented by an optimization over the size of the Hilbert-space so that the success probability can be written:

\begin{equation} \frac 1 2 \left ( 1 + \frac 1 2 \sup_{\mathcal H_R} \max_{\rho} \| (\mathcal E \otimes \one_{\mathcal H_R})(\rho) - (\mathcal F \otimes \one_{\mathcal H_R})(\rho) \|_1\right ), \end{equation}

where $\mathcal E, \mathcal F$ are CPTP maps from $\mathcal H$ to $\mathcal H'$ and $\rho$ is a density matrix over $\mathcal H \otimes \mathcal H_R$.

This defines an induced⁶⁶This means a norm over states is used to define a norm over operators using some optimization procedure norm over channels:

For $\mathcal E$ a CPTP map from $\mathcal H$,

\begin{equation} \| \mathcal E \|_{\diamond} \coloneqq \max_{\ket \psi \in \mathcal H \otimes \mathcal H} \| (\mathcal E \otimes \one_{\mathcal H})(\ketbra \psi) \|_1 \end{equation}

Note that above it is enough to optimize over pure states — using the convexity of the trace distance. Additionally, the optimization over $\mathcal H_R$ that was expected is omitted in the above definition as it is indeed sufficient to consider $\mathcal H_R= \mathcal H$ using the fact that the Schmidt rank of a state over $\mathcal H \otimes \mathcal H_R$ is bounded by $\dim \mathcal H$.

Finally, and similarly to the trace-distance, one can define the diamond distance between CPTP maps:

For $\mathcal E, \mathcal F$ two CPTP maps from $\mathcal H$ to $\mathcal H'$, the diamond distance between them is

\begin{equation} \| \mathcal E - \mathcal F \|_\diamond \end{equation}

3. Other measures of distance

3.1. Fidelity

We have seen that the trace-distance arises quite naturally from the state discrimination task. One might also argue that another well motivated notion of distance between two pure quantum states $\ket \psi, \ket \varphi$ is the squared overlap between the states: $|\braket{\psi}{\varphi}|^2$. It arises from the following protocol. Again a sender prepares either $\ket \psi$ or $\ket \varphi$, and the receiver tries to distinguish one from the other. To do that, instead of using the optimal measurement seen earlier, he could decide to perform the measurement $\{\ketbra \psi, \one - \ketbra \psi\}$. If the first outcome is obtained, it will decide that $\ket \psi$ was send, and $\ket \varphi$ otherwise. The probability of error given $\ket \varphi$ is sent is therefore $\tr(\ketbra \psi \ketbra \varphi) = |\braket{\psi}{\varphi}|^2$. Hence, the higher the fidelity, the harder to discriminate $\ket \psi$ from $\ket \varphi$ in this test.

The pure state fidelity for $\ket \psi, \ket \varphi \in \mathcal H$ is defined as:

\begin{equation} F(\ket \psi, \ket \varphi) \coloneqq |\braket{\psi}{\varphi}|^2 \end{equation}

\begin{align} & F(\ket\psi, \ket \varphi) = F(\ket \varphi, \ket\psi) \\ & 0\leq F(\ket\psi,\ket\varphi) \leq 1 \end{align}

There is a notion of classical fidelity for probability distributions $p(x), q(x)$ called Bhattacharyya overlap⁷⁷Show that for classical states expressed using quantum systems, the two correspond:

\begin{equation} F(p,q) = \left(\sum_x \sqrt{p(x)q(x)}\right)^2. \end{equation}

Using this definition for pure states, it would be useful to extend it to mixed states. As we will see this is not so easy in the general case, so we can build some intuition first. So we start for the pure-mixed case, i.e. the fidelity $F(\ket\psi,\rho)$. In such case, one way to recover the pure-state fidelity is to consider $\rho$ as a statistical mixture of the states appearing in its spectral decomposition. More precisely, we define the expected fidelity of $\rho$ against $\ket\psi$ as the expectation value of the pure state fidelity considering that $\rho$ is prepared from its spectral decomposition:

\begin{equation} F(\ket\psi,\rho) \coloneqq \sum_\lambda p_\lambda |\braket{\psi}{\varphi_\lambda}|^2 = \bra \psi \rho \ket \psi, \end{equation}

where $\rho = \sum_\lambda p_\lambda \ketbra{\varphi_\lambda}$ with $\{\ket{\varphi_\lambda}\}_\lambda$ an orthonormal set.

The crucial ingredient in the above generalization is that the spectral decomposition offers a natural way to recover pure states fidelity only, and a way to recombine each of the individual fidelities. With this in mind, a possibility — that is actually correct — would be to rely on the purification of mixed states to recover the pure state situation. This defined the (Uhlmann) fidelity, tht works for mixed states and coincide with the pure state and expected fidelity whenever we are in a pure-pure or pure-mixed setting:

For $\rho, \sigma$ density matrices over $\mathcal H_A$ and $\ket \psi, \ket \varphi \in \mathcal H_A \otimes \mathcal H_R$ such that $\tr_R (\ketbra \psi) = \rho$ and $\tr_R( \ketbra \varphi) = \sigma$, the Uhlmann fidelity is⁸⁸Remember that (i) purifications are equivalent up to a unitary on the reference system part — hence starting from $\ket \psi$ and $\ket \varphi$ one can obtain all of them by applying $U$ for $\ket \psi$ and $V$ for $\ket \varphi$ — and (ii) that maximizing over $U$ and $V$ is overkill as they impact the value to optimize through $U^\dagger V$ which is a unitary itself.

\begin{equation} F(\rho,\sigma) = \max_{U}|\bra\psi (\one_A \otimes U) \ket \varphi|^2. \end{equation}

Fortunately, Uhlmann's theorem allows for a closed-form expression of the fidelity:

For $\rho,\sigma$ density matrices over $\mathcal H_A$,

\begin{equation} F(\rho,\sigma) = \| \sqrt \rho \sqrt \sigma \|_1^2 = \left (\tr(\sqrt{\sqrt\sigma \rho \sqrt \sigma})\right)^2 \end{equation}

Given $\rho$ we can take the purification $\ket \psi = (\sqrt{\rho} \otimes \one_R) \ket\Gamma$ where $\ket\Gamma = \sum_i \ket{i}_A\otimes \ket{i}_R$ is a unnormalized maximally entangled state over $\mathcal H_A \otimes \mathcal H_R$, and similarly for $\sigma$. Then⁹⁹It is easy to show that for any $A\in End(\mathcal H)$, $(A\otimes\one) \ket\Gamma = (\one\otimes A^T)\ket\Gamma$ with $\ket\Gamma$ the maximally entangled state on $\mathcal H \otimes \mathcal H$ by directly expressing the matrix elements $\ketbra{i}{j}$. It is also straightforward using the same appraoch to show that $\tr(A) = \bra\Gamma (A\otimes \one) \ket \Gamma$. we have for any $U$:

\begin{align} |\bra \psi (\one_A \otimes U) \ket \varphi |^2 & = |\bra{\Gamma} (\sqrt{\rho}\sqrt{\sigma} \otimes U) \ket \Gamma|^2 \\ & |\bra \Gamma (\sqrt{\rho}\sqrt{\sigma}U^T \otimes \one_R \ket \Gamma\|^2 \\ & = |\tr(\sqrt\rho \sqrt\sigma U^T)|^2. \end{align}

Now maximizing $U$ and using the variational characterization of the trace-distance we get:

\begin{equation} \max_U |\tr(\sqrt\rho \sqrt\sigma U^T)|^2 = \|\sqrt\rho\sqrt\sigma\|_1^2. \end{equation}

$F$ is symmetric
$F$ is multiplicative with respect to tensor products: $F(\rho_A\otimes \rho_B, \sigma_A\otimes \sigma_B) = F(\rho_A, \sigma_A)F(\rho_B, \sigma_B)$
$F$ is invariant under isometries
$F$ is non-decreasing over partial trace¹⁰¹⁰The idea is to use the variational characterization of $F$ and realize that the lhs optimizes over unitaries on the reference system, while the rhs optimizes over unitaries on the reference system and $B$. This means that forgetting about one part of the system makes it easier to take one state for the other. This is generalized in the next property. There, if you see $\mathcal E$ as a possibly noisy evolution, it will improve the fidelity between the states. This kind of inequality is also referred to as Data processing inequality. Such inequality apply to various kind of quantities and are very useful in characterizing the evolution of soem quantities under noisy processes.: $F(\rho_{AB}, \sigma_{AB}) \leq F(\rho_A, \sigma_A)$
$F$ is monotone under CPTP maps: $F(\rho,\sigma) \leq F(\mathcal E (\rho), \mathcal E(\sigma))$
$\sqrt F$ is jointly concave with respect to its arguments: $\sqrt F(\lambda \rho + (1-\lambda) \rho', \lambda \sigma + (1-\lambda) \sigma') \geq \lambda\sqrt F(\rho,\sigma) + (1-\lambda) \sqrt F(\rho',\sigma')$
$F$ is concave wrt one of its arguments: $F(\lambda \rho + (1-\lambda)\rho',\sigma) \geq \lambda F(\rho, \ \sigma) + (1-\lambda) F(\rho', \sigma)$

4. Classical entropies and information

4.1. Overview of the operational definition of information and entropy

Information is a concept that was quantified by Shannon in 1948. As for distance measures, it stems out of a simple foundational task involving sending messages from Alice to Bob. This task is called source coding and corresponds to Alice sending messages seen as random bit strings picked at random from a given distribution — it could be for instance random words from the dictionary, random sentences from a book, or random paragraphs from a corpus of texts ¹¹¹¹Think about chatGPT like situations where the entire web was scraped to gain knowledge about correlations between sentences.. The objective of this task is to encode the message in such a way so that Alice sends the minimum number of bits to Bob, while still allowing Bob to decode the message perfectly. That is we want to find the best possible compression for the source emitting the messages.

Shannon's source coding theorem states the following. Let a source emit words out of an alphabet $\Sigma$ — say $\{A,B,C,\ldots\}$. Consider $f$ an encoding of these words into words over an alphabet $\Pi$ — say $\{0, 1\}$ — that is uniquely decodable¹²¹²This means that any input word is mapped to a single output codeword. Then we can consider an optimal $f$ that is such that when $X$ is drawn at random from the words on $\Sigma$ according to a known probability distribution, the expected length of the encoding $|f(X)|$ is minimal — i.e. we consider $f$ such that $\mathbb E(|f(X)|)$ is miminum. Then

\begin{equation} \frac{H(X)}{\log_2(|\Pi|)} \leq \mathbb{E}[|f(X)|] < \frac{H(X)}{\log_2(|\Pi|)} + 1, \end{equation}

where $H(X) \coloneqq - \sum_x \Pr(X=x)\log_2(\Pr(X=x))$ is the entropy of $X$. In other words, this theorem states that the best encoding $f$ will produce codewords whose expected length is given by $H(X)/\log_2(|\Pi|)$. If $\Pi$ is taken to be bits, then $H(X)$ directly measures the expected length of the codewords. This means that the average information content of $X$ is $H(X)$ bits as it is not possible to reduce the length of the codewords any further¹³¹³Lets beat that bound! Consider meassages made out of the 26 letters and 10 numerals, chosen uniformly at random. Consider the Morse encoding. We have $H(X) = - \log_2(1/36) = 5.16$, while we see that all letters are encoded using at most 4 symbols and numerals exactly 5. Hence, we have $\mathbb{E}(|f(X)|) \leq 4 \times \frac{26}{36} + 5 \times \frac{10}{36} = 4.22$. What's wrong?

A second important task put forth by Shannon is called channel coding. It corresponds to Alice sending messages to Bob, but this time through a noisy channel. The goal is then to find out how to encode the message, possibly adding some redundancy, so that Bob can perfectly retreive the message of Alice with very high probability. The important quantity that is highlighted by this task is the mutual information, which is derived from entropy (see below).

Here, instead of going through the proofs of Shannon's theorems — i.e. source and channel coding — we will take an axiomatic approach to defining entropy, yet I encourage you to read Shannon's paper and see the concept of entropy and mutual information emerge from these operational tasks.

4.2. Axiomatic definition

4.2.1. Entropy of events

We consider $(\Omega, \mathcal E, P)$ a probability space¹⁴¹⁴$\Omega$ is the sample space — the set of possible outcomes — $\mathcal E$ is the event space — each event is a set of outcomes — and $P$ a probability function which assigns a probability to each element in $\mathcal E$. The entropy of an event $E$ is a function over $\mathcal E$:

\begin{align} H:\ & \mathcal E \rightarrow \mathbb R \cup \{\infty\} \nonumber \\ & E \mapsto H(E) \nonumber \end{align}

that satisfies:

independence of representation: $H(E)$ only depends on $P(E)$
continuity (relative to the topology induced by the trace distance)
additivity for two independent events $E$ and $E'$: $H(E \cap E') = H(E) + H(E')$
normalization such that $H(E) = 1$ for $P(E) = \frac 1 2$

With these constraints, the entropy of $E$ is unique:

Any fonction $H$ satisfying 1-4 above has the form $H(E) = -\log_2(P(E))$

Clearly $-\log_2(P(E))$ satisfies 1-4. Because $-\log_2$ is bijective from $[0,1]$ to $\mathbb R^+ \cup \{+\infty\}$, the condition 1 allows to write $H(E) = f(-\log_2(P(E)))$, where $f$ is from $\mathbb R^+ \cup \{+\infty\}$ to $\mathbb R \cup \{+\infty\}$. 2 implies that $f$ is continuous and additivity implies that $f(x) + f(y) = f(x + y)$ by considering two events $E$ and $E'$ such that $-\log_2(P(E)) = x$ and $-\log_2(P(E')) = y$. This in turn implies that $f$ is linear, and the normalization implies the claimed result.

4.2.2. Entropies for random variables

The most obvious way to go from events to random variables is to consider the expected entropy for each event defining the random variable, i.e.

Shannon entropy of a random variable $X$ with alphabet $\Sigma$ is the expected entropy of the events $\{X = \sigma\}$ for $\sigma \in \Sigma$:

\begin{equation} H(X) = - \sum_{\sigma\in \Sigma} \Pr(X=\sigma) \log_2(\Pr(X=\sigma)). \end{equation}

But this is not the only possible way of combining the various probabilities of the events $\{X=\sigma\}$:

The Min-entropy of a random variable $X$ with alphabet $\Sigma$ is

\begin{equation} H_{\min}(X) = \min_\sigma - \log_2(\Pr(X=\sigma)). \end{equation}

It is also possible to define the Max-entropy, that is related to the number of events with non-zero probability¹⁵¹⁵This measures the maximum entropy a random variable with the same support as $X$ could have.:

The Max-entropy of a random variable $X$ with alphabet $\Sigma$ is

\begin{equation} H_{\max}(X) = \log_2(|\{X=\sigma \ : \ \Pr(X=\sigma) > 0\}|). \end{equation}

\begin{equation} H_{\min}(X) \leq H(X) \leq H_{\max}(X). \end{equation}

For all entropies (min, max, Shannon), we have

Permutation invariance: $H(X) = H(\pi(X))$ where $\pi$ permutes the elements of $X$;
Non negativity;
Equality to $0$ is equivalent to having exactly one element with probability 1;
Upper bound equal to $\log_2|\Sigma|$ where $\Sigma$ is the alphabet.

4.2.3. More entropies

Whenever there are correlated random variables $X$ and $Y$, one can define a conditional random variables when a say specific value of $Y=y$ is known. We write this random variable as $X|Y=y$ and its probability distribution is $\Pr(X|Y=y) = \frac{\Pr(X,Y=y)}{\Pr(Y=y)}$. The associated Shannon entropy is $H(X|Y=y) = -\sum_{x} \Pr(X=x|Y=y) \log_2(\Pr(X=x|Y=y)$. This represents the entropy of $X$ when $Y=y$. The conditional Shannon entropy of $X$ given $Y$ is defined as this quantity averaged over the possible values of $y$ and thus represents the expected remaining uncertainty about $X$ when $Y$ is known:

The conditional Shannon entropy of $X$ given $Y$ is

\begin{equation} H(X|Y) = \sum_y \Pr(Y=y) H(X|Y=y). \end{equation}

The Shannon entropy satisfies the chain rule¹⁶¹⁶This allows to draw Venn diagrams for entropies.:

\begin{equation} H(X|YZ) = H(XY|Z) - H(Y|Z) \end{equation}

Conditional entropies can be defined for the min and max entropy as well. For the min entropy we simply minimize over all possible realizations of $Y$, while for the max entropy, we maximize over the possible values of $Y$:

\begin{equation} H_{\min}(X|Y) = \min_y H_{\min}(X|Y=y). \end{equation}

\begin{equation} H_{\max}(X|Y) = \max_y \log_2 |\{x: \Pr(X=x|Y=y) \neq 0\}|. \end{equation}

All three entropies satisfy an important and powerful inequality stating that conditional entropies can only decrease when they are conditioned on an additional random variable:

H(X|Y) ≥ H(X|YZ).

While such property is expected — the entropy of $X$ decreases when we add more conditioning — it is important to mention it here, as the quantum analogue to it is non-trivial.

4.2.4. Mutual information

Mutual information is a concept that closely follows our everyday experience. When we get information, we get it through something about something else: We receive a message telling us that some event happened. It is not directly the event itself, but it is correlated to it in the sense that if the event would have been different so would have been the message.

Formally, the Shannon mutual information between $X$ and $Y$ is the average amount of uncertainty about $X$ that $Y$ removes when it is observed:

\begin{equation} I(X:Y) = H(X) - H(X|Y) = H(X) + H(Y) - H(XY). \end{equation}

The mutual information is non-negative.¹⁷¹⁷Use the strong subadditivity.

5. Quantum entropies

5.1. Generalizing classical entropies

Surely, it is possible to encode classical information — i.e. classical probability distributions — inside quantum systems. This holds for single systems as well as for correlated ones. Because we could also perform all classical processing within quantum systens, it is ultimately desirable that the definition of quantum entropy as well as conditional entropy, mutual information and relative entropy match those of random variables whenever the quantum states correspond to random variables.

This leads to the following definitions:

The von Neuman entropy for a system $A$ in state $\rho$ is¹⁸¹⁸Show that is matches the classical definition for classical states.:

\begin{equation} H(A)_\rho = - \tr(\rho \log_2 \rho). \end{equation}

The min entropy is defined using the $∞$-norm – equating the largest eigenvalue of $\rho$, by:

\begin{equation} H_{\min}(A)_\rho = -\log_2 \|\rho\|_{\infty} \end{equation}

The max entropy is defined by

\begin{equation} H_{\max}(A)_\rho = -\log_2 |\mathrm{supp}(\rho)|, \end{equation}

where $\mathrm{supp}(\rho)$ is the orthogonal complement to the $0$ eigenspace value of $\rho$.

Entropies satisfy useful properties:

Unitary invariance: $H(A)_\rho = H(A)_{U\rho U^\dagger}$.
Positivity: $H(A)_\rho \geq 0$.
Additivity for uncorrelated states: $H(AB)_{\rho_A\otimes\rho_B} = H(A)_{\rho_A} + H(B)_{\rho_B}$.
For $\rho_{AB}$ a quantum-classical state $\rho_{AB} = \sum_b \Pr(B=b) \rho_{A,b} \ketbra b$ then $H(AB)_{\rho_{AB}} = H(\Pr(B=b)) + \sum_b \Pr(B=b) H(A)_{\rho_{A,b}})$.

and specifically for the von Neumann entropy

$H(A)_\rho \leq \log_2 d$, where $d$ is the dimension of the Hilbert space of $A$.

5.2. Conditional entropy

There could a priori be several ways of defining conditional von Neumann entropy for a density matrix $\rho_{AB}$ with respect to the $A$, $B$ cut. The most tempting would be to define it through a measurement on $B$ that gives a different reduced density matrix on $A$ for each of the possible outcomes — this would be reminiscent of the computation for quantum-classical states above. The problem is that such conditional entropy would involve some optimization over the possible measurements and would not necessarily satisfy some of the other properties expected from conditional entropy. In fact the right way to define the conditional entropy is through the classical identity $H(X|Y) = H(X,Y) - H(Y)$.

The quantum conditional entropy for a state $\rho_{AB}$ for subsystems $A$, $B$ is

\begin{equation} H(A|B)_{\rho_{AB}} = H(AB)_{\rho_{AB}} - H(B)_{\rho_{AB}}, \end{equation}

where $H(B)_{\rho_{AB}} = H(B)_{\rho_B}$ for $\rho_B = \tr_A \rho_{AB}$.

Reversing the definition of $H(A|B)$ we obtain the chain rule for quantum entropies:

\begin{equation} H(AB)_\rho = H(A|B)_\rho + H(B)_\rho. \end{equation}

Whereas conditional entropy of random variables is positive, it is not necessary the case for quantum conditional entropy. E.g. take a pure bi-partite state $\rho$ over $A$ and $B$, then $H(AB)_\rho = 0$ and $H(A)_\rho = H(B)_\rho \geq 0$, which implies $H(A|B)_\rho = H(B|A)_\rho \leq 0$.

For a quantum-classical state $\rho_{AB} = \sum_b p(b) \rho_{A,b}\otimes \ketbra b$ over $A$ and $B$, we have

\begin{equation} H(A|B)_{\rho_{AB}} = \sum_b p(b) H(A)_{\rho_{A,b}}, \end{equation}

and consequently, $H(A|B) \geq 0$.

For $\rho_{ABC}$ a state for the 3 subsystems $A$, $B$ and $C$, then

\begin{equation} H(A|B)_{\rho_{ABC}} \geq H(A|BC)_{\rho_{ABC}}. \end{equation}

This inequality is fundamental in quantum theory as many other properties derive from it. We will see one consequence in the next paragraph.

5.3. Mutual information

Mutual information, once conditional entropy is defined, follows from the same definition as in the classical case:

For a state $\rho$ over subsystems $A$ and $B$, the quantum mutual information is defined as:

\begin{equation} I(A:B)_\rho = H(A)_\rho - H(A|B)_\rho = H(B)_\rho - H(B|A)_\rho = H(A)_\rho + H(B)_\rho - H(AB)_\rho. \end{equation}

\begin{equation} I(A:B)_\rho \geq 0. \end{equation}

Apply the strong subadditivity for $H(X|Y)_{\rho_{XYZ}}$ with $\rho_{XYZ} = \rho_{XZ} \otimes \ketbra 0_Y$ taking $X=A$, $Y=B$ and $Z=C$. Note that the general case (not setting up $Y$ in state $\ketbra 0$ gives that the conditional mutual information is positive.

Another interesting property of quantum mutual information states that quantum post processing subsystem $B$ does not increase the quantum mutual information with $A$. More precisely:

Let three subsystems $A$, $B$, $C$, $\rho_{AB}$ a state over $AB$ and $\mathcal E$ a CPTP map from $B$ to $C$. Then:

\begin{equation} I(A:B)_{\rho_{AB}} \geq I(A:C)_{(\one_A\otimes \mathcal E)(\rho_{AB})}. \end{equation}

We prove that $H(A|B)_{\rho_{AB}} \leq H(A|C)_{(\one_A\otimes \mathcal E)(\rho_{AB})}$. This is a direct application of the fact that $\mathcal E$ can be implemented by considering an auxiliary quantum system $D$ and a joint isometry from $BD$ to $CD$, and the strong subadditivity.

Foundation and tools (of quantum information theory)