# Importance Sampling

14 September 2017

Consider a probability distribution on $$\mathcal X$$ with the density $$p$$. (We will interchangeably denote the probability distribution and its density by the same letter). Given a function $$f: \mathcal X \to \mathbb R$$, our goal is to estimate \begin{align} I := \E[f(X)] := \int_{\mathcal X} f(x) p(x) \,\mathrm dx. \end{align}

## Monte Carlo Estimator

We can estimate this using $$N$$ independently, identically distributed (iid) distributed samples from $$p$$ as follows: \begin{align} X_n &\sim p \\
I_N^{\text{MC}} &= \frac{1}{N} \sum_{n = 1}^N f(X_n). \end{align}

The expectation of $$I_N^{\text{MC}}$$ is equal to $$I$$: \begin{align} \E\left[I_N^{\text{MC}}\right] &= \E\left[\frac{1}{N} \sum_{n = 1}^N f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \E\left[f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N I \\
&= I. \end{align}

The variance $$I_N^{\text{MC}}$$ is: \begin{align} \Var\left[I_N^{\text{MC}}\right] &= \E\left[{I_N^{\text{MC}}}^2\right] - \E\left[I_N^{\text{MC}}\right]^2 \\
&= \E\left[\left(\frac{1}{N} \sum_{n = 1}^N f(X_n)\right)^2\right] - I^2 \\
&= \frac{1}{N^2} \E\left[\sum_{n = 1}^N f(X_n)^2 + \sum_{n, \ell, n \neq \ell} f(X_n) f(X_\ell)\right] - I^2 \\
&= \frac{1}{N^2} \left(\sum_{n = 1}^N \E\left[f(X_n)^2\right] + \sum_{n, \ell, n \neq \ell} \E\left[f(X_n) f(X_\ell)\right]\right) - I^2 \\
&= \frac{1}{N^2} \left(N \E\left[f(X)^2\right] + (N^2 - N) \E\left[f(X)\right]^2 \right) - I^2 \\
&= \frac{\E\left[f(X)^2\right] - \E\left[f(X)\right]^2}{N} \\
&= \frac{\Var[f(X)]}{N}. \end{align}

## Importance Sampling

Instead of drawing samples from $$p$$, assume that we can

• draw samples from a proposal distribution on $$\mathcal X$$ with density $$q$$,
• evaluate density $$q$$, and
• evaluate density $$p$$ (we will consider the case when we can only evaluate $$p$$ up to a normalization constant later).

The importance sampling estimator is as follows: \begin{align} X_n &\sim q \\
W_n &= \frac{p(X_n)}{q(X_n)} \\
I_N^{\text{IS}} &= \frac{1}{N} \sum_{n = 1}^N W_n f(X_n). \label{eq:is_est} \end{align}

The expectation of $$I_N^{\text{IS}}$$ is equal to $$I$$: \begin{align} \E\left[I_N^{\text{IS}}\right] &= \E\left[\frac{1}{N} \sum_{n = 1}^N W_n f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \E\left[W_n f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \E\left[\frac{p(X_n)}{q(X_n)} f(X_n)\right] \\
&= \frac{1}{N} \sum_{n = 1}^N \int_{\mathcal X} \frac{p(x_n)}{q(x_n)} f(x_n) q(x_n) \,\mathrm dx_n \\
&= \frac{1}{N} \sum_{n = 1}^N \int_{\mathcal X} p(x) f(x) \,\mathrm dx \\
&= I. \end{align}

The variance of $$I_N^{\text{IS}}$$ is: \begin{align} \Var\left[I_N^{\text{IS}}\right] &= \E\left[{I_N^{\text{IS}}}^2\right] - I^2 \\
&= \E\left[\left(\frac{1}{N} \sum_{n = 1}^N W_n f(X_n)\right)^2\right] - I^2 \\
&= \frac{1}{N^2} \E\left[\sum_{n = 1}^N \left(W_n f(X_n)\right)^2 + \sum_{n, \ell, n \neq \ell} \left(W_n f(X_n)\right)\left(W_\ell f(X_\ell)\right)\right] - I^2 \\
&= \frac{1}{N^2} \sum_{n = 1}^N \E\left[\left(W_n f(X_n)\right)^2\right] + \frac{1}{N^2} \sum_{n, \ell, n \neq \ell} \E\left[\left(W_n f(X_n)\right)\left(W_\ell f(X_\ell)\right)\right] - I^2 \\
&= \frac{1}{N} \left(\int_{\mathcal X} \left(\frac{p(x)}{q(x)} f(x)\right)^2 q(x) \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\int_{\mathcal X} \frac{p(x)^2 f(x)^2}{q(x)} \,\mathrm dx - I^2 \right) \end{align}

The “optimal” proposal $$q^{\text{opt}}$$ has the density \begin{align} q^{\text{opt}}(x) &= \frac{|f(x)| p(x)}{\int_{\mathcal X} |f(x’)| p(x’) \,\mathrm dx’}. \end{align} This proposal is optimal in the sense that estimator’s variance is lowest possible. The variance of an estimator using an optimal proposal, $$\Var\left[I_N^{\text{IS, opt}}\right]$$ is not greater than the variance of an estimator using an arbitrary proposal $$q$$, $$\Var\left[I_N^{\text{IS}}\right]$$: \begin{align} \Var\left[I_N^{\text{IS, opt}}\right] &= \frac{1}{N} \left(\int_{\mathcal X} \frac{p(x)^2 f(x)^2}{\frac{|f(x)| p(x)}{\int_{\mathcal X} |f(x’)| p(x’) \,\mathrm dx’}} \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\int_{\mathcal X} |f(x)| p(x) \,\mathrm dx \cdot \int_{\mathcal X} \frac{p(x)^2 f(x)^2}{|f(x)| p(x)} \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\int_{\mathcal X} |f(x)| p(x) \,\mathrm dx \cdot \int_{\mathcal X} |f(x)| p(x) \,\mathrm dx - I^2 \right) \\
&= \frac{1}{N} \left(\left(\int_{\mathcal X} \frac{|f(x)| p(x)}{q(x)} \cdot 1 \cdot q(x) \,\mathrm dx\right)^2 - I^2 \right) \\
&\leq \frac{1}{N} \left(\left(\int_{\mathcal X} \left(\frac{|f(x)| p(x)}{q(x)}\right)^2 \cdot q(x) \,\mathrm dx\right) \cdot \left(\int_{\mathcal X} 1^2 \cdot q(x) \,\mathrm dx\right) - I^2 \right) & \text{(Cauchy-Schwarz ineq.)}\\
&= \frac{1}{N} \left(\int_{\mathcal X} \frac{f(x)^2 p(x)^2}{q(x)} \,\mathrm dx - I^2 \right) \\
&= \Var\left[I_N^{\text{IS}}\right]. \end{align}

## Self-Normalized Importance Sampling

Now, assume that we can

• draw samples from a proposal distribution on $$\mathcal X$$ with density $$q$$,
• evaluate density $$q$$, and
• evaluate the density $$p$$ only up to a normalization constant, i.e. we can evaluate $$\tilde p(x) = Zp(x)$$ where $$Z$$ is the normalization constant.

The importance sampling estimator is as follows: \begin{align} X_n &\sim q \\
\tilde W_n &= \frac{\tilde p(X_n)}{q(X_n)} \\
\tilde I_N^{\text{IS}} &= \frac{\frac{1}{N}\sum_{n = 1}^N \tilde W_n f(X_n)}{\frac{1}{N}\sum_{n = 1}^N \tilde W_n}. \label{eq:self_norm_est} \end{align}

### Consistency of The Estimator

Theorem. $$\tilde I_N^{\text{IS}}$$ converges to $$I$$ almost surely, i.e. \begin{align} P\left(\left\{\omega: \lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = I\right\}\right) = 1. \end{align}

Proof. Denote the numerator of \eqref{eq:self_norm_est} by $$\alpha_N$$ and the denominator by $$\beta_N$$. By the law of the strong law of large numbers, $$\alpha_N$$ converges to $$ZI$$ and $$\beta_N$$ converges to $$Z$$. Now, let \begin{align} A = \left\{\omega: \lim_{N \to \infty} \alpha_N(\omega) = ZI\right\} \\
B = \left\{\omega: \lim_{N \to \infty} \beta_N(\omega) = Z\right\} \\
C = \left\{\omega: \lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = I\right\}. \end{align} We have $$P(A) = P(B) = 1$$ and hence $$P(A \cap B) = P(A) + P(B) - P(A \cup B) = 1$$ (see properties probability measures). We also have $$A \cap B \subseteq C$$ since for any $$\omega \in A \cap B$$, $$\lim_{N \to \infty} \alpha_N(\omega) = ZI$$ and $$\lim_{N \to \infty} \beta_N(\omega) = Z$$ and given the properties of a limit, $$\lim_{N \to \infty} \tilde I_N^{\text{IS}}(\omega) = \lim_{N \to \infty} \frac{\alpha_N(\omega)}{\beta_N(\omega)} = \frac{ZI}{Z} = I$$. Hence $$P(C) \geq P(A \cup B) = 1$$. Since $$P(C) \leq 1$$, $$P(C) = 1$$. $$\square$$

### Asymptotic Variance

It is more difficult to compute the bias and variance of the self-normalized importance sampling estimator \eqref{eq:self_norm_est} because it cannot be decomposed to a sum of independent terms as in \eqref{eq:is_est}. We resort to the central limit theorem and the delta method which say the following.

Central limit theorem. Given iid random vectors $$v_n$$ ($$n = 1, 2, \dotsc$$) on $$\mathbb R^D$$ with a common mean $$\mu \in \mathbb R^D$$ and covariance $$\Sigma \in \mathbb R^{D \times D}$$, \begin{align} \sqrt{N} \left(\frac{1}{N} \sum_{n = 1}^N v_n - \mu\right) \xrightarrow{d} \mathrm{Normal}(0, \Sigma), \end{align} where $$\xrightarrow{d}$$ denotes convergence in distribution and $$\mathrm{Normal}(0, \Sigma)$$ denotes a multivariate normal distribution with the covariance $$\Sigma$$. $$\triangle$$

Delta method. Given $$\mu \in \mathbb R^D$$ and its estimator $$\hat \mu_N$$ with the following convergence \begin{align} \sqrt{N} \left(\hat \mu_N - \mu\right) \xrightarrow{d} \mathrm{Normal}(0, \Sigma), \end{align} then, given a function $$g: \mathbb R^D \to \mathbb R$$ and the corresponding gradient $$\nabla g: \mathbb R^D \to \mathbb R^D$$, we have \begin{align} \sqrt{N} \left(g(\hat \mu_N) - g(\mu)\right) \xrightarrow{d} \mathrm{Normal}\left(0, \nabla g(\mu)^T \Sigma \nabla g(\mu)\right). \end{align} $$\triangle$$

We can apply these theorems directly to self-normalized importance sampling, where \begin{align} v_n &= \begin{bmatrix} W_n f(X_n) \\
W_n \end{bmatrix}, \\
g\left(\begin{bmatrix} a \\
b \end{bmatrix}\right) &= \frac{a}{b}, \end{align} so that \begin{align} \mu &= \begin{bmatrix} ZI \\
Z \end{bmatrix}, \\
\nabla g\left(\begin{bmatrix} a \\
b \end{bmatrix}\right) &= \begin{bmatrix} 1 / a \\
-a / b^2 \end{bmatrix}, \\
\Sigma &= \begin{bmatrix} \Var(W_n f(X_n)) & \Cov(W_n, W_n f(X_n)) \\
\Cov(W_n, W_n f(X_n)) & \Var(W_n) \end{bmatrix}. \end{align} This results in the following convergence for the self-normalized importance sampling estimator: \begin{align} \sqrt{N} (\tilde I_N^{\text{IS}} - I) \xrightarrow{d} \mathrm{Normal}\left(0, \tilde \sigma_{\text{asymptotic}}^2\right), \end{align} where the asymptotic variance is \begin{align} \tilde \sigma_{\text{asymptotic}}^2 = \int \frac{p(x)^2 (f(x) - I)^2}{q(x)} \,\mathrm dx. \end{align}

### Optimal Proposal

A proposal that minimizes the asymptotic variance $$\tilde \sigma_{\text{asymptotic}}^2$$ has the following form \begin{align} \tilde q^{\text{opt}}(x) &= \frac{p(x)|f(x) - I|}{\int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx}. \end{align}

To see this, we can show that the asymptotic variance associated with $$\tilde q^{\text{opt}}$$ is not greater than the asymptotic variance associated with any other $$q$$: \begin{align} \int \frac{p(x)^2 (f(x) - I)^2}{\tilde q^{\text{opt}}(x)} \,\mathrm dx &= \int \frac{p(x)^2 (f(x) - I)^2}{\frac{p(x)|f(x) - I|}{\int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx}} \,\mathrm dx \\
&= \int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx \cdot \int_{\mathcal X} p(x) |f(x) - I| \,\mathrm dx \\
&= \E_q\left[\frac{\pi}{q} |f - I|\right]^2 && \text{(Jensen’s ineq.)} \\
&\leq E_q\left[\frac{\pi^2 |f - I|^2}{q^2}\right] \\
&= \int \frac{\pi(x)^2 |f(x) - I|^2}{q(x)} \,\mathrm dx \\
&= \int \frac{\pi(x)^2 (f(x) - I)^2}{q(x)} \,\mathrm dx. \end{align}

[back]