Statistics

By Yoo Chung.

Published on 2023-11-07.

Updated on 2024-01-12.

Basic reference on basic concepts and properties in statistics.

Based on notes taken during a course [1].

Expectation

\[ \mathrm{E}[X] = \sum x P(X = x) \]

\[ \mathrm{E}[X] = \int x f(x) \, dx \]

Expectation is a linear operator:

\[ \mathrm{E}[aX + bY] = a \mathrm{E}[X] + b \mathrm{E}[Y] \]

If \(X\) and \(Y\) are independent:

\[ \mathrm{E}[XY] = \mathrm{E}[X] \mathrm{E}[Y] \]

Variance

\[ \mathrm{Var}[X] = \sigma^2 = \mathrm{E}[(X - \mathrm{E}[X])^2] \]

Standard deviation \(\sigma\) is defined as the square root of the variance \(\sigma^2\).

Other properties:

\[ \mathrm{Var}[X] = \mathrm{E}[X^2] - \mathrm{E}[X]^2 \]

\[ \mathrm{Var}[aX] = a^2 \mathrm{Var}[X] \]

Covariance

\[ \mathrm{Cov}(X,Y) = \sigma_{X,Y} = E[(X-\mathrm{E}[X])(Y-\mathrm{E}[Y])] \]

\[ \mathrm{Cov}(X,Y) = \mathrm{E}[XY] - \mathrm{E}[X] \mathrm{E}[Y] \]

\[ \mathrm{Cov}(X,X) = \mathrm{Var}[X] \]

\[ \mathrm{Var}[X+Y] = \mathrm{Var}[X] + 2 \mathrm{Cov}(X,Y) + \mathrm{Var}[Y] \]

Correlation

\[ \mathrm{Corr}(X,Y) = \rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}[X] \mathrm{Var}[Y]}} \]

If \(X\) and \(Y\) are independent, then they are uncorrelated:

\[ \mathrm{Corr}(X,Y) = 0 \]

Transformation

If \(f_X(x)\) is a probability density function for \(X\), then for \(Y=g(X)\), where \(g\) is invertible:

\[ f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right| \]

Estimators

An estimator is a random variable estimating the true value of a parameter.

For example, the mean of a random sample \(\hat{\theta} = \overline{X}\) is an estimator for the true mean \(\mathrm{E}[X]\).

An estimator \(\hat{\theta}\) estimating \(\theta\) is unbiased if \(\mathrm{E}[\hat{\theta}] = \theta\).

\[ \mathrm{E}[\overline{X}] = \mathrm{E}[X] \]

\[ \mathrm{E}[\mathrm{Var}[\overline{X}]] = \frac{\mathrm{Var}[X]}{n^2} \]

The bias of \(\hat{\theta}\) is

\[ \mathrm{B}(\hat{\theta}) = \mathrm{E}[\hat{\theta}] - \theta \]

The mean squared error is

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{E}[(\hat{\theta} - \theta)^2] = \mathrm{Var}[\hat{\theta}] + \mathrm{B}(\hat{\theta})^2 \]

For two unbiased estimators \(\hat{\theta}_1\) and \(\hat{\theta}_2\), \(\hat{\theta}_1\) is more efficient than \(\hat{\theta}_2\) if

\[ \mathrm{Var}[\hat{\theta}_1] < \mathrm{Var}[\hat{\theta}_2] \]

Method of moments estimation

Assuming a particular distribution with unknown parameters, pretend the sample moments, i.e., \(\mathrm{E}[\overline{X}], \mathrm{E}[\overline{X}^2], \ldots\) are equal to the true moments, and solve for the distribution parameters.

Maximum likelihood estimation

Assuming a particular distribution with unknown parameters, the maximum likelihood estimator is the set of parameters which result in the highest probability for the observed samples.¹ Likelihood is proportional to the joint probability mass function or density function, assuming a particular distribution with unknown parameters.

Under the conditions in which the Cramér-Rao lower bound holds, a maximum likelihood estimator is consistent, asymptotically unbiased, asymptotically efficient, and has the normal distribution:

\[ \hat{\theta}_n \sim N(\theta, \mathrm{CRLB}_\theta) \]

Invariance property

If \(\tau\) is an invertible function, the maximum likelihood estimator for \(\tau(\theta)\) is \(\tau(\hat{\theta})\), where \(\hat{\theta}\) is the maximum likelhood estimator for \(\theta\).

Consistent estimator

An estimator \(\hat{\theta}_n\) is a consistent estimator for \(\theta\) if

\[ \hat{\theta}_n \xrightarrow{P} \theta \]

Asymptotically unbiased estimator

An estimator \(\hat{\theta}_n\) is an asymptotically unbiased estimator for \(\theta\) if

\[ \lim_{n \rightarrow \infty} \mathrm{E}[\hat{\theta}_n] = \theta \]

Asymptotically efficient

An estimator \(\hat{\theta}_n\) is asymptotically efficient if its variance is basically the same as the Cramér-Rao lower bound:

\[ \lim_{n \rightarrow \infty} \frac{\mathrm{CRLB}_\theta}{\mathrm{Var}[\hat{\theta}_n]} = 1 \]

Moment generating functions

\[ M_X(t) = \mathrm{E}[e^{tX}] = \int_{-\infty}^\infty e^{tx} f_X(x) \, dx \]

For independent \(X_1\), …, \(X_n\) and \(Y = \sum_{k=1}^n X_k\),

\[ M_Y(t) = \prod_{k=1}^n M_X(t) \]

If two probability distributions have the same moment generating function, they are the same distribution.

Cramér-Rao lower bound

\[ \mathrm{Var}[\tau(\theta)] \geq \frac{(\tau'(\theta))^2}{\mathrm{E} \left [ \left( \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \right)^2 \right] } \]

The lower bound holds if

\[ \frac{\partial}{\partial \theta} \int f(\vec{x}; \theta) \, dx = \int \frac{\partial}{\partial \theta} f(\vec{x}; \theta) \, dx \]

\[ \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \quad \text{exists} \]

\[ 0 < \mathrm{E} \left[ \left( \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \right)^2 \right] < \infty \]

Fisher information properties

\[ \mathrm{E}\left[ \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \right] = 0 \]

\[ \mathrm{E}\left[ \left( \frac{\partial}{\partial\theta} \ln f(\vec{x};\theta) \right)^2 \right] = -\mathrm{E}\left[ \frac{\partial^2}{\partial\theta^2} \ln f(\vec{x};\theta) \right] \]

If \(Y=(X_1, \ldots, X_n)\) is a tuple of independent and identically distributed random variables,

\[ \mathrm{E}\left[ \left( \frac{\partial}{\partial\theta} \ln f_Y(\vec{y};\theta) \right)^2 \right] = n \mathrm{E}\left[ \left( \frac{\partial}{\partial\theta} \ln f_X(\vec{x};\theta) \right)^2 \right] \]

References

[1]

Jem Corcoran. Statistical inference for estimation in data science. Course on Coursera.

Note that the maximum likelihood estimator may not be the most probable set of parameters. A large set of statisticians have an aversion to using Bayes’ theorem.↩︎