Statistics

Basic reference on basic concepts and properties in statistics.

Based on notes taken during a course [1].

Expectation

\[ \mathrm{E}[X] = \sum x P(X = x) \]

\[ \mathrm{E}[X] = \int x f(x) \, dx \]

Expectation is a linear operator:

\[ \mathrm{E}[aX + bY] = a \mathrm{E}[X] + b \mathrm{E}[Y] \]

If \(X\) and \(Y\) are independent:

\[ \mathrm{E}[XY] = \mathrm{E}[X] \mathrm{E}[Y] \]

Variance

\[ \mathrm{Var}[X] = \sigma^2 = \mathrm{E}[(X - \mathrm{E}[X])^2] \]

Standard deviation \(\sigma\) is defined as the square root of the variance \(\sigma^2\).

Other properties:

\[ \mathrm{Var}[X] = \mathrm{E}[X^2] - \mathrm{E}[X]^2 \]

\[ \mathrm{Var}[aX] = a^2 \mathrm{Var}[X] \]

Covariance

\[ \mathrm{Cov}(X,Y) = \sigma_{X,Y} = E[(X-\mathrm{E}[X])(Y-\mathrm{E}[Y])] \]

\[ \mathrm{Cov}(X,Y) = \mathrm{E}[XY] - \mathrm{E}[X] \mathrm{E}[Y] \]

\[ \mathrm{Cov}(X,X) = \mathrm{Var}[X] \]

\[ \mathrm{Var}[X+Y] = \mathrm{Var}[X] + 2 \mathrm{Cov}(X,Y) + \mathrm{Var}[Y] \]

Correlation

\[ \mathrm{Corr}(X,Y) = \rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}[X] \mathrm{Var}[Y]}} \]

If \(X\) and \(Y\) are independent, then they are uncorrelated:

\[ \mathrm{Corr}(X,Y) = 0 \]

Transformation

If \(f_X(x)\) is a probability density function for \(X\), then for \(Y=g(X)\), where \(g\) is invertible:

\[ f_Y(y) = f_X(g^{-1}(y)) \cdot \left| \frac{d}{dy} g^{-1}(y) \right| \]

Estimators

An estimator is a random variable estimating the true value of a parameter.

For example, the mean of a random sample \(\hat{\theta} = \overline{X}\) is an estimator for the true mean \(\mathrm{E}[X]\).

An estimator \(\hat{\theta}\) estimating \(\theta\) is unbiased if \(\mathrm{E}[\hat{\theta}] = \theta\).

\[ \mathrm{E}[\overline{X}] = \mathrm{E}[X] \]

\[ \mathrm{E}[\mathrm{Var}[\overline{X}]] = \frac{\mathrm{Var}[X]}{n^2} \]

The bias of \(\hat{\theta}\) is

\[ \mathrm{B}(\hat{\theta}) = \mathrm{E}[\hat{\theta}] - \theta \]

The mean squared error is

\[ \mathrm{MSE}(\hat{\theta}) = \mathrm{E}[(\hat{\theta} - \theta)^2] = \mathrm{Var}[\hat{\theta}] + \mathrm{B}(\hat{\theta})^2 \]

For two unbiased estimators \(\hat{\theta}_1\) and \(\hat{\theta}_2\), \(\hat{\theta}_1\) is more efficient than \(\hat{\theta}_2\) if

\[ \mathrm{Var}[\hat{\theta}_1] < \mathrm{Var}[\hat{\theta}_2] \]

Method of moments estimation

Assuming a particular distribution with unknown parameters, pretend the sample moments, i.e., \(\mathrm{E}[\overline{X}], \mathrm{E}[\overline{X}^2], \ldots\) are equal to the true moments, and solve for the distribution parameters.

Maximum likelihood estimation

Assuming a particular distribution with unknown parameters, the maximum likelihood estimator is the set of parameters which result in the highest probability for the observed samples.1 Likelihood is proportional to the joint probability mass function or density function, assuming a particular distribution with unknown parameters.

Under the conditions in which the Cramér-Rao lower bound holds, a maximum likelihood estimator is consistent, asymptotically unbiased, asymptotically efficient, and has the normal distribution:

\[ \hat{\theta}_n \sim N(\theta, \mathrm{CRLB}_\theta) \]

Invariance property

If \(\tau\) is an invertible function, the maximum likelihood estimator for \(\tau(\theta)\) is \(\tau(\hat{\theta})\), where \(\hat{\theta}\) is the maximum likelhood estimator for \(\theta\).

Consistent estimator

An estimator \(\hat{\theta}_n\) is a consistent estimator for \(\theta\) if

\[ \hat{\theta}_n \xrightarrow{P} \theta \]

Asymptotically unbiased estimator

An estimator \(\hat{\theta}_n\) is an asymptotically unbiased estimator for \(\theta\) if

\[ \lim_{n \rightarrow \infty} \mathrm{E}[\hat{\theta}_n] = \theta \]

Asymptotically efficient

An estimator \(\hat{\theta}_n\) is asymptotically efficient if its variance is basically the same as the Cramér-Rao lower bound:

\[ \lim_{n \rightarrow \infty} \frac{\mathrm{CRLB}_\theta}{\mathrm{Var}[\hat{\theta}_n]} = 1 \]

Moment generating functions

\[ M_X(t) = \mathrm{E}[e^{tX}] = \int_{-\infty}^\infty e^{tx} f_X(x) \, dx \]

For independent \(X_1\), …, \(X_n\) and \(Y = \sum_{k=1}^n X_k\),

\[ M_Y(t) = \prod_{k=1}^n M_X(t) \]

If two probability distributions have the same moment generating function, they are the same distribution.

Cramér-Rao lower bound

\[ \mathrm{Var}[\tau(\theta)] \geq \frac{(\tau'(\theta))^2}{\mathrm{E} \left [ \left( \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \right)^2 \right] } \]

The lower bound holds if

\[ \frac{\partial}{\partial \theta} \int f(\vec{x}; \theta) \, dx = \int \frac{\partial}{\partial \theta} f(\vec{x}; \theta) \, dx \]

\[ \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \quad \text{exists} \]

\[ 0 < \mathrm{E} \left[ \left( \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \right)^2 \right] < \infty \]

Fisher information properties

\[ \mathrm{E}\left[ \frac{\partial}{\partial \theta} \ln f(\vec{x}; \theta) \right] = 0 \]

\[ \mathrm{E}\left[ \left( \frac{\partial}{\partial\theta} \ln f(\vec{x};\theta) \right)^2 \right] = -\mathrm{E}\left[ \frac{\partial^2}{\partial\theta^2} \ln f(\vec{x};\theta) \right] \]

If \(Y=(X_1, \ldots, X_n)\) is a tuple of independent and identically distributed random variables,

\[ \mathrm{E}\left[ \left( \frac{\partial}{\partial\theta} \ln f_Y(\vec{y};\theta) \right)^2 \right] = n \mathrm{E}\left[ \left( \frac{\partial}{\partial\theta} \ln f_X(\vec{x};\theta) \right)^2 \right] \]

See also

References

[1]
Jem Corcoran. Statistical inference for estimation in data science. Course on Coursera.

  1. Note that the maximum likelihood estimator may not be the most probable set of parameters. A large set of statisticians have an aversion to using Bayes’ theorem.↩︎