15  Learning from Data

15.1 Summarizing Data

So far, we’ve talked about distributions in a theoretical sense, looking at different properties of random variables. We don’t observe random variables; we observe realizations of the random variable. These realizations of events are roughly equivalent to what we mean by “data”.

Sample mean: This is the most common measure of central tendency, calculated by summing across the observations and dividing by the number of observations.

\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i

The sample mean is an estimate of the expected value of a distribution.

Dispersion: We also typically want to know how spread out the data are relative to the center of the observed distribution. There are several ways to measure dispersion.

Sample variance: The sample variance is the sum of the squared deviations from the sample mean, divided by the number of observations minus 1.

\hat{\text{Var}}(X) = \frac{1}{n-1}\sum_{i = 1}^n (x_i - \bar{x})^2

Again, this is an estimate of the variance of a random variable; we divide by n - 1 instead of n in order to get an unbiased estimate.

Standard deviation: The sample standard deviation is the square root of the sample variance.

\hat{SD}(X) = \sqrt{\hat{\text{Var}}(X)} = \sqrt{\frac{1}{n-1}\sum_{i = 1}^n (x_i - \bar{x})^2}

Covariance and Correlation: Both of these quantities measure the degree to which two variables vary together, and are estimates of the covariance and correlation of two random variables as defined above.

  1. Sample covariance: \hat{\text{Cov}}(X,Y) = \frac{1}{n-1}\sum_{i = 1}^n(x_i - \bar{x})(y_i - \bar{y})
  2. Sample correlation: \hat{\text{Corr}} = \frac{\hat{\text{Cov}}(X,Y)}{\sqrt{\hat{\text{Var}}(X)\hat{\text{Var}}(Y)}}

Example 15.1 Example: Using the above table, calculate the sample versions of:

  1. \text{Cov}(X,Y)
  2. \text{Corr}(X, Y)

15.2 Law of Large Numbers

In probability theory, asymptotic analysis is the study of limiting behavior. By limiting behavior, we mean the behavior of some random process as the number of observations gets larger and larger.

Why is this important? We rarely know the true process governing the events we see in the social world. It is helpful to understand how such unknown processes theoretically must behave and asymptotic theory helps us do this.

Theorem 15.1 (Law of Large Numbers (LLN)) For any draw of independent random variables with the same mean \mu, the sample average after n draws, \bar{X}_n = \frac{1}{n}(X_1 + X_2 + \ldots + X_n), converges in probability to the expected value of X, \mu as n \rightarrow \infty:

\lim\limits_{n\to \infty} P(|\bar{X}_n - \mu | > \varepsilon) = 0

A shorthand of which is \bar{X}_n \xrightarrow{p} \mu, where the arrow is read as “converges in probability to” as n\to \infty. In other words, P( \lim_{n\to\infty}\bar{X}_n = \mu) = 1. This is an important motivation for the widespread use of the sample mean, as well as the intuition link between averages and expected values.

More precisely this version of the LLN is called the weak law of large numbers because it leaves open the possibility that |\bar{X}_n - \mu | > \varepsilon occurs many times. The strong law of large numbers states that, under a few more conditions, the probability that the limit of the sample average is the true mean is 1 (and other possibilities occur with probability 0), but the difference is rarely consequential in practice.

The Strong Law of Large Numbers holds so long as the expected value exists; no other assumptions are needed. However, the rate of convergence will differ greatly depending on the distribution underlying the observed data. When extreme observations occur often (i.e. kurtosis is large), the rate of convergence is much slower. Cf. The distribution of financial returns.

15.3 Central Limit Theorem

We are now finally ready to revisit, with a bit more precise terms, the two pillars of statistical theory we motivated Probability with.

Theorem 15.2 (Central Limit Theorem (i.i.d. case)) Let \{X_n\} = \{X_1, X_2, \ldots\} be a sequence of i.i.d. random variables with finite mean (\mu) and variance (\sigma^2). Then, the sample mean \bar{X}_n = \frac{X_1 + X_2 + \cdots + X_n}{n} increasingly converges into a Normal distribution.

\frac{\bar{X}_n - \mu}{\sigma / \sqrt{n}} \xrightarrow{d} \text{Normal}(0, 1),

Another way to write this as a probability statement is that for all real numbers a,

P\left(\frac{\bar{X}_n - \mu}{\sigma/\sqrt{n}} \le a\right) \rightarrow \Phi(a)

as n\to \infty, where \Phi(x) = \int_{-\infty}^x \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} \, dx is the CDF of a Normal distribution with mean 0 and variance 1.

This result means that, as n grows, the distribution of the sample mean \bar X_n = \frac{1}{n} (X_1 + X_2 + \cdots + X_n) is approximately normal with mean \mu and standard deviation \frac{\sigma}{\sqrt n}, i.e.,

\bar{X}_n \approx \mathcal{N}\bigg(\mu, \frac{\sigma^2}{n}\bigg). The standard deviation of \bar X_n (which is roughly a measure of the precision of \bar X_n as an estimator of \mu) decreases at the rate 1/\sqrt{n}, so, for example, to increase its precision by 10 (i.e., to get one more digit right), one needs to collect 10^2=100 times more units of data.

Intuitively, this result also justifies that whenever a lot of small, independent processes somehow combine together to form the realized observations, practitioners often feel comfortable assuming Normality.