Statistics Glossary

A comprehensive guide to key terms and concepts in Statistics.

A

Anscombe’s Quartet

A set of four datasets that have nearly identical simple descriptive statistics (mean, variance, correlation) but appear very different when graphed. It demonstrates the importance of visualizing data before analyzing it.

Asymptotic Normality

The property of an estimator where its sampling distribution approaches a normal distribution as the sample size increases.

B

Bias

The difference between the expected value of an estimator and the true value of the parameter being estimated. Bias is a measure of systematic error.

Bias-Variance Tradeoff

The conflict in trying to simultaneously minimize these two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.

Box Plot

A standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It is useful for identifying outliers and skewness.

C

Central Tendency

A central or typical value for a probability distribution. The most common measures of central tendency are the arithmetic mean, the median, and the mode.

Confidence Interval

A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.

Consistency

A property of an estimator where it converges in probability to the true value of the parameter as the sample size tends to infinity.

Correlation

A statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). The correlation coefficient r ranges from -1 to 1.

D

Descriptive Statistics

Brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population.

E

Efficiency

A measure of the quality of an estimator. An efficient estimator has the minimum possible variance among all unbiased estimators (achieving the Cramér-Rao lower bound).

Estimator

A rule or formula that tells us how to calculate an estimate of a population parameter based on sample data.

H

Histogram

A graphical representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable.

I

Interquartile Range (IQR)

A measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles: IQR = Q3 - Q1.

Irreducible Error

The error that cannot be reduced by creating a better model. It is caused by the noise in the data itself.

K

Kurtosis

A measure of the “tailedness” of the probability distribution of a real-valued random variable. High kurtosis indicates heavy tails (more outliers), while low kurtosis indicates light tails.

L

Likelihood Function

A function of the parameters of a statistical model, given specific observed data. Likelihood differs from probability in that the data is fixed and the parameters vary.

Log-Likelihood

The natural logarithm of the likelihood function. It is often easier to maximize the log-likelihood than the likelihood itself because sums are easier to work with than products.

M

Maximum A Posteriori (MAP)

An estimate of an unknown quantity, that equals the mode of the posterior distribution. It incorporates a prior distribution over the parameter.

Maximum Likelihood Estimation (MLE)

A method of estimating the parameters of a probability distribution by maximizing a likelihood function, so that under the assumed statistical model, the observed data is most probable.

Mean

The arithmetic average of a set of numbers, calculated by dividing the sum of the values by the number of values. It is sensitive to outliers.

Mean Squared Error (MSE)

A measure of the quality of an estimator. It measures the average squared difference between the estimated values and the actual value. MSE incorporates both bias and variance.

Median

The middle value separating the greater and lesser halves of a data set. It is robust to outliers.

Method of Moments (MoM)

A method of estimation that equates sample moments (e.g., sample mean, sample variance) to population moments (expected values) to solve for unknown parameters.

Mode

The value that appears most often in a set of data values.

N

Normal Distribution

A probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. Also known as the bell curve.

O

Outlier

A data point that differs significantly from other observations. Outliers may be due to variability in the measurement or may indicate experimental error.

P

Percentile

A score below which a given percentage of scores in its frequency distribution falls (exclusive definition) or a score at or below which a given percentage falls (inclusive definition).

Population Moment

The expected value of a power of a random variable. The first moment is the mean, the second central moment is the variance.

Posterior Distribution

The probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey.

Prior Distribution

The probability distribution that would express one’s beliefs about this quantity before some evidence is taken into account.

R

Range

The difference between the largest and smallest values in a set of values.

S

Sample Moment

The average of a power of the observed values in a sample. The first sample moment is the sample mean.

Skewness

A measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Positive skew indicates a tail on the right; negative skew indicates a tail on the left.

Standard Deviation

A measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean, while a high standard deviation indicates that the values are spread out over a wider range.

Sufficiency

A statistic is sufficient for a parameter if no other statistic that can be calculated from the same sample provides any additional information as to the value of the parameter.

V

Variance

The expectation of the squared deviation of a random variable from its mean. It measures how far a set of numbers is spread out from their average value.

Variance (of an Estimator)

The expectation of the squared deviation of an estimator from its mean. It measures the spread or precision of the estimator. High variance implies overfitting in predictive models.