Frequency Distributions and Measures of Central Tendency
Describe frequency distributions and measures of central tendency and dispersion
Frequency Distributions
Frequency distributions are a method of tabulating or graphically displaying a number of observations.
The Normal Distribution
The normal distribution is a gaussian distribution, where the majority of values cluster around the mean, and whilst more extreme values become progressively less frequent.
The normal distribution is common in medicine for two reasons.
- Much of the variation in biology follows a normal distribution
- When multiple random samples are taken from a population, the mean of these samples follows a normal distribution, even if the characteristic being measured is not normally distributed
This is known as the central limit theorem.- It is useful because many statistical tests are only valid when the data follow a normal distribution
The formula for the normal distribution is given by:
From this, it can bet seen the two variables which will determine the shape of the normal distribution are:
- μ (mu): The mean
- σ (sigma): The standard deviation
The Standard Normal Distribution
The standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. The equation for the standard normal distribution is much simpler, which is why it is used.
Any normal distribution can be transformed to fit a standard normal distribution using a z transformation:
The value of z then gives a standardised score, i.e. the number of standard deviations form the mean in a standardised curve. This can then be used to determine probability.
Binomial distribution
Where observations belong to one of two mutually exclusive categories, i.e.:
If then
If the number of observations is very large and the probability of an event is small, a Poisson distribution can be used to approximate a binomial distribution.
Measures of Central Tendency
As noted above in the normal distribution, results tend to cluster around a central value. Quantification of the degree of clustering can be done using measures of central tendency, of which there are three:
- Mode
The most common value in the sample. - Median
The middle value when the sample is ranked from lowest to highest.- The median is the best measure of central tendency when the data is skewed
- Arithmetic mean
The average, i.e:
The mean is common and reliable, though inaccurate if the distribution is skewed.
Measures of Dispersion
Measures of variability describe the degree of dispersion around the central value.
Basic Measures of Deviation
- Range: The lowest and highest values in the sample
Highly influenced by outliers - Percentiles: Rank observations into 100 equal parts, so that the median becomes the 50% percentile.
Better measure of spread than range. - Interquartile range: The 25th to 75th centile
A box-and-whisker plot graphically demonstrates the mean, 25th centile, 75th centile, and (usually), the 10th and 90th centiles.- Outliers are represented by dots
- Occasionally the range is plotted by the whiskers, and there are no outliers plotted
Variance and Standard Deviation
Variance is a better measure of variability than the above methods. Variance:
- Evaluates how far each observation is from the mean, and penalises observations more the further they lie from the mean
- Sums the squares of each difference and divides by the number of observations i.e:
- is used (instead of ) because the mean of the sample is known and therefore the last observation calculated must taken on a known quantity
- This is known as a degrees of freedom, which is a mathematical restriction used when using one statistical test in order to estimate another
- It is a confusing topic best illustrated with an example:
- You have been given a sample of two observations (say, ages of two individuals), and you know nothing about them
- The degrees of freedom is two, since those observations can take on any value.
- Alternatively, imagine you have been given the same sample, but this time I tell you that the mean age of the sample is 20
- The degrees of freedom is one, since if I tell you the value of one of the observations is 30, you know that the other must be 10
Therefore, only one of the observations is free to vary - as soon as its value is known then the value of the other observation is known as well.
- Different statistical tests may result in additional losses in degrees of freedom.
- is used (instead of ) because the mean of the sample is known and therefore the last observation calculated must taken on a known quantity
Standard Deviation
The standard deviation is the positive square root of the variance.
In a sample of normal distribution:
- 1 SD either side of the mean should include ~68% of results
- 2 SD either side of the mean should include ~95% of results
- 3 SD either side of the mean should include ~99.7% of results
Standard error and Confidence Intervals
Standard error of the mean is:
- A measure of the precision of the estimate of the mean
- Calculated from the standard deviation and the sample size
As the sample size grows, the SEM decreases (as the estimate becomes more precise). - Given by the formula:
- Used to calculate the confidence interval
Confidence Interval
The confidence interval:
- Gives a range in which the true population parameter is likely to lie
The width of the interval is related to the standard error, and the degree of confidence (typically 95%): - Is a function of the sample statistic (in this case the mean), rather than the actual observations
- Has several benefits over the p-value:
- Indicates magnitude of the difference in a meaningful way
- Indicates the precision of the estimate
The smaller the confidence interval, the more precise the estimate. - Allows statistical significance to be calculated
If the confidence interval crosses 1, then the result is insignificant. Note that the inverse is not true; a result is not necessarily significant because the CI does not cross 1.
References
- "Normal distribution". Licensed under Attribution 3.0 Unported (CC BY 3.0) via SubSurfWiki.
- Myles PS, Gin T. Statistical methods for anaesthesia and intensive care. 1st ed. Oxford: Butterworth-Heinemann, 2001
- Course notes from "Introduction to Biostats", University of Sydney, School of Public Health, circa 2013.