If the number of observations is large, frequency histograms
of many biological variables will have similar shapes ( they will be 'bell
shaped' ). Many, but not all, bell shaped frequency distributions are examples
of a special type of distribution known as the Normal or Gaussian
Distribution.
A normal distribution can be characterized by two
parameters, the mean and the standard deviation (see most statistics textbooks
for the equation describing a normal distribution). Compare this with the
binomial distribution which is
also controlled by two parameters: the number of trials and the probability of a
success (n and p) and the Poisson
distribution which is controlled by a single
parameter, the mean number of events.
Although populations may be normally distributed this is not
always the case for samples, particularly small ones. The term 'normal' has some
unfortunate connotations. It should not be taken as an implication that all
other frequency distributions are abnormal.
The normal distribution has a central role in many
statistical tests because the tests assume that data are normally distributed.
If this assumption is not met many significance tests become unreliable. Part of
the mystique surrounding the normal distribution is that measures derived from
samples obtained from obviously nonnormal populations will themselves be
normally distributed. This relationship is formalised by the central
limit theorem.
The reason why many biological variables are normally
distributed is that their values are a consequence of multifactorial processes,
i.e. they have many causes. If these causes are additive it is almost inevitable
that the values will follow a normal distribution. Consider the following rather
simplified scenario. The height of a particular organism is due mainly to 5
nonallelic genes, each of which has a dominant (tall) and recessive (short)
allele. Since the genes are nonallelic we can assume that their effects would
be additive. The possible genotypes and their associated phenotypes are shown
below. In this example the heterozygote phenotype is assumed to be identical
to that of the homozygote dominant, thus we do not need to differentiate between
homozygote dominant and heterozygote individuals.
Genotype 
Height 
Number of Combinations
(using nCr) 
'Tall' genes 
'Short' genes 


0 
5 
0 
1 
1 
4 
1 
5 
2 
3 
2 
10 
3 
2 
3 
10 
4 
1 
4 
5 
5 
0 
5 
1

From this simple example we can see that most individuals
would be of average height, few would be very short or very tall. Indeed, if
enough heights were obtained they would follow a normal distribution.
An understanding of the characteristics of a normal curve is
central to your understanding of the whole of parametric statistics. Some
examples of normal curves are shown below.
3 distributions in which µ = 0 but
which have different standard deviations. Note how increasing the standard
deviation 'flattens' the curve and makes it more likely that observations will
be further from the mean. Since they have identical means the 3 curves all have
their maximum height at the same value (µ).
Three distributions in which the standard deviations are identical
(s = 1) but which differ in
their means (m = 2, 4 and 6). This time the curves have the same profiles. Indeed if the curves
could be 'slid' along the x axis they would overlap perfectly. Changing the mean
has not affected the variability it has just affected the position of the curve
along the x axis.
The normal distribution is very important because it allows
to answer probability questions about many important biological variables.
Consider the following example. A normal population of 1000 body weights has a
mean of 70 g. This means that half of the weights will be less than 70 g while
the other half or more. This is obvious because the distribution is symmetrical
about the mean. However, suppose we wish to know what proportion of the
population is heavier than 80 g.
We can determine the answer to this question if we also know
the standard deviation of the population from
the observation was obtained (This is because a normal distribution is
controlled by two parameters, µ and s).
The method involves converting our normal distribution to a standard form in which
µ = 0 and s = 1.
This is relatively simple since any set of values can be
made to have a mean of 0. If we calculate their original mean and then subtract
it from each of the observations: e.g. the mean of 3, 4 & 5 is 4, if we now
subtract 4 from each observation we have 1, 0 & 1 : the mean of these is 0.
Similarly we can make the standard deviation equal to 1 if we divide by the
original standard deviation. If we do this to any set of data we convert from
the original units to something called Z scores. The reason for doing all
of this is that we will then be able to use one set of standard statistical
tables for all populations irrespective of the original values of µ and s.
Z = (x_{i}  µ) /
s
Z is known as a normal deviate (sometimes symbolised
by D). The value of Z can be used to determine the proportion of
the population which has a value equal to or greater than the value of x (see
diagram above). In the body weight example if we assume that s = 10 then
Z = (80  70) / 10 = 1.0
From tables of standard normal deviates we find that Z
= 1.0 corresponds to a value of 0.1587. This means that 15.87% of the
population weighs more than 80g.
This page, with acknowledgement, from a web site on univariant statistics by Dr Alan Fielding BSc MSc PhD FLS FHEA, Senior Learning and Teaching Fellow, School of Biology, Chemistry and Health Science, Manchester Metropolitan University. Alan has a new site with information on monitoring and statistics. He may be contacted at alan@alanfielding.co.uk or via his web page.
