Sources of error in biological
- Human Error - repeated readings and data checking should
identify these errors, e.g. 575 transposed to 757.
- Instrumentation Limitation
- Rounding errors - be aware of the
limitations (sensitivity) of your technique.
- Uncontrolled Factors - even the best designs will leave some variables
uncontrolled, e.g. experiments on different days or different times of the
- Unrepresentative Samples - this should not occur if the sampling is controlled, be certain of the limits of the population from which you are sampling.
- Statistical Fluctuations (random or experimental errors - most statistical
tests assume that they exist.
- Systematic Errors - Is the equipment calibrated?
Does it measure what you think it is measuring? This type of error will
introduce unacceptable bias.
Random and experimental errors
Consider the following questions:
- What is the area (mm2) of a woodland, measured
from the following map?
- What is your resting heart rate (beats per minute)?
- How many fingers have you got(!) ?
Try repeating these measurements 5 times.
Did you obtain identical values?
It would be very surprising if all 5 area and heart rate
measurements were identical. Similarly, it would be very surprising if your
finger count varied. It is common for repeat measurements to differ from each
other. These represent measurement errors. 'We can never measure anything
exactly, although we can count exactly'.
Are the errors in the first two examples equivalent?
There is a fundamental difference in the two quantities
being measured. The woodland area would not change between measurements
(assuming that the map is immune to alterations in size due to changes in the
atmosphere). Your heart rate is however continuously fluctuating so we should
expect some between-reading variation. In the heart rate example the best that
we can hope for is an estimate of the average heart rate.
Thus, differences between our estimates of the woodland area
would be a consequence of slight variations in the measurement process which
should occur randomly. Sometimes we overestimate the area, sometimes we
Differences in our heart rate estimates are a consequence of
two processes. Some of the differences are due to the measurement process
(random measurement errors), others are a consequence of the natural variation
in heart rate. These latter errors are due to experimental
Unfortunately there is often confusion about the distinction
between these two terms. For example, the Collins English Dictionary (New
Edition) defines accuracy as 'faithful measurement or representation of the
truth; correctness; precision'. They do not have the same
meanings when used in a scientific context.
- Accuracy is the closeness of an observation to its true
- Precision is the similarly between repeat readings.
Accepting that, in an ideal world, we should aim for both
accuracy and precision, which of the following is the most desirable in an
experiment. Assume that the true value for the measurement is 10 and we have two
sets of repeat measurements:
- Set 1 : 11.0,11.1,10.9,11.0,11.2,11.1
- Set 2 : 9.5,10.5, 9.1, 9.9,10.9,10.1
Set 2 is the most desirable because it has the greatest
accuracy. Set 1 has systematic errors even though the precision is high. Bias is
dangerous, particularly when combined with high precision. If we have
imprecision we can make use of statistical methods that attach confidence to
accuracy of our estimate.
In any experiment you should aim to ensure that systematic
errors are a faction of the random errors. The only way to achieve this aim is
by careful standardisation of equipment and techniques.
Rounding a number involves the discarding of information, in
the form of digits. Sometimes this is acceptable and even desirable, at other
times rounding may introduce errors. Whether, and by how much, you should round
depends on the number of significant figures that you wish to work with.
Unless you are working with small discrete numbers (e.g.
number of eggs in a nest) the way in which you record a number implies the
precision to which the observation was recorded. Thus, if you record two
observations as 11 and 11.0 you are implying that they were recorded at
different levels of precision. Recording the observation as 11 implies that the
true value lies somewhere between 10.5 & 11.5, whereas recording the
observation as 11.0 implies a value between 10.95 & 11.05.
Note that decimal places and significant figures are not
equivalent. Decimal places is the number of digits following the decimal point.
Significant figures is the number of digits excluding leading and
||implied limits |
||25.5 - 26.5|
||26.0015 - 26.0025|
||25500 - 26500|
If you round off numbers too early in a series of
calculations you can introduce errors. For example consider the numbers:
10.2 11.1 9.3 10.1 9.2
The sum of these numbers is 49.9, but if you rounded them
before the summation you would have:
10 11 9 10 9
which sum to 49.
Outliers and Mistakes
When you inspect your data, prior to more formal analyses,
you may find that one or more values is far from the majority of observations.
These unusual observations are called outliers. There does not appear to
be a universally agreed definition of the term 'outlier'. Some, but by no means
all, of these outliers will be a consequence of mistakes. Even the most careful
workers occasionally make mistakes. The trick is being able to recognise the
mistakes and deal them appropriately.
The informal definition of an outlier relates the value of
an observation to others from the same data set. Outliers are, therefore,
characterised by their distance from other observations, but how distant does an
observation need to become before it is classed as an outlier? There are several
statistical methods for detecting outliers, consequently the concept of a
statistical outlier is closely linked to the amount of variability in a
set of data. Three methods for outlier identification are described.
- Univariate samples : Box
- Bivariate data : Regression residuals
- Multivariate data : PCA. A multivariate outlier need not
to be extreme in any of its components. Generally, the outliers are searched
among the extreme values and if this notion is easy to define in the
univariate case, it is not so straightforward in the multivariate
What should you do with outliers?
It depends how the outlier achieved its unusual value. There
are a number of possibilities:
- A value was entered incorrectly? There is no substitute for careful checking. Mistakes often
occur when data are being entered or copied. Try to reduce the number of
occasions that this happens. Beware of transcription errors such as double
reading, omission and transposition (e.g. 45 becomes 54). If it was due to a
data entry error, make the correction.
- Was there anything unusual about the way in which this
observation was obtained? For example, was it collected by another researcher, did
someone else make up the standards, was there a problem with the water supply?
If you suspect that an observation was associated with an unusual event then you
can probably delete it.
- Is it a real value, from an unusual case? If the outlier was not a consequence of a mistake then you
must assume that its value is correct. Unusual observations can provide
interesting biological information so do not dismiss it. It may be possible to
apply a transformation which will remove the
problem. If this is not possible, you may wish to investigate it's effect on any
statistical analysis. For example, outliers on the axis of a regression
analysis may have a disproportionate effect on the
slope and intercept of that line. Typically, you should carry out the analysis
with and without the outliers present.
If you make predictions (for example using
a regression equation or standard curve) or estimates, there are often
differences between the actual and the predicted values. These differences are
called residuals. In general residuals are
thought to be a consequence of experimental
Residual = Actual value - Predicted value
Many statistical analyses make assumptions about the
residuals. In particular they are important in:
- c2 association analysis
- regression analysis
- analysis of variance
If you make a prediction about the category to which a case
belongs you may get this prediction correct or incorrect. For example, suppose
that you are predicting a person's sex (without using the more obvious
features!) your prediction will be either male or female. The accuracy of your
predictions are summarised in a confusion matrix. For example:
|Actual gender /