Sources of error in biological
data
 Human Error  repeated readings and data checking should
identify these errors, e.g. 575 transposed to 757.
 Instrumentation Limitation
 Rounding errors  be aware of the
limitations (sensitivity) of your technique.
 Uncontrolled Factors  even the best designs will leave some variables
uncontrolled, e.g. experiments on different days or different times of the
same day.
 Unrepresentative Samples  this should not occur if the sampling is controlled, be certain of the limits of the population from which you are sampling.
 Statistical Fluctuations (random or experimental errors  most statistical
tests assume that they exist.
 Systematic Errors  Is the equipment calibrated?
Does it measure what you think it is measuring? This type of error will
introduce unacceptable bias.
Measurement Errors
Random and experimental errors
Consider the following questions:
 What is the area (mm^{2}) of a woodland, measured
from the following map?
 What is your resting heart rate (beats per minute)?
 How many fingers have you got(!) ?
Try repeating these measurements 5 times.
Did you obtain identical values?
It would be very surprising if all 5 area and heart rate
measurements were identical. Similarly, it would be very surprising if your
finger count varied. It is common for repeat measurements to differ from each
other. These represent measurement errors. 'We can never measure anything
exactly, although we can count exactly'.
Are the errors in the first two examples equivalent?
There is a fundamental difference in the two quantities
being measured. The woodland area would not change between measurements
(assuming that the map is immune to alterations in size due to changes in the
atmosphere). Your heart rate is however continuously fluctuating so we should
expect some betweenreading variation. In the heart rate example the best that
we can hope for is an estimate of the average heart rate.
Thus, differences between our estimates of the woodland area
would be a consequence of slight variations in the measurement process which
should occur randomly. Sometimes we overestimate the area, sometimes we
underestimate it.
Differences in our heart rate estimates are a consequence of
two processes. Some of the differences are due to the measurement process
(random measurement errors), others are a consequence of the natural variation
in heart rate. These latter errors are due to experimental
error.
Unfortunately there is often confusion about the distinction
between these two terms. For example, the Collins English Dictionary (New
Edition) defines accuracy as 'faithful measurement or representation of the
truth; correctness; precision'. They do not have the same
meanings when used in a scientific context.
 Accuracy is the closeness of an observation to its true
value.
 Precision is the similarly between repeat readings.
Accepting that, in an ideal world, we should aim for both
accuracy and precision, which of the following is the most desirable in an
experiment. Assume that the true value for the measurement is 10 and we have two
sets of repeat measurements:
 Set 1 : 11.0,11.1,10.9,11.0,11.2,11.1
 Set 2 : 9.5,10.5, 9.1, 9.9,10.9,10.1
Set 2 is the most desirable because it has the greatest
accuracy. Set 1 has systematic errors even though the precision is high. Bias is
dangerous, particularly when combined with high precision. If we have
imprecision we can make use of statistical methods that attach confidence to
accuracy of our estimate.
In any experiment you should aim to ensure that systematic
errors are a faction of the random errors. The only way to achieve this aim is
by careful standardisation of equipment and techniques.
Rounding a number involves the discarding of information, in
the form of digits. Sometimes this is acceptable and even desirable, at other
times rounding may introduce errors. Whether, and by how much, you should round
depends on the number of significant figures that you wish to work with.
Unless you are working with small discrete numbers (e.g.
number of eggs in a nest) the way in which you record a number implies the
precision to which the observation was recorded. Thus, if you record two
observations as 11 and 11.0 you are implying that they were recorded at
different levels of precision. Recording the observation as 11 implies that the
true value lies somewhere between 10.5 & 11.5, whereas recording the
observation as 11.0 implies a value between 10.95 & 11.05.
Note that decimal places and significant figures are not
equivalent. Decimal places is the number of digits following the decimal point.
Significant figures is the number of digits excluding leading and
trailing zeros.
Some examples
number 
significant figures

decimal places 
implied limits 
26 
2 
0 
25.5  26.5 
26.002 
5 
3 
26.0015  26.0025 
26000 
2 
0 
25500  26500 
0.003 
1 
3 
0.0025 
0.0035 
If you round off numbers too early in a series of
calculations you can introduce errors. For example consider the numbers:
10.2 11.1 9.3 10.1 9.2
The sum of these numbers is 49.9, but if you rounded them
before the summation you would have:
10 11 9 10 9
which sum to 49.
Outliers and Mistakes
When you inspect your data, prior to more formal analyses,
you may find that one or more values is far from the majority of observations.
These unusual observations are called outliers. There does not appear to
be a universally agreed definition of the term 'outlier'. Some, but by no means
all, of these outliers will be a consequence of mistakes. Even the most careful
workers occasionally make mistakes. The trick is being able to recognise the
mistakes and deal them appropriately.
Identifying outliers
The informal definition of an outlier relates the value of
an observation to others from the same data set. Outliers are, therefore,
characterised by their distance from other observations, but how distant does an
observation need to become before it is classed as an outlier? There are several
statistical methods for detecting outliers, consequently the concept of a
statistical outlier is closely linked to the amount of variability in a
set of data. Three methods for outlier identification are described.
 Univariate samples : Box
plots
 Bivariate data : Regression residuals
 Multivariate data : PCA. A multivariate outlier need not
to be extreme in any of its components. Generally, the outliers are searched
among the extreme values and if this notion is easy to define in the
univariate case, it is not so straightforward in the multivariate
case.
What should you do with outliers?
It depends how the outlier achieved its unusual value. There
are a number of possibilities:
 A value was entered incorrectly? There is no substitute for careful checking. Mistakes often
occur when data are being entered or copied. Try to reduce the number of
occasions that this happens. Beware of transcription errors such as double
reading, omission and transposition (e.g. 45 becomes 54). If it was due to a
data entry error, make the correction.
 Was there anything unusual about the way in which this
observation was obtained? For example, was it collected by another researcher, did
someone else make up the standards, was there a problem with the water supply?
If you suspect that an observation was associated with an unusual event then you
can probably delete it.
 Is it a real value, from an unusual case? If the outlier was not a consequence of a mistake then you
must assume that its value is correct. Unusual observations can provide
interesting biological information so do not dismiss it. It may be possible to
apply a transformation which will remove the
problem. If this is not possible, you may wish to investigate it's effect on any
statistical analysis. For example, outliers on the axis of a regression
analysis may have a disproportionate effect on the
slope and intercept of that line. Typically, you should carry out the analysis
with and without the outliers present.
Prediction errors
If you make predictions (for example using
a regression equation or standard curve) or estimates, there are often
differences between the actual and the predicted values. These differences are
called residuals. In general residuals are
thought to be a consequence of experimental
error.
Residual = Actual value  Predicted value
Many statistical analyses make assumptions about the
residuals. In particular they are important in:
 c^{2} association analysis
 regression analysis
 analysis of variance
Classification errors
If you make a prediction about the category to which a case
belongs you may get this prediction correct or incorrect. For example, suppose
that you are predicting a person's sex (without using the more obvious
features!) your prediction will be either male or female. The accuracy of your
predictions are summarised in a confusion matrix. For example:
Actual gender / Predicted
gender 
Male 
Female 
Male 
70 (Correct) 
20 (Errors)

Female 
30
(Errors) 
80 (correct)

