Environmental Monitoring

Environmental Monitoring

Sources of error in biological data

  • Human Error - repeated readings and data checking should identify these errors, e.g. 575 transposed to 757.
  • Instrumentation Limitation
  • Rounding errors - be aware of the limitations (sensitivity) of your technique.
  • Uncontrolled Factors - even the best designs will leave some variables uncontrolled, e.g. experiments on different days or different times of the same day.
  • Unrepresentative Samples - this should not occur if the sampling is controlled, be certain of the limits of the population from which you are sampling.
  • Statistical Fluctuations (random or experimental errors - most statistical tests assume that they exist.
  • Systematic Errors - Is the equipment calibrated? Does it measure what you think it is measuring? This type of error will introduce unacceptable bias.

Measurement Errors

Random and experimental errors

Consider the following questions:

  • What is the area (mm2) of a woodland, measured from the following map?
  • What is your resting heart rate (beats per minute)?
  • How many fingers have you got(!) ?

Woodland Map

Try repeating these measurements 5 times.

Did you obtain identical values?

It would be very surprising if all 5 area and heart rate measurements were identical. Similarly, it would be very surprising if your finger count varied. It is common for repeat measurements to differ from each other. These represent measurement errors. 'We can never measure anything exactly, although we can count exactly'.

Are the errors in the first two examples equivalent?

There is a fundamental difference in the two quantities being measured. The woodland area would not change between measurements (assuming that the map is immune to alterations in size due to changes in the atmosphere). Your heart rate is however continuously fluctuating so we should expect some between-reading variation. In the heart rate example the best that we can hope for is an estimate of the average heart rate.

Thus, differences between our estimates of the woodland area would be a consequence of slight variations in the measurement process which should occur randomly. Sometimes we overestimate the area, sometimes we underestimate it.

Differences in our heart rate estimates are a consequence of two processes. Some of the differences are due to the measurement process (random measurement errors), others are a consequence of the natural variation in heart rate. These latter errors are due to experimental error.

Accuracy and precision of measurements

Unfortunately there is often confusion about the distinction between these two terms. For example, the Collins English Dictionary (New Edition) defines accuracy as 'faithful measurement or representation of the truth; correctness; precision'. They do not have the same meanings when used in a scientific context.

  • Accuracy is the closeness of an observation to its true value.
  • Precision is the similarly between repeat readings.

Accepting that, in an ideal world, we should aim for both accuracy and precision, which of the following is the most desirable in an experiment. Assume that the true value for the measurement is 10 and we have two sets of repeat measurements:

  • Set 1 : 11.0,11.1,10.9,11.0,11.2,11.1
  • Set 2 : 9.5,10.5, 9.1, 9.9,10.9,10.1

Set 2 is the most desirable because it has the greatest accuracy. Set 1 has systematic errors even though the precision is high. Bias is dangerous, particularly when combined with high precision. If we have imprecision we can make use of statistical methods that attach confidence to accuracy of our estimate.

In any experiment you should aim to ensure that systematic errors are a faction of the random errors. The only way to achieve this aim is by careful standardisation of equipment and techniques.

Rounding Errors

Rounding a number involves the discarding of information, in the form of digits. Sometimes this is acceptable and even desirable, at other times rounding may introduce errors. Whether, and by how much, you should round depends on the number of significant figures that you wish to work with.

Unless you are working with small discrete numbers (e.g. number of eggs in a nest) the way in which you record a number implies the precision to which the observation was recorded. Thus, if you record two observations as 11 and 11.0 you are implying that they were recorded at different levels of precision. Recording the observation as 11 implies that the true value lies somewhere between 10.5 & 11.5, whereas recording the observation as 11.0 implies a value between 10.95 & 11.05.

Note that decimal places and significant figures are not equivalent. Decimal places is the number of digits following the decimal point. Significant figures is the number of digits excluding leading and trailing zeros.

Some examples

number significant
decimal places implied limits
26 2 0 25.5 - 26.5
26.002 5 3 26.0015 - 26.0025
26000 2 0 25500 - 26500
0.003 1 3 0.0025 - 0.0035

If you round off numbers too early in a series of calculations you can introduce errors. For example consider the numbers:

10.2 11.1 9.3 10.1 9.2

The sum of these numbers is 49.9, but if you rounded them before the summation you would have:

10 11 9 10 9

which sum to 49.

Outliers and Mistakes

When you inspect your data, prior to more formal analyses, you may find that one or more values is far from the majority of observations. These unusual observations are called outliers. There does not appear to be a universally agreed definition of the term 'outlier'. Some, but by no means all, of these outliers will be a consequence of mistakes. Even the most careful workers occasionally make mistakes. The trick is being able to recognise the mistakes and deal them appropriately.

Identifying outliers

The informal definition of an outlier relates the value of an observation to others from the same data set. Outliers are, therefore, characterised by their distance from other observations, but how distant does an observation need to become before it is classed as an outlier? There are several statistical methods for detecting outliers, consequently the concept of a statistical outlier is closely linked to the amount of variability in a set of data. Three methods for outlier identification are described.

  • Univariate samples : Box plots
  • Bivariate data : Regression residuals
  • Multivariate data : PCA. A multivariate outlier need not to be extreme in any of its components. Generally, the outliers are searched among the extreme values and if this notion is easy to define in the univariate case, it is not so straightforward in the multivariate case.

What should you do with outliers?

It depends how the outlier achieved its unusual value. There are a number of possibilities:

  • A value was entered incorrectly? There is no substitute for careful checking. Mistakes often occur when data are being entered or copied. Try to reduce the number of occasions that this happens. Beware of transcription errors such as double reading, omission and transposition (e.g. 45 becomes 54). If it was due to a data entry error, make the correction.
  • Was there anything unusual about the way in which this observation was obtained? For example, was it collected by another researcher, did someone else make up the standards, was there a problem with the water supply? If you suspect that an observation was associated with an unusual event then you can probably delete it.
  • Is it a real value, from an unusual case? If the outlier was not a consequence of a mistake then you must assume that its value is correct. Unusual observations can provide interesting biological information so do not dismiss it. It may be possible to apply a transformation which will remove the problem. If this is not possible, you may wish to investigate it's effect on any statistical analysis. For example, outliers on the axis of a regression analysis may have a disproportionate effect on the slope and intercept of that line. Typically, you should carry out the analysis with and without the outliers present.

Prediction errors

If you make predictions (for example using a regression equation or standard curve) or estimates, there are often differences between the actual and the predicted values. These differences are called residuals. In general residuals are thought to be a consequence of experimental error.

Residual = Actual value - Predicted value

Many statistical analyses make assumptions about the residuals. In particular they are important in:

  • c2 association analysis
  • regression analysis
  • analysis of variance

Classification errors

If you make a prediction about the category to which a case belongs you may get this prediction correct or incorrect. For example, suppose that you are predicting a person's sex (without using the more obvious features!) your prediction will be either male or female. The accuracy of your predictions are summarised in a confusion matrix. For example:

Actual gender /
Predicted gender
Male Female
Male 70 (Correct) 20 (Errors)
Female 30 (Errors) 80 (correct)


Hosted by Keysoft Pty Ltd