Environmental Monitoring

Environmental Monitoring
   



The simplest form of exploratory data analysis (EDA) is a form of data cleansing or clarification.


Why bother?

When you have collected your data it would be helpful if:

  • you removed all the mistakes (mis-entered numbers etc.)
  • you could see the 'woods through the trees'.

EDA provides tools that can be used for both purposes. There are a number of graphical methods that allow you to visualize your data. Four simple approaches to EDA are described below.

Descriptive statistics

Data that have a symmetrical distribution tend to have

  • means, medians and modes that are similar
  • means that are larger than the standard deviation

Confidence intervals can be used to construct error bars. These can be used to provide a quick and convenient visualisation of the relative variability and symmetry of a number of samples. Note that a plot of a mean ±2 standard errors is not equivalent to a plot of 95% confidence intervals. It is a close approximation if the sample size is large (>30). The example below shows 95% confidence intervals for the average girth of trees at sampling points on three Indonesian islands (data from Dr. S. Marsden).

C.I. example plot, tree girth on 3 islands

Box and Whisker Plots

This is a pictorial combination of descriptive measures. It conveys more information than an error bar plot. We need five numbers:

  1. median (middle value in ordered data set);
  2. 1st &
  3. 3rd quartiles (see below);
  4. smallest and
  5. largest values (range).

Consider the following simple data set:

11,23,23,30,34,35,38,43,43,49,50.

  • median = 35;
  • 1st quartile = 23;
  • 3rd quartile = 43;
  • range = 11 and 50.
  • What is a quartile? An ordered data set can be split into 4 quarters. The first split is the median. This creates two halves (of 5 and 5 values in the above example). Each half can be further subdivided to give quarters, essentially the splits are at the medians for each half.

    example box and whicker plot

    The quartiles are called hinges and mark the ends of the box. The whiskers stretch from the hinges to the extremes. The position of the median is marked in the box. An example using the previous data set is shown above. Note that the plot is not symmetrical; the median is not in the middle of the box and the whiskers are of unequal length. This indicates a skewed distribution to the left.

    It is possible to make the display slightly more complicated by setting up inner and outer 'fences'. Minitab uses this approach. Let H-spread = upper hinge - lower hinge. The inner fences are set at 1.5*(H-spread) outside of the hinges. The outer fences are at 3*(H-spread). The advantage of this is that outliers can be more easily identified.

    The example below uses the same tree data as the C.I. example. Note that we now have some information about outliers and distributional skew.

    boxplots of tree girth data from 3 islands

    Stem and Leaf plots

    Any number can be split into 2 components. The initial digits are the stem, the final digit is the leaf. For example

    • 34 has a stem of 3 and a leaf of 4,
    • 654 has a stem of 65 and a leaf of 4.

    Using this concept a numerical equivalent of a histogram can be created. In the example below the numbers summarised are :

    11,23,23,30,34,35,38,43,43,49,50.

    Stem and leaf plot

    1 1 1
    3 2 33
    7 3 0458
    10 4 339
    11 5 0

    The first column is a cumulative count, the second is the stem and the third the leaf.

    Superficially the result is very similar to a histogram. However, you are provided with additional information about the variability of the data within a class. When data are recorded in the field or laboratory it can be useful to record them in this type of table.

    Histograms

    Histograms (frequency distributions) are used to picture the 'shape' of the data, e.g. are the data skewed, bimodal, normally distributed etc.? If the data set is small it can be plotted 'as is'. Note that care should be taken when interpreting histograms based on a small number (<20) observations. Usually it is necessary to split the data into frequency classes prior to plotting. There are some 'rules of thumb'.

    • Aim for 10-15 classes (about 8 for smaller samples).
    • Class boundaries may be found as follows :

    Assume range equals 24 to 56 = 32

    Number of classes = 10

    Class size = 32 /10 = 3.2

    Round off the class size to 3. This gives classes of

    1. 10-13
    2. 14-16
    3. 17-20 etc.
    4. ....

    Decimal values should be rounded down when deciding upon class membership e.g. 13.9 becomes 13 and goes into class 1.

    It is often suggested that the class size should not exceed 0.25* s (standard deviation). Again for small samples this may be impracticable.

    The classes form the x axis and the class frequencies the y axis. An example is shown below in which the class interval is 10.

    Sample histogram


    This page, with acknowledgement, from a web site on univariant statistics by Dr Alan Fielding BSc MSc PhD FLS FHEA, Senior Learning and Teaching Fellow, School of Biology, Chemistry and Health Science, Manchester Metropolitan University. Alan has a new site with information on monitoring and statistics. He may be contacted at alan@alanfielding.co.uk or via his web page.


    Exploratory data analysis can be much more comprehensive and for those who are interested a full paper on EDA is provided.


       
     

    Hosted by Keysoft Pty Ltd