When you have collected your data it would be helpful if:
- you removed all the mistakes (mis-entered numbers etc.)
- you could see the 'woods through the trees'.
EDA provides tools that can be used for both purposes. There
are a number of graphical methods that allow you to visualize your data. Four
simple approaches to EDA are described below.
Data that have a symmetrical distribution tend to have
- means, medians and modes that are similar
- means that are larger than the standard deviation
Confidence intervals can be used to construct error bars.
These can be used to provide a quick and convenient visualisation of the
relative variability and symmetry of a number of samples. Note that a plot of a
mean ±2 standard errors is not equivalent to a plot of 95% confidence intervals.
It is a close approximation if the sample size is large (>30). The example
below shows 95% confidence intervals for the average girth of trees at sampling
points on three Indonesian islands (data from Dr. S. Marsden).
Box and Whisker Plots
This is a pictorial combination of descriptive measures. It
conveys more information than an error bar plot. We need five numbers:
- median (middle value in ordered data set);
- 1st &
- 3rd quartiles (see below);
- smallest and
- largest values (range).
Consider the following simple data set:
What is a quartile? An ordered data set can be
split into 4 quarters. The first split is the median. This creates two halves
(of 5 and 5 values in the above example). Each half can be further subdivided to
give quarters, essentially the splits are at the medians for each half.
The quartiles are called hinges and mark the
ends of the box. The whiskers stretch from the hinges to the
extremes. The position of the median is marked in the box. An example using the
previous data set is shown above. Note that the plot is not symmetrical; the
median is not in the middle of the box and the whiskers are of unequal length.
This indicates a skewed distribution to the left.
It is possible to make the display slightly more complicated
by setting up inner and outer 'fences'. Minitab uses this approach. Let
H-spread = upper hinge - lower hinge. The inner fences are set at
1.5*(H-spread) outside of the hinges. The outer fences are at 3*(H-spread). The
advantage of this is that outliers can be more easily identified.
The example below uses the same tree data as the C.I.
example. Note that we now have some information about outliers and
Any number can be split into 2 components. The initial
digits are the stem, the final digit is the leaf. For example
- 34 has a stem of 3 and a leaf of 4,
- 654 has a stem of 65 and a leaf of 4.
Using this concept a numerical equivalent of a histogram can
be created. In the example below the numbers summarised are :
Stem and leaf plot
The first column is a cumulative count, the second is the
stem and the third the leaf.
Superficially the result is very similar to a histogram.
However, you are provided with additional information about the variability of
the data within a class. When data are recorded in the field or laboratory it
can be useful to record them in this type of table.
Histograms (frequency distributions) are used to picture the
'shape' of the data, e.g. are the data skewed, bimodal, normally distributed
etc.? If the data set is small it can be plotted 'as is'. Note that care
should be taken when interpreting histograms based on a small number (<20)
observations. Usually it is necessary to split the data into frequency classes
prior to plotting. There are some 'rules of thumb'.
- Aim for 10-15 classes (about 8 for smaller samples).
- Class boundaries may be found as follows :
Assume range equals 24 to 56 = 32
Number of classes = 10
Class size = 32 /10 = 3.2
Round off the class size to 3. This gives classes of
- 17-20 etc.
Decimal values should be rounded down when deciding upon
class membership e.g. 13.9 becomes 13 and goes into class 1.
It is often suggested that the class size should not exceed
0.25* s (standard deviation). Again for small samples this may be impracticable.
The classes form the x axis and the class frequencies the y
axis. An example is shown below in which the class interval is
This page, with acknowledgement, from a web site on univariant statistics by Dr Alan Fielding BSc MSc PhD FLS FHEA, Senior Learning and Teaching Fellow, School of Biology, Chemistry and Health Science, Manchester Metropolitan University. Alan has a new site with information on monitoring and statistics. He may be contacted at firstname.lastname@example.org or via his web page.