# Environmental Monitoring Power analysis

Power analysis is a technique that we can use to estimate the probability that an experiment will be able to reject a false null hypothesis. If the power is low there may be little point collecting the data since the chance of demonstrating an effect will be too small.

Power values (1 - ß) range from 0 to 1 and they are related directly to the Type II error rates (ß).

 Power Type II error rate Comment 1.0 0.0 If there is an effect it will be detected 0.8 0.2 If there is an effect it will be detected 80 out of 100 times 0.5 0.5 If there is an effect it will be detected 50 out of 100 times 0.2 0.8 If there is an effect it will be detected 20 out of 100 times 0.0 1.0 If there is and effect we will never find it.

It is obvious from the above table that high power is very desirable. If power is low there may no point carrying out the experiment.

Consequently we require techniques that enable us to:

• measure power
• maximise power

Maximising power is one of the most important aspects of experimental design. Often the only features that are simple to control are sample size and alpha. It is sometimes possible to decrease the experimental error. If we assume that alpha is fixed at the 'usual' value of 0.05, this leaves only sample size available for manipulation. Deciding on an appropriate sample size depends on an ability to measure power. Unfortunately it is not always easy to estimate power. It is influenced by several factors:

 low power high power small sample size large sample size large experimental error small experimental error small alpha (e.g. 0.001) larger alpha (e.g. 0.10) small effect size large effect size

Effect size
Effect size is the difference between the null and alternative hypotheses, essentially this is related to magnitude of effect that you consider to be of biological significance. For example, suppose that you were testing a new drug that was said to reduce blood pressure. How much reduction is required to be of medical significance? Presumably, a drug that reduced blood pressure by 1% would not be worth considering, whereas a reduction of 10% might be worth further investigation. Consequently you need to design an experiment that is capable of detecting a 10% drop. An experiment that was required to detect a 1% decrease would require many more replicates.

In fact this raises some philosophical points about hypothesis testing in general. Consider the following pair of hypotheses:

• H0: Aircraft noise has no affect on sleep.
• H1: Aircraft noise has an effect on sleep.

The null hypothesis states that the presence of aircraft noise has no effect. We might be very surprised if this was true. However, suppose that the noise decreased average sleep by 1%, for example a reduction of approximately 5 minutes over 8 hours. Would we be very interested or concerned? Probably not, because the effect is too small to be of biological significance. In addition, demonstrating such a small effect would require a very large trial. Could you justify the cost of the experiment to demonstrate such a small effect? Presumably we wish to set up a trial that is capable of demonstrating a biologically significant effect, for example a 5% reduction (about 24 minutes over 8 hours). A 5% effect size could be detected with many fewer replicates that a 1% effect size.

If our experiment is designed with a 5% effect size in mind the analysis should be unable to reject the null hypothesis for smaller effects, e.g. a 12 minute reduction. Thus, although our null hypothesis is framed in terms of 'no effect' we actually mean 'no biologically significant effect'. This has lead many statisticians to question the unthinking application of significance tests.

Estimating power
Power is quite difficult to measure. Calculations are not easily carried out by hand and few statistical packages provide power calculation options.

Example power curves
. The examples demonstrate the trade-offs that exist between the effect size, sample size, variability and power.

In the first example we know that the error variance is 15. Using this information we can investigate the effect of sample size on power. The 9 curves are for different minimum detectable differences (mdd), also known as range of means or effect size. There are several points to note:

• power increases with mdd
• power increases with the sample size
• over the range of tested sample sizes it will be very difficult to detect a difference between means that is smaller than 3
• there is no point using a sample size greater than 7 to detect a difference of 10 or more.

The remaining graphs illustrate some of the other trade-offs.

A. Keeping error variance (15) and sample size (10) constant what is the relationship between the mdd and power? Under these conditions power is adequate to detect differences of 5 and above. Note that for differences above 9 the sample size is inefficient since power is at 1.0. It may be possible to reduce sample size in order to detect differences greater than 10.

B. Keeping the effect size (10) and error variance (15) constant what is the relationship between sample size and power? Thus, the appropriate sample size, for a mdd of 10, is between 5 and 9. Less than 5 gives unacceptably low power, greater than 9 does not increase power.

C. Keeping the sample size (10) and effect size (10) constant what is the relationship between error variance and power? Note that under these conditions power is only a weak function of error variance. A four-fold increase has not decreased power below 0.8.

Rules for estimating sample sizes needed to achieve specified power.

1. Determining minimum sample size required to achieve a specified precision in a sample mean.

2. Two samples of continuous data (e.g. an unpaired t-test).

3. The difference between two proportions

4. Detecting temporal trends

5. Estimating power for regression lines

1. Determining minimum sample size required to achieve a specified precision in a sample mean.

Sample means are estimates of population means and it is possible to use sample statistics such as the standard error and confidence intervals to measure the precision of the estimate. However, these are post hoc measures of precision. Suppose you wish to determine, a priori, the sample size needed to reach a specified precision, what method is used?

The process is iterative, in that an estimate of the population standard deviation is needed. This can be obtained from a preliminary sample. The subsequent technique depends on the size of the preliminary sample. If the sample was 'large' (typically >30) a z statistic is used, if the sample was 'small' a t statistic is used.

Example 1

The effect of aircraft noise on the delay in getting to sleep was tested on 15 people and a mean of time from light out to sleep of 18.5 minutes, with a standard deviation of 2.5 minutes, was found. If we wish to estimate to within 1 minutes what sample size is needed?

The equation needed to determine n is

n = ( t(alpha,2).s/precision)2

for 14 degrees of freedom, and 95% confidence interval, t(0.05,2) is 2.145. Therefore

n = (2.145 . 2.5/1)2

= 28.8 or, after rounding, 29.

If we had obtained a large sample, for example a mean of 18.25 (s = 2.1) from a sample of 100 we would use:

n = (z(alpha,2).s/precision)

Using a 95% confidence interval, z(0.05,2) is 1.96. Therefore:

n = (1.96 . 2.1 / 1)2

n = 16.9 or, after rounding, 17.

This result tells us that the sample size used was far too large for our desired level of precision! The smaller sample size in the second example is due to two factors:

• the standard deviation was smaller (using s = 2.5 gives a sample size of 24)
• greater confidence in our estimate of the population standard deviation, this is reflected by our use of a z statistic in place of the t statistic.

Note that we can use this method to achieve a precision defined in percentage terms. For example, if we required to estimate the population mean with a precision of ± 10% the required precision would be 1.85 minutes (based on our sample mean estimate).

The following two examples are based on Campbell et al (1995).

2. Two samples of continuous data (e.g. an unpaired t-test).

You must first decide on a:

• minimum sample difference that is biologically meaningful, this is the effect size or mdd.
• significance level (alpha)
• power value (beta)

You also require an estimate of the variability of your observations in the form of a standard deviation. where d = mdd/s and m is the minimum sample size. The z values are from tables. Some common values are:

 alpha = 0.025 (0.05 2-tailed) 1.96 alpha = 0.005 (0.01 2 -tailed) 2.58 beta = 0.1 (power = 0.9) 1.28 beta = 0.2 (power = 0.8) 0.84

example calculations

The fist example is based on the previous example. What is m for a mdd of 10 when s = 3.87 (which is equivalent to a variance of 15). Thus, d = 10/3.87 = 2.58 = 4.11 or, after rounding up, 5.

The sample size of 5 is in close agreement with the results generated by Power Plant.

What is m for a mdd of 3 when s = 2. Thus d = 3/2 = 1.5 = 10.29 or, rounding up, 11.

Note that the above calculations can be simplified. For alpha = 0.5 and beta = 0.1 the equation is approximately: (21 / d) + 1

3. The difference between two proportions

Campbell et al (1995) provide an approximate formula to determine the sample size required to detect a specified difference in proportions. where d is pA - pB

Suppose that you know the proportion of patients experiencing a particular infection is 0.2. You have a new treatment that you think may decrease this proportion. You set the effect size at 0.05, i.e. a 25% reduction to 0.15. Thus, d = 0.2 - 0.15 = 0.05. What sample size is needed to detect such a reduction?

Assume, that as previously, alpha is 0.05 and beta is 0.1. = approximately 1200!

If we wish to detect a 50% reduction from 0.2 to 0.1 the required sample size is: = approximately 260.

4. Detecting temporal trends

Detecting temporal trends is an important goal for many studies. For example, identifying declining populations in endangered species; identifying increases in disease incidence. The problem is one of picking the 'signal' out of the 'noise' caused by seasonal variation and stochastic variation. Determining the sampling effort required to identify trends is complex because there are many parameters that can be controlled. For example:

• How many plots should be monitored?
• The magnitude of the counts per plot.
• The amount of stochastic variation.
• The length of the monitoring period.
• The interval between counts (months, annual, biennial).
• The magnitude of the trend.
• The significance level.

Because of this complexity it is very difficult to provide simple rules for the estimation of power. Fortunately there is some public domain software available. The following example is taken from the Monitor user manual.

The Dachigam Wildlife Sanctuary, Kashmir, India has a population of Himalayan black bears (Selenarctos thibetanus) . Unfortunately little is known about the population's status or trends. Throughout most of the year the bears are scattered throughout the sanctuary and are very difficult to count. However, during the peak fruiting period for local mast-bearing trees, most bears in the sanctuary travel to a large, central grove of masting trees to forage where it is possible to get repeatable counts of the number of bears traveling to and from the grove on any given day.

Baseline data were obtained from 15 separate, day-long counts of the bears. The average was 15.6 bears per day with a standard deviation of 3.6 bears. Would monitoring by one park ranger on 3 separate days, over a 10 year period, be sufficient to detect annual linear trends (positive and negative) of at least 3% in the bear population with a power > 0.90?

The results from a range of simulated conditions using the Monitor software are presented below.

Power to detect trends in a Himalayan black bear population surveyed annually over a 10 year period in Dachigam Wildlife Sanctuary, Kashmir, India. These data were provided by Vasant K. Saberwal.

Number of counts/year

 Trend (%) 3 5 10 -10 0.99 1 1 -5 0.74 0.94 1 -3 0.42 0.61 0.88 -2 0.22 0.29 0.62 -1 0.12 0.14 0.21 0 0.045 0.046 0.046 +1 0.078 0.15 0.28 +2 0.29 0.48 0.77 +3 0.65 0.87 0.99 +5 0.98 1 1 +10 1 1 1

These results demonstrate that 3 counts is insufficient to provide sufficient power to detect a 3% trend. Note that the power differs between increasing and decreasing trends. Increasing the counts to 5 results in sufficient power to detect a 3% increase but 10 counts are needed for a 3% decline.

 A final word The 'take home message' from these examples is that unless sample sizes are large enough you will not be able to detect biological effects. Bob Hayden (1995) provided a very telling metaphor on this subject on the Ed-Stat discussion list. It is paraphrased below. Researchers can complain when told how large their samples should be. They say this is completely impractical and elect to use a much smaller, more manageable sample size. Imagine being asked what type of instrument is required to measure intermolecular distances - you reply with a make and price (\$35 million dollars). They reply 'that is far too expensive I will use calipers'! The moral is if you can't achieve (afford) a large enough sample size to detect a particular effect, move on to a different topic that you can afford.

The absolutely final word on this topic is that you can use experimental design to increase power for the same number of replicates.

This page, with acknowledgement, from a web site on univariant statistics by Dr Alan Fielding BSc MSc PhD FLS FHEA, Senior Learning and Teaching Fellow, School of Biology, Chemistry and Health Science, Manchester Metropolitan University. Alan has a new site with information on monitoring and statistics. He may be contacted at alan@alanfielding.co.uk or via his web page.   Hosted by Keysoft Pty Ltd 