Education: The baseline data testing howler

Investigators are often anxious to know if random treatment allocation process has resulted in evenly balanced treatment groups. A very common mistake is to perform a significance test in order to determine this.

Null hypothesis: Any observed differences in the treatment group means are owing to chance alone.

Alternative hypothesis: There is a systematic difference in treatment group means, chance as a sole explanation is untenable.

Randomization is a chance mechanism. It is impossible to reject the null hypothesis because randomized data is… random. It is impossible to embrace the alternative hypothesis because randomized data is… random.

Testing baseline data is a misuse of significance testing. Investigators who perform significance testing of randomized baseline data are (unwittingly) testing randomly generated data to see if it is random or not!

…this practice is philosophically unsound, of no practical value and potentially misleading…

(Senn S, 1994)

While no lasting harm to the overall analysis is done by this practice it sends a clear message to the reader:

We, the investigators do not understand the basics of significance testing! We have made a mistake in the first step of the analysis (reporting baseline characteristics). Having got this wrong you should be very concerned about the (more important) analyses that follow…

Note: if the standard 0.05 level of significance is chosen by necessity 1 in 20 (5%) of the baseline variables on average will be “significantly different”.

Here’s how it can be misleading:

Trial 1: An RCT investigating the effect of methylphenidate treatment on testosterone levels in men.

Trial 2: An RCT investigating the effect of methylphenidate treatment on testosterone levels in boys.

Which of these trials has a has potentially serious imbalance in age?

So how do you know if there are any clinically important imbalances at baseline? Simple look at the data – not the statistical data (the p-value column) the clinical data (the means). Are there any notable differences in the treatment group means? on a clinically relevant variable? Discuss this and its possible effect in the results section.

Determining if there is a clinically important imbalance in the baseline data is a nuanced task requiring knowledge of the disease and treatment, its mechanism of action and many other factors. It seems unusual to leave this matter up to a t-test.

The practice of significance testing of baseline data is common in psychiatry, whlie it doesn’t cause any harm it’s usually the calling card of the statistical dilettante.

Suggested reading:
Testing for Baseline Balance in Clinical Trials