Whatever the technique applied, the approach of letting statistics decide which variables should be included in a model is popular among scientists. However there hardly exists any statistical theory which justifies the use of those techniques
The five myths
- The number of variables in a model should be reduced until there are 10 events per variables – No!
- Only variables with proven univariate model significance should be included – No! -> Although univariable prefiltering is traceable and easy to do with standard software, one should better completely forget about it as it it neither a prerequisite nor providing any benefits when building multivariable models
- Insignificant effects should be eliminated from a model – No! -> not necessarily as they might change the other variables weight and interactions
- The reported P-values quantifies the type I error of a variable being falsely selected – No!
- Variable selection simplifies analysis – No! -> Don’t just let the data speak. Expert background knowledge formalized for example by directed acyclic graphs is usually a much better guide and much more robust
To remember
- Variable selection should always be accompanied by sensitivity analyses
- For prognostic models a good start is to use backward elimination with a selection criterion of 0.157 without a preceding univariable prefiltering
- For etiologic models Augmented Backward Elimination preceded by a careful preselection based on assumptions of causal roles of variables is a reasonable approach