Five myths about variable selection

Original by Georg Heinze, Daniela Dunkler, Transplant international, 2017, 5 pagesHamster_gagarin_linkedin
hamster writter This summary note was posted on 17 January 2020, by in Credit risk #, #, #, #, #

Whatever the technique applied, the approach of letting statistics decide which variables should be included in a model is popular among scientists. However there hardly exists any statistical theory which justifies the use of those techniques

The five myths

  1. The number of variables in a model should be reduced until there are 10 events per variables – No!
  2. Only variables with proven univariate model significance should be included – No! -> Although univariable prefiltering is traceable and easy to do with standard software, one should better completely forget about it as it it neither a prerequisite nor providing any benefits when building multivariable models
  3. Insignificant effects should be eliminated from a model – No! -> not necessarily as they might change the other variables weight and interactions
  4. The reported P-values quantifies the type I error of a variable being falsely selected – No!
  5. Variable selection simplifies analysis – No! -> Don’t just let the data speak. Expert background knowledge formalized for example by directed acyclic graphs is usually a much better guide and much more robust

To remember

  • Variable selection should always be accompanied by sensitivity analyses
  • For prognostic models a good start is to use backward elimination with a selection criterion of 0.157 without a preceding univariable prefiltering
  • For etiologic models Augmented Backward Elimination preceded by a careful preselection based on assumptions of causal roles of variables is a reasonable approach