Improving classification performance with discretization on biomedical datasets

Distretization is typically used as a pre-processing step for machine learning
Supervised distretization methods will distretize a variable to a single interval in the variable has little to no correlation with the target variable
Support Vector Machines (SVM) and Random Forest (RF) are favored for their ability to handle high-dimensional data

Boullé developped a Minimum Optimal Description Length. (MODL) based on Minimal description length principle (MDL)
Examines all possible solutions so O(n^3) order of magnitudes size
New efficient Bayesian Discretization (EBD) using Bayesian score to evaluate discretization model
Runs faster than MODL with O(n^2) time order
EDB has better performance than commonly used Fayyad and Irani’s MDLPC discretization algorithm

Relative Classifier Information (RCI), quantifies amount of uncertainty of a decision problem that is reduced relative to using only the prior probabilities of each class
Similar to area under the curve (ROC as it measures the discriminatory power of the classifier while minimizing the effect of the distribution of the classes.
Use Wilcoxon paired sample signed rank test to compare RCI values

EDB resulted in substantial decrease in the number of selected variables
EDB improved performance of all the algorithm testes: SVM, RF and NB (Naive Bayes)
Using discrete values over continuous values improved performance of RF and NB but not SVM
NB benefits from smoothing of the parameters that discretization provides
Performance from discretization accrues to a large extend from variable selection and to a smaller extend from the transformation of the variable from continuous to discrete.

Related Hamster Notes