- Distretization is typically used as a pre-processing step for machine learning
- Supervised distretization methods will distretize a variable to a single interval in the variable has little to no correlation with the target variable
- Support Vector Machines (SVM) and Random Forest (RF) are favored for their ability to handle high-dimensional data
Discretization method
- Boullé developped a Minimum Optimal Description Length. (MODL) based on Minimal description length principle (MDL)
- Examines all possible solutions so O(n^3) order of magnitudes size
- New efficient Bayesian Discretization (EBD) using Bayesian score to evaluate discretization model
- Runs faster than MODL with O(n^2) time order
- EDB has better performance than commonly used Fayyad and Irani’s MDLPC discretization algorithm
Classification Performance Measure
- Relative Classifier Information (RCI), quantifies amount of uncertainty of a decision problem that is reduced relative to using only the prior probabilities of each class
- Similar to area under the curve (ROC as it measures the discriminatory power of the classifier while minimizing the effect of the distribution of the classes.
- Use Wilcoxon paired sample signed rank test to compare RCI values
Result
- EDB resulted in substantial decrease in the number of selected variables
- EDB improved performance of all the algorithm testes: SVM, RF and NB (Naive Bayes)
- Using discrete values over continuous values improved performance of RF and NB but not SVM
- NB benefits from smoothing of the parameters that discretization provides
- Performance from discretization accrues to a large extend from variable selection and to a smaller extend from the transformation of the variable from continuous to discrete.