Classification rule based on Bayesian naive Bayes models with feature selections bias corrected
Longhai Li, Department of Mathematics and Statistics, University of Saskatchewan
Copyright
Permission is granted for anyone to copy, use, modify, or distribute these programs and accompanying documents for any purpose, provided this copyright notice is retained and prominently displayed, and note is made of any changes made to these programs. These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user's own risk.
If you found that this software is useful in your work, please do not hesitate to cite the software or the papers below.
Description
This R package is used to predict the binary response based on high dimensional binary
features with Bayesian naive Bayes models. The software also accepts real
values but they will be converted into binary by thresholding at the medians
estimated from the data. A small number of features can be selected based on
the correlations with the response. The bias due to this selection can be
corrected.
A short-cut function for doing cross-validation with the classifier is also
provided.
The software is most suitable for analyzing the data with very high
dimension, for example the diagnosis of cancer based on the gene expression
data.
Source Packages and Documentations
References
Li, L., Zhang, J., and Neal, R. M. (2007), A Method for Avoiding Bias from Feature Selection with Application to Naive Bayes Classification Models, Bayesian Analysis, 2008, volume 3, number 1, pp 171-196: abstract
Li, L. (2007), Bayesian Classification and Regression with High Dimensional Features, Ph.D. thesis, University of Toronto: abstract
Instruction of Installing an R package and Using R
Click here.
Examples of classification with Colon gene expression data
The original real-valued colon data of R format: colon.rda. The binary colon data of R format: colon.bin.rda. There are 62 patients (40 vs 22) and
2000 genes. They can be loaded into R workspace by using "load" function:
> load("colon.bin.rda")
Test how well the above method with leave-one-out crossvalidation:
>cv.bayes(colon.bin,T,62,4,0.4,30,0.8,5,30,T,40)
Results:
The result of above R command is shown by cv-colon-result. The error rate of above analysis is
0.0967742, i.e. 6 out 62 cases were misclassified. This is the lowest error
rate for Colon data to my knowledge. We selected only 4 features out of 2000
for each iteration in cross-validation. Our method is also very fast, taking
totally 103 secs for 62 folds crossvalidation, which includes also the time for
feature selection. One more thing, our method is also pretty simple
conceptually.
Back to Longhai Li's
home page, or his
software packages