A Fully Bayesian Logistic Regression Model for Classification of ZADA Diabetes Dataset

  • Masoud M. Hassan Dept. of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region, Iraq.
Keywords: Diabetes, Bayesian Logistic Regression, Markov Chain Monte Carlo, Classification, Informative Priors


Classification of diabetes data with existing data mining and machine learning algorithms is challenging and the predictions are not always accurate. We aim to build a model that effectively addresses these challenges (misclassification) and can accurately diagnose and classify diabetes. In this study, we investigated the use of Bayesian Logistic Regression (BLR) for mining such data to diagnose and classify various diabetes conditions. This approach is fully Bayesian suited for automating Markov Chain Monte Carlo (MCMC) simulation. Using Bayesian methods in analysing medical data is useful because of the rich hierarchical models, uncertainty quantification, and prior information they provide. The analysis was done on a real medical dataset created for 909 patients in Zakho city with a binary class label and seven independent variables. Three different prior distributions (Gaussian, Laplace and Cauchy) were investigated for our proposed model implemented by MCMC. The performance and behaviour of the Bayesian approach were illustrated and compared with the traditional classification algorithms on this dataset using 10-fold cross-validation. Experimental results show overall that classification under BLR with informative Gaussian priors performed better in terms of various accuracy metrics. It provides an accuracy of 92.53%, a recall of 94.85%, a precision of 91.42% and an F1 score of 93.11%. Experimental results suggest that it is worthwhile to explore the application of BLR to predictive modelling tasks in medical studies using informative prior distributions.

Author Biography

Masoud M. Hassan, Dept. of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region, Iraq.

Dept. of Computer Science, Faculty of Science, University of Zakho, Kurdistan Region, Iraq – (Masoud.hassan@uoz.edu.krd)


Chang, M., Dalpatadu, R. J., Phanord, D., Singh, A. K., Harrah, W. F., & Administration, H. (2018). Breast Cancer Prediction Using Bayesian Logistic Regression. 2, 2–6. https://doi.org/10.31031/OABB.2018.02.000537
Clark, T. G., De Iorio, M., & Griffiths, R. C. (2007). Bayesian logistic regression using a perfect phylogeny. Biostatistics, 8(1), 32–52. https://doi.org/10.1093/biostatistics/kxj030
DuMouchel, W. (2012). Multivariate bayesian logistic regression for analysis of clinical study safety issues. Statistical Science, 27(3), 319–339. https://doi.org/10.1214/11-STS381
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. Annals of Applied Statistics. https://doi.org/10.1214/08-AOAS191
Ghosh, J., Li, Y., & Mitra, R. (2018a). On the Use of Cauchy Prior Distributions. Bayesian Analysis, 13(2), 359–383. https://doi.org/10.1176/appi.ajp.2014.13121571
Ghosh, J., Li, Y., & Mitra, R. (2018b). On the use of Cauchy prior distributions for Bayesian logistic regression. Bayesian Analysis. https://doi.org/10.1214/17-BA1051
Hassan, M. M., Jones, E., & Buck, C. E. (2019). A simple Bayesian approach to tree-ring dating. Archaeometry, 61(4), 991–1010. https://doi.org/10.1111/arcm.12466
Hassan, Masoud M, & Amiri, N. N. (2019). Classification of Imbalanced Data of Diabetes Disease Using Machine Learning Algorithms. In G. E. Bostanci (Ed.), ICTACSE 2019, 4th international conference of theoretical and applied computer science and engineering (pp. 50–55). Istanbul, Turkey
Hassan, Masoud Muhammed. (2018). Bayesian Sensitivity Analysis to Quantifying Uncertainty in a Dendroclimatology Model. ICOASE 2018 - International Conference on Advanced Science and Engineering, 363–368. https://doi.org/10.1109/ICOASE.2018.8548877
Holmes, C. C., & Held, L. (2006). Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1 A), 145–168. https://doi.org/10.1214/06-BA105
Huggins, J. H., Campbell, T., & Broderick, T. (2016). Coresets for scalable Bayesian logistic regression. Advances in Neural Information Processing Systems, Nips, 4087–4095.
Joseph, L. (2016). Bayesian Inference for Logistic Regression Parameters. 1–12. https://doi.org/10.1111/j.1469-7793.2001.00521.x
Kass, R. E., Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1997). Markov Chain Monte Carlo in Practice. Journal of the American Statistical Association. https://doi.org/10.2307/2965438
Li, L., & Yao, W. (2018). Fully Bayesian logistic regression with hyper-LASSO priors for high-dimensional feature selection. Journal of Statistical Computation and Simulation, 88(14), 2827–2851. https://doi.org/10.1080/00949655.2018.1490418
Lin, J., Myers, M. F., Koehly, L. M., & Marcum, C. S. (2019). A Bayesian hierarchical logistic regression model of multiple informant family health histories. BMC Medical Research Methodology, 19(1), 1–10. https://doi.org/10.1186/s12874-019-0700-5
Madigan, D., Genkin, A., Lewis, D. D., & Fradkin, D. (2005). Bayesian multinomial logistic regression for author identification. AIP Conference Proceedings, 803(1), 509–516. https://doi.org/10.1063/1.2149832
Mary Gladence, L., Karthi, M., & Maria Anu, V. (2015). A statistical comparison of logistic regression and different bayes classification methods for machine learning. ARPN Journal of Engineering and Applied Sciences, 10(14), 5947–5953.
Maxime Vono, P. C. (2018). Sparse Bayesian Binary Logistic Regression Using The Split-And-Augmented Gibbs Sampler. 2018 IEEE International Workshop on Machine Learning for Signal Processing, Sept. 17–20, 2018, Aalborg, Denmark.
Octaviani, T. L., Rustam, Z., & Siswantining, T. (2019). Ovarian Cancer Classification using Bayesian Logistic Regression. IOP Conference Series: Materials Science and Engineering, 546(5). https://doi.org/10.1088/1757-899X/546/5/052049
Plummer, M. (2016). rjags: Bayesian graphical models using MCMC. In R package version 3-13. https://doi.org/http://cran.r-project.org/package=rjags
R Development Core Team, R. (2011). R: A Language and Environment for Statistical Computing. In R Foundation for Statistical Computing. https://doi.org/10.1007/978-3-540-74686-7
Ravussin, E., Valencia, M. E., Esparza, J., Bennett, P. H., & Schulz, L. O. (1994). Effects of a traditional lifestyle on obesity in Pima Indians. Diabetes Care. https://doi.org/10.2337/diacare.17.9.1067
Rstudio Team. (2019). RStudio: Integrated development for R. RStudio, Inc., Boston MA. In RStudio. https://doi.org/10.1007/978-3-642-20966-6
Spyroglou, I. I., Spöck, G., Chatzimichail, E. A., Rigas, A. G., & Paraskakis, E. N. (2018). A bayesian logistic regression approach in asthma persistence prediction. Epidemiology Biostatistics and Public Health, 15(1), e12777-1-e12777-14. https://doi.org/10.2427/12777
Suleiman, M., Demirhan, H., Boyd, L., Girosi, F., & Aksakalli, V. (2019). Bayesian logistic regression approaches to predict incorrect DRG assignment. Health Care Management Science, 22(2), 364–375. https://doi.org/10.1007/s10729-018-9444-8
Wang, H., Xiao, X., Zhang, X., Zhang, J., & Yan, Y. (2010). A bayesian logistic regression approach to spoken language identification. IEICE Electronics Express, 7(6), 390–396. https://doi.org/10.1587/elex.7.390
How to Cite
Hassan, M. (2020). A Fully Bayesian Logistic Regression Model for Classification of ZADA Diabetes Dataset. Science Journal of University of Zakho, 8(3), 105-111. https://doi.org/10.25271/sjuoz.2020.8.3.707
Science Journal of University of Zakho