Forecasting Probability of Bankruptcy from unbalanced data

Pierri, Francesca; Stanghellini, Elena; Bistoni, N.

When analysing the determinants of bankruptcy of small and medium enterprises, one of the most common problems is that of unbalanced data, as very often the event under study happens in only a small percentage of cases. The aim of this paper is to explore three different statistical methods of coping with unbalanced data and to identify which of these has the greatest predictive capability in the context of the bankrupcty event. The dataset is composed of all firms which were active in Tuscany in 2006. For each of them we have a five-year series of balance sheet indicators. Bankruptcy is represented by their legal status at May 2010. We focused on some indicators previously identified as predictors of the state of bankruptcy (Pierri 2013; Pierri, Burchi and Stanghellini 2013) and we tested the same model using the following three methods: logistic regression for matched case-control studies, logistic regression for a random balanced data sample, logistic regression for a sample balanced by ROSE (Random OverSampling Examples, Menardi and Torelli 2014). We built a training sample to develop the models and a hold-out sample to compare their discriminatory ability through ROC curves.