Genetic algorithms for subset selection in model-based clustering

Scrucca, Luca

Model-based clustering assumes that the observed data can be represented by a finite mixture model, where each cluster is represented by a parametric distribution. In the multivariate continuous case the Gaussian distribution is often employed. Identifying the subset of relevant clustering variables allows to achieve parsimony of unknown parameters, thus yielding more efficient estimation, clearer interpretation, and, often, better clustering partitions. This paper discusses variable or feature selection for model-based clustering. The problem of subset selection is recast as a model comparison problem, and BIC is used to approximate Bayes factors. Searching over the potentially vast solution space is performed through genetic algorithms, which are stochastic search algorithms that use techniques and concepts inspired by evolutionary biology and natural selection.