Title: Data Anonymisation, Outliers Detection and Fighting Overfitting with Restricted Boltzmann Machines.
Abstract: We propose a novel approach to the anonymisation of datasets
through non-parametric learning of the underlying multivariate
distribution of dataset features and generation of the new
synthetic samples from the learned distribution. The main objective
is to ensure equal (or better) performance of the classifiers and regressors trained on synthetic datasets in comparison with the same
classifiers and regressors trained on the original data. The ability to
generate unlimited number of synthetic data samples from the
learned distribution can be a remedy in fighting overfitting when
dealing with small original datasets. When the synthetic data generator
is trained as an autoencoder with the bottleneck information
compression structure we can also expect to see a reduced number of
outliers in the generated datasets, thus further improving the
generalization capabilities of the classifiers trained on synthetic
data. We achieve these objectives with the help of the Restricted
Boltzmann Machine, a special type of generative neural network that
possesses all the required properties of a powerful data anonymiser.
Based on joint work with Alexei Kondratyev and Christian Schwarz.