Title: Data Anonymisation, Outliers Detection and Fighting Overfitting with Restricted Boltzmann Machines.
Abstract: We propose a novel approach to the anonymisation of datasets through non-parametric learning of the underlying multivariate distribution of dataset features and generation of the new synthetic samples from the learned distribution. The main objective is to ensure equal (or better) performance of the classifiers and regressors trained on synthetic datasets in comparison with the same classifiers and regressors trained on the original data. The ability to generate unlimited number of synthetic data samples from the learned distribution can be a remedy in fighting overfitting when dealing with small original datasets. When the synthetic data generator is trained as an autoencoder with the bottleneck information compression structure we can also expect to see a reduced number of outliers in the generated datasets, thus further improving thegeneralization capabilities of the classifiers trained on synthetic data. We achieve these objectives with the help of the Restricted
Boltzmann Machine, a special type of generative neural network that possesses all the required properties of a powerful data anonymiser.
Based on joint work with Alexei Kondratyev and Christian Schwarz.