Most real world data sets consist of data vectors whose individual components are not statistically independent, that is, they are redundant in the statistical sense. Then it is desirable to create a factorial code of the data, i. e., a new vectorvalued representation of each data vector such that it gets uniquely encoded by the resulting code vector (lossfree coding), but the code components are statistically independent.
Later supervised learning usually works much better when the raw input data is first translated into such a factorial code. For example, suppose the final goal is to classify images with highly redundant pixels. A naive Bayes classifier will assume the pixels are statistically independent random variables and therefore fail to produce good results. If the data are first encoded in a factorial way, however, then the naive Bayes classifier will achieve its optimal performance (compare Schmidhuber et al. 1996).
To create factorial codes, Horace Barlow and coworkers suggested to minimize the sum of the bit entropies of the code components of binary codes (1989). Jürgen Schmidhuber (1992) reformulated the problem in terms of predictors and binary feature detectors, each receiving the raw data as an input. For each detector there is a predictor that sees the other detectors and learns to predict the output of its own detector in response to the various input vectors or images. But each detector uses a machine learning algorithm to become as unpredictable as possible. The global optimum of this objective function corresponds to a factorial code represented in a distributed fashion across the outputs of the feature detectors.
