Learning from nothing or almost

Last article

In my first article for this blog,  I talked about how my teachers and a team of students I joined used the latest Deep Learning (DL) technologies to help fight cancer.
The goal was to segment and colorize areas in a scanner image corresponding to biological tissues, which could be used to estimate the health of the patient and, in turn, to a better formulation of his treatment.

 

Not colored scan

Colorization of a scanner image

Colorization of a scanner image

Colorization of a scanner image

 

 

 

 

 

 

 

 

 

At that time, in order to focus on how technology can improve medicine,
I had to skip a crucial component of this IODA project, namely pre-training of the auto-encoders. By presenting it here, I’ll try to illustrate a bit some problematics of deep learning.

So, first, let’s do a quick reminder on deep learning.

Lasagna and matrices

Lasagna and matrices multiplication : I swear it'll make sense

Lasagna and matrices multiplication : I swear it’ll make sense

Matrices

Last time, I stated that a DL model was a stack of Lego’s, each stack having its own behaviour. This time, let’s go a bit further:
A DL model is a stack of lasagna. Each layer of lasagna is in fact what we call a matrix of numbers that we call the weights of the layer (in reference to the biological inspiration  of that technique).
A matrix is just a rectangular array of numbers arranged in rows and columns, just like Excel. These weights define the way the current layer is going to transform its inputs before pasing it to the next layer. This linear transformation is what we call a matrix multiplication and is illustrated in the figure above.
In fact, a Deep Learning layer contains not only weights but also non-linear components, but I won’t delve into them here.

What is at the core of Deep Learning is an optimization problem:
using various methods, we are going to try to find the best set of weights to solve our problem. To do so, we are going to pass examples of our dataset through the model, search from what weights he errors come from and change the weights to diminish the error. Passing examples is pretty fast and easy. Searching how to change weights is the slow and complicated part.

So a neural network has two modes:

  • the training: the network is fed with with examples from the dataset
    and we change its weights (aka. make it _learn_) to minimize an error rate we defined.
  • the prediction: the actual use of the network, we feed it with an input sample and the network will use the weights it has learned to transform it into an output.

IODA pre-training

Autoencoder schema

 

Now, back to IODA.
You remember that its basis is an auto-encoder, so 3 parts :

  • one encoder neural network that takes the input and reduce its features
  • one intermediate form that contains the condensed information of the input
  • one decoder neural network that takes the intermediate formulation and tries to recreate the input

So that’s two neural networks to train. And that’s bothersome, because a big network takes hours or days on specialized machines to train. We would like to reduce this time to a minimum.

Another problem is that if the first layers of a neural network are quick to learn, the last layers (the deepest) are slower (due to gradient vanishing).
If you were to just throw your training set at a very deep neural network, you would have to wait ages to have it trained.

 

The trick

But in the case of an auto-encoder, we don’t really need to train 1) both networks at the same time  and 2) all the layers at the same time.
In the end, we just want to produce an encoder that reduces correctly the dimension of our inputs and a decoder that expands correctly the dimension of the intermediate form.

The authors of IODA proposed to train separately encoder and decoder by using a trick.

You can train the layer (call it `L_i`) of an encoder separately if you consider an auto-encoder (call it `AE`) whose encoder is `L_i` and whose decoder is a neural network with weights the same as `L_i` but transposed  (switch row and column indices of your matrix).
Then it goes like this :

  1. Train `AE` as any auto-encoder (note that this training is unsupervised); the inputs are the ones you have for `L_i` and the desired output are also the inputs for `L_i`.
  2. `L_i` will be trained as a whole encoder.
  3. Throw away the decoder of `AE`, you don’t need it anymore
  4. Transform all your training set with `L_i`: you will get the input for the layer above `L_i`, call it `L_i+1`
  5. Make `L_i` the `i`th encoder layer of your final encoder
  6. Apply the same trick to train `L_i+1`

This trick allows to quickly train the layers of your encoder and decoder, but it’s not sufficient to get a model you can use right away.  You still need to train the final auto-encoder, but it’ll be easier as you already have pre-trained layers.

 

The Alogorithm

No, algorithms are not magic invoked by developers when they don’t want to explain their work:
they describe a sequence of operations to solve a specific kind of problem, and like a picture is worth a thousand words, an algorithm is worth a thousand paragraphs.

So, below is the algorithm presented in IODA’s article:

Inputs

  • a training set X :  it’s also a matrix where each row represent the characteristics of an example (in our case, a row represent a scanner image)
  • the expected results Y: for each row in X, the value we want our model to learn (in our case, a row represents a colorized scanner image)
  • N input : the number of layers of the encoder
  • N output: the number of layers of the decoder.  So there are N = N input + N output layers in total.

 

Outputs

 [W_1, W_2, … , W_N] : one weight matrix for each layer that has been learned

Functions

  • Predict([a list of weights], samples) -> predictions: this function will transform the samples into predictions using the list of weights
  • Train([a list of weights], training set, expected results) -> [list of trained weights] : given an initial list of weights, this function will train them so that they give the expected results when passed the training set

Algorithm

 

Random initialization of [W_1, W_2, ... , W_N]
R = X
for i=1..N_input
    [W_i, W_dummy] = Train([W_i, W_i^t], R, R)
    Drop W_dummy
    R = Predict([W_i], R)

R=Y
for i=N..N - N_output + 1
    [U, W_i] = Train([W_i^t, W_i], R, R)
    R = Predict([U], R)
    Drop U
[W_1, W_2, ... , W_N] = Train([W_1, W_2, ... , W_N], X, Y)

 

To conclude, I can only recommend you to check out IODA: it’s a remarkable piece of work that shows how two very different domains (medicine and computer science) can intertwine to make science advance.