Artificial Confusion: SGD vs RProp

If you read the papers^[1] on RProp it seems like a great algorithm that should tremendously speed up the time to convergence of a large neural network. I thought it was a no-brainer to apply this to modern CNNs that were becoming so popular, like GoogLeNet, VGG, and ResNet. When I went looking it seemed like almost everyone was using SGD. There was almost no mention of RProp.

The promise of RProp is fast convergence with no hyperparameter tuning. The penalty is that it is designed to operate on the whole training set as one unit, not on mini-batches. Does it live up to its promise, and can we do anything about the penalty?

The Promise

With SGD we have to tune at least one hyperparameter, the learning rate, and maybe others like momentum. This can lead to a lot wasted time and compute resources on trial and error before ever spending time properly training a network on the data set. RProp is supposed to eliminate this problem, because it effectively auto-tunes the learning rate for each neuron.

Using Torch, I tried using the RProp implementation from the optim package and could not get convergence at all. I thought there must be a bug. Since it seems no one uses RProp, it must have gone overlooked. I examined the source but didn't see a problem. I wrote my own implementation, and still no convergence. RProp does in fact have some hyperparameters. It has etaPlus, etaMinus, maxStepSize, and minStepSize, all of which have good default values worked out by the original RProp author. There's also the initial step size. This is something that you do have to tune. In Torch, the default is 0.1, and this will probably be too large for a deep network. If this value is too large, your network will not converge. We could make it arbitrarily small, but the smaller it is, the more passes through the data it takes for RProp to really speed up and start converging. So much for not tuning hyperparameters.

The Penalty

Your intuition might tell you that RProp should still work with smaller batches, as long as the batches are large enough to be fairly representative of the training set. This is in fact correct. I did some tests with the CIFAR-10 data set and a simple CNN and found that with batch sizes as small as 64 I still got convergence. That's just 6 or 7 examples from each of the 10 classes per batch. Of course, with a larger data set like ILSVRC2012 where there are 1000 classes, 6 or 7 examples from each class still produces quite a large batch. But it's much smaller than the 1.2 million examples in the training set so we've made progress.

Unfortunately, the quality suffers with smaller batch sizes. The network will only converge to a point rather far from what we know is possible from using SGD as a baseline. As we increase the batch size the point of convergence improves until we reach a batch size equal to the entire training set. But maybe we can still work around this problem. We could start with a small batch size and increase it whenever convergence starts to slow. It turns out that this does work, so maybe we can still get some use out of RProp.

There's yet one more stumbling block when it comes to using RProp. The larger the batch size we use, the better the point to which our network converges; however, it never gets as good as SGD, at least not for as long as I wanted to wait. RProp is very fast out of the gate, improving rapidly until it gets close to the convergence point, and then it suddenly slows way down. In my tests, it continued to show improvement with minimal overfitting at 100 epochs. The same network reached peak convergence with SGD at around 50 epochs and any further training just resulted in overfitting. At 100 epochs, the network trained with RProp was getting close to the one trained with SGD, and might eventually catch it, but the whole point of trying RProp was to converge faster.

Our last fleeting hope is that we could use RProp at the beginning to converge part of the way rapidly, before switching to SGD. If we're going to do that then we need to tune both RProp's and SGD's hyperparameters. In reality, once we've got a good set of hyperparameters for SGD, RProp just doesn't have that much of an edge. In my tests it had between 0 and 2 epochs advantage in the early training. Certainly not enough to justify the extra work needed to combine the two learning strategies. The wisdom of the crowd is correct: SGD beats out RProp.

1. ↩	A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm
	Improving the Rprop Learning Algorithm

Monday, May 30, 2016

SGD vs RProp

The Promise

The Penalty

No comments:

Post a Comment