This is an introduction to background, implementation and use of Error Curvature Optimization (ECO here for short), a method using Laplacian value for sharpening error curvature with regard to input vector. I didn't finish everything around ECO, but there are two positive things. One, it at least partially works, so it's not completely useless, and that already means it is worth being shared. Two, it provides some fairly interesting results, and is also quite novel (I think; though it could've been that I just didn't find anything related).
What it is
Background on gradient descent
Note how in the usual gradient descent technique we use the derivative/gradient of the accumulated error \(E\) over each parameter we have.
\[ D = \nabla_{\theta} E\left(Y_0, Y \right) \] \[ \theta \mathrel{-}= \gamma_{\mathrm{learn}} D \]
This method provides a good fitting strategy to reduce error, by moving parameters along gradient direction. What it doesn't do, is use specific cases, or understand when the solution is out of the problem domain.
Cipher-lock principle
What we want, is for the neural network to memorize a specific input, and output a defined (trained) value only when this input is recognized. Otherwise, we would like it to stay as far from that output, as possible.
Because of this, we are attempting to maximize the steepness of accumulated error with regard to inputs at the point of the "cipher".
Error Curvature as Laplacian
Thus, we define the following function:
\[ C(Y_0,Y) = \nabla^2_{X} E\left( Y_0, Y \right) \]
Maximizing value of this Laplacian will give us the most articulated error value. We will use gradient to do this with regard to parameters.
\[ D = \nabla_{\theta} C(Y_0,Y) \] \[ \theta \mathrel{-}= \gamma_{\mathrm{learn}} D \]
Discussion
This function is quite different in its behavior, when used for backpropagation learning.
When using common, first-order derivative based gradient descent, we get the most fitting approximation as the result of the neural network. Such approximation introduces the solutions that were not part of the original training set (which is quite desired), but it also has no relation to the size of the domains used. Technically, it is a good mathematical model, but not a best classification learning model: the results there may be a good fit, but the input may be very far from the original domain.
Consider having a handwritten digit classification network. You don't define rubbish
output value along with 0-9. You either accept that it will classify noise as a random value, or train it in such a way, that outputs also imply some overall certainty. In the former you employ the artifact of randomness in initial weights, that has lead to randomness in learned feature points, and overlap in those feature points that leads to low certainty value. You have never defined where proper input ends, and "noise" begins, neither have you defined their relative sizes.
Well, using ECO is different. Because you are maximizing Laplacian, you are implicitly defining input domain. The closer is the fit of the target function at each step, the larger is the error (relative to the outputs from the training set) for inputs outside of the ones you used for training. So that you are not only training the network to output you the values you showed, but you are also telling it to output something different, the more your input is different from the training.
Some plots would be useful for explanation, but I didn't make them anywhere
but my head. If you want them, I can put some effort to making a few
simplified schematic plots to demonstrate the concept.
Implementation
So I implemented this all using Python and Theano, and here's the code: HTML/Jupyter Notebook. I wasn't very thorough on explanations, and the last run that is shown here isn't quite informative. So, again, if you want — help, or at least show some interest in it, I'll do my best to clean it up and update.
[Update (21.31.2018)]: Additional changes were made to the implementation: the de-learning algorithm was adjusted to provide smoother error values and generate better inputs; introduced discrimination-type de-learning, which uses real inputs and random wrong outputs.
Those two changes allowed to finally achieve some stability in the learning process and better increase curvature. Unfortunately though, as of now, the learning process still has a maximum number of samples after which learning quickly deteriorates. Interestingly, it is also the point of highest curvature, which wasn't the case before the update (previously, learning deteriorated at a random point).
The update to the de-learning samples also allowed to reach learning symmetry with the de-learning. Some additional theoretical investigation showed that the network can never become stable with just one layer (zero hidden layers).
The idea for implementation is fairly simple. The only thing you need to override, is the error function — which you define as \(C\) discussed above. There are three approaches that may work:
- Using usual learning methods,
- Reiterating learning until the desired output is achieved,
- Reiterating learning of each sample for a set amount of repeats.
Second and third are implemented there, both work to some degree. What I didn't implement is the reduction of learning step. In effect, there was one problem I encountered: after some amount of samples, the learning has not just stopped, but instead completely destabilized and locked on one signle output value. I have yet to diagnose the specific cause of such behavior.
Noise de-learning
There is also an additional technique, related to ECO, that helps produce better results overall. We define an artificial de-training set, that consists of input noise and some output values that we want for noise. We then feed it into the network, but define the training step as:
\[ \theta \mathrel{+}= \gamma_{\mathrm{de-learn}} D \]
Since we are working with function curvature, the results are sensible, unlike in common backpropagation.
Usage
Well, no much information here yet. I would expect that it behaves better when supplied with an infinite training set, than the usual neural network, and would allow continuous learning, but there is a problem of it locking onto one output value, which I have no idea why appears.
I hope that if you have some experience with neural networks/deep learning and have any ideas on how to perfect this technique or continue its research, this discussion will be useful to you. Also, I welcome any ideas, comments, help, etc.
I'm sorry that this post is not much, but at least it describes the basic idea.
Updates:
- Fixed formulas a bit
- Added a little better intro
- Fixed a typo in Laplacian formula (wrote wrt Y instead of X)
- Adjusted de-learning algorithm (seem update information above)