Neural Networks offered a major breakthrough in building non linear models. But that was not enough. Any amount of training and data is not enough if our model is not rich enough. After all, the amount of information contained in the model is limited by the number of weights therein. A simple network with a hundred weights cannot define a model for complicated tasks like face recognition. We need a lot more.
It may seem simple, just increasing the number of perceptrons in a network can increase the count of weights. What is the big deal about it? But that is not so simple. As the network grows larger, many other problems start peeping in. In general, the ability of the network does not grow linearly with the number of perceptrons. In fact, it can decrease beyond a point - unless we take care of some important aspects.
The capacity of the networks is a lot better when the network is deeper than wider - that is, if the network has a lot more layers rather than having too many perceptrons in the same layer. Such deep neural networks have enabled miraculous innovations in the past few years. Deep Learning is a branch of machine learning that looks into these aspects of Neural Networks. Deep Learning stands for Learning with deep Neural Networks.
Some problems are common to all machine learning algorithms. The problems of overfitting and underfitting do not spare the neural networks either. But there are some more problems in deep learning. As the depth of the network increases, efficiency of training cycle increases drastically. The concept of forward and backward propagation remains the same. But in a deep network, there are too many parameters to adjust in a single cycle. Things can go wrong while doing this.
As we attempt a gradient descent with a deep network, it is difficult to estimate how each individual parameter will affect the overall cost. It is possible that part of the network adds value while another part works in the opposite direction - only at a slower rate. That appears as an overall descent in the cost function, but is not the best possible solution. One can think of this as a local minimum caused because few of the many hidden layers are doing what they should not do (also called swaying layers). In such a case it is very difficult to identify and fix the problem.
Researchers have come up with some interesting solutions to this:
In a deep network, it is logical to expect that each new layer adds more value to the entire network. One can think of the first layer providing a very gross solution, the second layer refines it and the third layer refines it further and so on. With this in mind, we can train the network in steps. First train a single layer - to the best it can do. Once we have the weights defined for the this layer, we need not change them further.
With this in place, we can add the next layer and train the model only for the second layer (without altering the layer we have already trained.) Here, we can be sure that the second layer adds value to what the first layer did. If we go on doing this over multiple layers, we can go on refining the model at every step - with the assurance that each layer adds value. Theoretically this is an ideal solution. The only problem here is the amount of computation required for training each layer again and again. We can compromise a little by adding a few layers at a time.
Another way to work around this problem is to randomly drop (force to 0) some nodes in the network as we train it. This ensures that the responsibility is equally divided among all the nodes in the network. That means, each node is sure to contribute to the overall model.
The concept here is similar to the previous approach. We train few nodes at a time to make sure the overall model works very well. But the big advantage here is the improvement in computational cost. This is also a very good solution to overfitting - because at every step we train a network that is not so rich - ensuring that the net model is rich as well as effective.
This approach works well for medium size networks. But is not so good beyond a point.
This is an interesting concept. When we think of layers in a network, we usually think of connections flowing from a layer to the next layer. But this is not necessary. We can have a network where a connection bypasses a few layers. That is, the output of the first layer may be fed in as an input to the second layer and also as part of the input to the fourth layer (skipping the second and third layers).
This ensures that the input to the fourth layer is at least as good as the output of the first layer. Because if the third layer offers something less than that, the regression at fourth step will shut off the inputs from the third layer. With this in place, we can be sure that even if the second and third layers do not improve the model they can never harm the model. This AlexNetwas a conceptual revolution in deep learning. Google Net proposed in 2015 was an elaborate network that stemmed out of this concept.