Machine learning is all about data and if data is not good, the outcome can never be good. Along with the training set, the dev / test sets are also play a significant role - often the dev / test sets have a much larger impact on the outcome.
Some important points that we must note in this regard:
- Choose dev and test sets from a distribution that reflects what data you expect to get in the future and want to do well on. This may not be the same as your training data's distribution. Choose dev and test sets from the same distribution as far as possible.
- There are several traditional heuristics - 60:20:20 / 70:10:20 .. And some say 99/0.5/0.5! All this depends upon the actual size of the available data. Your dev set should be large enough to detect meaningful changes in the accuracy of your algorithm, but not necessarily much larger. Your test set should be big enough to give you a confident estimate of the final performance of your system. And never forget that there is no point evaluating the model without training it. The training set should form the significant part of the story. We should extract a dev/test set just enough to server the purpose.
- The dev set was meant to avoid overfitting. But it does happen after several iterations that we end up overfitting the dev set as well. One of the prominent symptoms of overfitting is that the error levels vary significantly over different data sets of the same distributions. Thus, if we notice a significant disparity between the dev set error and the test set error, it is quite likely that we have overfit the dev set. We should frequently refresh the dev set in order to avoid this problem.