Deep Double Descent 요약
Deep Double Descent : Where Bigger Models And More Data Hurt(2019)
Double Descent phenomenon
Bigger models are better
- Performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time.
- This phenomenon is called “Double Descent”
EMC(Effective Model Complexity)
- EMC: the maximum number of samples n on which Training procedure T acheives on average ≈ 0 training error
- interpolation threshold : EMC(T) = n
- Critical interval : interval around interpolation threshold
- Below and above critical interval : complexity ↑, performance ↑
- Within critical interval : performance ↓
Model-wise double descent
- Double Descent intensifies when there is more label noise.
Epoch-wise double descent
- (Left) Larger model and intermediate model has double descent.
- Larger model has double descent earlier than intermediate one.
- (Right) Test error of larger model descreases first, increases, and then descreases again as # of epochs increases.
- But test error of intermediate model is not. It is better to stop early.
Sample-wise non-monotonicity
- (Left) Double descent abates when more samples.
- (Right) there is a regime where more samples hurt performance. But more than 10K, smaller model is better.
- increasing # of samples shifts the curve downwards towards lower test error.
- More samples require larger models to fit.
- For intermediate model, more samples hurts performance.
Conclusion
- In general, the peak of test error appears when models are just barely able to fit the train set.
- Models at the interpolation threshold is the worst and label noise can easily destroy its gloabl structure.
- However, in the over-parameterized regime, there are many models that fit the train set.
- But authors don’t know why this tendency happens.
references
- https://arxiv.org/pdf/1912.02292.pdf
- https://openai.com/blog/deep-double-descent/
- https://bluediary8.tistory.com/59