Optimizing Train/Validation/Test Splits in Small Data (i.e. Medical Data)
Introduction
The way you create your data subsets for your deep learning project is essential, and has major implications for final model performance. To review a few fundamentals, Andrew Ng and Deeplizard have great introductory overviews:
As described in the Deeplizard video above, generating your data splits typically involves randomly shuffling your data and then partitioning it into subsets for train, validation, and test. Many libraries even have this built in (i.e. the Keras fit function). When your sample size is in the thousands (or if you are lucky millions!) you are unlikely to randomly generate a skewed distribution, but what if you have a dataset with 40 or 50 patients?
Small Data Problems
Medical datasets are often small in scale, and thus special consideration is required when developing your splits. With only a few samples, a random split may result in a validation set that largely differs from your test set, and even worse, a test set that ultimately differs from the data you can expect to find in the wild. Furthermore, as Aurélien Géron suggests, it’s best practice to evaluate performance on your holdout test subset after you have trained (and “locked”) your optimal model to avoid bias — his book has a great overview of some other helpful considerations.
Ultimately, this means you might not realize you have a problem with your data distribution across your splits, until late in your experimental workflow.
So what is the solution to this Small Data problem? …
Using Multi-label Stratification
As described above, the conventional practice in large datasets is splitting your data into subsets after random shuffling, however this can lead to issues with small samples. To overcome this, we can force our splits to evenly shuffle features of key interest across train/validation/test. To see how, review the notebook I wrote below. The notebook goes through a simple end-to-end example and can easily be adapted to to more complex datasets.
By the end of this notebook you will see how to avoid distribution imbalance when creating your subsets using the scikit-multilearn library.
Conclusion
For small data sets, it is well worth taking the effort the optimize your train/validation/test subsets such that key features of interest are evenly distributed. In the notebook I created, I mention tumor volume as a theoretical example, but this could easily extend to other features — i.e. age, gender, etc. It’s probably optimal to focus on a few key features so you still largely respect the randomization process.
Author’s Note
This post is the first of more planned deep learning and medical research articles. If you have suggestions related to the content, please share them in the comments below and I will do my best to incorporate them into a revision or a follow up post. Lastly, if you benefited from this content, please support the article by sharing it, clapping it (a few times!), or leaving feedback in the comments below.