Building Missing Maps with Machine Learning
By Humanity & Inclusion
9 months ago
I have a one doubt about current competition rules and I would like to share it with you, I hope organizers could provide me with answers.
Right now participants know exactly nothing about stage 2 test set and aren’t allowed to use any publicly available datasets as well. And I have to major concerns about this.
We should generalize our models, but it is impossible to generalize model based solely on images from one or two very easy locations (with clear images in good resolution, without much shadows, etc.) even with strong augmentations. That would be possible if we would be allowed to use other datasets. And if organizers want solutions that could be used on a real-life use cases than why prohibit participants from making truly universal models?
Final Leaderboard could be quite random depending on luck and test set. It is possible that great models will perform poorly on a test set with very different distribution and bad models, that are underfitted in general or overfitted on some features could perform better because it happens that on this particular dataset they got lucky. I understand that it should incentivize generalization, but without an option to use external datasets or even a sample of test set I am afraid that final leaderboard won’t really reward best models. For example someone could generalize models by training it on images with artificial shadows and on test set there won’t be any shadows. It shouldn’t be a gamble.
Our best models (after adding geometric mean in TTA and erosion combined with dilation in postprocessing) gets us to 0.943 AP and 0.954 AR, but this is how prediction looks like on some other random pictures:
I think current rules make it harder for participants while simultaneously make organizers receive worse models than they could.
My ideas about what could change it (they are quite orthogonal, so all three can be used or just one or two):
Allow usage of publicly available external datasets. That way we could truly generalize our model.
In stage two allow participants to submit two models with the best score counting on LB. This method is used in other competitions and allows to “reduce variance” in this gamble I was taking about. For example you can submit one very generalized model and one with some assumptions or risky method.
Release a sample (like 10-20 pictures) from stage 2 test set. That way we won’t be able to use it for training but participants will know what to expect (for example new roofs colors, shadows, clouds, etc.)
I hope this post could contribute to overall results and impact of this competition, also by providing best models for Humanity & Inclusion, because this is really what we care the most about.
And just as a reminder our solution is always available for everyone to use freely, discuss or contribute to at https://github.com/minerva-ml/open-solution-mapping-challenge/
9 months ago |
@minerva.ml : I understand your concerns. But because of the nature of the license on the data used for Round-2, we cannot release the data under any circumstances, not even 10-20 pictures as a sample unfortunately.
And regarding your concerns around overfitting, and the role of luck in getting a higher performance, they are very valid, and we have been internally having a lot of discussions around the very same point.
Infact that is the reason why accepting submissions (and releasing submission instructions) for round-2 is taking a bit long.
We will finally be allowing all the participants to train on a randomly sampled subset of the dataset used for Round-2 before they can make predictions. The total timelimit for both training and predictions will be 8 hours, and each of the submission will be provided with a single Titan-X (Maxwell Series) GPU.
While the images for round-2 are still satellite imagery, the visual features of tents in refugee camps are very different from the concept of buildings in urban areas, so we understand why it wouldnt make sense to simply take your trained models and directly predict on the new dataset.
Also, participants will be allowed to submit multiple models for round-2, so the best scoring models will decide their position on the leaderboard.
The 8 hour timelimit is mostly because of a logistical reasons, and I agree that it might not result in the best models, but the difficulty will still be consistent for all the participants, and at the end of the challenge, the top models can be trained longer (depending on what the said approach expects) to make them more suitable for production usage.
Hope this answers all your questions,