Help us share the archives of the League of Nations, a vital part of world history
By United Nations Library
5 months ago
I use the corrected results from ViktorF, and it’s not very hard to get good result from validation (I split traning data into training and validation randomly). Validation F1 score is almost 0.98.
But the result on leaderboard is not good (leaderboard F1 score is 0.88). I can understand that it’s because the data distribution in corrected training set is different from test set.
If the test set has errors, then the best result of this competition is just to fit the wrong labels distribution. Maybe it’s not the model’s expectation.
What’s the expectation of this competition?
5 months ago |
This is why I gave up. This competition does not have proper datasets and apparently abandoned by the maintainers. Please correct me if I’m wrong.
5 months ago |
I feel the same way.
5 months ago |
@ViktorF : Sorry for the delay in the response. I personally was very caught up with the NIPS deadlines, and the main organizers of this challenge also seem to have been caught up.
And true, there does seem to be a lot of error. in the training and test data as they were collected by crowdsourcing. We will however prefer to not change the rules at this point, and instead go ahead with the current competition as things stand.
The main organizers of this competition are in the process of cleaning up the test set, and then we will re evaluate all the submissions.
And yes, we will use f1_score as primary evaluation metric.
5 months ago |
@spMohanty I’m sorry but I don’t understand this. You first say “We will however prefer to not change the rules at this point, and instead go ahead with the current competition as things stand.” But then the next sentence is “The main organizers of this competition are in the process of cleaning up the test set, and then we will re evaluate all the submissions.” So which is it? And let me explain why, as a competitor, the second sentence is beyond frustrating when you’ve spent time and resources on this competition.
The organizers have been aware of the noisy training set for three months but stated that this was somehow acceptable and done nothing about it. And this is what we’ve worked with this whole time. While the current data is indeed noisy it’s not completely random, there’s some consistent mistakes. An out of the box ocr + langdetect method scores ~.875, and this is probably the most correct method at the moment for actually correct annotation. But as shown it’s possible to score .92+ when training models because it learns the mistakes the annotators have made. It learned to annotate some french documents as english and the other way around because that’s how the annotators consistently classified them.
Now it’s of course everyone’s hope that their models will generalize well but with this data there’s just no chance that they will. They will do that if they’re retrained with correctly labeled data but not in their current state. So when you less than 24 hours before the competition ends announce that you will change the test set that basically means that the more time you’ve spent training models to perform well on the leaderboard (our goal this entire time) the worse you will perform in the end.
I realise that as the current leader I have something to gain by keeping the competition as is but I hope that what I say makes sense and that you can understand my perspective. You had months to fix this. Preferably you’d accept that this is how you made the competition and not change anything this late. You can still use the methods and code presented to train new models when you generate new correct data. But if you absolutely have to make those kind of huge last minute changes I ask that you at the very least let us retrain our models with new representative training data to give us a chance to adjust. Another option could also be to launch a new fresh competition with this new data.
@dennissv : I do see for your point, and sorry for the confusion created in my prervious statement. What I meant when I said : “We will however prefer to not change the rules at this point, and instead go ahead with the current competition as things stand” was that, the current rules and prizes stand for this noisy dataset. You as the current top-scores are still entitled to the promised prize.
I understand the frustration out of messy data with noisy labels, but even if this was unintended, you could think of this as a more real-world scenario ! and great job at still making it work !
Now you could argue that a %-age of the training data (and by extension the test data) had wrong labels. Which is true, but an overwhelming majority of the data still has correct labels.
Then I went on to say, “The main organizers of this competition are in the process of cleaning up the test set, and then we will re evaluate all the submissions” which is also true.
As someone could argue that all the current submissions are indeed overfitting to a wrong labels distribution, as @dt does. So we are working with the organizers to ensure that they provide us a “clean” test set. And we re-evaluate all the submissions, and if another participant comes up as the top-score in that case, then we will try to negotiate with the organizers to try to announce another parallel prize for the other participant. If its you again, then its all and well. In any case, I personally would be curious to see how the leaderboard changes when we test the submitted predictions against a cleaner and manually verified labels. I am sure many others would also be curious.
In any case, great job with all your participation in the competition, even against the overwhelming odds of messy data to work with.
@spMohanty Thank you for the clarification and I also want to apologize as I think my wording was overly harsh. It was a quite stressful end to a tough week. This does sound very fair and seems like a good solution to this problem. It will indeed be very interesting to see how all solutions check out against a clean test set. Could you perhaps open up late submissions at a later point so we could try training new more useful models and see how they stack up? I did end up cleaning the training set yesterday when it was indicated that this might be needed to compete, I will post those corrected labels together with my solution code if this might be helpful.
Thank you for the clear response!