Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

crowdAI is shutting down - please read our blog post for more information

League of Nations Archives Digitization Challenge

Help us share the archives of the League of Nations, a vital part of world history


Noisy training datasets

Posted by Ciprian Tomoiaga over 3 years ago

A quick screening of the training dataset reveals that there are mis-labelled examples in both folders, i.e. French documents in the English folder and vice-versa.

Is this known? Do we know an approximate ratio of the noise ?


Posted by spMohanty  over 3 years ago |  Quote

The labels were collected using a crowdsourcing campaign, so I am not surprised that there might be mislabelled data.

@nshreyasvi : Can you tell more about the expected noise in the datasets ?

Posted by nshreyasvi  over 3 years ago |  Quote

Hello, I am not sure about the noise but if possible can you provide a list of mis-labelled examples in the french and english folders so that it can be corrected ?


Posted by nshreyasvi  over 3 years ago |  Quote

I think the noise in the datasets must not be for more than 20-30 labels (worst case).

Posted by Syaffers  over 3 years ago |  Quote

@Ciprian Tomoiaga I’ve been doing some relabeling to try and clean the data manually. After 1038 documents, there are 77 French documents (0.07%) and 7 (0.007%) German documents. Hope this helps


Posted by ViktorF  over 3 years ago |  Quote

What should be the right prediction in case the document is not English nor French?

Should we predict English / Non-English instead?

Or should we predict 0.000,0.000 ?

There are also empty pages and pages with a single country name as a title on it, which is likely the same in English and French…

Posted by ViktorF  over 3 years ago |  Quote

Found 400+ examples of French mislabeled as English.

Found 100+ examples of English mislabeled as French.

Some of them are mixed language pages, so 50%-50% prediction would be reasonable. But not that many falls into this category.

Please find the lists here: https://1drv.ms/f/s!AqEgz8G_d8TShq0bhMTOwrnNHu1PHA

So the training set is indeed very noisy.

There can be a few mistakes, still.


Posted by ViktorF  over 3 years ago |  Quote

I guess the test set has the same error rate, providing you did a random split.

In that case we won’t be able to get near to perfect scores on the leaderboard.

Please feel free to contact me if you need help fixing the datasets. I offer the full predictions, so you can find the glaring mistakes and manually verify, then relabel them. It would be nice to do it quickly before we go far into the competition.

Please let me know if the competition is abandoned. It is pretty silent here…


Posted by nshreyasvi  over 3 years ago |  Quote

@ViktorF In case there are any errors you can remove and train the machine learning model without noise, there are some labels that have wrong value and since a lot of them was extracted using a crowdsourcing campaign and had some wrong classification done by volunteers. I think excluding the ones that have been wrongly classified, there should be plenty of data to train the model on. Let me know if this would work out! Sorry for the late reply :)