crowdAI is shutting down - please read our blog post for more information
Help us share the archives of the League of Nations, a vital part of world history
By United Nations Library
almost 2 years ago
Hello and thanks for the competition! I’m currently in process of doing a more comprehensive writeup and cleaning up my code but I wanted to post a quick version for the curious here so you can atleast see the methods I used.
I started this competition with using Tesseract3 and pythons langdetect to get an initial benchmark which scored ~.875. I ended up playing around with different versions of this for a while but never scored significantly better.
My second approach was pure image based as this is what I have most experience with. I tested some different models which scored as followed:
Inception v3: .897
Inception resnet v2: .898
Xception with pseudo labeling: .903 (The one that ended up being usefull later)
I then did some ensembles of these combined with the original ocr results I had which landed me ~.905 where I stayed for the majority of the competition.
That’s when DataExMachina and Dt came along and they pushed the scores up which motivated me to find new methods. I now had some experience with text based models from a Kaggle competition so I tried to apply some of them here. I upgraded Tesseract to v4 with the new LSTM model which seemed to produce better text (no numbers to back this up I’m afraid, just by manual inspection). The methods I tried was:
Logistic regression: .885
FTRL / LGBM / XGBoost ensemble: .915
The final push to .920 was by merging the text based ensemble with the Xception results. My thought process was that the image based method could save the text based when they had very little information to work with (little or no text successfully extracted), and this did indeed prove beneficial.
Some other things I tried was attempting to make a better OCR engine and training Tesseract with a typewriter font but none of this helped with the scores so I abandoned those approaches. I think that the image based methods was better at classifying these hard to read documents anyway. Pseudo labeling was very beneficial with some methods but not with others. Also ensemble of different methods was somewhat difficult. I think this was due to the somewhat random data.
I’d be very happy to answer questions if you have any!
almost 2 years ago |
Hi, thank you for sharing your solution. I did not think about metadata extraction from images using pre-trained models. I’m curious to know what kind of labels Xception found in images. Can you give us an example ? :)
almost 2 years ago |
Hi! It’s simply an image classifier, much like you would use in any ordinary image classification task. So the end ensemble is simply an averaging of the text + image scores with an extra weight towards the image one when the text ones are unsure. I guess even when the text is hard to read and no usable text was extracted with OCR there is still some features that the image classifier can use. It is an interesting idea to use it to extract features, I thought about it but never got around to it. Perhaps one could have trained it to detect font, whether or not it’s a table, map or plain text and features like that and then use it with CatBoost or a similar method to build a more robust ensemble. I think it would have been a good idea. Next time I guess :)
Hi, thank you for clarifications. Can you send us a tutorial / git or stuff like that to use Xception as you did ? I am curious about how it can be used in a data pipeline :)