Loading
Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

League of Nations Archives Digitization Challenge

Help us share the archives of the League of Nations, a vital part of world history


41 days left
92
Submissions
94
Participants
2640
Views

Overview

Starter Kit : https://github.com/crowdAI/league-of-nations-archives-digitization-challenge-starter-kit

The documents in the archives of the League of Nations (1919-1946) contain important historical information about the origins of the United Nations that are relevant to understanding the UN and its many roles today, as well being a critical resource for the history of international relations during the interwar period. The official documents in particular are a vital source for researchers, as they provide the official output of the various bodies of the League, from the well-known League Council and Assembly, to the most technical sub-committees and conferences, including minutes, final reports, and official working papers.

Because these data exist only in the form of paper archives at the UN Library in Geneva , the digital transformation of the documents represents an important challenge. Digitization of these archives will enable far wider access to these documents, which are currently only accessible to researchers who can visit the archives in person. Furthermore, while digitization itself is obviously critical, providing intellectual access to the documents is no less necessary.

Currently, there are no comprehensive indexes or catalogues of the official documents collection, and it is therefore necessary to provide basic points of access to the materials, including the title, document symbol, date, and language of each document. This is currently done in a largely manual process, which is extremely labour intensive and relatively slow.

The documents themselves unfortunately present further challenges to rendering their contents accessible. There is a wide variety in the formats and layouts of title pages, titles may be complex and multi-faceted, and documents exist in multiple languages, including bi- and multi-lingual texts. Moreover, the quality of the printing and reproduction methods used has often resulted in poor quality text, for which it can be difficult to achieve good results using Optical Character Recognition (OCR) analyses.

The training dataset contains more than 4500 documents in English or French. More details can be found in the Dataset of the challenge.

The challenge is to identify the language in each of the documents accurately using any model for prediction.

Partners

Evaluation

The primary metric for evaluation will be the Mean F1-Score, and the secondary metric for the evaluation with be the Mean Log Loss

The Mean Log Loss is defined by

$ L = - \frac{1}{N} \sum_{n=1}^N \sum_{c=1}^{C} y_{nc} \ln(p_{nc}), $

where

  • $N= 14216$ is the number of examples in the test set,
  • $C=2$ is the number of class labels, i.e. languages : [en, fr],
  • $y_{nc}$ is a binary value indicating if the n-th instance belongs to the c-th label,
  • $p_{nc}$ is the probability according to your submission that the n-th instance belongs to the c-th label,
  • $\ln$ is the natural logarithmic function.

The $F_1$ score for a particular class $c$ is given by

$ F_1^c = 2\frac{p^c r^c}{p^c + r^c}, $

where

  • $p^c = \frac{tp^c}{tp^c + fp^c}$ is the precision for class $c$,
  • $r^c = \frac{tp^c}{tp^c + fn^c}$ is the recall for class $c$,
  • $tp^c$ refers to the number of True Positives for class $c$,
  • $fp^c$ refers to the number of False Positives for class $c$,
  • $fn^c$ refers to the number of False Negatives for class $c$.

The final Mean $F_1$ Score is then defined as

$ F_1 = \frac{1}{C} \sum_{c=1}^{C} F_1^c. $

The participants have to submit a CSV file with the following header:

filename,en,fr

Each row is then an entry for every file in the test set (in the sorted order of the filenames). The first column in every row represents the filename (which is the name of the test file with its ‘.jpg’ extension) and the rest of the $C=2$ columns are the predicted probabilities for each class in the order mentioned in the above CSV header.

A sample row would look like. :

58dc3fae0f039187c6615393b31c4978.jpg, 0.765, 0.235

which means that for the image in the file 58dc3fae0f039187c6615393b31c4978.jpg, you are 76.5% confident that it belongs to the class en and 23.5% confident that it belongs to the class fr.

An example of a French document will be:  french_lon

An example of an English document will be:  english_lon

Rules

  • Participants are allowed at most 10 submissions per 24h.
  • Participants are welcome to form teams. Teams should submit their predictions under a single account.
  • Participants have to release their solution under an Open Source License of their choice to be eligible for prizes. We encourage all participants to open-source their code!
  • The use of pre-trained models is nevertheless permitted.
  • While submissions by Admins and Organizers can serve as baselines, they won’t be considered in the final leaderboard.
  • In case of conflicts, the decision of the Organizers will be final and binding.
  • Organizers reserve the right to make changes to the rules and timeline.
  • Violation of the rules or other unfair activity may result in disqualification.

Prizes

The winning participant will be extended a travel grant by Citizen Cyberlab to attend the AI for Good summit 2019 in Geneva, Switzerland. The travel grant covers the travel and accommodation expenses up to CHF 1500.

Resources

Contact Us

Use one of the public channels:

We strongly encourage you to use the public channels mentioned above for communications between the participants and the organizers. In extreme cases, if there are any queries or comments that you would like to make using a private communication channel, then you can send us an email at :