Loading
Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

crowdAI is shutting down - please read our blog post for more information

League of Nations Archives Digitization Challenge

Help us share the archives of the League of Nations, a vital part of world history


Completed
469
Submissions
136
Participants
9364
Views

solution to get 0.915 in public leaderboard

Posted by dt almost 2 years ago

It’s a tough game XD

I am going to share something I did. The code is still not ready.

There are two main steps:

Step 1. Extract words from image: Just like dennissv and DataExMachina did, I did some image pre-processing and use Tesseract with python to extract words. The image pre-processing includes denoising, image enlarge, sharpen. After this step, I will get a many text files.

Step 2. A binary classification model: I apply a binary classification model, and this binary classification is from a framework called ‘ULMFit’. This ULMFit include two steps: a. finetune a language model (It takes about 2.5 hours to finetune) b. Add a binary classification layer. (It takes about 0.5 hours to train)

These two step will get f1-score: ~0.909, and logloss: ~0.205

And I use 10 folds cross validation on the final binary classification model to estimate the valid result more precisely. Then I use these 10 models to make prediction and average the prediction probability. Then the result is f1-score: ~0.915, and logloss: ~0.198

This game has some problems, but it’s fun to apply models and compete with all of you.

That’s all. Thanks

2

Posted by jazivxt  almost 2 years ago |  Quote

Here’s what I was able to do in a day because I only found out about the competition then. Curious about the additional steps for image per-processing. I had to only take a small section to meet the timeline and my Intel NUC computer fan was acting up so I could not continue to better the score but wanted to share anyway for others to compare. Looking forward to hearing more about your work. I also tried TextBlob but it uses an API call to Google I believe that maxed out and just gave errors. Didn’t have time to create a proper train/test scenario and use the train images ocr text for a proper language classifier.

from PIL import Image, ImageEnhance, ImageFilter from multiprocessing import Pool, cpu_count import pytesseract as ocr import pandas as pd import numpy as np import glob, os

def getLang(path): try: img = Image.open(path) #img = img.rotate(90) #single rotation w, h = img.size if w > 800: ratio = 800 / w img = img.resize((800, int(h*ratio)), Image.ANTIALIAS) w, h = img.size img = img.crop((300, 500, w-300, 550)) text = ocr.image_to_string(img) return text except: print(path, w, h)
return ‘unknown’

if name == “main”: train = pd.DataFrame({‘path’: glob.glob(‘../input/train//.jpg’)}) test = pd.DataFrame({‘path’: glob.glob(‘../input/test/**.jpg’)}) train[‘filename’] = train[‘path’].map(lambda x: os.path.basename(x)) test[‘filename’] = test[‘path’].map(lambda x: os.path.basename(x)) train[‘prob_en’] = train[‘path’].map(lambda x: 1 if ‘/en/’ in x else 0) train[‘prob_fr’] = train[‘path’].map(lambda x: 1 if ‘/fr/’ in x else 0) print(len(train), len(test))

def transform_df(df):
    df = pd.DataFrame(df)
    #df['tb_language'] = df['path'].map(lambda x: TextBlob(ocr.image_to_string(Image.open(x))).detect_language())
    df['tb_language'] = [getLang(x) for x in df['path'].values]
    return df

def multi_transform(df):
    print('Init Shape: ', df.shape)
    p = Pool(cpu_count()-1)
    df = p.map(transform_df, np.array_split(df, 10))
    df = pd.concat(df, axis=0, ignore_index=True).reset_index(drop=True)
    p.close(); p.join()
    print('After Shape: ', df.shape)
    return df

test = multi_transform(test)
test.to_csv('test.csv')

Posted by dt  almost 2 years ago |  Quote

Thanks for your sharing!

I didn’t try many combination of image preprocessing, because the computation time of ocr is long. The following function is the preprocessing steps.

def processInput(img_i): img = cv2.imread(img_i, cv2.IMREAD_COLOR) # resize img = cv2.resize(img, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC) # convert color to gray img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) # erode kernel = np.ones((2, 2), np.uint8) img = cv2.erode(img, kernel, iterations=1) # blur img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1] # unsharp mask gaussian_3 = cv2.GaussianBlur(img, (9,9), 10.0) img = cv2.addWeighted(img, 1.5, gaussian_3, -0.5, 0, img)

Posted by dt  almost 2 years ago |  Quote

Oh, the typesetting is disaster…

I did the following things resize, convert color to gray, erode, blur, unsharp mask

Posted by jazivxt  almost 2 years ago |  Quote

Awesome, thanks for the preprocessing function.