Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

crowdAI is shutting down - please read our blog post for more information

Mapping Challenge

Building Missing Maps with Machine Learning


Different Evaluation Criteria between Local Evaluation and Leaderboard for test images?

Posted by chenyiyong over 3 years ago

As we can see in Local Evaluation, the detail evaluation methods in accumulate function reference, the precision are average precision of 101 recall values (recThrs [0:.01:1]), which means the precision value will lower than the recall value, that’s it’s actually what we can see when we use Local Evaluation. But as we can see in the leaderboard, the precision value can higher than the recall value. And we also can see the Evaluation Criteria in the Overview page, there are no average precision of 101 recall values. So it’s different evaluation methods between local evaluation and leaderboard, am I right? If I am right, I want to confirm that the evaluation method for leaderboard are based on image or annotation? Because the local evaluation are base on annotation. And if it’s base on image, the image with many annotations and the image with little annotations will have same weight; If it’s base on annotation, the image with many annotations will have more weight than the image with little annotations, and I want to know the detail evaluation method for leaderboard.

Posted by chenyiyong  over 3 years ago |  Quote

This is the first time I am using the Discussion, sorry for the bad format.

Posted by spMohanty  over 3 years ago |  Quote

Hi @chenyiyong,

We do not average across different threshold values. We just use the threshold of 0.5. In Local Evaluation script, it is implemented in this line : python average_precision = cocoEval._summarize(ap=1, iouThr=0.5, areaRng="all", maxDets=100)

And for averaging, each annotation has the same weight. So if an image has more annotations in the ground truth, then it will have more weight than images with less annotations. Please refer to this function for more details on the actual implementation : https://github.com/crowdAI/mapping-challenge-starter-kit/blob/master/cocoeval.py#L453

And the grading server uses the exact same implementation for evaluating as is provided in the starter-kit.

Hope this answers your questions, Cheers, Mohanty

Posted by chenyiyong  over 3 years ago |  Quote

Hi @spMohanty , 1. For average across different threshold values. I mean 101 recThrs thresholds, not iouThrs thresholds. You can see the precision[T,R,K,A,M] that is precision[p.iouThrs, p.recThrs, p.catIds, p.areaRng, p.maxDets], and it return precision[iouThr=0.5, :, :, areaRngLbl=’all’, maxDets = 100], so it average all 101 recThrs thresholds. 2. I just submit a test result for LB which only have 2 correct annotations of 1 image and the precision is 0.00990099009901(exactly 1/101), because it only get precision[p.recThrs = 0]. So I think you are right about the grading server uses the exact same implementation for evaluating as is provided in the starter-kit ON PRECISION. But I still no sure about recall. If it’s average 101 recThrs values, the precision value must lower than the recall values if we despite the p.recThrs = 0.

Posted by taraspiotr  over 3 years ago |  Quote

Hi @spMohanty , @chenyiyong ,

I was also wondering how, given COCOeval evaluation, is it possible that for some recall is greater than precision. My only guess was that there is a small bug in LB evaluation, just like the one in your mask-rcnn repo: https://github.com/crowdAI/crowdai-mapping-challenge-mask-rcnn/blob/master/mrcnn/evaluate.py

In line 89 you calculate recall as: ar = cocoEval._summarize(ap=0, areaRng="all", maxDets=100) and .summarize() without iouThr argument calculates by the default average recall for iouThr in range [0.5:0.05:0.95]. Maybe that is the case.

By the way, I think the evaluation method described in Overview can be misleading (it was for me), as it doesn’t correspond to the one implemented in pycocotools. Could you confirm that for evaluation on LB you are using exactly the same method as in Local Evaluation script?

Cheers, taraspiotr from minerva.ml