Location-based species recommendation
Note: Do not forget to read the Rules section on this page
Automatically predicting the list of species that are the most likely to be observed at a given location is useful for many scenarios in biodiversity informatics. First of all, it could improve species identification processes and tools by reducing the list of candidate species that are observable at a given location (be they automated, semi-automated or based on classical field guides or flora). More generally, it could facilitate biodiversity inventories through the development of location-based recommendation services (typically on mobile phones) as well as the involvement of non-expert nature observers. Last but not least, it might serve educational purposes thanks to biodiversity discovery applications providing functionalities such as contextualized educational pathways.
The aim of the challenge is to predict the list of species that are the most likely to be observed at a given location. Therefore, we will provide a large training set of species occurrences, each occurrence being associated to a multi-channel image characterizing the local environment. Indeed, it is usually not possible to learn a species distribution model directly from spatial positions because of the limited number of occurrences and the sampling bias. What is usually done in ecology is to predict the distribution on the basis of a representation in the environmental space, typically a feature vector composed of climatic variables (average temperature at that location, precipitation, etc.) and other variables such as soil type, land cover, distance to water, etc. The originality of GeoLifeCLEF is to generalize such niche modeling approach to the use of an image-based environmental representation space. Instead of learning a model from environmental feature vectors, the goal of the task will be to learn a model from k-dimensional image patches, each patch representing the value of an environmental variable in the neighborhood of the occurrence (see figure below for an illustration). From a machine learning point of view, the challenge will thus be treatable as an image classification task.
A detailed description of the data is provided in the Dataset section. In a nutshell, the dataset was built from occurrence data of the Global Biodiversity Information Facility (GBIF), the world’s largest open data infrastructure in this domain, funded by governments. It is composed of 261,176 occurrences of 3,203 plant species observed on the French territory between 1835 and 2017. Each occurrence is characterized by 33 local environmental images of 64x64 pixels (encoded as tif images with 33 channels). These environmental images were constructed from various open datasets including Chelsea Climate , ESDB soil pedology data [2,3,4], Corine Land Cover 2012 soil occupation data, CGIAR-CSI evapotranspiration data [5,6], USGS Elevation data (Data available from the U.S. Geological Survey.) and BD Carthage hydrologic data. This dataset is split in 3/4 for training and 1/4 for testing.
Participants are allowed to use other external training data but at the condition that (i) the experiment is entirely re-produceable, i.e. that the used external ressource is clearly referenced and accessible to any other research group in the world, (ii) participants submit at least one run without external training data so that we can study the contribution of such ressources, (iii) the additional ressource does not contain any of the test observations.
Each team is allowed to submit 10 runs maximum. A run is a .csv file with 4 columns separated by “;” and containing :
patch_id ; species_glc_id ; probability ; rank
Here is an example :
Please watch your runs format. patch_id, species_glc_id and rank should be integers, and probability a float. For a patch_id, one can give up to 100 species, which must be distinct and their ranks strictly consecutive.
WARNING: Even though a run induces an error, it is counted among the 10 allowed.
Remark : There is no leaderboard for this task. We removed it to maximise the independence between submitted algorithms and test data, and the usefulness of GeoLifeClef results for research.
As soon as the submission is open, you will find a “Create Submission” button on this page (just next to the tabs)
The following table and graph sum up the MRR results on the test set per participant :
The used metric will be the Mean Reciprocal Rank (MRR). The MRR is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. The MRR is the average of the reciprocal ranks for the whole test set:
|where||Q||is the total number of query occurrences in the test set.|
LifeCLEF lab is part of the Conference and Labs of the Evaluation Forum: CLEF 2018. CLEF 2018 consists of independent peer-reviewed workshops on a broad range of challenges in the fields of multilingual and multimodal information access evaluation, and a set of benchmarking activities carried in various labs designed to test different aspects of mono and cross-language Information retrieval systems. More details about the conference can be found here .
Submitting a working note with the full description of the methods used in each run is mandatory. Any run that could not be reproduced thanks to its description in the working notes might be removed from the official publication of the results. Working notes are published within CEUR-WS proceedings, resulting in an assignment of an individual DOI (URN) and an indexing by many bibliography systems including DBLP. According to the CEUR-WS policies, a light review of the working notes will be conducted by LifeCLEF organizing committee to ensure quality. As an illustration, LifeCLEF 2017 working notes (task overviews and participant working notes) can be found within CLEF 2017 CEUR-WS proceedings.
Participants of this challenge will automatically be registered at CLEF 2018. In order to be compliant with the CLEF registration requirements, please edit your profile by providing the following additional information:
Regarding the username, please choose a name that represents your team.
This information will not be publicly visible and will be exclusively used to contact you and to send the registration data to CLEF, which is the main organizer of all CLEF labs
LifeCLEF 2018 is an evaluation campaign that is being organized as part of the CLEF initiative labs. The campaign offers several research tasks that welcome participation from teams around the world. The results of the campaign appear in the working notes proceedings, published by CEUR Workshop Proceedings (CEUR-WS.org). Selected contributions among the participants, will be invited for publication in the following year in the Springer Lecture Notes in Computer Science (LNCS) together with the annual lab overviews.
- Technical issues : https://gitter.im/crowdAI/lifeclef-2018-geo
- Discussion Forum : https://www.crowdai.org/challenges/lifeclef-2018-geo/topics
We strongly encourage you to use the public channels mentioned above for communications between the participants and the organizers. In extreme cases, if there are any queries or comments that you would like to make using a private communication channel, then you can send us an email at :
- Sharada Prasanna Mohanty: email@example.com
- Christophe Botella: christophe[DOT]botella[AT]gmail[DOT]com
- Alexis Joly: alexis[DOT]joly[AT]inria[DOT]fr
- Ivan Eggel: ivan[DOT]eggel[AT]hevs[DOT]ch
You can find additional information on the challenge here: http://imageclef.org/node/229