Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

crowdAI is shutting down - please read our blog post for more information

WWW 2018 Challenge: Learning to Recognize Musical Genre

Learning to Recognize Musical Genre from Audio on the Web



Like never before, the web has become a place for sharing creative work - such as music - among a global community of artists and art lovers. While music and music collections predate the web, the web enabled much larger scale collections. Whereas people used to own a handful of vinyls or CDs, they nowadays have instant access to the whole of published musical content via online platforms. Such dramatic increase in the size of music collections created two challenges: (i) the need to automatically organize a collection (as users and publishers cannot manage them manually anymore), and (ii) the need to automatically recommend new songs to a user knowing his listening habits. An underlying task in both those challenges is to be able to group songs in semantic categories.

Music genres are categories that have arisen through a complex interplay of cultures, artists, and market forces to characterize similarities between compositions and organize music collections. Yet, the boundaries between genres still remain fuzzy, making the problem of music genre recognition (MGR) a nontrivial task (Scaringella 2006). While its utility has been debated, mostly because of its ambiguity and cultural definition, it is widely used and understood by end-users who find it useful to discuss musical categories (McKay 2006). As such, it is one of the most researched areas in the Music Information Retrieval (MIR) field (Sturm 2012).

The task of this challenge, one of the four official challenges of the Web Conference (WWW2018) challenges track, is to recognize the musical genre of a piece of music of which only a recording is available. Genres are broad, e.g. pop or rock, and each song only has one target genre. The data for this challenge comes from the recently published FMA dataset (Defferrard 2017), which is a dump of the Free Music Archive (FMA), an interactive library of high-quality and curated audio which is freely and openly available to the public.


You can find the final results and the ranking on the repository and in the slides used to announce them.

In the interest of reproducibility and transparency for interested researchers, you’ll find below links to the source code repositories of all systems submitted by the participants for the second round of the challenge.

  1. Transfer Learning of Artist Group Factors to Musical Genre Classification
  2. Ensemble of CNN-based Models using various Short-Term Input
  3. Detecting Music Genre Using Extreme Gradient Boosting
  4. ConvNet on STFT spectrograms
  5. Xception on mel-scaled spectrograms
  6. Audio Dual Path Networks on mel-scaled spectrograms

The repositories should be self-contained and easily executable. You can execute any of the systems on your own mp3s by following those steps:

  1. Clone the git repository.
  2. Build a docker image with repo2docker
  3. Execute the docker image

You can find more details in the slides used to announce the results and in the overview paper. The overview paper summarizes our experience running a challenge with open data for musical genre recognition. Those notes motivate the task and the challenge design, show some statistics about the submissions, and present the results. Please cite our paper in your scholarly work if you want to reference this challenge.

  title = {Learning to Recognize Musical Genre from Audio},
  author = {Defferrard, Micha\"el and Mohanty, Sharada P. and Carroll, Sean F. and Salath\'e, Marcel},
  booktitle = {WWW '18 Companion: The 2018 Web Conference Companion},
  year = {2018},
  url = {https://arxiv.org/abs/1803.05337},


To avoid overfitting and cheating, the challenge will happen in two rounds. The final ranking will be based on results from the second round. In the first round, participants are provided a test set of 35,000 clips of 30 seconds each, and they have to submit their predictions for all the 35,000 clips. The platform evaluates the predictions and ranks the participant upon submission. In the second round, all the participants will have to wrap their models in a Docker container. We will evaluate those against a new unseen test set. These 30s clips will be sampled (at least in part) from new contributions to the Free Music Archive.

Details of how to package your code as Binder compatible repositories, please read the documentation here : https://github.com/crowdAI/crowdai-musical-genre-recognition-starter-kit/blob/master/Round2_Packaging_Guidelines.md

The primary metric for evaluation will be the Mean Log Loss, and the secondary metric for the evaluation with be the Mean F1-Score.

The Mean Log Loss is defined by

$ L = - \frac{1}{N} \sum_{n=1}^N \sum_{c=1}^{C} y_{nc} \ln(p_{nc}), $


  • $N=35000$ is the number of examples in the test set,
  • $C=16$ is the number of class labels, i.e. genres,
  • $y_{nc}$ is a binary value indicating if the n-th instance belongs to the c-th label,
  • $p_{nc}$ is the probability according to your submission that the n-th instance belongs to the c-th label,
  • $\ln$ is the natural logarithmic function.

The $F_1$ score for a particular class $c$ is given by

$ F_1^c = 2\frac{p^c r^c}{p^c + r^c}, $


  • $p^c = \frac{tp^c}{tp^c + fp^c}$ is the precision for class $c$,
  • $r^c = \frac{tp^c}{tp^c + fn^c}$ is the recall for class $c$,
  • $tp^c$ refers to the number of True Positives for class $c$,
  • $fp^c$ refers to the number of False Positives for class $c$,
  • $fn^c$ refers to the number of False Negatives for class $c$.

The final Mean $F_1$ Score is then defined as

$ F_1 = \frac{1}{C} \sum_{c=1}^{C} F_1^c. $

The participants have to submit a CSV file with the following header:

file_id,Blues,Classical,Country,Easy Listening,Electronic,Experimental,Folk,Hip-Hop,Instrumental,International,Jazz,Old-Time / Historic,Pop,Rock,Soul-RnB,Spoken

Each row is then an entry for every file in the test set (in the sorted order of the file_ids). The first column in every row represents the file_id (which is the name of the test file without its .mp3 extension) and the rest of the $C=16$ columns are the predicted probabilities for each class in the order mentioned in the above CSV header.


The following rules have to be observed by all participants:

  • Participants are allowed at most 5 submissions per day.
  • Participants are welcome to form teams. Teams should submit their predictions under a single account. The submitted paper will mention all members.
  • Participants have to release their solution under an Open Source License of their choice to be eligible for prizes. We encourage all participants to open-source their code!
  • While submissions by Admins and Organizers can serve as baselines, they won’t be considered in the final leaderboard.
  • Training must be done with the audio from the FMA medium subset only. In particular, the large and full subsets must not be used. Neither should fma_features.csv (from fma_metadata.zip), which was computed on the full set.
  • Metadata, e.g. the song title or artist name, cannot be used for the prediction. The submitted algorithms shall learn to map an audio signal, i.e. a time series, to one of the 16 target genres. In particular, no information (audio features or metadata) from external websites or APIs can be used. This will be enforced in the second round, when the submitted systems will only have access to a set of mp3s to make predictions.
  • In case of conflicts, the decision of the Organizers will be final and binding.
  • Organizers reserve the right to make changes to the rules and timeline.
  • Violation of the rules or other unfair activity may result in disqualification.

For the second round, the docker containers will have access to the following resources with a timeout of 10 hours :

  • 1 Nvidia GTX GeForce 1080 Ti (11 GB GDDR5X)
  • 5 cores of an Intel Xeon E5-2650 v4 (2.20-2.90 GHz)
  • 60 GB of RAM
  • 100 GB of disk
  • no network access

Guidelines for packaging the code, and submitting your models for Round-2 are now available at :


The winner will be invited to present their solution to the 3rd Applied Machine Learning Days at EPFL in Switzerland in January 2019, with travel and accommodation covered (up to $2000).

Moreover, all participants are invited to submit a paper to the Web Conference (WWW2018) challenges track. The paper should describe the proposed solution and self-assessments of its performance. Papers must be submitted in PDF on EasyChair for peer-review. The template to use is ACM, selecting the “sigconf” sample (as for the main conference). Submissions should not exceed five pages including any diagrams or appendices, plus unlimited pages of references. As the challenge is run publicly, reviews are not double-blind and papers should not be anonymized. Accepted papers will be published in the official satellite proceedings of the conference. As the challenge will continue after the submission deadline, authors of accepted papers will have the opportunity to submit a camera-ready version which will incorporate their latest tweaks. The event at the conference will be like a workshop, where participants present their solutions and we announce the winners.


Below is the timeline of the challenge:

  • 2017-12-07 Challenge start.
  • 2018-02-09 Paper submission deadline.
  • 2018-02-14 Paper acceptance notification.
  • 2018-03-01 End of the first round. No new participants can enroll.
  • 2018-04-08 Participants have to submit a docker container for the second round.
  • 2018-04-27 Announcement of winners and presentation of accepted papers at the conference.


Please refer to the dataset page for more information about the training and test data, as well as download links.

The starter kit includes code to handle the data an make a submission. Moreover, it features some examples and baselines.

You are encouraged to check out the FMA dataset GitHub repository for Jupyter notebooks showing how to use the data, exploring it, and training baseline models. This challenge uses the rc1 version of the data, make sure to checkout that version of the code. The associated paper describes the data.

Additional resources:

Public contact channels:

We strongly encourage you to use the public channels mentioned above for communications between the participants and the organizers. In extreme cases, if there are any queries or comments that you would like to make using a private communication channel, then you can send us an email at: