This challenge aims at predicting height based on genetics (DNA variation).
Genetics is the study of genes, genetic variation, and heredity in living organisms. DNA is the support that allows most living organism to pass information from generation to generation. It consists in long strands of nucleotides that build a higher order structures, the chromosomes. There are four different nucleotides represented by the letters A, T, C and G that together make up the genetic code.
The human genome is made of > 3 billion nucleotides, and each individual harbors about 4 million genetic variants (mostly single nucleotide polymorphisms, or SNPs). A specific position on a chromosome is called a genetic locus, and different versions of the same genetic locus are called alleles. Humans being diploid organisms, they have two genome copies - one inherited from each parent - and thus two alleles at each genetic locus. For one particular genetic locus, an individual is homozygous if the two alleles are identical, and heterozygous if the two alleles are different.
We call genotype the DNA sequence of an individual that determines a specific observable characteristic. That characteristic is called phenotype.
Monogenic phenotypes are under the control of a single gene. For example, if hair color was a monogenic phenotype, inheriting two brown alleles of a hypothetical hair color gene would result in the brown hair phenotype. Conversely, inheriting two ginger hair alleles would result in the ginger hair phenotype.
Polygenic phenotypes, on the contrary, are under the control of multiple genetic variants across the genome. If hair color was polygenic, it might for example work like the RGB (Red, Green, Blue) color model. In this case, three different genes would add their effects and interact to control hair color.
Heritability of a phenotype measures how much of the observed variance of the phenotype in the population is due to genetic factors. Missing heritability represents the difference between the estimated heritability of a given phenotype, and the heritability that is explained by known genetic factors. Heritability of human height is estimated to be as high as 80%, but large genomic studies have so far only been able to explain about 25% of the observed variance. Height is a model phenotype to study complex traits, and here we want to test whether part of the missing heritability can be explained using innovative approaches to genetic datasets, including deep learning.
The data comes from OpenSNP, which allows customers of direct-to-customer genetic tests to publicly share their genome-wide genotyping data.
We provide two datasets for a total of 921 samples divided into a training set of 784 sample
subset_cm_train.npy and a test set of 137 samples in
It contains a set of 9,894 genetic variants known to be associated with height (9207 variants) and the one on Y chromosome (687 variants). This numpy file has shape
(784, 9894) for the training set and
(137, 9894) for the test set.
Each genetic variant is represented by 0 (homozygous for reference) , 1 (heterozygous), 2 (homozygous for the genetic variant) or NA (missing information or absence of the position in the case of Y chromosome in women). The first 9207 rows are the genetic variants known to be associated with height, the last 687 correspond to the Y chromosome.
Finally, height is provided in a separate numpy file of shape
(784, 1) named
openSNP_heights.npy for the training set only.
While we recommend to start with this simplified dataset, more advanced user might try to analyze an extended version of OpenSNP data which description is available here.
import crowdai challenge = crowdai.Challenge("OpenSNPChallenge2017", "YOUR_CROWDAI_API_KEY_HERE") data = ... #a list of 137 predicted heights for all the 137 corresponding data points in the test set challenge.submit(data) challenge.disconnect()
More instructions to make submissions, and starter code is available at :
Challenge Image source: https://commons.wikimedia.org/wiki/File:Benzopyrene_DNA_adduct_1JDG.png
The evaluation will be done based on two scores :
between the actual heights of the individuals in the test set and the submitted predictions.
NOTE : During the challenge, the scores will be computed only on 20% of the test dataset. The final standings on the leaderboard will be decided computing the same scores on the 100% of the dataset after the challenge.
The winner will be invited to the 2nd Applied Machine Learning Days at EPFL in Switzerland on January 29 & 30, 2018, with travel and accommodation covered.
MIT Open Course Ware can help you go further in understanding biological concepts related to this challenge.
The most important publications describing associations between genetic factors and human height :
To transform the VCF files it can be convenient to use plink .
1 Wood, Andrew R, Tonu Esko, Jian Yang, Sailaja Vedantam, Tune H Pers, Stefan Gustafsson, Audrey Y Chu, et al. “Defining the Role of Common Variation in the Genomic and Biological Architecture of Adult Human Height.” Nature Genetics 2014. doi:10.1038/ng.3097.
2 Marouli, Eirini, Mariaelisa Graff, Carolina Medina-Gomez, Ken Sin Lo, Andrew R. Wood, Troels R. Kjaer, Rebecca S. Fine, et al. “Rare and Low-Frequency Coding Variants Alter Human Adult Height.” Nature 2017. doi:10.1038/nature21039.