Loading
Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

OpenSNP Height Prediction

OpenSNP


Completed
1275
Submissions
131
Participants
13690
Views

Error in subset test data?

Posted by ivokwee over 1 year ago

Hello organizers,

There might be an error in the coding of the subset test data…

  1. I downloaded the ‘subset_cm_train.npy’ and ‘subset_cm_test.npy’, and converted them to CSV (please next time provide a CSV file…)
  2. Also downloaded the subset VCF file.
  3. Followed the instructions from Baranger to combined train+test and get bed, bim, fam files using plink
  4. Converted the genotype calls to integers (for combined set)
  5. Flipped the 0/1/2 integer coding to 2/1/0 to match the training subset CSV file

For 269 snps probes, I still get a flipped coding in the test set! Seems like they are those snps where the major allele is flipped in the training and test set. Below are the first few mismatching snps and their counts in training and test (subset data). I guess defining the major allele as the most frequent has been done separate in training and test, instead of in the combined data set.

I have put my R code in GIST https://gist.github.com/ivokwee/fbb668a83b826a492c8be8c8486b9305

Can you please check?

Ivo


      mismatch train.0 train.1 train.2 test.0 test.1 test.2 rs7536679     TRUE     218     386     180     37     62     38 rs2622925     TRUE     207     399     178     27     73     37 rs1044457     TRUE     205     403     176     28     74     35 rs2208375     TRUE     201     392     191     31     73     33 rs4375253     TRUE     200     406     178     34     63     40 ...

Posted by ivokwee  over 1 year ago |  Quote

Table with mismatching snps are here:

https://gist.github.com/ivokwee/84664fad3b770519c2bf781e5935d248

Posted by Olivier  about 1 year ago |  Quote

Hello Ivo,

I have checked and you are right, some SNPs are flipped between training and testing set, and this is a real mistake from our side. Plink re-imputes the status of ‘minor’ and ‘major’ allele every time it generate a new file from the SNP frequency. We extracted our training and testing set from the full cohort using plink, if by chance a SNP is minor in one cohort and major in the other one (SNP which frequency should be close to 0.5 in general), it created a flip between the two sets. However if there is 269 mismatches it is only 2.7% of all SNPs and it impacted every challenger in a similar way. For this reason we are going to change the outcome of the challenge. Thank you for bringing this to us.

Cheers,

Olivier

Posted by SergeKrier  about 1 year ago |  Quote

For this reason we are going to change the outcome of the challenge.

Not sure I understand the implication of this mistake (for me those are just numbers to predict). Does it mean the ordering of the leaderboard might be changed following the rescoring ?

thanks

Posted by David_Baranger  about 1 year ago |  Quote

Perhaps it would be more accurate to post a corrected version of the subset datasets, and let interested teams re-submit, than to rescore? Not every team used the dataset with the error.

Posted by NB  about 1 year ago |  Quote

Hi Olivier,

When do you expect to rescore the submissions? I’d like to know the final outcome of this challenge.

Posted by spMohanty  about 1 year ago |  Quote

Hi everyone,

Just letting you guys know that we will be updating the erroneous datasets with the corrected dataset, and resuming the challenge for a period of 1 month. More details on this front will be added on the challenge page next week, and we will also be sending out notifications regarding the same via email to all the participants of this challenge.

Mohanty

1

Posted by SergeKrier  about 1 year ago |  Quote

Hi Mohanty, I see the challenge in reopened, great, is it just for “fun” or there’s a new prize for the winner ? ;-) Thanks

Posted by spMohanty  about 1 year ago |  Quote

Hi @SergeKrier , As the error in the dataset affected all the participants in a similar way, the results of the challenge still stand; and we are not announcing a new prize for the second phase of the challenge at the moment.

The corrected dataset is still updated for participants who are interested in having their predictions reevaluated after training on the correct data.

Cheers, Mohanty

Posted by ivokwee  about 1 year ago |  Quote

Just to say. The winning method of Baranger correctly merged the train and test VCF files, called plink on the merged file, then split the plink result into train and test again. So in this way he was not “affected” by the error. Calling plink separately on each VCF would create the error as discussed (which was probably done by the organizers).

1

Posted by aglotero  about 1 year ago |  Quote

Was the new npy files generated correctly? The starter code

andre@lenovo-andre:/export/openai/opensnp-challenge-starter-kit$ python Python 3.6.2 |Anaconda custom (64-bit)| (default, Jul 20 2017, 13:51:32) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux Type “help”, “copyright”, “credits” or “license” for more information. »> »> import crowdai »> import argparse »> from sklearn import linear_model

import numpy as np x_train = np.load(“data/subset_cm_train.npy”) y_train = np.load(“data/train_heights.npy”) x_train.shape, y_train.shape ((9894, 785), (784,))

Posted by aglotero  about 1 year ago |  Quote

Was the new npy files generated correctly? The starter code fails to execute:

andre@lenovo-andre:/export/openai/opensnp-challenge-starter-kit$ python
Python 3.6.2 |Anaconda custom (64-bit)| (default, Jul 20 2017, 13:51:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import crowdai
>>> import argparse
>>> from sklearn import linear_model

>>> import numpy as np
>>> x_train = np.load("data/subset_cm_train.npy")
>>> y_train = np.load("data/train_heights.npy")
>>> x_train.shape, y_train.shape
((9894, 785), (784,))
>>> 

Posted by Olivier  about 1 year ago |  Quote

There was indeed a mistake in the dataset, it should be updated very soon. If you want you can correct it manually by removing the first column of the npy file (it should be containing only -1). Our apologies for the inconvenience.

Cheers, Olivier

1

Posted by spMohanty  about 1 year ago |  Quote

Hi,

We have updated the corrected dataset from @Olivier ; let us know if there are still problems with the data. The old erroneous dataset can also be downloaded from the datasets section.

Cheers,
Mohanty

1

Posted by mbeisen  about 1 year ago |  Quote

Thanks @Olivier and @spMohanty. But can you also check the heights numpy file? Even with the corrected genotype arrays there’s essentially no relationship I can see between the heights and genotypes in the training set. There’s an essentially uniform distribution of p-vals when regressing individual SNPs against height which makes me think the heights file is not in the right order?

Posted by ivokwee  about 1 year ago |  Quote

I am finding 57 duplicated probes in the (new?) training subset by checking the RS ids in ‘SNP_subset_detail.txt’. I checked, and they are not duplicated in the file ‘SNP_fullset_detail.txt’. Many map to a minor genotype with a ‘.’ (period) and corresponding to a very sparse coding vector.

First few duplicated probes (from subset): 274 395 673 888 1015 1040

Mapping to (duplicated) SNPid: rs2792793 rs1676499 rs6666508 rs1529897 rs10210125 rs3768919

Should not be like this right?

Ivo

Posted by Olivier  about 1 year ago |  Quote

Hello Ivo, This corresponds to the few non-biallelic alleles. It means that at a given position, you will have multiple possible alternate alleles. Some of them are indeed very sparse (very rare or absent) and you can, for example, choose to ignore them.

Cheers,

Olivier

Posted by ivokwee  about 1 year ago |  Quote

Here is a double entry in the SNP_subset_details.txt file:

2 134501984 rs17605839 G C 2 134501984 rs17605839 G T

But in the SNP_fullset_details.txt file there is only one entry for rs17605839:

2 134501984 rs17605839 G T

If it is due to non-biallelic SNPs, then at least it should also be in the fullset, I presume….

Ivo

Posted by Olivier  about 1 year ago |  Quote

Hello Ivo,

When I generated the fullset file I applied some filters to reduce the number of markers I got from the imputation process. One of them was a minimum allele frequency threshold. When I generated the subset I just extracted the markers that were known to be strongly associated with height, regardless of other parameters. For this reason, you can have a handful of SNPs in the subset absent in the fullset.

Cheers,

Olivier