Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

crowdAI is shutting down - please read our blog post for more information

OpenSNP Height Prediction



My solution

Posted by David_Baranger over 2 years ago

Hi all! I’ve posted an explanation to my solution here: https://davidbaranger.com/2017/10/01/on-predicting-traits-with-genetics/

The post includes a link to a file containing the 5 variables I generated, which made-up the entirety of my model.

Feel free to email me if you have questions. I hope that some of you will be inspired to improve on my work!

Best, David


Posted by bolis.mrc  over 2 years ago |  Quote

Hi David,

Impressing result, thanks for sharing!

Best, Marco

Posted by David_Baranger  over 2 years ago |  Quote

A quick thought on how to translate my approach into ‘machine-learning-speak’:

While not exactly the same, the PRS-approach might be thought of as an extremely simplified version of ensemble learning. Given N SNPs, we fit N experts, where each expert is a regression trained on a single feature. We then compute a weighted sum of the experts, where the weight-vector is the effect-size of the feature in each regression. While this approach virtually guarantees that we won’t capture the maximum amount of variance possible, it also seems to be very good at reducing overfitting. I think this is an especially important point for genetics research, where each individual feature has a very small effect-size. More complex learning-algorithms thus run the risk of latching onto effects which appear to be especially predictive, but are actually just noise.

Posted by Muhammad_Alfiansyah  over 2 years ago |  Quote

Hi David!

Thank you!!!

Your last comment exlpain why my best model was actually just random forest on all SNP without any feature engineering :D!


Posted by David_Baranger  almost 2 years ago |  Quote

Hey all! It looks like this challenge is officially finished. I’ve posted some additional analyses, where I improve my original prediction by 11%. This is largely through data-cleaning.


Best, David