Explore methodology to identify assets with extreme positive or negative returns.
Update: Details for Round 2 can be found on the bottom of this page.
Using the provided data sets of financial predictors and semi-annual returns, participants are challenged to develop a model that will help identify the best-performing stocks in each time-period.
Research Question: Which stocks will experience the highest and lowest returns during the next six months?
Out of the thousands of stocks in the market, small groups will experience exceptionally high or low returns. Considering the distribution of stock returns, a portfolio manager must buy the stocks in the right tail of the distribution and avoid the stocks in the left tail. The performance of an entire equity portfolio is often driven by these key investment decisions. The goal of this challenge is to explore methodology that will increase the probability that portfolio managers identify these stocks with extreme positive or negative returns.
Each team must create a model that ranks a set of stocks based on the expected return over a forward 6-month window. This model can be a risk factor-based strategy (multi-factor model), predictive model, or any other data-based heuristic. There are many ways to approach this task and creative, non-traditional solutions are strongly encouraged. The final model will be tested on each 6-month period from 2002 to 2017.
Analysts rely on a mix of quantitative and qualitative methodology to help investors consistently outperform the market. It’s not enough to be investment experts. Having the right data at the right time plays a critical role in successfully anticipating economic and environmental changes that may impact investment performance. Personalized solutions can be designed to provide a tailored mix of risk and return. Current baseline solutions rely on simple regressions and/or random forest solutions. Current approaches have high explanatory value and low predictive value. Improved solutions would increase predictive accuracy.
Teams are provided with predictors and semi-annual returns for a group of stocks from
2017. This span of 21 years is represented as 42 non-overlapping 6-month periods. In each of the
42 time periods, roughly 900 stocks with the largest market capitalization (i.e., total market value in USD) were selected. Therefore, the selected set of stocks at each time period changes as companies increase or decrease in value. All stock identifiers have been removed and all numeric variables have been anonymized and normalized. Training and test datasets were created by selecting a random sample of stocks at each time period.
60% of stocks were sampled into the training set and the remaining
40% created the test set. Finally, all data from the second half of 2017 was allocated to the test set. This 6-month period will provide a final out-of-sample test of a model’s performance.
Note : Please refer to the starter-kit to quickly get started with the dataset, train a simple Random Forest based model, and make an example submission to crowdai.
Consistent performance over time and through varying market conditions is crucial for any financial model. Each team must test their model using an expanding window procedure. For a given time period, , an expanding window test allows the model to incorporate all available information up to time , to generate predictions for time . For example, when predicting the stock rankings in the first half of 2016, the model can include all data from 1996 to no later than year-end 2015. Predictions for the second half of 2016 could then include all the data from the first half of 2016. The quality of the predicted rankings at each time period will be evaluated in two ways, described below.
Spearman correlation: This metric will describe the overall relationship between the actual rankings and the predicted rankings from the model. Higher values indicate better performance.
Normalized Discounted Cumulative Gain of Top 20%: In reality, analysts and portfolio managers are not concerned with the entire distribution of stocks. They will instead focus on identifying and buying the best-ranking stocks. Normalized Discounted Cumulative Gain (NDCG) is a metric from the information retrieval domain that considers the relevance and confidence (rank position) to describe a model’s rank quality.
Spearman correlation describes how well a model is ranking the stocks at a given time period. Spearman correlation is calculated using the formula below.
Where $d_i$ is the difference between the predicted and actual ranking of stock i.
Spearman correlation has a range from -1 to 1. Models that rank stocks more accurately will produce higher Spearman correlation values. Correlation values will be averaged across all time periods.
Normalized Discounted Cumulative Gain of Top 20%
Normalized Discounted Cumulative Gain (NDCG) is the ratio between the Discounted Cumulative Gain (DCG) and Ideal Discounted Cumulative Gain (IDCG), shown below.
represents the normalized future 6-month return (Norm_Ret_F6M in the dataset) of the ranked stock. With this formula, stocks with better (lower) predicted ranks will have more influence on the ranking quality than stocks with higher predicted ranks. IDCG is the maximum possible DCG, which gives the NDCG score an upper bound of 1. The NDCG will be calculated for each individual 6-month period and then averaged across all periods.
Note that the NDCG is calculated using only the top 20% of a model’s predicted rankings. Therefore, NDCG rewards correctly identifying stocks in the top 20% and ranking them in the correct order. This aligns with the viewpoint of a ‘long-only’ portfolio manager who will focus on buying the best stocks and ignore stocks outside the top 20%.
A more detailed description of NDCG can be found here. This challenge uses a modified formulation of DCG that is tailored to investment ranking.
Update: The evaluation script was incorrectly calculating NDCG as of the challenge launch. This was fixed 03/28. Solutions submitted prior to this date would have provided incorrect results.
Testing your solution
Throughout the competition, teams will be given the opportunity to evaluate their models on the test dataset. Teams can sign in and upload their predictions up to 5 times per day. This will provide an estimate for out-of-sample performance during the competition. Teams should rely on internal model validation procedures and be careful not to optimize results to this one small section of the test dataset.
March 26 : Challenge launch and start of Round 1 - contestants create models and upload predictions to crowdAI.
April 30 : Deadline for Round 1. All solutions must be submitted by 11:59 GMT; Top solutions from leader board invited to Round 2.
May 1 : Start of Round 2 - Contestants explain their methods, results, and conclusions in short paper. Contestants also package code of submitted solution using Docker for testing and evaluation.
May 20 : Deadline for Round 2. All solutions must be submitted by 11:59 GMT.
May 21 : Top 6 solutions selected; Winners provided travel stipend (maximum USD 1000) and invitation to present at the IEEE Data Science Workshop in Lausanne, Switzerland June 4 - June 6.
Details for Round 2:
Round 2 is open to all challenge participants. Round 1 focused on prototyping models that maximized statistical measures and Round 2 will enhance this with a deeper dive into your methodology and a new set of holdout data from 2017. To compete, all participants must submit the following items:
Final predictions for all time periods
A brief written solution using the IEEE template for conference proceedings. MS Word and LaTeX are both acceptable. At a minimum, the document should include an introduction, description of your methodology, results, and any other information needed to understand your solution and its merit. Tables, charts, and other visuals are highly encouraged.
All code and files needed to reproduce your results uploaded to Gitlab. (More details on this soon)
The top 6 solutions will be selected based on their statistical performance, calculated in the following manner:
Final score = (A+B+C+D)/4
A = Rank of spearman correlation on holdout data from 2002 – 2016
B = Rank of NDCG score on holdout data from 2002 – 2016
C = Rank of spearman correlation on holdout data from 2017
D = Rank of NDCG score on holdout data from 2017
All ranks will be determined using 3 significant digits. Performance on the data from 2017 will be used as a tiebreaker if needed.
Round 2 Submission Details
The Round-2 of the IEEE Investment Ranking Challenge is now accepting submissions. Please remember to update your crowdai client and follow the instructions here before making a submission.
We will accept the submissions until 21st of May, and to be eligible for the final leaderboard, you will also have to upload your code as a private repository to gitlab.crowdai.org.
Please add the following users as Members of your private repository : benharlander, spMohanty
Apart from your code, please include a description of your approach using this template
The leaderboard of this challenge will be shown only at the end of the Round on May 21st, but the grading status of your submissions can still be checked under the Submissions Tab.
Top-6 participants on the leaderboard (except the organizers) will be invited (maximum USD 1000 travel stipend, provided by Principal Financial Group) to present at the IEEE Data Science Workshop in Lausanne, June 4-6, 2018.
A starter kit has been prepared which explains how to get access to the dataset, parse it, train a simple random forest based method, and make a submission. It can be accessed at : https://github.com/crowdAI/ieee_investment_ranking_challenge-starter-kit
- Gitter Channel : crowdAI/ieee-investment-ranking-challenge
- Technical issues : https://github.com/crowdAI/ieee_investment_ranking_challenge-starter-kit/issues
- Discussion Forum : https://www.crowdai.org/challenges/ieee-investment-ranking-challenge/topics
We strongly encourage you to use the public channels mentioned above for communications between the participants and the organisers. In extreme cases, if there are any queries or comments that you would like to make using a private communication channel, then you can send us an email at :