crowdAI is shutting down - please read our blog post for more information
Explore methodology to identify assets with extreme positive or negative returns.
By Principal Financial Group
over 2 years ago
It is mentioned in the overview that “Finally, all data from the second half of 2017 was allocated to the test set. This 6-month period will provide a final out-of-sample test of a model’s performance.”. However, as I checked there’s only data from the first half of 2017( 2017_1 ) in the dataset. Does it mean that there exists another test set with 2017_2 data?
Following point 1, it’s not very clear to me how I’m really supposed to predict. If the final goal is to predict 2017_2, then it makes sense to use all the data before it, which is given as such. However, if, let’s say, we are gonna predict the 2017_1 ranking, does it make sense to use the data X1 to X70? As I checked in the given starter-kit, you are using the info in 2017_1 to predict the ranking of 2017_1:
test_predictions = rf.predict(model_data.loc[(model_data[‘time_period’] == time) & (model_data[‘Train’] == 0),’X1_avg’:’X70_avg_pctile’])
Should the data ‘X1_avg’:’X70_avg_pctile’ already be available when making our predictions?
over 2 years ago |
Clearly, all my comments are just gone for no reason and have to rewrite everything. :(
I just wanna ask,
It is mentioned in the overview that “Finally, all data from the second half of 2017 was allocated to the test set. This 6-month period will provide a final out-of-sample test of a model’s performance.”, but as I checked there’s only ‘2017_1’ in the dataset, does it mean there exists another test set containing data from 2017_2?
It’s not very clear to me how we are supposed to predict. In the starter-kit, it seems you are using current data to predict current ranking:
Suppose we are to predict the 2017_1 ranking, should the data ‘X1_avg’:’X70_avg_pctile’, which is given, allowed to be used? As I understand, these data are not available when we are making predictions.
over 2 years ago |
Hi @baseline, excellent questions.
First, a reminder of how the data is organized. For a time period, say 2007_1, you have the future 6-month returns/ranks as of 6/30/2007 and the historic inputs available as of 6/30/2007. Therefore, the data corresponding to 2017_1 is from Jan 2017 to June 2017 and you are predicting the rank for the future 6 months (July to Dec 2017). There is more detail in the Data Description file on the “Datasets” tab above. Hope this helps.
My answer will be similar to the last question. The return/rank data and the inputs at the same time period are non-overlapping. This means that X1:X70 would have been available to make predictions for this period, as you are predicting the rank for the second half of 2017 (i.e. the end of 2017_1).
With the current train/test split, you are unable to leak future data into the training set when predicting for 2017_1. However, you need to be more careful for previous periods. For example: When predicting the future 6-month ranks for 2007_1, you should not include observations from any periods after 2006_2. This means the rank+input observations from 2007_1 are also off limits when making predictions for 2007_1. In the random forest notebook, you will see that for predicting time_period == time, the training set included all data up to and not including time.
time_period == time
Thanks for your questions. This is the joy of time series data and the financial markets! :) Let me know if further clarification is needed.