Interaction detection is a helpful instrument for analyzing the structure of the response function. Some of the features are more important than others; and some of those important features are involved in complex non-additive effects. TreeExtra package contains tools that allow you to detect and visualize effects of single important features together with their joint interactive effects. In order to understand better how exactly our algorithms work, see the following two papers.
Daria Sorokina, Rich Caruana, Mirek Riedewald, Daniel Fink.
Detecting Statistical Interactions with Additive
Groves of Trees.
In proceedings of the 25th International Conference on Machine Learning (ICML'08).
Video of ICML presentation
Slides (.ppt)
Daria Sorokina, Rich Caruana, Mirek Riedewald, Wes Hochachka, Steve Kelling.
Detecting and Interpreting Variable Interactions in
Observational Ornithology Data.
To appear in the International Workshop on Domain Driven Data Mining (DDDM'09).
Now we are going to explain how to perform the analysis of feature effects and interactions step by step.
All steps described below should be performed with the same fixed training and validation data sets. For many data sets, setting aside 1/5 of the data as a validation set works well.
If your data has hundreds or thousands of features, we recommend to run bagging with feature evaluation first. Set -k parameter to some reasonable number of features. 50 can be a good number for complex data sets, 20 is often enough for simple or noisy ones. An attribute file corresponding to the reduced version of the data set will be automatically generated: use this file in the following two steps.
For descriptive analysis we are going to use a simplified version of Additive Groves. In this version, a complex grove is always built from a grove with the same number of smaller trees. (The exact algorithm is described in the Section 2.2 of the Additive Groves paper).
Train the standard grid of Additive Groves models to determine the best combination of the algorithm's parameters for your data set. To force your Additive Groves to be trained in the layered mode, ag_train should be called with the argument -s layered. (See AG manual for detailed training instructions.)
In the layered model, the best model parameters provided in the output correspond to the best model for interaction detection, not to the model providing the best performance. Use those recommended values of α and N for the next steps.
Next step is feature selection by backward elimination. It leaves you with a small set of crucially important features: the performance drops significantly if you remove any one of them. ag_fs command implements backward elimination wrapped around Layered Groves. Call ag_fs with -a, -n and -b parameters set to values that were recommended by Layered Groves on the previous step. As a part of its output, ag_fs creates a new attribute file and a new model file that can be used for interaction detection. It also creates an effect visualization file for every important feature. Visualization manual explains how to interprete the data in those files.
There are two commands in the package you can use to run interaction tests. ag_interactions makes all possible pairwise tests between important features. Run it using the information (attribute file, mean and std of performance) produced by the last run of ag_fs. The log output shows results of all tests. Also, ag_interactions creates partial dependence files for each detected interaction. You can use these files for visualizing the joint effect of the two features involved.
If detected pairwise interactions form a clique between 3 or more variables, it is a good idea to test these variables for a higher-order interaction. ag_nway command allows you to perform such test. You need to provide it with a file listing n variables (one on each line) to run a test for an n-way interaction. ag_nway also allows you to save a restricted model.
If there is a lot of variance in the data, the results of the above analysis can be unstable: the exact sets of detected important features and interactions will depend on the train/validation data split, random seed, order of tests, etc. In this case it is recommended to perform the whole process several times, using different ways to split data on train and test sets.
O | argument | description | default value |
-t | _train_set_ | training set file name | |
-v | _validation_set_ | validation set file name | |
-r | _attr_file_ | attribute file name | |
-a | _alpha_value_ | parameter that controls size of tree | |
-n | _N_value_ | number of trees in a Grove | |
-b | _bagging_iterations_ | number of bagging iterations | |
-m | _model_file_name_ | name of the output file for the model | model.bin |
-i | _init_random_ | init value for random number generator | 1 |
-c | rms|roc | performance metric | rms |
-h | _threads_ | number of threads, linux version only | 6 |
Output:
O | argument | description | default value |
-t | _train_set_ | training set file name | |
-v | _validation_set_ | validation set file name | |
-r | _attr_file_ | attribute file name | |
-a | _alpha_value_ | parameter that controls size of tree | |
-n | _N_value_ | number of trees in a Grove | |
-b | _bagging_iterations_ | number of bagging iterations | |
-ave | _mean_performance_ | mean performance of and unrestricted model | |
-std | _std_of_performance_ | std of an unrestricted model's performance | |
-m | _model_file_name_ | input file containing the unrestricted model | model.bin |
-i | _init_random_ | init value for random number generator | 1 |
-c | rms|roc | performance metric | rms |
-h | _threads_ | number of threads, linux version only | 6 |
Output:
O | argument | description | default value |
-t | _train_set_ | training set file name | |
-v | _validation_set_ | validation set file name | |
-r | _attr_file_ | attribute file name | |
-a | _alpha_value_ | parameter that controls size of tree | |
-n | _N_value_ | number of trees in a Grove | |
-b | _bagging_iterations_ | number of bagging iterations | |
-ave | _mean_performance_ | mean performance of and unrestricted model | |
-std | _std_of_performance_ | std of an unrestricted model's performance | |
-w | _interaction_file_ | list of n features to test for an n-way interaction | |
-m | _model_file_name_ | output file name for the restricted model | restricted_model.bin |
-i | _init_random_ | init value for random number generator | 1 |
-c | rms|roc | performance metric | rms |
-h | _threads_ | number of threads, linux version only | 6 |
Output: