Interaction Detection with Additive Groves

Overview

Interaction detection is a helpful instrument for analyzing the structure of the response function. Some of the features are more important than others; and some of those important features are involved in complex non-additive effects. TreeExtra package contains tools that allow you to detect and visualize effects of single important features together with their joint interactive effects. In order to understand better how exactly our algorithms work, see the following two papers.

Daria Sorokina, Rich Caruana, Mirek Riedewald, Daniel Fink.
Detecting Statistical Interactions with Additive Groves of Trees.
In proceedings of the 25th International Conference on Machine Learning (ICML'08).
Video of ICML presentation
Slides (.ppt)

Daria Sorokina, Rich Caruana, Mirek Riedewald, Wes Hochachka, Steve Kelling.
Detecting and Interpreting Variable Interactions in Observational Ornithology Data.
To appear in the International Workshop on Domain Driven Data Mining (DDDM'09).

Now we are going to explain how to perform the analysis of feature effects and interactions step by step.

Preparing the data

All steps described below should be performed with the same fixed training and validation data sets. For many data sets, setting aside 1/5 of the data as a validation set works well.

Fast feature selection

If your data has hundreds or thousands of features, we recommend to run bagging with feature evaluation first. Set -k parameter to some reasonable number of features. 50 can be a good number for complex data sets, 20 is often enough for simple or noisy ones. An attribute file corresponding to the reduced version of the data set will be automatically generated: use this file in the following two steps.

Layered Additive Groves

For descriptive analysis we are going to use a simplified version of Additive Groves. In this version, a complex grove is always built from a grove with the same number of smaller trees. (The exact algorithm is described in the Section 2.2 of the Additive Groves paper).

Train the standard grid of Additive Groves models to determine the best combination of the algorithm's parameters for your data set. To force your Additive Groves to be trained in the layered mode, ag_train should be called with the argument -s layered. (See AG manual for detailed training instructions.)

In the layered model, the best model parameters provided in the output correspond to the best model for interaction detection, not to the model providing the best performance. Use those recommended values of α and N for the next steps.

Thorough feature selection

Next step is feature selection by backward elimination. It leaves you with a small set of crucially important features: the performance drops significantly if you remove any one of them. ag_fs command implements backward elimination wrapped around Layered Groves. Call ag_fs with -a, -n and -b parameters set to values that were recommended by Layered Groves on the previous step. As a part of its output, ag_fs creates a new attribute file and a new model file that can be used for interaction detection. It also creates an effect visualization file for every important feature. Visualization manual explains how to interprete the data in those files.

Interaction detection

There are two commands in the package you can use to run interaction tests. ag_interactions makes all possible pairwise tests between important features. Run it using the information (attribute file, mean and std of performance) produced by the last run of ag_fs. The log output shows results of all tests. Also, ag_interactions creates partial dependence files for each detected interaction. You can use these files for visualizing the joint effect of the two features involved.

If detected pairwise interactions form a clique between 3 or more variables, it is a good idea to test these variables for a higher-order interaction. ag_nway command allows you to perform such test. You need to provide it with a file listing n variables (one on each line) to run a test for an n-way interaction. ag_nway also allows you to save a restricted model.

Variation in the results

If there is a lot of variance in the data, the results of the above analysis can be unstable: the exact sets of detected important features and interactions will depend on the train/validation data split, random seed, order of tests, etc. In this case it is recommended to perform the whole process several times, using different ways to split data on train and test sets.

Commands specification

ag_fs -t _train_set_ -v _validation_set_ -r _attr_file_ -a _alpha_value_ -n _N_value_ -b _bagging_iterations_ [-m _model_file_name] [-i _init_random_] [-c rms|roc] [-h _threads_] | -version

O	argument	description	default value
-t	_train_set_	training set file name
-v	_validation_set_	validation set file name
-r	_attr_file_	attribute file name
-a	_alpha_value_	parameter that controls size of tree
-n	_N_value_	number of trees in a Grove
-b	_bagging_iterations_	number of bagging iterations
-m	_model_file_name_	name of the output file for the model	model.bin
-i	_init_random_	init value for random number generator	1
-c	rms\|roc	performance metric	rms
-h	_threads_	number of threads, linux version only	6

Output:

Performs feature selection by backward elimination using layered training of Additive Groves. Shows the process in the log output.
Produces a set of important features. Lists them in the end of log output.
Estimates mean and standard deviation of the final model performance.
Saves the final model into the specified file.
Saves the attribute file for the final model. The name of the new file has a suffix ".fs" before the file extension.
Creates an effect visualization source file for every important feature. These files have an extension ".effect.txt".
Outputs a file named core_features.txt, containing list of selected features sorted by the level of importance.

ag_interactions -t _train_set_ -v _validation_set_ -r _attr_file_ -a _alpha_value_ -n _N_value_ -b _bagging_iterations_ -ave _mean_performance_ -std _std_of_performance_ [-m _model_file_name] [-i _init_random_] [-c rms|roc] [-h _threads_] | -version

O	argument	description	default value
-t	_train_set_	training set file name
-v	_validation_set_	validation set file name
-r	_attr_file_	attribute file name
-a	_alpha_value_	parameter that controls size of tree
-n	_N_value_	number of trees in a Grove
-b	_bagging_iterations_	number of bagging iterations
-ave	_mean_performance_	mean performance of and unrestricted model
-std	_std_of_performance_	std of an unrestricted model's performance
-m	_model_file_name_	input file containing the unrestricted model	model.bin
-i	_init_random_	init value for random number generator	1
-c	rms\|roc	performance metric	rms
-h	_threads_	number of threads, linux version only	6

Output:

Performs all possible 2-way interaction tests using layered training of Additive Groves. Shows the process and the results in the log output.
Creates a joint effect visualization source file for every pair of interacting features. These files have an extension ".iplot.txt".
Creates joint density distribution file for every pair of interacting features. These files have an extension ".iplot.dens.txt".

ag_nway -t _train_set_ -v _validation_set_ -r _attr_file_ -a _alpha_value_ -n _N_value_ -b _bagging_iterations_ -ave _mean_performance_ -std _std_of_performance_ -w _interaction_file [-m _model_file_name][-i _init_random_] [-c rms|roc] [-h _threads_] | -version

O	argument	description	default value
-t	_train_set_	training set file name
-v	_validation_set_	validation set file name
-r	_attr_file_	attribute file name
-a	_alpha_value_	parameter that controls size of tree
-n	_N_value_	number of trees in a Grove
-b	_bagging_iterations_	number of bagging iterations
-ave	_mean_performance_	mean performance of and unrestricted model
-std	_std_of_performance_	std of an unrestricted model's performance
-w	_interaction_file_	list of n features to test for an n-way interaction
-m	_model_file_name_	output file name for the restricted model	restricted_model.bin
-i	_init_random_	init value for random number generator	1
-c	rms\|roc	performance metric	rms
-h	_threads_	number of threads, linux version only	6

Output:

Performs an n-way interaction test using layered training of Additive Groves. Shows the results in the log output.
Saves a restricted model.