If you want to make a simple test run of Additive Groves on your data without reading the whole manual, go here.
Additive Groves is a supervised learning algorithm that consistently shows high performance
on regression and classification problems. It is based on regression trees, additive
models and bagging and is capable of both fitting additive structure of the problem and
modelling its highly nonlinear components with very large trees at the same time. As it is based on
bagging, it does not overfit when the number
of iterations is increased. Combination of these properties
makes Additive Groves superior in performance to other existing tree ensemble methods like bagging,
boosting and Random Forests.
Additive Groves consists of bagged additive models, where every single model is
a tree. A size of a single Grove (single additive model) is defined by two parameters:
size of tree and number of trees. Each Grove of a non-trivial size is iteratively
built from smaller Groves, so on every bagging iteration a set of Groves of different sizes is built. A single
Grove consisting of large trees can and will overfit heavily to the training set,
therefore bagging is applied on top in order to reduce variance. See [1] for details
of the algorithm.
To ensure good performance of the algorithm, it is important to choose the best values for all three parameters. Good values of α and N vary significantly between different data sets, so no predefined values of these parameters can be recommended. Therefore estimating these parameters on a validation data set - a separate part of the training data, not used for fitting the models - is crucial for Additive Groves. In the current implementation the process of finding best parameter values is built-in: models of different sizes that are naturally produced during training phase are evaluated on the validation set and the best combination of parameter values is reported along with recommendations whether more models, or models of larger sizes should be built for this dataset. Note that if you want to get the fair estimate of how well the algorithm performs, you cannot use the performance on the validation set during training - instead you need to save the best model, produce predictions for a separate test set and evaluate the final performance based on these predictions.
Therefore, the user can choose not to care about what the parameters are at all and still get the best performing model. To run the whole sequence automatically, check out the python script written by Alex Sorokin.
After the model is saved, it can be used by ag_predict command to produce predictions on the new data.
Commands ag_expand and ag_save rely on the information saved by previous runs of ag_train and ag_expand in temporary files, therefore the sequence of these commands should always be run in the same directory. Temporary files are stored in the directory AGTemp, which is created by ag_train. Although refered as temporary, these files are never deleted by any of the commands, in order to give you the possibility to run ag_save and ag_expand at any point of the time. Once the model is saved into a separate file, it does not rely on temporary files anymore and can be moved to other directories where it can be used for producing predictions.
As the running time can be the main issue with Additive Groves, several speed modes are provided.
Slow mode follows the original algorithm exactly, builds and tests two versions of each complex Grove on every bagging iteration. The algorithm provides the best performance in the slow mode.
In the fast mode, Groves of all sizes are still built on every iteration, however, they are built faster because the best path to build them is determined during the first bagging iteration and then is reused by all subsequent iterations. Therefore only during the first iteration each Grove is built twice, and the running time of all other iterations is decreased almost by the factor of two. Additive Groves model trained in the fast mode slightly loses in performance to Additive Groves trained in the slow mode.
In the layered mode each Grove is trained from its "left neighbor" - a Grove with the same number of smaller trees. The running time of training in this mode is the same as in the fast mode. The models trained this way are more stable and layered mode is required for the feature selection or interaction detection analysis. The best model parameters and the expanding recommendations produced in the layered mode correspond to the best model for interaction detection, not to the model providing the best performance on the validation set.
To further increase the speed of training, one can train several model grids in parallel in different directories (using commands ag_train / ag_expand) and then merge them using the command ag_merge.
Note: the format of ag_merge has changed between versions 2.0 and 2.1.
Each run of ag_merge merges several model grids. They should have the same size with respect to α and N. The number of bagging iterations in the resulting model grid is the sum of numbers of bagging iterations in the original grids. The models built on different bagging iterations need to be different, so it is very important to make sure that original model grids are created with different random seeds. ag_merge should be called in a new directory, different from directories where the original model grids reside. Those directories (where ag_train / ag_expand were run) should be passed as input arguments. The resulting merged model grid will be located in the directory where ag_merge was called. The output of ag_merge is the same as the output of ag_train / ag_expand and the resulting model grid can be treated the same way as the output of those commands.
Starting version 2.3, Linux version of TreeExtra package uses multithreading for training the trees. Parallel branches are trained at the same time. Default number of threads used is 6, but it can be changed using input argument -h.
O | argument | description | default value |
-t | _train_set_ | training set file name | |
-v | _validation_set_ | validation set file name | |
-r | _attr_file_ | attribute file name | |
-a | _alpha_value_ | parameter that controls max size of tree | 0.01 |
-n | _N_value_ | max number of trees in a Grove | 8 |
-b | _bagging_iterations_ | number of bagging iterations | 60 |
-s | slow | fast | layered | training mode | fast |
-i | _init_random_ | init value for random number generator | 1 |
-c | rms|roc | performance metric | rms |
-h | _threads_ | number of threads, linux version only | 6 |
Output:
O | argument | description | default value |
-a | _alpha_value_ | parameter that controls max size of tree | value used in previous train/expand session |
-n | _N_value_ | max number of trees in a Grove | value used in previous train/expand session |
-b | _bagging_iterations_ | number of bagging iterations | value used in previous train/expand session |
-i | _init_random_ | init value for random number generator | 10000 + value used in previous train/expand session |
-h | _threads_ | number of threads, linux version only | 6 |
Output: same as for ag_train . The log output is appended to the log.txt file.
O | argument | description |
-d | _dir1_ _dir2_ _dir3_ ... |
directories where the input model grids were created |
Output: same as for ag_train .
O | argument | description | default value |
-m | _model_file_name_ | name of the output file for the model | model.bin |
-a | _alpha_value_ | parameter that controls max size of tree | value with best results on validation set |
-n | _N_value_ | max number of trees in a Grove | value with best results on validation set |
-b | _bagging_iterations_ | overall number of bagging iterations | value used in previous train/expand session |
Output: saves a model with given parameters in a specified file.
ag_predict -p _test_set_ -r _attr_file_ [-m _model_file_name_] [-o _output_file_name_] [-c rms|roc] | -versionO | argument | description | default value |
-p | _test_set_ | cases that need predictions | |
-r | _attr_file | attribute file name | |
-m | _model_file_name_ | name of the input file containing the model | model.bin |
-o | _output_file_name_ | name of the output file for predictions | preds.txt |
-c | rms|roc | performance metric | rms |
Output: