Back to TreeExtra package web page
Bagged Trees with Feature Evaluation
This code provides tools for building bagged trees models with an extremely fast
built-in feature evaluation technique. It is recommended to use this tool for the
preprocessing for Additive Groves when the original data set contains large
numbers of features. This tool can also be used for building bagged trees models themselves.
Bagging
Bagging was invented by Leo Breiman [1]. This ensemble technique decreases variance
of the original single models by building every new model on a bootstrap of the train set
and averaging predictions of those models. In this tool bagging is applied
to decision
trees.
In this implementation size of trees can be controlled through the input parameter α. α influences max percentage of training data
in the leaf*, so in some sense it is reverse to the size of the tree. α = 1 produces
a stump, α = 0 - a full tree. The following values of α can be used in the training:
1, 0.5, 0.2, 0.1, 0.05, 0.02, 0.01, 0.005, ... , 0.
The best strategy is to build several ensembles with different values of α and compare
their performance on the validation set.
Root mean squared error is used both as splitting criterion and as performance measure,
therefore this tool can be used both for binary classification and regression data
sets.
* In the versions 2.3 and up the max size of the leaf is defined by both α and the height of the branch. This way lower nodes are less likely to be split than higher nodes, and the resulting trees are more balanced. If you are interested in the exact algorithm, read the code or contact Daria.
[1] Leo Breiman.
Bagging predictors. Machine Learning 24(2), 1996
Feature evaluation technique: multiple counts
Most feature selection techniques require repeated training of models using different
combinations of features. When the number of features in the data set is large,
such approach can be infeasible in practice.
In [2] we have suggested a feature evaluation technique referred as multiple counts.
It ranks the features based solely on how they are used by a single ensemble of
bagged trees. Multiple counts scores every feature by the number of data points present in the nodes split by this feature.
Empirical results show that this technique produces a ranking very similar to a more expensive
sensitivity analysis method commonly used for this purpose.
Starting with TreeExtra 2.4, this score is also normalized by the feature entropy to ensure comparability of features with different number of values.
bt_train command implements bagging with feature evaluation.
-k argument allows to specify a number of top features that should be
provided in the output. k = -1 means that all features should be ranked, 0
— that ranking is not needed. In case if you want to use this feature evaluation as fast feature
selection, bt_train generates a new attribute file, where only top
k features are left active.
Feature evaluation gives similar scores to correlated features. To weed out correlations one can use the list of Spearman's correlation scores, also generated during training.
[2] R. Caruana, M. Elhawary, A. Munson, M. Riedewald, D. Sorokina, D. Fink, W. Hochachka,
S. Kelling.
Mining Citizen Science
Data to Predict Prevalence of Wild Bird Species. In proceedings of the 12th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06).
Train, test and validation set
It is highly recommended to use a separate validation set for tuning α. However,
validation set is not directly required for training. One needs to provide some
data as a validation set to the train command, but it is used solely for analyzing
whether the bagging curve has converged well.
Commands specification
bt_train -t _train_set_ -v _validation_set_ -r _attr_file_ [-a _alpha_value_] [-b _bagging_iterations_]
[-i _init_random_] [-m _model_file_name_] [-k _attributes_to_output_] [-o _output_file_name_] [-l log|nolog] [-c rms|roc] [-h _threads_] | -version
O |
argument |
description |
default value |
-t |
_train_set_ |
training set file name |
|
-v |
_validation_set_ |
validation set file name |
|
-r |
_attr_file_ |
attribute file name |
|
-a |
_alpha_value_ |
parameter that controls max size of tree |
0 |
-b |
_bagging_iterations_ |
number of bagging iterations |
100 |
-i |
_init_random_ |
init value for random number generator |
1 |
-m |
_model_file_name_ |
name of the output file for the model |
model.bin |
-k |
_attributes_to_output_ |
number of ranked features to output (-1 = all) |
0 (no feature selection) |
-o |
_output_file_name_ |
name of the output file with the prediction scores on the validation data |
preds.txt |
-l |
log | nolog |
amount of log output to stdout |
log |
-c |
rms|roc |
performance metric used in the output |
rms |
-h |
_threads_ |
number of threads, linux version only |
6 |
Output:
- Saves the resulting model into the specified file.
- Outputs bagging curve on validation set into bagging_rms.txt (and bagging_roc.txt, if applicable).
- Outputs list of k top ranked features with their scores and column numbers into feature_scores.txt. Set k to -1 to rank all features.
- Saves the attribute file with only top k features as active. The name of the new file has a suffix
".fs" before the file extension.
- Predictions are saved into a specified output file, one prediction value per line.
- Training log is saved in log.txt file. If an old log.txt file already exists in the working directory, its contents are appended to logs.archive.txt
- If the log flag is on, full log is shown in the standard output. If the log flag is off, standard output shows performance on the validation set only.
- If active attributes don't have missing values, correlations.txt provides list of Spearman's correlation scores on the training set.
- missing.txt contains a list of active attributes with missing values in the training set. Is created only when missing values are present.
bt_predict -p _test_set_ -r _attr_file_ [-m _model_file_name_] [-o _output_file_name_] [-l log|nolog] [-c rms|roc] | -version
O |
argument |
description |
default value |
-p |
_test_set_ |
cases that need predictions |
|
-r |
_attr_file |
attribute file name |
|
-m |
_model_file_name_ |
name of the input file containing the model |
model.bin |
-o |
_output_file_name_ |
name of the output file for predictions |
preds.txt |
-l |
log | nolog |
amount of log output to stdout |
log |
-c |
rms|roc |
performance metric used in the output |
rms |
Output:
- Predictions are saved into a specified output file, one prediction value per line.
- If the true values are specified in the test file, performance on the test set is saved to the log.
- If the log flag is on, full log is shown in the standard output. If log flag is off, standard output shows only performance on the test set.