After one detects important features and interactions in the model, one might be interested to know how exactly they look like — what effect this feature or pair of features show on the response function. TreeExtra provides several tools that allow the user to perform such analysis. The primary output of these tools are text files. An example of MatLab code that converts such files into actual graphical plots is available as well.
vis_effect command produces an estimate of a single feature effect on the response. One of its input parameters is Q — a number of points to show on the effect curve. vis_effect separates all values the feature takes on the validation set into Q quantiles*. For each center of a quantile it calculates the average prediction the model makes for this value of the feature. These average predictions form the feature effect curve.
The curve is then saved into the output file in the following format. The file has three tab-delimited columns. The first column is counts: how many quantiles have the same value of the center. The second column is quantile center values. The third column is the model prediction average.
* (In fact it separates them into Q + 1 quantiles, throws away half of a quantile from each end, and re-separates the rest into Q quantiles.)
In order to visualize a 2-way interaction, we need to estimate a joint effect of 2 features at the same time. This procedure is implemented in vis_iplot command. Similarly to vis_effect, it calculates centers of quantiles for each of the two features. Then it creates a matrix of joint effect values by calculating the average prediction that the model makes for each pair of quantile centers.
The matrix is then saved into the output file in the following format. Four upper left values in the matrix are always zeros. First column contains quantile counts for feature 1, second column contains quantile center values for feature 1, first row - quantile counts for feature 2, second row - quantile center values for feature 2, and the rest is the matrix of average model predictions.
You can only trust the interaction plots in the areas where the joint distribution of the two features
is sufficiently high. For this purpose vis_iplot also generates a joint
density estimation matrix in a separate file. The density file name is the same as the output file name
with the ".dens" suffix before the extension. Its format is as follows: the first column contains
quantile borders for feature 1, the first row contains quantile borders for feature 2, and the matrix
contains proportions of validation set data points in the corresponding cells*. Note that the first
column/row contain borders, not centers of quantiles, therefore they are one number longer than
corresponding dimensions of the density matrix.
Warning: the density plots show joint distribution of densities for two features only, they don't take into account densities of fixed-value features.
* (Some data on edges of the distribution was ignored when calculating quantile borders, so the sum of these numbers might be less than one.)
We provide several MatLab/Octave functions to illustrate how the output of vis_effect and vis_iplot can be used to create actual plots.
plot_effects creates an effect plot for every .effect.txt file it finds in the working directory. X axis shows feature values, Y axis shows the average prediction values.
plot_interactions creates two interaction plots for every .iplot.txt file accompanied by a .iplot.dens.txt file. In the first plot different effect curves correspond to different values of feature 1, X axis shows values of feature 2, Y axis shows the average prediction values. The second, "flipped" plot swaps the features. The areas where the density is dangerously low are marked by red circles. The areas where the density is unusually high are marked by green circles. Absence of any circle means that the density is not too far from expected.
O | argument | description | default value |
-v | _validation_set_ | validation set file name | |
-r | _attr_file_ | attribute file name | |
-f | _feature_ | feature of interest name | |
-q | _#quantile_values_ | number of values to estimate | 10 |
-m | _model_file_name_ | model file name | model.bin |
-o | _output_file_name_ | feature effect file name | <_feature_> .effect.txt |
Outputs feature effect curve into the specified output file. Output file name is composed as f.o.effect.txt, where "f" and "o" are values of the correspondent arguments.
vis_iplot -v _validation_set_ -r _attr_file_ -f1 _feature1_ -f2 _feature2_ [-q1 _#quantile_values1_] [-q2 _#quantile_values2_] [-m _model_file_name] [-o _output_file_name_] [-x _fixed_values_file_name] | -versionO | argument | description | default value |
-v | _validation_set_ | validation set file name | |
-r | _attr_file_ | attribute file name | |
-f1 | _feature1_ | first feature of interest name | |
-f2 | _feature2_ | second feature of interest name | |
-q1 | _#quantile_values1_ | number of values to estimate for the first feature | 10 |
-q2 | _#quantile_values2_ | number of values to estimate for the second feature | 10 |
-m | _model_file_name_ | model file name | model.bin |
-o | _output_file_name_ | interaction (joint effect) file name | <_feature1_>. <_feature2_> .iplot.txt |
-x | _fixed_values_file_name_ | fixed attributes-values file name | no fixed value attributes |
Output:
O | argument | description | default value |
-t | _training_set_ | input data set file name | |
-r | _attr_file_ | attribute file name |
Outputs Spearman's correlations between all pairs of active attributes into file correlations.txt.