Visualization of Feature Effects and Interactions in TreeExtra Models

Overview

After one detects important features and interactions in the model, one might be interested to know how exactly they look like — what effect this feature or pair of features show on the response function. TreeExtra provides several tools that allow the user to perform such analysis. The primary output of these tools are text files. An example of MatLab code that converts such files into actual graphical plots is available as well.

Feature effects

vis_effect command produces an estimate of a single feature effect on the response. One of its input parameters is Q — a number of points to show on the effect curve. vis_effect separates all values the feature takes on the validation set into Q quantiles*. For each center of a quantile it calculates the average prediction the model makes for this value of the feature. These average predictions form the feature effect curve.

The curve is then saved into the output file in the following format. The file has three tab-delimited columns. The first column is counts: how many quantiles have the same value of the center. The second column is quantile center values. The third column is the model prediction average.

* (In fact it separates them into Q + 1 quantiles, throws away half of a quantile from each end, and re-separates the rest into Q quantiles.)

Interactions - joint effects of two features

In order to visualize a 2-way interaction, we need to estimate a joint effect of 2 features at the same time. This procedure is implemented in vis_iplot command. Similarly to vis_effect, it calculates centers of quantiles for each of the two features. Then it creates a matrix of joint effect values by calculating the average prediction that the model makes for each pair of quantile centers.

The matrix is then saved into the output file in the following format. Four upper left values in the matrix are always zeros. First column contains quantile counts for feature 1, second column contains quantile center values for feature 1, first row - quantile counts for feature 2, second row - quantile center values for feature 2, and the rest is the matrix of average model predictions.

Higher order interactions

If you want to view a higher order (n-way) interaction, ideally you would need to look at an n-dimensional interaction plot. -x option of vis_iplot tool allows you to look at any 2-dimensional slice of such a plot. You need to choose two features, whose joint effect you want to plot, and fix the values of the other n-2 features. The names and fixed values of those n-2 features should be provided in a separate file in a tab-delimited 2 column format: feature names in the first column, fixed values in a second column. By fixing additional features at different values, you can visualize the effect that those features have on the 2-way interaction of the original two features.

Joint density

You can only trust the interaction plots in the areas where the joint distribution of the two features is sufficiently high. For this purpose vis_iplot also generates a joint density estimation matrix in a separate file. The density file name is the same as the output file name with the ".dens" suffix before the extension. Its format is as follows: the first column contains quantile borders for feature 1, the first row contains quantile borders for feature 2, and the matrix contains proportions of validation set data points in the corresponding cells*. Note that the first column/row contain borders, not centers of quantiles, therefore they are one number longer than corresponding dimensions of the density matrix.

Warning: the density plots show joint distribution of densities for two features only, they don't take into account densities of fixed-value features.

* (Some data on edges of the distribution was ignored when calculating quantile borders, so the sum of these numbers might be less than one.)

Correlations

Independent of models, vis_correlations command provides a table of pairwise correlation scores between features in a data set. The scores are Spearman's scores: they take into account only the relative order of values, and do not make assumptions about their distribution. Scores take on values in the range [-1;1]. 1 implies correlation, 0 - absence of correlation, -1 - reverse correlation.

MatLab code for producing plots

We provide several MatLab/Octave functions to illustrate how the output of vis_effect and vis_iplot can be used to create actual plots.

plot_effects creates an effect plot for every .effect.txt file it finds in the working directory. X axis shows feature values, Y axis shows the average prediction values.

plot_interactions creates two interaction plots for every .iplot.txt file accompanied by a .iplot.dens.txt file. In the first plot different effect curves correspond to different values of feature 1, X axis shows values of feature 2, Y axis shows the average prediction values. The second, "flipped" plot swaps the features. The areas where the density is dangerously low are marked by red circles. The areas where the density is unusually high are marked by green circles. Absence of any circle means that the density is not too far from expected.

Commands specification

vis_effect -v _validation_set_ -r _attr_file_ -f _feature_ [-q _#quantile_values_] [-m _model_file_name] [-o _output_file_name_] | -version

O	argument	description	default value
-v	_validation_set_	validation set file name
-r	_attr_file_	attribute file name
-f	_feature_	feature of interest name
-q	_#quantile_values_	number of values to estimate	10
-m	_model_file_name_	model file name	model.bin
-o	_output_file_name_	feature effect file name	<_feature_> .effect.txt

Outputs feature effect curve into the specified output file. Output file name is composed as f.o.effect.txt, where "f" and "o" are values of the correspondent arguments.

vis_iplot -v _validation_set_ -r _attr_file_ -f1 _feature1_ -f2 _feature2_ [-q1 _#quantile_values1_] [-q2 _#quantile_values2_] [-m _model_file_name] [-o _output_file_name_] [-x _fixed_values_file_name] | -version

O	argument	description	default value
-v	_validation_set_	validation set file name
-r	_attr_file_	attribute file name
-f1	_feature1_	first feature of interest name
-f2	_feature2_	second feature of interest name
-q1	_#quantile_values1_	number of values to estimate for the first feature	10
-q2	_#quantile_values2_	number of values to estimate for the second feature	10
-m	_model_file_name_	model file name	model.bin
-o	_output_file_name_	interaction (joint effect) file name	<_feature1_>. <_feature2_> .iplot.txt
-x	_fixed_values_file_name_	fixed attributes-values file name	no fixed value attributes

Output:

Outputs joint effect matrix into the specified output file. Output file name is composed as f1.f2.o.iplot.txt, where "f1", "f2" and "o" are values of the correspondent arguments.
Outputs joint density matrix into a separate density file. The name of the density file is the name of the output file with the suffix ".dens" before the last extension.

vis_correlations -t _training_set_ -r _attr_file_ | -version

O	argument	description	default value
-t	_training_set_	input data set file name
-r	_attr_file_	attribute file name

Outputs Spearman's correlations between all pairs of active attributes into file correlations.txt.