Back to TreeExtra package web page
Back to Additive Groves manual web page
Additive Groves - model file structure
Note: you need to know the structure of the model file only if you want to write code that converts
AG model file to other formats.
The following model file structure description is using EBNF notation. In particular, '|' stands for
OR and '+' stands for repetition, that is, X+ means a sequence of X. The types and constants
(int, double, bool, true, false) are C++ types: int takes 4 bytes, double - 8 bytes, bool and its
constants - 1 byte. (* *) encloses comments.
<model-file> := <header> <model>
<header> := (201 <path-length> <path>) | 202 | 203 | 204
(* <header> contains information relevant for further training only and can be skipped at prediction. *)
(* The header format has changed between the versions TreeExtra 1.1 and TreeExtra 2.0 *)
<path-length> := int
<path> := bool+
(* <path> is a sequence of boolean values, its length is defined in <path-length>. *)
<model> := <N> <alpha> (<grove>+)
<N> := int
<alpha> := double
<grove> := <tree>+
(* N is the number of trees in every grove. α parameter is not used during the prediction stage.
Note that there is no field describing the number of groves in the whole model. *)
<tree> := <leaf> | (<node> <tree> <tree>)
(* Trees are saved in preorder: root node is followed by the left subtree followed by the right subtree. *)
<leaf> := true <prediction>
<prediction> := double
<node> := false <attribute-id> <threshold> <missing-coef>
<attribute-id> := int
<threshold> := double
<missing-coef> := double
(* See text below on prediction algorithm for the meanings of those fields. *)
Prediction algorithm
Prediction of Additive Groves model is the average of predictions of all single groves in the model.
Prediction of a single grove is a sum of predictions of all trees in this grove.
In order to get the prediction of a single tree for a specific data point, this data point is placed in the
root of a tree and then is passed from ancestor to descendant node down the tree. Two situations need to be
distinguished: absence and presence of missing values in the data set.
- No missing values in the data. (Note: the method of dealing with missing values has changed between versions 2.1 and 2.2)
If the value of attribute number <attribute-id> (see below on numbering order) is less or equal than
<threshold>, the data point goes to the left descendant of the internal node. Otherwise it goes to the
right descendant.
The data point ends in a single leaf of the tree. The prediction of the tree for this datapoint is
<prediction> value of the leaf.
- Data has missing values.
If <threshold> is set to a valid value (not NaN) and a data point has a valid (non-missing) value for the attribute
<attribute-id>, proceed as above.
If the value of <threshold> is set to NaN, this indicates a special split. All data points that have valid values for
<attribute-id> should go to the left, while data points with missing values should go to the right.
If the value of <attribute-id> is missing for this specific data point, and <threshold> is set to a valid
value, then the prediction program should
take into account <missing-coef> value. <missing-coef> encodes the proportion of the data that
should go to the left descendant in this situation. There are two ways to handle it:
-
With probability <missing-coef> send the data point to the left descendant, with probability
(1 - <missing-coef>) - to the right
-
Attach a "presence coefficient" λ to a data point in a node. In the root λ = 1. In the missing
value case the data point goes to both descendants: the presence coefficient λ is multiplied by
<missing-coef> for the left descendant and by
(1 - <missing-coef>) for the right descendant. In the end the data point
will end up in several leaves with correspondent presence coefficients summing up to
1. The prediction of the tree is the weighted average (weighted by λ) of predictions in those leaves.
ag_predict implements the second, more stable option of handling missing values.
Most likely, implementations of the first option will be very similar to ag_predict
in terms of performance on large data sets.
Numbering order of attributes
Attributes are numbered following the order in the attribute file starting with 0 and excluding the
response feature. For example, attributes described by a sample attribute file:
latitude: cont.
x: cont.
label: 0,1 (class).
region_Pacific: 0,1.
contexts:
x never
will be numbered in the following way: latitude - 0, x - 1, region_Pacific - 2.