Data Format

Data format

Data files

Train, validation and test data sets for all tools in TreeExtra should be provided in separate tab-delimited text files without any headers. Only continuous and Boolean features are supported. Nominal features are allowed in the data file, but they should be explicitly marked as unused in the attribute file (see next section). Missing values should be encoded with question marks. All data sets should have the same number and order of columns. If your test data does not have labels, you can put missing values instead, but the column should be still present. Binary classification problems should use 0 and 1 for the response values.

Attribute files

A separate attribute file describing data is required. I reused the idea of an attribute file from IND package, so the format of this file should be compatible with IND to some extent.

Each line in the first part of the attribute file corresponds to a single attribute. The order of attributes should be the same as in the data file.

The structure of the attribute description is the following:

_attr_name_: _type_ [(class)|(weight)].

_type_ should be either cont for continuous features, 0,1 for boolean or nom for nominals. (class) marks the label, there should be exactly one attribute marked with (class) per attribute file. (weight) marks the column with weights.
The first and the second parts of the attribute file should be separated by a line "contexts:".
The second part lists attributes that should not be used for training. Each line contains one attribute in the format:

_attr_name_ never

Here is an example of a valid attribute file:

latitude: cont. longtitude: cont. x: cont. y: cont. name: nom. label: 0,1 (class). region_Pacific: 0,1. region_Mountain: 0,1. region_NA: 0,1. coefficient: cont (weight). contexts: y never x never name never