Machine leaning¶

The command-line application for general-purpose machine learning.

Usage¶

python -m nordlys.core.ml.ml <config_file>

Config parameters¶

training_set: nordlys ML instance file format (MIFF)
test_set: nordlys ML instance file format (MIFF); if provided then it’s always used for testing. Can be left empty if cross-validation is used, in which case the remaining split is used for testing.
cross_validation:
- k: number of folds (default: 10); use -1 for leave-one-out
- split_strategy: name of a property (normally query-id for IR problems). If set, the entities with the same value for that property are kept in the same split. if not set, entities are randomly distributed among splits.
- splits_file: JSON file with splits (instance_ids); if the file is provided it is used, otherwise it’s generated
- create_splits: if True, creates the CV splits. Otherwise loads the splits from “split_file” parameter.
model: ML model, currently supported values: rf, gbrt
category: [regression | classification], default: “regression”
parameters: dict with parameters of the given ML model
- If GBRT:
  
  alpha: learning rate, default: 0.1
  
  tree: number of trees, default: 1000
  
  depth: max depth of trees, default: 10% of number of features
- If RF:
  
  tree: number of trees, default: 1000
  
  maxfeat: max features of trees, default: 10% of number of features
save_model: the model is saved to this file
load_model: if True, loads the model
save_feature_imp: Feature importance is saved to this file
output_file: where output is written; default output format: TSV with with instance_id and (estimated) target

Example config¶

{
    "model": "gbrt",
    "category": "regression",
        "parameters":{
                "alpha": 0.1,
                "tree": 10,
                "depth": 5
        },
        "training_set": "path/to/train.json",
        "test_set": "path/to/test.json",
        "save_model": "path/to/model.txt",
    "output_file": "path/to/output.json",
    "cross_validation":{
                "create_splits": true,
                "splits_file": "path/to/splits.json",
        "k": 5,
        "split_strategy": "q_id"
        }
}