Machine leaning

The command-line application for general-purpose machine learning.

Usage

python -m nordlys.core.ml.ml <config_file>

Config parameters

  • training_set: nordlys ML instance file format (MIFF)
  • test_set: nordlys ML instance file format (MIFF); if provided then it’s always used for testing. Can be left empty if cross-validation is used, in which case the remaining split is used for testing.
  • cross_validation:
    • k: number of folds (default: 10); use -1 for leave-one-out
    • split_strategy: name of a property (normally query-id for IR problems). If set, the entities with the same value for that property are kept in the same split. if not set, entities are randomly distributed among splits.
    • splits_file: JSON file with splits (instance_ids); if the file is provided it is used, otherwise it’s generated
    • create_splits: if True, creates the CV splits. Otherwise loads the splits from “split_file” parameter.
  • model: ML model, currently supported values: rf, gbrt
  • category: [regression | classification], default: “regression”
  • parameters: dict with parameters of the given ML model
    • If GBRT:
      • alpha: learning rate, default: 0.1
      • tree: number of trees, default: 1000
      • depth: max depth of trees, default: 10% of number of features
    • If RF:
      • tree: number of trees, default: 1000
      • maxfeat: max features of trees, default: 10% of number of features
  • save_model: the model is saved to this file
  • load_model: if True, loads the model
  • save_feature_imp: Feature importance is saved to this file
  • output_file: where output is written; default output format: TSV with with instance_id and (estimated) target

Example config

{
    "model": "gbrt",
    "category": "regression",
        "parameters":{
                "alpha": 0.1,
                "tree": 10,
                "depth": 5
        },
        "training_set": "path/to/train.json",
        "test_set": "path/to/test.json",
        "save_model": "path/to/model.txt",
    "output_file": "path/to/output.json",
    "cross_validation":{
                "create_splits": true,
                "splits_file": "path/to/splits.json",
        "k": 5,
        "split_strategy": "q_id"
        }
}