Train¶
Train is for training predictors that later can be used for prediction. Input to train is either precomputed data from the precompute program or compound data directly in SDF/SMILES/JSON file format.
Table of Contents
Parameters¶
The full usage menu can be retrieved by running command:
> java -jar cpsign-[version].jar train
train
SYNOPSIS
------------------------------------------------------------------------------------------
train [options]
train @/tmp/runconfigs/parameters.txt [options]
train @C:\Users\User\runconfigs\parameters.txt [options]
DESCRIPTION
------------------------------------------------------------------------------------------
Train an Aggragated Conformal Predictor (ACP), Venn-ABERS Predictor (VAP) or
Transductive Conformal Predictor (TCP). The trained models can later be used in
predictions. For ACP and VAP, the underlying scoring algorihtm is trained and can be
reused for all further predictions, whereas the TCP predictor only precomputes the data
and sets parameters for future training+predictions (as the underlying scoring algorithm
needs to be trained for each new prediction). Input to train can either be precomputed
data from 'precompute' or compound data in SMILES/SDF/JSON.
OPTIONS
------------------------------------------------------------------------------------------
Input:
-mi | --model-in [URI | path]
Model file with precomputed data
-td | --train-data [format] [opt args] [URI | path]
File with molecules in CSV, SDF or JSON format. run
-e | --endpoint [text]
Endpoint property that should be used for modeling (the endoint of the model)
-l | --labels [label label]
Label(s) for endpoint values in classification mode. More info can be found running
"explain labels"
Predictor:
-pt | --ptype | --predictor-type [id | text]
Predictor type:
(1) ACP_Classification
(2) ACP_Regression
(3) TCP_Classification
(5) VAP_Classification
Default: 1
-ss | --sampling-strategy [id | text]
Strategy used for sampling data to aggregated models (non TCP):
(1) random
(2) random_stratified (classification only)
(3) folded
(4) folded_stratified (classification only)
Default: 1
-nr | --nr-models [integer]
(ACP/VAP) Number of models that should be aggregated
Default: 1
-cr | --calibration-ratio [number]
(ACP/VAP) Part of training set used as calibration set, range (0,1)
Default: 0.2
--nonconf-measure [text]
Nonconformity measure that should be used, see documentation for clarifications.
Run "explain ncm" to get further information
Options (Regression) :
(1) LogNormalized
(2) Normalized
(3) AbsDiff
Options (Classification):
(11) NegativeDistanceToHyperplane
(12) PositiveDistanceToHyperplane
(13) ProbabilityEstimates (Only for ProbabilisticLibSVM - slower to compute)
Default: 1 or 11 (regression / classification)
--nonconf-beta [number]
If log-normalized nonconformity measure is chosen, optionally set a beta value (>=
0)
Default: 0.0
--pvalue-calc [id | text]
Choose the calculation of p-values (and nonconformity score for regression).
Options:
(1) Standard
(2) Smoothed
(3) Linear_interpolation
(4) Spline_interpolation
Default: 1
--percentiles [integer]
The maximum number of molecules used for calculating percentiles. This is a very
time consuming step in the training. Percentiles are only used for image rendering
and calculating gradients for predictions, save time and set this flag to 0 if
neither of this will be used.
Default: 1000
--percentiles-data [format] [opt args] [URI | path]
File with molecules that exclusively should be used for calculating percentiles
(used when calculating gradients and generating images). Default behaviour is to
re-use the training file but when using precomputed data this is not an option and
this flag must be given if percentiles should be calculated.
--splits [integer integer ..]
(ACP/VAP) Run only a specific set of training splits. A means of parallelising the
training step. If a folded sampling strategy is used, the random seed *must be
static*! Indexing starts at 1, i.e. allowed indexes are [1,#num_models]
Modeling:
-i | --impl [id | text]
Scoring algorithm (i.e. underlying machine learning implementation):
(1) LibLinear
(2) LibSvm
(3) ProbabilisticLibSvm
Default: 1
--cost [number]
User defined Cost value in SVM training
Default: 50.0
--gamma [number]
User defined Gamma value in SVM training (only used in libsvm)
Default: 0.002
--epsilon [number]
User defined tolerance of termination criterion
Default: 0.001
--epsilon-svr [number]
User defined epsilon in loss function of epsilon-SVR
Default: 0.1
Signature generation:
-hs | --height-start [integer]
Signatures start height
Default: 1
-he | --height-end [integer]
Signatures end height
Default: 3
-sg | --signatures-generator [id | text]
Type of signatures that should be used, note that stereo-signatures take much
longer time to compute. Stereo signatures also requires input data to have stereo
information explicitly given in the file. Options:
(1) default | normal
(2) stereo (experimental mode)
Default: 1
Data manipulation:
--duplicates [id | text]
Resolve/remove potential duplicates which can make it difficult for the SVM to find
a good decision plane. Replace duplicates by a single record with a new label or
remove all conflicting records. Regression options:
(1) median
(2) mean
(3) min
(4) max
(5) remove:[maximum allowed difference]
Classification options:
(5) remove
(6) vote
(7) keep:[label]
--filters [id | text]
Filters to apply on the records, currently only filters records based on the
endpoint value for regression. Options:
(1) min:[min]
(2) max:[max]
(3) range:[min]:[max]
Output:
-mo | --model-out [path]
Model file to generate (--model-out or --model-out-dir are required to pass)
--model-out-dir [path]
Specify a directory where the model should be saved, leave naming to cpsign
(--model-out or --model-out-dir are required to pass). Specify '.' if model should
be generated in the current directory.
* -mn | --model-name [text]
The name of the model
-mc | --model-category [text]
The category of the model, will end up as model-endpoint in the model JAR
-mv | --model-version [text]
Optional model version in SemVer versioning format
Default: 1.0.0_{date time string}
Encryption:
--encrypt [URI | path]
Path to the license file that the model should be encrypted by (can be the same as
passed to --license)
General:
* --license [URI | path]
Path or URI to license file
-h | --help | man
Get help text
--short
Use shorter help text (used together with the --help argument)
--logfile [path]
Path to a user-set logfile, will be specific for this run
--silent
Silent mode (only print output to logfile)
--echo
Echo the input arguments given to CPSign
--seed [integer]
Set this flag if an explicit RNG seed should be used in tasks that require a RNG
(randomization of training data, splitting in cross-validation, learning algorithms
etc). Not used by all programs.
--progress-bar
Add a Progress bar in the system error output
--progress-bar-ascii
Add a Progress bar in ASCII in the system error output
--time
Print wall-time for all individual steps in execution
------------------------------------------------------------------------------------------
Example Usage (ACP regression)¶
> java -jar cpsign-[version].jar train \
--license /path/to/Standard-license.license \
-td sdf /path/to/datafile.sdf \
-e "BIO" \
-nr 5 \
-i liblinear \
--model-out /tmp/Chang_BIO.cpsign \
--model-name Chang_BIO \
-pt 2
Running with Standard License registered to [Name] at [Company]. Expiry
date is [Date]
Reading train file and performing signature generation..
Successfully parsed 34 molecules. Generated 286 new signatures.
Training ACP Regression predictor with 5 models
- Trained model 1/5
- Trained model 2/5
- Trained model 3/5
- Trained model 4/5
- Trained model 5/5
Saving model to file..
Finished model saved at:
/private/tmp/Chang_BIO.cpsign
“Exclusive” datasets¶
For parameters --model-data
and --calibration-data
the same holds as for the precompute command, see Exclusive datasets.
Important performance note¶
The default behavior of CPSign is to compute percentiles (see Molecule Gradient) when training a model. This is in most
cases the most time consuming part of the training procedure as it requires to make a huge amount of predictions. This is
only required when rendering images or calculating molecule gradients, so if you do not intend to use this with the
trained models, you can gain a lot in runtime by setting the --percentiles
flag to 0. Or you can at least lower the amount of molecules used
for calculating the gradients to less than 1000 molecules which is the default. When using LibLinear this is not likely to have as big impact
on runtime as the time for making predictions is so small.
Nonconformity Measures¶
See information about the available nonconformity measures in the section Nonconformity measures .