Input formats in CPSign¶
Table of Contents
Numerical file format¶
CPSign loads and stores numerical data in LibSVM/Liblinear file format:
<value> <index>:<occurrances> <index>:<occurrances> ..
<value> <index>:<occurrances> <index>:<occurrances> ..
..
Also note that the <index> must start at 1 and not 0, to conform with LibLinear and LibSVM requirements.
CSV file format¶
CPSign supports CSV files in a fairly flexible manner, allowing to specify the separator characters and other parameters that might differ between formats. Molecules should be encoded as SMILES strings and there is a requirement that the CSV must contain a header row - so the SMILES field can be located. There must exist a header containing “smiles” (case insensitive), the first header containing “smiles” will be taken as SMILES column. Example of a supported CSV file, using tab as delimiter:
SMILES Sample_ID Activity Additional_Notes
OC(=O)\C=C/C(O)=O.C[C@]12CC=C3[C@@H](CCC4=CC(=O)C=C[C@]34C)[C@@H]1CC[C@@H]2C(=O)CN1CCN(CC1)C1=NC(=NC(=C1)N1CCCC1)N1CCCC1 NCGC00261900-01 POS Here's some additional information
[Na+].NC1=NC=NC2=C1N=C(Br)N2C1OC2CO[P@]([O-])(=O)O[C@@H]2C1O NCGC00260869-01 NEG More notes
O=C1N2CCC3=C(NC4=C3C=CC=C4)C2=NC2=C1C=CC=C2 NCGC00261776-01 NEG
Cl.FC1=CC=C(C=C1)C(OCCCC1=CNC=N1)C1=CC=C(F)C=C1 NCGC00261380-01 POS
CC1=CC=C(C=C1)S(=O)(=O)N[C@@H](CC1=CC=CC=C1)C(=O)CCl NCGC00261842-01 NEG Not all lines need to contain the additional notes
...
SMILES as single molecule¶
The predict command can predict single molecules using the --smiles
flag, this flag takes a string of texts where the string must start with a valid SMILES
and can then optionally include a blank space character (tab, space) and an identifier.
JSON file format¶
CPSign supports a JSON input format, the format require that the top level starts as a JSON array (meaning that the first character must be a hard bracket “[“). Each index of the array is one record and each record must include a key-value for SMILES for the molecule. This SMILES key-value pair must have the key “SMILES”, “smiles” or “Smiles”. Here are some examples for the file fromat (it is not required that the file is properly indented).
Example classification JSON file:
[
{
"cdk:Title" : "1728-95-6",
"Ames test categorisation" : "mutagen",
"smiles" : "C1(=C(C=2C=CC=CC2)N=C(N1)C3=CC=C(OC)C=C3)C=4C=CC=CC4"
},
{
"cdk:Title" : "91-08-7",
"Ames test categorisation" : "mutagen",
"smiles" : "C=1(C(=C(C=CC1)N=C=O)C)N=C=O"
},
..
]
Example regression JSON file:
[
{
"BIO" : "0.43",
"comment" : "This is a comment",
"smiles" : "SC1=C(C(F)(F)F)C=CC=C1"
},
{
"BIO" : "1.60",
"comment" : "Comment for second molecule",
"smiles" : "SC1=C(C(F)(F)F)C=C([N+]([O-])=O)C=C1"
},
..
]
Compression¶
CPSign automatically reads files compressed in GZIP format, there is no need to unzip these files.