Protocol dataset

This protocol automates a repeated calculation performed on a predefined data set of systems and calculates the statistics over the set. The data set is described by a YAML file containing the definition of the systems, general setup of the calculations to be performed (most importantly the protocol to be applied to each item) and reference values. Some data sets are provided with Cuby, user-defined data sets can be specified by providing a valid path to a yaml file instead of the name of the predefined data set.

The entries in the data sets can be divided into groups and individually tagged. Only part of the dataset can be calulated, the selection is defined by keywords dataset_select_... and dataset_skip_....

The individual calculations can be executed in parallel to reduce the overall time.

Important information!

The R160x6 data set contained wrong reference values, and it was withdrawn from Cuby until the issue is fixed.

Data sets available

By default, Cuby contains following data sets:

Non-Covalent Interactions Atlas data sets

New, large data sets from the Non-Covalent Interactions Atlas project.

NCIA_D1200London dispersion in an extended chemical space[51]
NCIA_D442x10London dispersion in an extended chemical space, 10-point dissociation curves[52]
NCIA_HB300SPXx10CCSD(T)/CBS interaction energies of H-bonds featuring S, P and halogens, 10-point dissociation curves[53]
NCIA_HB375x10CCSD(T)/CBS interaction energies of H-bonds and decoys, 10-point dissociation curves[54]
NCIA_IHB100x10CCSD(T)/CBS interaction energies of ionic H-bonds, 10-point dissociation curves[55]
NCIA_Rep739x5CCSD(T)/CBS interaction energies for repulsive contacts in extended chemical space[56]
NCIA_SH250x10Sigma-hole interactions, 10-point dissociation curves[57]

Other data sets

3B69CCSD(T)/CBS three-body energies in 23x3 trimers[1]
3B69_dimersAll dimers from the 3B69 set of trimers[2]
A24Accurate CCSD(T)/CBS interaction energies in small noncovalent complexes[3]
Bauza2013Halogen, chalcogen and pnicogen bonds[4]
Charge_transferCCSD(T)/CBS interaction energies in charge-transfer complexes[5][6]
HB104Diverse set of hydrogen bonds of O and N in organic molecules[46][47]
Ionic_H-bondsIonic hydrogen bonds - dissociation curves[48]
L7CCSD(T) or QCISD(T) interaction energies in large noncovalent complexes[49]
MPCONF196Conformation energies of peptides and macrocyclic compounds[50]
Pecina2015Chalcogen and pnicogen bonds of heteroboranes[58]
Peptide_FGGCSCD(T)/CBS conformation energies of FGG tripeptide[59]
Peptide_GFACSCD(T)/CBS conformation energies of GFA tripeptide[60]
Peptide_GGFCSCD(T)/CBS conformation energies of GGF tripeptide[61]
Peptide_WGCSCD(T)/CBS conformation energies of WG dipeptide[62]
Peptide_WGGCSCD(T)/CBS conformation energies of WGG tripeptide[63]
PLFrag547PLFrag547 - Protein-ligand fragments[64]
R160x6Repulsive intermolecular contacts in organic molecules[65]
S12LInteraction energies in large noncovalent complexes derived from experiment[66]
S66CCSD(T)/CBS interaction energies in organic noncovalent complexes[67][68]
S66a8CCSD(T)/CBS interaction energies in organic noncovalent complexes - angular displacements[69]
S66x8CCSD(T)/CBS interaction energies in organic noncovalent complexes - dissociation curves[70]
Sulfur_x8CCSD(T)/CBS interaction energies in complexes featuring sulfur[71]
W4-17High-level theoretical atomization energies[72]
X40CSCD(T)/CBS interaction energies of halogenated molecules[73]
X40x10CSCD(T)/CBS interaction energies of halogenated molecules - dissociation curves[74]

GMTKN30 data sets

The GMTKN30 collection of data sets by S. Grimme is available in Cuby. The original data were converted automatically to the format Cuby uses; as a result the data sets miss some fancy features such as nice names of the systems.

We have validated the GMTKN datasets against the original DFT results by Grimme (with exception of G21EA and WATER27 for which the published data were calculated in a modified basis set). Only in the SIE11 data set, there is one point (the last entry) where our result does not agree with Grimme's DFT data (but is closer to the reference).

GMTKN_ACONFrelative energies of alkane conformers[7]
GMTKN_ADIM6interaction energies of n-alkane dimers[8]
GMTKN_AL2Xdimerization energies of AlX3 compounds[9]
GMTKN_ALK6fragmentation and dissociation reactions of alkaline and alkaline−cation−benzene complexes[10]
GMTKN_BH76barrier heights of hydrogen transfer, heavy atom transfer, nucleophilic substitution, unimolecular, and association reactions[11][12]
GMTKN_BH76RCreaction energies of the BH76 set[13][14]
GMTKN_BHPERIbarrier heights of pericyclic reactions[15]
GMTKN_BSR36bond separation reactions of saturated hydrocarbons[16][17]
GMTKN_CYCONFrelative energies of cysteine conformers[18]
GMTKN_DARCreaction energies of Diels−Alder reactions[19]
GMTKN_DC9nine difficult cases for DFT[20]
GMTKN_G21EAadiabatic electron affinities[21]
GMTKN_G21IPadiabatic ionization potentials[22]
GMTKN_G2RCreaction energies of selected G2-97 systems[23]
GMTKN_HEAVY28noncovalent interaction energies between heavy element hydrides[24]
GMTKN_IDISPintramolecular dispersion interactions[25][26]
GMTKN_ISO34isomerization energies of small and medium-sized organic molecules[27]
GMTKN_ISOL22isomerization energies of large organic molecules[28]
GMTKN_MB08-165decomposition energies of artificial molecules[29][30]
GMTKN_NBPRColigomerizations and H2 fragmentations of NH3-BH3 systems; H2 activation reactions with PH3-BH3 systems[31][32]
GMTKN_O3ADD6reaction energies, barrier heights, association energies for addition of O3 to C2H4 and C2H2[33]
GMTKN_PAadiabatic proton affinities[34][35]
GMTKN_PCONFrelative energies of phenylalanyl−glycyl−glycine tripeptide conformers[36]
GMTKN_RG6interaction energies of rare gas dimers[37]
GMTKN_RSE43radical stabilization energies[38]
GMTKN_S22binding energies of noncovalently bound dimers[39][40]
GMTKN_SCONFrelative energies of sugar conformers[41][42]
GMTKN_SIE11self-interaction error related problems[43]
GMTKN_W4-08atomization energies of small molecules[44]
GMTKN_WATER27binding energies of water, H+(H2O)n and OH−(H2O)n clusters[45]

Calculation setup: All the entries in the GMTKN data sets are calculated using the reaction protocol. Because of this, the calculation setup must be provided in a separate block in the input named 'calculation' rather than at root level. Here is an example:

job: dataset
dataset: GMTKN_PCONF

  job: energy
  interface: mopac
  method: pm6

Alternative reference values

The data set definition file may contain additional sets of reference values such as energies calculated with other methods or e.g. results of an energy decomposition. This may include later, more accurate recalculations of the benchmark values – the main reference comes from the original publication where the data set was introduced (unless explicitly noted). These additional data are not covered in the documentation yet but can be found in the data set files.

To use the alternative refence data, use the keyword dataset_reference.

Custom data sets

Use an existing data set file (located in cuby4/data/datasets) as a template. The file can be located anywhere, just provide a valid path to it in the dataset keyword. The default data sets use geometries from cuby's library but files can be used as well, the record 'geometry' in the data set file is treated the same as the geometry keyword.

Ad-hoc data sets

A simple data set calculation can be run just on a bunch of geometry files by setting the dataset keyword to value 'from_files'. Here is an example:

job: dataset
dataset: from_files

# Selection of geometry files to be used, shell wildcards allowed
dataset_from_files: "*.xyz"
# What protocol to use for the items
dataset_from_files_job: energy
# Optionally, reference energies can be read from a table
dataset_from_files_reference: "energies.txt"

interface: mopac
method: pm6

Input structure

Optionally, following blocks can be defined in the input:

Keywords used

Keywords specific for this protocol:

Other keywords used by this protocol:


The following examples, along with all other files needed to run them, can be found in the directory cuby4/protocols/dataset/examples

# Dataset example 1: Calculation on a predefined data set

job: dataset

# Dataset selection
# Predefined data set is used, only the name of the set has to be provided
dataset: A24

# Calculation setup
# Interface and method of the calculation is specified, appropriate protocol
# (in this case interaction energy calculation) is chosed for each dataset
# automatically

interface: mopac
method: pm6

Produces output:

      / /      / 
     / / Cuby /   Dataset calculation
name                                             E      Eref     error  error(%)
01 water ... ammonia                        -3.904    -6.493     2.590    39.879
02 water dimer                              -3.922    -5.006     1.084    21.653
03 HCN dimer                                -2.537    -4.745     2.208    46.535
04 HF dimer                                  3.515    -4.581     8.096   176.722
05 ammonia dimer                            -2.333    -3.137     0.804    25.624
06 HF ... methane                           -0.336    -1.654     1.318    79.664
07 ammonia ... methane                      -0.544    -0.765     0.221    28.895
08 water ... methane                        -0.505    -0.663     0.158    23.836
09 formaldehyde dimer                       -3.788    -4.554     0.766    16.826
10 water ... ethene                         -1.272    -2.557     1.285    50.269
11 formaldehyde ... ethene                  -0.614    -1.621     1.007    62.145
12 ethyne dimer                             -0.463    -1.524     1.061    69.609
13 ammonia ... ethene                       -0.756    -1.374     0.618    44.996
14 ethene dimer                             -0.307    -1.090     0.784    71.884
15 methane ... ethene                       -0.176    -0.502     0.326    64.944
16 borane ... methane                       -1.124    -1.485     0.360    24.280
17 methane ... ethane                       -0.154    -0.827     0.673    81.353
18 methane ... ethane                       -0.129    -0.607     0.478    78.711
19 methane dimer                            -0.070    -0.533     0.463    86.895
20 Ar ... methane                            0.758    -0.405     1.162   287.292
21 Ar ... ethene                             0.511    -0.364     0.876   240.349
22 ethene ... ethyne                         0.128     0.821    -0.693   -84.379
23 ethene dimer                              0.149     0.934    -0.785   -84.047
24 ethyne dimer                              0.202     1.115    -0.913   -81.868
RMSE                1.951   kcal/mol
MUE                 1.197   kcal/mol
MSE                 0.998   kcal/mol
min                -0.913   kcal/mol
max                 8.096   kcal/mol
range               9.009   kcal/mol
min abs             0.158   kcal/mol
max abs             8.096   kcal/mol
RMSE              101.844   %
MUE                78.027   %
MSE                57.169   %
min               -84.379   %
max               287.292   %
range             371.671   %
min abs            16.826   %
max abs           287.292   %
H-bond         (5)  RMSE      3.974   MSE       2.956   kcal/mol
dispersion     (7)  RMSE      0.681   MSE       0.620   kcal/mol
other          (9)  RMSE      0.894   MSE       0.802   kcal/mol
stack          (3)  RMSE      0.802   MSE      -0.797   kcal/mol

# Dataset example 2: selections and plotting

job: dataset

# Dataset selection
dataset: S66x8 # Dissociation curves for the S66 data set

# Selection
# select only pi-pi dispersion-bound complexes
dataset_select_tag: "dispersion p-p"

# Plotting
# Plot the dissociation curves using gnuplot and merge the images to one file
# with four colums. This requires two external tools installed, gnuplot and
# imagemagick.
dataset_save_plots: gnuplot_tiled
dataset_plot_columns: 4

# Calculation setup
interface: mopac
method: pm6
# Dataset example 3: Custom calculation of each item

# By default, the data set contains information on what calculation protocol
# is applied to each of its items. In this example, we use the S66 data set
# where the calculated quantity is interaction energy in a fixed geometry.

# This example show how to override that and perform a custom calculation,
# in this case optimizing the geometry the geometry of the complex with the
# tested method before the interaction energy is calculated. Additinally,
# the change of the geometry is measured as RMSD and printed.

job: dataset
dataset: S66
dataset_select_name: "^0[1-4]" # only first four items from the data set are used

# The block calculation_overwrite allows definig a custom calculation that is
# performed instead of the default one
  # The multistep protocol allows running the optimization followed by
  # interaction energy calculation.
  # The multistep protocol returns the result of the last calculation
  # which, in this case yields the quantity we are looking for,
  # the interaction energy.
  job: multistep
  steps: clean, opt, rmsd, int

  # Common setup for all calculations
    interface: mopac
    method: pm6

  # Cleanup: remove the old optimized geometry
    job: shell_script
    shell_commands: "rm -f"

  # Optimize geometry of each item in data set
    job: optimize
    geometry: parent_block # The geometry is defined one level above
    opt_quality: 0.1
    optimizer: lbfgs
    optimize_print: steps_as_dots # Simplified printing of steps

  # Calculation of RMSD upon optimization
    job: geometry
    geometry_action: rmsd_fit
    geometry: parent_block

  # Calculate interaction energy in the optimized geometry
    job: interaction

# Dataset example 4: Combining multiple data sets

# While it is not possible to combine multiple data sets within the data set
# protocol, the multistep protocol can be used to achieve this. The final
# result, in this case root mean square error in the two data sets, has to be
# calculated from the output of the two steps via an user-defined expression.

# Two data sets are calculated separately using the multistep protocol
job: multistep
steps: set1, set2

# The calculation setup is the same for both steps
  job: dataset
  interface: mopac
  method: pm6

  dataset: s66

  dataset: x40

# Calculating the final error in the two data sets requires the knowledge of the
# structure of the objects containing the results of the individual steps.
# In the case of RMSE, we can not get it from RMSEs of the two data sets but it
# can be calculated from the sums of squares as follows:
multistep_result_expression: "((steps['set1'].errors.sumsq + steps['set2'].errors.sumsq)/(steps['set1'].errors.count + steps['set2'].errors.count))**0.5"
# The final results can have an arbitrary name which will be printed in the output
multistep_result_name: "RMSE"

# If more than one combined result is needed, they can be evaluated in a custom
# code inserted into the input using the keyword multistep_result_eval.

# Dataset example 5: Datasets from the GMTKN database

# All the datasets from the GMTKN database are calculated using the protocol
# "reaction". This protocol is set up automatically, but its use have one
# implication: the setup for the computational method should not be provided
# at the root level of the input, but in a block "calculation".

job: dataset
dataset: GMTKN_PCONF

# Unlike other data sets, the method is specified in a separate block:
  job: energy
  interface: mopac
  method: pm6