Protocol dataset

This protocol automates a repeated calculation performed on a predefined data set of systems and calculates the statistics over the set. The data set is described by a YAML file containing the definition of the systems, general setup of the calculations to be performed (most importantly the protocol to be applied to each item) and reference values. Some data sets are provided with Cuby, user-defined data sets can be specified by providing a valid path to a yaml file instead of the name of the predefined data set.

The entries in the data sets can be divided into groups and individually tagged. Only part of the dataset can be calulated, the selection is defined by keywords dataset_select_... and dataset_skip_....

The individual calculations can be executed in parallel to reduce the overall time.

Important information!

The R160x6 data set contained wrong reference values, and it was withdrawn from Cuby until the issue is fixed.

Data sets available

By default, Cuby contains following data sets:

Non-Covalent Interactions Atlas data sets

New, large data sets from the Non-Covalent Interactions Atlas project.

NCIA_D1200London dispersion in an extended chemical space[118]
NCIA_D442x10London dispersion in an extended chemical space, 10-point dissociation curves[119]
NCIA_HB300SPXx10CCSD(T)/CBS interaction energies of H-bonds featuring S, P and halogens, 10-point dissociation curves[120]
NCIA_HB375x10CCSD(T)/CBS interaction energies of H-bonds and decoys, 10-point dissociation curves[121]
NCIA_IHB100x10CCSD(T)/CBS interaction energies of ionic H-bonds, 10-point dissociation curves[122]
NCIA_Rep739x5CCSD(T)/CBS interaction energies for repulsive contacts in extended chemical space[123]
NCIA_SH250x10Sigma-hole interactions, 10-point dissociation curves[124]

Other data sets

3B69CCSD(T)/CBS three-body energies in 23x3 trimers[1]
3B69_dimersAll dimers from the 3B69 set of trimers[2]
A24Accurate CCSD(T)/CBS interaction energies in small noncovalent complexes[3]
Bauza2013Halogen, chalcogen and pnicogen bonds[4]
Charge_transferCCSD(T)/CBS interaction energies in charge-transfer complexes[5][6]
Dipoles152Benchmark CCSD(T)/CBS dipole moments in fixed equilibrium geometries[7]
HB104Diverse set of hydrogen bonds of O and N in organic molecules[113][114]
Ionic_H-bondsIonic hydrogen bonds - dissociation curves[115]
L7CCSD(T) or QCISD(T) interaction energies in large noncovalent complexes[116]
MPCONF196Conformation energies of peptides and macrocyclic compounds[117]
Pecina2015Chalcogen and pnicogen bonds of heteroboranes[125]
Peptide_FGGCSCD(T)/CBS conformation energies of FGG tripeptide[126]
Peptide_GFACSCD(T)/CBS conformation energies of GFA tripeptide[127]
Peptide_GGFCSCD(T)/CBS conformation energies of GGF tripeptide[128]
Peptide_WGCSCD(T)/CBS conformation energies of WG dipeptide[129]
Peptide_WGGCSCD(T)/CBS conformation energies of WGG tripeptide[130]
PLFrag547PLFrag547 - Protein-ligand fragments[131]
R160x6Repulsive intermolecular contacts in organic molecules[132]
S12LInteraction energies in large noncovalent complexes derived from experiment[133]
S66CCSD(T)/CBS interaction energies in organic noncovalent complexes[134][135]
S66a8CCSD(T)/CBS interaction energies in organic noncovalent complexes - angular displacements[136]
S66x8CCSD(T)/CBS interaction energies in organic noncovalent complexes - dissociation curves[137]
Sulfur_x8CCSD(T)/CBS interaction energies in complexes featuring sulfur[138]
W4-17High-level theoretical atomization energies[139]
X40CSCD(T)/CBS interaction energies of halogenated molecules[140]
X40x10CSCD(T)/CBS interaction energies of halogenated molecules - dissociation curves[141]

GMTKN55 data sets

The GMTKN55 collection of data sets by S. Grimme is available in Cuby. The original data were converted automatically to the format Cuby uses; as a result the data sets miss some fancy features such as nice names of the systems. The conversion was validated bu comparing calculations in Cuby to the the DFT results from the original paper, and in all data sets no or negligible difference was observed.

GMTKN55_ACONFRelative energies of alkane conformers[47]
GMTKN55_ADIM6Interaction energies of n-alkane dimers[48]
GMTKN55_AHB21Interaction energies in anion–neutral dimers[49]
GMTKN55_AL2X6Dimerisation energies of AlX3 compounds[50]
GMTKN55_ALK8Dissociation and other reactions of alkaline compounds[51]
GMTKN55_ALKBDE10Dissociation energies in group-1 and -2 diatomics[52]
GMTKN55_Amino20x4Relative energies in amino acid conformers[53]
GMTKN55_BH76Barrier heights of hydrogen transfer, heavy atom transfer, nucleophilic substitution, unimolecular and association reactions[54]
GMTKN55_BH76RCReaction energies of the BH76[55]
GMTKN55_BHDIV10Diverse reaction barrier heights[56]
GMTKN55_BHPERIBarrier heights of pericyclic reactions[57]
GMTKN55_BHROT27Barrier heights for rotation around single bonds[58]
GMTKN55_BSR36Bond-separation reactions of saturated hydrocarbons[59]
GMTKN55_BUT14DIOLRelative energies in butane-1,4-diol conformers[60]
GMTKN55_C60ISORelative energies between C60 isomers[61]
GMTKN55_CARBHB12Hydrogen-bonded complexes between carbene analogues and H2O, NH3, or HCl[62]
GMTKN55_CDIE20Double-bond isomerisation energies in cyclic systems[63]
GMTKN55_CHB6Interaction energies in cation–neutral dimers[64]
GMTKN55_DARCReaction energies of Diels-Alder reactions[65]
GMTKN55_DC1313 difficult cases for DFT methods[66][67]
GMTKN55_DIPCS10Double-ionisation potentials of closed-shell systems[68]
GMTKN55_FH51Reaction energies in various (in-)organic systems[69][70]
GMTKN55_G21EAAdiabatic electron affinities[71]
GMTKN55_G21IPAdiabatic ionization potentials[72]
GMTKN55_G2RCReaction energies of selected G2/97 systems[73]
GMTKN55_HAL59Binding energies in halogenated dimers (incl. halogen bonds)[74][75]
GMTKN55_HEAVY28Noncovalent interaction energies between heavy element hydrides[76]
GMTKN55_HEAVYSB11Dissociation energies in heavy-element compounds[77]
GMTKN55_ICONFRelative energies in conformers of inorganic systems[78]
GMTKN55_IDISPIntramolecular dispersion interactions[79]
GMTKN55_IL16Interaction energies in anion–cation dimers[80]
GMTKN55_INV24Inversion/racemisation barrier heights[81]
GMTKN55_ISO34Isomerisation energies of small and medium-sized organic molecules[82]
GMTKN55_ISOL24Isomerisation energies of large organic molecules[83][84]
GMTKN55_MB16-43Decomposition energies of artificial molecules[85]
GMTKN55_MCONFRelative energies in melatonin conformers[86]
GMTKN55_NBPRCOligomerisations and H2 fragmentations of NH3/BH3 systems, H2 activation reactions with PH3/BH3 systems[87]
GMTKN55_PA26Adiabatic proton affinities (incl. of amino acids)[88][89][90]
GMTKN55_PArelRelative energies in protonated isomers[91]
GMTKN55_PCONF21Relative energies in tri- and tetrapeptide conformers[92][93][94]
GMTKN55_PNICO23Interaction energies in pnicogen-containing dimers[95]
GMTKN55_PX13Proton-exchange barriers in H2O, NH3, and HF clusters[96]
GMTKN55_RC21Fragmentations and rearrangements in radical cations[97]
GMTKN55_RG18Interaction energies in rare-gas complexes[98]
GMTKN55_RSE43Radical-stabilisation energies[99]
GMTKN55_S22Binding energies of noncovalently bound dimers[100]
GMTKN55_S66Binding energies of noncovalently bound dimers[101]
GMTKN55_SCONFRelative energies of sugar conformers[102]
GMTKN55_SIE4x4Self-interaction-error related problems[103]
GMTKN55_TAUT15Relative energies in tautomers[104]
GMTKN55_UPU23Relative energies between RNA-backbone conformers[105][106]
GMTKN55_W4-11Total atomisation energies[107]
GMTKN55_WATER27Binding energies in (H2O)n, H+(H2O)n and OH-(H2O)n[108][109]
GMTKN55_WCPT18Proton-transfer barriers in uncatalysed and water-catalysed reactions[110]
GMTKN55_YBDE18Bond-dissociation energies in ylides[111][112]

Calculation setup: All the entries in the GMTKN55 (and GMTKN30 listed below) (and GMTKN30 listed below) are calculated using the reaction protocol. Because of this, the calculation setup must be provided in a separate block in the input named 'calculation' rather than at root level. Here is an example:

job: dataset
dataset: GMTKN_PCONF

calculation:
  job: energy
  interface: mopac
  method: pm6

GMTKN30 data sets

Although superseeded by GMTKN55, the GMTKN30 data sets are also kept in Cuby for backward compatibility. These were previously named just GMTKN. Please note that data sets with the same name may use different reference data in GMTKN30 and GMTKN55. The dsata sets were validsated agains against the original DFT results by Grimme (with exception of G21EA and WATER27 for which the published data were calculated in a modified basis set). Only in the SIE11 data set, there is one point (the last entry) where our result does not agree with Grimme's DFT data (but is closer to the reference).

GMTKN30_ACONFrelative energies of alkane conformers[8]
GMTKN30_ADIM6interaction energies of n-alkane dimers[9]
GMTKN30_AL2Xdimerization energies of AlX3 compounds[10]
GMTKN30_ALK6fragmentation and dissociation reactions of alkaline and alkaline−cation−benzene complexes[11]
GMTKN30_BH76barrier heights of hydrogen transfer, heavy atom transfer, nucleophilic substitution, unimolecular, and association reactions[12][13]
GMTKN30_BH76RCreaction energies of the BH76 set[14][15]
GMTKN30_BHPERIbarrier heights of pericyclic reactions[16]
GMTKN30_BSR36bond separation reactions of saturated hydrocarbons[17][18]
GMTKN30_CYCONFrelative energies of cysteine conformers[19]
GMTKN30_DARCreaction energies of Diels−Alder reactions[20]
GMTKN30_DC9nine difficult cases for DFT[21]
GMTKN30_G21EAadiabatic electron affinities[22]
GMTKN30_G21IPadiabatic ionization potentials[23]
GMTKN30_G2RCreaction energies of selected G2-97 systems[24]
GMTKN30_HEAVY28noncovalent interaction energies between heavy element hydrides[25]
GMTKN30_IDISPintramolecular dispersion interactions[26][27]
GMTKN30_ISO34isomerization energies of small and medium-sized organic molecules[28]
GMTKN30_ISOL22isomerization energies of large organic molecules[29]
GMTKN30_MB08-165decomposition energies of artificial molecules[30][31]
GMTKN30_NBPRColigomerizations and H2 fragmentations of NH3-BH3 systems; H2 activation reactions with PH3-BH3 systems[32][33]
GMTKN30_O3ADD6reaction energies, barrier heights, association energies for addition of O3 to C2H4 and C2H2[34]
GMTKN30_PAadiabatic proton affinities[35][36]
GMTKN30_PCONFrelative energies of phenylalanyl−glycyl−glycine tripeptide conformers[37]
GMTKN30_RG6interaction energies of rare gas dimers[38]
GMTKN30_RSE43radical stabilization energies[39]
GMTKN30_S22binding energies of noncovalently bound dimers[40][41]
GMTKN30_SCONFrelative energies of sugar conformers[42][43]
GMTKN30_SIE11self-interaction error related problems[44]
GMTKN30_W4-08atomization energies of small molecules[45]
GMTKN30_WATER27binding energies of water, H+(H2O)n and OH−(H2O)n clusters[46]

Alternative reference values

The data set definition file may contain additional sets of reference values such as energies calculated with other methods or e.g. results of an energy decomposition. This may include later, more accurate recalculations of the benchmark values – the main reference comes from the original publication where the data set was introduced (unless explicitly noted). These additional data are not covered in the documentation yet but can be found in the data set files.

To use the alternative refence data, use the keyword dataset_reference.

Custom data sets

Use an existing data set file (located in cuby4/data/datasets) as a template. The file can be located anywhere, just provide a valid path to it in the dataset keyword. The default data sets use geometries from cuby's library but files can be used as well, the record 'geometry' in the data set file is treated the same as the geometry keyword.

Ad-hoc data sets

A simple data set calculation can be run just on a bunch of geometry files by setting the dataset keyword to value 'from_files'. Here is an example:

job: dataset
dataset: from_files

# Selection of geometry files to be used, shell wildcards allowed
dataset_from_files: "*.xyz"
# What protocol to use for the items
dataset_from_files_job: energy
# Optionally, reference energies can be read from a table
dataset_from_files_reference: "energies.txt"

interface: mopac
method: pm6

Input structure

Optionally, following blocks can be defined in the input:

Keywords used

Keywords specific for this protocol:

Other keywords used by this protocol:

Examples

The following examples, along with all other files needed to run them, can be found in the directory cuby4/protocols/dataset/examples

#===============================================================================
# Dataset example 1: Calculation on a predefined data set
#===============================================================================

job: dataset

#-------------------------------------------------------------------------------
# Dataset selection
#-------------------------------------------------------------------------------
# Predefined data set is used, only the name of the set has to be provided
dataset: A24

#-------------------------------------------------------------------------------
# Calculation setup
#-------------------------------------------------------------------------------
# Interface and method of the calculation is specified, appropriate protocol
# (in this case interaction energy calculation) is chosed for each dataset
# automatically

interface: mopac
method: pm6

Produces output:

        _______  
       /\______\ 
      / /      / 
     / / Cuby /   Dataset calculation
     \/______/   
                 
==========================================================================================
name                                             E      Eref     error  error(%)
------------------------------------------------------------------------------------------
01 water ... ammonia                        -3.904    -6.493     2.590    39.879
02 water dimer                              -3.922    -5.006     1.084    21.653
03 HCN dimer                                -2.537    -4.745     2.208    46.535
04 HF dimer                                  3.515    -4.581     8.096   176.722
05 ammonia dimer                            -2.333    -3.137     0.804    25.624
06 HF ... methane                           -0.336    -1.654     1.318    79.664
07 ammonia ... methane                      -0.544    -0.765     0.221    28.895
08 water ... methane                        -0.505    -0.663     0.158    23.836
09 formaldehyde dimer                       -3.788    -4.554     0.766    16.826
10 water ... ethene                         -1.272    -2.557     1.285    50.269
11 formaldehyde ... ethene                  -0.614    -1.621     1.007    62.145
12 ethyne dimer                             -0.463    -1.524     1.061    69.609
13 ammonia ... ethene                       -0.756    -1.374     0.618    44.996
14 ethene dimer                             -0.307    -1.090     0.784    71.884
15 methane ... ethene                       -0.176    -0.502     0.326    64.944
16 borane ... methane                       -1.124    -1.485     0.360    24.280
17 methane ... ethane                       -0.154    -0.827     0.673    81.353
18 methane ... ethane                       -0.129    -0.607     0.478    78.711
19 methane dimer                            -0.070    -0.533     0.463    86.895
20 Ar ... methane                            0.758    -0.405     1.162   287.292
21 Ar ... ethene                             0.511    -0.364     0.876   240.349
22 ethene ... ethyne                         0.128     0.821    -0.693   -84.379
23 ethene dimer                              0.149     0.934    -0.785   -84.047
24 ethyne dimer                              0.202     1.115    -0.913   -81.868
==========================================================================================
RMSE                1.951   kcal/mol
MUE                 1.197   kcal/mol
------------------------------------------------------------------------------------------
MSE                 0.998   kcal/mol
min                -0.913   kcal/mol
max                 8.096   kcal/mol
range               9.009   kcal/mol
min abs             0.158   kcal/mol
max abs             8.096   kcal/mol
==========================================================================================
RMSE              101.844   %
MUE                78.027   %
MSE                57.169   %
min               -84.379   %
max               287.292   %
range             371.671   %
min abs            16.826   %
max abs           287.292   %
==========================================================================================
H-bond         (5)  RMSE      3.974   MSE       2.956   kcal/mol
dispersion     (7)  RMSE      0.681   MSE       0.620   kcal/mol
other          (9)  RMSE      0.894   MSE       0.802   kcal/mol
stack          (3)  RMSE      0.802   MSE      -0.797   kcal/mol
==========================================================================================

#===============================================================================
# Dataset example 2: selections and plotting
#===============================================================================

job: dataset

#-------------------------------------------------------------------------------
# Dataset selection
#-------------------------------------------------------------------------------
dataset: S66x8 # Dissociation curves for the S66 data set

#-------------------------------------------------------------------------------
# Selection
#-------------------------------------------------------------------------------
# select only pi-pi dispersion-bound complexes
dataset_select_tag: "dispersion p-p"

#-------------------------------------------------------------------------------
# Plotting
#-------------------------------------------------------------------------------
# Plot the dissociation curves using gnuplot and merge the images to one file
# with four colums. This requires two external tools installed, gnuplot and
# imagemagick.
dataset_save_plots: gnuplot_tiled
dataset_plot_columns: 4

#-------------------------------------------------------------------------------
# Calculation setup
#-------------------------------------------------------------------------------
interface: mopac
method: pm6
#===============================================================================
# Dataset example 3: Custom calculation of each item
#===============================================================================

# By default, the data set contains information on what calculation protocol
# is applied to each of its items. In this example, we use the S66 data set
# where the calculated quantity is interaction energy in a fixed geometry.

# This example show how to override that and perform a custom calculation,
# in this case optimizing the geometry the geometry of the complex with the
# tested method before the interaction energy is calculated. Additinally,
# the change of the geometry is measured as RMSD and printed.

job: dataset
dataset: S66
dataset_select_name: "^0[1-4]" # only first four items from the data set are used

# The block calculation_overwrite allows definig a custom calculation that is
# performed instead of the default one
calculation_overwrite:
  # The multistep protocol allows running the optimization followed by
  # interaction energy calculation.
  # The multistep protocol returns the result of the last calculation
  # which, in this case yields the quantity we are looking for,
  # the interaction energy.
  job: multistep
  steps: clean, opt, rmsd, int

  # Common setup for all calculations
  calculation_common:
    interface: mopac
    method: pm6

  # Cleanup: remove the old optimized geometry
  calculation_clean:
    job: shell_script
    shell_commands: "rm -f optimized.xyz"

  # Optimize geometry of each item in data set
  calculation_opt:
    job: optimize
    geometry: parent_block # The geometry is defined one level above
    opt_quality: 0.1
    optimizer: lbfgs
    optimize_print: steps_as_dots # Simplified printing of steps

  # Calculation of RMSD upon optimization
  calculation_rmsd:
    job: geometry
    geometry_action: rmsd_fit
    geometry: parent_block
    geometry2: optimized.xyz

  # Calculate interaction energy in the optimized geometry
  calculation_int:
    job: interaction
    geometry: optimized.xyz


#===============================================================================
# Dataset example 4: Combining multiple data sets
#===============================================================================

# While it is not possible to combine multiple data sets within the data set
# protocol, the multistep protocol can be used to achieve this. The final
# result, in this case root mean square error in the two data sets, has to be
# calculated from the output of the two steps via an user-defined expression.


# Two data sets are calculated separately using the multistep protocol
job: multistep
steps: set1, set2


# The calculation setup is the same for both steps
calculation_common:
  job: dataset
  interface: mopac
  method: pm6

calculation_set1:
  dataset: s66

calculation_set2:
  dataset: x40

# Calculating the final error in the two data sets requires the knowledge of the
# structure of the objects containing the results of the individual steps.
# In the case of RMSE, we can not get it from RMSEs of the two data sets but it
# can be calculated from the sums of squares as follows:
multistep_result_expression: "((steps['set1'].errors.sumsq + steps['set2'].errors.sumsq)/(steps['set1'].errors.count + steps['set2'].errors.count))**0.5"
# The final results can have an arbitrary name which will be printed in the output
multistep_result_name: "RMSE"

# If more than one combined result is needed, they can be evaluated in a custom
# code inserted into the input using the keyword multistep_result_eval.

#===============================================================================
# Dataset example 5: Datasets from the GMTKN database
#===============================================================================

# All the datasets from the GMTKN database are calculated using the protocol
# "reaction". This protocol is set up automatically, but its use have one
# implication: the setup for the computational method should not be provided
# at the root level of the input, but in a block "calculation".

job: dataset
dataset: GMTKN_PCONF

# Unlike other data sets, the method is specified in a separate block:
calculation:
  job: energy
  interface: mopac
  method: pm6