This protocol automates a repeated calculation performed on a predefined data set of systems and calculates the statistics over the set. The data set is described by a YAML file containing the definition of the systems, general setup of the calculations to be performed (most importantly the protocol to be applied to each item) and reference values. Some data sets are provided with Cuby, user-defined data sets can be specified by providing a valid path to a yaml file instead of the name of the predefined data set.

The entries in the data sets can be divided into groups and individually tagged. Only part of the dataset can be calulated, the selection is defined by keywords dataset_select_... and dataset_skip_....

The individual calculations can be executed in parallel to reduce the overall time.

Important information!

The R160x6 data set contained wrong reference values, and it was withdrawn from Cuby until the issue is fixed.

Data sets available

By default, Cuby contains following data sets:

Non-Covalent Interactions Atlas data sets

New, large data sets from the Non-Covalent Interactions Atlas project.

<%= `$CUBY_PATH/data/datasets/list_datasets.rb | grep NCIA` %>

Other data sets

<%= `$CUBY_PATH/data/datasets/list_datasets.rb | grep -v GMTKN | grep -v NCIA` %>

GMTKN55 data sets

The GMTKN55 collection of data sets by S. Grimme is available in Cuby. The original data were converted automatically to the format Cuby uses; as a result the data sets miss some fancy features such as nice names of the systems. The conversion was validated bu comparing calculations in Cuby to the the DFT results from the original paper, and in all data sets no or negligible difference was observed.

<%= `$CUBY_PATH/data/datasets/list_datasets.rb | grep GMTKN55` %>

Calculation setup: All the entries in the GMTKN55 (and GMTKN30 listed below) (and GMTKN30 listed below) are calculated using the reaction protocol. Because of this, the calculation setup must be provided in a separate block in the input named 'calculation' rather than at root level. Here is an example:

job: dataset
dataset: GMTKN_PCONF

calculation:
  job: energy
  interface: mopac
  method: pm6

GMTKN30 data sets

Although superseeded by GMTKN55, the GMTKN30 data sets are also kept in Cuby for backward compatibility. These were previously named just GMTKN. Please note that data sets with the same name may use different reference data in GMTKN30 and GMTKN55. The dsata sets were validsated agains against the original DFT results by Grimme (with exception of G21EA and WATER27 for which the published data were calculated in a modified basis set). Only in the SIE11 data set, there is one point (the last entry) where our result does not agree with Grimme's DFT data (but is closer to the reference).

<%= `$CUBY_PATH/data/datasets/list_datasets.rb | grep GMTKN30` %>

Alternative reference values

The data set definition file may contain additional sets of reference values such as energies calculated with other methods or e.g. results of an energy decomposition. This may include later, more accurate recalculations of the benchmark values – the main reference comes from the original publication where the data set was introduced (unless explicitly noted). These additional data are not covered in the documentation yet but can be found in the data set files.

To use the alternative refence data, use the keyword dataset_reference.

Custom data sets

Use an existing data set file (located in cuby4/data/datasets) as a template. The file can be located anywhere, just provide a valid path to it in the dataset keyword. The default data sets use geometries from cuby's library but files can be used as well, the record 'geometry' in the data set file is treated the same as the geometry keyword.

Ad-hoc data sets

A simple data set calculation can be run just on a bunch of geometry files by setting the dataset keyword to value 'from_files'. Here is an example:

job: dataset
dataset: from_files

# Selection of geometry files to be used, shell wildcards allowed
dataset_from_files: "*.xyz"
# What protocol to use for the items
dataset_from_files_job: energy
# Optionally, reference energies can be read from a table
dataset_from_files_reference: "energies.txt"

interface: mopac
method: pm6