MQS Search API - Part 2: QMugs and PubchemQC PM6

Posted on Jul 11, 2023

MQS Documentation

This blog is now defunct. Head over to https://docs.mqs.dk


Figure 1 (left): An ouroboros with a Benzene ring. The famous dream or creative process by August Kekulé discovering the alternating double bounds in the Benzene ring structure (Source: Wikimedia).

Figure 2 (right): The quantum chemical perspective: Cyclic overlapping of six p_z orbitals of the sp2 hybridized C-atoms forming a delocalized pi system (Source: LibreTexts Chemistry).

MQS Search API

In the MQS Search API Part 1 Tutorial we gave a basic introduction to the MQS Search API and will recap some details for getting an overview of the API.

Additionally, this Part 2 Tutorial will showcase how the Search field of the MQS Dashboard can be utilized to filter for molecules from the MQS database to then implement the defined search query via the Search API.

The database now also holds the QMugs data set with the community (UI for free) and basic subscription plans (API access + JupyterLab). In the end of this tutorial an overview of the PubchemQC PM6 and QMugs data sets is provided and we show how to retrieve data from the QMugs data set.

Overview of API endpoints

When interacting with a REST-API we can define sending data via the API with a POST call and receiving data with a GET call to the server. The following POST and GET methods exist currently with the MQS Search API:

  • POST /token: This endpoint generates an access token that is required to call the other two endpoints. The token is generated using the user’s email and password.
  • GET /search: This endpoint is used to search the chemical compounds data in the database.
  • GET /compound/{id}: This endpoint is used to retrieve detailed information of a specific compound.

Authentication is required for accessing the /search and /compound/{id} endpoints. The user must provide a valid bearer token in the Authorization HTTP header.

Generating the access token

To generate the access token, use the POST /token endpoint. The following parameters should be included in the request body:

  • email: The email address of the user which has been used to set up an account via the MQS Dashboard.
  • password: The password of the user.

See Tutorial I for a code example.

Searching for compounds

To search for compounds, use the GET /search endpoint. The following parameter should be included in the request URL:

  • q: The search query string.

One can optimize the search query string by making use of the MQS Dashboard search interface to test the query string by seeing which kind of molecules are found and how to further refine the query string.

For example if you would like to search for the chemical class of polychlorinated biphenyls (PCBs) you can start with typing “polychlorinated biphenyls” or “PCB” in the search field.

Retrieving a compound datasheet

To retrieve the full datasheet of a single compound, use the GET /compound/{id} endpoint. The following parameter should be included in the request URL:

  • id: The ID of the compound.

The differentiation between the two endpoints GET /search and the GET /compound/{id} can be exemplified with imagining how the search dashboard interface is being used:

The search field element of the dashboard makes use of the GET /search method while when a user clicks on a specific compound, the GET /compound/{id} method is applied.

Dashboard search field query

The search query field of the dashboard allows to tailor your search queries first before retrieving all the data of the molecules. As with many other search engines one can make use of operators (AND, OR, NOT) and wildcards (*, #, ?, %) to refine your search query to retrieve a search result with only a set of structures of interest.

For example to search for polychlorinated biphenyls (PCBs) one can search for the term:

“PCB”

but will then also retrieve a search result with structures including S, N, or O atoms.

A refinement of the search string to:

“PCB AND C12* NOT formula:*S NOT formula:*N* NOT formula:*O”

allows to only retrieve two rings connected to each other and with Cl atoms bounded with the C-ring atoms in different configurations.

Figure 3: Search query results in MQS Dashboard

Tip: If you are not satisfied with a search because a SMILES string gives you for example only a single atom as the top result or the exact match does not show up at the top, then try to make use of double quotation marks enclosing your SMILES string you are trying to search:

“CC(=O)OC1=CC=CC=C1C(=O)O”

instead of searching for the SMILES without quotation marks:

CC(=O)OC1=CC=CC=C1C(=O)O

try it yourself and see how the results differ.

MQS Database: Currently available data sets (as of July 2023)

The current beta release of the MQS database holds the PubChemQC PM6 and QMugs data sets where the PubchemQC PM6 calculation pipeline has been based on the Pubchem database to screen all available molecular structures whereas the QMugs pipeline makes use of the ChEMBL database.

We refer to both publications of the individual data sets to understand how the data has been generated and what kind of valuable data they hold. Here a comparison table between PubchemQC PM6 and QMugs:

PubchemQC PM6 QMugs
Data mining source Pubchem database ChEMBL database
Semi-empirical method PM6 GFN2-xTB
Software applied NWChem RDKit, xtb, PSI4
Availability of conformers data No Yes
Availability of different electronic states Yes No
Number of molecular structures 221.190.415 685.917

The Search UI & API allow to search through both data sets at the same time via the following identifiers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cid: number
inchi: string
inchi_key: string
chembl_id: string
isomeric_smiles: string
openbabel_canonical_smiles: string
formula: string
cas: string
molecular_weight: number
charge: number
multiplicity: number
title: string
synonyms: [string]
pubchem_version: string

One can also search through each individual data set and zoom further into the schema of the data sets to evaluate what can be searched for.

The PubchemQC PM6 data set holds the following data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
pubchemqc_pm6:
  [
    {
      state: string
      vibrations: {
        frequencies: [number]
        intensities: {
          IR: [number]
        }
      }
      atoms: {
        elements: {
          atom_count: number
          number: [number]
          heavy_atom_count: number
        }
        coords: {
          3d: [number]
        }
        core_electrons: [number]
      }
      properties: {
        number_of_atoms: number
        temperature: number
        total_dipole_moment: number
        multiplicity: number
        energy: {
          alpha: {
            homo: number
            gap: number
            lumo: number
          }
          beta: {
            homo: number
            gap: number
            lumo: number
          }
          total: number
        }
        orbitals: {
          energies: [[number]]
          basis_number: number
          MO_number: number
          homos: [number]
        }
        enthalpy: number
        charge: number
        partial_charges: {
          mulliken: [number]
        }
      }
      openbabel_inchi: string
      openbabel_canonical_smiles: string
    }
  ]

And the QMugs data set consists of:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
qmugs:
  [
    {
      conf: string
      atom_count: number
      heavy_atom_count: number
      hetero_atom_count: number
      rotabable_bounds: number
      stereocenters: number
      rings: number
      hbond_acceptors: number
      hbond_donors: number
      significant_negative_wavenumbers: boolean
      nonunique_smiles: boolean
      GFN2_TOTAL_ENERGY: number
      GFN2_ATOMIC_ENERGY: number
      GFN2_FORMATION_ENERGY: number
      GFN2_TOTAL_ENTHALPY: number
      GFN2_TOTAL_FREE_ENERGY: number
      GFN2_DIPOLE_X: number
      GFN2_DIPOLE_Y: number
      GFN2_DIPOLE_Z: number
      GFN2_DIPOLE_TOT: number
      GFN2_QUADRUPOLE_XX: number
      GFN2_QUADRUPOLE_XY: number
      GFN2_QUADRUPOLE_YY: number
      GFN2_QUADRUPOLE_XZ: number
      GFN2_QUADRUPOLE_yz: number
      GFN2_QUADRUPOLE_ZZ: number
      GFN2_ROT_CONSTANT_A: number
      GFN2_ROT_CONSTANT_B: number
      GFN2_ROT_CONSTANT_C: number
      GFN2_ENTHALPY_VIB: number
      GFN2_ENTHALPY_ROT: number
      GFN2_ENTHALPY_TRANSL: number
      GFN2_ENTHALPY_TOT: number
      GFN2_HEAT_CAPACITY_VIB: number
      GFN2_HEAT_CAPACITY_ROT: number
      GFN2_HEAT_CAPACITY_TRANSL: number
      GFN2_HEAT_CAPACITY_TOT: number
      GFN2_ENTROPY_VIB: number
      GFN2_ENTROPY_ROT: number
      GFN2_ENTROPY_TRANSL: number
      GFN2_ENTROPY_TOT: number
      GFN2_HOMO_ENERGY: number
      GFN2_LUMO_ENERGY: number
      GFN2_HOMO_LUMO_GAP: number
      GFN2_FERMI_LEVEL: number
      GFN2_DISPERSION_COEFFICIENT_MOLECULAR: number
      GFN2_POLARIZABILITY_MOLECULAR: number
      DFT_TOTAL_ENERGY: number
      DFT_ATOMIC_ENERGY: number
      DFT_FORMATION_ENERGY: number
      DFT_DIPOLE_X: number
      DFT_DIPOLE_Y: number
      DFT_DIPOLE_Z: number
      DFT_DIPOLE_TOT: number
      DFT_ROT_CONSTANT_A: number
      DFT_ROT_CONSTANT_B: number
      DFT_ROT_CONSTANT_C: number
      DFT_XC_ENERGY: number
      DFT_NUCLEAR_REPULSION_ENERGY: number
      DFT_ONE_ELECTRON_ENERGY: number
      DFT_TWO_ELECTRON_ENERGY: number
      DFT_HOMO_ENERGY: number
      DFT_LUMO_ENERGY: number
      DFT_HOMO_LUMO_GAP: number
    }
  ]

See also these tables from the QMugs paper to get an overview of the data:

https://www.nature.com/articles/s41597-022-01390-7/tables/3

https://www.nature.com/articles/s41597-022-01390-7/tables/4

The user interface of the MQS Dashboard provides an overview of the QMugs data if available for the specific molecule:

Figure 4: Tabs and table view for QMugs data in the MQS Dashboard

Sometimes you will not find the possibility to select the QMugs data in the dropdown because only PubchemQC PM6 data is available for this specific molecule. Or vice versa.

The available data view for the PubchemQC PM6 dataset can be seen in the following figure:

Figure 5: Tabs and table view for PubchemQC PM6 data in the MQS Dashboard

HOMO-LUMO gap data from the PubchemQC PM6 and QMugs data set

Similar to PubchemQC PM6, QMugs holds HOMO-LUMO gap data calculated via the gfn2 method and with DFT. In the following code snippet you can see how to retrieve HOMO-LUMO gap data from both data sets. QMugs contains property data for different conformers (local minima on the potential energy surface of the molecule) whereas PubchemQC PM6 holds data for different states.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
smiles_list = []
    homo_lumo_gap_list_pubchemqc = []
    homo_lumo_gap_list_qmugs_gfn2 = []
    homo_lumo_gap_list_qmugs_dft = []
    for i in range(len(molecule_id_set)):
        smiles = data_set[i]['isomeric_smiles']
        print("smiles:", smiles)
        smiles_list.append(smiles)
        try:
            homo_lumo_gap_pubchemqc = data_set[i]["pubchemqc_pm6"][0]["properties"]["energy"]["alpha"]["gap"]
            homo_lumo_gap_list_pubchemqc.append(homo_lumo_gap_pubchemqc)
        except: print("No pubchemqc pm6 data available.")
        try:
            homo_lumo_gap_qmugs_gfn2 = data_set[i]["qmugs"]["confs"][0]["GFN2_HOMO_LUMO_gap"]
            homo_lumo_gap_qmugs_dft = data_set[i]["qmugs"]["confs"][0]["DFT_HOMO_LUMO_gap"]
            homo_lumo_gap_list_qmugs_gfn2.append(homo_lumo_gap_qmugs_gfn2)
            homo_lumo_gap_list_qmugs_dft.append(homo_lumo_gap_qmugs_dft)
        except: print("No qmugs data available.")

    print("homo_lumo_gap_list_pubchemqc:", homo_lumo_gap_list_pubchemqc)
    print("homo_lumo_gap_list_qmugs_gfn2:", homo_lumo_gap_list_qmugs_gfn2)
    print("homo_lumo_gap_list_qmugs_dft:", homo_lumo_gap_list_qmugs_dft)

PubchemQC PM6 holds HOMO-LUMO gap data for different molecule states (anion, cation, singlet state, triplet state) which one can select after having selected the “pubchemqc_pm6” data set. It is important to be aware of that not all molecules hold data for every state. In the original PubchemQC PM6 schema this is reflected by the available entries which can be selected via [0], [1], [2], [3]. But these are not fixed to the individual states. Some molecules might only have entries for [0] and [1], where [0] would hold the singlet state data but [1] could hold any other data (anion or cation or triplet state). In the MQS database we have fixed the states data to non-numeric entries ([‘singlet state’], [‘anion’], [‘cation’], [‘triplet state’]).

The above example is also convenient for comparing data points from different data sets. This will be exemplified in our next blog article.

The ChEMBL and SureChEMBL identifiers combined with the retrieval of wavefunction files in the QMugs data set

The ChEMBL identifier has been added to the search IDs which can be used for screening compounds in the MQS database via the search query field in the dashboard or the API. The current supported ID set for searching through the database is summarized in the following:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
cid: number
inchi: string
inchi_key: string
chembl_id: string
isomeric_smiles: string
openbabel_canonical_smiles: string
formula: string
cas: string
molecular_weight: number
charge: number
multiplicity: number
title: string
synonyms: [string]
pubchem_version: string

For more information about the ChemBL identifier, we recommend the following blog article: http://chembl.blogspot.com/2011/08/chembl-identifiers.html

One has to be careful when searching via for example “chembl*” because one will also receive results where molecules in their synonyms list have a SureCHEMBL (SCHEMBL) identifier. Therefore it is best to search directly with chembl_id in the search string: “chembl_id: chembl*”

Each molecule in the QMugs dataset has up to three conformer structures and is linked to one ChEMBL identifier in the database. To understand the terms “conformation” and “conformer” we quote from the QMugs paper:

“In chemical terminology, the term “conformation” refers to any arrangement of atoms in space, whereas “conformer” refers to a conformation that is a local minimum on the potential energy surface of the molecule.” [40]

Here an example to retrieve the total energy [E_h] data from the QMugs data set with the ChEMBL and SChEMBL identifiers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
chembl_list = []
    total_energy_list_qmugs = []
    for i in range(len(molecule_id_set)):
        chembl = data_set[i]['chembl_id']
        print("chembl", chembl)
        chembl_list.append(smiles)
        try:
            total_energy_qmugs = data_set[i]["qmugs"]["confs"][0]["GFN2_TOTAL_ENERGY"]
            total_energy_qmugs.append(total_energy_qmugs)
        except: print("No qmugs data available for", chembl, ".")

    print("total_energy_list_qmugs:", total_energy_list_qmugs)

E_h is the total energy in units of Hartree and can be converted to other energy units. Here a convenient converter provided by the Weizmann Institute in Rehovot (Israel): https://www.weizmann.ac.il/oc/martin/tools/hartree.html

And here a conversion overview provided by the National Chiao Tung University (NCTU) in Taiwan which was merged with the National Yang-Ming University to the National Yang Ming Chiao Tung University (NYCU): http://wild.life.nctu.edu.tw/class/common/energy-unit-conv-table-detail.html

The QMugs data set also holds wavefunction related information and are stored as .tar.gz files. In the Dashboard one can download the wavefunction files by clicking on the respective button within the selected tab view:

Figure 6: Download button provided in the MQS Dashboard UI for wavefunctions from the QMugs dataset.

The following three steps can be followed to download the wave function files via the MQS API:

  1. Search for a compound and retrieve the compound ID

  2. Look up the download URL for the specific compound ID and

  3. Download the wavefunction .tar.gz file with the URL, unpack it and look at the content of the file directly in your terminal

The following Python code snippets implement these three steps.

Step 1:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import requests

credentials = {
        "username": "YOUR_EMAIL",
        "password": "YOUR_PW"
    }

    api_url = "https://api.mqs.dk"

token = requests.post(f"{api_url}/token", data=credentials).json()["access_token"]
headers = {"Accept": "application/json", "Authorization": f"Bearer {token}"}

query = "isomeric_smiles:CC"
results = requests.get(f"{api_url}/search", params={"q": query}, headers=headers).json()
print("results:", results)

#Use the ID from the last search result to get the molecule's complete datasheet
id = results["response"]["docs"][0]["id"]
print("id:", id)

Step 2:

1
2
3
data = requests.get(f"{api_url}/compound/{id}/datasets/qmugs/wavefunctions_url", headers=headers).json()
download_url = data["url"]
print("download_url:", download_url)

Step 3:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
#The urlretrieve function has to be imported: from urllib.request import urlretrieve
filename = str(id) + '_wavefunction.tar.gz'
urlretrieve(download_url, filename)

#The tarfile class has to be imported: import tarfile
file = tarfile.open(filename)
folder_path = "./"
file.extractall(folder_path)
file.close()

# os needs to be imported: import os
for filename in os.listdir(folder_path):
    f = os.path.join(directory, filename)
        wavefunctions_data = numpy.load(f)
        print(wavefunctions_data)

The properties which can be retrieved from the wave function files are listed in Table 3 of the QMugs paper:

https://www.nature.com/articles/s41597-022-01390-7/tables/4

All Jupyter notebooks and Python scripts are available in the JupyterLab environment with the “Quantum C^n” subscription (formerly named “Basic”) of the MQS Dashboard (https://dashboard.mqs.dk). C stands for chemistry and computing. n stands for the exponential advantage one could possibly retrieve with quantum computing for quantum chemistry applications.