Getting started with the MQS Search API - Part 1: Introduction

Posted on Apr 11, 2023

Figure: REST API - Author: Seobility - License: CC BY-SA 4.0

Microservices and containers

Right from the start when MQS was founded we defined a highly modular and microservice based architecture for our cloud software stack. The reasoning behind this decision was that due to the nature of a start-up company we wanted to remain flexible with our portfolio of algorithms and tools. For that reason we containerize the components of our stack (algorithms, database, API, UI) as much as possible and combine them for the needed pipelines/workflows. The first microservice we are releasing as a beta version is our Search API which connects to our database system. The database holds data of molecules and will be extended over time with additional data sets published under free licenses such as the GPL, MIT or Creative Commons licenses.

We have made our current database freely accessible through the user interface within the MQS Dashboard and on a subscription basis via the MQS Search API.

In the following we are describing how you can make use of the Search API when you have subscribed to the basic plan. You can find the subscription overview at the top right of the user menu in the dashboard.

Authentication to the MQS Search API

To authenticate via the Search API you need to make use of your email address which you have used for logging into the dashboard.

In the below code you can see where you have to provide your email address and your password in the credentials dictionaries’ key-value pairs “username”:“YOUR_EMAIL” and “password”: “YOUR_PW”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import requests

credentials = {
        "username": "YOUR_EMAIL",
        "password": "YOUR_PW"
    }

    api_url = "https://api.mqs.dk"

token = requests.post(f"{api_url}/token", data=credentials).json()["access_token"]
headers = {"Accept": "application/json", "Authorization": f"Bearer {token}"}

query = "isomeric_smiles:CC"
data = requests.get(f"{api_url}/search", params={"q": query}, headers=headers).json()
print("data:", data)

By running the above Python script a token is retrieved from the API post request. This token is being stored in the headers dictionary which will be utilized for any other request to the API.

The last lines in the above script showcase how you can search for a molecule in the database with an isomeric SMILES string.

When printing the retrieved information with the print() function one can see that the API has sent back data and it is being resolved as a dictionary by the requests package, due to the applied json() function. The data dictionary holds a lot of nested information. You can run the above example in this Python script or Jupyter Notebook which we have compiled covering the examples in this tutorial.

As you will see, the last print() function gives an overview of the complete data set for the ethane (CC) molecule.

Get individual properties of molecules

The next code example shows how one can retrieve specific information of a molecule such as the main title (main compound name) of the molecule, the number of atoms, the geometry optimized xyz coordinates of the individual atoms of the molecule, the vibrational spectra and orbitals information.

Important to acknowledge is that first the molecule id has to be retrieved from the data one has obtained from a previous request. The below code starts where the last code example has ended:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#Use the ID from the last search result to get the molecule's complete datasheet
id = results["response"]["docs"][0]["id"]
#id = "626745bf5297cf60c40d9459"

data = requests.get(f"{api_url}/compound/{id}", headers=headers).json()

#Print information about the molecule
#title
print("title:", data["title"], "\n")

#number of atoms
atom_count = data['pubchemqc_pm6'][0]['atoms']['elements']['atom_count']

#3d coordinates
print("Geometry optimized xyz coordinates:\n", np.asarray(data["pubchemqc_pm6"][0]["atoms"]["coords"]["3d"]).reshape(atom_count,3), "\n")

#vibrational spectra
print("Vibrational spectra data:", data["pubchemqc_pm6"][0]["vibrations"], "\n")

#orbitals
print("Orbitals data:", data["pubchemqc_pm6"][0]['properties']["orbitals"], "\n")

As you might have now noted, we have introduced a new request endpoint: {api_url}/compound and made use of the id retrieved from the data which we in turn requested from the {api_url}/search endpoint.

We are interacting with the API as we would do in the user interface: first we search for a compound and then we request further information on a searched molecule.

How to obtain detailed information could be quite confusing in the beginning since one has not yet seen the complete schema of the database. For a start we will go through the five examples depicted in the above code: title, number of atoms, 3d coordinates, vibrational spectra and orbitals.

To access the title we can make a first level look-up: data["title"]. That was easy. The number of atoms (atom count) is currently stored in a specific data set which the database holds and therefore one needs to first reference the data set and then the state of the molecule (e.g. 0: anion, 1: cation, 2: S0, singlet state 3: T0, triplet state), followed by ['atoms']['elements']['atom_count']. It is important to acknowledge that the PubchemQC PM6 data set holds information for various states and sometimes no data for a specific state is available. Therefore one can not just automate going through the index assuming that the indices 0, 1, 2 and 3 hold information about the anion, cation, singlet and triplet state. If for a molecule no information is available for the cation state, then index 1 would hold information about the singlet state and index 2 about the triplet state. In the last section of this tutorial we show how one can easily check and automate how to high-throughput screen through specific state information.

Back to the above code example: data["pubchemqc_pm6"][0]["atoms"]["coords"]["3d"] let’s one access geometry optimized 3d coordinates data which is being stored in a numpy array and reshaped so that one can view the xyz columns element by element (rows) in the output.

Now it should be clear that one needs to have some idea how the data set has been stored in the database and for this the schema is of importance for a user. In the following you can see the full schema of the PubchemQC PM6 data set which helps to request specific data from the API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
pubchemqc_pm6:
  [
    {
      state: string
      vibrations: {
        frequencies: [number]
        intensities: {
          IR: [number]
        }
      }
      atoms: {
        elements: {
          atom_count: number
          number: [number]
          heavy_atom_count: number
        }
        coords: {
          3d: [number]
        }
        core_electrons: [number]
      }
      properties: {
        number_of_atoms: number
        temperature: number
        total_dipole_moment: number
        multiplicity: number
        energy: {
          alpha: {
            homo: number
            gap: number
            lumo: number
          }
          beta: {
            homo: number
            gap: number
            lumo: number
          }
          total: number
        }
        orbitals: {
          energies: [[number]]
          basis_number: number
          MO_number: number
          homos: [number]
        }
        enthalpy: number
        charge: number
        partial_charges: {
          mulliken: [number]
        }
      }
      openbabel_inchi: string
      openbabel_canonical_smiles: string
    }
  ]

How the vibrational spectra and orbitals information have been referenced in the above code example can be comprehended by comparing the Python code with the schema.

Other identifiers to search with

In the previous example a search was performed with a SMILES string. But there are several other identifiers you can make use of for searching the database. For example, the Pubchem CID, the main name title (main compound name) of the molecule, synonyms of the molecule, InChi and InChi key.

Here a complete overview what one can use as search queries via the API:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
cid: number
title: string
synonyms: [string]
inchi: string
inchi_key: string
isomeric_smiles: string
formula: string
cas: string
molecular_weight: number
charge: number
multiplicity: number
pubchemqc_pm6.state: [string]
pubchemqc_pm6.openbabel_canonical_smiles: string

Currently the database consists of the PubchemQC PM6 data set(pre-print arXiv paper).

pubchemqc_pm6.state and pubchemqc_pm6.openbabel_canonical_smiles show how you can go directly into a data set to search for example via the available molecule states (e.g. S0, anion, cation) or for the openbabel_canonical_smiles which have been generated in this data set.

In the future more data sets will be available and the API will allow you to seamlessly search with the main identifiers through all data sets.

Get sets of molecules information

The following Python script example shows how one can high-throughput screen via the API to retrieve sets of molecule information. We apply a list comprehension instead of a for loop to iterate over API requests. The authentication procedure is now defined with a function and the main part shows how the ids of the found compounds are requested from the Search API and stored in the id set list. The queries list is generated by iterating over carbon chains from C2 to C10 and different identifiers are utilized. Isomeric SMILES, CID, InChI key and CAS work the best to receive an exact match with the respective identifier.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import requests
import json


def get_headers(credentials, api_url):
    """
    Get an access token which is valid for 2 hours
    """

    token = requests.post(f"{api_url}/token", data=credentials).json()["access_token"]
    headers = {"Accept": "application/json", "Authorization": f"Bearer {token}"}

    return headers
    
if __name__ == "__main__":

    credentials = {
        "username": "YOUR_EMAIL",
        "password": "YOUR_PW"
    }

    api_url = "https://api.mqs.dk"

    headers = get_headers(credentials, api_url)

    queries_list = ["isomeric_smiles:CC", #ethane (C2)
                    "cid:6334", #propane (C3)
                    "isomeric_smiles:CCCC", #butane (C4)
                    "isomeric_smiles:CCCCC", #pentane (C5)
                    "isomeric_smiles:CCCCCC", #hexane (C6)
                    "inchi_key:IMNFDUFMRHMDMM-UHFFFAOYSA-N", #heptane (C7)
                    "isomeric_smiles:CCCCCCCC", #octane (C8)
                    "cas:111-84-2", #nonane (C9)
                    "isomeric_smiles:CCCCCCCCCC"] #decane (C10)

    results = [requests.get(f"{api_url}/search", params={"q": query}, headers=headers).json()
               for query in queries_list]

    number_of_compounds = range(len(queries_list))
    id_set_list = []
    for i in number_of_compounds:
        print(results[i]['response']['docs'][0], "\n")
        id_set_list.append(results[i]["response"]["docs"][0]["id"])

    print("id_set_list:", id_set_list)

    data_set = []
    for id in id_set_list:
        data_set.append(requests.get(f"{API_URL}/compound/{id}", headers=headers).json())

    vibrational_spectra_data = [data_set[i]["pubchemqc_pm6"][0]["vibrations"]
                                for i in number_of_compounds]
    print("vibrational_spectra_data:", vibrational_spectra_data)
    

State look-up

The PubchemQC PM6 data set indices for the states can change because some compounds do not hold data for all states.

The following code snippet shows how one can read out the data from different states for a set of molecules:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
    for i in number_of_compounds:
        for dict_ in data_set[i]['pubchemqc_pm6']:
            if dict_['state'] == "anion":
                print("Compound:", data_set[i]['title'], "\n", "Anion data:", dict_, "\n")
            if dict_['state'] == "cation":
                print("Compound:", data_set[i]['title'], "\n", "Cation data:", dict_, "\n")
            if dict_['state'] == "S0":
                print("Compound:", data_set[i]['title'], "\n", "S0 data:", dict_, "\n")
            if dict_['state'] == "D0":
                print("Compound:", data_set[i]['title'], "\n", "D0 data:", dict_, "\n")
            if dict_['state'] == "T0":
                print("Compound:", data_set[i]['title'], "\n", "T0 data:", dict_, "\n")
            if dict_['state'] == "Q0":
                print("Compound:", data_set[i]['title'], "\n", "Q0 data:", dict_, "\n")

Before designing a high-throughput pipeline via the MQS Search API and tailored scripts, one can also make use of the MQS Dashboard Search UI to easily check what kind of data one can find in the database.

For example you can use the dashboard search field to screen for all available S0 state data by typing pubchemqc_pm6.state:S0 into the search input field.

Jupyter notebook and Python script

You can find a Jupyter notebook and a Python script implementation of the above examples in our dedicated github tutorials repo.