New Kubeflow Version 1.9 deployed on MQS Infrastructure

Posted on Aug 29, 2024

New Kubeflow Version 1.9

Kubeflow recently released Version 1.9 with a variety of feature improvements such as a centralized model registry, security enhancements, and integration/installation improvements. We have now deployed Kubeflow 1.9 in the MQS infrastructure and you can access Kubeflow with the Quantum & Machine Learning Tier: https://dashboard.mqs.dk/subscriptions

Here an overview of the important menu items under Pipelines which we currently utilize at MQS to design well thought out calculation pipelines:

  • Pipelines
  • Experiments
  • Runs
  • Recurring Runs
  • Artifacts
  • Executions

This allows us also to design the Cebule Application Programming Interface in a consistent way and to bridge MLOps together with quantum chemistry, quantum computing and machine learning tools.

Some of the changes between version 1.8 and 1.9 we found important with respect to navigating and applying our model library with Kubeflow:

The Central Kubeflow Dashboard UI has been improved, with all Kubeflow Pipelines links now being under the Pipelines tab as shown on the left below:

The UI has now been updated to support visualizing output artifacts such as the training loss plot from a MQS ML model fine-tuning pipeline:

Visualizing images in the UI allows one to conveniently view output results (ex. evaluate a training run with the loss plot) without having to download the output files and view them locally, as was previously needed.

Applying the MQS Model Library

As part of our Quantum & Machine Learning Tier, we provide quantum chemistry and ML models for on-demand use such as geometry optimization with different methods, and graph neural network training/fine-tuning to predict molecular target properties.

These models can be accessed through Pipelines -> Shared on the Central Dashboard:

Additionally, the models included in the Quantum & Machine Learning Tier, plus the Enterprise Matrix-Completed Graph Neural Network (MCGNN) for generating Hamiltonians of molecules, are available through the pay-per-usage Cebule API and can be used in a Jupyter Notebook with our SDK as described in a previous tutorial.

The following models/pipelines are available, with more being continuously added to the model library.

Overview of Kubeflow Pipelines

geometry_opt

Semi-empirical geometry optimization of molecule 3D coords after initial force field optimization.

Inputs:

force_field: str from [mmff94, ghemical].

optimization_method: str from [gfn2_xtb, am1].

smiles_list: List[str] of SMILES as JSON str.

Cebule TaskTypes applied:

GEOMETRY_OPT

Artifacts:

None

Output:

List[List[Tuple[float, float, float]]] as JSON str containing each molecule’s optimized 3D coords.

model_training

Fine-tune a graph neural network on a custom dataset to predict a molecule property of interest.

Inputs:

model_name: str unique to this model (only letters, numbers, underscores).

model_type: str from [mcgnn, delfta].

query: str to select dataset molecules from MQS Database.

target_property: str for the model to learn to predict; currently [homo_lumo_gap] supported.

Cebule TaskTypes applied:

GNN_DATASET_CREATE

GEOMETRY_OPT

GNN_DATASET_EXTEND

GNN_TRAIN

Artifacts:

test_loss_mae: Metrics

loss_history: Metrics

loss_plot: HTML`

model: Model

pretrained_model: Model

Output:

None

model_prediction

Predict a property of interest with a pre-trained or fine-tuned model.

Inputs:

model_name: str to use for prediction; fine-tuned model or from [mcgnn, delfta] for pre-trained model.

smiles_list: List[str] of SMILES as JSON str.

Cebule TaskTypes applied:

GNN_PREDICT

Artifacts:

model: Model

Output:

List[float] in JSON containing predicted target property for given molecules.

Overview of Cebule TaskTypes (Python SDK)

GEOMETRY_OPT

Semi-empirical geometry optimization of molecule 3D coords after initial force field optimization.

Inputs

force_field: str from [mmff94, ghemical].

optimization_method: str from [gfn2_xtb, am1].

smiles_list: list[str] of SMILES.

max_processors: Limits concurrency of optimization.

Output

list containing each molecule’s optimized 3D coords.

GNN_DATASET_CREATE

Create a dataset for training or prediction.

Inputs:

dataset_name: str unique to this dataset (only letters, numbers, underscores).

target_property: str the dataset is meant for from [homo_lumo_gap, eigenvalue] where eigenvalue is for predicted hamiltonians.

includes_target_val: bool whether the dataset includes the ground truth target values.

max_processors: None

Output

None

GNN_DATASET_EXTEND

Add datapoints to an existing dataset.

Inputs:

dataset_name: str to add data points to.

molecule_chunk: Dict[str, Union[List[str], List[List[Tuple[float, float, float]]], List[float]]] containing the keys smiles, coords (for homo_lumo_gap datasets), and target_val, each containing a list of that data for each molecule.

max_processors: None

Output:

None

GNN_DATASET_GET

View a chunk of an existing dataset.

Inputs:

dataset_name: str to view.

start: int index to begin (inclusive).

end: int final index (exclusive).

max_processors: None

Output:

Dict[str, Union[List[str], List[List[Tuple[float, float, float]]], List[float]]] containing the selected datapoints.

GNN_DATASET_DELETE

Delete an existing dataset.

Inputs:

dataset_name: str to delete.

max_processors: None

Output:

None

GNN_TRAIN

Train a GNN to predict molecule effective Hamiltonians, or fine-tune a GNN to predict a molecule property.

Inputs:

dataset_name: str to train on.

model_name: str unique to this model (only letters, numbers, underscores).

model_type: str optional from [mcgnn, delfta] (default mcgnn, note that only mcgnn supports eigenvalue datasets).

hamiltonian_len: int optional: only for eigenvalue datasets, defaults to 128 for 128x128 hamiltonians.

hyperparameters: Dict[str, Union[float, int]].

optional keys:

epochs (default 100)

batch_size (default 8)

initial_learning_rate (default 2.0 * 10-3)

max_processors: Used to limit concurrency of data loading

Output:

Dict[str, Union[float, Dict[str, List[float]]]] containing the key test_mae for the MAE test loss, and loss_history with the training and val loss (MAE for eigenvalue, MSE for other properties).

GNN_PREDICT

Predict a target property or hamiltonian of molecules with a trained model.

Inputs:

dataset_name: str to run prediction on.

model_name: str to use for prediction (can be from [mcgnn, delfta] for pre-trained model, for any property other than eigenvalue).

return_upper_triangle: bool, optional only for eigenvalue datasets to return the flattened upper triangle of predicted hamiltonians instead of the full square matrix (default False).

max_processors: Used to limit concurrency of data loading.

Output:

Dict[str, Union[float, List[float], List[List[float]], List[List[List[float]]]]]: containing the key predictions for the predicted target values/hamiltonians/flattened upper triangle of hamiltonians,

optionally mae for the MAE of the predictions if the dataset includes ground truth target values.

Python SDK notebook examples

For a detailed example of running our Kubeflow pipelines, see our tutorial on the pipelines for training, fine-tuning and using the MCGNN and DelFTa ML models to predict molecule HOMO-LUMO gaps.

Further, we recommend to take a look at all Python-SDK example notebooks here: https://gitlab.com/mqsdk/python-sdk/-/tree/main/notebooks

We hope you find our integration of Kubeflow’s latest version and our Quantum Chemistry/ML models library can be of value for interesting studies with molecular quantum information.

You are always welcome to reach out if you have any feedback, questions, ideas or even collaboration projects to push this kind of work and research further (contact (at) mqs [dot] dk).