There are many other evaluation metrics, and even evaluations of evaluations metrics (Cornell).

Tasks

To perform the benchmark, it’s good to start with tasks already done in the literature. Also, it is interesting to evaluate the same model across a large variety of tasks (to avoid overfitting a particular task).

Multi-objective tasks are more realistic, but more difficult than single-objective tasks (for example, getting molecules which are active, non-toxic, and synthetizable). It has been tried recently (Peking University). For a general introduction to Multi-objective deep reinforcement learning, see (Oxford).

Here’s a list of tasks (tell me if I omitted your paper):

Drug discovery tasks

Organic materials tasks

Participants can also propose their own favorite objectives. In any case, I think it is better to consider at least one specific real-world application, and not just generate ‘drug-like’ molecules, as in those preliminary papers by Harvard 1, Google 1, Paris-Saclay, Wildcard, Harvard 2, Novartis, Georgia Tech.

Data

This DiversityNet benchmark is based on publicly available data, like all the papers cited above. In most papers, data is taken from:

PubChem

ChEMBL

ExCAPE-DB, which aggregates PubChem and ChEMBL.

ZINC

Typical small molecule

Many papers only use small datasets (including mine), and in some way, that’s bad. Model pre-training should be made on a large dataset. This will require intensive computations, and here, the generosity of cloud sponsors is important.

Even better, different pre-training set sizes could be tested (5K, 10K, 15K, 30K, 50K, 100K, 250K, 1M) to understand how performance of the generative model changes (that’s a suggestion from an anonymous referee of my paper).

Besides small molecules chemistry, the same generative models can be used for other tasks related to drug discovery: for RNA sequences (University of Tokyo), for DNA sequences (University of Toronto) and for proteins (Harvard 4, ETH Zurich 3). However, I think it is better to keep those non-chemistry tasks for two separate benchmarks (that can be run in parallel, if participants and sponsors ask): DiversityNet-genetics and DiversityNet-proteins.

Generative models

It’s good to start with models already tried in the literature:

For GAN, there are different flavors: Wasserstein-GAN (University of Montreal), Cramer-GAN (DeepMind), Optimal Transport-GAN (OpenAI), Coulomb-GAN (Linz University), although at the end, maybe they are all equal (Google 2).

You can also find more in the Natural Language Processing literature (and apply them to SMILES):

Due to this flood of publications from non-chemistry fields, I think it would be inefficient to build a dedicated library for DiversityNet. That’s another difference with MoleculeNet, who is building the DeepChem library. The problem with it is that there are long delays between model publication and integration into the chemistry-specific library. Instead, I suggest that practitioners should learn to use a general library like TensorFlow, PyTorch or Keras. There are many beginner-friendly courses and tutorials online.

Finally, it will be interesting to design a systematic procedure for testing hyperparameter values. These methods are often very sensitive to hyperparameter choice (another suggestion from an anonymous referee of my previous paper).

Computational resources

For computations, you can use GPU from cloud sponsors, when these resources will be available. In the meantime, you can start small experiments using up to 2 GPU for free, with Microsoft Azure $200/30 days trial, if you have a Visa/Mastercard payment card. (notably Google Cloud excludes GPU of their free trial [edit: but apparently, you can use the Google ML engine]). I used Microsoft Azure in my paper (and their Data science Linux VM), and it was fine.

Academic prize: publish in a high-impact journal

If goals are met, the DiversityNet paper has reasonable chances to get published in a high-impact journal. Being a co-author of a good paper can be useful at any stage of the academic career, from undergraduate students applying for graduate school, up to tenured professors applying for a research grant. Scientific publications are the lifeblood of academic life.

However, since this challenge is ‘in the wild’, this academic prize is not 100% guaranteed: it’s still possible to get scooped by a traditional lab, posting a preprint earlier on ArXiv (for example, Harvard is working on this topic for months).

However, this risk remains pretty low. At the flag-planting game, Authorea (or any ‘GitHub for papers’) is a better weapon than ArXiv: it allows to release micro-contributions quickly and often (like on GitHub), whereas ArXiv requires to write a whole PDF paper, which is a hassle.

Moreover, papers can be 'forked’, which makes easier to build upon them. With ArXiv, a new paper needs to be written from scratch.

As a result, iteration cycle is shorter, and idea dissemination is accelerated.

Iteration cycle is shorter with open collaborative writing

To avoid “idea stealing” within the (potentially large) community of co-authors, it is strongly recommended to use public and timestamped communication channels (GitHub, Telegram…), so that collaboration is possible without the need of mutual trust. These communication records can be used to attribute credit individually (i.e. who planted his micro-flag first).

Co-authors of the Higgs Boson paper

Financial prizes: call for sponsors

Financial incentives are helpful, especially for non-academics, who don’t have the pressure to publish research articles. Money matters: there are huge crowds of data scientists competing for prizes on platforms like Kaggle. On the other hand, the MoleculeNet benchmark does not attract participants beyond its core contributors at Stanford (who are paid to do this job), due to the absence of financial prize.

Non-financial contributions are also welcome, especially GPU resources on the cloud. It would also be nice to have a cloud infrastructure for multiple GPUs, like big labs have (see OpenAI).

Sponsors can fill the information form.

Non-profits & philanthropists

Generative AI models in chemistry have the potential to benefit humanity in many different ways. As a result, they can attract funding from various non-profit foundations and philanthropic persons.

At some point, AI might provide new treatments for various incurable diseases. Moreover, by keeping research free to the maximum, this might hopefully reduce the price of those AI-generated drugs, and avoid a situation where Pharma and biotech companies are too dependent on costly proprietary platforms.

These AI models can also impact material science, including organic solar cells, which can help in the fight against global warming.

Organic solar cells

Moreover, sponsoring the DiversityNet challenge is a way to accelerate open and decentralized research in Artificial Intelligence. This is valuable at a time when there is a concentration of AI talent within the walls of tech giants, like Google, Facebook and Baidu.

For-profits

Like non-profit organisations and philanthropists, private companies can also want to benefit humanity. At the same time, sponsorship can help their business agenda for:

Precompetitive collaboration : open innovation is a way to pool resources. For Pharma and biotech companies, it can mean better virtual screening, and better leads.

: open innovation is a way to pool resources. For Pharma and biotech companies, it can mean better virtual screening, and better leads. Recruitment and brand exposure: like an online hackathon, it helps to identify, attract and recruit talent. Sponsorship can also increase engagement with a developer ecosystem (API, cloud…).

How money will be spent

It’s a collaboration, so there’s no sense to elect a ‘winner’ like in competitions. Here, money should be spent to encourage fast sharing of information, and accumulation of knowledge.

For example, sponsors can nominate a jury of experts, who split the money among participants, proportionally to contributions. Experts make evaluations based on public and timestamped communication records.

The jury needs to dig into the project, and such review is subjective. For evaluating complex tasks, human judgement is probably unavoidable, and can’t be replaced by an automated metric yet.

On the other hand, this human evaluation is not arbitrary either: to maximize impact, sponsors should treat participants fairly, in order to maximize their motivation and productivity (see Equity theory). That’s the way to get big bangs for bucks.

Conclusion

To contribute to the DiversityNet collaborative benchmark, you can edit the draft on Authorea, edit the code on Github, or talk on Telegram.

Sponsors can fill the information form.