What is the Alchemy Dataset?

The Tencent Quantum Lab has recently introduced a new molecular dataset, called Alchemy, to facilitate the development of new machine learning models useful for chemistry and materials science.

The dataset lists 12 quantum mechanical properties of 130,000+ organic molecules comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database. These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework (PySCF).

The Alchemy dataset expands on the volume and diversity of existing molecular datasets such as QM9.

For more details of Alchemy, please refer to the paper Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models. If you use the dataset in your research, please cite the paper below:

@article{chen2019alchemy,

title={Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models},

author={Chen, Guangyong and Chen, Pengfei and Hsieh, Chang-Yu and Lee, Chee-Kong and Liao, Benben and Liao, Renjie and Liu, Weiwen and Qiu, Jiezhong and Sun, Qiming and Tang, Jie and Zemel, Richard and Zhang, Shengyu},

journal={arXiv preprint arXiv:1906.09427},

year={2019}

}

Join the Alchemy Contest

Take part and help developing machine learning models to accurately predict organic molecular properties!

In this multi-feature learning contest, you are free to use whatever method you like to predict a set of 12 properties for organic molecules. To help train your model, a training set with the same set of molecular properties is provided below. The competition will be conducted in two phases. Phase 1 (Development) 5/22/2019 - 8/7/2019 A period for contest participants to get familiar with the Codalab and develop their models. Participants are asked to predict properties for molecules given in the valid.zip file. Five submissions per day are allowed. Phase 2 (Evaluation) 8/8/2019 - 10/7/2019 The final stage for the competition. Participants are asked to predict properties for molecules given in the test.zip file. One submission per day is allowed, and twenty submissions in total are allowed in Phase 2. The contest evaluation is based on the mean absolute error averaged over 12 regression tasks. Leaderboard Join via

Rewards A cash prize (total ￥100,000 RMB) will be awarded to the top three entries on the leaderboard in the Phase 2 only. First Place Prize ￥50,000

Second Place Prize ￥30,000

Third Place Prize ￥20,000 Requirements Winners of the first, second, and third place prizes must provide a clear model documentation and code according to the Declaration of Eligibility, Non-Exclusive License, and Release form. The Form will be distributed in Phase 2.

Contest Rules

Please refer to Contest Rules for full details. Every contest participant must acknowledge the reading of the contest rules before getting the datasets.

Contest Data

Training and Validation Please download the dev.zip (training), valid.zip (validation in Phase 1) and test.zip (evaluation in Phase 2) files. For development ( Phase 1, 5/22 - 8/7/2019) The dev.zip contains 99,776 SD files, each giving structural information of a molecule, and a train.csv file giving the 12 properties of all molecules.

dev.zip (updated 7/30)

md5sum: 70086cc2a2ac07f36a3a7c11a305a1a3 The SD filenames correspond to the molecular identification numbers found in the GDBMedChem database. These identification numbers are also used to distinguish molecules in the train.csv file. The molecular files are stored in different directories based on the number of heavy atoms. The valid.zip contains 3,951 SD files. This dataset is to be used for the Phase 1 competition. It will be available for download on 5/22/2019.

valid.zip (updated 7/30)

md5sum: dbe50df5f0b8a2771ed0f6f31481c035 For evaluation ( Phase 2, 8/8 - 10/7/2019) The test.zip contains 15,760 SD files. This data is too be used for the Phase 2 competition. It will be available for download on 8/8/2019.

test.zip (updated 7/30)

md5sum: e6b6f17882137118e2c323a77e793305

All the molecular properties are retrieved from Tencent Quantum Lab's Alchemy dataset. In this contest, all reported molecular properties are normalized by the substraction of population mean and divided by the standard deviations.

Optional Tools

RDKit For contest participants without prior experiences in handling molecular data, we strongly recommend you learn to work with RDKit, a cheminformatics software that allows one to easily build molecular graphs based on the SDF files we provide. RDKit - Getting Started in Python Tencent Alchemy Tools If you do not want to dive into RDKit, we also provide a ready-to-use pytorch dataloader which can help you easily deal with those molecules. You may also find a collection of baselines, including MPNN, from which you can start your journey with Alchemy! Tencent Alchemy Tools

Submission

answer.csv The following description applies to both Phase 1 and Phase 2. Once you have built a model that works to your satisfaction, you should run the model against molecules provided in either the valid.zip or test.zip file, and save the predicted properties in a file named answer.csv according to the format of train.csv file in dev.zip. In short, answer.csv should store an N by 13 matrix where N is the number of molecules in valid.zip file during Phase 1 or the number of molecules in test.zip during Phase 2. The first column should be the GDB ID followed by 12 columns of molecular properties. The data entries should be sorted in an ascending order of GDB ID. The answer.csv file should then be zipped and named submission.zip before uploading for evaluation. Join the Alchemy Contest

* A Codalab account is required for your submission. Available on 5/22/2019 - 8/7/2019 Phase 1 (Development),

8/8/2019 - 10/7/2019 Phase 2 (Evaluation)

Have Questions?

For general questions, please ask at the Codalab forum of Alchemy Contest

Problem with datasets and/or dataloaders? Submit an issue on Alchemy on Github

Note. Tencent has the right to adjust the competition rules, prize information, time of competition and other aspects of the contest, relevant requirements and specifications according to the operation situation of the competition, and all other content involved in the contest shall be subject to final confirmation by Tencent.