We released two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.

Dataset Descriptions

The datasets are machine learning data, in which queries and urls are represented by IDs. The datasets consist of feature vectors extracted from query-url pairs along with relevance judgment labels:

(1) The relevance judgments are obtained from a retired labeling set of a commercial web search engine (Microsoft Bing), which take 5 values from 0 (irrelevant) to 4 (perfectly relevant).

(2) The features are basically extracted by us, and are those widely used in the research community.

In the data files, each row corresponds to a query-url pair. The first column is relevance label of the pair, the second column is query id, and the following columns are features. The larger value the relevance label has, the more relevant the query-url pair is. A query-url pair is represented by a 136-dimensional feature vector.

Below are two rows from MSLR-WEB10K dataset:

==============================================

0 qid:1 1:3 2:0 3:2 4:2 … 135:0 136:0

2 qid:1 1:3 2:3 3:0 4:0 … 135:0 136:0

==============================================

Dataset Partition

We have partitioned each dataset into five parts with about the same number of queries, denoted as S1, S2, S3, S4, and S5, for five-fold cross validation. In each fold, we propose using three parts for training, one part for validation, and the remaining part for test (see the following table). The training set is used to learn ranking models. The validation set is used to tune the hyper parameters of the learning algorithms, such as the number of iterations in RankBoost and the combination coefficient in the objective function of Ranking SVM. The test set is used to evaluate the performance of the learned ranking models.

Folds Training Set Validation Set Test Set Fold1 {S1,S2,S3} S4 S5 Fold2 {S2,S3,S4} S5 S1 Fold3 {S3,S4,S5} S1 S2 Fold4 {S4,S5,S1} S2 S3 Fold5 {S5,S1,S2} S3 S4

Datasets

The datasets were released on June 16, 2010.

To use the datasets, you must read and accept the online agreement. By using the datasets, you agree to be bound by the terms of its license.

Datasets Size MD5 MSLR-WEB10K ~ 1.2G 97c5d4e7c171e475c91d7031e4fd8e79 MSLR-WEB30K ~ 3.7G 4beae4bee0cd244fc9b2aff355a61555

Evaluation tools

The evaluation script was updated on Jan. 13, 2011. Thank you to Yasser Ganjisaffar for pointing out the bug.

Evaluation script for NDCG(meanNDCG) and Precision(MAP)

Significance test script for algorithm comparison

Feature List

Each query-url pair is represented by a 136-dimensional vector.

Feature List of Microsoft Learning to Rank Datasets feature id feature description stream comments 1 covered query term number body 2 anchor 3 title 4 url 5 whole document 6 covered query term ratio body 7 anchor 8 title 9 url 10 whole document 11 stream length body 12 anchor 13 title 14 url 15 whole document 16 IDF(Inverse document frequency) body 17 anchor 18 title 19 url 20 whole document 21 sum of term frequency body 22 anchor 23 title 24 url 25 whole document 26 min of term frequency body 27 anchor 28 title 29 url 30 whole document 31 max of term frequency body 32 anchor 33 title 34 url 35 whole document 36 mean of term frequency body 37 anchor 38 title 39 url 40 whole document 41 variance of term frequency body 42 anchor 43 title 44 url 45 whole document 46 sum of stream length normalized term frequency body 47 anchor 48 title 49 url 50 whole document 51 min of stream length normalized term frequency body 52 anchor 53 title 54 url 55 whole document 56 max of stream length normalized term frequency body 57 anchor 58 title 59 url 60 whole document 61 mean of stream length normalized term frequency body 62 anchor 63 title 64 url 65 whole document 66 variance of stream length normalized term frequency body 67 anchor 68 title 69 url 70 whole document 71 sum of tf*idf body 72 anchor 73 title 74 url 75 whole document 76 min of tf*idf body 77 anchor 78 title 79 url 80 whole document 81 max of tf*idf body 82 anchor 83 title 84 url 85 whole document 86 mean of tf*idf body 87 anchor 88 title 89 url 90 whole document 91 variance of tf*idf body 92 anchor 93 title 94 url 95 whole document 96 boolean model body 97 anchor 98 title 99 url 100 whole document 101 vector space model body 102 anchor 103 title 104 url 105 whole document 106 BM25 body 107 anchor 108 title 109 url 110 whole document 111 LMIR.ABS body Language model approach for information retrieval (IR) with absolute discounting smoothing 112 anchor 113 title 114 url 115 whole document 116 LMIR.DIR body Language model approach for IR with Bayesian smoothing using Dirichlet priors 117 anchor 118 title 119 url 120 whole document 121 LMIR.JM body Language model approach for IR with Jelinek-Mercer smoothing 122 anchor 123 title 124 url 125 whole document 126 Number of slash in URL 127 Length of URL 128 Inlink number 129 Outlink number 130 PageRank 131 SiteRank Site level PageRank 132 QualityScore The quality score of a web page. The score is outputted by a web page quality classifier. 133 QualityScore2 The quality score of a web page. The score is outputted by a web page quality classifier, which measures the badness of a web page. 134 Query-url click count The click count of a query-url pair at a search engine in a period 135 url click count The click count of a url aggregated from user browsing data in a period 136 url dwell time The average dwell time of a url aggregated from user browsing data in a period

Reference

You can cite this dataset as below.

@article{DBLP:journals/corr/QinL13, author = {Tao Qin and Tie{-}Yan Liu}, title = {Introducing {LETOR} 4.0 Datasets}, journal = {CoRR}, volume = {abs/1306.2597}, year = {2013}, url = {http://arxiv.org/abs/1306.2597}, timestamp = {Mon, 01 Jul 2013 20:31:25 +0200}, biburl = {http://dblp.uni-trier.de/rec/bib/journals/corr/QinL13}, bibsource = {dblp computer science bibliography, http://dblp.org} }

Release Notes