$\begingroup$

I need to make sure that my XML sitemap has less than $1\%$ rubbish (broken links). The list of URL is in the hundred of thousands, and even if it could be feasible to test them all 1 by 1 I'd rather not, for many reasons:

1 - Saved bandwidth 2 - Faster traffic for real clients 3 - Less noise in visitor statistics (because my test would count as a visit) 5 - I could go on...

So I think taking a random subset would be sufficient, problem is I don't know probabilities.

Is there a simple function I can use?

If it helps, we can suppose to have an a priori information on the probability of a link to be broken across runs. Let’s say that across runs there is a $0.75\%$ for any given link to be broken.