Mind that age! This blog post is 9 years old! Most likely, its content is outdated. Especially if it's technical.

There was a really interesting discussion on the django-users mailing list about how to best select random elements out of a SQL database the most efficient way. I knew using a regular RANDOM() in SQL can be very slow on big tables but I didn't know by how much. Had to run a quick test!

Cal Leeming discussed a snippet of his to do with pagination huge tables which uses the MAX(id) aggregate function.

So, I did a little experiment on a table with 84,000 rows in it. Realistic enough to matter even though it's less than millions. So, how long would it take to select 10 random items, 10 times? Benchmark code looks like this:

TIMES = 10 def using_normal_random ( model ): for i in range ( TIMES ): yield model . objects . all () . order_by ( '?' )[ 0 ] . pk t0 = time () for i in range ( TIMES ): list ( using_normal_random ( SomeLargishModel )) t1 = time () print t1 - t0 , "seconds"

Result:

41.8955321312 seconds

Nasty!! Also running this you'll notice postgres spiking your CPU like crazy.

A much better approach is to use Python's random.randint(1, <max ID>) . Looks like this:

from django.db.models import Max from random import randint def using_max ( model ): max_ = model . objects . aggregate ( Max ( 'id' ))[ 'id__max' ] i = 0 while i < TIMES : try : yield model . objects . get ( pk = randint ( 1 , max_ )) . pk i += 1 except model . DoesNotExist : pass t0 = time () for i in range ( TIMES ): list ( using_max ( SomeLargishModel )) t1 = time () print t1 - t0 , "seconds"

Result:

0.63835811615 seconds

Much more pleasant!

UPDATE

Commentator, Ken Swift, asked what if your requirement is to select 100 random items instead of just 10. Won't those 101 database queries be more costly than just 1 query with a RANDOM() . Answer turns out to be no.

I changed the script to select 100 random items 1 time (instead of 10 items 10 times) and the times were the same:

using_normal_random () took 41.4467599392 seconds using_max () took 0.6027739048 seconds

And what about 1000 items 1 time:

using_normal_random () took 204.685141802 seconds using_max () took 2.49527382851 seconds

UPDATE 2

The algorithm for returning a generator has a couple of flaws:

Can't pass in a QuerySet You get primary keys returned, not ORM instances You can't pass in a number Internally, it might randomly select a number already tried

Here's a much more complete function:

def random_queryset_elements ( qs , number ): assert number <= 10000 , 'too large' max_pk = qs . aggregate ( Max ( 'pk' ))[ 'pk__max' ] min_pk = qs . aggregate ( Min ( 'pk' ))[ 'pk__min' ] ids = set () while len ( ids ) < number : next_pk = random . randint ( min_pk , max_pk ) while next_pk in ids : next_pk = random . randint ( min_pk , max_pk ) try : found = qs . get ( pk = next_pk ) ids . add ( found . pk ) yield found except qs . model . DoesNotExist : pass

Related posts