So someone in a chat I was a part of was pondering how much osu!’s net worth would be, and it got me wondering: how much revenue does a game like osu make?

osu has a lot of registered users, but a lot of those are inactive, and even fewer are supporters (revenue for osu!). I figured it’d be a neat afternoon project to see just how many accounts on osu were active, and how many were paying members.

Methods

Now, of course, we could just ask the owner to crunch the numbers for us and give us the answer, as they have access to their database, but that’s no fun.

Instead, we’ll take a random sample of their users and use the results of that sample to draw conclusions about osu’s 6 million+ users. (Yay statistics)

Thanks to the wonders of statistics, we can take a relatively small sample of the population and still get pretty decent results. The (margin of error)[https://en.wikipedia.org/wiki/Marginoferror] tells us what the interval should be for a certain confidence level. For ~95% confidence, we can calculate the margin of error as follows:

Margin of error = 1/sqrt(sample_size)

For large samples over 5% of the population, we use a modified formula

Margin of error = 1/sqrt(sample_size) * sqrt((population_size - sample_size)/(population_size - 1))

With a sample size of 100, we have a 10% margin of error. This means that if we find that 50% of a sample was true for some variable, we are 95% sure that 40-60% of the population is true for that same variable.

With a sample size of 10000, we have a 1% margin of error.

1% seemed like a good enough margin of error so we’ll use 10000 samples.

Sampling

Osu accounts are each assigned an integer id that counts up starting from 0. The number of accounts is listed on the home page, and every account can be accessed easily at https://osu.ppy.sh/u/<id> , and via a json api at https://osu.ppy.sh/api/get_user?u=<id>

A simple way, then, to get a random profile would to generate a random number, then take the modulus by the number of total accounts, which would generate pretty good randomness for us. At time of sampling there were 6969675 accounts.

def getRand(): return struct.unpack('I', os.urandom(4))[0] % 6969675

Extracting Data

Next, we needed to get information about the account id that we selected.

The data we want is on the profile page of users, so we scrape it. Regex proved useful to extract what we needed out of this HTML data. Getting the contents is simple enough.

def getProfile(num): return urllib2.urlopen("https://osu.ppy.sh/u/" + str(num)).read()

Account ids that our random number generator makes might no longer exist, so we first need to determine if the account exists or not. This is also an interesting piece of information because osu! doesn’t seem to remove accounts, it only bans them, so the percentage of “non-existant” accounts would be close to the percentage of banned osu! users.

Finding out if an account exists is simple enough - the website tells us! A simple regex gets us there:

def isAccount(content): return re.match(".*The user you are looking for was not found!.*", content, flags=re.DOTALL) == None

Once we’re sure an account exists, we’re after the interesting datum: is the account a supporter? Regex played useful here as well - the profile page includes a little banner with the css class name “profileSupporter” if you are a supporter!

def isSupporter(content): return re.match(".*profileSupporter.*", content, flags=re.DOTALL) != None

Another datum we were interested in was if the account was active. The last active date is conviniently supplied on the website and we can use regex to extact it, and parse it using python-dateutil . We can then compare that to a predefined time (28 days ago) to decide if it is active or not

__active_cutoff__ = (datetime.datetime.now(dateutil.tz.tzlocal()) - datetime.timedelta(days=28)) def getLastActive(content): return dateutil.parser.parse(re.match(".*(<div title='Last Active'><i class='icon-signout'></i><div><time class='timeago' datetime='(.*)'>.*</time></div></div>).*", content, flags=re.DOTALL).group(2)) def isActive(date): return date > __active_cutoff__

This however didn’t handle every case: some users had no known time so were marked as “dead”. Regex again saves the day! If the account is dead, we don’t try to extract it’s last active time.

def isDead(content): return re.match(".*<div title='Last Active'><i class='icon-signout'></i><div>dead</div></div>.*", content, flags=re.DOTALL) != None

Results

I ran this a few times on a sample-size of 100 just to test and already got some interesting results:

In the first sample, of the 100 accounts queried, 88 existed (weren’t banned or deleted), just 9 were active within the last 28 days, and 0 were supporters. One of the 88 existing accounts also had an invalid date (“dead”)

In the second sample, 89/100 existed, only 4/100 were active, and again none were supporters.

I then was happy with what I had and decided to take a sample of 10000.

It takes about .5 seconds for an HTTP request to return, so for 10000 samples this should take about 5000 seconds or about 1 and a half hours. So I let it sit. (To get all ~7 million users this would take 40 days! and we’d probably be blocked from the website!)

~50 minutes later we got our results:

81.09% of accounts existed

10.10% of accounts were active within the last month

.70% of accounts had an invalid last-active date (“dead”)

.23% of accounts were active supporters.

So we’d estimate ~690,000 osu users have played within the last 28 days, and there are somewhere in the avenue of 1,300,000 disabled/deleted accounts (banned or otherwise).

If we take into account our estimated error, 0-1.23% of all osu users are supporters.

A supporter tag costs 4USD/mo if you pay for it monthly, down to 2.16USD/mo if you pay for it yearly. We’ll estimate it at 3USD/mo.

With .23% of 6969675 accounts paying at 3USD/mo, that would be an estimated revenue of 48090.75 USD / mo, or 577089.09USD/year. At the high end of our estimate yearly revenue would be 3086172.09USD/year. At the low end, our estimated yearly revenue would be 0USD/year.

Clearly, since the proportion of supporters is so small, our sample size is not nearly large enough to give a confident result. In fact, if we wanted to be even marginally confident with this result, say within .1% margin of error, we’d need a sample size of 874525 accounts, which is 12.5% of users. It’s unfortunate that we can’t easily get a precise number here, but it’s still interesting to note just how few players pay for free-to-play games such as osu!

Neat graphs

We also grabbed the last time players played, and we see an exponential drop off of last active date for these users. Curious what that spike near the end of 2013 is. It’d be neat to compare this to sign up dates as well. If there was also a spike of users signing up near the end of 2013 that’d be a possible explaination - with more users trying, there would also be more users trying once and dropping it.

When graphing last_active time for the supporters, I noticed that one was very inactive - they hadn’t played since 2013! It turned out our sample included a hall of famer with a lifetime of supporter.

Except for our hall of famer, all of our supporters had played within the last few weeks, and most had played within the last day, which makes sense considering the subscription nature of supporter.

This was a fun little time waster.