An urn has 2 red balls, 2 blue balls and 2 yellow balls, a ball is drawn out of the urn at random and then put back in, another ball of the same color is then added to the urn, this process is repeated arbitrarily many times, it is then noted that there are more blue balls in the urn than there are balls of any other color, what is the expected proportion of balls in the urn that are blue?

This is a problem that I made up and it is based upon research that I did for my EE (extended Essay, a requirement for IB students). In Showing the solution for it I will also prove several generalizations, bring up other interesting problems and give other possible interpretations of the math being discussed.

We see that this problem introduces a Markov chain, a randomized sequence of states where the probability of moving to the next state is determined by the previous state. In this problem each state is an ordered triple of numbers representing the number of red, yellow and blue balls in the urn, each move increases exactly one of those numbers by 1 and the probability of each number being increased is proportional to it’s current size. That is, from the state (a,b,c) we move to (a+1,b,c) with probability a/(a+b+c), we move to (a,b+1,c) with probability b/(a+b+c), and we move to (a,b,c+1) with probability c/(a+b+c).

This would be easier to analyze if there were only two possibilities for each step so we will consider the case where there are only two colors of balls and use our results to extrapolate to more colors. Suppose that we start from the position (a,b), a red balls and b blue balls, and we wish to calculate the probability that a proportion p of the next x steps add red balls. Consider any sequence of x reds and blues with xp of them being red, the probability of this exact sequence occurring for the next x steps is the product of the probabilities of each step occurring, Each of these probabilities can e written as a fraction with the denominator of the fraction being the total number of balls in the urn, as we go along the sequence these denominators will range from a+b to a+b+x-1 so the product of all of them will be (a+b+x-1)!/(a+b-1)! the numerators for the points in the sequence which indicate a red draw will be the number of red balls in the urn at the time of the draw and thus will range from a to a+xp-1 and thus will multiply to (a+px-1)!/(a-1)! similarly the product of the numerators of the parts of the string indicating blue draws will be (b+(1-p)x-1)!/(b-1)! thus overall the probability of such a sequence is (a+b-1)!(a+px-1)!(b+(1-p)x-1)!/(a-1)!(b-1)!(a+b+x-1)! if we multiply this number by the number of possible sequences of length x that indicate px reds and (1-p)x blues, which is x!/(px)!(1-p)x)! we get the probability that px of the next x draws are red is

We are interested in the behavior of this proportion as x becomes arbitrarily large however it is clear from looking at the above expression that for any p the expression will tend to 0 as x becomes large so instead we can look at the odds between different proportions. the odds are just the ratio of the two probabilities so by taking that ratio and cancelling terms we find that the odds between the proportions p and q is

The expression (a+px-1)!/(px)! represents the product of the a-1 numbers from px+1 to px+a-1 as x (and thus px) becomes large relative to a and b each of these numbers can be approximated by px and thus the product can be approximated by (px)^(a-1) using similar approximations for the other terms we see that the odds can be approximated by

this of course simplifies to

From this expression we know that as the number of throws grows large the proportion of draws that are red approaches a distribution given by

C must be chosen such that the integral over all possible values, in this case the integral from 0 to 1, is one. It is known that

thus C must equal

We now have a distribution for 2 colors and we wish to extrapolate to more colors. If we have n colors, call them 1 2 3 … n, we can instead look at only 2 possibilities 1 and not 1. thus given initial conditions we can calculate a distribution for the proportion of balls that are color 1 after a large number of steps, this distribution is calculated treating all balls of all colors other than 1 as the same so it will be independent of any probabilities that we can calculate involving only the differences between those balls. Furthermore if we are only looking at probabilities involving colors 2 through n we can calculate them in the same way that we would if there were no balls of color 1 because if a ball of color 1 is drawn it doesn’t affect the numbers of the balls of any other color and given that the ball drawn is not of color one the probabilities are exactly the same as if there was no color one, we can then calculate a distribution function for the proportion of draws that are not color 1 that are color 2, and then a distribution for the proportion of balls that are not color 1 or color 2 that are color 3 etc. We end up with n-1 distribution functions representing independent random variables, these can be multiplied to get a joint distribution function. we can use one such joint distribution function to answer the original problem

We calculate the relevant joint distribution function to be f(x)= where x represents the proportion of balls that are blue and y represents the proportion of non-blue balls that are red. We are then given the information that there are more blue balls than there are balls of any other color this means that the area of the function we are interested in is the area where x>(1-x)y,(1-x)(1-y) this area is 1/2<x<1 and 0<y<1 OR 1/3<x<1/2 and (1-2x)/(1-x)<y<x/(1-2x) the expected value of the proportion is given by the integral of xf(x) over the desired region divided by the integral of f(x) over the desired region thus the answer to the original problem is

This was evaluated using paper so if anyone finds the calculation to be in error please comment. We have solved the original problem, our method also allows us to say many other things about the system described the problem could also be solved for non symmetric initial conditions or other problems could be solved like “given that after a large number of draws blue is the most common, what is the probability the the first draw was blue?”

Although the system discussed here was interpreted as balls in an urn it could also be interpreted as a random walk with the number of balls of each color representing coordinates in different dimensions, this makes non-integer coordinates reasonable.