I've been on Stack Overflow since 2012 and have seen a fairly steady decline in the quality of questions asked in the php tag over the years. However, I noticed a significant incline in the sheer volume of these questions in the last couple of years.

A lot of the very low-quality questions stem from every-day problems like syntax/parse errors, why isn't this code working PEBKACs, and other poorly-researched questions that could have been better answered by a quick Google search or just spending a little more time playing with the code, or even just searching Stack Overflow for similar questions that provided useful answers.

Sometimes even closing a question as duplicate led the OP to re-ask the same question with "this did not answer my question" in the title/body, when the duplicate clearly did. It seems that if the answer is not specifically tailored to their use-case/code, a general answer is unacceptable from their point-of-view.

I developed a theory that the majority of these questions come from students or hobbyists that have been misguided, to believe that Stack Overflow is a general help forum, where anyone can ask for and get some quick help with their code.

So I decided to head over to SEDE to see if there was any data to support my theory. Sure enough, I dug some interesting data, but I'm still not quite sure what to make of it yet.

The Hourly Trends

Looking at a histogram of when php questions are asked, it seems that historically, the peak hours are usually between 8 AM and 4 PM UTC.

So far it seems that, generally, day-light hours bring in the most php questions, which if we go by the theory that the bulk of these questions come from students spending their days looking for help with their projects/homework, it might make sense.

The Daily Trends

Looking at a histogram of php questions by days of the week supports this notion even further. There're almost twice as many questions asked during week days as there are during weekends.

The Monthly Trends

Looking at a histogram of php questions by month further supports this theory as you can spot a significant ~20% drop, historically, during the start of the school year, and 5-10% upticks during holidays and summer months.

The Yearly Trends

Looking at the data by year, you can clearly see that question scores are tending downwards, while the amount of questions are trending upwards very quickly.

In 2011, the amount of questions asked that were tagged php, nearly doubled from the previous year in 2010. By 2012 we can see that aggregate question scores start to suffer and trend downwards as the amount posted questions rise and trend upwards.

Even though php is the 4th most popular tag on Stack Overflow, it's oddly the 7th highest scoring tag out of the top 10, lagging way behind less popular tags like python, android, and c++. Which leads me to believe that the quality of scores that the php tag suffers from must be the result of a lack of precedence.

Most people tag their questions as php just because PHP is involved in virtually every aspect of their stack in their web development process. So while the question might really be about JavaScript or Apache httpd, or even just HTML/CSS, the fact that PHP is somehow involved means the question likely gets tagged under php as well. Meaning that PHP tagged questions share in a lot of cumulative blame as well.

If it's not the result of masses being misled to believe that they can turn to Stack Overflow whenever they run into a problem with their PHP code, then what is it? What can be done to improve the overall quality of the PHP tag score on Stack Overflow?

Should I even care? Is this really a sign of a bad trend in the works that's detrimental to the PHP ecosystem on Stack Overflow or just typical behavior that's to be expected as the site grows?

Update

I'm including the monthly standard deviations scatter chart based on further discussion in the comments to see if there is any more meaningful data there that contradicts or strengthens my theory. It's come to my attention that the histogram of question distribution by month may not be as meaningful as it is presented to be.

So this scatter only includes data for complete years (2009 - 2015).

Showing my math

Here's a gist with the CSV dump of the data in the chart.

Here's a gist with the CSV dump of the aggregate monthly question data

A standard deviation is calculated as the square root of the variance. The variance is the average of the squared differences from the mean. The mean is sum of all members in the set divided by the number of members in the set.

So, for example, during the year of 2009, there are 12 months. The total number of questions asked are 20548 . This is given from s = [645, 775, 909, 976, 1189, 1586, 2042, 2205, 2279, 2385, 2714, 2843] . Thus the mean of of the set s is 20548 / 12 which gives us 1712.333 .

The variance is then calculated by the following function.

function variance(set, mean) { var sum = 0; for(var i in set) { value = set[i]; sum += (value - mean) ** 2; } return sum / set.length; }

So variance(set, 1712.333) gives us 575158.222

The standard deviation is then the square root of the variance giving us Math.sqrt(575158.222) == 758.391 .

So on each month of 2009 if we take the number of questions asked, subtract the number of questions asked from the previous month, we can see how many standard deviations removed the current month is from its previous month.

Of course, for the first month we have no previous month so it's (645 - 0) / 758.391 for January of 2009. Then (775 - 645) / 758.391 for February of 2009... so on and so forth.

So far this shows there are some months that tend to be further from the standard deviation then others, but not consistently enough. There's probably some math error in my calculations here that I'm not aware of... So please do feel free to point out where I might have gone wrong.

I'm by no means a data scientist or have any advanced mathematical skills beyond the average Joe. So constructive criticism is both valued and welcomed.