Please be more careful when interpreting SO Developer data

These types of surveys are interesting and useful, but each year I find myself pulling my hair out at poor analyses by the press and internal analysts. As an example:

The analysis of the Evaluating Competence question:

We asked respondents to evaluate their own competence, for the specific work they do and years of experience they have, and almost 70% of respondents say they are above average while less than 10% think they are below average. This is statistically unlikely with a sample of over 70,000 developers who answered this question, to put it mildly.

Is seriously flawed, and represents a misunderstanding of what "statistically likely" means.

First of all, there are no inferential statistics computed here, only summary statistics. Implicit in this analysis is a comparison between the distribution of competence in the population and a distribution of competence in the sample. See below for a brief discussion of the implied comparison. You cannot say whether the difference between your sample distribution and the population is "statistically likely" or not without inferential statistics.

If you did run an analysis using inferential statistics, you could make a statement about how likely it is that a distribution from a random sample of the population would have the characteristics that this sample does. You would not be able to draw a conclusion about whether respondents are biased in their evaluation of their own competence, or whether your sample was biased. Because of your methodology, we must assume a biased sample. Inferential statistics of SO survey data have minimal value in this context (comparing distributions to the population of developers) because respondents were not sampled using random sampling methods.

This is both a simple and crucial principle that, apparently, we don't hammer on enough in introductory statistics courses. Everyone seems to be able to parrot "correlation isn't causation", but equally important: you cannot generalize from a non-random sample!

Sample size doesn't save a biased sample:

Consider the case of the Literary Digest Election Poll of Landon vs. Roosevelt. A huge sample (2.4 million people) was used to generalize to the electorate at large, and predicted Landon would win, with 57% of the vote. In fact, the opposite occurred, Roosevelt won in a landslide, with 61% and a 24% margin of victory. A much smaller sample (50,000) by Gallup used sampling methods that allowed for generalization, and correctly predicted the Roosevelt landslide.

The challenges to generalization and inference here are the same challenges the 1936 literary digest poll had -- selection and non response bias. 70,000 is a lot, but you cannot generalize from a non-random sample, even a big one. Consider that, with over 20 million developers globally, you would need about 1 million respondents to have the same proportion of the population of interest as the Literary Digest sample. And we know how that turned out.

A comment on the response by the analyst:

In @JuliaSilge's answer, she says

That paragraph I wrote was intended to be a little light-hearted, but I'm willing to stick by it.

This is disappointing. The reasoning in the answer is mostly about the plausibility of the hypothesis, and whether there is data about any association between the known bias and the variable of interest. While I agree it is plausible, and even likely that developers overestimate their abilities this is not at all the point. We could make that argument without the survey. The most basic point here has little to do with the conclusion. The analysis itself contains an error and is incorrect regardless of whether the conclusion the analysis supports is true. It is an error to use the sample size of a non-random sample to support the underlying comparison with the population of interest. Sample size can decrease random error, but not bias. I would hope Dr. Silge consider carefully why she thinks the sample size of ~70,000 provides additional support for her comparison with the population and what exactly is "statistically unlikely".

Please note that I'm not coming at this from the perspective that there is nothing useful to be learned here. The SO developer survey is a useful undertaking. I would just suggest more care be taken when interpreting the data. Here, in particular, please own your errors when someone points them out.

The implied comparison:

Average competence is the midpoint of the distribution of competence. In this case, "average" is the median of the distribution, and, in any valid measure of competence, the median happens to have the same value as the mean. Average is explicitly defined in this literature see Kruger and Dunning to be the 50th percentile, the median.

The analysis of the proportion of respondents who said they were above average (70%) is based on an expectation that 50% of the population are above average competence by definition, and that the sample should have a similar proportion of competence as the population.