Right now, 1,638 people and/or ponies have answered the fimfiction survey. Thanks very much! I didn't expect to hit 1000 for a long time.

You should’ve gotten a link to the results after taking the survey, but some people were dazed by the time they finished it. The results are here.

The proximate cause of the survey was my application to Princeton, for which I needed a 25-page writing sample by midnight Dec. 15. I had something written about fan-fiction, but I asserted things about ficdom which I didn’t know or couldn’t prove. So I made a survey.

(I plan to break that writing sample up into articles & send it off to journals, so I'm not supposed to post it here first. I'll post some of the highlights.)

Fortunately, literary journals don't have ethics committees, so I shouldn't have to worry about the exact nature of your consent to use your data, which would not hold up if I wanted to use these results in a psychology paper.

Unfortunately, it turns out how people answer depends a lot on exactly how I phrase the question, what answers are possible, and the background of the person reading the question. So much so that I don’t trust survey results anymore. More on that in a later post.

I got a lot of comments about the survey. Some were constructive criticism, including valid points about ambiguous questions. Some were about things I had no control over, like ranging the linear scale from 1 to 5 instead of 1 to 10, or not giving more elaborate definitions of the scale.

But most of the angry complaints were about things that I did deliberately. (The things I screwed up the worst, you mostly couldn't see. I’ll explain those in another post.)

The problem is that people taking a survey want to express themselves as fully and accurately as possible, while the people giving a survey want to extract usable information from their answers that is clear, objective, and has simple statistical properties. These goals often conflict. I only need one to three bits of information from each question, so it's much more important for the meaning and statistical properties of answers to be clear than for their values to be precise or their information content to be large. As a survey giver, I would much rather get one-tenth of a bit of objective information about a binary choice from your answer than two bits of information about your position in a high-dimensional space with subjectively-defined co-ordinates.

Things I Did Deliberately

The most-common kind of complaint was about questions that restricted answers. They gave two choices and not ‘both’; they used single-choice instead of checkboxes; they didn’t include an ‘Other’ option. That was deliberate.

Consider some typical cases:

Are you in the fandom more for…

(a) the show?

(b) the fandom?

A lot of people wanted option (c), ‘both?’ But the point of the question is to provide information. Information is literally a distinction between alternatives. That’s why it says ‘more for,’ not ‘for.’ If you're in it 49% for the show and 51% for the fandom, I want you to enter "the fandom," and I will be able to tell roughly what the average importance of each is, and what the variance of that importance is, from the answers. A ‘both’ answer provides zero information, and prevents me from figuring out the average and variance, because I no longer have data on the entire distribution and don't know how much of it I'm missing.

Information is not measured by the number of answers. It is measured in bits, and bits are measured by the probability of giving one choice or the other on a two-choice question. If I’d provided a 'both', most people would have chosen it, even people in it 60% for the fandom and 40% for the show. The people who didn’t choose it, but gave a useful answer, would be questionably representative of fandom. Providing ‘both’ as an answer just bleeds the data away. By restricting it to 2 options, I got a useful and interesting answer: 73% more for the fandom, 27% more for the show.

"But that doesn't really mean that 73% of people are in it for the fandom, because--" No. It doesn't. It means what it says, not what it doesn't say. We wouldn't know what the results meant if there were a "both" option, because that would mean "the importance of these things is within some fraction N of each other," where N is different for each person.

“But you’re not capturing how important it is that these things are really close for me, because--” no; I am capturing the fraction of a bit of information in your slight preference about the one distinction I am gathering information on. When I add all these fractions of bits up, I get lots of information. If I include the “both” answer, I don’t get that information. I instead get additional information about a distinction between one of the original choices and not making a distinction. That's too meta, folks. That's not the information I'm looking for.

If you “make a mistake,” a proportional number of people on the other side of the issue made the same mistake, and it will cancel out. If there is no real preference in the population, the answer will come out 50/50.

And what if I want to do a binomial test on the results? Oh, shit, that "both" answer just threw out an unknown chunk from the middle of the distribution.

Putting a 1 to 5 scale there instead of two choices would still leave the interpretation of how wide the middle bin should be up to the survey-taker. A 1 to 4 scale, maybe.

This one complaint makes up most of the complaints. Look, it's a complicated subject, and I could be wrong. But I know a lot about extracting information from weak data. It’s math. If I ask, "Which has higher entropy, a normal distribution or a flat distribution," and you hesitate before answering, you aren't in a position to tell me I'm doing it wrong.

The intensity of feeling and conviction some people have for their mathematically wrong beliefs about this answer are, I think, indicative of a split between the sciences and the humanities, which I hope to discuss at length in a later post. Basically, many people in the humanities think in whole numbers, especially those in art and philosophy. This is because they're rationalists. The term "ratio" is derived from the term "rationality", because the Greeks were also rationalists and thought in whole numbers. Rationalism and Platonism are closely intertwined; both have as central beliefs that the integers "exist" in some transcendent reality. Ratios also exist; they are relations between integers. Fractions do not "exist." No one even invented a way of writing down fractions--I'm using "fraction" to mean a finite decimal expansion, but actually it's a synonym for ratio, English still doesn't have a word for a finite approximation of a ratio--until about 1600. That's why people before then never developed science, but only logic. Most people in the humanities today, and all art theorists, are still under the illusion that logic and rationality are similar to science and reason, because they haven't checked with the scientists in the past 300 years, and that is why we have modernism today. Because artists and philosophers mistake the fanatical dogmatism and excluded middle of rationalists for science.

But I digress.

Americans only: In 2016, you voted for

o Trump

o Clinton

o Johnson

o Stein

o Castle

o did not vote

o Other

I got a lot of complaints for having this question at all. I didn't want to know people's political views. I wanted to do 2 things:

- I wanted to look for differences between people who read different authors, and between writers and readers. Turned out there were huge differences in voting patterns between people who read different things.

- I wanted to see who would be willing to share their answers on touchy subjects with others. I could've used "Do you masturbate while reading fanfic?" instead, but I might not have gotten enough "yes" answers. Again, there were differences, and not what I expected.

I made some big mistakes here: I didn't include "American but not eligible to vote," I should've made "did not vote" specify "American," and I included "Other."

Some people complained their candidates weren’t listed there, so I added ‘Other’, thinking people would fill in the names of other candidates. Instead, people gave explanations of why they voted for Trump or Clinton, or why they didn’t vote, or that they weren't American, and I got 19% of answers as “Other.” This means I have to copy the data into another spreadsheet to remove the "Other" category, and that I don't know how many answers are under "Other" that should be in some other entry.

I have some really interesting results, like that 76% of people who liked the “literary” authors voted, while only 39% of people who liked fluff romance voted. Or do I? It could be that people who like “literary” authors are more likely to write long explanations of why they didn't vote. I won’t know unless I go through all the “Other” answers one-by-one and file them where they should go.

I have maybe 10,000 ‘Other’ answers in the spreadsheet. Except for the ones where "Other" is exploratory, like "How do you usually choose stories?", I’m not going to read them. Unless I'm trying to discover what people do, all that the ‘Other’ entry does is stop people from complaining, at the cost of throwing their answers away, making the pie charts that Google Forms produces useless, and screwing up the other results because I don't know what's misfiled under "Other."

Writers: What inspires you most to write fan-fiction? Pick up to 4.

This question was a disaster. I made some big mistakes here, including letting people pick more than one answer. Some people picked one; some picked 4; some (surprise!) picked all 11. The top answer was “adding details, background, or continuations to canon stories,” at a whopping 49%. But wait--if I just count people who chose that as one of 1 or 2 answers, it’s only 43 out of 836. If I count just people who chose “other things outside MLP” as one of 1 or 2 answers, which got 40% of the multiple-choice checkbox answers, I get 60--increasing its relative share of answers by a factor of 1.7. It turns out, looking at that and another question, that the people who were most-interested in the canon also checked lots of boxes.

What if I want to include this question in a multiple regression? I can’t, because the answers aren’t numerically comparable to each other.

The most-disturbing thing about this question is how much it contradicted the results on a similar question on Afalstein’s 2014 survey. For example, “filling in plot holes” got a mere 0.3% there, versus “filling in gaps in the canon”, which got 35% here--but only 16 people chose it as one of up to 2 choices, and only 2 chose it as their only answer.

The moral of all these cases--and this is repeated throughout the results--is to avoid at all costs answers of ‘both’, fill-in-the-blank, or checkboxes that allow multiple choices, because they wreck the data and make analysis much more time-consuming.

Animal rights? (from 1 to 5)

1. It's okay to kill animals for fun.

5. Medical experiments on animals should be outlawed.

From reddit:

When it comes to animal rights, "it's OK to kill animals for fun" is not a neutral way to refer to game hunting. Not is "medical experiments on animals should be outlawed" a fair measurement, (or representation) of people who care about animal rights.

I’m not speaking for the NRA or for PETA. I’m representing extremist positions. I deliberately chose the words "it's OK to kill animals for fun" to be more extremist than “it’s okay to hunt for sport.”

I took a lot of flack on reddit for being “biased”, by which they meant they didn’t like the descriptions I used on the 1 to 5 questions. One guy wanted me to write out long, detailed descriptions of each end of the scale, which wouldn’t have fit and would have been more subjective and confusing. Another wanted me not to have a scale at all, but to have a wide variety of choices to represent the complexity of real life. Some, I suppose, wanted me to use the usual supposedly objective descriptions like “extremely conservative … extremely liberal”.

When I put choices on a scale of 1-5, that reduces the data to one dimension, but it gives you points along that one dimension. The principal component of any complex phenomenon usually captures at least 50% of the variance. So you’ve still got at least 50% of the data, and you're not throwing away the numeric component of it. You've probably saved more than you've thrown away, and it’s in a useful form.

(This is part of why some questions were America-centric. If I tried to incorporate political views from, say, Europe, their principal component would be different, and it might wreck what data I had.)

Supposing that instead of using a scale along one dimension, you came up with a list of 8 choices on economic policy that didn't ignore the many different ways people’s opinions vary in real life--what would you do with the results? How would you include them in your multiple regression? You’d have to include each answer as a separate binary variable, and run a logistic regression instead of a linear regression. The weakness of the data and the crappiness of using logistic regression on 8 different variables instead of linear regression on 1 would throw out more information than you saved by having all those answers. And you couldn't get linear regression answers, which are usually simpler to understand.

The generic “extremely conservative … extremely liberal” is worst of all. Who’s going to say on a survey that they’re extremists? I doubt even people in ISIS call themselves extremists.

I wanted to contrast answers between people in different social groups. But if there is a difference, that difference erases itself in that highly subjective scale. I live in a town that votes extremely conservative, but they don’t know that. They think they vote normal. Most of them would say someone who voted for Clinton was “extremely liberal,” while somebody in Portland would call that same person “moderate.” People here would consider “it’s okay to kill animals for sport” the moderate position and “it’s okay to kill animals for the meat” a liberal position.

Saying that you don't like how I described the endpoints is just partisan bickering. The important point is that they're clear and will be interpreted pretty much the same way in different social groups.

Next post: Things I Screwed Up