An example from Seth Stephens-Davidowitz illustates the problem. He looked at the gender of Katy Perry Facebook fans and found that they were overwhelmingly female. However, Spotify listening data revealed the gender split was much more balanced: Perry was in the top ten artists for both genders. If the music label used the Facebook data to target their advertising they’d be way out.

Does that mean the new data streams are junk and best ignored?

Not at all. Observed data is an improvement on claimed data, but it’s still flawed. To understand customers we need a balanced approach, using multiple techniques. If each technique tells us the same story then we can give it greater credence. If they jar then we need to generate a hypothesis to explain the contradiction.

Let’s go back to the Katy Perry example. A simple explanation would be that while both genders enjoy listening to her, far more women are comfortable expressing that publicly. If a record label wants to sell Katy Perry songs or encourage streaming, then Spotify data would be ideal. However, if they want to promote her concerts, it would be better to use the Facebook numbers. Neither data set is right in any absolutist sense – they are right in certain circumstances.