Descriptive statistics on IMDB scores for episodes of Star Trek Benjamin Carlisle

The relative quality of different Star Trek series is a matter of much debate. Among Star Trek fans, The Original Series, The Next Generation and Deep Space Nine tend to be preferred to Voyager and Enterprise, but it is not clear that these reputations are deserved, based on the quality of the programmes. This analysis seeks to quantify the distribution of subjective quality assessments of each series, and explain why it is that Voyager and Enterprise are less well-liked, and whether that reputation is deserved, based on a set of subjective assessments from IMDB.

Star Trek episodes from all live-action television programmes were scored by IMDB users. The score for each episode was compiled into a single spreadsheet in order to produce an infographic and kindly posted to Reddit. This analysis does not include The Animated Series or any of the Star Trek movies. Descriptive statistics were done using R version 3.2.2. We define \(p > 0.05\) to be statistically significant.

The following figure represents box plots for the IMDB scores of each of the live-action television series of Star Trek, with dashed lines at IMDB scores of 6 and 9 for reference.

The following two tables report all the episodes of Star Trek from any series with an IMDB score less than or equal to 6 or greater than or equal to 9. For all the figures that follow, I have added grey dashed lines at IMDB scores 6 and 9. These are arbitrary cut-offs, provided for reference only, but arguably any episode with a score over 9 could be regarded as a “highlight” of that series, whereas any episode under 6 could be regarded as “very bad.” Among the 697 episodes of Star Trek, there were only 26 episodes with a rating of 6 or below (0.04%), and there were only 11 episodes with a rating of 9 or above (0.02%).

The number of episodes (sample size), median, mean, standard deviation, and range for each series of Star Trek has been tabulated below. The final column, “Series finale,” indicates the IMDB score given to the final episode broadcast in that series.

To test the hypothesis that there is a significant increase in quality between seasons 1 and 2 of Deep Space Nine, we performed a one-tailed \(t\)-test. The difference in mean IMDB scores is 0.37, which is statistically significant (\(p = 0.043\)).

To test the hypothesis that there is a significant increase in quality between seasons 2 and 3 of The Next Generation, we performed a one-tailed \(t\)-test. The difference in mean IMDB scores is 0.69, which is statistically significant (\(p = 0.017\)).

To indicate the change in series quality from season to season, the following five figures give boxplots representing the IMDB scores for each televised Star Trek series, stratified by season. For ease of comparison, the scale of the y-axis is equal among all five figures (ranging from 3–10) and dashed grey lines have been added at IMDB scores 6 and 9.

Discussion

This analysis is limited in that it is a statistical analysis of independent subjective evaluations of individual episodes. This means there are two sources of potential bias: self-selection bias, and standalone episode bias.

Self-selection bias refers to the bias among the users who choose to rate an episode. Namely, only those who feel most strongly about an episode or series will rate it, for better or for worse. What I have called “standalone episode bias” is the fact that this analysis will, for example, reward a series for having episodes that rank highly individually, and may not capture the quality of a series that contains a serialised story-arc in which no individual episode excels by itself. This may artificially reduce the score of Deep Space Nine, for example, which contained very serialised storytelling, and much of the quality of the programme came from well-explored characters, back-stories and careful world-building. Conversely, the fact that there was no overall assessment of serialisation apart from the sum of IMDB scores for individual episodes may have helped Enterprise, which also had serialised story-arcs, but often failed to develop some major parts of the story (e.g. the identity, back-story, motivation or even the name of the recurring villain who was introduced in the very first episode).

The series that was least consistent in IMDB scores was The Next Generation, with a standard deviation of 0.92. The best and the worst episodes can be found here, with a range spanning from 3.3–9.3. The quality improves starting in season 3 with a statistically significant increase in quality as compared to season 2 (\(p = 0.017\)). Quality dips slightly in its final season. TNG ties for the highest rated series finale at 8.5. Despite the fact that it comes from a very bad season of TNG (season 2), the worst episode in Star Trek, “Shades of Gray,” (a clip-show, rated 3.3) is a clear outlier.

The Original Series has a higher mean and median score than TNG, and its standard deviation was smaller, indicating a tighter distribution around the mean. The Original Series tied with The Next Generation for the episode with the maximum score of 9.3, but each succeeding season scores lower, and its series finale was not rated as favourably.

The best episodes of Deep Space Nine were ranked nearly as high as TOS or TNG, with a range of 5.4–9.2. Its mean score is between that of TOS and TNG, but its standard devation is much lower. This indicates that the quality is, on average, similar to the two preceding programmes, but much more consistent. There is a statistically significant jump in quality between seasons 1 and 2 (\(p = 0.043\)), which it maintains through the remainder of its 7 seasons. DS9 tied with TNG for best series finale.

Voyager‘s IMDB scores were, on average worse than DS9’s, with a lower standard devation, which is to say, they were worse, and consistently worse. No episodes of Voyager were given scores greater than 9, reaching a maximum of 8.9. The series received mostly middling scores with no major changes in subjective appraisal of quality between seasons. Voyager’s finale scored better than The Original Series’, and slightly lower than DS9’s and TNG’s.

Enterprise has a higher mean and median than any other series, as well as a smaller standard deviation. Similar to Voyager, no episodes of Enterprise were given scores greater than 9, reaching a maximum of 8.9. Despite a general upward trend in IMDB score from season 1–4, the series finale was given a rating of 5.4, which would make it the worst episode of Star Trek of all time, except for the 6 indicated in the table above (See table: “All Star Trek episodes with an IMDB rating less than or equal to 6”), one of which being a clip show from TNG.

Overall, Voyager had episode scores reported on IMDB that were lower, and consistently lower. On the other hand, despite its reputation, Enterprise had higher scores. These data support the hypothesis that Voyager’s reputation as a sub-par Star Trek offering is deserved, while Enterprise’s reputation may be explained as the consequence of the poor quality of its series finale, and the lack of redeeming episodes with scores greater than 9.

Subjective appraisals of a series of Star Trek therefore seem to be most sensitive to the number of excellent episodes, rather than median or mean score, and least sensitive to the number of very bad episodes, except in the case where that episode is a series finale.