The Star Wars expanded universe is huge. Really huge. Like, you just won’t believe how vastly, hugely, mind-bogglingly huge it really is. To grasp the full extent of this hugeness, a team of data scientists used a new computer program to analyze it, revealing some unexpected things about the extended saga.




A research team from Switzerland’s École Polytechnique Fédérale de Lausanne used a new computer program to analyze hundreds of web pages devoted to the Expanded Star Wars Universe, primarily via Wookieepedia. Then, to make sense of all this data, the researchers applied a bit of graph theory and some math; this allowed the researchers to get a handle on all the characters, communities, and timelines involved in the story, and how everything fits together. The project was led by French data scientist and Star Wars enthusiast Kirell Benzi and assisted by Pierre Vandergheynst of EPFL’s Signal Processing Laboratory 2.



No doubt, one of the more remarkable things about Star Wars is the way it lends itself to universe building. In addition to the seven full-length feature films, the saga has grown in other ways, including through numerous books, television series, video games, and other creative outlets. Here’s what the researchers learned when they put it all together.




Connections between the 7,563 main characters. Via K. Benzi LTS2/EPFL.

The Expanded Star Wars Universe consists of a whopping 21,647 characters. That number drops to 19,612 if every character listed as “unidentified” is removed. Of these, an astounding 7,563 play an important role. Among those drawn to the ways of the Force, 1,367 are Jedi and 724 are Sith. These characters are dispersed among 640 distinct communities on 294 planets. Surprisingly, 78 percent of the galaxy’s population is human.



That’s a lot of humans. Note: This graph does not include every species.

Not content to stop there, Benzi and Vandergheynst also placed each character within the timeline of the story. Star Wars takes place over the course of 36,000 years, which is broken down into six main periods: before the Republic, the Old Republic, the Empire, the Rebellion, the New Republic, and the Jedi Order. Analysis shows that Star Wars characters aren’t evenly distributed throughout the length of the story.




Distribution of characters across Star Wars eras. Some of the characters have been discarded (no era info). The extra colors highlight characters living in different eras.

“We see that the most popular eras are from the films: Rise of the Empire and the Rebellion era,”Benzi explained on his blog. “The Old Republic era is also popular (despite having fewer people) thanks to the MMORPG (video game) Star Wars: The Old Republic.” The extra colors on the chart above highlight characters living in different eras. Darth Vader, for example, makes an appearance in both Rise of the Empire and in the Rebellion era.


This visualization shows how communities of characters interact together.

“As you can see the whole Star Wars universe is coherent and fun details can be revealed using graph theory,” noted Benzi at his blog. “However let’s not forget that we need data to do data-science. In our case, wiki contributors are of paramount importance as they actually create the content we use to blog about. May the Force be with them, always.”


A section of the Star Wars character graph. The orange-red nodes are from Rise of the Empire Era (episodes 1-3), the blue nodes represent the Rebellion era (episodes 4-6), and the green ones represent both eras. Black nodes represent missing data. On the right, black nodes have been replaced by the best compromise using their neighbors.

Benzi’s team also mapped the most connected characters in the Star Wars Universe. Not surprisingly, Anakin Skywalker tops the list. Super interesting to see Boba Fett at number 10; that bounty hunter clearly gets around. And check out Revan, a character from the Old Republic Era, who makes an appearance in the 13th slot.




“To put some order into this massive forest of data, we based our approach on network analysis. In other words, all the connections that one character has with all the others,” noted LTS2 researcher Xavier Bresson in a press statement. “Using these cross-references, we are able to accurately determine the time period of the character almost without fail, when this information is not directly provided in the books or movies.”


This entire project may seem indulgent, but this data-parsing technique could be applied elsewhere. As Benzi explained, this “program maps out connections in the mass of unorganized data available on the net.” These algorithms can not just extract data according to precise criteria, they can also create links among data points, do sorting, quantification, interpretation, and seek out missing information. In future, a program like this could be used to perform historical and sociological research, along with other scientific research interests.

[EPFL]

All images: Kireli Benzi/EPFL/LTS2

