Many towns in Sweden share the same atomic parts in their names. -hult is common in Småland -arp is common in Skåne. I got curious of how they cluster geographically. Let’s load up a DataFrame with the positions of Swedish towns that I made by querying Google for the position of every Swedish town I could find on Wikipedia.

This notebook is also available as a Jupyter Notebook here if you want to execute the code yourself.

Scroll past the code if you’re only here for the maps.

import pandas as pd df = pd.read_csv('swe_towns_latlon.csv', encoding='utf-8') df['town'] = df['town'].str.lower()

And then do some crappy character n-gram extractions.

def find_features(s): feats = [s[i:i+7] for i in xrange(len(s)-6)] feats += [s[i:i+6] for i in xrange(len(s)-5)] feats += [s[i:i+5] for i in xrange(len(s)-4)] feats += [s[i:i+4] for i in xrange(len(s)-3)] feats += [s[i:i+3] for i in xrange(len(s)-2)] return feats features = [] for town in df['town'].values: features.extend(find_features(town))

To be able to count the occurances.

features = list(set(features)) final_features = [] occurances = [] for feature in features: try: occurances.append(sum(df['town'].str.contains(feature))) final_features.append(feature) except: pass df_features = pd.DataFrame({ 'feature': final_features, 'occurances': occurances })

So now we get interesting parts of town names. Sanity check (at least if you’re Swedish) is that sta , hult and so on are present.

print ", ".join(df_features.sort('occurances', ascending=False).head(100)['feature'].values) sta, tor, ing, erg, nge, orp, oc, och , och , och, ch , och, vik, torp, ra , ber, ors, berg, und, sjö, tra, and, str, näs, ter, inge, stra, tad, stad, den, for, fors, mar, ken, sto, olm, ste, äst, storp, stor, lla, lle, all, dal, hol, ngs, holm, äll, lst, tra , ång, ers, orr, stra , red, ran, nda, arp, ham, ill, est, ung, äck, ten, ster, gen, jör, rby, jär, nne, cke, ult, amm, mma, lin, sun, ryd, ack, sby, äng, eby, byn, len, lan, sund, ling, ker, nna, mmar, löv, lun, ike, bäck, bro, tan, nga, bäc, ård, rst, tte

I went ahead and took out the ones I deemed interesting. Now let’s plot them to se how they group geographically.

from mpl_toolkits.basemap import Basemap import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['font.family'] = 'sans-serif' rcParams['font.sans-serif'] = ['Helvetica Neue'] %matplotlib inline def plot(data, mult=0.9): """ Take data with 'part-of-towns-name' and color and plot it. Use `mult` for easy scaling of plot """ def get_coordinates(townpart): try: townpart = townpart.decode('utf-8') except: pass return df[df['town'].str.contains(townpart)]['lat'].values, df[df['town'].str.contains(townpart)]['lon'].values fig = plt.figure(figsize=(20*mult, 16*mult), dpi=200) m = Basemap( projection='merc', resolution='i', area_thresh=250, llcrnrlon=9.5, llcrnrlat=54.5, urcrnrlon=24.5, urcrnrlat=69.5 ) m.drawcoastlines(linewidth=0, color="#000000") m.drawcountries() m.drawstates() m.drawmapboundary() m.fillcontinents(color='black', lake_color='white', zorder=0) m.drawmapboundary(fill_color='white') title = plt.title(u'Town names containing:', fontsize=16*mult) title.set_y(1.01) for townpart in data.keys(): lats, lons = get_coordinates(townpart) x, y = m(lons, lats) m.scatter(x, y, marker='o', s=30*mult, alpha=1, label=townpart, edgecolors='none', c=data[townpart]) plt.legend(loc=2, fontsize=16*mult) plt.show() plot({ 'arp': '#00ff00', 'holm': '#ff0000', })

plot({ 'sta': '#ff00ff', 'ing': '#00ffff', })

plot({ 'hult': '#ffff00', 'fors': '#00ff00', })

plot({ 'vik': '#4f96c5', 'torp': '#00ff00', })

plot({ u'näs': '#b15928', 'ryd': '#08a060', })

plot({ 'tuna': '#33cc33', 'hammar': '#ff0066', })

plot({ u'köping': '#cc99ff', u'bruk': '#ffff00', })

plot({ 'stor': '#00ccff', 'sund': '#e3c471', })

plot({ 'stad': '#ffcc00', 'berg': '#9933ff', })

Berg is really common in the old mining regions called Bergslagen.