This means that we are ready to organize this data to see how each job sector has been performing.

Performing word embedding on each of the job titles, descriptions, and job posts

Let’s load all of the relevant libraries for this project.

Before we decide to generate a vector for each text element, we’ll remove all of the generic English stop words (“if”, “I”, “because”…), and other generic words for each category. We do this by looking at the most common words for each entry type, generating a custom dictionary, and removing the selected words.

After we’ve removed the first list of stop words and print the list of words by count, we notice that there is a lot of repeating words across all job postings. This outputs a clear list of “boring” words:

[('responsible', 4711),

('looking', 4290),

('incumbent', 3840),

('position', 3717),

('development', 3538),

('seeking', 3015)...

This would’ve been a really good opportunity to use TF-IDF for removing words based on how often they are present, but this dataset was a bit too small for it to be worthwhile, and the edits can be performed manually. (Whatever gets the job done faster is what we’re fans of.)

Let’s remove a select few so that the following word embedding will have less common words across the dataset.

Now we’re at the beginning of the fun stuff — word embeddings. Although not recommended for long statements or documents (the Yoon Kim CNN works best by having a per-word vector when classifying statements, for example), we know that the removal of low-signal/generic words will give us some better definition in between vectors.

Frankeinstein lives: UMAP + HDBSCAN

On typical numerical or categorical data, K-Means makes a lot of sense for creating clusters. We can also use this approach a lot when separating simple word embeddings (1 to 4 words), but it loses signal when combining vectors of strings, where the cosine similarities across word embeddings are much more similar. A typical approach for sorting clusters with K-Means involves the following:

We will choose a varying number of clusters (usually a slider from 1 to 20).

Using the elbow method, we will find the optimal number of clusters.

To confirm that the elbow method assist in deciding the number of clusters, we will use the silhouette value so see how well each point fits in its target cluster.

However, there are some potential issues with this approach. This is why we want HDBSCAN to take care of these points rather than K-Means:

Not all points can or should be clustered.

Maybe clusters are not round or spherical. (Cigar-shaped, half-moon, etc.)

Why isn’t K-Means ideal for large word embedding clustering? Because it tries to find the center of a spherical cluster. If there is an oblong cluster, the extremities will be cut off and the center will be exposed to outliers.

Instead of tweaking the K-Means parameters until the cows come home, let’s merge UMAP and HDBSCAN. This involves:

Redefining the space in which the vectors exist by defining a new embedding space on a Riemannian manifold using UMAP; Using HDBSCAN to cluster close but not necessarily spherical clusters together, while ignoring outliers.

What is UMAP?

The Uniform Manifold Approximation and Projection (UMAP) is a new technique in dimensionality reduction.

In PCA, dimensionality reduction is performed by creating vectors that can adequately recreate the location of every point. If my data exists in ℝ³⁰⁰ (hello word embeddings), I can define 10 (or 2, or 20) eigenvectors that can reproduce as many as the points as possible. There’s always a loss of information, but PCA can usually attain 70%-80% representative accuracy.

In UMAP, the approach is to create a Riemannian Manifold — an arbitrary shape that is curved, smooth, and connects back to itself. I now have a shape that I untwist and flatten on which all of my points are fairly close to the surface, especially if I have a long enough strand. Think of a ball of twine where both ends meet — this would be a Riemannian manifold in ℝ³.

All I need to do at this point is unravel the string, flatten out the cylinder, and scale the data. Boom! You now have a 2D projection.

Another way to think about it is when you put the Earth on a map. Mercator and polar projections will have different skews, and those skews can be ignored (or not) depending on the data that we are looking at.

Its GitHub page explains it best:

But why use a Riemannian manifold? Fairly simple.

3D space is defined by rules (gravity, time, distance, etc.). If I say, “Let’s imagine a world where distance and angles are not an issue”, you say “Great! Now a cube is a sphere is a prism, because distances and angles don’t matter.” Now that we are living in ElastiSpace™, we can solve certain problems much more easily. Distance still doesn’t matter, but you quickly realize sharp edges on cubes and prisms mess up your equations. So, I create ElastiSpaceV2™, where spheres and ellipsoids are the same thing, but now cubes have their own category. But what about all of the math you learned in high school? Shouldn’t trigonometry still work after all the time you spent practicing those exercises? ElastiSpaceV2™can’t help you, but Mr. Riemann’s got chu, fam. This is where spheres and ellipses are different and all matters of fancy math still work, so you are able to perform all of the math you want.

(This section was paraphrased from this Quora post.)

Now back to job postings.

Reduce and cluster the vectors

Here’s the code to reduce the dimensionality and start clustering the assembled vectors.

There’s two things to notice here:

The number of clusters was determined by HDBSCAN, not by us. If you re-run it, you will get some volatility, so save the clusters once you have something that works. There are some unclustered values (indicated by “-1”), about one-third of them. This means that the clusters are only performed on the remaining values.

[21 -1 -1 -1 4 -1 -1 25 25 -1 7 12 19 -1 25 -1 25 4 25

8 25 8 -1 -1 -1 -1 -1 -1 -1 -1 -1 5 -1 9 -1 -1 9 1

25 -1 20 -1 -1 -1 -1 -1 -1 -1 12 -1 -1 24 -1 2 25 -1 1

-1 7 12 -1 -1 14 -1 11 25 -1 -1 21 -1 -1 -1 -1 2 9 -1

11 12 9 12 25 11 7 -1 12 25 25 7 -1 -1 -1 -1 -1 -1 25

7 -1 21 -1 -1 19 -1 14 -1 -1 25 25 -1 25 8 25 25 -1 -1

-1 -1 -1 -1 25 12 -1 -1 25 8 -1 25 -1 -1 -1 -1 7 -1 25

-1 22 -1 25 16 9 -1 12 9 -1 7 12 -1 -1 8 -1 -1 8 24

24 7 25 -1 -1 7 7 7 25 7 1 -1 24 12 -1 -1 25 -1 7

-1 -1 8 13 -1 8 8 1 -1 -1 13 13 13 13 16 16 18 0 -1

16 25 -1 -1 -1 4 -1 -1 -1 15 25 -1 -1 6 7 -1 8 -1 25

-1 -1 -1 4 -1 -1 -1 -1 18 25 25 -1 25 20 25 7 7 7 25

-1 -1 25 25 -1 -1 -1 -1 27 -1 11 25 -1 25 25 18 -1 8 25

25 25 2 -1 -1 25 0 -1 -1 25 -1 12 25 -1 -1 -1 -1 7 -1

-1 -1 7 -1 27 25 -1 27 -1 -1 16 -1 -1 -1 -1 -1 -1 25 -1

-1 25 -1 25 14 -1 1 -1 -1 8 -1 1 25 25 7 7 -1 7 25

-1 25 -1 -1 -1 7 -1 7 -1 25 -1 25 12 -1 25 4 7 7 7

7 6 -1 7 -1 6 -1 11 16 19 -1 -1 -1 25 24 -1 16 -1 -1

-1 -1 -1 -1 -1 -1 -1 2 -1 4 7 25 -1 25 4 7 -1 25 25

1 -1 25 -1 -1 4 -1 7 -1 9 25 -1 -1 7 8 8 25 14 -1

-1 4 25 25 -1 8 25 19 -1 -1 25 15 16 -1 -1 15 21 -1 -1

25 -1 7 7 -1 -1 7 7 8 -1 18 2 16 4 -1 -1 9 9 -1

-1 -1 -1 -1 25 -1 -1 12 -1 7 20 7 -1 -1 25 -1 25 25 -1

25 6 25 7 17 -1 7 16 17 7 7 -1 4 15 -1 7 16 25 -1

25 25 -1 -1 -1 1 25 -1 25 -1 16 25 25 24 -1 12 -1 -1 -1

6 -1 25 1 11 -1 7 6 25 25 -1 25 25 -1 -1 25 1 -1 7

1 -1 -1 4 12 14]

So what do these clusters represent? Let’s count the top words for each cluster to determine what they contain.

Look at how well it broke down!

Cluster: 0

[('lawyer', 125), ('advocate', 3), ('attorney', 1)]

---

Cluster: 1

[('accountant', 259), ('auditor', 22), ('outsourcing', 4), ('cheef', 1), ('bookkeeper', 1)]

---

Cluster: 2

[('software', 130), ('developer', 125), ('developers', 4), ('programmer', 4), ('programmers', 1)]

---

Cluster: 3

[('developer', 166), ('android', 70), ('ios', 64), ('senior', 48), ('application', 32)]

---

Cluster: 4

[('accountant', 233), ('chief', 225), ('assistant', 29), ('deputy', 26), ('engineer', 11)]

--- ... ---

Cluster: 25

[('specialist', 227), ('expert', 226), ('development', 217), ('coordinator', 157), ('program', 147)]

---

Cluster: 26

[('service', 78), ('manager', 55), ('customer', 49), ('department', 17), ('corporate', 15)]

---

We can even get a more compact view:

Cluster 0: lawyer / advocate / attorney

Cluster 1: accountant / auditor / outsourcing / cheef

Cluster 2: software / developer / developers / programmer

Cluster 3: developer / android / ios / senior

Cluster 4: accountant / chief / assistant / deputy

Cluster 5: medical / representative / yerevan / armenia

Cluster 6: developer / java / php / senior

Cluster 7: developer / senior / software / web

Cluster 8: preseller / merchandiser / methodologist / 3d

Cluster 9: language / translator / english / interpreter

Cluster 10: hr / manager / assistant / generalist

Cluster 11: manager / project / construction / senior

Cluster 12: administrative / assistant / office / manager

Cluster 13: quality / assurance / engineer / tester

Cluster 14: assistant / executive / director / ceo

Cluster 15: administrator / system / network / systems

Cluster 16: engineer / software / senior / support

Cluster 17: qa / engineer / senior / manager

Cluster 18: analyst / financial / business / senior

Cluster 19: marketing / specialist / pr / digital

Cluster 20: legal / lawyer / senior / consultant

Cluster 21: finance / officer / chief / financial

Cluster 22: manager / branch / operator / teller

Cluster 23: manager / sales / assistant / retail

Cluster 24: agent / sales / manager / international

Cluster 25: specialist / expert / development / coordinator

Cluster 26: service / manager / customer / department

And here’s a pretty graph, split by cluster.

We now have a pretty graph showing us the demand for each job sector, over time. Yay!