Word Embedding

Word tagging is the other important aspect of NLP and word embedding is a part of it. Word embedding basically maps strings to their vector counterparts. In doing so, strings that have small vector distances are deemed similar.

The following diagram showcases a few random strings placed in the coordinate space. You can see that the semantically similar ones are clustered together:

Word embedding is a crucial part of search engines and indexing search, as it’s pretty common to search terms that are not directly present in the search index. For such cases, by using word embedding we can retrieve the closest possible matches.

Currently, the Natural Language framework supports built-in OS embedding in 7 languages: English, Spanish, French, Italian, German, Portuguese, and Simplified Chinese. However, we can also create our own custom word embeddings as well, as we’ll see shortly.

An NLEmbedding class instance is instantiated as follows:

let embedding = NLEmbedding.wordEmbedding(for: .english)

Currently, OS embeddings require the word to be in lowercase only. Not doing so returns no result. The following method is used to retrieve the vector representation of a word:

embedding?.vector(for: "cat") //returns double array

The distance that computed between two words is a cosine value, and in cases where it can’t be computed (for example, a word doesn’t exist in the built-in OS embedding) the value returned is 2.0.

embedding?.distance(between: "cat", and: "dog") //0.71

To find the top K most similar words, we can enumerate this request in the following way:

embedding?.enumerateNeighbors(for: "HeartBeat".lowercased(), maximumCount: 5) { (string, distance) -> Bool in print("\(string) - \(distance)") return true }

For specific use cases, custom word embeddings can be built using GloVe, Word2Vec, BERT, and FastText datasets. For demonstration purposes, we create a vector dictionary, as shown below:

let vectors = [ "dog": [-0.4, 0.37], "cat": [-0.15, -0.02], "lion": [0.19, -0.4] ] let embedding = try MLWordEmbedding(dictionary: vectors) try embedding.write(to: URL(fileURLWithPath: "modelPathHere"))

MLWordEmbedding uses an automatic compression technique that can compress gigabytes of data into a very tiny Core ML model.

To use the Core ML model in the NLEmbedder , we need to pass over the URL of the compiled model, as shown below: