Vered Shwartz is a final year PhD student researching lexical semantic relations in the Natural Language Processing lab at Bar-Ilan University. In between her studies, she has also worked on natural language processing as part of the R&D teams at Google and IBM. Away from the lab, Vered is the creative mind behind Probably Approximately a Scientific Blog, where she discusses key NLP concepts for a general audience.

We sat down with her to discuss a range of AI topics, from the lexical problems causing a buzz in the academic community to where she gets her training data.

Lionbridge AI: What was it that led you to specialize in NLP?

Vered: I took two NLP courses as an undergrad and when I decided to do my Masters it was an easy answer to “what’s the most interesting field in computer science I’ve been exposed to so far?”. Before that I also had a little experience with very naive text processing in Hebrew, but in my research I work in English. I think that being a non-native speaker of English makes me especially interested in investigating ways to teach machines to understand the language, since I see the parallels with my own language learning journey.

L: Which issues does your research focus on?

V: My research focuses on lexical semantics. In particular I’m interested in recognizing semantic relations between words and phrases, or as I once described it: I teach the computer that cat is an animal. It sounds trivial to people outside the field, but the truth is even a trivial task like this is not yet solved. NLP applications today are informed of word similarity, so they know that cat is similar to animal, tail, and dog, but they don’t know the exact relationship between cat and each of the other words. Of course, it gets more complicated if you consider the context – “cat” in an article discussing Linux is more likely to refer to the command than the animal, and a generic verb like “get” can have the same meaning as the more specific verb “sentenced” in the context of “get X years in jail”.

L: It’s interesting how these tiny connections that we take for granted are so challenging for machines to learn. You talk in depth on your blog about similar language difficulties that machines run into, such as paraphrasing and ambiguity. What is it that makes these so problematic?

V: Variability and ambiguity are considered big obstacles because they make mapping between language and meaning difficult. Rather than a simple function, it is a many-to-many mapping, where a text can have multiple meanings, and a meaning can be described in different words. There was a lot of progress in lexical variability and ambiguity in previous years – word embeddings map words to vectors, where words with similar meanings have similar vectors. Recently, contextualized word representations (ELMo) have improved the performance of many applications, mitigating the ambiguity issue.

L: What do you think are the most interesting lexical problems facing NLP researchers?

V: To name a few interesting unsolved problems: first, building better word representations, that can provide more accurate semantic information than plain word similarity. Second, inferring implicit information – for example, depending on the context and knowledge about the speaker, an utterance like “the president” can be resolved to a specific president. And finally, dealing with multiword expressions – there is so much more work there!

“Although there is an extensive pool of literature on multiword expressions, many NLP applications today make a naive assumption that the meaning of a sentence is derived from a combination of the meanings of its words.”

In extreme cases, there are idioms (“look what the cat dragged in”) and non-compositional compounds (“cat walk”) where this assumption does not hold. Even in the “easy” compositional cases, different words combine differently. “Baby oil” is not oil made of babies, despite the many other examples like “olive oil” and “coconut oil” that the algorithm can learn from. Sometimes even pairs of similar words combine into completely different expressions, as in “pharmaceutical salesperson” vs. “drug dealer”… not something you’d want your algorithm to err on.

L: Considering how fine these semantic lines are, it must be important to have great quality data. What are the key features that you look for in your data when tackling these challenges?

V: I’ll refer separately to training and test data. For training data, the straightforward answer is a “large enough” dataset, although this is probably a necessary-but-not-sufficient condition, and it’s also not well-defined. I mean, how large is large? It will never include every possible linguistic phenomenon. Instead, I’d settle for a reasonably-sized training set and invest more time in designing features that would make the model generalize and deal with unobserved data.

As for test sets, I have other expectations of them. The test set needs to be annotated carefully, with high inter-annotator agreement, be diverse enough with respect to the phenomena that it tests, and hopefully not easily solvable by memorizing biases in the training data. It’s not easy to design good datasets, but I think there is a growing interest in the community. After several papers in the last year pointed out problems in existing datasets, now some papers suggest how to build good datasets that don’t suffer from the same issues.

L: How do you normally find your data?

V: I normally follow previous work and use the standard datasets for the task. I only collect data if I’m convinced that there are no suitable existing datasets, but when I do I usually use crowdsourcing.

L: NLP is progressing at a rapid pace right now. As an insider, what are some hot research topics that you’ll be keeping an eye on over the next year?

V: One of the recent trends, which I think is going to continue, is creating difficult datasets and challenging evaluation settings. In the last few years the community focused on a few easy datasets which are now almost solved. This is not because the task itself is solved, but rather because the datasets are too easy. It took a few years, but now the cycle of releasing a new dataset and having it criticized by other researchers for its simplicity is shorter than ever.

“I’m glad to see that there are brave researchers working on new datasets, which would hopefully require more effort to solve. We can expect to see interesting innovations in the models following the release of these datasets.”

L: Are there any industry applications of NLP that particularly excite you?

V: Although I haven’t worked directly on NLP in industry, as a user of NLP technologies I’m excited about the progress in personal assistants and chatbots. It involves everything in NLP: it needs to convert the speech to text, analyze the utterance, understand the user’s intent, resolve any mention of entities like people or places to the entity’s representation in the system, fetch the answer, phrase it like a human, and convert it to speech.

Besides that, machine translation is also pretty exciting. Despite being far from perfect, it has really improved for the languages I use. And last, I’m still waiting for someone to develop an application that would understand my emails, take action items and reply automatically, so that I can focus on other work!

L: Finally, do you have any advice for anyone out there looking to build an algorithm that uses NLP?

V: Don’t reinvent the wheel. Read the existing literature before you design a new algorithm, and not just the recent literature. I’m thinking about this advice from the point of view of an academic whose goal is to publish papers, but I’m sure this is even more crucial in the industry, where you wouldn’t want to waste time and money researching something that already has a solution.

Second, I feel like I’m repeating what many others have said before me, but look at the data. Looking at the training data before you start designing the algorithm is important, otherwise you may be expecting your algorithm to learn something for which there is no signal in the data. Analyzing the outputs of your algorithm on the validation set can give you insights on the weaknesses and strengths of the algorithm, which you can’t always get from performance metrics. Eventually, looking at the data can help you design better algorithms.