I recently moved to San Francisco to join the Insight Health Data Science program as a Fellow. One of the first things that I needed to do in a new city was to find a primary care physician, but I didn’t have time to read all of the reviews of doctors online.

I realized it would be nice to have a tool that could highlight the pros and cons of each doctor in concise snippets of information, saving me needless hours diving into lengthy reviews. I decided to build such a tool for my Insight project and call it DoctorSnapshot.

DoctorSnapshot: developing the product vision

When people search for doctors using business review websites, they naturally choose among the doctors that have the highest ratings and a large number of reviews that support those high ratings. These highly-rated doctors could have hundreds or even thousands of reviews under their profiles, and comparing highly-rated doctors to each other becomes a tedious task.

Furthermore, even if there is only one highly-rated doctor, one may still want to read the reviews to see why people like this doctor and if the reviewers addressed his or her concerns. This, again, could be time-consuming. In both cases, some sort of review summarizer would be helpful.

Web services such as Zocdoc and Yelp have offered their own version of “doctor snapshots” to help users quickly see what other reviewers have said about doctors. Zocdoc rates doctors based on three categories: “overall rating,” “bedside manner,” and “wait time”. However, this does not cover any other useful points that users made in their specific reviews. Yelp automatically highlights representative review sentences that share common phrases with other sentences (see example), but no explicit rating is given for the topics mentioned in those sentences.

I decided that my tool would combine the best of both the above products. DoctorSnapshot first detects the topics that have been discussed in the reviews (e.g. bedside manner). Then, it analyzes whether people were talking positively or negatively about those topics, and finally assigns appropriate ratings to the topics.

Obtaining data on doctors’ reviews through web-scraping

My first step in building DoctorSnapshot was collecting a large number of reviews. As far as I knew, there was no existing dataset with reviews of doctors available online, so the only way to acquire these data was through web-scraping. I started by looking at a database of physicians called BetterDoctor, whose API allows for easy querying of doctors’ profiles using geographical locations.

Although BetterDoctor itself does not contain reviews of doctors, it provides the Yelp url for a doctor if he/she has a Yelp page. I retrieved all doctors in San Francisco from BetterDoctor based on the longitudes and latitudes of their addresses and ended up with 187 doctors that have independent Yelp pages. While 187 doesn’t sound like a large number, this gave me 5,088 reviews composed of 700,000 words. (For comparison, all seven Harry Potter books are comprised of about 1 million words.)

Algorithm development and workflow

Step 1. Latent Dirichlet Allocation to identify topics from text

Latent Dirichlet Allocation (LDA) is a popular Natural Language Processing (NLP) tool that can automatically identify topics from a corpus. LDA assumes each topic is made of a bag of words with certain probabilities, and each document is made of a bag of topics with certain probabilities — this concept is illustrated in the figure below (see here for a more detailed explanation). The goal of LDA is to learn the word and topic distribution underlying the corpus. Gensim is an NLP package that is particularly suited for LDA and other word-embedding machine learning algorithms, so I used Gensim to implement my project.

How Latent Dirichlet Allocation (LDA) sees the review texts.

I obtained 11 meaningful topics from LDA and I categorized them manually into “general topics” and “doctor specialty topics.”

The general topics are:

Payments Appointments and visits Positive comments

The doctor specialty topics are:

Dental care Women’s health Surgery Allergy treatments Skin procedures Eye care Reconstructive surgery Urology treatments

As I mentioned above, in LDA, each topic is composed of words with probabilities. For example, the most frequent words for the “Payments” topic are insurance, pay, company, cover, charge, cost, paid, pocket, price, office, payment, medicine, service, amount, and claim. Here, I should point out that LDA only group words into topics. It takes a human to manually understand a topic and assign the topic a meaning by looking at the words in the topics.

Step 2. Assign sentiment scores to sentences under identified topics

In order to score doctors by the topics mentioned in their reviews, I needed to analyze the sentiments of their reviews. I defined a percentage rating for a topic as the percent of reviews that gave a positive comment when they mentioned the topic (similar to Rotten Tomatoes). I used this metric to assign sentiment scores to topics.

More specifically, I used my trained LDA model to determine the topic composition of each sentence in a doctor’s reviews. If a sentence was dominated by one topic by 70% or more, I considered that sentence as belonging to that specific topic. Then, I calculated the sentiment of the sentence, either positive or negative, and finally counted the total percent of positive sentences in each topic for the final ratings of that doctor.

DoctorSnapshot machine learning pipeline.

To supplement my ratings by topic, I also added in highlights from reviews for users to read. These highlights are the three most positive and three most negative sentences in a doctor’s reviews, based on the sentiment scores.

Validation of LDA topics and sentiments

Qualitative validation of learned doctors’ specialties

Before using my trained models to generate snapshots of doctors, I tried a couple of ways to validate my model. First, I made sure that medical specialty topics appeared in the reviews of a doctor with the same specialty. I found that my specialty topics automatically learned by my LDA model actually aligned with the doctors’ recorded specialties (found via the BetterDoctor API). The figure below illustrates this fact. For example, the topic “Skin procedures” only appears in the reviews of dermatologists.

Topics representing a doctor’s specialty correspond correctly to the doctor’s actual specialty.

word2vec for validating learned general topics

For the topics that can appear in any doctor’s reviews, i.e. for “general topics,” I used a different validation method. LDA is only one of the many word embedding algorithms. Another well-used word embedding method is word2vec, perfect for LDA validation. I trained a word2vec neural network and projected the top words in the LDA-obtained general topics onto the word2vec space. I presented the projection result in a 2-dimensional figure produced by t-SNE dimensionality reduction method, shown below. We can see that LDA general topics separate nicely in the word2vec space, a further validation that the LDA general topics are meaningful.

The words in LDA topics separate into different clusters in the word2vec space.

Qualitative validation of VADER for sentiment analysis

I also tested the sentiment analyzer that I chose, VADER. Before VADER, I tried another sentiment analyzer called TextBlob. I plotted the sentiment scores for reviews (-1 meaning most negative and 1 meaning most positive) against the ratings associated with the reviews. TextBlob gave very similar sentiments across different review ratings, but VADER gave more positive sentiments for higher ratings and vice versa, which made it a good fit for DoctorSnapshot.