Skip to content

Using Latent Dirichlet Allocation to find sub-topical sentiments in Yelp reviews

Notifications You must be signed in to change notification settings

cpleasants/yelp-refiner

Repository files navigation

Yelp Refiner

Using natural language processing to identify sub-topical sentiments in Yelp reviews

Problem Statement

What exactly does a restaurant with 3.5-star Yelp rating really mean? Without looking through the (often numerous, often long) reviews, it's impossible to tell whether the star rating comes from mediocre food with excellent service, or good food that is way over-priced, or excellent food that you'll have to wait 3 hours for.

Data

Yelp is a website where people can write reviews on just about any business. They give an overall star rating and write a narrative review explaining their star rating. Every couple of years, Yelp releases a set of its data so that data scientists and others can do something interesting with it. The latest dataset includes 2.7 million reviews and 649,000 "Tips" (short reviews) for 86,000 businesses from 687,000 users. For some quick visualizations, check out my Tableau Story here. I focused only on restaurants that were not also night clubs or fast food places.

Goal

My goal is to identify the sub-topics of each review (e.g. food, service, price, wait time, etc.) and the sentiment regarding those sub-topics (e.g. good food, bad service, bad price, good wait time, etc.) to help people make informed decisions about what restaurant they want to go to without having to read through all the reviews.

Methods

I started with a lot of research on approaches for unsupervised (un-labeled) Natural Language Processing (NLP), since my data weren't identified by sub-topical sentiments. Most NLP techniques focus on overall polarity (positive or negative sentiment) and/or overall topic. To identify sub-topics, I needed to take a hierarchical approach and found Latent Dirichlet Allocation (LDA). LDA identifies words that co-occur frequently and considers those a unique "topic". One unique thing about LDA is that it allows the same word to be associated with multiple topics. It also identifies what percent of each topic a given document is. A common way to interpret an LDA model is to look at the top words associated witha particular topic; for instance, pizza, dough, sauce, cheese, and toppings would likely co-occur frequently and would be easily interpreted as a topic about pizza restaurants. Because LDA is a processing-intensive and time-consuming approach, I used a 10,000-review subset randomly selected from the whole dataset to create LDA models.

An initial LDA approach created topics that were easily identified as a type of restaurant (greek, italian, sushi, pizza, burgers, etc.), but didn't identify sentiment or fall around themes like "food" and "service".A Yelp Challenge winner approaching a similar problem included "codewords" in their LDA model to help their it identify sentiments along with topics whenever there was a positive or negative word included. So, for example, "The food was bad, but the waiter was very nice" would become "The food was bad BADREVIEW, but the waiter was awesome GOODREVIEW." I took a similar approach, but modified it so that "not bad" would turn into "not bad GOODREVIEW" instead of "not bad BADREVIEW" (I also used POSITIVEWORD and NEGATIVEWORD instead of GOODREVIEW and BADREVIEW). This did help to identify sentiments, but my LDA model was still settling on topics related to types of restaurants instead of topics like "food" and "service".

I figured the issue may be that people use different vocabulary when talking about different types of food items. To help the LDA model realize that "The pizza had lots of pepperonis" isn't so different from "The hamburger had lots of tomatoes", I added a food-related codeword. So, if the word had something to do with food (as identified by (a tool in the Natural Language Processing Toolkit)[http://www.nltk.org/howto/wordnet.html]), I added the codeword "FOODWORD". So, "The pizza had lots of pepperonis" became "The pizza FOODWORD had lots of pepperonis FOODWORD". This helped to make topics that seemed more related to things like "service" or "food". However, it wasn't fully clear just how close a particular topic was related to what a human would identify as "Good service," "Bad food," "Bad price/value," etc. If I wanted to fine-tune the parameters, how would I know that a topic is getting closer and closer to what a human might interpret? So, to help my model out, I manually coded about 205 reviews to see if there were correlations between my manually-coded reviews and particular LDA topics.

I was relieved to see that there were strong correlations between some of the LDA topics and my manual coding! I decided to use that to fine-tune my model. I started with a few different tokenization techniques to see which created a stronger correlation with the manually-coded topics. I then played around with some of the hyper-parameters (number of topics, alpha, and beta/eta) to find the best model. I identified which LDA model had topics that most strongly correlated with my manually-coded topics until I found my best model.

Results

To test whether my LDA model worked, I applied it to four randomly chosen restaurants whose reviews I had manually coded. There was a strong overlap between the LDA and manual codings on the following topics: Good Food, Bad Food, Bad Service. Unfortunately, the other overlaps weren't strong enough and, in particular, lacked the precision necessary for my purposes (after all, identifying a restaurant as one with bad food when, in fact, few people said they didn't like the food would be very very bad). See my Keynote Presentation for some visualizations of the overlaps. Check out my Tableau Story here for more information about restaurant-level predictions.

Risks and Assumptions

Before applying this technique, it's important to identify the risks and assumptions inherent in it. The first is an assumption that the patterns in the reviews provided in the dataset are representative of/transferrable to other locations not provided in the dataset. It's possible that people in places like New York City review restaurants very differently than in, for instance, Las Vegas (where much of this dataset comes from). The second big assumption is that the way people talk about restaurants is not changing over time. For instance, if people become more and more sarcastic in their reviews, this model may no longer be appropriate as it likely doesn't pick up on that sort of thing. Next, I have to assume that the data I manually coded and the restaurants I randomly chose were not unique or anomolies, since they played a major role in judging my model. If they are anomolies, then I can't assume that my model works very well on the complete data set. Finally, I am assuming that my 10,000-review subset is representative of the whole. I was able to repeat my process on another random 10,000-review subset of the data and came to similar, though not exactly the same, conclusions. Therefore it may be worth repeating on a larger sample to ensure better representation.

Application/Further Work

I believe that my model is strong enough to begin implementation with. Any restaurant with over 20% of the reviews predicted to be about a specific topic (Good Food, Bad Food, or Bad Service) could be flagged. For example, there could be an icon indicating "People especially like the food here", or "People often have problems with the service here". However, it can definitely improve with the use of learning through human correction. For instance, if users were periodically asked to verify the LDA labeling (e.g. "Does this review speak poorly of the service?") or future reviewers were asked to rate on certain aspects (e.g. an overall thumbs up or down on food, service, value, wait time, etc.), this could train the model to become better at identifying sub-topical sentiments. Also, I believe that a non-LDA approach may be more appropriate for identifying sentiments around price/value, wait time, and some other topics, because the way they are talked about may be more consistent (e.g. while there may be a million ways to say food is good, there are probably fewer ways to say that the wait time was too long). A multi-model, learning approach can make this a very effective way to improve Yelp's usability moving forward.

About

Using Latent Dirichlet Allocation to find sub-topical sentiments in Yelp reviews

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published