Visualizing and analyzing user reviews of polarizing movies on Rotten Tomatoes
This is an exploration of user reviews on Rotten Tomatoes, inspired by the famous user backlash against the critically successful Star Wars: The Last Jedi. This just scratches the surface of this topic, but it provides some direction for a larger project, and it was a fun excuse to play around with various R packages, in addition to my usual Python tools.
To get the code (the R code and also the Python code for the bit at the end), visit my Github page for this project.
To view the R code interspersed with the text of this post, check out my Kaggle notebook for this project.
The Last Jedi, the newest Star Wars movie, caused quite a stink when it came out last December. Initial reviews from critics, as well as some audiences (like those polled by CinemaScore) were overwhelmingly positive. But when the Rotten Tomatoes audience reviews started to roll in, it was clear that many fans had a different take. The Rotten Tomatoes backlash was the subject of many analyses and thinkpieces, and things are unlikely to settle down soon, as there is already a new Star Wars movie coming out in May.
This got me thinking about what is actually going on when users disagree with critics. On average, Rotten Tomatoes user ratings tend to accord with critic scores. When user scores are significantly lower than critic scores, what is it about the movie that many users don't like, but which critics seem not to mind? Are user backlashes to critically successful movies predictable from what users are saying about the film? I decided to conduct a series of breakfast experiments to explore this.
To this end, I created a data set of Rotten Tomatoes user reviews of 16 high-grossing films from the past year. This was a fun excuse for me to acquaint myself better with R's web scraping abilities. The dataset consists of the text of every review left by users along with the rating (0.5 to 5 stars) each user left for that film. Reviews with no associated numerical score are left out. (Rotten Tomatoes allows users to leave reviews but mark the score as 'not interested' or 'want to see it'.)
After using R's rvest package to scrape Rotten Tomatoes and create the data set, I started plotting distributions over user ratings for the different movies. Here is a density plot of the reviews for the two biggest movies of the last year, The Last Jedi and Black Panther. These are two films for which the average critic score is the same (4.1 stars out of 5), but we see very different distributions of scores, with The Last Jedi receiving a much lower rating on average (1.6 compared to 3.2).
What's interesting here is that, while the distributions are clearly very different, both of these movies are polarizing—we see that review scores tend toward the ends of scale, not the middle.
Now let's compare Black Panther to Sherlock Gnomes, a movie with a similar average user score (3.4 stars compared to Black Panther's 3.2) but a very different critic score (2.1 stars compared to Black Panther's 4.1).
Black Panther has more 4- and 5-star reviews as well as more 0.5- and 1-star reviews. One might hypothesize that user backlash to critically successful movies happens when movies are polarizing. The linking hypothesis here is that the polarization is under-represented among movie critics, who are perhaps a more homogeneous group of people than Rotten Tomatoes users at large.
To explore this further, we can look at the content of the 0.5-star reviews of these movies. What is it about some of these movies that some users hated enough to leave 0.5-star reviews? We can visualize this using word clouds.
Combining R's tm and wordcloud packages, we can extract word frequencies and use them to visualize the content of 0.5-star reviews. Here is a word cloud for the 0.5-star reviews of The Last Jedi:
The core of the word cloud contains a lot of Star Wars-specific terms; it seems many people were not thrilled with how the character of Luke Skywalker was handled. The periphery, on the other hand, contains many more general terms, including some politically charged terms such as "SJW" ("social justice warrior", often used as a derogatory term for politically liberal activists with certain social concerns), "political" and "agenda". I mention these terms specifically, because we also see such terms (e.g. "SJW", "liberal" and "propaganda") in a word cloud for The Shape of Water, the movie in the data set with the second-highest disparity between critical success and user score (avg. user review rating of 3.1 stars, vs. 4.2 stars from critics).
To see whether this might indicate anything about lower-than-expected user ratings, let's calculate what I call the "delta score" for each movie in the data set, where delta is critic score minus user score:
Delta(film) = AverageCriticRating(film) - AverageUserReviewRating(film)
Using this, we can separate movies into "positive delta" movies (movies that critics liked better than users on average) and "negative delta" movies (movies that users liked better than critics on average). Then, we can make a word cloud for all of the positive delta movies (movies like The Last Jedi, Black Panther and The Shape of Water) which excludes words that are also associated with negative delta movies. The following word cloud visualizes the relative frequency of words in 0.5-star reviews of positive delta movies which do not occur more than once in the set of 0.5-star reviews of negative delta movies. It also excludes words that are too specific to one film (a word must occur at least 4 times in reviews of other films) as well as words in the movie titles.
Again, we are only looking at the 0.5-star reviews—we're looking at what people who hated the movies hated about them. And we see from this that the negative delta movies, which users like better than critics on average, do not tend to evoke the politically charged words we are talking about. This makes sense if a primary source of user backlash is polarization—what could be more polarizing than matters of politics and social and cultural identity? Under this view, average user score will tend to undershoot average critic score in cases where a significant proportion of users have a visceral unfavorable reaction to the film for social/political/cultural reasons, and accordingly flood Rotten Tomatoes with 0.5-star reviews.
Indeed it is true that the number/proportion of 0.5-star reviews is highly correlated with the delta score. However, this is not as telling as we might like, because delta score and user score are correlated---the highest delta scores occur when user scores are low. Ideally, what we would want to probe is whether having a higher critic score correlates with having a higher number of 0.5-star reviews than we would expect given the average user rating. This would require modeling the probability distributions over user scores, and for that we would probably want a larger data set, with more movies represented.
For now, let's just run with the idea that the reviews that we see for high-delta movies are meaningfully different in their distributions of certain words than the reviews we see for low-delta movies.
The graph below plots delta score as a function of the proportion of reviews that contain the word "SJW". I've used the log of the proportion (with smoothing to prevent negative infinities for those movies where "SJW" is absent) to make the graph easier to look at, since the actual proportions are very small across the board, except for The Last Jedi, for which 2% of all reviews contain this word.
If we run some correlation tests, we find that there is both a significant positive correlation between "SJW" and delta score as well as a significant negative correlation between "SJW" and average user score. This is to be expected, because the delta and user scores are inversely correlated, as discussed above. However, the correlation is stronger (r-squared = 0.57, p < 0.001) for delta score than it is for user score (r-squared = 0.51, p < 0.01).
This suggests that perhaps it is possible to look at occurrences of individual words such as "SJW" (and many others) in early user reviews as a predictor of when a film will have an unexpectedly low audience score on Rotten Tomatoes. To the extent that this is possible, then we should see some degree of success with relatively simple machine learning techniques (e.g., techniques based in decision trees) that use word frequencies to predict differences between critic and user scores on Rotten Tomatoes. A final breakfast experiment (this one implemented in Python) provides a small bit of evidence for this.
I used Python's XGBoost package (which is also available for R, but Python was much easier for me, because I could adapt code I already had). XGBoost is a fast gradient tree boosting library. I trained two extremely simple models, both of which just use word counts as features. The first uses these features to predict individual delta scores for each user review, where delta in this case is a measure of how an individual users' review score differs from the average critic score for that movie:
Delta(review of film) = AverageCriticRating(film) - UserReviewRating(review of film)
The second model uses word counts to predict the score left by the user (0.5 to 5 stars), rather than delta. Both models were trained on the same sample of 80% of the data set, then tested on the other 20%. If the above discussion is on the right track, then delta should be at least as easy if not easier to predict than the user score itself. In other words, where traditional sentiment analysis techniques would predict how many stars were given by a user by analyzing the text of the review, it might be easier to predict whether the user's rating is above or below that of the average rating of critics on Rotten Tomatoes.
I did minimal tinkering with these models, used no syntactic or semantic features, and the data set is not huge, and therefore we should not expect this to be industrial strength; however, we can compare the two models simply to make a point. Model #1 is a regression model for learning delta scores from word counts, and on the test set we get an r-squared of 0.15. That's not a great r-squared value, but it is positive, which means the regression doing some prediction. Here is a graphical representation of the fit of the model, made with matplot lib:
Noisy, but we can imagine that with more features and more data, we could actually predict this somewhat reliably. Moreover, it might make more sense to predict ranges of delta values rather than exact numbers. In any case, Model #2, which is the same, but which tries to predict user scores, fares a bit worse, with r-squared of 0.06, despite the range of possible predictions being smaller (0-5 stars rather than -5 to 5 delta). Here's the prediction-reality graph for that one:
This is purely suggestive at this point, but we can imagine it possible to know from early user reviews how well the movie will end up doing with Rotten Tomatoes audiences. Critic reviews are typically available first, because critics get pre-screenings for most movies, and thus we can imagine it possible to predict overall user reaction to a film by looking at the first handful of user reviews that come in, estimating the average delta score for that film, and then subtracting it from the average critic score. This might be particularly useful given that many users leave text reviews but do not leave a numerical score. And if nothing else, analyzing user review text can tell us more about why users sometimes disagree with critics on Rotten Tomatoes.