Predicting Imdb Ratings

November 01, 2016
Jocelyn Ong

What affects the rating of a movie on IMDb?

Introduction

This week we’re working with movie data! We’ll be getting information from IMDb with some help from the OMDb API.

IMDb is now the world’s most popular and authoritative source for movie, TV and celebrity content. We offer a searchable database of more than 185 million data items including more than 3.5 million movies, TV and entertainment programs and 7 million cast and crew members. (source: IMDb)

We’ll be looking at IMDb ratings and what are some of the things that affect a movie’s ratings. Brainstorming before we actually obtain the data, some possible factors may be:

Who stars in the movie
Who directed the movie
Length of the movie
What genre(s) is the movie

TL;DR

We got web-scraped data from IMDb with some help from OMDb API, transformed some of the words into features, and tried to predict actual ratings (failed) and/ or predict whether ratings will be higher than 8.5 (didn’t do too badly). We found that the director of a movie has a pretty big impact on whether ratings will be higher than 8.5 on IMDb.

About the data

How we obtained our data

Hello, BeautifulSoup! No, I didn’t use much of it this time around. (I tried, but everything took too long to run. Maybe I’ll try it again when I have more time.)

We had a lot of guidance for obtaining the data this week in our labs. Each movie’s page on IMDb can be accessed by a unique IMDB ID that starts with “tt” followed by 9 numbers. Using a combination of the requests and re libraries, we pulled the movie IDs of the top 250 rated movies.

r2 = requests.get("http://www.imdb.com/chart/top")
id_list = re.findall("tt[0-9]{7,8}", r2.content)

# set removes duplicates
# change it back to a list so we can iterate through it later
id_list = list(set(id_list))

Now that we have our IDs, we used the OMDb API to pull some basic data of each movie.

# OMDb API URL
api_url = "http://www.omdbapi.com/?i={}&plot=full&r=json"

# Create a function to get the data for each movie
# into a format we can work with
def get_content(id_num):
    r = requests.get(api_url.format(id_num))
    tmp = json.loads(r.text)
    return tmp

# Populate our data
df = pd.DataFrame([get_content(i) for i in id_list])

Here’s the data we pulled using the API.

We were then asked to pull data on gross earnings of each movie as well. (Here’s where I used BeautifulSoup.)

# Create a function to pull Gross Earnings information
# and return np.nan if it's not available
def get_gross(id_num):
    r = requests.get("http://www.imdb.com/title/{}/".format(id_num))
    soup = bs4.BeautifulSoup(r.text, "lxml")
    try:
        for i in soup.findAll("div", class_="txt-block"):
            for j in i.findAll("h4", class_="inline"):
                if "Gross" in j.text:
                    text = j.parent.text.split()[1]
                    num = text.replace(",", "").strip("$")
                    return float(num)
    except:
        return np.nan

# Create a column in the dataframe for Gross Earnings
df["Gross_earnings"] = df["imdbID"].map(get_gross)

Description of the data (data dictionary)

Column	Data type	Description
Actors	String	List of top-billed actors
Awards	String	List of awards won or nominations
Country	String	Countries where the movie was shown
Director	String	Director(s) of the movie
Genre	String	Genre(s) of the movie
Language	String	Language(s) the movie is available in
Metascore	String	Score from metacritic.com
Plot	String	Summary of the movie
Poster	String	URL for the movie poster
Rated	String	Viewer advisory rating for the movie
Released	String	Release date for the movie
Runtime	String	Length of the movie in minutes
Title	String	Movie title
Writer	String	Writers for the movie
Year	String	Year in which the movie was released
imdbID	String	IMDb ID of the movie
imdbRating	String	IMDb rating of the movie
imdbVotes	String	Number of votes received
Gross_earnings	Float	Gross earnings of the movie

Data munging/ data wrangling

As you can see, most of our columns are strings, which is not something we can use in our models (not that I’m aware of at least), so we’ll have to clean and work on our data so that everything is numerical.

Columns like Year, imdbRating etc. should have been either integers or floats, and we converted them using df["col_name"] = df["col_name"].astype(float).

We also want to convert some of the text columns to dummies so that we can use them in our model.

Oscars won

def oscars_won(i):
  try:
      i_list = i.split()
      i_list = [x.strip(".") for x in i_list]
      if "Oscars" in i_list and i_list[i_list.index("Oscars")-2] == "Won":
          return float(i_list[i_list.index("Oscars")-1])
      elif "Oscar" in i_list and  i_list[i_list.index("Oscar")-2] == "Won":
          return float(i_list[i_list.index("Oscar")-1])
      else:
          return 0
  except:
      return 0

Languages, countries, actors, directors

# Let"s consider the language the movie is available in
# For model simplicity, we'll consider just the top 5 languages
all_languages = []
for i in df2["Language"]:
  lang_list = str(i).split(",")
  all_languages.extend([j.strip() for j in lang_list])
top_10_languages = [i[0] for i in Counter(all_languages).most_common(10)]
for i in top_10_languages:
  df2["Language_"+i] = df2["Language"].map(lambda x: 1 if i in str(x) else 0)

Using TfidfVectorizer, we also generated word features from our plot summaries.

tfidf = feature_extraction.text.TfidfVectorizer(stop_words="english", ngram_range=(1,1), max_features=1000)
plot_df = pd.DataFrame(tfidf.fit_transform(df6["Plot"]).todense(), columns=tfidf.get_feature_names())

Visualizations

Before we get to modeling, we’ll take a look at the relationships within out data.

Modeling: generating the model and drawing conclusions

So I tried both regression (predicting actual ratings) and classification (predicting whether a movie will have a rating higher than 8.5) with this dataset. I wanted to test several models, so I created a function to evaluate them:

def fit_model(model, name,  X_train, X_test, y_train, y_test, mtype="r"):
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    score = model.score(X_test, y_test)
    print "{} Score: {:.2f}".format(name, score)

    if mtype=="r":
        print "{} MSE: {:.2f}".format(name, metrics.mean_squared_error(y_test, y_pred))

    return model, y_pred

def evaluate_model(model, name, mtype="r", X=df4.iloc[:, 8:]):
    print name
    if mtype=="r":
        # set X and y
        y = df4["imdbRating"]
        X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=77)

        # fit the model
        model, y_pred = fit_model(model, name, X_train, X_test, y_train, y_test, mtype)

        # Plotting
        plt.plot([min(y_test), max(y_test)],[min(y_test), max(y_test)],'-');
        plt.scatter(y_pred, y_test);
        plt.title("{}\nActual and predicted ratings".format(name));
        plt.xlabel("Predicted");
        plt.ylabel("Actual");
        plt.show();
        return model

    elif mtype=="c":
        # set X and y
        y_class = df4["rating_higher_8.5"]
        X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y_class, stratify=y_class,
                                                                      test_size=0.3, random_state=77)

        # fit the model
        model, y_pred = fit_model(model, name, X_train, X_test, y_train, y_test, mtype)

        # confusion matrix
        conmat = metrics.confusion_matrix(y_test, y_pred)
        conmat = pd.DataFrame(conmat)
        conmat

        # classification report
        print metrics.classification_report(y_test, y_pred)

        # plotting
        try:
            y_score = model.decision_function(X_test)
        except:
            y_score = model.predict_proba(X_test)[:,1]

        fpr, tpr, thresholds = metrics.roc_curve(y_test, y_score)
        plt.plot(fpr,tpr);
        plt.title("{}\nROC curve".format(name));
        plt.ylim((-0.1, 1.1));
        plt.xlim((-0.1, 1.1));
        plt.xlabel("FPR");
        plt.ylabel("TPR");
        plt.text(0.8,0.05,"AUC: {:0.2f}".format(metrics.roc_auc_score(y_test, y_score)));

        return model, conmat, X, y
    else:
        return "Wrong type"

Here’s a couple of our plots from the above:

With each model we ran our score functions, and regression was terrible. Classification did quite well with an average accuracy of about 80%.

We also looked at the coefficients of our classification model:

One of the things that I noted (before looking at the coefficients) was that:

Actors don’t matter
Plot summary doesn’t matter
DIRECTORS MATTER!

I happened to run my models a few times (without text features, with actors and plot summary, and with directors) and the only improvement in scores came when directors were added in.

Next Steps

I wish we could have trained our model on a bigger and more varied training set. (Recall: we only worked on the all-time 250 movies on IMDB for this project.) I did try to get more information - I managed to get more movied IDs but it took too long to pull some of the other information. Maybe I’ll try it out when I get a more powerful machine.

Until Next Time

I hope you enjoyed reading about movies (I’d rather just watch them), and I do hope I’ll have time to work on those next steps! Meanwhile, take a look at the detailed code for this project here.