INTRODUCTION¶

This project analyzes the data in various IMDb datasets, showing summary statistics, distributions, correlations, and more. The raw datasets that were used as sources are imported from IMDb Data Files, and documentation surrounding the files can be found at IMDb Non-Commercial Datasets. I learned about these files from the resources page of our class book, Data Science 1 by Toby Driscoll.

Here is the code to import our libraries and the raw datasets. The only datasets we will use are title.basics.tsv, title.crew.tsv, title.ratings.tsv, and name.basics.tsv. The raw data will take a long time to import. In preprocessing, we will will make it quicker.

(**NOTE: the types of the columns of title_basics are all specified as strings to avoid errors. We will correct the types later.)

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_50817/573954172.py:2: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd
In [2]:
title_basics = pd.read_csv("https://datasets.imdbws.com/title.basics.tsv.gz", delimiter="\t", 
                           dtype={"tconst": str, "titleType": str, "primaryTitle": str, "originalTitle": str, 
                                  "isAdult": str, "startYear": str, "endYear": str, "runtimeMinutes": str, 
                                  "genres": str})
title_basics.head(3)
Out[2]:
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
0 tt0000001 short Carmencita Carmencita 0 1894 \N 1 Documentary,Short
1 tt0000002 short Le clown et ses chiens Le clown et ses chiens 0 1892 \N 5 Animation,Short
2 tt0000003 short Pauvre Pierrot Pauvre Pierrot 0 1892 \N 4 Animation,Comedy,Romance
In [3]:
title_crew = pd.read_csv("https://datasets.imdbws.com/title.crew.tsv.gz", delimiter="\t")
title_crew.head(3)
Out[3]:
tconst directors writers
0 tt0000001 nm0005690 \N
1 tt0000002 nm0721526 \N
2 tt0000003 nm0721526 \N
In [4]:
title_ratings = pd.read_csv("https://datasets.imdbws.com/title.ratings.tsv.gz", delimiter="\t")
title_ratings.head(3)
Out[4]:
tconst averageRating numVotes
0 tt0000001 5.7 2036
1 tt0000002 5.7 272
2 tt0000003 6.5 1986
In [5]:
name_basics = pd.read_csv("https://datasets.imdbws.com/name.basics.tsv.gz", delimiter="\t")
name_basics.head(3)
Out[5]:
nconst primaryName birthYear deathYear primaryProfession knownForTitles
0 nm0000001 Fred Astaire 1899 1987 actor,miscellaneous,producer tt0072308,tt0050419,tt0053137,tt0027125
1 nm0000002 Lauren Bacall 1924 2014 actress,soundtrack,archive_footage tt0037382,tt0075213,tt0117057,tt0038355
2 nm0000003 Brigitte Bardot 1934 \N actress,music_department,producer tt0057345,tt0049189,tt0056404,tt0054452

PREPROCESSING¶

Here are the steps we will take to preprocess our data. In the interest of being concise, the datasets won't be displayed until they are merged.

1. Drop Unnecessary Rows and Columns¶

In this project, we will only analyze movies, no TV shows or shorts. So, we will drop the rows of title_basics that don't have the titleType "movie". We will also drop columns from all the dataframes that are unnecessary when analyzing movies.

In [6]:
movie_basics = title_basics[title_basics.titleType == "movie"]
movie_basics = movie_basics.drop(columns=["titleType", "originalTitle", "isAdult", "endYear"])
movie_crew = title_crew.drop(columns=["writers"])
movie_ratings = title_ratings
names = name_basics.drop(columns=["birthYear", "deathYear", "primaryProfession", "knownForTitles"])

2. Replace and Remove Missing Values with NaN¶

Before any further preprocessing, we must replace and remove the rows with missing values. This will allow us to specify types.

In [7]:
movie_basics = movie_basics.replace("\\N", np.nan)
movie_basics = movie_basics.dropna()
movie_crew = movie_crew.replace("\\N", np.nan)
movie_crew = movie_crew.dropna()
movie_ratings = movie_ratings.replace("\\N", np.nan)
movie_ratings = movie_ratings.dropna()
names = names.replace("\\N", np.nan)
names = names.dropna()

3. Specify Types¶

Next, we must specify the types of some columns. This is done so we can merge the dataframes properly.

In [8]:
movie_basics["startYear"] = movie_basics["startYear"].astype(int)
movie_basics["runtimeMinutes"] = movie_basics["runtimeMinutes"].astype(int)
movie_crew["directors"] = movie_crew["directors"].str.split(",")
movie_ratings["averageRating"] = movie_ratings["averageRating"].astype(float)
movie_ratings["numVotes"] = movie_ratings["numVotes"].astype(int)

4. Merge the Dataframes¶

Now, we will merge the four dataframes together into all_movies. The movie dataframes will merge easily on "tconst", however, the names dataframe requires a specific method of merging. This method involves creating a helper variable "directors_names" that explodes the "directors" column, merges with the names dataframe, and then gets grouped by its tconst. The helper variable then merges back with all_movies. This is just to match the lists of "nconst"s with lists of the directors' actual names.

In [9]:
all_movies = pd.merge(movie_basics, movie_crew, on="tconst")
all_movies = pd.merge(all_movies, movie_ratings, on="tconst")
# Create helper variable to match the lists of nconsts with lists of names
directors_names = all_movies[["tconst", "directors"]]
directors_names = directors_names.explode("directors")
directors_names = pd.merge(directors_names, names, left_on="directors", right_on="nconst")
directors_names = directors_names.groupby("tconst").agg({"directors": list, "primaryName": list}).reset_index()
# Merge the helper variable back into the main dataframe
all_movies = pd.merge(all_movies, directors_names, on="tconst")
all_movies.head()
Out[9]:
tconst primaryTitle startYear runtimeMinutes genres directors_x averageRating numVotes directors_y primaryName
0 tt0000009 Miss Jerry 1894 45 Romance [nm0085156] 5.3 209 [nm0085156] [Alexander Black]
1 tt0000147 The Corbett-Fitzsimmons Fight 1897 100 Documentary,News,Sport [nm0714557] 5.2 506 [nm0714557] [Enoch J. Rector]
2 tt0000574 The Story of the Kelly Gang 1906 70 Action,Adventure,Biography [nm0846879] 6.0 876 [nm0846879] [Charles Tait]
3 tt0000591 The Prodigal Son 1907 90 Drama [nm0141150] 5.5 23 [nm0141150] [Michel Carré]
4 tt0000679 The Fairylogue and Radio-Plays 1908 120 Adventure,Fantasy [nm0091767, nm0877783] 5.2 71 [nm0091767, nm0877783] [Francis Boggs, Otis Turner]

5. Polish and Reorder the Frame¶

For this project, we are only going to analyze the top 1000 highest rated action, comedy, or drama movies that have a minimum number of 50,000 votes. Here are the steps taken to polish the frame:

  • Drop the rows in which "numVotes" is less than 50,000.
  • Drop any rows that don't have a genre of action, comedy, or drama. Keep only one of those genres that is displayed, that way, it becomes a categorical column. We will use iteration to help with that.
  • Order the data according to the "averageRating" column.
  • Create a "rank" column and set it as the index.
  • Create a column called "totalRating" that multiplies the averageRating and the numVotes.
  • Delete, alter, and move some of our columns to polish our finished set of data.
In [10]:
movies = all_movies[all_movies["numVotes"] >= 50000]
movies = movies[movies["genres"].str.contains("Action") | movies["genres"].str.contains("Comedy") | movies["genres"].str.contains("Drama")]
for index, movie in movies.iterrows():
    if "Comedy" in movie["genres"]:
        movies.at[index, "genres"] = "Comedy"
    elif "Action" in movie["genres"]:
        movies.at[index, "genres"] = "Action"
    elif "Drama" in movie["genres"]:
        movies.at[index, "genres"] = "Drama"
movies = movies.sort_values(by="averageRating", ascending=False)
movies = movies.iloc[:1000]
movies["rank"] = range(1, len(movies) + 1)
movies.index = movies["rank"]
movies["totalRating"] = movies["averageRating"] * movies["numVotes"]
movies = movies.drop(columns=["tconst", "directors_x", "directors_y", "rank"])
movies = movies.rename(columns={"primaryTitle": "title", "startYear": "year", "runtimeMinutes": "runtime", 
                                "primaryName": "director[s]", "genres": "genre"})
movies["director[s]"] = movies["director[s]"].str.join(",")
movies = movies[["title", "year", "runtime", "director[s]", "genre", "averageRating", "numVotes", "totalRating"]]
movies.head()
Out[10]:
title year runtime director[s] genre averageRating numVotes totalRating
rank
1 The Shawshank Redemption 1994 142 Frank Darabont Drama 9.3 2874036 26728534.8
2 The Godfather 1972 175 Francis Ford Coppola Drama 9.2 2001828 18416817.6
3 12th Fail 2023 147 Vidhu Vinod Chopra Drama 9.0 106078 954702.0
4 The Lord of the Rings: The Return of the King 2003 201 Peter Jackson Action 9.0 1969425 17724825.0
5 The Godfather Part II 1974 202 Francis Ford Coppola Drama 9.0 1358038 12222342.0

6. Determine Outliers¶

Our last step of preprocessing is to perform an outlier analysis on all quantitative columns. This is done by determining which values do not follow the 1.5IQR rule. While the IQRs will be found altogether, we will separate the identification of outliers by the 5 quantitative columns.

In [11]:
quantitative_columns = movies[["year", "runtime", "averageRating", "numVotes", "totalRating"]]
Q1 = quantitative_columns.quantile(0.25)
Q3 = quantitative_columns.quantile(0.75)
IQR = Q3 - Q1
print("Here is the IQR for each column:")
print(IQR)
Here is the IQR for each column:
year                  25.000
runtime               34.000
averageRating          0.500
numVotes          359853.500
totalRating      2835101.025
dtype: float64
In [12]:
year_outliers = (quantitative_columns["year"] < (Q1["year"] - 1.5 * IQR["year"])) | (quantitative_columns["year"] > (Q3["year"] + 1.5 * IQR["year"]))
print("Here are the year outliers in the dataset:")
movies[year_outliers].head(3)
Here are the year outliers in the dataset:
Out[12]:
title year runtime director[s] genre averageRating numVotes totalRating
rank
26 It's a Wonderful Life 1946 130 Frank Capra Drama 8.6 497496 4278465.6
44 Modern Times 1936 87 Charles Chaplin Comedy 8.5 258403 2196425.5
49 City Lights 1931 87 Charles Chaplin Comedy 8.5 195320 1660220.0
In [13]:
runtime_outliers = (quantitative_columns["runtime"] < (Q1["runtime"] - 1.5 * IQR["runtime"])) | (quantitative_columns["runtime"] > (Q3["runtime"] + 1.5 * IQR["runtime"]))
print("Here are the runtime outliers in the dataset:")
movies[runtime_outliers].head(3)
Here are the runtime outliers in the dataset:
Out[13]:
title year runtime director[s] genre averageRating numVotes totalRating
rank
4 The Lord of the Rings: The Return of the King 2003 201 Peter Jackson Action 9.0 1969425 17724825.0
5 The Godfather Part II 1974 202 Francis Ford Coppola Drama 9.0 1358038 12222342.0
8 Schindler's List 1993 195 Steven Spielberg Drama 9.0 1443603 12992427.0
In [14]:
averageRating_outliers = (quantitative_columns["averageRating"] < (Q1["averageRating"] - 1.5 * IQR["averageRating"])) | (quantitative_columns["averageRating"] > (Q3["averageRating"] + 1.5 * IQR["averageRating"]))
print("Here are the averageRating outliers in the dataset:")
movies[averageRating_outliers].head(3)
Here are the averageRating outliers in the dataset:
Out[14]:
title year runtime director[s] genre averageRating numVotes totalRating
rank
1 The Shawshank Redemption 1994 142 Frank Darabont Drama 9.3 2874036 26728534.8
2 The Godfather 1972 175 Francis Ford Coppola Drama 9.2 2001828 18416817.6
3 12th Fail 2023 147 Vidhu Vinod Chopra Drama 9.0 106078 954702.0
In [15]:
numVotes_outliers = (quantitative_columns["numVotes"] < (Q1["numVotes"] - 1.5 * IQR["numVotes"])) | (quantitative_columns["numVotes"] > (Q3["numVotes"] + 1.5 * IQR["numVotes"]))
print("Here are the numVotes outliers in the dataset:")
movies[numVotes_outliers].head(3)
Here are the numVotes outliers in the dataset:
Out[15]:
title year runtime director[s] genre averageRating numVotes totalRating
rank
1 The Shawshank Redemption 1994 142 Frank Darabont Drama 9.3 2874036 26728534.8
2 The Godfather 1972 175 Francis Ford Coppola Drama 9.2 2001828 18416817.6
4 The Lord of the Rings: The Return of the King 2003 201 Peter Jackson Action 9.0 1969425 17724825.0
In [16]:
totalRating_outliers = (quantitative_columns["totalRating"] < (Q1["totalRating"] - 1.5 * IQR["totalRating"])) | (quantitative_columns["totalRating"] > (Q3["totalRating"] + 1.5 * IQR["totalRating"]))
print("Here are the totalRating outliers in the dataset:")
movies[totalRating_outliers].head(3)
Here are the totalRating outliers in the dataset:
Out[16]:
title year runtime director[s] genre averageRating numVotes totalRating
rank
1 The Shawshank Redemption 1994 142 Frank Darabont Drama 9.3 2874036 26728534.8
2 The Godfather 1972 175 Francis Ford Coppola Drama 9.2 2001828 18416817.6
4 The Lord of the Rings: The Return of the King 2003 201 Peter Jackson Action 9.0 1969425 17724825.0

Our outlier analysis is extremely interesting. Notice how the year outlier analysis is filled with older movies... this is because the majority of the top 1000 movies are from the modern day!

The runtime outlier analysis is filled with movies with long runtimes. This is because there is a limit to how short a movie can be, but the sky is the limit for runtimes!

The averageRating outlier analysis is perhaps the most interesting, as it's filled with the top ranks. This is because the top movies have a higher rating than most of the other movies in the dataset, since most of the ratings are between 7.5 and 8.0 or so.

Notice how the numVotes and totalRating outlier datasets are mostly the same. This is because they are dependent on each other! They are also filled with mostly the top ranks, since the better a movie is, the more people will watch it and vote on it (this is an idea that will be explored more later).

SUMMARY DATA ANALYSIS¶

Now that our "movies" dataframe is preprocessed, it is ready for analysis. In our summary analysis, we will identify summary statistics like means, medians, standard deviations, and z-scores. Then, we will showcase several distributions, including ECDFs, histograms, facet plots, box plots, and violin plots. Finally, we will explore correlations using line plots, scatter plots, and correlation coefficients.

1. Summary Statistics¶

The basic summary statistics are means, medians, and standard deviations. We will compute them (as well as the mins and maxs) using the describe method and then interpret them:

In [17]:
movies_statistics = movies.describe()
movies_statistics = movies_statistics.loc[["mean", "min", "50%", "max", "std"]]
movies_statistics = movies_statistics.rename(index={"50%": "median"})
movies_statistics = movies_statistics.round(2)
print(movies_statistics)
    
           year  runtime  averageRating    numVotes  totalRating
mean    1996.81   125.16           7.88   341631.07   2756261.85
min     1921.00    45.00           7.50    50054.00    379132.50
median  2003.00   122.00           7.80   191176.50   1484413.90
max     2024.00   321.00           9.30  2874036.00  26728534.80
std       21.29    27.94           0.32   387117.15   3303533.92

These summary statistics show us a lot about the data.

Notice how the mean year is way closer to the max year than the min year, further showing that most movies on this list are from the modern day.

Similarly, the mean averageRating is much closer to the min rating than the max rating, further showing that most movies here hover around an averageRating of 7.5 to 8.0, and the top ranked movies are outliers.

Notice how totalRating is still roughly equal to averageRating * numVotes for the mean, min, median, and max rows, but different for the std row.

The stds are obviously much largery for numVotes and totalRating. This is because unlike the year and runtime, these values are not chained between a length of 100 or 300: they can range anywhere from 50,000 to the multi-millions.

The z-scores are a separate summary statistic. They are computed and interpreted here:

In [18]:
means = quantitative_columns.mean()
stds = quantitative_columns.std()
z_scores = (quantitative_columns - means) / stds
print(z_scores)
          year   runtime  averageRating  numVotes  totalRating
rank                                                          
1    -0.131776  0.602522       4.493881  6.541702     7.256554
2    -1.164943  1.783588       4.176696  4.288616     4.740546
3     1.230126  0.781472       3.542325 -0.608480    -0.545343
4     0.290883  2.714124       3.542325  4.204913     4.531076
5    -1.071018  2.749914       3.542325  2.625580     2.865441
...        ...       ...            ...       ...          ...
996   0.666581 -1.151181      -1.215454  0.532851     0.409574
997  -0.131776 -0.077485      -1.215454  0.011177    -0.048910
998   0.431770  0.638312      -1.215454 -0.064730    -0.115623
999  -0.178738 -0.149065      -1.215454 -0.637588    -0.619091
1000 -0.178738 -1.008021      -1.215454 -0.716659    -0.688584

[1000 rows x 5 columns]

The z-scores show separate stories.

Notice that since the data is ordered by averageRating, the z-scores strictly get lower in that column as the list goes down.

Notice how the z-scores of numVotes are very similar to those of totalRatings, which makes sense because their spreads should be similar.

2. Distributions¶

Distributions of our movie data can be seen graphically through ECDFs, histograms, facet plots, box plots, and violin plots. Creating these visuals with certain conditions will reveal fascinating new aspects of the data.

First, we will plot an ECDF of the year column in order to fully confirm that most movies on this list are from the modern day. The ECDF will show that a higher proportion of movies are from 2000-2024, rather than from earlier years.

In [19]:
sns.displot(movies["year"], kind="ecdf", height=3, aspect=2)
Out[19]:
<seaborn.axisgrid.FacetGrid at 0x103c26f50>
No description has been provided for this image

Now, we will use a histogram (with a kde density line) to confirm that most movies are rated between 7.5 and 8.0, and that most of the high-ranked movies are outliers. The histogram and the density line will show that the majority of the count lies between 7.5 and 8.0.

In [20]:
sns.displot(movies["averageRating"], bins=20, kind = "hist", kde = True, height=3, aspect=2)
Out[20]:
<seaborn.axisgrid.FacetGrid at 0x374ef6e10>
No description has been provided for this image

These ECDFs and histograms apply to the entire dataset, as did the summary data statistics in the first part of this section. But what if we start restricting the data based on certain conditions?

Christopher Nolan is a very good movie director with several films in the top 1000. Here, we will calculate the mean averageRating of his movies and compare it to the mean of the entire dataset. Then, we will display the histogram and kde density line for his averageRatings, which we can compare with the histogram we just made for all the averageRatings.

In [21]:
nolan_movies = movies[movies["director[s]"].str.contains("Christopher Nolan")]
nolan_mean = nolan_movies["averageRating"].mean()
print("Here is the mean averageRating of Christopher Nolan's movies:")
print(nolan_mean)
print("Here is the mean averageRating of all movies:")
print(means["averageRating"])
sns.displot(nolan_movies["averageRating"], bins=20, kind = "hist", kde = True, height=3, aspect=2)
Here is the mean averageRating of Christopher Nolan's movies:
8.4625
Here is the mean averageRating of all movies:
7.8831999999999995
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x374ff1550>
No description has been provided for this image

As you can see, the mean averageRating of Christopher Nolan's movies is much higher than the mean averageRating of all the movies in the dataset. Similarly, the density shown in the Christopher Nolan histogram is centered around 8.5, while it was centered around 7.75 for all movies. This makes sense because he makes a lot of high-rated movies!

Besides looking at specific directors, we can also restrict our movies based on genre. This is where facet, box, and violin plots are very useful, since genre is a strictly categorical column.

First, we will print the count of each genre in the dataset. Then, we will analyze the numVotes in a facet plot with that count in mind.

In [22]:
print(movies["genre"].value_counts())
sns.displot(data=movies, x="numVotes", col="genre", height=3, aspect=1)
genre
Drama     539
Comedy    249
Action    212
Name: count, dtype: int64
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x374f79990>
No description has been provided for this image

We can draw several conclusions based on these graphs. First, the Drama genre gets way more votes than the Action and Comedy genres. While this is partially because the count of Drama movies is higher than the others, it's also because Drama is a very broad genre, and thus more movie reviewers will watch and rate those movies.

Another conclusion is that Comedy movies are generally underlooked by movie reviewers. While Comedy and Action have similar counts, Comedy has way more movies with numVotes between 0 and 1,000,000, while Action movies dip past 1,000,000 more often.

To compare runtimes, we can use a box plot.

In [23]:
sns.catplot(data=movies, x="genre", y="runtime", kind="box", height=4, aspect=2)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas.
  positions = grouped.grouper.result_index.to_numpy(dtype=float)
Out[23]:
<seaborn.axisgrid.FacetGrid at 0x37528a250>
No description has been provided for this image

This box plot is actually really interesting. The stand-out observation of the box plot is the small length of the Comedy box, which shows that Comedy movies are much shorter in length than Drama and Action movies. Also note that Action movies have a much higher minimum runtime than the rest.

A violin plot can be used to represent our last column, totalRating.

In [24]:
sns.catplot(data=movies, x="totalRating", y="genre", kind="violin", height=4, aspect=2)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
  data_subset = grouped_data.get_group(pd_key)
Out[24]:
<seaborn.axisgrid.FacetGrid at 0x374fcdf50>
No description has been provided for this image

The total ratings, which combine the number of votes and the average ratings, is a good indicator of the popularity of a genre. As you can see, Comedy films are typically less popular for movie reviewers, since their total ratings violin plot is much smaller. It's also very clear that action movies cover a very wide range of popularity.

3. Correlations¶

Unlike distributions, which show trends of our quantitative columns based on categorical data, correlations of our movie data will show how quantitative columns relate to each other. We will show several correlations through line plots, scatter plots, and correlation coefficients.

First, we will use a line plot (with a standard deviation error bar) to compare the average ratings to the number of votes.

In [25]:
sns.relplot(data=movies, x="averageRating", y="numVotes", kind="line", errorbar="sd", height=4, aspect=2)
Out[25]:
<seaborn.axisgrid.FacetGrid at 0x3753a9010>
No description has been provided for this image

This line plot shows an interesting connection. On the surface, the average ratings should not be directly correlated to the number of votes; however, it makes sense that the line is upwards sloping because if a movie has a high rating, then more movie reviewers will be tempted to watch it and rate it.

Next, we will use a scatter plot (with genres as the hue) to compare movie years to their runtimes.

In [26]:
sns.relplot(data=movies, x="year", y="runtime", hue="genre", height=4, aspect=2)
Out[26]:
<seaborn.axisgrid.FacetGrid at 0x37549e090>
No description has been provided for this image

This scatter plot shows that as time went on, movie runtimes generally got longer. As the genres show, the rising popularity of Action movies in the modern years were a main component of making the runtimes higher, since those movies are typically longer.

Graphs are not the only methods of displaying correlation. We can use a Pearson correlation coefficient as well. Here, we will create the Pearson correlation coefficient for the number of votes and the total ratings. This will be close to 1 since total ratings is dependent on the number of votes.

In [27]:
movies[["numVotes", "totalRating"]].corr()
Out[27]:
numVotes totalRating
numVotes 1.00000 0.99795
totalRating 0.99795 1.00000

The Spearman correlation coefficient is not as sensitive to outliers. We will use it to compare the movie years to their average ratings.

In [28]:
print(movies["year"].corr(movies["averageRating"], "spearman"))
-0.13562520803957584

This correlation is negative, which shows that as the years went on, the average ratings of movies actually went down. This is really interesting!

DISCUSSION¶

After our summary data analysis, it is crucial to pose questions about the dataset that can be explored in future expositions. One of these questions will be about the prediction of a categorical outcome, and one of them will be about the prediction of a quantitative outcome.

1. Can the year, runtime, average rating, and total rating of a movie, along with the name of the director, predict a movie's genre?¶

This question relates to a categorical outcome. We've already deduced during our analysis that higher runtimes are common for Action movies. We also found that Comedy movies tend to have lower runtimes and total ratings (which shows that they are less popular and taken less seriously). Using these observations, combined with observations regarding years, average ratings, and the director's name (certain directors only focus on a select few genres), we may be able to predict a movie's genre in a future project.

2. Do a movie's year, average rating, number of votes, and total rating, combined with the director's name and the genre, predict the movie's runtime?¶

This question relates to a quantitative outcome. During our summary data analysis, we determined that as time went on (as the years got more modern), movie runtimes generally got longer. It was also found that Comedy movies are much shorter in length than Drama and Action movies. We even noted that Action movies have a much higher minimum runtime than the rest. Using these observations about years and genres, along with observations regarding average ratings, numbers of votes, total ratings, and director trends, we should look into predicting a movie's runtime in the future.