INTRODUCTION¶
This project analyzes the data in various IMDb datasets, showing summary statistics, distributions, correlations, and more. The raw datasets that were used as sources are imported from IMDb Data Files, and documentation surrounding the files can be found at IMDb Non-Commercial Datasets. I learned about these files from the resources page of our class book, Data Science 1 by Toby Driscoll.
Here is the code to import our libraries and the raw datasets. The only datasets we will use are title.basics.tsv, title.crew.tsv, title.ratings.tsv, and name.basics.tsv. The raw data will take a long time to import. In preprocessing, we will will make it quicker.
(**NOTE: the types of the columns of title_basics are all specified as strings to avoid errors. We will correct the types later.)
import numpy as np
import pandas as pd
import seaborn as sns
/var/folders/gc/0752xrm56pnf0r0dsrn5370c0000gr/T/ipykernel_50817/573954172.py:2: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466 import pandas as pd
title_basics = pd.read_csv("https://datasets.imdbws.com/title.basics.tsv.gz", delimiter="\t",
dtype={"tconst": str, "titleType": str, "primaryTitle": str, "originalTitle": str,
"isAdult": str, "startYear": str, "endYear": str, "runtimeMinutes": str,
"genres": str})
title_basics.head(3)
tconst | titleType | primaryTitle | originalTitle | isAdult | startYear | endYear | runtimeMinutes | genres | |
---|---|---|---|---|---|---|---|---|---|
0 | tt0000001 | short | Carmencita | Carmencita | 0 | 1894 | \N | 1 | Documentary,Short |
1 | tt0000002 | short | Le clown et ses chiens | Le clown et ses chiens | 0 | 1892 | \N | 5 | Animation,Short |
2 | tt0000003 | short | Pauvre Pierrot | Pauvre Pierrot | 0 | 1892 | \N | 4 | Animation,Comedy,Romance |
title_crew = pd.read_csv("https://datasets.imdbws.com/title.crew.tsv.gz", delimiter="\t")
title_crew.head(3)
tconst | directors | writers | |
---|---|---|---|
0 | tt0000001 | nm0005690 | \N |
1 | tt0000002 | nm0721526 | \N |
2 | tt0000003 | nm0721526 | \N |
title_ratings = pd.read_csv("https://datasets.imdbws.com/title.ratings.tsv.gz", delimiter="\t")
title_ratings.head(3)
tconst | averageRating | numVotes | |
---|---|---|---|
0 | tt0000001 | 5.7 | 2036 |
1 | tt0000002 | 5.7 | 272 |
2 | tt0000003 | 6.5 | 1986 |
name_basics = pd.read_csv("https://datasets.imdbws.com/name.basics.tsv.gz", delimiter="\t")
name_basics.head(3)
nconst | primaryName | birthYear | deathYear | primaryProfession | knownForTitles | |
---|---|---|---|---|---|---|
0 | nm0000001 | Fred Astaire | 1899 | 1987 | actor,miscellaneous,producer | tt0072308,tt0050419,tt0053137,tt0027125 |
1 | nm0000002 | Lauren Bacall | 1924 | 2014 | actress,soundtrack,archive_footage | tt0037382,tt0075213,tt0117057,tt0038355 |
2 | nm0000003 | Brigitte Bardot | 1934 | \N | actress,music_department,producer | tt0057345,tt0049189,tt0056404,tt0054452 |
PREPROCESSING¶
Here are the steps we will take to preprocess our data. In the interest of being concise, the datasets won't be displayed until they are merged.
1. Drop Unnecessary Rows and Columns¶
In this project, we will only analyze movies, no TV shows or shorts. So, we will drop the rows of title_basics that don't have the titleType "movie". We will also drop columns from all the dataframes that are unnecessary when analyzing movies.
movie_basics = title_basics[title_basics.titleType == "movie"]
movie_basics = movie_basics.drop(columns=["titleType", "originalTitle", "isAdult", "endYear"])
movie_crew = title_crew.drop(columns=["writers"])
movie_ratings = title_ratings
names = name_basics.drop(columns=["birthYear", "deathYear", "primaryProfession", "knownForTitles"])
2. Replace and Remove Missing Values with NaN¶
Before any further preprocessing, we must replace and remove the rows with missing values. This will allow us to specify types.
movie_basics = movie_basics.replace("\\N", np.nan)
movie_basics = movie_basics.dropna()
movie_crew = movie_crew.replace("\\N", np.nan)
movie_crew = movie_crew.dropna()
movie_ratings = movie_ratings.replace("\\N", np.nan)
movie_ratings = movie_ratings.dropna()
names = names.replace("\\N", np.nan)
names = names.dropna()
3. Specify Types¶
Next, we must specify the types of some columns. This is done so we can merge the dataframes properly.
movie_basics["startYear"] = movie_basics["startYear"].astype(int)
movie_basics["runtimeMinutes"] = movie_basics["runtimeMinutes"].astype(int)
movie_crew["directors"] = movie_crew["directors"].str.split(",")
movie_ratings["averageRating"] = movie_ratings["averageRating"].astype(float)
movie_ratings["numVotes"] = movie_ratings["numVotes"].astype(int)
4. Merge the Dataframes¶
Now, we will merge the four dataframes together into all_movies. The movie dataframes will merge easily on "tconst", however, the names dataframe requires a specific method of merging. This method involves creating a helper variable "directors_names" that explodes the "directors" column, merges with the names dataframe, and then gets grouped by its tconst. The helper variable then merges back with all_movies. This is just to match the lists of "nconst"s with lists of the directors' actual names.
all_movies = pd.merge(movie_basics, movie_crew, on="tconst")
all_movies = pd.merge(all_movies, movie_ratings, on="tconst")
# Create helper variable to match the lists of nconsts with lists of names
directors_names = all_movies[["tconst", "directors"]]
directors_names = directors_names.explode("directors")
directors_names = pd.merge(directors_names, names, left_on="directors", right_on="nconst")
directors_names = directors_names.groupby("tconst").agg({"directors": list, "primaryName": list}).reset_index()
# Merge the helper variable back into the main dataframe
all_movies = pd.merge(all_movies, directors_names, on="tconst")
all_movies.head()
tconst | primaryTitle | startYear | runtimeMinutes | genres | directors_x | averageRating | numVotes | directors_y | primaryName | |
---|---|---|---|---|---|---|---|---|---|---|
0 | tt0000009 | Miss Jerry | 1894 | 45 | Romance | [nm0085156] | 5.3 | 209 | [nm0085156] | [Alexander Black] |
1 | tt0000147 | The Corbett-Fitzsimmons Fight | 1897 | 100 | Documentary,News,Sport | [nm0714557] | 5.2 | 506 | [nm0714557] | [Enoch J. Rector] |
2 | tt0000574 | The Story of the Kelly Gang | 1906 | 70 | Action,Adventure,Biography | [nm0846879] | 6.0 | 876 | [nm0846879] | [Charles Tait] |
3 | tt0000591 | The Prodigal Son | 1907 | 90 | Drama | [nm0141150] | 5.5 | 23 | [nm0141150] | [Michel Carré] |
4 | tt0000679 | The Fairylogue and Radio-Plays | 1908 | 120 | Adventure,Fantasy | [nm0091767, nm0877783] | 5.2 | 71 | [nm0091767, nm0877783] | [Francis Boggs, Otis Turner] |
5. Polish and Reorder the Frame¶
For this project, we are only going to analyze the top 1000 highest rated action, comedy, or drama movies that have a minimum number of 50,000 votes. Here are the steps taken to polish the frame:
- Drop the rows in which "numVotes" is less than 50,000.
- Drop any rows that don't have a genre of action, comedy, or drama. Keep only one of those genres that is displayed, that way, it becomes a categorical column. We will use iteration to help with that.
- Order the data according to the "averageRating" column.
- Create a "rank" column and set it as the index.
- Create a column called "totalRating" that multiplies the averageRating and the numVotes.
- Delete, alter, and move some of our columns to polish our finished set of data.
movies = all_movies[all_movies["numVotes"] >= 50000]
movies = movies[movies["genres"].str.contains("Action") | movies["genres"].str.contains("Comedy") | movies["genres"].str.contains("Drama")]
for index, movie in movies.iterrows():
if "Comedy" in movie["genres"]:
movies.at[index, "genres"] = "Comedy"
elif "Action" in movie["genres"]:
movies.at[index, "genres"] = "Action"
elif "Drama" in movie["genres"]:
movies.at[index, "genres"] = "Drama"
movies = movies.sort_values(by="averageRating", ascending=False)
movies = movies.iloc[:1000]
movies["rank"] = range(1, len(movies) + 1)
movies.index = movies["rank"]
movies["totalRating"] = movies["averageRating"] * movies["numVotes"]
movies = movies.drop(columns=["tconst", "directors_x", "directors_y", "rank"])
movies = movies.rename(columns={"primaryTitle": "title", "startYear": "year", "runtimeMinutes": "runtime",
"primaryName": "director[s]", "genres": "genre"})
movies["director[s]"] = movies["director[s]"].str.join(",")
movies = movies[["title", "year", "runtime", "director[s]", "genre", "averageRating", "numVotes", "totalRating"]]
movies.head()
title | year | runtime | director[s] | genre | averageRating | numVotes | totalRating | |
---|---|---|---|---|---|---|---|---|
rank | ||||||||
1 | The Shawshank Redemption | 1994 | 142 | Frank Darabont | Drama | 9.3 | 2874036 | 26728534.8 |
2 | The Godfather | 1972 | 175 | Francis Ford Coppola | Drama | 9.2 | 2001828 | 18416817.6 |
3 | 12th Fail | 2023 | 147 | Vidhu Vinod Chopra | Drama | 9.0 | 106078 | 954702.0 |
4 | The Lord of the Rings: The Return of the King | 2003 | 201 | Peter Jackson | Action | 9.0 | 1969425 | 17724825.0 |
5 | The Godfather Part II | 1974 | 202 | Francis Ford Coppola | Drama | 9.0 | 1358038 | 12222342.0 |
6. Determine Outliers¶
Our last step of preprocessing is to perform an outlier analysis on all quantitative columns. This is done by determining which values do not follow the 1.5IQR rule. While the IQRs will be found altogether, we will separate the identification of outliers by the 5 quantitative columns.
quantitative_columns = movies[["year", "runtime", "averageRating", "numVotes", "totalRating"]]
Q1 = quantitative_columns.quantile(0.25)
Q3 = quantitative_columns.quantile(0.75)
IQR = Q3 - Q1
print("Here is the IQR for each column:")
print(IQR)
Here is the IQR for each column: year 25.000 runtime 34.000 averageRating 0.500 numVotes 359853.500 totalRating 2835101.025 dtype: float64
year_outliers = (quantitative_columns["year"] < (Q1["year"] - 1.5 * IQR["year"])) | (quantitative_columns["year"] > (Q3["year"] + 1.5 * IQR["year"]))
print("Here are the year outliers in the dataset:")
movies[year_outliers].head(3)
Here are the year outliers in the dataset:
title | year | runtime | director[s] | genre | averageRating | numVotes | totalRating | |
---|---|---|---|---|---|---|---|---|
rank | ||||||||
26 | It's a Wonderful Life | 1946 | 130 | Frank Capra | Drama | 8.6 | 497496 | 4278465.6 |
44 | Modern Times | 1936 | 87 | Charles Chaplin | Comedy | 8.5 | 258403 | 2196425.5 |
49 | City Lights | 1931 | 87 | Charles Chaplin | Comedy | 8.5 | 195320 | 1660220.0 |
runtime_outliers = (quantitative_columns["runtime"] < (Q1["runtime"] - 1.5 * IQR["runtime"])) | (quantitative_columns["runtime"] > (Q3["runtime"] + 1.5 * IQR["runtime"]))
print("Here are the runtime outliers in the dataset:")
movies[runtime_outliers].head(3)
Here are the runtime outliers in the dataset:
title | year | runtime | director[s] | genre | averageRating | numVotes | totalRating | |
---|---|---|---|---|---|---|---|---|
rank | ||||||||
4 | The Lord of the Rings: The Return of the King | 2003 | 201 | Peter Jackson | Action | 9.0 | 1969425 | 17724825.0 |
5 | The Godfather Part II | 1974 | 202 | Francis Ford Coppola | Drama | 9.0 | 1358038 | 12222342.0 |
8 | Schindler's List | 1993 | 195 | Steven Spielberg | Drama | 9.0 | 1443603 | 12992427.0 |
averageRating_outliers = (quantitative_columns["averageRating"] < (Q1["averageRating"] - 1.5 * IQR["averageRating"])) | (quantitative_columns["averageRating"] > (Q3["averageRating"] + 1.5 * IQR["averageRating"]))
print("Here are the averageRating outliers in the dataset:")
movies[averageRating_outliers].head(3)
Here are the averageRating outliers in the dataset:
title | year | runtime | director[s] | genre | averageRating | numVotes | totalRating | |
---|---|---|---|---|---|---|---|---|
rank | ||||||||
1 | The Shawshank Redemption | 1994 | 142 | Frank Darabont | Drama | 9.3 | 2874036 | 26728534.8 |
2 | The Godfather | 1972 | 175 | Francis Ford Coppola | Drama | 9.2 | 2001828 | 18416817.6 |
3 | 12th Fail | 2023 | 147 | Vidhu Vinod Chopra | Drama | 9.0 | 106078 | 954702.0 |
numVotes_outliers = (quantitative_columns["numVotes"] < (Q1["numVotes"] - 1.5 * IQR["numVotes"])) | (quantitative_columns["numVotes"] > (Q3["numVotes"] + 1.5 * IQR["numVotes"]))
print("Here are the numVotes outliers in the dataset:")
movies[numVotes_outliers].head(3)
Here are the numVotes outliers in the dataset:
title | year | runtime | director[s] | genre | averageRating | numVotes | totalRating | |
---|---|---|---|---|---|---|---|---|
rank | ||||||||
1 | The Shawshank Redemption | 1994 | 142 | Frank Darabont | Drama | 9.3 | 2874036 | 26728534.8 |
2 | The Godfather | 1972 | 175 | Francis Ford Coppola | Drama | 9.2 | 2001828 | 18416817.6 |
4 | The Lord of the Rings: The Return of the King | 2003 | 201 | Peter Jackson | Action | 9.0 | 1969425 | 17724825.0 |
totalRating_outliers = (quantitative_columns["totalRating"] < (Q1["totalRating"] - 1.5 * IQR["totalRating"])) | (quantitative_columns["totalRating"] > (Q3["totalRating"] + 1.5 * IQR["totalRating"]))
print("Here are the totalRating outliers in the dataset:")
movies[totalRating_outliers].head(3)
Here are the totalRating outliers in the dataset:
title | year | runtime | director[s] | genre | averageRating | numVotes | totalRating | |
---|---|---|---|---|---|---|---|---|
rank | ||||||||
1 | The Shawshank Redemption | 1994 | 142 | Frank Darabont | Drama | 9.3 | 2874036 | 26728534.8 |
2 | The Godfather | 1972 | 175 | Francis Ford Coppola | Drama | 9.2 | 2001828 | 18416817.6 |
4 | The Lord of the Rings: The Return of the King | 2003 | 201 | Peter Jackson | Action | 9.0 | 1969425 | 17724825.0 |
Our outlier analysis is extremely interesting. Notice how the year outlier analysis is filled with older movies... this is because the majority of the top 1000 movies are from the modern day!
The runtime outlier analysis is filled with movies with long runtimes. This is because there is a limit to how short a movie can be, but the sky is the limit for runtimes!
The averageRating outlier analysis is perhaps the most interesting, as it's filled with the top ranks. This is because the top movies have a higher rating than most of the other movies in the dataset, since most of the ratings are between 7.5 and 8.0 or so.
Notice how the numVotes and totalRating outlier datasets are mostly the same. This is because they are dependent on each other! They are also filled with mostly the top ranks, since the better a movie is, the more people will watch it and vote on it (this is an idea that will be explored more later).
SUMMARY DATA ANALYSIS¶
Now that our "movies" dataframe is preprocessed, it is ready for analysis. In our summary analysis, we will identify summary statistics like means, medians, standard deviations, and z-scores. Then, we will showcase several distributions, including ECDFs, histograms, facet plots, box plots, and violin plots. Finally, we will explore correlations using line plots, scatter plots, and correlation coefficients.
1. Summary Statistics¶
The basic summary statistics are means, medians, and standard deviations. We will compute them (as well as the mins and maxs) using the describe method and then interpret them:
movies_statistics = movies.describe()
movies_statistics = movies_statistics.loc[["mean", "min", "50%", "max", "std"]]
movies_statistics = movies_statistics.rename(index={"50%": "median"})
movies_statistics = movies_statistics.round(2)
print(movies_statistics)
year runtime averageRating numVotes totalRating mean 1996.81 125.16 7.88 341631.07 2756261.85 min 1921.00 45.00 7.50 50054.00 379132.50 median 2003.00 122.00 7.80 191176.50 1484413.90 max 2024.00 321.00 9.30 2874036.00 26728534.80 std 21.29 27.94 0.32 387117.15 3303533.92
These summary statistics show us a lot about the data.
Notice how the mean year is way closer to the max year than the min year, further showing that most movies on this list are from the modern day.
Similarly, the mean averageRating is much closer to the min rating than the max rating, further showing that most movies here hover around an averageRating of 7.5 to 8.0, and the top ranked movies are outliers.
Notice how totalRating is still roughly equal to averageRating * numVotes for the mean, min, median, and max rows, but different for the std row.
The stds are obviously much largery for numVotes and totalRating. This is because unlike the year and runtime, these values are not chained between a length of 100 or 300: they can range anywhere from 50,000 to the multi-millions.
The z-scores are a separate summary statistic. They are computed and interpreted here:
means = quantitative_columns.mean()
stds = quantitative_columns.std()
z_scores = (quantitative_columns - means) / stds
print(z_scores)
year runtime averageRating numVotes totalRating rank 1 -0.131776 0.602522 4.493881 6.541702 7.256554 2 -1.164943 1.783588 4.176696 4.288616 4.740546 3 1.230126 0.781472 3.542325 -0.608480 -0.545343 4 0.290883 2.714124 3.542325 4.204913 4.531076 5 -1.071018 2.749914 3.542325 2.625580 2.865441 ... ... ... ... ... ... 996 0.666581 -1.151181 -1.215454 0.532851 0.409574 997 -0.131776 -0.077485 -1.215454 0.011177 -0.048910 998 0.431770 0.638312 -1.215454 -0.064730 -0.115623 999 -0.178738 -0.149065 -1.215454 -0.637588 -0.619091 1000 -0.178738 -1.008021 -1.215454 -0.716659 -0.688584 [1000 rows x 5 columns]
The z-scores show separate stories.
Notice that since the data is ordered by averageRating, the z-scores strictly get lower in that column as the list goes down.
Notice how the z-scores of numVotes are very similar to those of totalRatings, which makes sense because their spreads should be similar.
2. Distributions¶
Distributions of our movie data can be seen graphically through ECDFs, histograms, facet plots, box plots, and violin plots. Creating these visuals with certain conditions will reveal fascinating new aspects of the data.
First, we will plot an ECDF of the year column in order to fully confirm that most movies on this list are from the modern day. The ECDF will show that a higher proportion of movies are from 2000-2024, rather than from earlier years.
sns.displot(movies["year"], kind="ecdf", height=3, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x103c26f50>
Now, we will use a histogram (with a kde density line) to confirm that most movies are rated between 7.5 and 8.0, and that most of the high-ranked movies are outliers. The histogram and the density line will show that the majority of the count lies between 7.5 and 8.0.
sns.displot(movies["averageRating"], bins=20, kind = "hist", kde = True, height=3, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x374ef6e10>
These ECDFs and histograms apply to the entire dataset, as did the summary data statistics in the first part of this section. But what if we start restricting the data based on certain conditions?
Christopher Nolan is a very good movie director with several films in the top 1000. Here, we will calculate the mean averageRating of his movies and compare it to the mean of the entire dataset. Then, we will display the histogram and kde density line for his averageRatings, which we can compare with the histogram we just made for all the averageRatings.
nolan_movies = movies[movies["director[s]"].str.contains("Christopher Nolan")]
nolan_mean = nolan_movies["averageRating"].mean()
print("Here is the mean averageRating of Christopher Nolan's movies:")
print(nolan_mean)
print("Here is the mean averageRating of all movies:")
print(means["averageRating"])
sns.displot(nolan_movies["averageRating"], bins=20, kind = "hist", kde = True, height=3, aspect=2)
Here is the mean averageRating of Christopher Nolan's movies: 8.4625 Here is the mean averageRating of all movies: 7.8831999999999995
<seaborn.axisgrid.FacetGrid at 0x374ff1550>
As you can see, the mean averageRating of Christopher Nolan's movies is much higher than the mean averageRating of all the movies in the dataset. Similarly, the density shown in the Christopher Nolan histogram is centered around 8.5, while it was centered around 7.75 for all movies. This makes sense because he makes a lot of high-rated movies!
Besides looking at specific directors, we can also restrict our movies based on genre. This is where facet, box, and violin plots are very useful, since genre is a strictly categorical column.
First, we will print the count of each genre in the dataset. Then, we will analyze the numVotes in a facet plot with that count in mind.
print(movies["genre"].value_counts())
sns.displot(data=movies, x="numVotes", col="genre", height=3, aspect=1)
genre Drama 539 Comedy 249 Action 212 Name: count, dtype: int64
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
<seaborn.axisgrid.FacetGrid at 0x374f79990>
We can draw several conclusions based on these graphs. First, the Drama genre gets way more votes than the Action and Comedy genres. While this is partially because the count of Drama movies is higher than the others, it's also because Drama is a very broad genre, and thus more movie reviewers will watch and rate those movies.
Another conclusion is that Comedy movies are generally underlooked by movie reviewers. While Comedy and Action have similar counts, Comedy has way more movies with numVotes between 0 and 1,000,000, while Action movies dip past 1,000,000 more often.
To compare runtimes, we can use a box plot.
sns.catplot(data=movies, x="genre", y="runtime", kind="box", height=4, aspect=2)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/categorical.py:640: FutureWarning: SeriesGroupBy.grouper is deprecated and will be removed in a future version of pandas. positions = grouped.grouper.result_index.to_numpy(dtype=float)
<seaborn.axisgrid.FacetGrid at 0x37528a250>
This box plot is actually really interesting. The stand-out observation of the box plot is the small length of the Comedy box, which shows that Comedy movies are much shorter in length than Drama and Action movies. Also note that Action movies have a much higher minimum runtime than the rest.
A violin plot can be used to represent our last column, totalRating.
sns.catplot(data=movies, x="totalRating", y="genre", kind="violin", height=4, aspect=2)
/Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /Users/driscoll/mambaforge/envs/219/lib/python3.11/site-packages/seaborn/_base.py:949: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
<seaborn.axisgrid.FacetGrid at 0x374fcdf50>
The total ratings, which combine the number of votes and the average ratings, is a good indicator of the popularity of a genre. As you can see, Comedy films are typically less popular for movie reviewers, since their total ratings violin plot is much smaller. It's also very clear that action movies cover a very wide range of popularity.
3. Correlations¶
Unlike distributions, which show trends of our quantitative columns based on categorical data, correlations of our movie data will show how quantitative columns relate to each other. We will show several correlations through line plots, scatter plots, and correlation coefficients.
First, we will use a line plot (with a standard deviation error bar) to compare the average ratings to the number of votes.
sns.relplot(data=movies, x="averageRating", y="numVotes", kind="line", errorbar="sd", height=4, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x3753a9010>
This line plot shows an interesting connection. On the surface, the average ratings should not be directly correlated to the number of votes; however, it makes sense that the line is upwards sloping because if a movie has a high rating, then more movie reviewers will be tempted to watch it and rate it.
Next, we will use a scatter plot (with genres as the hue) to compare movie years to their runtimes.
sns.relplot(data=movies, x="year", y="runtime", hue="genre", height=4, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x37549e090>
This scatter plot shows that as time went on, movie runtimes generally got longer. As the genres show, the rising popularity of Action movies in the modern years were a main component of making the runtimes higher, since those movies are typically longer.
Graphs are not the only methods of displaying correlation. We can use a Pearson correlation coefficient as well. Here, we will create the Pearson correlation coefficient for the number of votes and the total ratings. This will be close to 1 since total ratings is dependent on the number of votes.
movies[["numVotes", "totalRating"]].corr()
numVotes | totalRating | |
---|---|---|
numVotes | 1.00000 | 0.99795 |
totalRating | 0.99795 | 1.00000 |
The Spearman correlation coefficient is not as sensitive to outliers. We will use it to compare the movie years to their average ratings.
print(movies["year"].corr(movies["averageRating"], "spearman"))
-0.13562520803957584
This correlation is negative, which shows that as the years went on, the average ratings of movies actually went down. This is really interesting!
DISCUSSION¶
After our summary data analysis, it is crucial to pose questions about the dataset that can be explored in future expositions. One of these questions will be about the prediction of a categorical outcome, and one of them will be about the prediction of a quantitative outcome.
1. Can the year, runtime, average rating, and total rating of a movie, along with the name of the director, predict a movie's genre?¶
This question relates to a categorical outcome. We've already deduced during our analysis that higher runtimes are common for Action movies. We also found that Comedy movies tend to have lower runtimes and total ratings (which shows that they are less popular and taken less seriously). Using these observations, combined with observations regarding years, average ratings, and the director's name (certain directors only focus on a select few genres), we may be able to predict a movie's genre in a future project.
2. Do a movie's year, average rating, number of votes, and total rating, combined with the director's name and the genre, predict the movie's runtime?¶
This question relates to a quantitative outcome. During our summary data analysis, we determined that as time went on (as the years got more modern), movie runtimes generally got longer. It was also found that Comedy movies are much shorter in length than Drama and Action movies. We even noted that Action movies have a much higher minimum runtime than the rest. Using these observations about years and genres, along with observations regarding average ratings, numbers of votes, total ratings, and director trends, we should look into predicting a movie's runtime in the future.