Hey, IMDB, Accurate Data Entry is Important for Representation!

Caleb Elgut
6 min readJun 22, 2020

I recently completed my first end-of-module project in my data analysis course at The Flatiron School. My project partner and I analyzed questions about film data that are available online via IMDB, Rotten Tomatoes, The Movie Database, and Box Office Mojo. It was a delightful assignment that I took great joy in! One of the questions I was able to ask about and analyze was regarding the representation of non-English languages in films and what it looks like for ROI for investors.

ROI stands for Return on Investment. It is a percentage that tells an investor whether a film was worth their time and money. It is an especially necessary datapoint to analyze when determining a film’s financial success because it is inclusive of independent filmmakers. A blockbuster with a budget of $100M can receive a profit of $200M but will still only have doubled its investment with a 200% ROI. Meanwhile, an indie film like Paranormal Activity can make an ROI of nearly 20,000% because it made almost $90M with a $450,000 budget. Learning in my course how to use Python to analyze the ROI of categories like genres and language was a delight for me, and I was excited to determine the ROI for non-English language films.

I was disappointed to discover that IMDB’s data entry methods leave out many films and filmmakers by allowing for the language portion to remain empty for many films that are entered into its database. The dataset I worked with to find information about the prevalence of languages was a truncated version of the publicly available imdb_title_aka data frame that holds such information as a film’s language and region in which it was produced. The original data frame held 331,703 rows; however, only 41,715 rows contained a language. For those of you keeping score at home, that’s 12.5%. It’s worth noting, additionally, that the number of rows containing a region was 278,410, which comes out to about 84%.

It is here that I would like to point out the obvious: region/country does not equal language! This concept is especially relevant in ethnically, culturally, and racially diverse countries as the United States, the United Kingdom, Germany, and more! As the world shrinks due to globalization, language becomes a topic of which we should remain aware more now than ever. In the United States, we are filling out the census this year to provide accurate data for our government so that communities are represented accurately and cared for accordingly for the next ten years.

I understand that the needs met due to an accurate census dwarf the issue of underrepresented filmmaker communities. I would posit, however, that representation matters in all fields of life. Precise representation of communities that speak a specific language cannot happen when 86% of the rows of a data frame that is supposed to give us this information do not contain a value for language.

After joining my data frame with title information to a table that had information about budget & gross (and therefore profitability), the number of rows dwindled to 10,608.

It is quite common to join two data frames together with an inner join and subsequently lose rows where the two data frames do not have a value in common on the index upon which you make the join. An inner join is when two tables are joined together on a column that they have in common. The only rows that we keep when we combine the tables in this way are rows where the cell in the column from Table A matches a cell in the column of Table B. In this case, I inner-joined the two tables on the column “movie” therefore the only rows in the resulting table contain information of movies that both initial tables had in common. There were only 10,608 films in common between the two tables that I joined together.

10,608 is still a very sizable data sample set. Under ideal/normal circumstances, this would not be such an adverse outcome. However, the problem came when I checked to see how many rows did not have a value in their language column. Of the 10,608 rows, 9,067 did not have a language assigned to them. 85% of the rows were missing an entry for their language.

This situation reduced my sample set for language-ROI analysis to 1,541 total rows from which to examine a film’s language and the resulting ROI. After accounting for a movie having multiple languages entered, the resulting number of individually appearing languages was 2,571. The English language took up over 40% of this table, with over 1110 rows. When looking at the non-English languages where the prevalence was at least 100, only French (545), Hebrew (425), Turkish (365), and Swedish (114) made the cut.

When calculating the ROI of the non-English languages, I was able to find some pretty exciting results! Hebrew, Turkish, French, and Swedish all had an ROI of over 150 with Hebrew and Turkish coming in at around 200. It seems pretty clear that investing in films made in a non-English language can be quite lucrative! This result could be an argument for increasing the number of languages that films in the USA are dubbed in and subtitled. At the end of the day, though, this suggestion can be left to further research.

I want to rejoice over such results; however, the sample size is much smaller than I believe it should be. There were a great many languages left out due to data entry issues.

It is possible for us to explain away this issue as a result of user error as the method of entering a film onto IMDB requires the filmmaker, or someone on their team, to enter the film’s information into the database. However, if this is the case, then it is on IMDB to do a better job of verifying the entries. On their site, they state that when adding a title to their database, one must wait for approval after submission.

I believe further research needs to be done by others on how IMDB carries out this approval. IMDB should probably make language a mandatory piece of data for filmmakers to report when entering their film into the database. This approach could allow for more representation. If there are languages that are not as prevalent as others, accurate data entry will let us know! Perhaps filmmakers could take chances on making films in those languages as they see that maybe the market share for that language is quite low, but possibly the ROI is high. Accurate data entry leads to representation; IMDB needs to do better.

--

--