Data Critique


Within our dataset, we have 25 columns of descriptive information including: title, original title, tagline, overview, genres, keywords, release date, runtime, status, original language, spoken language, production companies, cast, budget, revenue, popularity, vote average, vote count, average rating, rating count, adult (18+), imdb_id, director. All of this information will help us collect additional information and further analysis on our research. We are currently working on getting the gender of the director added to our dataset. We hope to investigate questions that relate and correlate to the popularity of the movie, revenue, genre, gender of director, and description or overarching themes of the movie descriptions. 
Our dataset can show how factors like runtime, release year, and popularity relate to each other. It can also reveal trends across studios and languages, helping us to understand how filmmaking and audience interests may have changed over time. Additionally, it may reveal connections between directors and the type of genre they pursue in films. It may also show which genres may lead to higher revenue or extra funding in their budgets. We may also be able to see trends in which movies are most popular based on the theme/description of the movie. 
Our dataset cannot reveal the reasons why certain directors may push for certain genres. It cannot reveal why certain directors, crew, or genres are heavily funded compared to other films. It cannot reveal the reasons behind some movie’s success and popularity or the lack of popularity– it may lead to some correlation, but cannot reveal the true reasons. There is a lack of evidence to be able to provide correlation to these categories to answer further research questions. Possible future studies could try and include data for reasons of funding.
The data was generated as an ensemble of information to be used for a capstone project. The information was pulled and gathered through already established sources to provide context for the “Full Movielens Dataset” to build various types of Recommender Systems.
The original sources of the dataset are collected from TMDB  and GroupLens, specifically the movie details, credits and the keywords were collected from the TMDB Open API. The movie links and ratings were gathered through the Official Group Lens website as part of an additional dataset. The original datasets are from “The Story of Film” and “Movie Recommender Systems.”
The data is public domain through the accessibility of the creator. All information was also pulled from public domain sites. The creator of the dataset as a whole is Rounak Banik. He is a data science fellow at McKinsey & Company. Some of his intended questions are, predicting movie revenue and/pr movie success based on a certain metric. What movies tend to get hired, vote counts, and vote averages on TMDB? 
The merged spreadsheet does not include the gender of directors. Although the first json file includes “Gender” encrypted in numerical values, we may run into issues converting this to male, female, or unknown categories. We are discussing this further with the TA to determine how we can go about this. Additionally, the dataset does not include race or ethnicity of the directors. This can also pose a problem in asking other questions about the effects on certain groups of people.
Although we have only selected one dataset, the dataset was created as a collection from multiple sources – TMDB, GroupLens, and Open API. This dataset was created by someone for a capstone project, meaning there should not be any information left out intentionally. His intention for creating the dataset was to dive into the history and the story of Cinema, therefore should not have any biases to lead to information being left out. 
There was no use of AI for this.