

IMDb is an exceedingly large database – getting all of the data into a database for our analysis was a bit of a challenge. Brian has started a GitHub project called PUMA (Probably Über Movie Advice) where you can see our classifier in action. It's an interesting project that we are considering improving on this summer and building some sort of hosted web app. TL DR: We were able to get a classifier working with an accuracy rate of over 60%, and ~80% with the threshold rating increased. We evaluated the data concerning movies released between 20 in an attempt to train a perceptron (a classification model), and to use the perceptron to predict the IMDb user review score of new movies based on the chosen metrics. The IMDb user review score is the value we used as the criteria for a good movie.

From the packaged text data dumps that IMDb provides, we used an open source Python project called IMDbpy to extract all of the information and store it in our database. More specifically, using metrics of past movies such as cast, director, release date, rating, duration, subject matter, etc, can we accurately predict whether a movie will be 'good' or 'bad' (perhaps measure in ratings, box office sales, etc)? What patterns in these metrics would ensure a quality prediction?įor our data, we used what anyone would expect us to use: the Internet Movie Database. We came up with a motivation for our research: Using existing data on past movies and available data on upcoming movies, can we accurately predict how well upcoming movies will do?

We wanted to do something with an exceedingly large amount of data, just for fun - and decided that movies were a good example of something with large amounts of data that would fit well in relations. This Spring I worked with Brian Salter and Rich Jeffery to complete a semester long project using relational databases to predict upcoming movie outcomes.
