May 29, 2020

Lights, Camera, KNN: Machine Learning for Movie Recommendations

Machine Learning KNN Python Scikit-learn
Written during BloomTech (formerly Lambda School), 2020. My first production ML application — where I learned to ship.
Movie collage

Do you spend countless hours trying to find the right movie to watch? The premise of this project is to predict movie recommendations based on user ratings for films rated 75 or higher. By inputting a movie title, we use the K-Nearest Neighbors algorithm to predict ten recommendations.

The Setup

The core stack: scikit-learn for the KNN model, FuzzyWuzzy for string matching, scipy for sparse matrices, and the MovieLens dataset for ratings data.

import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
from scipy.sparse import csr_matrix
from fuzzywuzzy import fuzz

Data Preparation

We load the MovieLens dataset containing movies and ratings, then pivot the DataFrame to create a matrix of movie ratings by users. This matrix is converted to a sparse format — essential because most users haven't rated most movies, making the data naturally sparse.

# Pivot to set features: movies as rows, users as columns
features = ratings.pivot(
    index='movieId',
    columns='userId',
    values='rating'
).fillna(0)

# Convert to sparse matrix
matrix_movie_features = csr_matrix(features.values)

The Algorithm

K-Nearest Neighbors works by calculating distances between vectors — in our case, the distance between movie rating patterns. We use cosine similarity as the metric, which measures the angle between two vectors rather than their magnitude. This is particularly effective for recommendation systems where the pattern of ratings matters more than absolute values.

# Define and fit the model
model_knn = NearestNeighbors(
    metric='cosine',
    algorithm='brute',
    n_neighbors=10,
    n_jobs=-1
)
model_knn.fit(movie_user_matrix_sparse)

Fuzzy Matching

Since we're not using NLP, we employ FuzzyWuzzy — a fuzzy string matching library that calculates similarity ratios using the Levenshtein Distance algorithm. This handles typos and partial matches when users input movie titles.

The Results

The recommendation function takes a favorite movie, finds it in our matrix using fuzzy matching, then returns the 10 nearest neighbors based on cosine similarity of rating patterns. The output: ten films that users with similar taste also rated highly.

def recommendation(model_knn, data, mapper, favorite_movie, n=10):
    model_knn.fit(data)
    index = fuzzy_matcher(mapper, favorite_movie)
    distances, indices = model_knn.kneighbors(
        data[index], n_neighbors=n+1
    )
    # Reverse mapping and sort by distance
    raw = sorted(
        list(zip(indices.squeeze().tolist(),
                 distances.squeeze().tolist())),
        key=lambda x: x[1]
    )[:0:-1]
    return raw

Takeaway

As someone with a background in film, this was where data science and my domain expertise first intersected. The model isn't perfect — but it demonstrated that shipping a working ML application is about pragmatic engineering choices, not academic perfection. Build it, test it, ship it. That lesson carried forward into everything I've built since.

Original article on Medium · View the code

← All posts