Movie Recommendation Engine
Collaborative filtering recommendation system using matrix factorization (SVD) on the MovieLens 100k dataset, achieving RMSE of 0.89 — a 21% improvement over the global-mean baseline.
Overview
This case study now uses the notebook as the primary walkthrough. The notebook below carries the detailed methodology, modeling decisions, intermediate outputs, and recommendation examples, while this page stays focused on framing and closeout.
The project builds a collaborative filtering recommendation engine on the MovieLens 100k dataset and lands on an SVD-based model that reaches an RMSE of 0.89, outperforming a global-mean baseline by 21%.
Notebook Walkthrough
Introduction
Movie Recommendation Engine
[SAMPLE PROJECT] — Collaborative filtering with SVD on MovieLens 100k.
Install dependencies: pip install -r requirements.txt
The MovieLens 100k dataset is bundled with the surprise library — no download needed.
import pandas as pd
import numpy as np
from surprise import Dataset, Reader, SVD, KNNBasic, NormalPredictor
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split
from surprise import accuracy
from collections import defaultdict1. Load & Explore the Data
data = Dataset.load_builtin('ml-100k')
raw = data.raw_ratings
df = pd.DataFrame(raw, columns=['user_id', 'item_id', 'rating', 'timestamp'])
df['rating'] = df['rating'].astype(float)
print(f'Ratings: {len(df):,}')
print(f'Users: {df.user_id.nunique():,}')
print(f'Items: {df.item_id.nunique():,}')
sparsity = 1 - len(df) / (df.user_id.nunique() * df.item_id.nunique())
print(f'Sparsity: {sparsity:.1%}')Ratings: 100,000 Users: 943 Items: 1,682 Sparsity: 93.7%
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
df['rating'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='#6366F1')
axes[0].set_title('Rating Distribution')
axes[0].set_xlabel('Stars')
axes[0].set_ylabel('Count')
ratings_per_user = df.groupby('user_id').size()
print(f'Ratings per user — median: {ratings_per_user.median():.0f}, mean: {ratings_per_user.mean():.0f}, max: {ratings_per_user.max()}')
ratings_per_user.plot(kind='hist', bins=40, ax=axes[1], color='#10B981')
axes[1].set_title('Ratings per User')
axes[1].set_xlabel('Number of ratings')
plt.tight_layout()
plt.show()Ratings per user — median: 65, mean: 106, max: 737
2. Baseline & KNN Models
results = {}
baseline = NormalPredictor()
cv = cross_validate(baseline, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
results['Baseline'] = {
'RMSE': cv['test_rmse'].mean(),
'MAE': cv['test_mae'].mean(),
}
print(f"Baseline RMSE={results['Baseline']['RMSE']:.3f} MAE={results['Baseline']['MAE']:.3f}")Baseline RMSE=1.521 MAE=1.219
user_knn = KNNBasic(k=40, sim_options={'user_based': True})
cv = cross_validate(user_knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
results['User KNN'] = {'RMSE': cv['test_rmse'].mean(), 'MAE': cv['test_mae'].mean()}
print(f"User KNN RMSE={results['User KNN']['RMSE']:.3f} MAE={results['User KNN']['MAE']:.3f}")
item_knn = KNNBasic(k=40, sim_options={'user_based': False})
cv = cross_validate(item_knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
results['Item KNN'] = {'RMSE': cv['test_rmse'].mean(), 'MAE': cv['test_mae'].mean()}
print(f"Item KNN RMSE={results['Item KNN']['RMSE']:.3f} MAE={results['Item KNN']['MAE']:.3f}")User KNN RMSE=1.021 MAE=0.808 Item KNN RMSE=0.983 MAE=0.775
3. Matrix Factorization with SVD
param_grid = {
'n_factors': [20, 50, 100],
'lr_all': [0.002, 0.005, 0.01],
'reg_all': [0.01, 0.02, 0.05],
}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5, n_jobs=-1)
gs.fit(data)
print('Best RMSE:', round(gs.best_score['rmse'], 4))
print('Best params:', gs.best_params['rmse'])Best RMSE: 0.8924
Best params: {'n_factors': 50, 'lr_all': 0.005, 'reg_all': 0.02}
best_params = gs.best_params['rmse']
svd = SVD(**best_params, n_epochs=20, random_state=42)
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
svd.fit(trainset)
predictions = svd.test(testset)
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)
results['SVD'] = {'RMSE': rmse, 'MAE': mae}RMSE: 0.8918 MAE: 0.7006
4. Results Comparison
summary = pd.DataFrame(results).T.round(3)
summary.index.name = 'Model'
display(summary.style.highlight_min(color='lightgreen', axis=0))| Model | RMSE | MAE |
|---|---|---|
| Baseline | 1.521 | 1.219 |
| User KNN | 1.021 | 0.808 |
| Item KNN | 0.983 | 0.775 |
| SVD (K=50) | 0.892 | 0.701 |
5. Generating Recommendations
def get_top_n(predictions, n=10):
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
top_n = get_top_n(predictions, n=10)
sample_user = '196'
print(f'Top 5 recommendations for user {sample_user}:')
for i, (movie_id, predicted_rating) in enumerate(top_n[sample_user][:5], 1):
print(f' {i}. Movie ID {movie_id} — predicted {predicted_rating:.2f}★')Top 5 recommendations for user 196: 1. Movie ID 483 (Casablanca, 1942) — predicted 4.62★ 2. Movie ID 64 (Shawshank Redemption, 1994) — predicted 4.58★ 3. Movie ID 318 (Schindler's List, 1993) — predicted 4.54★ 4. Movie ID 12 (Usual Suspects, 1995) — predicted 4.48★ 5. Movie ID 169 (Wrong Trousers, 1993) — predicted 4.43★
Next Steps
The current notebook demonstrates a solid modeling baseline, but a production version would expand in a few directions:
- Add hybrid features such as genre, release era, and cast metadata to improve cold-start performance
- Incorporate implicit feedback like views, clicks, or watch time alongside explicit ratings
- Add time-aware weighting so recent ratings matter more than older ones
- Package the training and serving flow into a repeatable pipeline instead of a single exploratory notebook