Sample Project — This is a demonstration entry showing what portfolio case studies will look like. The scenario, data, and results are illustrative.

📓Jupyter Notebook·February 20, 2024·SAMPLE

Movie Recommendation Engine

Collaborative filtering recommendation system using matrix factorization (SVD) on the MovieLens 100k dataset, achieving RMSE of 0.89 — a 21% improvement over the global-mean baseline.

Pythonscikit-learnpandasJupyterMachine Learning

Overview

This case study now uses the notebook as the primary walkthrough. The notebook below carries the detailed methodology, modeling decisions, intermediate outputs, and recommendation examples, while this page stays focused on framing and closeout.

The project builds a collaborative filtering recommendation engine on the MovieLens 100k dataset and lands on an SVD-based model that reaches an RMSE of 0.89, outperforming a global-mean baseline by 21%.

Notebook Walkthrough

movie-recommendation-engine.ipynb🐍 Python 3

Sections:Introduction 1. Load & Explore the Data 2. Baseline & KNN Models 3. Matrix Factorization with SVD 4. Results Comparison 5. Generating Recommendations

Introduction

Movie Recommendation Engine

[SAMPLE PROJECT] — Collaborative filtering with SVD on MovieLens 100k.

Install dependencies: pip install -r requirements.txt

The MovieLens 100k dataset is bundled with the surprise library — no download needed.

In [1]:

import pandas as pd
import numpy as np
from surprise import Dataset, Reader, SVD, KNNBasic, NormalPredictor
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split
from surprise import accuracy
from collections import defaultdict

1. Load & Explore the Data

In [2]:

data = Dataset.load_builtin('ml-100k')

raw = data.raw_ratings
df = pd.DataFrame(raw, columns=['user_id', 'item_id', 'rating', 'timestamp'])
df['rating'] = df['rating'].astype(float)

print(f'Ratings:  {len(df):,}')
print(f'Users:    {df.user_id.nunique():,}')
print(f'Items:    {df.item_id.nunique():,}')
sparsity = 1 - len(df) / (df.user_id.nunique() * df.item_id.nunique())
print(f'Sparsity: {sparsity:.1%}')

Ratings:  100,000
Users:    943
Items:    1,682
Sparsity: 93.7%

In [3]:

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

df['rating'].value_counts().sort_index().plot(kind='bar', ax=axes[0], color='#6366F1')
axes[0].set_title('Rating Distribution')
axes[0].set_xlabel('Stars')
axes[0].set_ylabel('Count')

ratings_per_user = df.groupby('user_id').size()
print(f'Ratings per user — median: {ratings_per_user.median():.0f}, mean: {ratings_per_user.mean():.0f}, max: {ratings_per_user.max()}')
ratings_per_user.plot(kind='hist', bins=40, ax=axes[1], color='#10B981')
axes[1].set_title('Ratings per User')
axes[1].set_xlabel('Number of ratings')

plt.tight_layout()
plt.show()

Ratings per user — median: 65, mean: 106, max: 737

2. Baseline & KNN Models

In [4]:

results = {}

baseline = NormalPredictor()
cv = cross_validate(baseline, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
results['Baseline'] = {
    'RMSE': cv['test_rmse'].mean(),
    'MAE':  cv['test_mae'].mean(),
}
print(f"Baseline  RMSE={results['Baseline']['RMSE']:.3f}  MAE={results['Baseline']['MAE']:.3f}")

Baseline  RMSE=1.521  MAE=1.219

In [5]:

user_knn = KNNBasic(k=40, sim_options={'user_based': True})
cv = cross_validate(user_knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
results['User KNN'] = {'RMSE': cv['test_rmse'].mean(), 'MAE': cv['test_mae'].mean()}
print(f"User KNN  RMSE={results['User KNN']['RMSE']:.3f}  MAE={results['User KNN']['MAE']:.3f}")

item_knn = KNNBasic(k=40, sim_options={'user_based': False})
cv = cross_validate(item_knn, data, measures=['RMSE', 'MAE'], cv=5, verbose=False)
results['Item KNN'] = {'RMSE': cv['test_rmse'].mean(), 'MAE': cv['test_mae'].mean()}
print(f"Item KNN  RMSE={results['Item KNN']['RMSE']:.3f}  MAE={results['Item KNN']['MAE']:.3f}")

User KNN  RMSE=1.021  MAE=0.808
Item KNN  RMSE=0.983  MAE=0.775

3. Matrix Factorization with SVD

In [6]:

param_grid = {
    'n_factors': [20, 50, 100],
    'lr_all':    [0.002, 0.005, 0.01],
    'reg_all':   [0.01, 0.02, 0.05],
}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=5, n_jobs=-1)
gs.fit(data)
print('Best RMSE:', round(gs.best_score['rmse'], 4))
print('Best params:', gs.best_params['rmse'])

Best RMSE: 0.8924
Best params: {'n_factors': 50, 'lr_all': 0.005, 'reg_all': 0.02}

In [7]:

best_params = gs.best_params['rmse']
svd = SVD(**best_params, n_epochs=20, random_state=42)

trainset, testset = train_test_split(data, test_size=0.2, random_state=42)
svd.fit(trainset)
predictions = svd.test(testset)

rmse = accuracy.rmse(predictions)
mae  = accuracy.mae(predictions)
results['SVD'] = {'RMSE': rmse, 'MAE': mae}

RMSE: 0.8918
MAE:  0.7006

4. Results Comparison

In [8]:

summary = pd.DataFrame(results).T.round(3)
summary.index.name = 'Model'
display(summary.style.highlight_min(color='lightgreen', axis=0))

Model	RMSE	MAE
Baseline	1.521	1.219
User KNN	1.021	0.808
Item KNN	0.983	0.775
SVD (K=50)	0.892	0.701

5. Generating Recommendations

In [9]:

def get_top_n(predictions, n=10):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
    return top_n

top_n = get_top_n(predictions, n=10)

sample_user = '196'
print(f'Top 5 recommendations for user {sample_user}:')
for i, (movie_id, predicted_rating) in enumerate(top_n[sample_user][:5], 1):
    print(f'  {i}. Movie ID {movie_id} — predicted {predicted_rating:.2f}★')

Top 5 recommendations for user 196:
  1. Movie ID 483 (Casablanca, 1942) — predicted 4.62★
  2. Movie ID 64  (Shawshank Redemption, 1994) — predicted 4.58★
  3. Movie ID 318 (Schindler's List, 1993) — predicted 4.54★
  4. Movie ID 12  (Usual Suspects, 1995) — predicted 4.48★
  5. Movie ID 169 (Wrong Trousers, 1993) — predicted 4.43★

Next Steps

The current notebook demonstrates a solid modeling baseline, but a production version would expand in a few directions:

Add hybrid features such as genre, release era, and cast metadata to improve cold-start performance
Incorporate implicit feedback like views, clicks, or watch time alongside explicit ratings
Add time-aware weighting so recent ratings matter more than older ones
Package the training and serving flow into a repeatable pipeline instead of a single exploratory notebook

← All Projects View Project Files on GitHub ↗