Sample Project — This is a demonstration entry showing what portfolio case studies will look like. The scenario, data, and results are illustrative.

🐍Python·January 15, 2024·SAMPLE

NYC Taxi Trip Patterns

Exploratory data analysis of 1.5M NYC yellow taxi trips uncovering temporal demand patterns, geographic hotspots, fare structures, and tip behavior.

PythonpandasmatplotlibseabornEDA

Overview

New York City's yellow taxi system generates one of the richest public urban datasets in the world. Every trip is recorded and published monthly by the NYC Taxi & Limousine Commission. In this project, I analyzed 1.5 million trips from January 2023 to surface patterns in demand timing, geographic hotspots, fare economics, and tipping behavior.

The Dataset

The NYC TLC Trip Record Data includes yellow and green taxi trips across the city. Each row represents a single trip and contains:

Pickup and dropoff timestamps
Location IDs mapped to 263 named taxi zones
Trip distance in miles
Itemized fare breakdown (base, tip, tolls, surcharges)
Payment type (credit card, cash, dispute, etc.)
Passenger count

After removing trips with zero distance, negative fares, or pickups that precede dropoffs, the working dataset contained 1.47 million trips.

Methodology

The analysis was conducted entirely in Python. The pipeline:

Load the January 2023 Parquet file (~150 MB compressed) with pandas
Clean and derive features: hour of day, day of week, trip duration, fare per mile, tip percentage
Aggregate and visualize with matplotlib and seaborn
Summarize key statistics

The full script is in this repository at analysis.py.

Key Findings

Demand Patterns

Rush hour peaks dominate weekday mornings (7–9 AM) and evenings (5–8 PM). Friday evening recorded the single highest hour across the dataset at 62,400 trips. Saturday night from 11 PM–1 AM nearly matches weekday rush—driven by Manhattan's nightlife districts.

Sunday mornings are the quietest period, with volume falling to just 18% of Friday evening peak.

Period	Trips/hour (avg)
Weekday morning rush (7–9 AM)	~54,000
Friday evening (5–8 PM)	~62,000
Saturday night (11 PM–1 AM)	~58,000
Sunday morning (8–10 AM)	~11,000

Geographic Hotspots

Midtown Manhattan dominates pickup volume, but some non-obvious findings emerged:

JFK Airport ranked #3 in pickup volume but #1 in average fare ($52.80), driven by the flat-rate pricing structure
LaGuardia Airport showed a distinctive bimodal demand pattern aligned with flight arrival schedules
East Village and Lower East Side were negligible during business hours but surged after 10 PM on weekends, outranking major Midtown zones

Fare Economics

The average fare per mile was $3.82, but this varied significantly by trip length:

Distance band	Avg fare/mile
0–2 miles	$5.20
2–10 miles	$3.45
10+ miles	$2.90

Short trips are inflated by the $3.00 initial charge and the minimum fare of $8.00. Long trips (mostly airport runs) benefit from flat-rate pricing that reduces per-mile cost.

Tip Behavior

Credit card users tipped an average of 22.3%. Cash tips are not recorded in the dataset, so this almost certainly understates true cash tip rates.

Late-night riders (11 PM–3 AM) tipped 3.2 percentage points above the daily average. This is consistent with anecdotal reports of more generous late-night behavior, though it may also reflect selection effects—late-night credit card users skew toward higher-income demographics.

Visualizations

The script produces six output charts:

Hourly trip volume heatmap — day of week × hour of day, showing the full demand landscape at a glance
Top 20 pickup zones — horizontal bar chart with named zones
Fare per mile distribution — log-scale histogram revealing the bimodal structure (short vs. long trips)
Tip percentage by hour — line chart tracking tip generosity across the day
Trip distance vs. total fare scatter — annotated with the OLS regression line
Payment type breakdown — pie chart by zone category (airport vs. Midtown vs. outer borough)

Conclusions

This dataset is a goldmine for urban analytics. The most actionable tactical insight for independent drivers: positioning in Midtown during the 5 PM rush, then migrating toward East Village and Lower East Side after 10 PM on weekends, maximizes both trip frequency and tip rate simultaneously.

For future work, I’d extend this to a full year to capture seasonal patterns and incorporate weather data to model demand elasticity—rain events are known to cause demand spikes that can overwhelm supply.

Technical Notes

Python 3.11 · pandas 2.1 · matplotlib 3.8 · seaborn 0.13
Data downloaded as a Parquet file from the NYC TLC open data portal
Full runtime on an M2 MacBook Pro: approximately 45 seconds

← All Projects View Project Files on GitHub ↗