NYC Taxi Trip Patterns
Exploratory data analysis of 1.5M NYC yellow taxi trips uncovering temporal demand patterns, geographic hotspots, fare structures, and tip behavior.
Overview
New York City's yellow taxi system generates one of the richest public urban datasets in the world. Every trip is recorded and published monthly by the NYC Taxi & Limousine Commission. In this project, I analyzed 1.5 million trips from January 2023 to surface patterns in demand timing, geographic hotspots, fare economics, and tipping behavior.
The Dataset
The NYC TLC Trip Record Data includes yellow and green taxi trips across the city. Each row represents a single trip and contains:
- Pickup and dropoff timestamps
- Location IDs mapped to 263 named taxi zones
- Trip distance in miles
- Itemized fare breakdown (base, tip, tolls, surcharges)
- Payment type (credit card, cash, dispute, etc.)
- Passenger count
After removing trips with zero distance, negative fares, or pickups that precede dropoffs, the working dataset contained 1.47 million trips.
Methodology
The analysis was conducted entirely in Python. The pipeline:
- Load the January 2023 Parquet file (~150 MB compressed) with pandas
- Clean and derive features: hour of day, day of week, trip duration, fare per mile, tip percentage
- Aggregate and visualize with matplotlib and seaborn
- Summarize key statistics
The full script is in this repository at analysis.py.
Key Findings
Demand Patterns
Rush hour peaks dominate weekday mornings (7–9 AM) and evenings (5–8 PM). Friday evening recorded the single highest hour across the dataset at 62,400 trips. Saturday night from 11 PM–1 AM nearly matches weekday rush—driven by Manhattan's nightlife districts.
Sunday mornings are the quietest period, with volume falling to just 18% of Friday evening peak.
| Period | Trips/hour (avg) |
|---|---|
| Weekday morning rush (7–9 AM) | ~54,000 |
| Friday evening (5–8 PM) | ~62,000 |
| Saturday night (11 PM–1 AM) | ~58,000 |
| Sunday morning (8–10 AM) | ~11,000 |
Geographic Hotspots
Midtown Manhattan dominates pickup volume, but some non-obvious findings emerged:
- JFK Airport ranked #3 in pickup volume but #1 in average fare ($52.80), driven by the flat-rate pricing structure
- LaGuardia Airport showed a distinctive bimodal demand pattern aligned with flight arrival schedules
- East Village and Lower East Side were negligible during business hours but surged after 10 PM on weekends, outranking major Midtown zones
Fare Economics
The average fare per mile was $3.82, but this varied significantly by trip length:
| Distance band | Avg fare/mile |
|---|---|
| 0–2 miles | $5.20 |
| 2–10 miles | $3.45 |
| 10+ miles | $2.90 |
Short trips are inflated by the $3.00 initial charge and the minimum fare of $8.00. Long trips (mostly airport runs) benefit from flat-rate pricing that reduces per-mile cost.
Tip Behavior
Credit card users tipped an average of 22.3%. Cash tips are not recorded in the dataset, so this almost certainly understates true cash tip rates.
Late-night riders (11 PM–3 AM) tipped 3.2 percentage points above the daily average. This is consistent with anecdotal reports of more generous late-night behavior, though it may also reflect selection effects—late-night credit card users skew toward higher-income demographics.
Visualizations
The script produces six output charts:
- Hourly trip volume heatmap — day of week × hour of day, showing the full demand landscape at a glance
- Top 20 pickup zones — horizontal bar chart with named zones
- Fare per mile distribution — log-scale histogram revealing the bimodal structure (short vs. long trips)
- Tip percentage by hour — line chart tracking tip generosity across the day
- Trip distance vs. total fare scatter — annotated with the OLS regression line
- Payment type breakdown — pie chart by zone category (airport vs. Midtown vs. outer borough)
Conclusions
This dataset is a goldmine for urban analytics. The most actionable tactical insight for independent drivers: positioning in Midtown during the 5 PM rush, then migrating toward East Village and Lower East Side after 10 PM on weekends, maximizes both trip frequency and tip rate simultaneously.
For future work, I’d extend this to a full year to capture seasonal patterns and incorporate weather data to model demand elasticity—rain events are known to cause demand spikes that can overwhelm supply.
Technical Notes
- Python 3.11 · pandas 2.1 · matplotlib 3.8 · seaborn 0.13
- Data downloaded as a Parquet file from the NYC TLC open data portal
- Full runtime on an M2 MacBook Pro: approximately 45 seconds