DoorDash System Design Interview: A Complete Guide
Designing a food delivery platform requires managing interactions between digital systems and physical logistics. Logistics platforms must account for traffic, weather, and unstable GPS signals, unlike purely virtual services. This differs significantly from URL shorteners or chat applications. The interview tests your ability to balance high-throughput data ingestion with low-latency, region-aware behavior. You should design a system resilient enough to handle real-world unpredictability while maintaining low latency.
The DoorDash System Design interview evaluates your ability to architect a three-sided marketplace involving customers, merchants, and dashers. You should demonstrate competence in real-time state management, geospatial indexing, and fault tolerance. Candidates must go beyond basic CRUD-style request handling to design systems that manage time-sensitive state machines. The following course offers a structured approach to distributed systems patterns to help you prepare.
The interview format and expectations
The DoorDash System Design interview typically spans a 45 to 60 minute technical session. It often follows behavioral or coding rounds. The prompt usually asks you to design a core component, such as the order placement backend or dispatch engine. Interviewers focus less on a perfect solution and more on scoping and load estimation. They evaluate your ability to navigate trade-offs between reliability, performance, and cost.
You should demonstrate a strong grasp of streaming events and idempotent design in this environment. These concepts are operational requirements for delivery platforms. Evaluators look for your ability to model time-sensitive state machines, such as the transition from preparation to pickup. You should also appropriately prioritize consistency versus availability based on the data path. Payment processing typically requires strong consistency and idempotent operations, while dasher location updates often tolerate eventual consistency.
Tip: Ask clarifying questions early to confirm the priority. Determine if the goal is optimizing for user latency, dasher efficiency, or system scalability. Your assumptions guide your architecture and the subsequent questions.
The following diagram outlines the high-level interactions between customers, merchants, and dashers. It visualizes the problem’s primary scope.
Scoping the use case
Define the system’s boundaries before drawing any architecture. A typical prompt might ask you to design a backend for placing and tracking orders. Identify the primary actors, including the customer, dasher, merchant, and support team. Once the actors are defined, outline the end-to-end workflow and derive functional goals from each actor’s responsibilities. These goals include creating orders, assigning merchants, dispatching dashers, and tracking deliveries.
Non-functional requirements (NFRs) are equally important. Latency is critical for tracking updates where users expect fast feedback. Consistency is required for financial transactions. The system must reliably handle surge traffic during lunch or dinner rushes. Explicitly state your focus to narrow the scope to real-time order placement and tracking. This framing prepares you to transition into load estimation.
Estimating load and traffic
Load estimation demonstrates the ability to derive infrastructure requirements from business metrics. Assume a daily active user (DAU) count of 20 million and 5 million orders per day. The traffic pattern includes high write volume from dasher location updates and high read volume from customer and merchant status checks. Assume 5 million orders per day, with roughly 2.5 million dashers concurrently online during peak windows. Sending updates every 2 seconds creates a significant write load.
If 2.5 million dashers are actively sending location updates every 2 seconds during peak windows, this results in roughly 1.25 million write operations per second. Customers and merchants polling every 5 seconds generate approximately 2 million read operations per second. At this scale, a standard relational database is likely to become a bottleneck on the high-frequency location-tracking hot path, motivating the use of in-memory caching and write-optimized, sharded storage for location data.
Watch out: Do not forget peak-hour multipliers. Food delivery traffic fluctuates significantly. Lunch and dinner rushes can generate traffic 2x to 3x the daily average. Capacity planning must account for these spikes.
The following table summarizes the estimated traffic and storage requirements for the tracking subsystem.
| Metric | Estimate | Implication |
|---|---|---|
| Daily Orders | 5 Million | Transactional order store required (often relational), with separate scaling for tracking telemetry |
| Concurrent Dashers | ~2.5 Million | High concurrency connection handling |
| Write QPS (Location) | ~1.25 Million | Requires a write-optimized ingestion pipeline (e.g., Kafka) and hot storage for latest state (e.g., Redis), with a durable high-write store (e.g., Cassandra) if historical retention is required |
| Read QPS (Tracking) | ~2 Million | Heavy caching layer needed |
| Daily Data Ingestion | ~5.4 TB (assumption-based) | Cold storage offloading is often needed for retained traces and analytics |
We can design a high-level architecture that supports this volume of data based on these constraints.
High-level architecture
The architecture must be decoupled to handle order management and real-time tracking requirements. Traffic enters through a Load Balancer that routes requests to an API Gateway. The gateway serves as the entry point for clients and handles authentication and rate limiting. It routes traffic to specific microservices. The core application layer consists of stateless services that handle business logic, such as order creation.
A dedicated Location Service ingests GPS pings and publishes them to an event stream such as Kafka. This decouples ingestion from processing, allowing the system to buffer surges. A WebSocket service (or SSE for server-to-client updates) maintains persistent connections with client apps for low-latency streaming updates. Data storage is tiered with Redis or Memcached, providing fast access to the latest locations. A persistent database stores order history while cold storage archives completed trips.
Real-world context: Large delivery platforms commonly use geo-partitioning, sharding services and data by region to keep traffic local and limit blast radius during regional incidents.
The diagram below illustrates how these components connect to form a cohesive system.
Real-time tracking subsystem deep dive
The real-time tracking subsystem demands high throughput and low latency. The flow begins when a dasher’s device sends a GPS coordinate to the Ingestion Service. This stateless service validates the request and adds metadata, such as the order ID. The data is forwarded to a Kafka topic partitioned by a stable key (such as order ID), potentially within a regional topic, so updates for the same delivery can be processed in order within a partition.
Location processing workers consume events to calculate ETAs and detect geofence crossings. The latest location is written to a Redis cluster with a short Time-to-Live (TTL). Historical location data quickly becomes irrelevant for live tracking. A WebSocket Fanout Service consumes location events (for example, from Kafka or a dedicated fanout topic) and pushes new coordinates to subscribed clients. This push-based model reduces server load compared to polling.
Note: Early delivery systems used HTTP polling, where the app requested driver location every few seconds. This consumed significant bandwidth. Modern systems use WebSockets or Server-Sent Events (SSE) to reduce repeated polling; WebSockets are bidirectional, while SSE streams updates from server to client over a long-lived connection.
Consider the following ingestion pipeline to visualize the data flow from the driver’s phone to the customer’s screen.
Order lifecycle state machine
Managing order state involves multiple parties. An order transitions through states such as Created, Confirmed, Preparing, Assigned, Picked Up, and Delivered. This flow is modeled as a Finite State Machine (FSM). Transitions are triggered by specific events, like a merchant sending a “Food Ready” signal.
You must ensure transitions are idempotent and atomic in a distributed system. The system must handle a signal idempotently, even if a network error causes a duplicate “Picked Up” request. Using a database like PostgreSQL with optimistic concurrency control allows the system to safely guard state transitions and prevent race conditions. Storing a history of state transitions aids debugging and customer support disputes.
Watch out: Avoid orders that get stuck in a state. Implement background cron jobs or sweepers to scan for stalled orders. These jobs check for orders that have been in “Preparing” or “Searching for Dasher” states for too long. The system should trigger an alert or escalation workflow in these cases.
The following diagram depicts the state transitions and triggers for each step.
Dasher matching and ETA estimation
The matching engine assigns orders to dashers. This latency-sensitive process relies on geospatial queries. The system triggers a matching event when an order is confirmed. The engine queries a geospatial index using technologies such as Redis Geo or Google S2 to find dashers within a given radius. The algorithm considers proximity, dasher mode, historical reliability, and estimated food prep time.
The system may batch orders to optimize efficiency in markets and scenarios where stacking reduces travel time and improves utilization. This involves assigning multiple deliveries from the same restaurant to a single driver. Optimization algorithms run asynchronously to determine these batches. The offer is pushed to the dasher with a strict timeout once a match is identified. The system retries with the next candidate if the dasher declines or the timer expires. Retry logic prevents orders from remaining unassigned.
Tip: Use geohashing or a hierarchical spatial index (such as S2) to partition location data into cells for efficient nearby-candidate queries. This avoids scanning the entire database. This technique is essential for scaling geospatial searches.
Advanced subsystems for pricing and notifications
A complete system requires dynamic pricing and a reliable notification pipeline. Dynamic pricing influences supply and demand by adjusting delivery fees in near real-time. It compares active orders against available dashers in a specific geohash cell. A dedicated service aggregates metrics from order and location streams to calculate demand. This service periodically updates pricing configurations in the cache.
The notification pipeline serves as the communication link between the system and users. It handles millions of events per minute and distributes them to channels such as SMS, email, and push notifications. A decoupled architecture using message queues ensures delays in one channel do not block order processing. This subsystem handles user preferences and routing. It ensures users receive notifications via their preferred method without redundancy.
Handling failures and recovery
Failure is an expectation in distributed systems. Design for scenarios where components fail or networks partition. The system must atomically update the order status to “Unassigned” in the source-of-truth store if a dasher cancels mid-delivery. This should immediately re-trigger the matching workflow. A timeout event should auto-cancel or escalate orders if a merchant fails to confirm within a set window.
Resilience patterns like circuit breakers prevent cascading failures. A circuit breaker opens to fail requests fast if a payment gateway times out. This prevents the order service from stalling. Dead Letter Queues (DLQs) capture events that repeatedly fail processing, enabling later analysis or replay without blocking the main pipeline and preventing silent data loss.
Real-world context: Observability is essential for system stability. Engineers rely on tools like Prometheus and Grafana to monitor Service Level Indicators (SLIs). Metrics include “Matching Latency” or “Order Acceptance Rate.” On-call engineers are paged if these metrics dip below the agreed Service Level Objective (SLO).
DoorDash System Design interview questions
The prompts below are representative examples; structure responses by starting with the happy path and then layering in failure modes, scale constraints, and trade-offs.
1. Design DoorDash from scratch
This prompt tests end-to-end system design skills. Break the problem into subdomains, including the consumer app, merchant portal, and dasher app. Define the MVP flow for placing and delivering an order. Discuss the high-level architecture and emphasize the separation between the transactional order engine and the real-time tracking engine. Conclude by discussing database scaling using sharding based on city or region.
2. How would you design the Dasher assignment engine?
Focus on the trade-off between optimality and latency. A sufficient match found in milliseconds is often better than a perfect match found slowly. Discuss using geospatial indexing to filter candidates and a scoring algorithm to rank them. Mention the push-notification model for sending offers. Explain how to handle race conditions if multiple dashers can accept overlapping offers, ensuring only one assignment is finalized.
3. What happens if a dasher goes offline mid-delivery?
This scenario tests resilience design. Propose a heartbeat mechanism, explicit or inferred from location updates, to detect when a dasher is offline. The system marks the dasher as potentially offline if the expected heartbeat or location updates are missed. A “Rescue Dispatch” workflow assigns a new driver after a grace period. Mention the need for atomic conditional updates in the source-of-truth store to prevent assigning the order to two drivers simultaneously.
Tip: Avoid absolutes when discussing trade-offs. Explain the rationale behind technology choices. For example, state that Kafka was chosen over SQS for stream replayability despite the added operational complexity.
Conclusion
Designing a system like DoorDash requires considering physical logistics alongside software architecture. You must balance real-time tracking demands with transactional integrity in the core ordering flow. Understanding geo-partitioning, idempotent state machines, and resilient event-driven architectures demonstrates operational maturity.
Platforms may evolve to include additional delivery modalities, introducing new challenges in telemetry and routing. System Design prompts require architects to design a dynamic marketplace rather than just a database schema.