OpenAI System Design Interview: A step-by-step guide
Designing infrastructure for OpenAI differs from building a standard e-commerce backend. Traditional System Design interviews often emphasize data consistency, request throughput, and general scalability concerns. An interview for an LLM-centric role focuses on probabilistic model behavior and massive GPU resource constraints. You must manage the economic and latency trade-offs of generating intelligence in real time. Success requires demonstrating architectures that are LLM-aware and aligned with safety protocols.
The following diagram illustrates the high-level architecture of an LLM inference system. It highlights the flow from user request to model serving.
Step 1: Clarifying scope, latency, and user requirements
Candidates often fail by diving into distributed systems theory without establishing context. In an OpenAI System Design interview, upfront constraints dictate the architecture. You must determine if you are building a real-time chat interface or a batch summarization job. Real-time chat requires streaming and low latency. Batch jobs prioritize throughput over speed. Clarifying target models is critical. Designing for GPT-4 requires different GPU provisioning due to its larger model size, memory footprint, and throughput characteristics.
Define performance metrics and safety requirements explicitly. Ask if the system needs token-by-token streaming responses using Server-Sent Events (SSE). Safety is a core non-functional requirement in this domain with direct functional implications. Clarify if the system requires a moderation layer to filter toxicity before the prompt reaches the model. Establish infrastructure needs regarding usage tracking and billing quotas. Determine if the system requires a prompt cache or an embedding layer.
Tip: Explicitly ask about the user tier when clarifying scope. Designing for free users often involves aggressive caching and smaller models. Enterprise tiers require strict SLAs and dedicated GPU capacity.
A clear, concise response clarifies specific details. State that you are designing a real-time summarization API for third-party developers. Each call should target a time-to-first-token (TTFT) of under 500ms. The system must support up to 4,000 context tokens and output a single paragraph. You must enforce strict quota limits and log all interactions for safety review. This approach demonstrates engineering maturity at the forefront of LLMs.
Estimate computational load and cost drivers next.
Step 2: Estimating load, model behavior, and cost drivers
Token economics govern scalable LLM systems rather than just request volume. Interviewers expect calculations for throughput and cost based on token usage. Start by estimating the request volume. Assume 100,000 daily active users generate 300,000 requests per day. If each request consumes roughly 4,000 total tokens (input plus output), you achieve a throughput of 1.2 billion tokens per day.
Peak traffic is often estimated as roughly 10% of daily volume occurring within a single hour. You would need to support approximately 8.3 requests per second during peak hours under this assumption. A model like GPT-4 might generate tens of tokens per second, depending on hardware and configuration. You must provision GPU capacity to handle a burst of 100,000 to 150,000 tokens per second.
Cost is a primary architectural constraint in generative AI. Estimate inference cost using standard pricing models. A 1.2 billion-token day could cost on the order of tens of thousands of dollars based on published GPT-4 pricing, depending on the input/output mix. This calculation justifies architectural optimizations like caching layers or tiered model serving. You might route simpler queries to GPT-3.5 or implement early exit strategies.
Account for the latency overhead of safety checks. Every prompt passing through a moderation model can add hundreds of milliseconds of latency, depending on deployment and batching. This directly impacts your Time to First Token (TTFT) targets.
Watch out: Avoid confusing requests per second (RPS) with tokens per second (TPS). TPS determines GPU saturation and cost in LLM systems. RPS determines your API gateway scaling.
The following table summarizes the key differences between standard web-scale and LLM-scale metrics.
| Metric | Standard Web App | LLM / GenAI App |
|---|---|---|
| Primary Unit | Requests / API Calls | Tokens (Input + Output) |
| Bottleneck | Database I/O, Network | GPU Compute, VRAM Bandwidth |
| Latency Target | < 100ms (Total) | Hundreds of ms (First Token), Seconds (Total) |
| Cost Driver | Data Transfer, Storage | Inference Compute |
Sketch a high-level architecture that supports these constraints.
Step 3: High-level System Design architecture
A baseline system for OpenAI-style inference requires a clean flow from the client to the model. The architecture typically begins with a Client Request hitting an API Gateway. This gateway handles standard authentication and rate limiting. The request passes to a Prompt Moderation layer to ensure safety compliance. It then enters a Prompt Router.
The router directs traffic to a multi-tiered model pool. It sends complex queries to GPT-4 and lighter tasks to GPT-3.5. A Streaming Response Handler manages the response. The system delivers content to the client while logging and usage metering happen asynchronously.
The Prompt Router and Inference Tiering are critical components. The router should make decisions based on user tier and prompt characteristics. A free-tier user might be routed to a distilled model to save costs. Rate limiting must be token-aware. Use a Redis-backed, token-aware leaky bucket algorithm rather than raw request counts.
Streaming support is essential to achieving low perceived latency. Design the system to return token-by-token responses using Server-Sent Events (SSE). Emit the first token as soon as the inference engine generates it.
Real-world context: Companies like OpenAI and Anthropic use semantic routing. A small model classifies the difficulty of a prompt before routing it. A poem request goes to a cheaper model. A code debugging request goes to the flagship model.
Detail how data flows through these components.
Step 4: Prompt flow, moderation, and LLM wrapping
The prompt lifecycle involves significant processing before reaching a GPU. The flow begins with preprocessing to trim whitespace and normalize formatting. The system applies template injection to structure the user input. This stage involves prompt structuring and validation to reduce the risk of basic prompt injection attacks. The input then undergoes safety moderation.
Run text through a specialized moderation endpoint to detect hate speech or jailbreak attempts. The system must reject flagged content immediately. Log the incident with redacted metadata for future analysis.
Handle the output with care once the prompt reaches the inference engine. The inference engine streams tokens that must be monitored in real time. Output moderation re-scans the generated response to catch policy violations or unsafe content. Post-processing cleans the formatting and appends citations if using Retrieval-Augmented Generation (RAG).
Wrap the content in a structured JSON format with metadata. This structured wrapping helps improve system safety and resilience. It handles cases where underlying models behave unpredictably.
Historical note: Early LLM deployments often lacked robust output moderation. Users could trick models into bypassing safety filters. Modern architectures treat the model as an untrusted component by verifying both its inputs and outputs.
Implement intelligent caching and routing strategies to optimize for cost.
Step 5: Caching, token efficiency, and routing
Tokens equate to dollars in an LLM system. Propose a two-layer caching strategy. The first layer is an exact-match cache keyed by the prompt’s hash. This handles repeat queries, such as those hitting public tools. The second layer is a semantic cache.
Use a vector database to store embeddings of previous prompts. The system identifies near-duplicate queries based on cosine similarity. Return the stored response if a user asks a sufficiently similar question and the cached response is still valid for the current context. This saves the cost of a full inference run.
Token optimization and routing further enhance performance. Truncating context or using windowed summarization reduces the number of input tokens. Model routing should be dynamic based on input length or priority tier. Offload low-priority traffic during peak hours.
A confidence threshold router can auto-detect low-complexity prompts. Route them to a faster model like GPT-4o-mini. Reserve heavy-duty GPT-4 models for complex reasoning tasks.
The following diagram visualizes a semantic caching workflow using vector embeddings.
Tip: Semantic caching is powerful but carries risks. Set a high similarity threshold for code generation or math queries. Small differences in the prompt can require vastly different answers.
Efficiency requires dedicated architectural considerations for user privacy.
Step 6: Data privacy, logging, and safe retention
Privacy and safety are foundational to OpenAI infrastructure. Implement a flexible logging service that supports opt-in mechanisms for data retention. This ensures sensitive user data is not inadvertently used for model training. The architecture should include a redaction filter that masks PII. Logging pipelines must be asynchronous to avoid latency impacts.
Data retention policies and access controls are important. Define Time-To-Live (TTL) settings for prompts and responses. Raw text logs should auto-expire after a set period while retaining metadata. Role-Based Access Control (RBAC) is essential.
Only safety reviewers should have access to flagged outputs. Engineering teams should only see aggregate or anonymized logs. This separation allows for security investigations without compromising user privacy.
Watch out: Never log the full prompt payload in application metrics. High-cardinality text data will increase costs and violate privacy policies. Use structured logs for text and metrics for counts.
Establish robust observability to monitor system health.
Step 7: Observability, metrics, and LLM monitoring
Observability for LLMs demands tracking signals that traditional web systems ignore. You must still monitor standard API availability and latency. The health of an AI system is measured by token usage and model behavior. Track Time to First Token (TTFT) to measure perceived latency. Monitor Time Per Output Token (TPOT) for GPU generation speed.
Tracking the ratio of prompt tokens to completion tokens helps analyze cost efficiency. A sudden spike in completion length might indicate a jailbreak attempt or a shift in user behavior. It could also signal a change in user usage patterns.
Safety and business metrics are part of the observability stack. Monitor the percentage of responses flagged as unsafe. Track the rate of jailbreak attempts and cache hit rates. Business analytics should break down usage by tier.
Tools like Prometheus and Grafana must be fed with custom metrics. Tracing is vital for debugging. Assign a unique trace ID to every request. This allows you to follow a prompt from the gateway through the embedding layer.
The following image depicts an observability dashboard tailored for LLM operations.
Real-world context: Engineers often use golden prompt sets or standardized evaluation prompts. These are collections of standard prompts run periodically. They detect regression in model quality that standard metrics like latency miss.
Design for resilience against specific LLM failure modes.
Step 8: Failure modes, timeouts, and guardrails
LLM systems are prone to unique failures, such as non-deterministic timeouts. GPT-4 may take too long to generate a complex completion. Implement a timeout with a fallback strategy. Serve a cached response if generation exceeds a strict limit.
GPU resource exhaustion occurs during traffic bursts. Adaptive throttling is necessary in this scenario. The system should gradually slow down low-priority traffic. Queue requests to preserve capacity for premium users.
Guardrails against prompt overload are critical for stability. Reject prompts that exceed the context window immediately. Attempt to summarize the input before inference if feasible and capacity permits. A moderation overlay should post-process all outputs.
This secondary filter acts as a final safety gate if a model generates unsafe content. It aborts the response before it reaches the client. Logging these failures with trace IDs allows for failure replay. Engineers can debug specific prompts that caused system crashes.
Tip: Implement graceful degradation strategies. Route traffic to a faster model like GPT-3.5 if the primary model is overloaded. Notify the user that they are receiving a lite response.
Apply these concepts to specific interview questions.
OpenAI System Design interview questions and answers
These questions simulate real-world scenarios. They test your ability to combine infrastructure knowledge with product intuition.
1. Design a ChatGPT-style product that streams responses in real time
This question tests your understanding of low-latency design. Propose an architecture that uses Server-Sent Events (SSE) or WebSockets. Push tokens to the client as they are generated. The design must prioritize a low Time-to-First-Token (TTFT).
The flow involves a client request hitting an API Gateway. It passes through a moderation layer and reaches an inference server. A circular buffer can efficiently push token chunks to the client. Moderation must happen in parallel or on the input side. Post-hoc moderation disrupts the streaming experience.
2. Design an internal prompt management tool for OpenAI engineers
The interviewer is looking for internal tooling design. Core components should include a Prompt Registry Service. Include a playground UI for live evaluation. Advocate for a Git-style version control system for prompts.
Allow engineers to branch, test, and merge prompt changes. The system needs to track metadata such as the owner and cost impact. Highlight traceability as a key feature. Connect every prompt version to its historical token usage and output quality.
3. Design a vector search system to support Retrieval-Augmented Generation (RAG)
This is a high-priority topic given the rise of RAG applications. Design a pipeline that ingests documents and chunks them into manageable sizes. Generate embeddings using an API like OpenAI `text-embedding-3`. Store these embeddings in a Vector Database.
The retrieval flow involves converting a user query into a vector. Perform a cosine similarity search to find relevant chunks. Re-rank them for precision. Inject these chunks into the LLM context window. Discuss the trade-offs of chunk size and re-indexing latency.
4. How would you enforce safety and moderation for code-generation use cases?
This question targets ethical awareness and domain knowledge. Code generation poses risks, including the creation of malware. Your design should include a dual-layer moderation system. Scan the input prompt for malicious intent.
Output code should undergo static analysis to detect dangerous patterns. Propose a dynamic sandbox for high-security environments. Execute the code in an isolated container to verify behavior. Emphasize the feedback loop in which community reporting helps retrain safety classifiers.
5. What is your plan for handling inference overload during peak GPT-4 traffic?
This tests your ability to manage resource prioritization. The strategy should revolve around load shedding and model tiering. Prioritize Pro users and enterprise clients. Route free-tier traffic to GPT-3.5 or a cached response layer.
Implement token-based rate limiting to prevent capacity monopolization. Propose using an exponential backoff queue for retries. Auto-scale GPU clusters based on inference queue depth rather than CPU load.
Conclusion
Mastering the OpenAI System Design interview requires a shift in how you view compute. You are managing tokens, probability, and safety rather than just database connections. Strong candidates anchor their designs in the reality of LLM economics. Optimize for first-token latency and leverage semantic caching.
Implement rigorous guardrails to prevent abuse. The field is evolving toward more autonomous agents and multimodal inputs. Designing scalable, ethically robust systems will define future infrastructure leaders. Walk into the room ready to calculate token math and defend your trade-offs.