System Design Interview Questions
1. Fundamentals and Approach
-
How do you approach a system design interview question?
[core]- What are the steps you follow when given an open-ended design problem?
- How do you handle ambiguity in requirements during a design interview?
- How do you prioritize features when you cannot build everything?
-
How do you gather and clarify requirements for a system?
[core]- What is the difference between functional and non-functional requirements?
- How do you estimate scale and traffic requirements from scratch?
- What questions do you ask to determine read-heavy vs write-heavy workloads?
-
How do you estimate scale — users, requests per second, storage?
[core]- How do you calculate requests per second from daily active users (DAU)?
- How do you estimate storage needs for a system like Instagram?
- What is back-of-the-envelope calculation and how do you practice it?
-
What are the key trade-offs to consider in any system design?
[core]- How do you decide between consistency and availability?
- When do you trade latency for throughput?
- How do you evaluate build vs buy decisions?
-
How do you define SLAs, SLOs, and SLIs for a system?
[core]- What is the difference between SLA, SLO, and SLI?
- How do you set a realistic uptime SLA (e.g., 99.9% vs 99.99%)?
- What does 99.99% availability mean in terms of downtime per year?
2. Scalability
-
What is scalability and why does it matter in system design?
[core]- What is the difference between horizontal scaling and vertical scaling?
- What is the ceiling of vertical scaling?
- What does it mean for a system to scale linearly?
-
What is horizontal vs vertical scaling and when do you use each?
[core]- What are the cost implications of horizontal vs vertical scaling?
- Which cloud services support automatic horizontal scaling?
- When does horizontal scaling introduce complexity that vertical scaling avoids?
-
What is database sharding and how does it help with scalability?
[core]- What are the different sharding strategies (range-based, hash-based, directory-based)?
- What is a hot shard problem and how do you solve it?
- How do you handle cross-shard queries?
- What happens when you need to re-shard a database?
-
How do you scale a relational database?
[core]- What is a read replica and how does it offload read traffic?
- What is connection pooling and why is it important at scale?
- When should you move from a relational database to a NoSQL database for scale?
-
What is the role of a load balancer in a scalable system?
[core]- What load balancing algorithms exist (round robin, least connections, consistent hashing)?
- What is the difference between a hardware load balancer and a software load balancer?
- How do you ensure the load balancer itself doesn't become a single point of failure?
-
What is consistent hashing and where is it used?
[advanced]- How does consistent hashing minimize data redistribution when nodes are added or removed?
- What is a virtual node (vnode) in consistent hashing?
- Which real-world systems use consistent hashing (Cassandra, Dynamo, CDNs)?
-
How do you design a system that can auto-scale?
[advanced]- What metrics trigger auto-scaling (CPU, memory, request queue depth)?
- What is the difference between proactive scaling and reactive scaling?
- What are the risks of auto-scaling too aggressively or too slowly?
3. Databases
-
How do you choose between SQL and NoSQL databases?
[core]- What workloads favor a relational database?
- When is a document store (MongoDB) better than a key-value store (Redis)?
- What does ACID compliance mean and why does it matter?
-
What are the different types of NoSQL databases and their use cases?
[core]- When do you use a wide-column store like Cassandra vs a document store like MongoDB?
- What is a time-series database and when is it appropriate?
- What is a graph database and what problems does it solve?
-
What is database replication and what are the different replication strategies?
[core]- What is the difference between synchronous and asynchronous replication?
- What is master-slave vs master-master replication?
- How does replication lag affect system behavior?
-
What is database indexing and how does it improve query performance?
[core]- What is the difference between a clustered and non-clustered index?
- What are the downsides of over-indexing a database?
- What is a composite index and when should you use one?
-
What is the CAP theorem?
[core]- Can a distributed system be both consistent and available at the same time?
- What does partition tolerance mean in practice?
- Which databases choose CP and which choose AP?
-
What is eventual consistency and when is it acceptable?
[core]- What is the difference between strong consistency, eventual consistency, and causal consistency?
- How does Amazon DynamoDB implement eventual consistency?
- What real-world scenarios can tolerate eventual consistency?
-
What is ACID vs BASE and how do they apply to system design?
[advanced]- What does BASE stand for (Basically Available, Soft state, Eventually consistent)?
- How do you achieve ACID guarantees in a distributed system?
- What is a distributed transaction and what are its challenges?
-
How do you handle database migrations in a production system?
[advanced]- What is a zero-downtime migration strategy?
- What is the expand-contract pattern for schema changes?
- How do you roll back a failed database migration?
-
What is a data warehouse and how does it differ from a transactional database?
[advanced]- What is OLTP vs OLAP?
- When would you use Snowflake, BigQuery, or Redshift?
- What is the star schema vs snowflake schema?
4. Caching
-
What is caching and why is it important in system design?
[core]- What types of data are good candidates for caching?
- What are the trade-offs of caching (staleness, memory cost, complexity)?
- Where can you apply caching in a system (client, CDN, server, database)?
-
What are the different caching strategies (cache-aside, write-through, write-back)?
[core]- What is cache-aside (lazy loading) and when do you use it?
- What is write-through caching and what are its trade-offs?
- What is write-back (write-behind) caching and when is it risky?
- What is read-through caching?
-
What are cache eviction policies (LRU, LFU, TTL)?
[core]- How does LRU (Least Recently Used) eviction work?
- When would you prefer LFU (Least Frequently Used) over LRU?
- What is a TTL-based eviction and when is it most appropriate?
-
What is a cache stampede (thundering herd) and how do you prevent it?
[advanced]- What is probabilistic early recomputation for cache stampede prevention?
- How does mutex locking prevent cache stampede?
- What is request coalescing?
-
What is the difference between Redis and Memcached?
[core]- What data structures does Redis support that Memcached does not?
- When would you choose Memcached over Redis?
- How does Redis persistence (RDB vs AOF) work?
-
How do you design a distributed cache?
[advanced]- How do you handle cache invalidation in a distributed system?
- What is cache coherence and why is it hard in distributed systems?
- How does consistent hashing apply to distributing cache keys across nodes?
-
What is CDN caching and how does it differ from server-side caching?
[core]- How do cache-control headers control CDN behavior?
- What is the difference between edge caching and origin caching?
- How do you invalidate a CDN cache for a specific resource?
5. Messaging and Queues
-
What is a message queue and why is it used in distributed systems?
[core]- What problems does asynchronous messaging solve?
- What is the difference between a queue and a topic (pub/sub)?
- How does a message queue decouple producers from consumers?
-
What is the difference between RabbitMQ and Kafka?
[core]- When would you choose Kafka over RabbitMQ?
- What is Kafka's log-based storage model?
- What is a consumer group in Kafka and how does it enable parallel processing?
- What level does Kafka guarantee message ordering?
-
What is pub/sub messaging and when do you use it?
[core]- What are fan-out patterns in pub/sub systems?
- How does Google Pub/Sub differ from Kafka?
- What is the difference between push and pull delivery in pub/sub?
-
How do you ensure exactly-once delivery in a message queue?
[advanced]- What is at-least-once vs at-most-once vs exactly-once delivery semantics?
- How do idempotency keys help ensure exactly-once processing?
- How does Kafka implement exactly-once semantics?
-
What is a dead letter queue (DLQ) and why is it important?
[core]- When does a message end up in a dead letter queue?
- How do you monitor and process messages in a DLQ?
- What retry strategies work well with a DLQ?
-
How do you handle backpressure in a messaging system?
[advanced]- What is backpressure and why does it occur?
- What strategies can a consumer use to signal backpressure to a producer?
- How does Kafka handle slow consumers?
6. API Design
-
What are the principles of good API design?
[core]- What makes an API RESTful?
- What is the Richardson Maturity Model?
- How do you version a public API?
- What is HATEOAS?
-
What is the difference between REST, GraphQL, and gRPC?
[core]- When would you choose GraphQL over REST?
- What are the performance benefits of gRPC over REST?
- What is the n+1 query problem in GraphQL and how is it solved?
-
How do you design a pagination API for large datasets?
[core]- What is the difference between offset pagination and cursor-based pagination?
- Why is cursor-based pagination preferred for real-time feeds?
- How do you handle page size limits in a public API?
-
What is rate limiting in APIs and how do you implement it?
[core]- What algorithms are used for rate limiting (token bucket, leaky bucket, sliding window log)?
- How do you implement distributed rate limiting?
- How do you communicate rate limit status to clients (HTTP headers)?
-
How do you design an idempotent API?
[advanced]- Which HTTP methods are idempotent by definition?
- How do you make a POST endpoint idempotent using idempotency keys?
- What are the storage implications of idempotency key tracking?
-
How do you handle API versioning?
[core]- What are the different API versioning strategies (URL path, header, query param)?
- What are the trade-offs of each versioning approach?
- How long should you maintain backward compatibility in a versioned API?
-
What is an API gateway and what does it provide?
[core]- What is the difference between an API gateway and a reverse proxy?
- How does an API gateway handle authentication and authorization?
- What is a BFF (Backend for Frontend) pattern?
7. Distributed Systems
-
What are the fallacies of distributed computing?
[advanced]- What are the 8 fallacies of distributed computing?
- How does each fallacy affect system design decisions?
- What is "the network is reliable" fallacy and how do you design around it?
-
What is a distributed transaction and how do you handle it?
[advanced]- What is the two-phase commit (2PC) protocol?
- What is the SAGA pattern and how does it compare to 2PC?
- What is a compensating transaction and when is it used?
-
What is the SAGA pattern in microservices?
[advanced]- What is the difference between choreography and orchestration in SAGA?
- How do you handle partial failures in a SAGA?
- What are the trade-offs of using SAGA vs 2PC?
-
What is leader election and why is it needed in distributed systems?
[advanced]- How does the Raft consensus algorithm work?
- How does ZooKeeper implement leader election?
- What happens during a split-brain scenario?
-
What is a distributed lock and how do you implement one?
[advanced]- What is the Redlock algorithm for distributed locking with Redis?
- What are the risks of distributed locks?
- When should you use optimistic locking instead of a distributed lock?
-
How do you design a system for fault tolerance and high availability?
[core]- What is the difference between fault tolerance and high availability?
- What is a circuit breaker pattern?
- What is the retry pattern and what are its risks?
- What is bulkhead isolation?
-
What is the two generals problem and what does it illustrate?
[advanced]- Why can you not guarantee consensus in an unreliable network?
- How does this relate to real-world distributed systems?
- What is the Byzantine Generals Problem?
8. Microservices
-
What are microservices and how do they differ from monoliths?
[core]- What are the benefits and drawbacks of microservices?
- When should you choose a monolith over microservices?
- What is a modular monolith?
-
How do microservices communicate with each other?
[core]- What is the difference between synchronous (REST/gRPC) and asynchronous (message queue) communication?
- When should microservices communicate synchronously vs asynchronously?
- What is a service mesh and what problem does it solve?
-
What is service discovery and how does it work in microservices?
[core]- What is the difference between client-side and server-side service discovery?
- How do Consul, Eureka, and Kubernetes DNS handle service discovery?
- How does a health check integrate with service discovery?
-
How do you handle data management in a microservices architecture?
[advanced]- What is the database-per-service pattern?
- How do you handle data consistency across services without a shared database?
- What is event sourcing and how does it help with microservices data?
-
What is the strangler fig pattern?
[advanced]- How do you migrate a monolith to microservices using the strangler fig pattern?
- What are the risks of incremental migration?
- What is an anti-corruption layer?
-
How do you monitor and debug a microservices system?
[core]- What is distributed tracing and how does Jaeger or Zipkin work?
- What is a correlation ID and how is it used?
- What are the key metrics to monitor for each microservice?
-
What is a service mesh and what does it provide?
[advanced]- What is the difference between Istio and Linkerd?
- What is a sidecar proxy pattern?
- How does a service mesh handle mTLS between services?
9. Storage Systems
-
How do you design a distributed file storage system?
[core]- How does Google File System (GFS) or HDFS work?
- What is erasure coding vs replication for data durability?
- How do you handle file chunking and metadata management?
-
What is object storage and how does it differ from block storage?
[core]- What is the difference between object storage (S3), block storage (EBS), and file storage (EFS)?
- When would you use object storage over a relational database?
- How does Amazon S3 achieve durability?
-
How do you design a blob storage system like S3?
[advanced]- What data structures are used to store and retrieve blobs efficiently?
- How do you handle large file uploads (multipart upload)?
- How do you implement access control for stored objects?
-
What is a columnar storage format and when is it used?
[advanced]- What is the difference between row-oriented and column-oriented storage?
- When does columnar storage (Parquet, ORC) outperform row storage?
- How does columnar storage improve compression?
-
How do you handle data replication for durability?
[core]- What is the replication factor and how do you choose it?
- What is quorum-based replication?
- How does Cassandra handle replication across data centers?
10. Real-world System Design Problems
-
How would you design a URL shortener (like bit.ly)?
[core]- How do you generate a unique short code?
- How do you handle collisions in short code generation?
- How do you scale reads since redirects are very frequent?
- How do you handle custom aliases?
-
How would you design a rate limiter?
[core]- Which algorithm would you use (token bucket, sliding window)?
- How do you make the rate limiter work across multiple servers?
- How do you store rate limit state (Redis, in-memory)?
-
How would you design a notification system?
[core]- How do you handle push notifications to millions of users?
- How do you ensure notifications are delivered exactly once?
- How do you handle user preference settings (opt-out, channels)?
-
How would you design a social media news feed (like Twitter or Facebook)?
[core]- What is the fan-out on write vs fan-out on read approach?
- How do you handle celebrity users with millions of followers in fan-out?
- How do you rank and filter a user's feed?
-
How would you design a ride-sharing system (like Uber)?
[advanced]- How do you match drivers with riders efficiently?
- How do you handle geolocation and proximity queries at scale?
- How do you handle surge pricing?
-
How would you design a distributed key-value store?
[advanced]- How does consistent hashing distribute keys across nodes?
- How do you handle node failures and data recovery?
- How does Amazon DynamoDB or Apache Cassandra implement a key-value store?
-
How would you design a video streaming platform (like YouTube or Netflix)?
[advanced]- How do you handle video upload, transcoding, and storage?
- How does adaptive bitrate streaming work?
- How do you use a CDN to serve video content globally?
-
How would you design a chat application (like WhatsApp)?
[core]- How do you implement real-time message delivery using WebSocket?
- How do you store and retrieve chat history efficiently?
- How do you handle group chats and delivery receipts?
-
How would you design a search autocomplete system?
[core]- What data structure is used for prefix search (trie)?
- How do you rank autocomplete suggestions by popularity?
- How do you update suggestions in real time as trends change?
-
How would you design a distributed job scheduler?
[advanced]- How do you ensure a job runs exactly once?
- How do you handle job failures and retries?
- How do you schedule jobs with dependencies?
-
How would you design a payment processing system?
[advanced]- How do you ensure idempotency in payment APIs?
- How do you handle double charges and refunds?
- What compliance considerations (PCI-DSS) affect payment system design?
-
How would you design a web crawler?
[advanced]- How do you avoid crawling the same page twice?
- How do you handle crawl politeness (robots.txt, rate limiting)?
- How do you scale a crawler to billions of pages?
11. Reliability and Fault Tolerance
-
What is the circuit breaker pattern and how does it work?
[core]- What are the three states of a circuit breaker (closed, open, half-open)?
- What metrics trigger a circuit breaker to open?
- How does the circuit breaker pattern relate to the bulkhead pattern?
-
What is a retry pattern and what are its risks?
[core]- What is exponential backoff with jitter?
- What is the difference between retry at the client vs retry at the proxy layer?
- When should you not retry (non-idempotent operations)?
-
What is a bulkhead pattern in system design?
[advanced]- How does bulkhead isolation prevent cascading failures?
- What is thread pool isolation vs semaphore isolation?
- How does Netflix's Hystrix implement bulkheads?
-
How do you design for graceful degradation?
[core]- What is the difference between graceful degradation and failover?
- How do you implement a fallback response when a dependency fails?
- What is feature flagging and how does it support graceful degradation?
-
What is chaos engineering and why do companies practice it?
[advanced]- What is Netflix's Chaos Monkey?
- How do you design a chaos experiment?
- What is the difference between chaos engineering and load testing?
-
What are the different types of system failures (hardware, software, network)?
[core]- What is a cascading failure and how does it start?
- What is a gray failure and why is it harder to detect than a hard failure?
- How do you design a system to detect and recover from partial failures?
12. Security in System Design
-
How do you design a secure authentication system?
[core]- What is the difference between session-based and token-based authentication?
- How do you store passwords securely (bcrypt, Argon2)?
- How do you implement multi-factor authentication (MFA)?
-
What is OAuth 2.0 and how does it work in system design?
[core]- What are the OAuth 2.0 grant types (authorization code, client credentials, implicit)?
- What is the difference between OAuth 2.0 and OpenID Connect?
- How do you implement token refresh and revocation?
-
How do you secure inter-service communication in microservices?
[advanced]- What is mutual TLS (mTLS) and how does it work?
- How does a service mesh like Istio enforce mTLS?
- What is a service account and how is it used for authorization?
-
How do you design a system to protect against DDoS attacks?
[core]- What is rate limiting and how does it mitigate DDoS?
- How do CDNs absorb volumetric DDoS attacks?
- What is a WAF (Web Application Firewall) and how does it help?
-
What is encryption at rest vs encryption in transit?
[core]- How do you implement encryption at rest in a cloud database?
- What is envelope encryption?
- What key management strategies exist (KMS, HSM)?
-
How do you handle secrets management in a distributed system?
[advanced]- What is HashiCorp Vault and how does it manage secrets?
- What are the risks of storing secrets in environment variables?
- How do you rotate secrets without downtime?
13. Observability and Monitoring
-
What are the three pillars of observability?
[core]- What is the difference between logs, metrics, and traces?
- How do logs, metrics, and traces complement each other?
- What tools are used for each pillar (ELK, Prometheus, Jaeger)?
-
How do you design a logging system for a distributed application?
[core]- What is structured logging and why is it preferred over unstructured logging?
- How do you aggregate logs from hundreds of services (ELK stack, Loki)?
- How do you handle log sampling to reduce cost?
-
What is distributed tracing and how does it work?
[core]- What is a trace, span, and trace ID?
- How does a correlation ID propagate through a distributed system?
- What is OpenTelemetry and why is it important?
-
What metrics should you monitor for a backend system?
[core]- What are the four golden signals (latency, traffic, errors, saturation)?
- How do you set meaningful alert thresholds?
- What is a p99 latency and why is it more meaningful than average latency?
-
How do you design an alerting system?
[advanced]- What is the difference between symptom-based and cause-based alerting?
- How do you reduce alert fatigue?
- What is a runbook and how does it relate to alerts?
-
What is a health check endpoint and how is it used?
[core]- What is the difference between a liveness check and a readiness check?
- How does Kubernetes use health checks?
- What should a health check endpoint actually verify?
14. Cloud and Infrastructure
-
What is the difference between IaaS, PaaS, and SaaS?
[core]- What are examples of each (EC2, Elastic Beanstalk, Salesforce)?
- When would you choose PaaS over IaaS?
- What is FaaS (Function as a Service) and how does serverless fit in?
-
What is serverless computing and when is it appropriate?
[core]- What is the cold start problem and how do you mitigate it?
- What are the cost implications of serverless vs always-on servers?
- What types of workloads are a poor fit for serverless?
-
What is a VPC and how do you design network security with it?
[core]- What is the difference between a public subnet and a private subnet?
- What is a NAT gateway and when is it needed?
- What is VPC peering and what are its limitations?
-
What is Infrastructure as Code (IaC) and why is it important?
[core]- What is the difference between Terraform and CloudFormation?
- What is idempotency in the context of IaC?
- How do you manage secrets in IaC configurations?
-
How do you design a multi-region architecture?
[advanced]- What is active-active vs active-passive multi-region deployment?
- How do you handle data replication across regions?
- How do you handle DNS failover for multi-region deployments?
-
What is a content delivery network (CDN) and how do you integrate it into a system design?
[core]- What types of content should be served from a CDN?
- How does a CDN reduce latency for a globally distributed user base?
- How do you handle CDN cache invalidation?