What is a Cascading Failure?
A cascading failure is when the failure of one component causes a chain reaction that brings down other components — until the entire system collapses.
It's the digital equivalent of a power grid blackout: one substation trips, overloads the next, which trips, overloads the next — until entire cities go dark.
User Request
│
▼
┌──────────┐ ┌──────────────┐ ┌──────────┐
│ App │─────▶│ Razorpay │─────▶│ Bank API │
│ Server │ │ Gateway │ ✗ │ (slow) │
└──────────┘ └──────────────┘ └──────────┘
│ │
Threads Threads
blocking blocking
(waiting) (waiting for bank)
│
▼
New requests pile up
Memory fills up
Thread pool exhausted
│
▼
┌──────────────────────────┐
│ ENTIRE APP CRASHES │ ← because of ONE slow dependency
└──────────────────────────┘
Why It Happens — Thread Exhaustion
Modern web servers use a thread pool to handle requests. When Razorpay slows down, here's what happens at the thread level:
// What thread exhaustion looks like in metrics
Thread pool size: 200
Threads waiting on Razorpay: 200 ← all blocked
Threads available: 0 ← nothing left
Request queue length: 4800 ← backing up fast
P99 latency: 30000ms ← everything timing out
// Result: your /home, /login, /search endpoints all return 503
// because of a payment provider outage
The Circuit Breaker Pattern
The Circuit Breaker is a design pattern borrowed from electrical engineering. In your home, a circuit breaker trips when it detects dangerous current — it breaks the circuit to prevent fire and damage.
In software: when a dependency starts failing, the Circuit Breaker stops sending requests to it. Instead of letting threads pile up waiting for timeouts, it fails fast — immediately returning an error or fallback response.
┌──────────────────────────────────────────────────────┐
│ Your Service │
│ │
│ ┌──────────┐ ┌─────────────────┐ ┌───────┐ │
│ │ Your │───▶│ Circuit Breaker │───▶│ Razorpay│
│ │ Code │ │ (proxy/wrapper)│ │ API │ │
│ └──────────┘ └────────┬────────┘ └───────┘ │
│ │ (when OPEN) │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Fallback Logic │ │
│ │ (use Paytm / │ │
│ │ return error) │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────┘
The 3 States Every Engineer Must Know
Everything is healthy. Requests flow through to the dependency normally. The CB is monitoring the failure rate in a sliding time window.
If failure rate exceeds the threshold (e.g. 50% of calls fail in 10 seconds) → trips to OPEN.
The CB stops sending ALL requests to the broken dependency. Instead, it immediately returns a fallback response — no waiting, no thread blocking.
After a configured timeout (e.g. 30 seconds) → transitions to HALF-OPEN to test recovery.
The CB allows a small number of test requests through. If they succeed → the service has recovered → trips back to CLOSED. If they fail → trips back to OPEN.
This is the self-healing mechanism. No engineer needs to manually reset anything.
failures > threshold
┌─────────┐ ─────────────────────────────▶ ┌──────────┐
│ CLOSED │ │ OPEN │
│ ●●● │ ◀───────────────────────────── │ ○○○ │
└─────────┘ test succeeds └────┬─────┘
▲ │
│ after timeout
│ │
│ test succeeds ┌──────▼──────┐
└────────────────────────────────────│ HALF-OPEN │
test fails → back to OPEN │ ◑◑◑ │
└─────────────┘
Real World: Razorpay → Paytm Fallback
Here's the exact scenario from the reel, with full engineering detail:
Code Implementation
Pick your language — each example includes the recommended library, configuration explanation, and the Razorpay → Paytm fallback pattern.
Resilience4j — the standard for Spring Boot microservices.How it works: A sliding window tracks the last N calls. When failure rate exceeds the threshold, the CB trips OPEN. After
waitDurationInOpenState, it goes HALF-OPEN and lets N test requests through. If they succeed → CLOSED.
/* ── 1. Add dependency (build.gradle) ─────────────────── */
implementation 'io.github.resilience4j:resilience4j-spring-boot3:2.2.0'
/* ── 2. Configure in application.yml ──────────────────── */
resilience4j:
circuitbreaker:
instances:
razorpay:
slidingWindowType: COUNT_BASED
slidingWindowSize: 10 # track last 10 calls
failureRateThreshold: 50 # trip if >50% fail
waitDurationInOpenState: 30s # stay OPEN 30 seconds
permittedNumberOfCallsInHalfOpenState: 3
slowCallDurationThreshold: 2s # slow = failure too
slowCallRateThreshold: 80
/* ── 3. Use in your service ────────────────────────────── */
@Service
public class PaymentService {
// Primary: Razorpay. Fallback: Paytm
@CircuitBreaker(name = "razorpay", fallbackMethod = "payViaPaytm")
public PaymentResult payViaRazorpay(PaymentRequest req) {
// This call is protected by the CB
return razorpayClient.processPayment(req);
}
// Called automatically when CB is OPEN or call fails
public PaymentResult payViaPaytm(PaymentRequest req, Exception ex) {
log.warn("Razorpay CB open, using Paytm fallback", ex);
return paytmClient.processPayment(req);
}
}
/* ── 4. Monitor CB state changes ──────────────────────── */
@Component
public class CBEventListener {
@EventListener
public void onOpen(CircuitBreakerOnStateTransitionEvent e) {
if (e.getStateTransition().getToState() == OPEN) {
alertService.page("Razorpay CB tripped OPEN");
}
}
}
When to Use Circuit Breaker
- A Cascading Failure happens when one slow dependency exhausts your thread pool, taking down the entire app — even parts unrelated to the failing service.
- The Circuit Breaker wraps calls to external dependencies. When failures spike, it "trips open" and stops making calls — failing fast instead of waiting for timeouts.
- CLOSED = normal traffic. OPEN = dependency broken, use fallback. HALF-OPEN = testing if service recovered.
- In OPEN state, requests get an immediate fallback response — no threads blocked, no memory pressure, rest of app stays healthy.
- HALF-OPEN is the self-healing mechanism — it periodically probes the broken service and automatically closes the circuit when it recovers. Zero manual intervention.
- Use Resilience4j (Java), opossum (Node.js), or Polly (.NET) — don't roll your own in production.