Enterprise Applications rarely are islands. Often they have many dependencies and most often they depend on the other services. A typical example of an order usecase maybe as follows.

If any of the services that the use case depends on become latent, it has a cascading effect on the entire application. It is a fine day and everything is looking good. But all of a sudden for reasons not known, lets say the payment service which usually responds in 2s is taking 8s. Let us for a second theorize the impact

  • For example, the web container is working with 50 threads, which is good number to go with from past experience
  • Because of the 4 fold increase in latency of one service, all the web container threads wait up on the payment service
  • Rather quickly, we can see all the 50 container threads are busy. Busy doing what? Mostly waiting up on payment service while it could have done something useful.
  • From CPU utilization perspective, since threads are waiting rather than actually being busy, the CPU utilization typically goes down.
  • And soon enough, there are no threads to serve new requests (not necessarily order request) and the entire site is slow.

Few architectural mistakes.

  • All eggs are in one basket. In a perfect world, where every one plays nicely this is not an issue. But many times, there is little control over services accessed over the network - even within same enterprise not to mention if they are from a different provider altogether.
  • At a minimum connect and read timeouts must be set for the connection so that the container thread will not wait indefinitely for any service.

The proposal is to

  • Isolate each dependency. Each dependency (or related dependencies) should work from a thread pool of its own so that rest of the system is not impacted.
  • Yes, this will impact parts of the application, but will not impact the system as a whole.
  • There may be an alternate service that we can connect to, or queue up the payment requests for later processing etc. so that user experience is not affected.
  • With Java Concurrent libraries, setting up various thread pools and using them is straight forward especially if the pools are injected via a DI framework like Spring.

We have used the above approach to successfully to handle network dependencies.


But here is an even better alternative. Apparently Netflix faced with similar issues have built a comprehensive framework to do just this. The Hystrix framework built and open-sourced by Netflix team does what was discussed above and more.

  • Hystrix wraps all calls to external systems and executes within a separate thread.
  • The thread times out at pre-configured interval.
  • It can shed the load and fail-fast rather than bringing the entire system down.
  • Trip circuit breaker, if error percentage is high
  • Reporting Capabilities.
  • Dashboards

Definitely, a useful tool to have in the toolbox.