Enterprise Applications rarely are islands. Often they have many dependencies and most often they depend on the other services. A typical example of an order usecase maybe as follows.

If any of the services that the use case depends on become latent, it has a cascading effect on the entire application. It is a fine day and everything is looking good. But all of a sudden for reasons not known, lets say the payment service which usually responds in 2s is taking 8s. Let us for a second theorize the impact

  • For example, the web container is working with 50 threads, which is good number to go with from past experience
  • Because of the 4 fold increase in latency of one service, all the web container threads wait up on the payment service
  • Rather quickly, we can see all the 50 container threads are busy. Busy doing what? Mostly waiting up on payment service while it could have done something useful.
  • From CPU utilization perspective, since threads are waiting rather than actually being busy, the CPU utilization typically goes down.
  • And soon enough, there are no threads to serve new requests (not necessarily order request) and the entire site is slow.

Few architectural mistakes.

  • All eggs are in one basket. In a perfect world, where every one plays nicely this is not an issue. But many times, there is little control over services accessed over the network - even within same enterprise not to mention if they are from a different provider altogether.
  • At a minimum connect and read timeouts must be set for the connection so that the container thread will not wait indefinitely for any service.

The proposal is to

  • Isolate each dependency. Each dependency (or related dependencies) should work from a thread pool of its own so that rest of the system is not impacted.
  • Yes, this will impact parts of the application, but will not impact the system as a whole.
  • There may be an alternate service that we can connect to, or queue up the payment requests for later processing etc. so that user experience is not affected.
  • With Java Concurrent libraries, setting up various thread pools and using them is straight forward especially if the pools are injected via a DI framework like Spring.

We have used the above approach to successfully to handle network dependencies.

Hystrix

But here is an even better alternative. Apparently Netflix faced with similar issues have built a comprehensive framework to do just this. The Hystrix framework built and open-sourced by Netflix team does what was discussed above and more.

  • Hystrix wraps all calls to external systems and executes within a separate thread.
  • The thread times out at pre-configured interval.
  • It can shed the load and fail-fast rather than bringing the entire system down.
  • Trip circuit breaker, if error percentage is high
  • Reporting Capabilities.
  • Dashboards

Definitely, a useful tool to have in the toolbox.