Enterprise Applications rarely are islands. Often they have many dependencies and most often they depend on the other services. A typical example of an order usecase maybe as follows.
If any of the services that the use case depends on become latent, it has a cascading effect on the entire application. It is a fine day and everything is looking good. But all of a sudden for reasons not known, lets say the payment service which usually responds in 2s is taking 8s. Let us for a second theorize the impact
- For example, the web container is working with 50 threads, which is good number to go with from past experience
- Because of the 4 fold increase in latency of one service, all the web container threads wait up on the payment service
- Rather quickly, we can see all the 50 container threads are busy. Busy doing what? Mostly waiting up on payment service while it could have done something useful.
- From CPU utilization perspective, since threads are waiting rather than actually being busy, the CPU utilization typically goes down.
- And soon enough, there are no threads to serve new requests (not necessarily order request) and the entire site is slow.
Few architectural mistakes.
- All eggs are in one basket. In a perfect world, where every one plays nicely this is not an issue. But many times, there is little control over services accessed over the network - even within same enterprise not to mention if they are from a different provider altogether.
- At a minimum connect and read timeouts must be set for the connection so that the container thread will not wait indefinitely for any service.
The proposal is to
- Isolate each dependency. Each dependency (or related dependencies) should work from a thread pool of its own so that rest of the system is not impacted.
- Yes, this will impact parts of the application, but will not impact the system as a whole.
- There may be an alternate service that we can connect to, or queue up the payment requests for later processing etc. so that user experience is not affected.
- With Java Concurrent libraries, setting up various thread pools and using them is straight forward especially if the pools are injected via a DI framework like Spring.
We have used the above approach to successfully to handle network dependencies.
But here is an even better alternative. Apparently Netflix faced with similar issues have built a comprehensive framework to do just this. The Hystrix framework built and open-sourced by Netflix team does what was discussed above and more.
- Hystrix wraps all calls to external systems and executes within a separate thread.
- The thread times out at pre-configured interval.
- It can shed the load and fail-fast rather than bringing the entire system down.
- Trip circuit breaker, if error percentage is high
- Reporting Capabilities.
Definitely, a useful tool to have in the toolbox.