The ongoing trouble with timeouts
Date posted
5 July 2017
Reading time
13 Minutes
The ongoing trouble with timeouts
We have all been there, late at night, writing some critical component for a piece of software that absolutely must be deployed to production within the next 24 hours to salvage product deadlines - it usually involves some last minute integration of a service or a component...it might even sound something like "The Order Processing service needs to call the Billing Service's RESTful API"...we all want to get paid, right..."straightforward stuff - 10-20 lines of code, fire up the debugger and it all works...great...commit...don't forget to write a funny commit message."
Not so fast...the developer just fell into the distributed computing trap! His team lead reviews and comments "what if the billing service has crashed, or is extremely busy and taking longer than usual to respond, or all the processing threads have hung due to a synchronisation bug, or"...well, you get the idea.
"We need to somehow handle all possible error conditions to ensure the order processing service remains stable; the first is straightforward, we will get an error response straightaway as the billing service is unreachable; what about the other error conditions...we may never get a response so we need a timeout to govern how long the order processing service waits for a response from the billing service"...enter the humble timeout.
Because timeouts are so hard, they are sometimes overlooked, often ignored configured either too high or too low, or worse, use the dreaded default value, all resulting in system instability and undefined behaviour when problems (the only certainty in distributed computing) occur. Timeouts are hard!
What is a Timeout?
As a concept, timeouts are actually quite simple, its just a number, usually scoped in seconds, used to determine how long to wait for an operation (typically I/O) to complete. In modern technology, timeouts are ubiquitous; they help to turn an unreliable IP network into a reliable TCP/IP network - the backbone of the internet. They help mobile applications remain responsive when communicating in low signal areas; They help to keep systems stable in the face of failure - its this last scenario I want to discuss in more detail.Why do I need a timeout anyway?
Timeouts are a response to a wider problem: unreliable systems; there is no panacea. It is important however, that we use timeouts in the right way. Typically, timeouts are used in a request / reply style interaction, for example an HTTP call or a Database update, however, they are also used in asynchronous systems, such as message brokers. Messages typically have a time-to-live (TTL) - in effect, a timeout governing how long they should remain unprocessed before disappearing. Coding and configuring a timeout in response to finding a problem at integration test is almost always too late - it will have unintended consequences in any system of sufficient complexity. Instead, we need to carefully consider and design for the unreliable nature of distributed systems. As developers and architects, we need to consider how and when system will fail, typically referred to in engineering practices as a "failure mode" - its discussed in great detail in the book "Release It", by Michael Nygard - which incidentally, is an excellent read. I've used the word when intentionally here; even if the change of failure occurring is 1 in 100 million transactions, if a system processes 10 million transactions a day, the chance of failure on any given day is 1 in 10 - if only the lottery had those odds.Why are they so hard then?
What makes timeouts so difficult in practice is typically the questions that arise once we introduce one:- How long should the timeout wait before expiring?
- How should the system behave if the timeout is exceeded?
