The ongoing trouble with timeouts
We have all been there, late at night, writing some critical component for a piece of software that absolutely must be deployed to production within the next 24 hours to salvage product deadlines – it usually involves some last minute integration of a service or a component…it might even sound something like “The Order Processing service needs to call the Billing Service’s RESTful API”…we all want to get paid, right…“straightforward stuff – 10-20 lines of code, fire up the debugger and it all works…great…commit…don’t forget to write a funny commit message.”
Not so fast…the developer just fell into the distributed computing trap! His team lead reviews and comments “what if the billing service has crashed, or is extremely busy and taking longer than usual to respond, or all the processing threads have hung due to a synchronisation bug, or”…well, you get the idea.
“We need to somehow handle all possible error conditions to ensure the order processing service remains stable; the first is straightforward, we will get an error response straightaway as the billing service is unreachable; what about the other error conditions…we may never get a response so we need a timeout to govern how long the order processing service waits for a response from the billing service”…enter the humble timeout.
As a concept, timeouts are actually quite simple, its just a number, usually scoped in seconds, used to determine how long to wait for an operation (typically I/O) to complete. In modern technology, timeouts are ubiquitous; they help to turn an unreliable IP network into a reliable TCP/IP network – the backbone of the internet. They help mobile applications remain responsive when communicating in low signal areas; They help to keep systems stable in the face of failure – its this last scenario I want to discuss in more detail.
Timeouts are a response to a wider problem: unreliable systems; there is no panacea. It is important however, that we use timeouts in the right way. Typically, timeouts are used in a request / reply style interaction, for example an HTTP call or a Database update, however, they are also used in asynchronous systems, such as message brokers. Messages typically have a time-to-live (TTL) – in effect, a timeout governing how long they should remain unprocessed before disappearing.
Coding and configuring a timeout in response to finding a problem at integration test is almost always too late – it will have unintended consequences in any system of sufficient complexity. Instead, we need to carefully consider and design for the unreliable nature of distributed systems. As developers and architects, we need to consider how and when system will fail, typically referred to in engineering practices as a “failure mode” – its discussed in great detail in the book “Release It”, by Michael Nygard – which incidentally, is an excellent read.
I’ve used the word when intentionally here; even if the change of failure occurring is 1 in 100 million transactions, if a system processes 10 million transactions a day, the chance of failure on any given day is 1 in 10 – if only the lottery had those odds.
What makes timeouts so difficult in practice is typically the questions that arise once we introduce one:
Distributed systems, particularly those using a micro-services architecture increase the complexity further, as typically the call stack contains many requests to remote services, each requiring its own timeout. Mismatched timeouts, as outlined below are one of the main sources of problems.
Because timeouts are so hard, they are sometimes overlooked, often ignored configured either too high or too low, or worse, use the dreaded default value, all resulting in system instability and undefined behaviour when problems (the only certainty in distributed computing) occur. Timeouts are hard!
Arbitrarily picking values, or worse, using the default values for timeouts is extremely risky; too low and even a small volume of traffic will cause timeouts to occur, while too high could cause the dreaded, colourfully but aptly named “brown out” where all requests start to take longer and longer, and the system slowly grinds to a halt.
The aim of a timeout is to fail fast. To do that, we need to measure system performance. Performance testing with realistic volumes and detailed instrumentation will allow developers to measure how long it should take for a response to be received for any given request, allowing us to intelligently configure timeouts, based on hard evidence.
System behaviour really depends on the failure causing the timeout. One approach is to attempt to reduce the number of timeouts that occur. The concept of back-pressure, or load shedding, essentially reducing the amount of traffic that reaches the failing system helps alleviate some of the issues that can cause timeouts – busy systems. Busy systems are a breeding ground for timeouts, response times are usually directly proportional to the amount of work a system needs to do, or how many requests its handling in parallel.
It might be something as simple as a transient network issue due to a System Administrator updating firewall rules (during hours of low utilisation of course). In this case, a simple retry may result in success….of course there’s nothing really simple about a retry. What if the original request was processed successfully and the response was lost – can we still retry? What if the original request was to debit a Customers bank account – was the request idempotent?
Timeouts, retries and back-pressure are hard, but idempotent gateways and compensating transactions are harder, and probably left to be the subject of another blog post.
When talking about system behaviour, its also wise to call out how the system shouldn’t behave. Distributed transactions might seem really tempting, however, while they might solve some of the problems above, they introduce more issues than they solve. Hidden inter-dependencies, ordered startups (the Transaction Coordinator needs to be running before anything that needs to use it), slow performance and potential deadlocks are all symptoms of a problem no-one wants to have.
Ultimately, understanding system behaviour is a conversation between the developer, the technical architect and the product owner. The aim is to improve stability and keep the system responsive – unfortunately, as with most problems in computing, there is no one right answer.
Software developers love to take inspiration from engineering, and the circuit breaker is no exception. Netflix’s Hystrix library is probably the most famous implementation of the pattern available today (other implementations do exist, but aren’t as fully features). It helps monitor response times and failure rates, and if not within agreed thresholds, “opens” the circuit, instantly shedding load and relieving pressure on the failing downstream system. Periodic tracer bullet requests are allowed to reach the downstream system to measure health. Eventually, the circuit is closed again allowing all traffic to reach the downstream system.
Once we start employing circuit breakers, with all the monitoring and isolation techniques that come along with them, such as bulkheads, we quickly move away from the simple timeout and into the world of anti fragile software; building self-healing, fault tolerant, resilient applications – the aspirational goal for many architects, developers ops engineers and system administrators alike.
Sign up to the Kainos newsletter