The two most common causes for software failure are, in order, human errors then external services. Working extensively with external APIs, we often encounter tricky issues in maintaining the responsiveness of our end-user services (both in terms of speed, but also plain availability). Many teams are addressing those issues on a case-by-case basis, most often using a homemade patchwork of external libraries and failing cases, and we used to do the same. Over time, we have come to rethink our approach to this problem.
We will present the usual suspects (and their consequences) we're usually facing: timeouts, HTTP errors, cascading failures, unclear or changing contracts, and the difficulty of forensic analysis after an incident occurs when the root cause stems from external data or calls.
Then, we'll show various approaches we use or have seen be used by teams of different sizes.
We'll finish by presenting an innovative approach delegating the issues to a forward proxy so that the development team can both avoid having to spend time on reinventing the resilience and reliability patterns, while providing them the tools to act quickly when things go wrong.