Writing reliable software while depending on hazardous APIs

Romain Dorgueil

Thursday 14:20 in Hassium

The two most common causes for software failure are, in order, human errors then external services. Working extensively with external APIs, we often encounter tricky issues in maintaining the responsiveness of our end-user services (both in terms of speed, but also plain availability). Many teams are addressing those issues on a case-by-case basis, most often using a homemade patchwork of external libraries and failing cases, and we used to do the same. Over time, we have come to rethink our approach to this problem.

We will present the usual suspects (and their consequences) we're usually facing: timeouts, HTTP errors, cascading failures, unclear or changing contracts, and the difficulty of forensic analysis after an incident occurs when the root cause stems from external data or calls.

Then, we'll show various approaches we use or have seen be used by teams of different sizes.

We'll finish by presenting an innovative approach delegating the issues to a forward proxy so that the development team can both avoid having to spend time on reinventing the resilience and reliability patterns, while providing them the tools to act quickly when things go wrong.

Romain Dorgueil

My first pieces of code ran on Atari ST machines. I had the chance to see the Internet baby say its first words while I was starting to get interested in building software. I'm a proud Software Craftsman and Open-Source Software advocate. I spend a few of my other lifes playing afro-cuban & jazz music, or playing some go games.