Backs off smartly, typed, self-documenting: How I write scripts for orchestrating APIs
I work as a software engineer at a tech company. My team has outgrown many infra and platform teams, so most of our needs for internal tooling comes down to our own hands. To be precise, me.
If that sounds like you and your team, you may find this post helpful. Here are some best practices I’ve learned while building internal tools that automate day-to-day tasks.
An orchestra of APIs
In most tech companies, infrastructure and platform teams maintain internal APIs for various actions in the DevOps cycle. In addition to APIs that come with the off-the-shelf solutions (such as GitHub and Jenkins), in-house APIs are responsible for:
- containers / pool management,
- software building & testing,
- compliance validations,
- batch job scheduling,
- …
Thanks to how the “separation of concern” principle manifests in organization structures, different teams own disjoint subsets of those APIs. It’s unrealistic to count on them to collaboratively build a unified tool that streamlines a complete user journey for developers.
No blame on them, though; my team has some really rare usage patterns. A least intriguing example is to test a revision with live traffic. To achieve this, an engineer on my team needs to:
- get the current revision of the codebase on GitHub,
- get the version currently live in production,
- compare the two version numbers, only continuing if they don’t match.
- query the CI for the manifest ID associated with this latest revision (potentially waiting for the CI to finish building the revision first),
- have the cloud service deploy it to a specific pool,
- fetch a list of hosts in the prod pool and the test pool,
- kick off traffic-mirroring jobs (what is this?),
- repeatedly check the progress till they finish, and then
- examine the error rates and latency numbers.
This procedure involves at least 6 different services. Each service has its own web UI, which makes them more approachable to humans. On the flip side, web UIs hides away the underlying APIs, making them harder to discover. To make it worse, developers may expect everyone to use the graphical interfaces, so they often do not bother to document their APIs.
How I write scripts that orchestrate APIs
Studying the API calls. To understand which and how APIs are called when I click on something, I often have to reverse-engineer a UI by looking at the JavaScript console. On Chromium-based browsers, the “Network” tab in the Developer Console shows all network calls being made in the current tab. This is what I do:
- I wait for the page to complete rendering before opening up the Console, so as to avoid the noise of async loads.
- I perform the action by hand and wait for the desired output to show up.
- I immediately stop recording the network calls in that tab, again for minimizing noise.
- I browse through them and pick the one that is doing the job. It’s usually the one that is taking the longest, which you can observe in the waterfall diagram on the right.
- I study its request, particularly the headers, the query parameters, and the request body.
- I read the response, paying special attention to how the status is returned. I will be writing an assertion in the API-calling function later.
Python is excellent for glue work like orchestrating APIs.
Must-have for calling APIs: backoff
For calling in-house APIs, I use the well-trusted requests
library. But I also want to avoid DDoSing the server when I fire those "is the job done yet" queries, so I use the library backoff
to have my requests attempt with exponentially-growing wait durations.
Using backoff
also simplifies response validations. Without it, I would write:
def check_status(...):
is_job_done = False
num_tries = 0
while num_tries < 3 and not is_job_done:
num_tries += 1
response = requests.get(...)
if response.status_code != 200:
# Server failed to give a JSON response.
sleep(3)
continue
try:
data = response.json()
except JSONDecodeError:
sleep(3)
continue
if "job_status" not in data:
# Malformed response.
sleep(3)
continue
is_job_done = data["job_status"] == "done"
with backoff
, I'd write:
@backoff.on_exception(
backoff.expo,
(
# Looks out for server execution issues.
requests.exceptions.RequestException,
# Looks out for malformed response issues.
JSONDecodeError,
KeyError,
# Looks out for data values taht violate expectations.
AssertionError,
),
max_tries=3,
)
def check_status(...):
response = requests.get(...)
response.raise_for_status()
data = response.json()
assert data["job_status"] == "done"
Notice that:
- I no longer have to implement the retry loop or the
num_retries
counter. - Plus,
backoff
serves as atry-except
clause.
The beauty of defining a list of exceptions to catch is that I can reuse this list across multiple API-calling functions:
__EXCEPTIONS_TO_BACK_OFF_FOR = (
requests.exceptions.RequestException,
JSONDecodeError,
KeyError,
AssertionError,
)
@backoff.on_exception(
backoff.expo,
__EXCEPTIONS_TO_BACK_OFF_FOR,
)
def check_status_from_service_1(...):
response = requests.get(...)
response.raise_for_status()
data = response.json()
assert data["job_status"] == "done"
@backoff.on_exception(
backoff.expo,
__EXCEPTIONS_TO_BACK_OFF_FOR,
)
def check_result_from_service_2(...):
response = requests.get(...)
response.raise_for_status()
data = response.json()
assert data["result"] == "OK"
The idea is that, as a caller of the API, I should pay as little verbiage (in lines of code) as possible to handle its exceptions. In the method body, I should focus on the happy path of execution, leaving all server-side problems to itself. It is like dining out: If the cook burnt your bread, just ask them to make another; no need to ask “was it because you dialed the timer too long or the temperature too high”.
backoff
also plays well with libraries that provide typed access to off-the-shelf systems, such as GitHub and Jenkins. This makes handling ephemeral errors a breeze:
@backoff.on_exception(
backoff.expo,
github.GithubException.RateLimitExceededException,
)
def get_repo(repo_name):
return github_client.get_repo(repo_name)
Prefer typed interactions
As shown above, I use a typed library whenever possible. For instance,
- To interact with the CI, I use the
jenkinsapi
package. - To interact with the SCM systems, I use
gitpython
when the repo is local (example of me doing this) andpygithub
when the repo at question is in the GitHub Enterprise server (example of me doing that). - Even for things as native as globbing for files, I use
pathlib
instead ofglob
.
I prefer the typed approach instead of executing shell commands or crafting my own HTTP requests because:
- They give me the safety of type checks. Mind the difference between “typed” and “object-oriented” here: You can have type checks even in non-OO programming languages like Go.
- They usually allow me to chain method calls together, e.g.,
Jenkins(JENKINS_URL).getJob(JOB_NAME).getBuild(7).getStatus()
. It reads like English and gives clarity of mind. - Reading their documentations informs me about all the possible exceptions I need to handle. Taking the example of the last code block above, I don’t have to bombard GitHub with 10k requests before I learn that I need to watch out for rate limits.
All in all, using typed libraries is all about minimizing the risk of interacting with complex, enterprise-level software systems.
Write self-documenting code
The 2nd point about brings us to the topic of self-documenting code. I’ve observed a general sentiment regarding API orchestration scripts: Many think that they are too volatile to justify writing documentations for. Hence, it’s crucial to write them as readable as possible in the first place.
In this regard, besides preferring libraries that offer typed access, I have a couple of other habits:
I use pandas
to manipulate data. Don’t think of pandas
as purely a numerical tool; it works nicely with JSONs, too. It’s common for APIs to wrap most information in a list. For example, a getHosts
API may return:
{
“status”: “OK”,
“hosts”: [
{“hostName”: “a.example.com”, “status”: “idle”},
{“hostName”: “b.example.com”, “status”: “down”},
]
}
I always convert that into a DataFrame first and apply transformations there. This reduces the usage of nested for-loops and makes the effect of each step observable (by printing them as a table during debugging).
For accepting inputs, I use docopt
. This library parses declarations of CLI arguments from the docstring, which you should write like a help message. Yes, argparse
and click
offer syntactical safety on the argument definitions, but only with docopt
can I show a help message without executing the script.
This is important due to two facts:
- users may not have the dependencies installed when they download my script, and
- users usually want to be convinced that a script will fit their needs before investing time to satisfy its dependencies. They usually confirm so by reading the help messages.
Since argparse
and click
only kick in after the import
statements, those developers won't even have a chance to read the help message:
$ ./script.py --help
Traceback (most recent call last):
File "...", line 1X, in <module>
import lorem
ModuleNotFoundError: No module named 'lorem'
With a docstring, people can less
the script, and the first lines would be exactly what they would see when they run script.py --help
:
$ less script.py
#! /usr/bin/env python3
"""
Javadoc Auto-writer
===================
If a GitHub PR wants to modify Java methods not yet having Javadocs, generate them via GPT, and suggest to add them via PR review comments.
Usage:
./autojavadoc.py [options] <owner> <repo> <pr_number>
Options:
--endpoint=<e> GitHub API endpoint. [default: https://api.github.com]
"""
:
I also write unit tests. For data manipulation procedures (such as transforming a JSON response from one API to fit the input schema of another), no matter how trivial, I make a point to write a unit test. It’s not too much overhead — Using pytest
, the test code really boils down to assert transform_data(mock_input) == expected_output
:
def test_pair_hosts():
got = pair_hosts({
'all_hosts': [1, 2, 3, 4],
'hosts_in_use': [4],
})
want = [{'left': 1, 'right': 2}]
assert got == want
Unit tests give two advantages:
- Doing so forces me to extract the data-transforming steps into a separate function, thus improving readability.
- More importantly, test data serve as examples for API calls. The mock input would describe what the previous API call might emit, while the expected output illustrates the request schema of the next. This is useful when another developer picks up the task of maintaining this script, or for confirming that the API schema has been altered.
Conclusion
In this post, I’ve shared some best practices I’ve developed while writing automation scripts for orchestrating multiple APIs. They boil down to 3 main ideas:
- Use
backoff
to free yourself from handling server-side failures. - Look for existing API libraries to provide you with a safety of type checks and extra documentation on API behaviors.
- Protect other engineers from messing up or misusing your script by writing self-documenting code.
While none of these ideas are groundbreaking, they provide a specific perspective on building internal tools and why these ideas are considered best practices in this context. I hope you find them helpful in your work.
Any other tricks you find effective? I’d love to learn them from the comments!