Batching API Calls

Batching API calls

MP 168: An obvious speedup, with a surprising side benefit. I've been working steadily on gh-profiler, while attending PyCon US and working on other projects as well. The project makes a number of calls to GitHub's API, and then analyzes the results to generate some indicators of whether the targeted user has been engaging in problematic open source behavior. These calls were being made serially, so one of the obvious low-hanging optimizations was to try parallelizing the API calls. This had the expected effect; most gh-profiler runs were significantly faster after running them in parallel. But it also had a much more significant effect: it means any additional call we want to make to gather more information about the user's activity is pretty much free. If that new call is faster than the current slowest API call, it shouldn't affect the overall execution time noticeably at all. In this post I'll show what kinds of changes were necessary in order to make parallel API calls, and discuss the unexpected benefits of making this change. A tool that runs reasonably quickly is always going to be more useful than one that's slower than it needs to be.The old (serial) way When I started this project I wasn't sure how much information I'd need to get about the user in order to start producing a meaningful signal. So I just started making API calls, and then analyzing the results of each call. The first thing I wanted to look at was how old the user's account was, because newer accounts can be a sign the user is a bot that's spamming open source issues and PRs. An early version of gh-profiler looked something like this:

ensure_gh() ensure_authenticated()

check_account_age()

The first two steps make sure the GitHub CLI tool gh is installed, and that the user is running an authenticated session of gh. Then a call is made to get information about the new contributor's account, and that information is processed. The output looked like this:

$ uvx gh-profiler ehmatthes GitHub user: ehmatthes 🟢 Account age: 13 years

This was a good start. But I ended up grabbing a bit more information before building out the first useful version of gh-profiler. The core of the project expanded piece by piece until it looked more like this:

ensure_gh() ensure_authenticated()

check_account_age() check_profile_info()

check_pr_activity()

check_issue_activity()

That was enough information to get meaningful signals about whether a user was likely to be a well-intentioned human contributor, a bot, or a human using AI to spam a bunch of repos:

$ uvx gh-profiler GitHub user: 🟡 Some concerns found with user's profile. 🟡 Account age: 6 months 🟢 Profile information: ...

🟢 No concerns found with recent PR activity. 🟢 Fewer than 10 PRs opened in the last 21 days.

🔴 Significant concerns found with recent issue activity. 🔴 79 new issues opened in the last 21 days. 🟢 1 issues closed as NOT_PLANNED. 🔴 71 issues opened with the same title: 📋 Documentation Enhancement Suggestion (71)

This was quite useful! But there were some problems with the approach I had started with. Growing pains There were a number of problems that were clear at this point, as the project was starting to see some actual usage: It was getting slower with each new piece of information that was being included. People were starting to identify additional patterns of behavior that we should check for. But every new API call would mean the program takes longer and longer to run. In the last post, I described addressing this by getting rid of the ensure_authenticated() call, and checking the results of the first necessary API call to see if it was successful instead. That turned out to be unreliable, in part because there are several ways a user can be unauthenticated. For example, the user may have logged out explicitly, or they may have an expired token. It turns out an explicit check for whether the user is authenticated was quite useful after all. But, adding that call back in would slow the program down by ~0.3 seconds. That's not much, but it was a trend I didn't want to resume. Re-architecting for parallel calls To parallelize the API calls, I needed to restructure the project so that fetching necessary data was separate from processing the data. Before introducing any parallel code, I restructured the project to look like this:

ensure_gh()

def get_data(): fetch_status() fetch_age() fetch_profile_data() fetch_pr_data() fetch_issue_data()

def process_data(): process_status() process_age() ...

The function to check whether gh is installed is entirely local, so it's quick and can be run before anything else. All external data is first fetched by get_data(), and then all the fetched data is processed by process_data(). There was no change in the project's behavior. It just did all the fetching first, and all the processing second. Here's the full main() function from gh_profiler.py:

def main(): # Generate new workflow,...

Batching API Calls

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy