I Built an Indexing Platform For an External News Aggregation API

Posted on:

One of my clients is a WordPress VIP agency. Over the last few months, I have been working with them to help launch a new version of a site for one of their VIP clients. Most of my efforts have gone into extracting the numerous bits of custom functionality that was baked into the existing WordPress theme into a proper place as a mu-plugin. Of the dozens of features I migrated, organized, and modernized, one feature included an integration with an external API that provided articles to display on the site. These articles are not a part of the WordPress installation, but are instead syndicated from another source using a REST API.

When we initially integrated this API, we utilized a simple caching policy that cached API responses, and expired them after a while. It served as a nice default approach, that seemed to be sufficient for our initial needs.

One day, this API suddenly started to fail, because various website crawlers would crawl the site, hammering this API with requests and causing it to hit it's rate limit. This would cause the pages that relied on this external API to have missing content.

I was presented with this project, and in-spite of the fact that I was in the midst of moving into a new camper, I was able to build the integration, implement it, and advise as we worked through making the system work reliably. I have built many integrations like this, including an e-commerce setup a few years prior.

The Problem

Our cache policy was ineffective at preventing us from reducing requests adequately. This was happening because these website crawlers would hit very old pages that don't get viewed frequently. As a result, these pages were not cached, which essentially caused us to inadvertently hit our rate limit, causing problems on the site.

Along with this problem, the agency also expressed a desire to be able to query the content this API provided. They wanted to be able to intermingle this content with other stories that were written on the site itself. This isn't something that could be done since the source of the data came from disparate locations, and merging then into a single list of content based on the varied query parameters in WP_Query was impossible.

Proposed Solution

At first blush, I thought this project was going to be the perfect fit for my content ingestion library, Adiungo, which is incomplete, but was far along enough that it could have been used here with success. But after some back and forth with the client, it became apparent that they did not want all of the historical data on the API, just current things as it comes in. I'm sure Adiungo could support something like this, but the timeline on this project was rather tight, and I didn't feel comfortable using it yet. Another time, perhaps.

Based on the above, I proposed that we build a system that would store the stories retrieved from the API as a WordPress custom post type. These posts are treated in much the same way as a traditional cache, with the difference being that they're saved as custom post types instead of saving them in transients, or Redis, or whatever. This gives the agency the benefit of using WP_Query alongside other content, while also solving the problem of reducing the number of API requests that need to be made to the site.

[Alex] was definitely opinionated about the architecture. We brought him a problem that definitely had some constraints on it and we didn’t have to go back and forth with him a lot about how the thing should be built. He understood the problem and suggested a solution. From there we were able to discuss some specifics and make some minor adjustments but overall, that initial architecture is what we ended up shipping and that saved us a lot of time.

Implementation

To build this, I needed several components:

  1. A cache strategy. The site is using Underpin now, which has a handy object caching trait that I was able to lean on. I just needed to build a custom cache strategy that uses the WP post as the cached object.
  2. A model for the stories - This allowed me to ensure the returned data is consistent regardless of if it came from the API, or from the WordPress database. This makes the entire mechanism opaque from the perspective of the theme.
  3. A place to store it - in this case, that's the custom post type, with good ol' fashioned post meta.
  4. An API implementation - I already built this in a previous project, again with the help of a class in Underpin.
  5. Request middleware to throttle the number of requests we can make to the API in a single minute, as well as a single day. This ensures we aren't too voracious with our ingestion processes. (The system needs a diet haha) Again, Underpin has something to help with this.
  6. A task scheduling mechanism. This ensures that the content we ingest happens in the background, retries when it fails, and ensures the site is minimally impacted by all these actions. There's a few libraries out there that do this, but I went with Delicious Brains background processing, mostly because it's what I know.
  7. A set of WP CLI commands to interact with, troubleshoot, and check the status of the ingestion.

With these pieces combined together, I was able to create a caching mechanism that reliably pulls content when the conditions to pull content are right, and only pulls stories when they are new enough to justify pulling. It will also check to see if the content hasn't been pulled for long-enough to justify asking the API for an update on the content. This creates a "Goldilocks region" where content will be pulled, and updated occasionally, but after a certain point it will stop being updated. This prevents crawlers from inadvertently flooding the system with requests to ingest content, causing rate limit errors, but still ensures that new content remains up-to-date.

Conclusion

As of today, the site in-question is quietly pulling content from the remote location, and the number of requests to the original API has dropped from thousands per day to less than 100 in a single day. The agency has a lot more transparency in their relationship with the API, and have tools at their disposal to interact with it reliably.

This was a fun project that made extensive use of Underpin. It felt good to utilize the library in this context, and it was exciting to see it reliably work in-action.