Solving Memory Issues with Custom Crawlers

Spantree recently worked with a company whose goal is simplify regulatory compliance. One of their core value propositions to identify, retrieve, and notify their clients of new regulatory documents published on the websites of organizations like the FTC and the SEC. This company needs to notify customers of new regulatory documents the day they are published. This company called Spantree because the critical part of their infrastructure that coordinated the identification and retrieval of these documents was failing. This is the story of how we made that system an order of magnitude more scalable and simultaneously improved its reliability.

Architecture

This company identifies and retrieves regulatory documents using a homegrown web crawler written in Python that runs every night. We were initially contacted after this web crawler had stopped identifying any regulatory documents.

The architecture of the crawler was composed of four primary component:

The crawler, which was deployed using AWS Spot Fleet Instances operating in an ECS cluster
A scheduling system running on Heroku that seeds the crawl
A MongoDB database that stores regulatory documents and some crawl state
A Redis instance that Python interacts with using a queueing library called RQ.

The lifecycle of a crawl is as follows:

The scheduler dyno was always running and would execute specific tasks at specific times or after specific time intervals. Its most important task was to seed a crawl every night. It queried the MongoDB database for a set of seed URLs and enqueued these into a urls queue on Redis.
The ECS cluster would scale up dozens of worker machines, and these instances would check the queue for urls to crawl.
If the queue was non-empty, then a worker would pop off that URL and visit it. Visiting a URL looks like this:
1. First the system spins up a copy of the Firefox browser using Selenium. Selenium was used to scrape all pages, regardless of the type of content.
2. Then the contents of the page were downloaded, and all links were identified
3. Every single link would be added to the urls Redis queue
4. If any of these links matched specific regexes indicating that they were regulatory documents e.g. think a PDF, then the link would be offloaded onto a separate queue for additional processing.

Getting Starts with the Crawler

When we got access to the code, it was not possible to start the application locally. One of the previous developers had “resolved” a Selenium memory leak by killing the crawl worker process after it had finished visiting and scraping a single page. The codebase relied on the underlying AWS ECS cluster to spin up a new machine with a new worker process. The codebase was unusable locally. The previous developers of this application had been under such pressure to keep this critical system online that a series of hacks and bandaids had been applied over the course of years, and now the system was struggling under the weight of its technical debt.

After resolving the memory leaks and a number of similar issues, our first task was to increase system debuggability.

Observability - Structured Logging

One of the primary challenges with debugging this crawler is its operations were completely opaque. It was merely set of inputs (the seed urls), a set of outputs (some changes to Mongo and saved documents in S3), and hours worth of inaccessible intermediate state. There was little way to monitor the crawler’s progress and verify its behavior at a granular level of resolution.

Part of the issue stemmed from the use of unstructured logs. The crawler was programmed to output messages like this:

INFO: Visiting new url: www.sec.gov
ERROR: crawl failure: www.ftc.gov 
INFO: Found 25 new documents on www.sec.gov

Each log was an unstructued string. The problem with this approach is that the messages are exceptionally challenging to parse and thus, without putting the effort into parsing, impossible to run complicated queries against. Debugging is mostly limited to grep.

We instead adopted a technique called structured logging with the help of a library called python-json-logger. In case you’re unfamiliar with structured logging, it’s the practice of having log messages with a well-defined structure. The above messages might be written as structured logs like this:

{“level": “info”, “url”: “www.sec.gov”,  “event”: “visiting new url”}
{“level": “error”, “url”: “www.ftc.gov”,  “event”: “crawl failure”}
{“level": “info”, “url”: “www.sec.gov”,  “event”: “document founds”, “document_count”: 25}

Notice:

All the log messages are in JSON.
Our data and our messages are not intermingled. For example, in the last log, we separate out the url, the event, and documents_count.

These two qualities make structured log messages easy to parse and thus easy to analyze. Using structured logs, we were able to follow the entire lifecycle of a URL with a product called sumo logic using a query like

_sourceHost="crawler" // Give us logs from the crawler
| json field=_raw "message.url" as url // Parse out the relevant fields from the logs
| where url = "www.ftc.gov" // Only retrieve logs for this specific URL

This allows us to quickly evaluate when every major business event that happens for a given URL. These logs equipped us tackle the scalability problem we were initially called to resolved. Our logs soon confirmed quickly that the crux of the issue was that Redis was running out of memory. We did a number of things to improve this issue.

Vertical Scaling

The simple solution to this problem is to use a Redis machine with more memory. This client was hosting their Redis cluster on AWS ElastiCache, which made provisioning an instance with additional memory simple. We did this, but this tactic failed to address the crux of their problem.

This client was originally using a 4gb instance, but they were only scraping 60,000 unique URLs a night. A little math highlights the absurdity of this situation. If we are conservative and estimate that each URL is 150 characters long, and since URLs are encoded using ascii, that translates into 150 bytes per URL. Given 60,000 URLs, that translates 9MB in the unrealistic scenario where every URL is enqueued at once. So that begged the question, how the heck were we exhausting the available memory of our Redis instance?

Lightweight Messages

The crawler uses a library called RQ for coordinating all enqueueing and dequeueing with Redis. RQ’s default functionality using Python’s pickle module to serialize and deserialize messages between the Redis queue and the application. Pickle allows the user to take an arbitrary Python object, translate it into a series of bytes, and then other side deserialize those bytes back into the a Python object. This functionality is incredibly powerful; it was also the root cause of our client’s problem.

The crawler was interoperating with Mongo using an ORM called MongoEngine. One ORM model tracked the various crawler settings related to each regulator. The crawler was written in such a way that each time a new link was discovered, it was placed on the queue along with the Mongo object representing the regulator. These Mongo objects were serialized using Pickle and placed on the Redis queue. Pickle serialized not just the ORM model itself but every related model as well. The serialized bytes included entire website HTML contents, class/function definitions, and more, resulting in an average message size of >100KB.

Our solution to this problem was simple and non-invasive, an important requirement in an unstable system. These regulator records can all be looked up using an id. And that’s what we did: instead of serializing the entire regulator object, we instead only placed its id on the queue, and during deserialization, we did a find query against Mongo to retrieve the entire regulator record.

URL Deduplication

Even with these changes, the crawl was still getting out of memory from Redis. The next issue is probably the simplest but also the most significant.

The way the crawl worked, as mentioned above, is that every time a new URL was discovered, it was added to the Redis queue. There was some code in place to ensure that we would black list specific common domains (like Google, Twitter, Facebook) and some other code that ensured that we only recursed a limited number of links away from the seed URL, but other than that, the recursive process had no way to ensure it would actually end: in other words, this recursive process had no base case.

The behavior we observed in the logs is that we could crawl the same URLs over and over again. Certain links, like those in headers and footers, were getting crawled hundreds of times. And mostly importantly, they all linked to themselves, which ensured that the crawl would run forever. Our solution to this problem was straight-forward.

We need centralized set that every crawler could add urls to. We did not want to introduce additional infrastructure into this project to avoid adding additional complexity, which made our choice pretty clear. We would use Redis’s set datastructure to maintain the set of unique urls which have previously been added to the Redis queue.

Additional Recommendations

At Spantree, we spend quite a bit time writing web crawlers. Based on some of the things we’ve discussed, I thought I’d offer some additional thoughts on how to improve this architecture without making drastic changes:

Judicious Selenium Usage - Selenium should only be used for sites whose HTML is generated using JavaScript. If this crawler knew which sites needed JavaScript parsing and which did not, a simple HTTP request could be made for sites with static HTML. This would drastically reduce the cost, runtime, and flakiness (Selenium throws a lot of errors) of this crawl. You might remember that I remarked that this site uses dozens of instances to crawl 10000’s of pages; this is one of the main reasons why
Whitelist Domains - I mentioned that this crawler has a blacklist of domains it does not visit. This is an effort to avoid crawling unnecessary pages. A more effective technique is to whitelist the set of domains your crawler should visit. This way you won’t ever crawl URLs you aren’t supposed to.
Async or Multiprocessing Support - Each machine crawls only one URL at a time, spending much of its time idle waiting for network requests. Ideally this code could be written asynchronously or using threading and then multi-processeded so that each machine could work on many URLs at once.

And lastly - and most importantly - in this day and age, you should probably never be writing a crawler to begin with. Today, frameworks like Scrapy exist to handle the nuances and challenges of writing custom crawlers.