AWS Lambda For Great Victory

Posted by Jason Michael

Lambda is a way to run short code snippets in response to events, without the need to worry about infrastructure.

The Use Case

This past Tuesday, one of our PMs dropped by the engineering pit to discuss a problem he was having. He explained that, as part of some compliance work, he needed a workflow to convert incoming HTML documents into PDFs. This needed to happen not just once, but on an ongoing basis. We didn’t have any natural home for such an endpoint in any of our services, and building a new one just for this workflow seemed like overkill. What to do?

Enter Lambda

I had been experimenting with Amazon’s Lambda service, available as part of AWS. Lambda provides a runtime environment for executing small, stateless code snippets (called “Lambda functions”, which irks me to no end…) in response to certain events within the AWS environment. There is no “backend” to worry about, no EC2 instances to provision, deploy, or manage. You simply upload your Lambda function, map it to a set of events, and AWS will handle the execution, monitoring, etc. It seemed like the perfect fit for our use case!

Our strategy to start with seemed pretty simple - HTML would be uploaded to a specific S3 bucket. This would trigger an event, causing our Lambda function to execute. The function would pull the HTML down from S3, and pipe it through an HTML-to-PDF binary (we are using the html-pdf module for Node.js, which itself calls into the phantomjs binary). The resulting PDF document is uploaded into a separate, target S3 bucket. Piece of cake, right?

There are a few wrinkles we had to deal with when working with Lambda. First of all, each Lambda function is built and deployed as a zip archive, and the max size of this archive, including modules/dependencies and any binaries, is 20MB. The phantomjs binary clocks in at just over 36MB. Shipping it along with our code wouldn’t work, obviously. We worked around this problem by uploading the binary to its own S3 bucket. Then, the first step in the Lambda function is to pull the binary down and place it in /tmp (Lambda only gives you access to /tmp on the local filesystem). This takes maybe 3-4 seconds, so not a big deal. I then had to get html-pdf to use the binary in such a non-standard location. This amounted to replacing the binary in my zipfile with a symlink to /tmp/phantomjs, and preserving symlinks in the zip job. I found out later that you can pass the binary path into the call to html-pdf in your Node.js code, which is much cleaner.

Small Victories!

Within a single business day, I had the workflow up and running, and was able to demo it for our PM. He was excited to get such a fast turnaround, with no fussing about with infrastructure of any kind. And while I’m hard-pressed to jump up and down over 20-30 lines of code, I am excited to find other uses for Lambda within the organization. For example, we are starting to build a data warehouse, and one thing I’m interested in trying out is decomposing our ETL/data movement workflows into a series of small, stateless Lambda functions. These would operate on data change events streaming in from our production systems, freeing us from updating the warehouse in a batch manner. If we can make it work, we’d have the Holy Grail in warehousing - a rich, integrated dataset that is updated in near real-time! I’m also looking forward to seeing what other runtimes besides Node.js that Amazon adds support for. If they can provide an environment with a JVM, that opens the door for any number of fun and useful JVM-based languages, not to mention plain old Java.

Sound intriguing? Come join us!