How to Compress Files without a Filesystem Support

BY Luca Di Stefano

Zipping files is essential to save space and memory when working on a platform as big and complex as BOOM, which handles thousands of files on a daily basis and therefore must optimize space and resources in order to provide an efficient experience. However, compressing files is a tough task, especially when you don’t know how much space you will need in advance. Working in a cloud environment also brings out new challenges when handling data, as a file system is something that is not included by default in every cloud environment, and it impacts the pricing of the services we’re using.

As a developer, it’s our responsibility to get things done efficiently while being mindful of costs, and luckily there are ways to get around the file system!

I recently found myself having to translate an old procedure from python to java (kotlin, actually), whose purpose was to download pictures from a third-party supplier and then save them on our S3 bucket, zipping them together once the process was done.

The old procedure simply wasn’t effective. First of all, it required a temporary filesystem to store the images and the zip file. Secondly, since we deal with a good deal of resources at once (each photoshoot requires around 10s of Gigabytes of data) we had to make sure that the container that took care of doing the job had enough space and memory to work successfully at any given moment.

Since we could move this procedure into a single microservice that took care of the whole picture editing process, there surely had to be a better way to handle this use case without having to rely on a filesystem and using as little memory as possible. I then ventured into a journey made of trial and error which proved to be a valuable learning experience.

The first instinctive approach that came to mind was using Java’s streams, of course. This would surely have reduced the impact on the JVM and gotten rid of our data persistence needs. 

In a way, it made sense. The first part - downloading pictures as a data stream and uploading them to S3 - worked perfectly thanks to the S3 SDK and a bit of abstraction.

The code to upload something with the SDK looks like this:

Now, the S3 SDK claims the following:

"Content length must be specified before data can be uploaded to Amazon S3. If the caller doesn't provide it, the library will make the best effort to compute the content length by buffering the contents of the input stream into the memory because Amazon S3 explicitly requires that the content length be sent in the request headers before any of the data is sent.”

As such, it’s quite apparent that the “metadata” object needs to have the content length specified, which in this first stage was not an issue at all, using some HttpOpener class that provided the data stream and its length, which could then be passed on to the S3 API:

This proved to work well: during the download/upload phase, the memory usage was constant.

The real issue came when trying to zip everything back to S3. In theory, the working principle went as follows: 

  1. Get the image from S3 (because I don’t trust the image to be still available at the old location)

  2. Feed that data stream into a zip input stream

  3. Feed that zip stream back into S3

In practice, there were two major drawbacks / learning moments:

  1. First of all, as noticed before, if we don’t send the final size of the object sent by the S3 API through the putObject method, the method just reads the Input Stream twice, once to calculate the actual size, once to send the data.
    This proved devastating on the JVM, because to calculate the size, the stream needs to be materialized into a byte array in memory. In other words, the allocated memory had to be as big as the zip we were going to handle. And so this solution couldn’t have worked.
    The real problem, in this case, is that we don’t know the size of the final zip’s size beforehand.

  2. Secondly, but still quite important, the “getObject” method of the S3 SDK -  used to retrieve our images on S3 - keeps an HTTP connection open for each file we want to process, so getting all files beforehand wouldn’t quite work, as the HTTP pool would quickly deplete and throw exceptions all over the place.

The second point was quickly solved by using the getObject function lazily (so only when we had to process that particular file), and manually closing the resource afterwards.

The first problem was solved by exploiting the multipart upload and Java’s utilities PipedInputStream and PipedOutputStream in the following way:

First of all, let’s summarize the elements that we need to get the job done:

Here we have here our main zip stream (zip), which will feed its contents to the main output stream (outputStream), which in turn will pipe its contents to the final input stream (inputStream) that will be used by the upload thread.

The zipping thread is spawned by means of a CompletableFutureto simplify things, and manages the data that goes inside the zip stream:

As mentioned before, for each key in our list, we lazily get the object and put the data from the entry to the zip stream.

And finally, the upload thread is spawned with a CompletableFuture as well, and takes care of using the data from the input stream, after initiating a multipart upload request towards S3.

This is a bit more complex yet pretty straightforward: 

  • At the beginning of the process we initiate the multipart upload session

  • We use our “inStream” as resource (the piped input stream with data flowing from the zip)

  • We calculate a rough estimate of the size of the final zip. In this particular case we don’t need absolute precision, as we just need to know how many parts we are sending, based on the size of each part we send, in our case 20 Mb

  • Once we have the total size, we calculate the number of parts that we’re going to send

  • For each part, we read through many bytes from the input stream, and send it as an S3 part, collecting the resulting eTags

  • Once we finish, we use the list of eTags to tell S3 to build up the zip and terminate the multipart upload.

  • Any error in the process will abort the multipart upload

The main question you might ask at this point is: why use two different threads? The answer is simple: to use the piped streams efficiently we need data to be constantly “pumped” in and out of the stream, so that memory consumption remains low.

Tests were made by spawning many threads running the upload function with a total image size of roughly 40Gb for each run, and it was run both on a development machine and in a Docker container with just 1Gb of memory available, with no occurrence of out-of-memory issues.

The following is the memory consumption of the service in our production environment, while it was running the zip of a ~40Gb file, in parallel with other smaller compression jobs:

As you can see, the memory usage during the whole process (that started at 8:30 and ended at 9:30) was constant and had no peaks. The application runs with the Spring Boot framework, on a docker container, and the baseline of the memory occupied by the server is in the 600Mb range when not doing anything.

Trasformazione visiva in uno schiocco di dita