Homework 3. Multithreaded gzip compression filter Background
You’re working for a web server house that regularly needs to generate some fairly large data streams and compress them using gzip. The compression part of this is turning into a CPU bottleneck and the users are starting to complain. Your company can easily afford to throw hardware at the problem, so your boss asks you whether you can reprogram your servers to do the compression in parallel, taking advantage of the fact that your servers are multiprocessor.
You look into the matter, and discover that there’s a C implementation of a program called pigz that does something along the lines that you want. (For convenience, a copy of a recent version of this program is available on SEASnet in the /usr/local/cs/src/pigz directory, and an executable is installed in /usr/local/cs/bin/pigz.) The pigz program can be used as a filter that reads programs from standard input and writes to standard output. Unfortunately, it has a problem: it’s written in C. Your company has standardized on Java, so that it can use just one version of each executable and run it on a wide variety of servers that you have: some are x86-64, some are ARM, and some use secret RISC-V-based hardware whose full nature isn’t disclosed even to your group.
You tell this to your boss, who responds, “OK, so do what pigz is doing, but do it in Java.” Her suggestion is to use standard Java classes and see how well your substitute does, compared to standard pigz. She also wants you to compare OpenJDK’s default compiler to the native-image command of GraalVM, to see whether GraalVM can give you better performance.
The gzip format lets you partition an input stream, compress each partition separately, and concatenate the compressed versions of each partition; the resulting compressed stream can be decompressed by pigz, by gzip, or by any other gzip-format-understanding program. Unfortunately, this approach doesn’t work for your application, because the data are generated dynamically in in the form of a stream, and you want the data to be compressed and delivered to its destination on the fly. You do not want to generate all the data into a huge file, partition the file, compress each partition separately, and send the concatenation of the compressed partitions.
What you want instead, is to do what pigz does: divide the input into fixed-size blocks (with block size equal to 128 KiB), and have P threads that are each busily compressing a block. That is, pigz starts by reading P blocks and starting a compression thread on each block. It then waits for the first thread to finish, outputs its result, and then can reuse that thread to compress the (P+1)st block.
For better compression, pigz does something else. Instead of compressing each block independently, it uses the last 32 KiB of the previous block to prime the compression dictionary for the next block. That way, each block other than the first is compressed better, in the typical case. You want to do that too.
You search around the net some more to see whether someone has done this, and find that there’s a package by Cédrik Lime called MessAdmin that has a similar feature in MessAdmin-Core. It has a lot of code that you don’t need, though, and doesn’t have a simple standalone application to try out. You’d like a stripped down version that just does pigz-style multithreaded compression, so that you compare the two applications’ performances.
Assignment
Write a Java program called Pigzj that behaves like the C pigz implementation, in the sense that it operates with multiple compression threads to improve wall-clock performance. Each compression thread acts on an input data block of size 128 KiB. Each thread uses as its dictionary the last 32 KiB of the previous input data block. Compressed output blocks are generated in the same order that their uncompressed blocks were input. The number of compression threads defaults to the number of available processors, but this can be overridden. Your program may also use a small, fixed number of threads to control the compression threads or to do input/output.
Your implementation can be a simplification of pigz, in the following ways:
Pigzj need not decompress; you have to implement only the compression part.
Pigzj needs to support only the -p processes options of pigz. The latter option must be spelled with the space between the option and the value: for example, -p3 (without a space) need not be recognized. Pigzj can report an error if any other option syntax is used. Pigzj need not and should not support attempts to use more than four times the number of available processors.
Pigzj always reads from standard input and writes to standard output. It can report an error if you specify a file name.
Pigzj’s behavior is not specified if the input or the output is a terminal. For example, it can unconditionally read standard input and write to standard output without checking whether these streams are connected to a terminal.
When an error occurs, Pigzj need not issue exactly the same diagnostic message as pigz, so long as it detects and reports the error to standard error, and exits with nonzero status.
Pigzj should behave like pigz in the following respects:
If you decompress the output with gzip or with pigz, you get a copy of the input.
The output is compressed, about as well as pigz would compress it.
The output follows the GZIP file format standard, Internet RFC 1952.
The output contains just a single member, that is, it does not contain the concatenation of two or more members. For a definition of “member” please see RFC 1952 §2.3. If you have trouble implementing this, then for partial credit you can generate output with multiple members. Pigzj runs faster than gzip, when the number of processors is greater than 1. It is competitive in speed with pigz.
The default value for processes is the number of available processors; see the Java standard library’s availableProcessors method.
Read errors and write errors are detected. For example, the command “pigz /dev/full” reports a write error and exits with nonzero exit status, and the same should be true for “java Pigzj /dev/full”. The input and output need not be a regular file; they may be pipes. For example, the command “cat /etc/passwd | java Pigzj | cat” should output the same thing as the command “java Pigzj gzip.gz
time pigz <$input >pigz.gz
time java Pigzj <$input >Pigzj.gz
time ./pigzj <$input >pigzj.gz
ls -l gzip.gz pigz.gz Pigzj.gz pigzj.gz
# This checks Pigzj’s and pigzj’s output.
gzip -d