Btune: Making Compression Better
There is no one-size-fits-all solution for data compression. Depending on the type of (e.g. numeric, text), its structure, intended use, acquisition context, and other features, the optimal combination of compression parameters will differ substantially—and choosing a suboptimal combination could render the application unusable. Worse, owing to phenomena such as concept drift which are ubiquitous in data science, the optimal compression parameters may change over time.
All this causes quite a headache for data handlers, since searching the parameter space is time-consuming. There is a clear need to automate the process, and also establish quantifiable measures to determine what constitutes the best compression in a given context. Fortunately, ironArray has developed a powerful tool to respond to this need: Btune.
What Is Btune?
Btune is a Blosc2 plugin that uses machine- and deep-learning techniques to find the optimal compression parameters for your datasets. We offer three tiers of the tool to our clients:
-
Btune Community: Free for personal use. Uses a genetic algorithm to test parameter combinations and find the best settings for your dataset. For a graphical visualization, click the image on the right.
-
Btune Models: A commercial license for workgroups - using your sample data we train and deploy a neural network model, optimized for your data. Best for workgroups with limited data variety.
-
Btune Studio: A commercial license that includes our training software, giving you full control to create your own models on-site for unlimited datasets. This is best for organizations that need to optimize for a wide range of data.
Why Btune?
The main trade-off axis for data compression is between compression ratio and speed. For example, high-speed data acquisition prioritizes fast compression, while frequently accessed datasets benefit from fast decompression. Btune helps you optimize for what matters most.
The following figures illustrate these trade-offs for different codecs and filters using chunks of weather data:
And here, the different codecs and filters are compared in terms of compression ratio:
With Btune, you can find the optimal combination of compression parameters (in the Pareto sense) for your datasets, allowing you to achieve the best possible compression ratio and speed for your specific needs.
How To Use
Ready to optimize your compression? Getting started with Btune is simple. Install the plugin directly from PyPI:
pip install blosc2-btune
This single plugin supports both Btune Community and Btune Models. For detailed instructions, check out the Btune README, or contact us. To use Btune Studio, you will need additional software for on-site model training; please contact us to get set up.
Currently, the Btune plugin is available for Linux and MacOS on Intel architectures, with support for more platforms coming soon.
Explore our hands-on tutorials to see Btune in action:
NB: to complete the Studio tutorial, you will need to contact us to obtain additional software for model training.
What's in a Model?
Neural networks have proven to be very effective parametrized models for learning from a wide variety of data. During "training", the neural network is repeatedly fed examples of paired inputs/outputs (e.g. compression parameters and data features/compression ratio) and adjusts automatically to better predict outputs from inputs in succeeding training steps. Once trained, it can quickly predict outputs on new, unseen input data.
Btune then furnishes a "model" as the product of this training process, saved as a set of small files (in JSON and TensorFlow format). You can place these files anywhere on your system for Btune to use. Btune leverages this model to instantly predict the best compression parameters for each chunk of your data, based on its characteristics. This rapid prediction is ideal for optimizing compression while handling large data streams.
A Starry Example
The figures below illustrate Btune's optimization for decompression speed on a 7.3 TB subset of the Gaia dataset. The first image shows the predicted optimal codec and filter combinations for this task, as a function of the desired trade-off between decompression speed and degree of compression.
The following image plots the I/O speed (in GB/s) achieved when accessing multiple multidimensional slices of the Gaia dataset along different axes when using these combinations (higher values are better).
The results show that the codec/compression level combinations BloscLZ (level 5) and Zstd (level 9) are fastest. Since their performance is not heavily dependent on the number of threads, they perform well even on machines with fewer CPU cores.
Finally, the last figure compares the resulting file sizes (in GB)—i.e. how small the files have been compressed down to. Lower values are better.
In this case, the trained model recommends Zstd (level 9) for a good balance between file size and decompression speed. While adding the BitShuffle filter achieves the highest compression ratio, it is not recommended for general use.
For more details, see our paper for SciPy 2023 (slides). The data and scripts are also available on GitHub.
Pricing
Visit our pricing page for more information on the different licensing options available for Btune.
Testimonials
Blosc2 and Btune are fantastic tools that allow us to efficiently compress and load large volumes of data for the development of AI algorithms for clinical applications. In particular, the new NDarray structure became immensely useful when dealing with large spectral video sequences.
-- Leonardo Ayala, Div. Intelligent Medical Systems, German Cancer Research Center (DKFZ)
Btune is a simple and highly effective tool. We tried this out with @LEAPSinitiative data and found some super useful spots in the parameter space of Blosc2 compression arguments! Awesome work, @Blosc and @ironArray teams!
-- Peter Steinbach, Helmholtz AI Consultants Team Lead for Matter Research @HZDR_Dresden
Contact
If you are interested in Btune and have any further questions, please contact us at contact@ironarray.io.