Skip to main content

5 posts tagged with "compression"

View All Tags

Bringing Blosc2 to Heel

· 7 min read
Luke Shaw
Product Manager at ironArray SLU

There are many array libraries in the scientific and data ecosystem that provide native array types (Numpy, PyTorch, Zarr, h5py, PyTables, JAX, Blosc2) and an even larger list of those that ''consume'' these provided array types (Scikit-Learn, Parcels, Dask, Pillow, ScikitImage). Moroever, the division between the two groups is not very cleancut - PyTorch tensors are wrappers for NumPy arrays, and thus straddle the boundary between array provider and consumer.

Such a high degree of interdependency makes it crucial that array objects are portable between libraries - this means that the array objects must be standardised between libraries, but also that the libraries are equipped with a minimal set of functions that have the same names and signatures across the ecosystem, and that know how to ingest, produce and process the arrays. The ideal would be to be able to write code that works with arrays

import array_lib as xp
#
# Do array things with library
#

and then simply swap in any array library array_lib and have the code run.

From this set of concerns has sprung an open-source effort to develop the array API standard, along with an extensive associated test suite, to drive the array ecosystem towards this holy grail of interoperability.

Blosc2 and the array API

Blosc2 has been developed with the array API in mind from an early stage, but it is only now that ironArray has been able to dedicate development time to integration efforts. While the standard and test suite are still evolving (the latest version was released in December 2024), it is sufficiently stable to form the basis for ironArray's work.

Compress Better, Compute Bigger

· 10 min read
Francesc Alted
CEO ironArray SLU

Have you ever experienced the frustration of not being able to analyze a dataset because it's too large to fit in memory? Or perhaps you've encountered the memory wall, where computation is hindered by slow memory access? These are common challenges in data science and high-performance computing. The developers of Blosc and Blosc2 have consistently focused on achieving compression and decompression speeds that approach or even exceed memory bandwidth limits.

Moreover, with the introduction of a new compute engine in Blosc2 3.0, the guiding principle has evolved to "Compress Better, Compute Bigger." This enhancement enables computations on datasets that are over 100 times larger than the available RAM, all while maintaining high performance. Continue reading to know how to operate with datasets of 8 TB in human timeframes, using your own hardware.

The Importance of Better Compression

Data compression typically requires a trade-off between speed and compression ratio. Blosc2 allows users to fine-tune this balance. They can select from a variety of codecs and filters to maximize compression, and even introduce custom ones via its plugin system. For optimal speed, it's crucial to understand and utilize modern CPU capabilities. Multicore processing, SIMD, and cache hierarchies can significantly boost compression performance. Blosc2 leverages these features to achieve speeds close to memory bandwidth limits, and sometimes even surpassing them, particularly with contemporary CPUs.

Lazy Expressions in Caterva2

· 5 min read
Francesc Alted
CEO ironArray SLU

What is Caterva2?

Caterva2 is a Free/Open Source distributed system written in Python meant for sharing Blosc2 datasets (either native or converted on-the-fly from HDF5) among different hosts. It uses a publish–subscribe messaging pattern where the data of a publisher can be replicated by an unlimited amount of subscribers. Also, every subscriber exposes a REST interface that allows clients to access the datasets.

Let's suppose that we have a large dataset that we want to share with a group of people. Instead of sending the data to each person individually, we can use Caterva2 to publish the data once and allow multiple subscribers to access it. This way, we can save time and resources by avoiding the need to send the data to each person separately.

Lazy expressions in Caterva2

Besides the data sharing utility, Caterva2 can also perform operations on the data via Python-Blosc2 v3. These operations range from arithmetic expressions to reductions, filters and broadcasting, and are performed lazily, that is, only when a part of the result is needed.

Computing Expressions in Blosc2

· 7 min read
Oumaima Ech Chdig
Intern at ironArray SLU

What expressions are?

The forthcoming version of Blosc2 will bring a powerful tool for performing mathematical operations on pre-compressed arrays, that is, on arrays whose data has been reduced in size using compression techniques. This functionality provides a flexible and efficient way to perform a wide range of operations, such as addition, subtraction, multiplication and other mathematical functions, directly on compressed arrays. This approach saves time and resources, especially when working with large data sets.

An example of expression computation in Blosc2 might be:

dtype = np.float64
shape = [30_000, 4_000]
size = shape[0] * shape[1]
a = np.linspace(0, 10, num=size, dtype=dtype).reshape(shape)
b = np.linspace(0, 10, num=size, dtype=dtype).reshape(shape)
c = np.linspace(0, 10, num=size, dtype=dtype).reshape(shape)

# Convert numpy arrays to Blosc2 arrays
a1 = blosc2.asarray(a, cparams=cparams)
b1 = blosc2.asarray(b, cparams=cparams)
c1 = blosc2.asarray(c, cparams=cparams)

# Perform the mathematical operation
expr = a1 + b1 * c1 # LazyExpr expression
expr += 2 # expressions can be modified
output = expr.compute(cparams=cparams) # compute! (output is compressed too)

Compressed arrays ( a1, b1, c1) are created from existing numpy arrays ( a, b, c) using Blosc2, then mathematical operations are performed on these compressed arrays using general algebraic expressions. The computation of these expressions is lazy, in that they are not evaluated immediately, but are meant to be evaluated later. Finally, the resulting expression is actually computed (via .compute()) and the desired output (compressed as well) is obtained.

How it works

Unlocking Big Data Potential with Blosc Compression

· 3 min read
Francesc Alted
CEO ironArray SLU

Dear valued community,

Two years ago, ironArray embarked on an ambitious journey with the launch of our groundbreaking ironArray product, designed to revolutionize computations with compressed data. While our aspirations were high, we faced challenges in gaining traction and failed to meet our sales targets.

However, every setback is an opportunity for growth and transformation. Today, we are thrilled to announce a strategic shift in our business focus towards consulting services, leveraging the power of compression for big data, specifically through the acclaimed Blosc compressor.