Lazy Expressions in Caterva2
What is Caterva2?
Caterva2 is a Free/Open Source distributed system written in Python meant for sharing Blosc2 datasets (either native or converted on-the-fly from HDF5) among different hosts. It uses a publish–subscribe messaging pattern where the data of a publisher can be replicated by an unlimited amount of subscribers. Also, every subscriber exposes a REST interface that allows clients to access the datasets.
Let's suppose that we have a large dataset that we want to share with a group of people. Instead of sending the data to each person individually, we can use Caterva2 to publish the data once and allow multiple subscribers to access it. This way, we can save time and resources by avoiding the need to send the data to each person separately.
Lazy expressions in Caterva2
Besides the data sharing utility, Caterva2 can also perform operations on the data via Python-Blosc2 v3. These operations range from arithmetic expressions to reductions, filters and broadcasting, and are performed lazily, that is, only when a part of the result is needed.
The Caterva2 subscriber permits you to create so-called lazy expressions where operands are array datasets accessible in the subscriber. These expressions get stored in the user's own scratch space (an always-subscribed pseudo-root named @scratch
), thus working with them requires user authentication.
Lazy expressions are very cheap to create as this operation only requires knowing the metadata of the involved operands. The resulting data is not computed on creation, the operations only take place at the subscriber when you request access to the resulting data itself (e.g. via fetch or download operations).
For example, the next code uses the API in Caterva2 to create a lazy expression named plusone
from a 2D dataset, while handling user authentication via the auth_cookie
argument:
caterva2.lazyexpr('plusone', 'x + 1', {'x': 'foo/dir1/ds-2d.b2nd'},
auth_cookie=...)
The path of the new dataset is returned: @scratch/plusone.b2nd
. Now you can access it as a normal dataset, e.g.:
caterva2.fetch('@scratch/plusone.b2nd', slice_='0:2, 4:8',
auth_cookie=...)
Mind that, as the data is not computed until you request it, only the fetch
operation will trigger the computation of the expression (in the subscriber), and then the resulting dataset will be created (in the client). Furthermore, only the requested slice will be computed, so even if your operands are very large, you will still get very good response times when asking for small slices of the result.
Operating with Caterva2's web GUI
The Caterva2 web interface enables you to create lazy expressions in a very easy way. You can select the operands from the dataset browser using their associated tags (short string identifiers), and write the expression in a text box (the Prompt). In the screenshot, we are filtering for 2D datasets (with the Search box), then selecting the example/dir1/ds-2d.b2nd
dataset (marked with tag b
), and finally writing a lazy expression (named plusone
) that adds 1 to the dataset:
When the GO button is pressed, the system will check the expression for correctness, and afterward, it will create the lazy expression in your @scratch
area. Here it is how it looks like in the Main tab of the Details panel:
Once the lazy expression is created, you can access the data as if it were a normal dataset. The system will compute the result on-the-fly, and you can fetch or download the data as you would do with any other dataset. For example, here it is how the data looks like when using the data browser:
Experiment with our demo sites
We have set up a demo site where you can experiment with browsing, visualizing and fetching an assortment of datasets straight away. Besides that, we have the authenticated demo site, where you can register a temporary user (which will last until the next day) and then upload your own datasets, create lazy expressions, visualize their results and download their data.
If you are interested in learning more about Caterva2, please visit the Caterva2 documentation for tutorials, API reference, and more.
Conclusion
Caterva2 is a powerful tool for sharing and operating with large datasets, even when they do not fit in memory. It is designed to be easy to use, with a simple and intuitive interface that allows you to create and operate with lazy expressions in a few simple steps, either programmatically or via its web interface. If you are working with large datasets and need a way to share and operate with them efficiently, Caterva2 is the tool for you.