Caterva2: On-demand access to Blosc2 data repositories
Documentation and tutorials
Go to the Caterva2 documentation for tutorials, API reference and more.
What is it?
Caterva2 is a distributed system written in Python meant for sharing Blosc2 datasets (either native or converted on-the-fly from HDF5) among different hosts by using a publish–subscribe messaging pattern.
Watch this video to have a basic idea of what is Caterva2 and its main functionalities:
In Caterva2 the data of a publisher can be replicated by an unlimited amount of subscribers. Publishers categorize datasets into root groups that are announced to the broker and propagated to subscribers. Also, every subscriber exposes a REST interface that allows clients to access the datasets.
Advanced features
Besides the data sharing utility, Caterva2 offers more advanced features when the user authenticates in the system. For example, you can upload, delete and perform operations on datasets via Python-Blosc2 v3. These operations range from arithmetic expressions to reductions, filters and broadcasting, and are performed lazily, that is, only when a part of the result is needed.
Watch this video explaining how to create a lazy expression in Caterva2 straight from its web interface:
Of course, all this functionality is also accessible directly from its Python API, so you can automate your workflows with Caterva2.
Components of Caterva2
A typical Caterva2 deployment includes:
- A broker service to enable the communication between publishers and subscribers.
- Publishers, for providing subscribers with access to one root (and the datasets that it contains). The root may be a native Caterva2 directory with Blosc2 and plain files, or an HDF5 file (support for other formats may be added in the future).
- Subscribers, for tracking changes in multiple roots and datasets from publishers, and caching them locally for efficient reuse.
- Clients, each one asking a subscriber to track roots and datasets, and provide access to their data and metadata.
Programming interface
Besides providing a REST interface, Caterva2 also brings a lightweight Python API, allowing the building of your own clients, in a simple but effective way.
Web interface
Subscribers are also exposing a web interface for users to interact with the data. You can take a look at this web interface in our demo site. There, you can see the root list on the left and subscribe by clicking on each of them. Once you click, you can see and download the data files from that root and read their metadata, no matter if it's a Blosc2, HDF5 or any kind of file (will be automatically converted to Blosc2 format).
Furthermore, you can visualize the dataset contents if the file is an n-dimensional dataset (NDArray, HDF5,...) or a MarkDown file. Moreover, if datasets are stack of images, you can also visualize them with the integrated display viewer. For example, 3-dimensional arrays of ints can be displayed as a grayscale stack of images, or 4-dimensional arrays of ints as a color stack.
Finally, there is an authenticated demo site where you can upload (as well as copying, moving or deleting) your own datasets to the server, and then do computations like complex expressions, filtering, reductions, and more.