Dataset class#

A dataset is a Blosc2-encoded file on a root repository (thus a File) representing either a flat string of bytes or an n-dimensional array.

class caterva2.Dataset(root, path)#

Bases: File, Operand

Attributes:
blocks

The blockshape of the compressed dataset.

chunks

The chunkshape of the compressed dataset.

device

Hardware device the array data resides on.

dtype

The data type of the dataset.

info

Get information about the Operand.

ndim

Get the number of dimensions of the Operand.

shape

The shape of the dataset.

vlmeta

Returns a mapping of metalayer names to their respective values.

This is used to access variable-length metalayers (user attributes) associated with the file.

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> file = root['ds-sc-attr.b2nd']
>>> file.vlmeta
{'a': 1, 'b': 'foo', 'c': 123.456}

Methods

all([axis, keepdims])

Test whether all array elements along a given axis evaluate to True.

any([axis, keepdims])

Test whether any array element along a given axis evaluates to True.

append(data)

Appends data to the dataset.

argmax([axis, keepdims])

Returns the indices of the maximum values along a specified axis.

argmin([axis, keepdims])

Returns the indices of the minimum values along a specified axis.

copy(dst)

Copies the file to a new location.

download([localpath])

Downloads the file to storage.

get_download_url()

Retrieves the download URL for the file.

item()

Copy an element of an array to a standard Python scalar and return it.

max([axis, keepdims])

Return the maximum along a given axis.

mean([axis, dtype, keepdims])

Return the arithmetic mean along the specified axis.

min([axis, keepdims])

Return the minimum along a given axis.

move(dst)

Moves the file to a new location.

prod([axis, dtype, keepdims])

Return the product of array elements over a given axis.

remove()

Removes the file from the remote repository.

slice(key[, as_blosc2])

Get a slice of a File/Dataset.

std([axis, dtype, ddof, keepdims])

Return the standard deviation along the specified axis.

sum([axis, dtype, keepdims])

Return the sum of array elements over a given axis.

to_device(device)

Copy the array from the device on which it currently resides to the specified device.

unfold()

Unfolds the file in a remote directory.

var([axis, dtype, ddof, keepdims])

Return the variance along the specified axis.

where([value1, value2])

Select value1 or value2 values based on True/False for self.

Special Methods:

__init__(root, path)

Represents a dataset within a Blosc2 container.

__getitem__(item)

Retrieves a slice of the dataset.

Constructor#

__init__(root, path)#

Represents a dataset within a Blosc2 container.

This class is not intended to be instantiated directly; it should be accessed through a Root instance.

Parameters:
  • root (Root) – The root repository.

  • path (str) – The path of the dataset.

Examples

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> ds = root['ds-1d.b2nd']
>>> ds.dtype
'int64'
>>> ds.shape
(1000,)
>>> ds.chunks
(100,)
>>> ds.blocks
(10,)

Utility Methods#

__getitem__(item)#

Retrieves a slice of the dataset.

Parameters:

item (int, slice, tuple of ints and slices, or None) – Specifies the slice to fetch.

Returns:

The requested slice of the dataset.

Return type:

numpy.ndarray

Examples

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> ds = root['ds-1d.b2nd']
>>> ds[1]
array(1)
>>> ds[:1]
array([0])
>>> ds[0:10]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
all(axis=None, keepdims=False, **kwargs)#

Test whether all array elements along a given axis evaluate to True.

The parameters are documented in the min.

Returns:

all_along_axis – The result of the evaluation along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.all

Examples

>>> import numpy as np
>>> import blosc2
>>> data = np.array([True, True, False, True, True, True])
>>> ndarray = blosc2.asarray(data)
>>> # Test if all elements are True along the default axis (flattened array)
>>> result_flat = blosc2.all(ndarray)
>>> print("All elements are True (flattened):", result_flat)
All elements are True (flattened): False
any(axis=None, keepdims=False, **kwargs)#

Test whether any array element along a given axis evaluates to True.

The parameters are documented in the min.

Returns:

any_along_axis – The result of the evaluation along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.any

Examples

>>> import blosc2
>>> import numpy as np
>>> data = np.array([[1, 0, 0], [0, 1, 0], [0, 0, 0]])
>>> # Convert the NumPy array to a Blosc2 NDArray
>>> ndarray = blosc2.asarray(data)
>>> print("NDArray data:", ndarray[:])
NDArray data: [[1 0 0]
                [0 1 0]
                [0 0 0]]
>>> any_along_axis_0 = blosc2.any(ndarray, axis=0)
>>> print("Any along axis 0:", any_along_axis_0)
Any along axis 0: [True True False]
>>> any_flattened = blosc2.any(ndarray)
>>> print("Any in the flattened array:", any_flattened)
Any in the flattened array: True
append(data)#

Appends data to the dataset.

Parameters:

data (blosc2.NDArray, numpy.ndarray, sequence) – The data to append to the dataset.

Returns:

The new shape of the dataset.

Return type:

tuple

Examples

>>> import caterva2 as cat2
>>> import numpy as np
>>> # To append data to a dataset you need to be a registered user
>>> client = cat2.Client("https://cat2.cloud/demo", ("joedoe@example.com", "foobar"))
>>> data = client.copy('@public/examples/ds-1d.b2nd', '@personal/ds-1d.b2nd')
>>> dataset = client.get('@personal')['ds-1d.b2nd']
>>> dataset.append([1, 2, 3])
(1003,)
argmax(axis=None, keepdims=False, **kwargs)#

Returns the indices of the maximum values along a specified axis.

When the maximum value occurs multiple times, only the indices corresponding to the first occurrence are returned.

Parameters:
  • x (blosc2.Array) – Input array. Should have a real-valued data type.

  • axis (int | None) – Axis along which to search. If None, return index of the maximum value of flattened array. Default: None.

  • keepdims (bool) – If True, reduced axis included in the result as singleton dimension. Otherwise, axis not included in the result. Default: False.

Returns:

out – If axis is None, a zero-dimensional array containing the index of the first occurrence of the maximum value; otherwise, a non-zero-dimensional array containing the indices of the maximum values.

Return type:

blosc2.Array

argmin(axis=None, keepdims=False, **kwargs)#

Returns the indices of the minimum values along a specified axis.

When the minimum value occurs multiple times, only the indices corresponding to the first occurrence are returned.

Parameters:
  • x (blosc2.Array) – Input array. Should have a real-valued data type.

  • axis (int | None) – Axis along which to search. If None, return index of the minimum value of flattened array. Default: None.

  • keepdims (bool) – If True, reduced axis included in the result as singleton dimension. Otherwise, axis not included in the result. Default: False.

Returns:

out – If axis is None, a zero-dimensional array containing the index of the first occurrence of the minimum value; otherwise, a non-zero-dimensional array containing the indices of the minimum values.

Return type:

blosc2.Array

copy(dst)#

Copies the file to a new location.

Parameters:

dst (Path) – The destination path for the file.

Returns:

The new path of the copied file.

Return type:

Path

Examples

>>> import caterva2 as cat2
>>> import numpy as np
>>> # For copying a file you need to be a registered user
>>> client = cat2.Client("https://cat2.cloud/demo", ("joedoe@example.com", "foobar"))
>>> root = client.get('@personal')
>>> root.upload('root-example/dir2/ds-4d.b2nd')
<Dataset: @personal/root-example/dir2/ds-4d.b2nd>
>>> file = root['root-example/dir2/ds-4d.b2nd']
>>> file.copy('@personal/root-example/dir2/ds-4d-copy.b2nd')
PurePosixPath('@personal/root-example/dir2/ds-4d-copy.b2nd')
>>> 'root-example/dir2/ds-4d.b2nd' in root
True
>>> 'root-example/dir2/ds-4d-copy.b2nd' in root
True
download(localpath=None)#

Downloads the file to storage.

Parameters:

localpath (Path, optional) – The destination path for the downloaded file. If not specified, the file will be downloaded to the current working directory.

Returns:

The path to the downloaded file.

Return type:

Path

Examples

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> file = root['ds-1d.b2nd']
>>> file.download()
PosixPath('example/ds-1d.b2nd')
>>> file.download('mydir/myarray.b2nd')
PosixPath('mydir/myarray.b2nd')
get_download_url()#

Retrieves the download URL for the file.

Returns:

The file’s download URL.

Return type:

str

Examples

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> file = root['ds-1d.b2nd']
>>> file.get_download_url()
'https://demo.caterva2.net/api/fetch/example/ds-1d.b2nd'
item() float | bool | complex | int#

Copy an element of an array to a standard Python scalar and return it.

max(axis=None, keepdims=False, **kwargs)#

Return the maximum along a given axis.

The parameters are documented in the min.

Returns:

max_along_axis – The maximum of the elements along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.max

Examples

>>> import blosc2
>>> import numpy as np
>>> data = np.array([[11, 2, 36, 24, 5, 69], [73, 81, 49, 6, 73, 0]])
>>> ndarray = blosc2.asarray(data)
>>> print("NDArray data:", ndarray[:])
NDArray data:  [[11  2 36 24  5 69]
                [73 81 49  6 73  0]]
>>> # Compute the maximum along axis 0 and 1
>>> max_along_axis_0 = blosc2.max(ndarray, axis=0)
>>> print("Maximum along axis 0:", max_along_axis_0)
Maximum along axis 0: [73 81 49 24 73 69]
>>> max_along_axis_1 = blosc2.max(ndarray, axis=1)
>>> print("Maximum along axis 1:", max_along_axis_1)
Maximum along axis 1: [69 81]
>>> max_flattened = blosc2.max(ndarray)
>>> print("Maximum of the flattened array:", max_flattened)
Maximum of the flattened array: 81
mean(axis=None, dtype=None, keepdims=False, **kwargs)#

Return the arithmetic mean along the specified axis.

The parameters are documented in the sum.

Returns:

mean_along_axis – The mean of the elements along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.mean

Examples

>>> import numpy as np
>>> import blosc2
>>> # Example array
>>> array = np.array([[1, 2, 3], [4, 5, 6]]
>>> nd_array = blosc2.asarray(array)
>>> # Compute the mean of all elements in the array (axis=None)
>>> overall_mean = blosc2.mean(nd_array)
>>> print("Mean of all elements:", overall_mean)
Mean of all elements: 3.5
min(axis=None, keepdims=False, **kwargs)#

Return the minimum along a given axis.

Parameters:
  • ndarr (NDArray or NDField or C2Array or LazyExpr) – The input array or expression.

  • axis (int or tuple of ints, optional) – Axis or axes along which to operate. By default, flattened input is used.

  • keepdims (bool, optional) – If set to True, the axes which are reduced are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

  • kwargs (dict, optional) – Keyword arguments that are supported by the empty() constructor.

Returns:

min_along_axis – The minimum of the elements along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.min

Examples

>>> import numpy as np
>>> import blosc2
>>> array = np.array([1, 3, 7, 8, 9, 31])
>>> nd_array = blosc2.asarray(array)
>>> min_all = blosc2.min(nd_array)
>>> print("Minimum of all elements in the array:", min_all)
Minimum of all elements in the array: 1
>>> # Compute the minimum along axis 0 with keepdims=True
>>> min_keepdims = blosc2.min(nd_array, axis=0, keepdims=True)
>>> print("Minimum along axis 0 with keepdims=True:", min_keepdims)
Minimum along axis 0 with keepdims=True:  [1]
move(dst)#

Moves the file to a new location.

Parameters:

dst (Path) – The destination path for the file.

Returns:

The new path of the file after the move.

Return type:

Path

Examples

>>> import caterva2 as cat2
>>> # For moving a file you need to be a registered user
>>> client = cat2.Client("https://cat2.cloud/demo", ("joedoe@example.com", "foobar"))
>>> root = client.get('@personal')
>>> root.upload('root-example/dir2/ds-4d.b2nd')
<Dataset: @personal/root-example/dir2/ds-4d.b2nd>
>>> file = root['root-example/dir2/ds-4d.b2nd']
>>> file.move('@personal/root-example/dir1/ds-4d-moved.b2nd')
PurePosixPath('@personal/root-example/dir1/ds-4d-moved.b2nd')
>>> 'root-example/dir2/ds-4d.b2nd' in root
False
>>> 'root-example/dir1/ds-4d-moved.b2nd' in root
True
prod(axis=None, dtype=None, keepdims=False, **kwargs)#

Return the product of array elements over a given axis.

The parameters are documented in the sum.

Returns:

product_along_axis – The product of the elements along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.prod

Examples

>>> import numpy as np
>>> import blosc2
>>> # Create an instance of NDArray with some data
>>> array = np.array([[11, 22, 33], [4, 15, 36]])
>>> nd_array = blosc2.asarray(array)
>>> # Compute the product of all elements in the array
>>> prod_all = blosc2.prod(nd_array)
>>> print("Product of all elements in the array:", prod_all)
Product of all elements in the array: 17249760
>>> # Compute the product along axis 1 (rows)
>>> prod_axis1 = blosc2.prod(nd_array, axis=1)
>>> print("Product along axis 1:", prod_axis1)
Product along axis 1: [7986 2160]
remove()#

Removes the file from the remote repository.

Returns:

The path of the removed file.

Return type:

str

Examples

>>> import caterva2 as cat2
>>> import numpy as np
>>> # To remove a file you need to be a registered user
>>> client = cat2.Client('https://cat2.cloud/demo', ("joedoe@example.com", "foobar"))
>>> root = client.get('@personal')
>>> path = 'root-example/dir2/ds-4d.b2nd'
>>> root.upload(path)
<Dataset: @personal/root-example/dir2/ds-4d.b2nd>
>>> file = root[path]
>>> file.remove()
'@personal/root-example/dir2/ds-4d.b2nd'
>>> path in root
False
slice(key: int | slice | Sequence[slice], as_blosc2: bool = True) NDArray | SChunk | ndarray#

Get a slice of a File/Dataset.

Parameters:
  • key (int, slice, or sequence of slices) – The slice to retrieve. If a single slice is provided, it will be applied to the first dimension. If a sequence of slices is provided, each slice will be applied to the corresponding dimension.

  • as_blosc2 (bool) – If True (default), the result will be returned as a Blosc2 object (either a SChunk or NDArray). If False, it will be returned as a NumPy array (equivalent to self[key]).

Returns:

A new Blosc2 object containing the requested slice.

Return type:

NDArray or SChunk or numpy.ndarray

Examples

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> ds = root['ds-1d.b2nd']
>>> ds.slice(1)
<blosc2.ndarray.NDArray object at 0x10747efd0>
>>> ds.slice(1)[()]
array(1)
>>> ds.slice(slice(0, 10))[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
std(axis=None, dtype=None, ddof=0, keepdims=False, **kwargs)#

Return the standard deviation along the specified axis.

Parameters:
  • ndarr (NDArray or NDField or C2Array or LazyExpr) – The input array or expression.

  • axis (int or tuple of ints, optional) – Axis or axes along which the standard deviation is computed. By default, axis=None computes the standard deviation of the flattened array.

  • dtype (np.dtype or list str, optional) – Type to use in computing the standard deviation. For integer inputs, the default is float32; for floating point inputs, it is the same as the input dtype.

  • ddof (int, optional) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. By default, ddof is zero.

  • keepdims (bool, optional) – If set to True, the reduced axes are left in the result as dimensions with size one. This ensures that the result will broadcast correctly against the input array.

  • kwargs (dict, optional) – Additional keyword arguments that are supported by the empty() constructor.

Returns:

std_along_axis – The standard deviation of the elements along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.std

Examples

>>> import numpy as np
>>> import blosc2
>>> # Create an instance of NDArray with some data
>>> array = np.array([[1, 2, 3], [4, 5, 6]])
>>> nd_array = blosc2.asarray(array)
>>> # Compute the standard deviation of the entire array
>>> std_all = blosc2.std(nd_array)
>>> print("Standard deviation of the entire array:", std_all)
Standard deviation of the entire array: 1.707825127659933
>>> # Compute the standard deviation along axis 0 (columns)
>>> std_axis0 = blosc2.std(nd_array, axis=0)
>>> print("Standard deviation along axis 0:", std_axis0)
Standard deviation along axis 0: [1.5 1.5 1.5]
sum(axis=None, dtype=None, keepdims=False, **kwargs)#

Return the sum of array elements over a given axis.

Parameters:
  • ndarr (NDArray or NDField or C2Array or LazyExpr) – The input array or expression.

  • axis (int or tuple of ints, optional) – Axis or axes along which a sum is performed. By default, axis=None, sums all the elements of the input array. If axis is negative, it counts from the last to the first axis.

  • dtype (np.dtype or list str, optional) – The type of the returned array and of the accumulator in which the elements are summed. The dtype of ndarr is used by default unless it has an integer dtype of less precision than the default platform integer.

  • keepdims (bool, optional) – If set to True, the reduced axes are left in the result as dimensions with size one. With this option, the result will broadcast correctly against the input array.

  • kwargs (dict, optional) – Additional keyword arguments supported by the empty() constructor.

Returns:

sum_along_axis – The sum of the elements along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.sum

Examples

>>> import numpy as np
>>> import blosc2
>>> # Example array
>>> array = np.array([[1, 2, 3], [4, 5, 6]])
>>> nd_array = blosc2.asarray(array)
>>> # Sum all elements in the array (axis=None)
>>> total_sum = blosc2.sum(nd_array)
>>> print("Sum of all elements:", total_sum)
21
>>> # Sum along axis 0 (columns)
>>> sum_axis_0 = blosc2.sum(nd_array, axis=0)
>>> print("Sum along axis 0 (columns):", sum_axis_0)
Sum along axis 0 (columns): [5 7 9]
to_device(device: str)#

Copy the array from the device on which it currently resides to the specified device.

Parameters:
  • self (NDArray) – Array instance.

  • device (str) – Device to move array object to. Returns error except when device==’cpu’.

Returns:

out – If device=’cpu’, the same array; else raises an Error.

Return type:

NDArray

unfold()#

Unfolds the file in a remote directory.

Returns:

The path to the unfolded directory.

Return type:

Path

Examples

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> file = root['ds-1d.h5']
>>> file.unfold()
PurePosixPath('example/ds-1d.h5')
var(axis=None, dtype=None, ddof=0, keepdims=False, **kwargs)#

Return the variance along the specified axis.

The parameters are documented in the std.

Returns:

var_along_axis – The variance of the elements along the axis.

Return type:

np.ndarray or NDArray or scalar

References

np.var

Examples

>>> import numpy as np
>>> import blosc2
>>> # Create an instance of NDArray with some data
>>> array = np.array([[1, 2, 3], [4, 5, 6]])
>>> nd_array = blosc2.asarray(array)
>>> # Compute the variance of the entire array
>>> var_all = blosc2.var(nd_array)
>>> print("Variance of the entire array:", var_all)
Variance of the entire array: 2.9166666666666665
>>> # Compute the variance along axis 0 (columns)
>>> var_axis0 = blosc2.var(nd_array, axis=0)
>>> print("Variance along axis 0:", var_axis0)
Variance along axis 0: [2.25 2.25 2.25]
where(value1=None, value2=None)#

Select value1 or value2 values based on True/False for self.

Parameters:
  • value1 (array_like, optional) – The value to select when element of self is True.

  • value2 (array_like, optional) – The value to select when element of self is False.

Returns:

out – A new expression with the where condition applied.

Return type:

LazyExpr

property blocks#

The blockshape of the compressed dataset.

property chunks#

The chunkshape of the compressed dataset.

property device#

Hardware device the array data resides on. Always equal to ‘cpu’.

property dtype#

The data type of the dataset.

abstract property info: InfoReporter#

Get information about the Operand.

Returns:

out – A printable class with information about the Operand.

Return type:

InfoReporter

abstract property ndim: int#

Get the number of dimensions of the Operand.

Returns:

out – The number of dimensions of the Operand.

Return type:

int

property shape#

The shape of the dataset.

property vlmeta#

Returns a mapping of metalayer names to their respective values.

This is used to access variable-length metalayers (user attributes) associated with the file.

>>> import caterva2 as cat2
>>> client = cat2.Client('https://demo.caterva2.net')
>>> root = client.get('example')
>>> file = root['ds-sc-attr.b2nd']
>>> file.vlmeta
{'a': 1, 'b': 'foo', 'c': 123.456}