Creating arrays

ironArray for Python is a package that implements a multi-dimensional, compressed data container and an optimized computational engine to manage large arrays.

In this tutorial we will cover creating a simple ironArray array. We will instantiate a simple array, then set properties on the array object. We will also see how to set default properties by changing global and contextual configuration settings.

Creating an array

Let’s start by creating a simple array whose elements are inside the [-1, 1] interval:

[1]:
import numpy as np
import iarray as ia

arr = ia.linspace((5, 5), -1 , 1, dtype=np.float64)
print(arr)
<IArray (5, 5) np.float64>

Voilà, the object arr contains our first ironArray array.

To create an array, we first have to define its shape. The array is then instantiated by the linspace constructor, where you specify the start and stop values. Functions in ironArray are written to map closely to NumPy functions; you can consult the NumPy documentation for more information on the functions and their parameters.

The ironArray library is designed to operate on floating point numerical data. Consequently, the arrays currently support two data types: double and float.

Let’s convert the arr object into a NumPy array and inspect the data:

[2]:
ia.iarray2numpy(arr)
[2]:
array([[-1.        , -0.91666667, -0.83333333, -0.75      , -0.66666667],
       [-0.58333333, -0.5       , -0.41666667, -0.33333333, -0.25      ],
       [-0.16666667, -0.08333333,  0.        ,  0.08333333,  0.16666667],
       [ 0.25      ,  0.33333333,  0.41666667,  0.5       ,  0.58333333],
       [ 0.66666667,  0.75      ,  0.83333333,  0.91666667,  1.        ]])

You can also use the .data attribute for doing the same:

[3]:
arr.data
[3]:
array([[-1.        , -0.91666667, -0.83333333, -0.75      , -0.66666667],
       [-0.58333333, -0.5       , -0.41666667, -0.33333333, -0.25      ],
       [-0.16666667, -0.08333333,  0.        ,  0.08333333,  0.16666667],
       [ 0.25      ,  0.33333333,  0.41666667,  0.5       ,  0.58333333],
       [ 0.66666667,  0.75      ,  0.83333333,  0.91666667,  1.        ]])

Properties

Besides the shape and data type, we can set more properties on the array. For example, let’s make it persistent:

[4]:
pers_arr = ia.linspace((5, 5), -1 , 1, dtype=np.float64, urlpath="myarr.iarr", mode="w")
[5]:
%%bash
ls -l myarr.iarr
-rw-rw-r-- 1 faltet2 faltet2 852 dic 22 09:08 myarr.iarr

and then we’ll read the persistent object from disk. We are going to use open() instead of load() to lazily read in the data as needed (a topic covered in a later tutorial):

[6]:
arr2 = ia.open("myarr.iarr")
print(arr2.data)
[[-1.         -0.91666667 -0.83333333 -0.75       -0.66666667]
 [-0.58333333 -0.5        -0.41666667 -0.33333333 -0.25      ]
 [-0.16666667 -0.08333333  0.          0.08333333  0.16666667]
 [ 0.25        0.33333333  0.41666667  0.5         0.58333333]
 [ 0.66666667  0.75        0.83333333  0.91666667  1.        ]]

Config

The Config class is used to tune the storage (together with some other parameters) for your arrays. The urlpath property is just one of many properties that can be set in a Config object. See the Config documentation for more details on how ironArray configuration can be optimized to improve performance and decrease array size.

[7]:
cfg = ia.Config()
print(cfg)
Config(codec=<Codec.LZ4: 1>, clevel=9, favor=<Favor.BALANCE: 0>, filters=[<Filter.SHUFFLE: 1>], fp_mantissa_bits=0, use_dict=False, nthreads=32, eval_method=<Eval.AUTO: 1>, seed=1, random_gen=<RandomGen.MERSENNE_TWISTER: 0>, btune=True, dtype=<class 'numpy.float64'>, split_mode=<SplitMode.AUTO_SPLIT: 3>, chunks=None, blocks=None, urlpath=None, mode='w-', contiguous=None)

We can also set multiple properties in a single Config instance. For example, this Config object has properties for the shape of the chunks and the blocks:

ia.Config(chunks=(3000, 1000), blocks=(100, 100))

The following example shows how to create a Config object and set its properties, then add it to a larger ironArray array object:

[8]:
cfg = ia.Config(chunks=(3000, 1000), blocks=(100, 100), urlpath="large_arr.iarr", mode="w", fp_mantissa_bits=30)
arr = ia.linspace((10000, 7000), -1, 1, dtype=np.float64, cfg=cfg)
[9]:
%%bash
ls -lh large_arr.iarr
-rw-rw-r-- 1 faltet2 faltet2 134M dic 22 09:08 large_arr.iarr

We have just created an array containing more than 500 MB of data. Thanks to integrated compression, the size of the serialized array on disk is lesss than 150 MB.

In addition, and in contrast to other chunked and compressed data container libraries that support just a single level of data partitioning (such as HDF5 and Zarr), IronArray allows for two levels: chunks and blocks. As we’ll see later, two levels offer more flexibility and options for tuning performance on modern CPU architectures.

You may set many other properties when creating an ironArray array. Here we set some compression properties:

[10]:
cfg = ia.Config(chunks=(3000, 1000), blocks=(100, 100), urlpath="large_arr2.iarr", mode="w")
arr = ia.linspace((10000, 7000), -1, 1, dtype=np.float64, cfg=cfg, btune=False, clevel=5, codec=ia.Codec.ZSTD, fp_mantissa_bits=30)

Note that when we set codec, filters or clevel we have to disable btune, otherwise this will overwrite the latter values. However, if we want to set favor, btune will have to be enabled (the default) in order to actually favor some resource.

[11]:
%%bash
ls -lh large_arr2.iarr
-rw-rw-r-- 1 faltet2 faltet2 29M dic 22 09:08 large_arr2.iarr

As you can see, we created an array that holds 500 MB of data, as before. But now the serialized data only takes less than 30 MB of disk space. We changed the compression codec and mantissa bits properties to shrink the storage size:

  1. codec=ia.Codec.ZSTD: ZSTD offers better compression.

  2. fp_mantissa_bits=30: The IEEE Standard for Floating-Point Arithmetic (IEEE 754), sets the number of significand bits to 30 for float32 and 53 for float64 (including the hidden bit). By setting just 30 bits in the mantissa (or significand) instead of the usual 53 bits for float64 we are setting the other 23 bits to zero, which improves the compression ratio. You can set the fp_mantissa_bits to any precision between 1 and 24 bit (float32) or 53 bit (float64); the compression engine will compress the data to fit the specified precision.

You can see the complete set of supported properties and their defaults by examining an instance of ia.Config:

[12]:
cfg = ia.Config()
print(cfg)
Config(codec=<Codec.LZ4: 1>, clevel=9, favor=<Favor.BALANCE: 0>, filters=[<Filter.SHUFFLE: 1>], fp_mantissa_bits=0, use_dict=False, nthreads=32, eval_method=<Eval.AUTO: 1>, seed=1, random_gen=<RandomGen.MERSENNE_TWISTER: 0>, btune=True, dtype=<class 'numpy.float64'>, split_mode=<SplitMode.AUTO_SPLIT: 3>, chunks=None, blocks=None, urlpath=None, mode='w-', contiguous=None)

Conclusion

You can create arbitrarily large arrays either in memory or on disk, and you can tailor arrays to your own needs using ironArray configuration properties. There is a dedicated tutorial about Configuring ironArray that is important to read in order to comfortably deal with the rich set of properties in ironArray.