Slicing Datasets and Creating Views#

In this tutorial, we will cover the steps to load a large array of data, then slice it into chunks to create a view that can be examined and manipulated. Views are simply references to the parts of a larger array that contain the data of interest to you at the moment.

In order to see how slicing works in ironArray, we’ll use a Pangeo project open source dataset containing precipitation data. In this case, we are going to work with precipitation data covering a period of one month.

Opening the Dataset#

Before we can begin slicing the data and creating a view, we first need to load the data into an ironArray structure. We’ll start by configuring our ironArray environment, as covered in the previous tutorial.

[1]:
import zarr
import xarray as xr
import s3fs
import numpy as np
import iarray as ia
import os
import shutil

ia.set_config_defaults(btune=False, codec=ia.Codec.LZ4, filters=[ia.Filter.SHUFFLE])

%load_ext memprofiler

Even though it is not too important how we download the data, we are going to show how we do it. First we define a method to load data into a zarray object from a zarray file on a S3 filesystem. The dataset is split into sets organized by year and month; we’ll make year and month parameters to our load method.

We’ll also have a starting date and ending date as parameters, so that we can further reduce the amount of data we use to populate our initial ironArray object.

[2]:
def open_zarr(year, month, datestart, dateend):
    fs = s3fs.S3FileSystem(anon=True)

    datestring = 'era5-pds/zarr/{year}/{month:02d}/data/'.format(year=year, month=month)

    precip_zarr = xr.open_dataset(s3fs.S3Map(datestring + 'precipitation_amount_1hour_Accumulation.zarr/',
                                             s3=fs),
                                  engine="zarr")
    precip_zarr = precip_zarr.sel(time1=slice(np.datetime64(datestart), np.datetime64(dateend)))

    return precip_zarr.precipitation_amount_1hour_Accumulation

Next, we’ll check if we have an ironArray dataset saved to disk; if we don’t, we’ll import the data from the zarray file. When we call our open_zarr() method, we’ll import a file containing one-hour accumulation precipitation data for October 1987, then select all the data from the zarray that starts at the beginning of the month and finishes at the end of the month. Finally, we’ll turn the loaded dataset into an ironArray object and save it to disk.

[3]:
%%time
if os.path.exists("precip1.iarr"):
    precip_ia = ia.load("precip1.iarr")
else:
    precip = open_zarr(1987, 10, '1987-10-01', '1987-10-30 23:59').data
    precip_ia = ia.numpy2iarray(precip)
    ia.save("precip1.iarr", precip_ia)
print(precip_ia.info)
type   : IArray
shape  : (720, 721, 1440)
chunks : (128, 128, 256)
blocks : (16, 32, 64)
cratio : 14.62

CPU times: user 449 µs, sys: 64.2 ms, total: 64.6 ms
Wall time: 64.5 ms

In this simple case, the chunk shape and block shape have been automatically balanced, which is the default setting for ironArray. Setting the appropriate chunk shape and block shape for your array is very important, as we will demonstrate below in Optimization Tips.

Slicing Notation#

In slicing tuples, ironArray supports integers, start:stop slices (the step is not yet implemented) and Ellipsis ....

As in Python, all indices are zero-based: for the \(i\)-th index \(n_i\), the valid range is \(0 \le n_i < d_i\) where \(d_i\) is the \(i-th\) element of the shape of the array. Negative indices are interpreted as counting from the end of the array (i.e., if \(n_i < 0\), it means \(d_i + n_i\)).

The simplest way to obtain a slice is using integers to acces a value. In this case ironArray will return a Python object.

[4]:
s1 = precip_ia[5, 234, -55]
s1, type(s1)
[4]:
(array(0., dtype=float32), numpy.ndarray)

As we said, ironArray also suports the start:stop notation. If start is not specified, the slice will start at the beginning of the array. In the same way, if stop is not specified, the slice will stop at the end of the array.

[5]:
sl1 = precip_ia[34:555, 211:311, 300:]
sl2 = precip_ia[:, :-250, 300:500]
sl3 = precip_ia[:, :, :]
sl1, sl2, sl3
[5]:
(<IArray (521, 100, 1140) np.float32>,
 <IArray (720, 471, 200) np.float32>,
 <IArray (720, 721, 1440) np.float32>)

Another interesting feature to use in slicing is the ellipsis object (...). This symbol expands the number of : objects to index all dimensions. There may only be a single ellipsis present.

[6]:
sl1 = precip_ia[...]
sl2 = precip_ia[..., 5]
sl3 = precip_ia[500:100, ..., 400]
sl1, sl2, sl3
[6]:
(<IArray (720, 721, 1440) np.float32>,
 <IArray (720, 721) np.float32>,
 <IArray (0, 721) np.float32>)

In ironArray the slicing tuples may not be completed at all (i.e. not all dimensions are indexed). If this is the case, ironArray completes the remaining dimensions with :

[7]:
sl1 = precip_ia[:400, -500]
sl2 = precip_ia[5]
sl1, sl2
[7]:
(<IArray (400, 1440) np.float32>, <IArray (721, 1440) np.float32>)

It should be noted that if an integer is used, the dimension of the array is reduced by one unit. If we want to keep the dimension, we can specify a slice with a range of length one (e.g. [5:6]).

[8]:
precip_ia[5].shape, precip_ia[5:6].shape
[8]:
((721, 1440), (1, 721, 1440))

These two slices contain the same data. But the first dimension has been removed in the first slice since we have indexed it with an integer.

Views#

When a slice is performed in ironArray, a view of the container is returned (like in NumPy). You can always check whether an array is a view or not with the is_view() method:

[9]:
%%time

s1 = precip_ia[2:300, 40:310, 500:1000]
precip_ia.is_view(), s1.is_view()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <timed exec>:2, in <module>

TypeError: 'bool' object is not callable

If we don’t want a view, we can do a copy of the slice or get a NumPy array using the data attribute:

[10]:
%%time

s1 = precip_ia[2:300, 40:310, 500:1000].copy()
type(s1)
CPU times: user 625 ms, sys: 188 ms, total: 813 ms
Wall time: 119 ms
[10]:
iarray.iarray_container.IArray
[11]:
%%time

s1 = precip_ia[2:300, 40:310, 500:1000].data
type(s1)
CPU times: user 279 ms, sys: 226 ms, total: 505 ms
Wall time: 94.9 ms
[11]:
numpy.ndarray

So, retrieving the interesting data out of your IArray is pretty similar to NumPy convention.

At any rate, whenever you want to use the NumPy advanced slicing features, you can always get a NumPy array out of an IArray (or a view of it) and apply your desired indexing there. Remember that ironArray is meant for handling very large arrays, so there is no shame in getting the interesting slice as a NumPy object and then do your work over it.

Finally, indexing also applies to arrays that are stored persistently on disk. ironArray will use the information about the data you want and will read and decompress only the part that is necessary. See for example:

[12]:
precip_ia2 = ia.open("precip1.iarr")
sl1 = precip_ia2[:400, -500]
sl2 = precip_ia2[5]
sl1, sl2
[12]:
(<IArray (400, 1440) np.float32>, <IArray (721, 1440) np.float32>)

Due to the double partitioning and fast compression codecs, the slicing on disk is in general very efficient (see the benchmarks section).

Optimization Tips#

In this section we are going to fine-tune some of the ironArray parameters to obtain a better performance. The first thing that we can modify is the chunk shape and the block shape of the array.

Let’s suppose that we are going to slice the array always in the same dimension. For example, in this case we want to slice the array in the days dimension.

[13]:
%%time
for i in range(precip_ia.shape[1]):
    _ = precip_ia[:, i, :].data
CPU times: user 1min 29s, sys: 12 s, total: 1min 41s
Wall time: 7.67 s

What happens if we optimize the chunks and the blocks for that data access pattern?

[14]:
chunks = (precip_ia.shape[0], 1, precip_ia.shape[2])
blocks = (128, 1, 128)

precip_ia_op = precip_ia.copy(chunks=chunks, blocks=blocks)
[15]:
%%time
for i in range(precip_ia.shape[1]):
    _ = precip_ia_op[:, i, :].data
CPU times: user 4.4 s, sys: 944 ms, total: 5.34 s
Wall time: 1.64 s

This is more than 4x faster that with unoptimized settings. As has been demonstrated, if you are going to mostly access in a specific dimension, it is very important to optimize the chunks and the blocks of the array.