Expression Evaluation (On-Disk)#

ironArray has transparent support for the evaluation of expressions whose operands are disk-based. The main advantage of this is that you can perform operations with data that exceeds your available memory (even in compressed state).

On the other hand, disks are pretty much slower beasts than memory (although with the advent of SSDs, the gap is closing significantly during the last few years), so you might expect evaluation speeds slowing down significantly, but due to the on-the-fly compression, perhaps not as much as you can realize.

In this tutorial we are going to exercise on disk expression evaluation and compare it with in memory evaluation.

%load_ext memprofiler
%matplotlib inline
import matplotlib.pyplot as plt
import iarray as ia
import os

Let’s start providing some hints on what kind of speed you can expect from using ironArray with on-disk data. So as to show this, we are going to use our original on-disk array and will create an on-disk outcome where we will put the result of our operations. As in latter tutorials, we are going to evict the files to better assess an out of core evaluation:

precip1 ="precip1.iarr")
precip2 ="precip2.iarr")
precip3 ="precip3.iarr")
memprofiler: used 0.49 MiB RAM (peak of 0.49 MiB) in 0.0016 s, total RAM usage 240.82 MiB

In this case, we are just getting views of the larger array that is on-disk. Remember that views do not create new containers, so this is why the above operation is fast and consumes little memory. Now, let’s build the expression for the mean values:

!vmtouch -e "precip1.iarr" "precip2.iarr" "precip3.iarr"
           Files: 3
     Directories: 0
   Evicted Pages: 180234 (704M)
         Elapsed: 0.024078 seconds
precip_mean = (precip1 + precip2 + precip3) / 3
memprofiler: used 0.00 MiB RAM (peak of 0.00 MiB) in 0.0010 s, total RAM usage 240.98 MiB

As usual, this is a very fast operation. And now let’s evaluate and make sure that the result is created on-disk:

%%mprof_run mean
precip_mean_disk = precip_mean.eval(urlpath="mean-3m.iarr", mode="w")
<iarray.lazy_expr.LazyExpr at 0x7f1d9db69580>
memprofiler: used 66.57 MiB RAM (peak of 98.39 MiB) in 2.4155 s, total RAM usage 307.74 MiB

We see that evaluation from disk takes quite more time than operating in memory, but this is kind of expected. What we are more interested in here is that the amount of RAM needed to perform the evaluation is around 100 MB, whereas the output array is quite larger than this:

%ls -lh mean-3m.iarr
-rw-rw-r-- 1 marta marta 694M nov  7 18:30 mean-3m.iarr

This is well above than the consumed memory. Here it is a more graphical view on memory consumption:

%mprof_plot mean -t "Mean computation (on-disk)"