Tutorial: Using xpersist to Cache Data#

xpersist is a Python package for caching data in Python. It is designed to be used in conjunction with analysis packages such as xarray, pandas which provide convenient interfaces for saving and loading data to and from disk.

This notebook provides an example of current usage of xpersist. It is not intended to be a complete tutorial on how to use xpersist. For more information, see the rest of xpersist documentation.

Let’s import the packages we will use.

import xpersist
import xarray as xr
import tempfile

Set cache location#

To use xpersist, we must set the location of the cache. This is done by instatiating a xpersist.cache.CacheStore class. The cache store points to a POSIX directory or a cloud storage bucket where all cached data will be stored. In this example, we will use a local directory.

store = xpersist.CacheStore(f'{tempfile.gettempdir()}/my-cache')
store

CacheStore(path='/tmp/my-cache', readonly=False, on_duplicate_key=<DuplicateKeyEnum.skip: 'skip'>, storage_options={})

Put data in the Cache#

Now we can use the store object to cache some data. The data can be any Python object.

ds = xr.tutorial.open_dataset('rasm').isel(time=slice(0, 2))
value = {'bar': 'baz'}

Get the full list of available serializers#

By default, xpersist uses some heuristics based on an object’s type to determine the right serializer to use. Instead of relying on xpersist’s heuristics, we can specify the name of the serializer in the serializer argument. To get the list of available serializers, we can use xpersist.registry.serializers.get_all(). This will return a dictionary of serializer names and their associated xpersist.serializers.Serializer instances.

serializers = xpersist.registry.serializers.get_all().keys()
[serializer for serializer in serializers]

['xarray.zarr',
 'xarray.netcdf',
 'xcollection',
 'joblib',
 'pandas.csv',
 'pandas.parquet']

Once we know the name of the serializer we want to use, we can specify it in the serializer argument.

_ = store.put('foo', value)
_ = store.put('my-dataset.zarr', ds, serializer='xarray.zarr', dump_kwargs={'mode': 'w'})

Get data from the Cache#

To find the list of keys in the cache, use the xpersist.cache.CacheStore.keys() method.

store.keys()

['my-dataset.zarr', 'Tuesday-sample_flow-foo_task.zarr', 'bar.zarr', 'foo']

To retrieve the data from the cache, use the xpersist.cache.CacheStore.get() method. The get method returns a deserialized object. Let’s retrieve our dataset and the dictionary values we previously cached.

value_from_cache = store.get('foo')
print(value_from_cache)

{'bar': 'baz'}

ds_from_cache = store.get('my-dataset.zarr')
print(ds_from_cache)

<xarray.Dataset>
Dimensions:  (time: 2, y: 205, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 1980-10-17 00:00:00
    xc       (y, x) float64 dask.array<chunksize=(103, 275), meta=np.ndarray>
    yc       (y, x) float64 dask.array<chunksize=(103, 275), meta=np.ndarray>
Dimensions without coordinates: y, x
Data variables:
    Tair     (time, y, x) float64 dask.array<chunksize=(1, 103, 275), meta=np.ndarray>
Attributes:
    NCO:                       netCDF Operators version 4.7.9 (Homepage = htt...
    comment:                   Output from the Variable Infiltration Capacity...
    convention:                CF-1.4
    history:                   Fri Aug  7 17:57:38 2020: ncatted -a bounds,,d...
    institution:               U.W.
    nco_openmp_thread_number:  1
    output_frequency:          daily
    output_mode:               averaged
    references:                Based on the initial model of Liang et al., 19...
    source:                    RACM R1002RBRxaaa01a
    title:                     /workspace/jhamman/processed/R1002RBRxaaa01a/l...

To confirm that the data is the same, we can use the assert statement and xarray.testing.assert_equal() function:

assert value == value_from_cache
xr.testing.assert_equal(ds, ds_from_cache)

Inspect the cache#

There are a few other methods that can be used to inspect the cache. For example, the xpersist.cache.CacheStore.get_artifact() method returns an xpersist.cache.Artifact object. An artifact object is a Python object that contains metadata about the data stored in the cache.

artifact = store.get_artifact('my-dataset.zarr')
artifact

Artifact(key='my-dataset.zarr', serializer='xarray.zarr', load_kwargs={}, dump_kwargs={'mode': 'w'}, additional_metadata={}, created_at=datetime.datetime(2022, 9, 6, 3, 25, 32, 166607))

Delete data from the cache#

To delete data from the cache, use the xpersist.cache.CacheStore.delete() method and pass the key of the data to delete.

store.delete('my-dataset.zarr')

DRY RUN: would delete items with the following paths:

* /tmp/my-cache/xpersist_metadata_store/my-dataset.zarr.artifact.json
* /tmp/my-cache/my-dataset.zarr

To delete these items, call `delete(key, dry_run=False)`

By default, the delete method will run in dry-run mode. This means that it will not actually delete the data from the cache. To actually delete the data, use the dry_run=False argument.

store.delete('my-dataset.zarr', dry_run=False)

To confirm that the data was deleted, we can check the available keys in the cache:

store.keys()

['Tuesday-sample_flow-foo_task.zarr', 'bar.zarr', 'foo']

Trying to delete a key that does not exist in the cache will raise an error.

store.delete('my-dataset.zarr', dry_run=False)

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 store.delete('my-dataset.zarr', dry_run=False)

File ~/checkouts/readthedocs.org/user_builds/xpersist/checkouts/latest/xpersist/cache.py:114, in CacheStore.delete(self, key, dry_run)
    103 """Deletes the key and corresponding artifact from the cache store.
    104 
    105 Parameters
   (...)
    110     If True, the key is not deleted from the cache store. This is useful for debugging.
    111 """
    113 if key not in self:
--> 114     raise KeyError(f'Key `{key}` not found in cache store.')
    115 paths = [self._artifact_meta_full_path(key), self._construct_item_path(key)]
    116 if not dry_run:

KeyError: 'Key `my-dataset.zarr` not found in cache store.'