Tutorial: Using xpersist to Cache Data#
xpersist is a Python package for caching data in Python. It is designed to be used in conjunction with analysis packages such as xarray, pandas which provide convenient interfaces for saving and loading data to and from disk.
This notebook provides an example of current usage of xpersist. It is not intended to be a complete tutorial on how to use xpersist. For more information, see the rest of xpersist documentation.
Let’s import the packages we will use.
import xpersist
import xarray as xr
import tempfile
Set cache location#
To use xpersist, we must set the location of the cache. This is done by instatiating a xpersist.cache.CacheStore
class. The cache store points to a POSIX directory or a cloud storage bucket where all cached data will be stored. In this example, we will use a local directory.
store = xpersist.CacheStore(f'{tempfile.gettempdir()}/my-cache')
store
CacheStore(path='/tmp/my-cache', readonly=False, on_duplicate_key=<DuplicateKeyEnum.skip: 'skip'>, storage_options={})
Put data in the Cache#
Now we can use the store
object to cache some data. The data can be any Python object.
ds = xr.tutorial.open_dataset('rasm').isel(time=slice(0, 2))
value = {'bar': 'baz'}
Get the full list of available serializers#
By default, xpersist uses some heuristics based on an object’s type to determine the right serializer to use. Instead of relying on xpersist’s heuristics, we can specify the name of the serializer in the serializer
argument. To get the list of available serializers, we can use xpersist.registry.serializers.get_all()
. This will return a dictionary of serializer names and their associated xpersist.serializers.Serializer
instances.
serializers = xpersist.registry.serializers.get_all().keys()
[serializer for serializer in serializers]
['xarray.zarr',
'xarray.netcdf',
'xcollection',
'joblib',
'pandas.csv',
'pandas.parquet']
Once we know the name of the serializer we want to use, we can specify it in the serializer
argument.
_ = store.put('foo', value)
_ = store.put('my-dataset.zarr', ds, serializer='xarray.zarr', dump_kwargs={'mode': 'w'})
Get data from the Cache#
To find the list of keys in the cache, use the xpersist.cache.CacheStore.keys()
method.
store.keys()
['my-dataset.zarr', 'Tuesday-sample_flow-foo_task.zarr', 'bar.zarr', 'foo']
To retrieve the data from the cache, use the xpersist.cache.CacheStore.get()
method. The get
method returns a deserialized object. Let’s retrieve our dataset and the dictionary values we previously cached.
value_from_cache = store.get('foo')
print(value_from_cache)
{'bar': 'baz'}
ds_from_cache = store.get('my-dataset.zarr')
print(ds_from_cache)
<xarray.Dataset>
Dimensions: (time: 2, y: 205, x: 275)
Coordinates:
* time (time) object 1980-09-16 12:00:00 1980-10-17 00:00:00
xc (y, x) float64 dask.array<chunksize=(103, 275), meta=np.ndarray>
yc (y, x) float64 dask.array<chunksize=(103, 275), meta=np.ndarray>
Dimensions without coordinates: y, x
Data variables:
Tair (time, y, x) float64 dask.array<chunksize=(1, 103, 275), meta=np.ndarray>
Attributes:
NCO: netCDF Operators version 4.7.9 (Homepage = htt...
comment: Output from the Variable Infiltration Capacity...
convention: CF-1.4
history: Fri Aug 7 17:57:38 2020: ncatted -a bounds,,d...
institution: U.W.
nco_openmp_thread_number: 1
output_frequency: daily
output_mode: averaged
references: Based on the initial model of Liang et al., 19...
source: RACM R1002RBRxaaa01a
title: /workspace/jhamman/processed/R1002RBRxaaa01a/l...
To confirm that the data is the same, we can use the assert
statement and xarray.testing.assert_equal()
function:
assert value == value_from_cache
xr.testing.assert_equal(ds, ds_from_cache)
Inspect the cache#
There are a few other methods that can be used to inspect the cache. For example, the xpersist.cache.CacheStore.get_artifact()
method returns an xpersist.cache.Artifact
object. An artifact object is a Python object that contains metadata about the data stored in the cache.
artifact = store.get_artifact('my-dataset.zarr')
artifact
Artifact(key='my-dataset.zarr', serializer='xarray.zarr', load_kwargs={}, dump_kwargs={'mode': 'w'}, additional_metadata={}, created_at=datetime.datetime(2022, 9, 6, 3, 25, 32, 166607))
Delete data from the cache#
To delete data from the cache, use the xpersist.cache.CacheStore.delete()
method and pass the key of the data to delete.
store.delete('my-dataset.zarr')
DRY RUN: would delete items with the following paths:
* /tmp/my-cache/xpersist_metadata_store/my-dataset.zarr.artifact.json
* /tmp/my-cache/my-dataset.zarr
To delete these items, call `delete(key, dry_run=False)`
By default, the delete
method will run in dry-run mode. This means that it will not actually delete the data from the cache. To actually delete the data, use the dry_run=False
argument.
store.delete('my-dataset.zarr', dry_run=False)
To confirm that the data was deleted, we can check the available keys in the cache:
store.keys()
['Tuesday-sample_flow-foo_task.zarr', 'bar.zarr', 'foo']
Trying to delete a key that does not exist in the cache will raise an error.
store.delete('my-dataset.zarr', dry_run=False)
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 store.delete('my-dataset.zarr', dry_run=False)
File ~/checkouts/readthedocs.org/user_builds/xpersist/checkouts/latest/xpersist/cache.py:114, in CacheStore.delete(self, key, dry_run)
103 """Deletes the key and corresponding artifact from the cache store.
104
105 Parameters
(...)
110 If True, the key is not deleted from the cache store. This is useful for debugging.
111 """
113 if key not in self:
--> 114 raise KeyError(f'Key `{key}` not found in cache store.')
115 paths = [self._artifact_meta_full_path(key), self._construct_item_path(key)]
116 if not dry_run:
KeyError: 'Key `my-dataset.zarr` not found in cache store.'