Adding a new Backend¶
Q: How do I add new nodes to the odo graph?
Extend Functions¶
We extend Odo by implementing a few functions for each new type
discover
- Return the DataShape of an objectconvert
- Convert data to new typeappend
- Append data on to existing data sourceresource
- Identify data by a string URI
We extend each of these by writing new small functions that we decorate with types. Odo will then pick these up, integrate them in to the network, and use them when appropriate.
Discover¶
Discover returns the DataShape of an object. Datashape is a potentially nested combination of shape and datatype. It helps us to migrate metadata consistently as we migrate the data itself. This enables us to emerge with the right dtypes even if we have to transform through potentially lossy formats.
Example¶
>>> discover([1, 2, 3])
dshape("3 * int32")
>>> import numpy as np
>>> x = np.empty(shape=(3, 5), dtype=[('name', 'O'), ('balance', 'f8')])
>>> discover(x)
dshape("3 * 5 * {name: string, balance: float64}")
Extend¶
We import discover
from the datashape
library and extend it with a
type.
from datashape import discover, from_numpy
@discover(pd.DataFrame)
def discover_dataframe(df, **kwargs):
shape = (len(df),)
dtype = df.values.dtype
return from_numpy(shape, dtype)
In this simple example we rely on convenience functions within datashape to form a datashape from a numpy shape and dtype. For more complex situations (e.g. databases) it may be necessary to construct datashapes manually.
Convert¶
Convert copies your data in to a new object with a different type.
Example¶
>>> x = np.arange(5)
>>> x
array([0, 1, 2, 3, 4])
>>> convert(list, x)
[0, 1, 2, 3, 4]
>>> import pandas as pd
>>> convert(pd.Series, x)
0 0
1 1
2 2
3 3
4 4
dtype: int64
Extend¶
Import convert from odo
and register it with two types, one for the target
and one for the source
from odo import convert
@convert.register(list, np.ndarray)
def array_to_list(x, **kwargs):
return x.tolist()
@convert.register(pd.Series, np.ndarray)
def array_to_series(x, **kwargs):
return pd.Series(x)
Append¶
Append copies your data in to an existing dataset.
Example¶
>>> x = np.arange(5)
>>> x
array([0, 1, 2, 3, 4])
>>> L = [10, 20, 30]
>>> _ = append(L, x)
>>> L
[10, 20, 30, 0, 1, 2, 3, 4]
Extend¶
Import append from odo
and register it with two types, one for the target
and one for the source. Usually we teach odo
how to append from one
preferred type and then use convert for all others
from odo import append
@append.register(list, list)
def append_list_to_list(tgt, src, **kwargs):
tgt.extend(src)
return tgt
@append.register(list, object) # anything else
def append_anything_to_list(tgt, src, **kwargs):
source_as_list = convert(list, src, **kwargs)
return append(tgt, source_as_list, **kwargs)
Resource¶
Resource creates objects from string URIs matched against regular expressions.
Example¶
>>> resource('myfile.hdf5')
<HDF5 file "myfile.hdf5" (mode r+)>
>>> resource('myfile.hdf5::/data', dshape='10 * 10 * int32')
<HDF5 dataset "data": shape (10, 10), type "<i4">
The objects it returns are h5py.File
and h5py.Dataset
respectively. In
the second case resource found that the dataset did not exist so it created it.
Extend¶
We import resource
from odo
and register it with regular expressions
from odo import resource
import h5py
@resource.register('.*\.hdf5')
def resource(uri, **kwargs):
return h5py.File(uri)
General Notes¶
We pass all keyword arguments from the top-level call to odo
to all
functions. This allows special keyword arguments to trickle down to the right
place, e.g. delimiter=';'
makes it to the pd.read_csv
call when
interacting with CSV files, but also means that all functions that you write
must expect and handle unwanted keyword arguments. This often requires some
filtering on your part.
Even though all four of our abstract functions have a .register
method they
operate in very different ways. Convert is managed by networkx and path
finding, append
and discover
are managed by multipledispatch, and
resource
is managed by regular expressions.
Examples are useful. You may want to look at some of the odo
source for
simple backends for help