DataProvider
An important part of any algorithm is the data it’s working over and the data that it produces.
An important part of working with large scales of data is where the data is stored and how it’s accessed.
The smqtk_dataprovider
module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed.
This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.
DataProvider Structures
The following are the core data representation interfaces included in this package.
- Note:
It is required that implementations have a common serialization format so that they may be stored or transported by other structures in a general way without caring what the specific implementation is. For this we require that all implementations be serializable via the
pickle
module functions.
DataElement
- class smqtk_dataprovider.DataElement[source]
Abstract interface for a byte data container.
The primary “value” of a
DataElement
is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. ThusDataElement
instances are not considered generally hashable. Specific implementations may define a__hash__
method if that implementation reflects a data source that guarantees immutability.UUIDs should be cast-able to a string and maintain unique-ness after conversion.
- clean_temp() None [source]
Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.
- abstract content_type() Optional[str] [source]
- Returns
Standard type/subtype string for this data element, or None if the content type is unknown.
- Return type
str or None
- classmethod from_uri(uri: str) DataElement [source]
Construct a new instance based on the given URI.
This function may not be implemented for all DataElement types.
- Parameters
uri (str) – URI string to resolve into an element instance
- Raises
NoUriResolutionError – This element type does not implement URI resolution.
InvalidUriError – This element type could not resolve the provided URI string.
- Returns
New element instance of our type.
- Return type
- abstract is_empty() bool [source]
Check if this element contains no bytes.
The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.
- Returns
If this element contains 0 bytes.
- Return type
bool
- md5() str [source]
Get the MD5 checksum of this element’s binary content.
- Returns
MD5 hex checksum of the data content.
- Return type
str
- abstract set_bytes(b: bytes) None [source]
Set bytes to this data element.
Not all implementations may support setting bytes (check
writable
method return).This base abstract method should be called by sub-class implementations first. We check for mutability based on
writable()
method return.- Parameters
b (bytes) – bytes to set.
- Raises
ReadOnlyError – This data element can only be read from / does not support writing.
- sha1() str [source]
Get the SHA1 checksum of this element’s binary content.
- Returns
SHA1 hex checksum of the data content.
- Return type
str
- sha512() str [source]
Get the SHA512 checksum of this element’s binary content.
- Returns
SHA512 hex checksum of the data content.
- Return type
str
- to_buffered_reader() BytesIO [source]
Wrap this element’s bytes in a
io.BufferedReader
instance for use as file-like object for reading.As we use the
get_bytes
function, this element’s bytes must safely fit in memory for this method to be usable.- Returns
New BufferedReader instance
- Return type
io.BufferedReader
- uuid() Hashable [source]
UUID for this data element.
This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.
By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.
- Returns
UUID value for this data element. This return value should be hashable.
- Return type
collections.abc.Hashable
- abstract writable() bool [source]
- Returns
if this instance supports setting bytes.
- Return type
bool
- write_temp(temp_dir: Optional[str] = None) str [source]
Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.
It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.
- NOTE:
The file path returned should not be explicitly removed by the user. Instead, the
clean_temp()
method should be called on this object.
- Parameters
temp_dir (None or str) – Optional directory to write temporary file in, otherwise we use the platform default temporary files directory. If this is an empty string, we count it the same as having provided None.
- Returns
Path to the temporary file
- Return type
str
DataSet
- class smqtk_dataprovider.DataSet[source]
Abstract interface for data sets, that contain an arbitrary number of
DataElement
instances of arbitrary implementation type, keyed onDataElement
UUID values.This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.
- abstract add_data(*elems: DataElement) None [source]
Add the given data element(s) instance to this data set.
NOTE: Implementing methods should check that input elements are in fact DataElement instances.
- Parameters
elems (smqtk.representation.DataElement) – Data element(s) to add
- abstract get_data(uuid: Hashable) DataElement [source]
Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.
- Raises
KeyError – If the given uuid does not refer to an element in this data set.
- Parameters
uuid (collections.abc.Hashable) – The uuid of the element to retrieve.
- Returns
The data element instance for the given uuid.
- Return type
smqtk.representation.DataElement
- abstract has_uuid(uuid: Hashable) bool [source]
Test if the given uuid refers to an element in this data set.
- Parameters
uuid (collections.abc.Hashable) – Unique ID to test for inclusion. This should match the type that the set implementation expects or cares about.
- Returns
True if the given uuid matches an element in this set, or False if it does not.
- Return type
bool
KeyValueStore
- class smqtk_dataprovider.KeyValueStore[source]
Interface for general key/value storage.
Implementations may impose restrictions on what types keys or values may be due to backend used.
Data access and manipulation should be thread-safe.
- abstract add(key: Hashable, value: Any) KeyValueStore [source]
Add a key-value pair to this store.
NOTE: Implementing sub-classes should call this super-method. This super method should not be considered a critical section for thread safety unless ``is_read_only`` is not thread-safe.
- Parameters
key (Hashable) – Key for the value. Must be hashable.
value (object) – Python object to store.
- Raises
ReadOnlyError – If this instance is marked as read-only.
- Returns
Self.
- Return type
- abstract add_many(d: Mapping[Hashable, Any]) KeyValueStore [source]
Add multiple key-value pairs at a time into this store as represented in the provided dictionary d.
- Parameters
d (dict[Hashable, object]) – Dictionary of key-value pairs to add to this store.
- Raises
ReadOnlyError – If this instance is marked as read-only.
- Returns
Self.
- Return type
- abstract clear() KeyValueStore [source]
Clear this key-value store.
NOTE: Implementing sub-classes should call this super-method. This super method should not be considered a critical section for thread safety.
- Raises
ReadOnlyError – If this instance is marked as read-only.
- Returns
Self.
- Return type
- abstract count() int [source]
- Returns
The number of key-value relationships in this store.
- Return type
int | long
- abstract get(key: ~typing.Hashable, default: ~typing.Any = <smqtk_dataprovider.interfaces.key_value_store.KeyValueStoreNoDefaultValueType object>) Any [source]
Get the value for the given key.
NOTE: Implementing sub-classes are responsible for raising a ``KeyError`` where appropriate.
- Parameters
key – Key to get the value of.
default – Optional default value if the given key is not present in this store. This may be any value except for the
NO_DEFAULT_VALUE
constant (custom anonymous class instance).
- Raises
KeyError – The given key is not present in this store and no default value given.
- Returns
Deserialized python object stored for the given key.
- get_many(keys: ~typing.Iterable[~typing.Hashable], default: ~typing.Any = <smqtk_dataprovider.interfaces.key_value_store.KeyValueStoreNoDefaultValueType object>) Iterable[Any] [source]
Get the values for the given keys.
NOTE: Implementing sub-classes are responsible for raising a ``KeyError`` where appropriate.
- Parameters
keys (collections.abc.Iterable[Hashable]) – The keys for which associated values are requested.
default (object) – Optional default value if a given key is not present in this store. This may be any value except for the
NO_DEFAULT_VALUE
constant (custom anonymous class instance).
- Raises
KeyError – A given key is not present in this store and no default value given.
- Returns
Iterable of deserialized python objects stored for the given keys in the order that the corresponding keys were provided.
- Return type
collections.abc.Iterable
- abstract has(key: Hashable) bool [source]
Check if this store has a value for the given key.
- Parameters
key (Hashable) – Key to check for a value for.
- Returns
If this store has a value for the given key.
- Return type
bool
- abstract is_read_only() bool [source]
- Returns
True if this instance is read-only and False if it is not.
- Return type
bool
- abstract keys() Iterator[Hashable] [source]
- Returns
Iterator over keys in this store.
- Return type
collections.abc.Iterator[Hashable]
- abstract remove(key: Hashable) KeyValueStore [source]
Remove a single key-value entry.
- Parameters
key (Hashable) – Key to remove.
- Raises
ReadOnlyError – If this instance is marked as read-only.
KeyError – The given key is not present in this store and no default value given.
- Returns
Self.
- Return type
- abstract remove_many(keys: Iterable[Hashable]) KeyValueStore [source]
Remove multiple keys and associated values.
- Parameters
keys (collections.abc.Iterable[Hashable]) – Iterable of keys to remove. If this is empty this method does nothing.
- Raises
ReadOnlyError – If this instance is marked as read-only.
KeyError – The given key is not present in this store and no default value given. The store is not modified if any key is invalid.
- Returns
Self.
- Return type