DataProvider
An important part of any algorithm is the data it’s working over and the data that it produces.
An important part of working with large scales of data is where the data is stored and how it’s accessed.
The smqtk_dataprovider
module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed.
This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.
DataProvider Structures
The following are the core data representation interfaces included in this package.
- Note:
It is required that implementations have a common serialization format so that they may be stored or transported by other structures in a general way without caring what the specific implementation is. For this we require that all implementations be serializable via the
pickle
module functions.
DataElement
- class smqtk_dataprovider.DataElement[source]
Abstract interface for a byte data container.
The primary “value” of a
DataElement
is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. ThusDataElement
instances are not considered generally hashable. Specific implementations may define a__hash__
method if that implementation reflects a data source that guarantees immutability.UUIDs should be cast-able to a string and maintain unique-ness after conversion.
- clean_temp() None [source]
Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.
- abstract content_type() str | None [source]
- Returns:
Standard type/subtype string for this data element, or None if the content type is unknown.
- Return type:
str or None
- classmethod from_uri(uri: str) DataElement [source]
Construct a new instance based on the given URI.
This function may not be implemented for all DataElement types.
- Parameters:
uri (str) – URI string to resolve into an element instance
- Raises:
NoUriResolutionError – This element type does not implement URI resolution.
InvalidUriError – This element type could not resolve the provided URI string.
- Returns:
New element instance of our type.
- Return type:
- abstract get_bytes() bytes [source]
- Returns:
Get the bytes for this data element.
- Return type:
bytes
- abstract is_empty() bool [source]
Check if this element contains no bytes.
The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.
- Returns:
If this element contains 0 bytes.
- Return type:
bool
- md5() str [source]
Get the MD5 checksum of this element’s binary content.
- Returns:
MD5 hex checksum of the data content.
- Return type:
str
- abstract set_bytes(b: bytes) None [source]
Set bytes to this data element.
Not all implementations may support setting bytes (check
writable
method return).This base abstract method should be called by sub-class implementations first. We check for mutability based on
writable()
method return.- Parameters:
b (bytes) – bytes to set.
- Raises:
ReadOnlyError – This data element can only be read from / does not support writing.
- sha1() str [source]
Get the SHA1 checksum of this element’s binary content.
- Returns:
SHA1 hex checksum of the data content.
- Return type:
str
- sha512() str [source]
Get the SHA512 checksum of this element’s binary content.
- Returns:
SHA512 hex checksum of the data content.
- Return type:
str
- to_buffered_reader() BytesIO [source]
Wrap this element’s bytes in a
io.BufferedReader
instance for use as file-like object for reading.As we use the
get_bytes
function, this element’s bytes must safely fit in memory for this method to be usable.- Returns:
New BufferedReader instance
- Return type:
io.BufferedReader
- uuid() Hashable [source]
UUID for this data element.
This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.
By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.
- Returns:
UUID value for this data element. This return value should be hashable.
- Return type:
collections.abc.Hashable
- abstract writable() bool [source]
- Returns:
if this instance supports setting bytes.
- Return type:
bool
- write_temp(temp_dir: str | None = None) str [source]
Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.
It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.
- NOTE:
The file path returned should not be explicitly removed by the user. Instead, the
clean_temp()
method should be called on this object.
- Parameters:
temp_dir (None or str) – Optional directory to write temporary file in, otherwise we use the platform default temporary files directory. If this is an empty string, we count it the same as having provided None.
- Returns:
Path to the temporary file
- Return type:
str
DataSet
- class smqtk_dataprovider.DataSet[source]
Abstract interface for data sets, that contain an arbitrary number of
DataElement
instances of arbitrary implementation type, keyed onDataElement
UUID values.This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.
- abstract add_data(*elems: DataElement) None [source]
Add the given data element(s) instance to this data set.
NOTE: Implementing methods should check that input elements are in fact DataElement instances.
- Parameters:
elems (smqtk.representation.DataElement) – Data element(s) to add
- abstract get_data(uuid: Hashable) DataElement [source]
Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.
- Raises:
KeyError – If the given uuid does not refer to an element in this data set.
- Parameters:
uuid (collections.abc.Hashable) – The uuid of the element to retrieve.
- Returns:
The data element instance for the given uuid.
- Return type:
smqtk.representation.DataElement
- abstract has_uuid(uuid: Hashable) bool [source]
Test if the given uuid refers to an element in this data set.
- Parameters:
uuid (collections.abc.Hashable) – Unique ID to test for inclusion. This should match the type that the set implementation expects or cares about.
- Returns:
True if the given uuid matches an element in this set, or False if it does not.
- Return type:
bool
KeyValueStore
- class smqtk_dataprovider.KeyValueStore[source]
Interface for general key/value storage.
Implementations may impose restrictions on what types keys or values may be due to backend used.
Data access and manipulation should be thread-safe.
- abstract add(key: Hashable, value: Any) KeyValueStore [source]
Add a key-value pair to this store.
NOTE: Implementing sub-classes should call this super-method. This super method should not be considered a critical section for thread safety unless ``is_read_only`` is not thread-safe.
- Parameters:
key (Hashable) – Key for the value. Must be hashable.
value (object) – Python object to store.
- Raises:
ReadOnlyError – If this instance is marked as read-only.
- Returns:
Self.
- Return type:
- abstract add_many(d: Mapping[Hashable, Any]) KeyValueStore [source]
Add multiple key-value pairs at a time into this store as represented in the provided dictionary d.
- Parameters:
d (dict[Hashable, object]) – Dictionary of key-value pairs to add to this store.
- Raises:
ReadOnlyError – If this instance is marked as read-only.
- Returns:
Self.
- Return type:
- abstract clear() KeyValueStore [source]
Clear this key-value store.
NOTE: Implementing sub-classes should call this super-method. This super method should not be considered a critical section for thread safety.
- Raises:
ReadOnlyError – If this instance is marked as read-only.
- Returns:
Self.
- Return type:
- abstract count() int [source]
- Returns:
The number of key-value relationships in this store.
- Return type:
int | long
- abstract get(key: ~typing.Hashable, default: ~typing.Any = <smqtk_dataprovider.interfaces.key_value_store.KeyValueStoreNoDefaultValueType object>) Any [source]
Get the value for the given key.
NOTE: Implementing sub-classes are responsible for raising a ``KeyError`` where appropriate.
- Parameters:
key – Key to get the value of.
default – Optional default value if the given key is not present in this store. This may be any value except for the
NO_DEFAULT_VALUE
constant (custom anonymous class instance).
- Raises:
KeyError – The given key is not present in this store and no default value given.
- Returns:
Deserialized python object stored for the given key.
- get_many(keys: ~typing.Iterable[~typing.Hashable], default: ~typing.Any = <smqtk_dataprovider.interfaces.key_value_store.KeyValueStoreNoDefaultValueType object>) Iterable[Any] [source]
Get the values for the given keys.
NOTE: Implementing sub-classes are responsible for raising a ``KeyError`` where appropriate.
- Parameters:
keys (collections.abc.Iterable[Hashable]) – The keys for which associated values are requested.
default (object) – Optional default value if a given key is not present in this store. This may be any value except for the
NO_DEFAULT_VALUE
constant (custom anonymous class instance).
- Raises:
KeyError – A given key is not present in this store and no default value given.
- Returns:
Iterable of deserialized python objects stored for the given keys in the order that the corresponding keys were provided.
- Return type:
collections.abc.Iterable
- abstract has(key: Hashable) bool [source]
Check if this store has a value for the given key.
- Parameters:
key (Hashable) – Key to check for a value for.
- Returns:
If this store has a value for the given key.
- Return type:
bool
- abstract is_read_only() bool [source]
- Returns:
True if this instance is read-only and False if it is not.
- Return type:
bool
- abstract keys() Iterator[Hashable] [source]
- Returns:
Iterator over keys in this store.
- Return type:
collections.abc.Iterator[Hashable]
- abstract remove(key: Hashable) KeyValueStore [source]
Remove a single key-value entry.
- Parameters:
key (Hashable) – Key to remove.
- Raises:
ReadOnlyError – If this instance is marked as read-only.
KeyError – The given key is not present in this store and no default value given.
- Returns:
Self.
- Return type:
- abstract remove_many(keys: Iterable[Hashable]) KeyValueStore [source]
Remove multiple keys and associated values.
- Parameters:
keys (collections.abc.Iterable[Hashable]) – Iterable of keys to remove. If this is empty this method does nothing.
- Raises:
ReadOnlyError – If this instance is marked as read-only.
KeyError – The given key is not present in this store and no default value given. The store is not modified if any key is invalid.
- Returns:
Self.
- Return type: