DataProvider

An important part of any algorithm is the data it’s working over and the data that it produces. An important part of working with large scales of data is where the data is stored and how it’s accessed. The smqtk_dataprovider module contains interfaces and plugins for various core data structures, allowing plugin implementations to decide where and how the underlying raw data should be stored and accessed. This potentially allows algorithms to handle more data that would otherwise be feasible on a single machine.

DataProvider Structures

The following are the core data representation interfaces included in this package.

Note:

It is required that implementations have a common serialization format so that they may be stored or transported by other structures in a general way without caring what the specific implementation is. For this we require that all implementations be serializable via the pickle module functions.

DataElement

class smqtk_dataprovider.DataElement[source]

Abstract interface for a byte data container.

The primary “value” of a DataElement is the byte content wrapped. Since this can technically change due to external forces, we cannot guarantee that an element is immutable. Thus DataElement instances are not considered generally hashable. Specific implementations may define a __hash__ method if that implementation reflects a data source that guarantees immutability.

UUIDs should be cast-able to a string and maintain unique-ness after conversion.

clean_temp() None[source]

Clean any temporary files created by this element. This does nothing if no temporary files have been generated for this element yet.

abstract content_type() Optional[str][source]
Returns

Standard type/subtype string for this data element, or None if the content type is unknown.

Return type

str or None

classmethod from_uri(uri: str) DataElement[source]

Construct a new instance based on the given URI.

This function may not be implemented for all DataElement types.

Parameters

uri (str) – URI string to resolve into an element instance

Raises
  • NoUriResolutionError – This element type does not implement URI resolution.

  • InvalidUriError – This element type could not resolve the provided URI string.

Returns

New element instance of our type.

Return type

DataElement

abstract get_bytes() bytes[source]
Returns

Get the bytes for this data element.

Return type

bytes

abstract is_empty() bool[source]

Check if this element contains no bytes.

The intend of this method is to quickly check if there is any data behind this element, ideally without having to read all/any of the underlying data.

Returns

If this element contains 0 bytes.

Return type

bool

is_read_only() bool[source]
Returns

If this element can only be read from.

Return type

bool

md5() str[source]

Get the MD5 checksum of this element’s binary content.

Returns

MD5 hex checksum of the data content.

Return type

str

abstract set_bytes(b: bytes) None[source]

Set bytes to this data element.

Not all implementations may support setting bytes (check writable method return).

This base abstract method should be called by sub-class implementations first. We check for mutability based on writable() method return.

Parameters

b (bytes) – bytes to set.

Raises

ReadOnlyError – This data element can only be read from / does not support writing.

sha1() str[source]

Get the SHA1 checksum of this element’s binary content.

Returns

SHA1 hex checksum of the data content.

Return type

str

sha512() str[source]

Get the SHA512 checksum of this element’s binary content.

Returns

SHA512 hex checksum of the data content.

Return type

str

to_buffered_reader() BytesIO[source]

Wrap this element’s bytes in a io.BufferedReader instance for use as file-like object for reading.

As we use the get_bytes function, this element’s bytes must safely fit in memory for this method to be usable.

Returns

New BufferedReader instance

Return type

io.BufferedReader

uuid() Hashable[source]

UUID for this data element.

This many take different forms from integers to strings to a uuid.UUID instance. This must return a hashable data type.

By default, this ends up being the hex stringification of the SHA1 hash of this data’s bytes. Specific implementations may provide other UUIDs, however.

Returns

UUID value for this data element. This return value should be hashable.

Return type

collections.abc.Hashable

abstract writable() bool[source]
Returns

if this instance supports setting bytes.

Return type

bool

write_temp(temp_dir: Optional[str] = None) str[source]

Write this data’s bytes to a temporary file on disk, returning the path to the written file, whose extension is guessed based on this data’s content type.

It is not guaranteed that the returned file path does not point to the original data, i.e. writing to the returned filepath may modify the original data.

NOTE:

The file path returned should not be explicitly removed by the user. Instead, the clean_temp() method should be called on this object.

Parameters

temp_dir (None or str) – Optional directory to write temporary file in, otherwise we use the platform default temporary files directory. If this is an empty string, we count it the same as having provided None.

Returns

Path to the temporary file

Return type

str

DataSet

class smqtk_dataprovider.DataSet[source]

Abstract interface for data sets, that contain an arbitrary number of DataElement instances of arbitrary implementation type, keyed on DataElement UUID values.

This should only be used with DataElements whose byte content is expected not to change. If they do, then UUID keys may no longer represent the elements associated with them.

abstract add_data(*elems: DataElement) None[source]

Add the given data element(s) instance to this data set.

NOTE: Implementing methods should check that input elements are in fact DataElement instances.

Parameters

elems (smqtk.representation.DataElement) – Data element(s) to add

abstract count() int[source]
Returns

The number of data elements in this set.

Return type

int

abstract get_data(uuid: Hashable) DataElement[source]

Get the data element the given uuid references, or raise an exception if the uuid does not reference any element in this set.

Raises

KeyError – If the given uuid does not refer to an element in this data set.

Parameters

uuid (collections.abc.Hashable) – The uuid of the element to retrieve.

Returns

The data element instance for the given uuid.

Return type

smqtk.representation.DataElement

abstract has_uuid(uuid: Hashable) bool[source]

Test if the given uuid refers to an element in this data set.

Parameters

uuid (collections.abc.Hashable) – Unique ID to test for inclusion. This should match the type that the set implementation expects or cares about.

Returns

True if the given uuid matches an element in this set, or False if it does not.

Return type

bool

abstract uuids() Set[Hashable][source]
Returns

A new set of uuids represented in this data set.

Return type

set

KeyValueStore

class smqtk_dataprovider.KeyValueStore[source]

Interface for general key/value storage.

Implementations may impose restrictions on what types keys or values may be due to backend used.

Data access and manipulation should be thread-safe.

abstract add(key: Hashable, value: Any) KeyValueStore[source]

Add a key-value pair to this store.

NOTE: Implementing sub-classes should call this super-method. This super method should not be considered a critical section for thread safety unless ``is_read_only`` is not thread-safe.

Parameters
  • key (Hashable) – Key for the value. Must be hashable.

  • value (object) – Python object to store.

Raises

ReadOnlyError – If this instance is marked as read-only.

Returns

Self.

Return type

KeyValueStore

abstract add_many(d: Mapping[Hashable, Any]) KeyValueStore[source]

Add multiple key-value pairs at a time into this store as represented in the provided dictionary d.

Parameters

d (dict[Hashable, object]) – Dictionary of key-value pairs to add to this store.

Raises

ReadOnlyError – If this instance is marked as read-only.

Returns

Self.

Return type

KeyValueStore

abstract clear() KeyValueStore[source]

Clear this key-value store.

NOTE: Implementing sub-classes should call this super-method. This super method should not be considered a critical section for thread safety.

Raises

ReadOnlyError – If this instance is marked as read-only.

Returns

Self.

Return type

KeyValueStore

abstract count() int[source]
Returns

The number of key-value relationships in this store.

Return type

int | long

abstract get(key: ~typing.Hashable, default: ~typing.Any = <smqtk_dataprovider.interfaces.key_value_store.KeyValueStoreNoDefaultValueType object>) Any[source]

Get the value for the given key.

NOTE: Implementing sub-classes are responsible for raising a ``KeyError`` where appropriate.

Parameters
  • key – Key to get the value of.

  • default – Optional default value if the given key is not present in this store. This may be any value except for the NO_DEFAULT_VALUE constant (custom anonymous class instance).

Raises

KeyError – The given key is not present in this store and no default value given.

Returns

Deserialized python object stored for the given key.

get_many(keys: ~typing.Iterable[~typing.Hashable], default: ~typing.Any = <smqtk_dataprovider.interfaces.key_value_store.KeyValueStoreNoDefaultValueType object>) Iterable[Any][source]

Get the values for the given keys.

NOTE: Implementing sub-classes are responsible for raising a ``KeyError`` where appropriate.

Parameters
  • keys (collections.abc.Iterable[Hashable]) – The keys for which associated values are requested.

  • default (object) – Optional default value if a given key is not present in this store. This may be any value except for the NO_DEFAULT_VALUE constant (custom anonymous class instance).

Raises

KeyError – A given key is not present in this store and no default value given.

Returns

Iterable of deserialized python objects stored for the given keys in the order that the corresponding keys were provided.

Return type

collections.abc.Iterable

abstract has(key: Hashable) bool[source]

Check if this store has a value for the given key.

Parameters

key (Hashable) – Key to check for a value for.

Returns

If this store has a value for the given key.

Return type

bool

abstract is_read_only() bool[source]
Returns

True if this instance is read-only and False if it is not.

Return type

bool

abstract keys() Iterator[Hashable][source]
Returns

Iterator over keys in this store.

Return type

collections.abc.Iterator[Hashable]

abstract remove(key: Hashable) KeyValueStore[source]

Remove a single key-value entry.

Parameters

key (Hashable) – Key to remove.

Raises
  • ReadOnlyError – If this instance is marked as read-only.

  • KeyError – The given key is not present in this store and no default value given.

Returns

Self.

Return type

KeyValueStore

abstract remove_many(keys: Iterable[Hashable]) KeyValueStore[source]

Remove multiple keys and associated values.

Parameters

keys (collections.abc.Iterable[Hashable]) – Iterable of keys to remove. If this is empty this method does nothing.

Raises
  • ReadOnlyError – If this instance is marked as read-only.

  • KeyError – The given key is not present in this store and no default value given. The store is not modified if any key is invalid.

Returns

Self.

Return type

KeyValueStore

values() Iterator[source]
Returns

Iterator over values in this store. Values are not guaranteed to be in any particular order.

Return type

collections.abc.Iterator