Skip to content

Dataset

Module superwise.controller.dataset

This module implement dataset functionality

Functions

create_file_from_dataframe(dataset: superwise.models.dataset.Dataset, dataframe: pandas.core.frame.DataFrame)

Classes

DatasetController(client, sw, internal_bucket) Datasets controller class, implement functionalities for dataset API

Args:

client: superwise client object

sw: superwise object

Ancestors (in MRO)

  • superwise.controller.base.BaseController
  • abc.ABC

Methods

create(self, model: superwise.models.dataset.Dataset, return_model=True, gcs_service_account: dict = None, aws_access_key_id: str = None, aws_secret_access_key: str = None, aws_role_arn: str = None, azure_connection_string: str = None, wait_until_complete: bool = True, timeout_seconds: int = 300, on_failure='raise', **kwargs) Description:

Create a new dataset.

Args:

model: Dataset model.

return_model: return model if True or response.body if False. Default True.

gcs_service_account: GCP service account object used to authenticate and pull dataset files from a customer GCS bucket. If not provided, will be inferred from the environment. (See Google Cloud auth)

aws_access_key_id: AWS access key ID used to authenticate and pull dataset files from a customer S3 bucket. If not provided, will be inferred from the environment. (Used together with aws_secret_access_key parameter)

aws_secret_access_key: AWS secret access key used to authenticate and pull dataset files from a customer S3 bucket. If not provided, will be inferred from the environment. (Used together with aws_access_key_id parameter)

aws_role_arn: AWS role ARN used to authenticate and pull dataset files from a customer S3 bucket. If not provided, the authentication will use the aws_access_key_id and aws_secret_access_key parameters.

azure_connection_string: Azure blob storage connection string used to authenticate and pull dataset files from a customer blob storage container. MUST be provided in order to pull files from azure.

wait_until_complete: if True, wait until the dataset is fully processed in the system, and return the final object. If False, return immediately after the dataset is created and the given dataset files are validated, without waiting for the processing. A partially set Dataset object is returned, without all the processed fields. Afterwards the status can be checked with 'get_by_id' method. Default True.

timeout_seconds: Timeout for dataset processing waiting. Only relevant if 'wait_until_complete' is True. Default 5 minutes.

on_failure: Action to take in case the dataset processing failed. Only relevant if 'wait_until_complete' is True. Possible values are: - 'ignore': Don't raise an exception, and return the object. - 'raise': Raise a 'SuperwiseDatasetFailureError' exception. Default 'raise'.