Datasets

A dataset is a central item in the pod that organizes your project data and label annotations. To facilitate using Dataset items in your datascience workflow, the Dataset class contains methods to convert the data to a popular datascience format, or save a dataset to disk.

class Dataset

Dataset(**kwargs) :: Dataset

The main Dataset class

Dataset.to

Dataset.to(dtype:str, columns:List[str])

Converts Dataset to a different format.
Available formats:
list: a 2-dimensional list, containing one dataset entry per row
dict: a list of dicts, where each dict contains {column: value} for each column
pd: a Pandas dataframe

Args: dtype (str): Datatype of the returned dataset columns (List[str]): Column names of the dataset Returns: Any: Dataset formatted according to dtype

Dataset.save

Dataset.save(path:Union[Path, str], columns:List[str])

Save dataset to CSV.

Usage

To convert the data in the pod to a different format, Dataset implements the Dataset.to method. In the columns argument, you can define which features will be included in your dataset. A column is either a property of an entry in the dataset, or a property of an item connected to an entry in the dataset.

The Pod uses the following schema for Dataset items. Note that the DatasetEntry item is always included, and the actual data can be found by traversing the entry.data Edge.

dataset schema

Now for example, if a dataset is a set of Message items, and the content has to be included as column, data.content would be the column name. If the name of the sender of a message has to be included, data.sender.handle would be a valid column name.

The following example retrieves an example dataset of Message items, and formats them to a Pandas dataframe:

client = PodClient()
client.add_to_schema(Dataset, DatasetEntry)
True
dataset = client.get_dataset("example-dataset")
columns = ["data.content", "data.sender.handle", "annotation.labelValue"]
dataframe = dataset.to("pd", columns=columns)
dataframe.head()
id data.content data.sender.handle annotation.labelValue
0 507b0036d7a94d1a918696ce2735a3a1 content_0 account_0 label_0
1 e71467d84f9f400fbb40abde1db8cca1 content_1 account_1 label_1
2 1f71cf2ba429427baaad0eea441081d6 content_2 account_2 label_2
3 15c346152be549f099522606dd54ce4c content_3 account_3 label_3
4 51c8e90f3432421a97eb939e6a022c94 content_4 account_4 label_4