FeatureUtils: Simplified Feature Management for AI Model Training

FeatureUtils is a lightweight Python library that simplifies handling (precomputed) features in AI model training pipelines. It focuses on IO and storage management for feature datasets, making it easy to scale feature storage while maintaining flexibility and performance.

Whether you're dealing with large-scale machine learning workflows or managing features across multiple experiments, FeatureUtils provides intuitive APIs for saving, loading, staging, and querying features. It is designed to integrate seamlessly with PyTorch and supports efficient storage backends that limit inode usage when working with large-scale datasets.

Key Features

Flexible Feature Management: Save, load, and manage multiple features for training instances with minimal effort.
Efficient Storage: Supports optimized storage backends that aggregate shard-based data partitioning.
Staging for Speed: Leverage fast storage (e.g., SSDs) to stage data for rapid random access.
PyTorch Integration: Directly create PyTorch datasets from your features.
Scalable Metadata Handling: Efficiently list, query, and manage metadata for large feature collections.

Installation

To install the latest version of FeatureUtils from the repository, use pip to install the package and its dependencies:

pip install git+https://github.com/JonaRuthardt/featureutils.git

Getting Started

Here’s a quick example to demonstrate how to use FeatureUtils in your AI training pipeline.

Initialization

from featureutils.core import FeatureUtils

# Initialize FeatureUtils with a ZIP storage backend
feature_utils = FeatureUtils(
    base_dir="path/to/storage", # directory where features will be saved to
    staging_dir="path/to/staging", # (optional) directory for data staging
    feature_num=1, # number of individual features for a given key
    shard_size=10000 # maximum size of individual shards
)

Save Features

import torch

# Save features for a specific key
key = "example_id"
feature_utils.save_feature(key, feature1=torch.rand(256), feature2=torch.rand(512))

Load Features

# Load features for a specific key
features = feature_utils.load_feature("example_id", feature_names=["feature1", "feature2"])

Create a PyTorch Dataset

# Get a Torch dataset for selected keys and features
keys = ["example_id1", "example_id2"] # (optional) specifications of keys to include
features = ["feature1", "feature2"] # (optional) specification of features to include
dataset = feature_utils.get_dataset(keys=keys, features=features)

Query and Manage Features

# List all keys
keys = feature_utils.list_keys()

# Check if a key exists
exists = feature_utils.exists("example_id")

# Delete a feature
feature_utils.delete_feature("example_id")
feature_utils.delete_features(["example_id1", "example_id2"])

Citation

If you use FeatureUtils in your research, please consider citing it with the following BibTeX entry:

@misc{featureutils,
  author = Jona Ruthardt,
  title = FeatureUtils: Simplified Feature Management for AI Training,
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/JonaRuthardt/FeatureUtils}},
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

Contact and Contributing

For questions or support, open an issue on the GitHub repository. Contributions and feedback are always welcome.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
featureutils		featureutils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FeatureUtils: Simplified Feature Management for AI Model Training

Key Features

Installation

Getting Started

Initialization

Save Features

Load Features

Create a PyTorch Dataset

Query and Manage Features

Citation

License

Contact and Contributing

About

Releases

Packages

Languages

License

JonaRuthardt/FeatureUtils

Folders and files

Latest commit

History

Repository files navigation

FeatureUtils: Simplified Feature Management for AI Model Training

Key Features

Installation

Getting Started

Initialization

Save Features

Load Features

Create a PyTorch Dataset

Query and Manage Features

Citation

License

Contact and Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages