Client Side S3 Part High

Advanced S3 Partitioned Data Loader

Module: sovai.utils.client_side_s3_part_high

Advanced S3 Partitioned Data Loader

This module provides a high-performance interface for loading partitioned data from S3 with support for ticker and date-based partitioning schemes, parallel loading, and comprehensive filtering capabilities.

Classes

PathBuilder

class PathBuilder

Utility class for building and managing S3 data paths.

Methods

clean_path()

def clean_path(path: str) -> str

Remove s3:// prefix if present for consistent path handling.

Parameters

Parameter
Type
Description

path

str

Returns: str


build_ticker_path()

Build a complete ticker-partitioned path.

Parameters

Parameter
Type
Description

base_path

str

Base S3 path to ticker partitions ticker: Ticker symbol has_year: Whether ticker partitions include year subdirectories year: Optional year for year-partitioned data

Returns

Complete S3 path for the ticker partition



PartitionFinder

Methods for discovering data partitions in the S3 storage.

Methods

find_ticker_partitions()

Find all valid ticker partitions with optional year filtering.

Parameters

Parameter
Type
Description

ticker_base_path

str

Base path to ticker partitions tickers: List of tickers to search for has_year: Whether ticker partitions include year subdirectories start_year: Optional starting year to filter by end_year: Optional ending year to filter by

Returns

List of tuples (path, ticker) for loading


find_date_partitions()

Find all date partitions within the specified range.

Parameters

Parameter
Type
Description

date_base_path

str

Base path to date partitions start_date: Optional start date in YYYY-MM-DD format end_date: Optional end date in YYYY-MM-DD format. If not provided and start_date is provided, defaults to today + 7 days to include recent and near-future data

Returns

List of S3 paths for matching date partitions



DataLoader

Core data loading functionality with filtering and parallelism.

Methods

load_partition()

Load and filter data from a single partition. Added logging for debugging column selection.

Parameters

Parameter
Type
Description

path

str

S3 path to the partition ticker_filter: Optional list of tickers to filter by columns: Optional list of columns to load start_date: Optional start date for filtering end_date: Optional end date for filtering

Returns

Filtered pandas DataFrame


load_data_parallel()

Process loading tasks in parallel with progress tracking.

Parameters

Parameter
Type
Description

tasks

List[Tuple]

max_workers

int

Default: 8

Returns: List[pd.DataFrame]



Functions

get_s3_filesystem()

Get cached S3 filesystem for the specified provider.

Parameters

Parameter
Type
Description

provider

str

Cloud provider identifier (default: "digitalocean")

Returns

Authenticated S3FileSystem instance with caching


load_data()

Load data from S3 using both ticker and date partitioning schemes.

Parameters

Parameter
Type
Description

ticker_path

str

Base path for ticker partitions date_path: Base path for date partitions has_year: Whether ticker partitions include year subdirectories tickers: Optional list of tickers to filter by start_date: Optional start date in YYYY-MM-DD format end_date: Optional end date in YYYY-MM-DD format. If not provided and start_date is provided, defaults to today + 7 days to include recent and near-future data columns: Optional list of columns to load max_workers: Maximum number of concurrent loading threads post_process: Optional function to apply to the final DataFrame

Returns

Combined pandas DataFrame with the requested data


load_frame_s3_partitioned_high()

Load data for the specified endpoint with ticker and date filtering.

Parameters

Parameter
Type
Description

endpoint

str

Name of the data endpoint (e.g., "clinical_trials") tickers: Optional ticker or list of tickers to filter by columns: Optional list of columns to load start_date: Optional start date in YYYY-MM-DD format end_date: Optional end date in YYYY-MM-DD format. If not provided and start_date is provided, defaults to today + 7 days to include recent and near-future data post_process: Optional function to apply to the final DataFrame

Returns

DataFrame with the requested data

Examples


Last updated