# Client Side S3 Part High

**Module:** `sovai.utils.client_side_s3_part_high`

Advanced S3 Partitioned Data Loader

This module provides a high-performance interface for loading partitioned data from S3 with support for ticker and date-based partitioning schemes, parallel loading, and comprehensive filtering capabilities.

## Classes

### `PathBuilder`

```python
class PathBuilder
```

Utility class for building and managing S3 data paths.

**Methods**

### `clean_path()`

```python
def clean_path(path: str) -> str
```

Remove s3:// prefix if present for consistent path handling.

**Parameters**

| Parameter | Type  | Description |
| --------- | ----- | ----------- |
| `path`    | `str` | —           |

**Returns:** `str`

***

### `build_ticker_path()`

```python
def build_ticker_path(
    base_path: str,
    ticker: str,
    has_year: bool = True,
    year: Optional[int] = None,
) -> str
```

Build a complete ticker-partitioned path.

**Parameters**

| Parameter   | Type  | Description                                                                                                                                                            |
| ----------- | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `base_path` | `str` | Base S3 path to ticker partitions ticker: Ticker symbol has\_year: Whether ticker partitions include year subdirectories year: Optional year for year-partitioned data |

**Returns**

Complete S3 path for the ticker partition

***

***

### `PartitionFinder`

```python
class PartitionFinder
```

Methods for discovering data partitions in the S3 storage.

**Methods**

### `find_ticker_partitions()`

```python
def find_ticker_partitions(
    ticker_base_path: str,
    tickers: List[str],
    has_year: bool = True,
    start_year: Optional[int] = None,
    end_year: Optional[int] = None,
) -> List[Tuple[str, str]]
```

Find all valid ticker partitions with optional year filtering.

**Parameters**

| Parameter          | Type  | Description                                                                                                                                                                                                                          |
| ------------------ | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `ticker_base_path` | `str` | Base path to ticker partitions tickers: List of tickers to search for has\_year: Whether ticker partitions include year subdirectories start\_year: Optional starting year to filter by end\_year: Optional ending year to filter by |

**Returns**

List of tuples (path, ticker) for loading

***

### `find_date_partitions()`

```python
def find_date_partitions(
    date_base_path: str,
    start_date: Optional[str] = None,
    end_date: Optional[str] = None,
) -> List[str]
```

Find all date partitions within the specified range.

**Parameters**

| Parameter        | Type  | Description                                                                                                                                                                                                                                          |
| ---------------- | ----- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `date_base_path` | `str` | Base path to date partitions start\_date: Optional start date in YYYY-MM-DD format end\_date: Optional end date in YYYY-MM-DD format. If not provided and start\_date is provided, defaults to today + 7 days to include recent and near-future data |

**Returns**

List of S3 paths for matching date partitions

***

***

### `DataLoader`

```python
class DataLoader
```

Core data loading functionality with filtering and parallelism.

**Methods**

### `load_partition()`

```python
def load_partition(
    path: str,
    ticker_filter: Optional[List[str]] = None,
    columns: Optional[List[str]] = None,
    start_date: Optional[str] = None,
    end_date: Optional[str] = None,
) -> pd.DataFrame
```

Load and filter data from a single partition. Added logging for debugging column selection.

**Parameters**

| Parameter | Type  | Description                                                                                                                                                                                                        |
| --------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `path`    | `str` | S3 path to the partition ticker\_filter: Optional list of tickers to filter by columns: Optional list of columns to load start\_date: Optional start date for filtering end\_date: Optional end date for filtering |

**Returns**

Filtered pandas DataFrame

***

### `load_data_parallel()`

```python
def load_data_parallel(tasks: List[Tuple], max_workers: int = 8) -> List[pd.DataFrame]
```

Process loading tasks in parallel with progress tracking.

**Parameters**

| Parameter     | Type          | Description  |
| ------------- | ------------- | ------------ |
| `tasks`       | `List[Tuple]` | —            |
| `max_workers` | `int`         | Default: `8` |

**Returns:** `List[pd.DataFrame]`

***

***

## Functions

### `get_s3_filesystem()`

```python
def get_s3_filesystem(provider: str = 'digitalocean') -> S3FileSystem
```

Get cached S3 filesystem for the specified provider.

**Parameters**

| Parameter  | Type  | Description                                         |
| ---------- | ----- | --------------------------------------------------- |
| `provider` | `str` | Cloud provider identifier (default: "digitalocean") |

**Returns**

Authenticated S3FileSystem instance with caching

***

### `load_data()`

```python
def load_data(
    ticker_path: str = '',
    date_path: str = '',
    has_year: bool = True,
    tickers: Optional[List[str]] = None,
    start_date: Optional[str] = None,
    end_date: Optional[str] = None,
    columns: Optional[List[str]] = None,
    max_workers: int = 8,
    post_process: Optional[Callable[[pd.DataFrame], pd.DataFrame]] = None,
) -> pd.DataFrame
```

Load data from S3 using both ticker and date partitioning schemes.

**Parameters**

| Parameter     | Type  | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| ------------- | ----- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ticker_path` | `str` | Base path for ticker partitions date\_path: Base path for date partitions has\_year: Whether ticker partitions include year subdirectories tickers: Optional list of tickers to filter by start\_date: Optional start date in YYYY-MM-DD format end\_date: Optional end date in YYYY-MM-DD format. If not provided and start\_date is provided, defaults to today + 7 days to include recent and near-future data columns: Optional list of columns to load max\_workers: Maximum number of concurrent loading threads post\_process: Optional function to apply to the final DataFrame |

**Returns**

Combined pandas DataFrame with the requested data

***

### `load_frame_s3_partitioned_high()`

```python
def load_frame_s3_partitioned_high(
    endpoint: str,
    tickers: Optional[Union[str, List[str]]] = None,
    columns: Optional[List[str]] = None,
    start_date: Optional[str] = None,
    end_date: Optional[str] = None,
    post_process: Optional[Callable[[pd.DataFrame], pd.DataFrame]] = None,
) -> pd.DataFrame
```

Load data for the specified endpoint with ticker and date filtering.

**Parameters**

| Parameter  | Type  | Description                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| ---------- | ----- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `endpoint` | `str` | Name of the data endpoint (e.g., "clinical\_trials") tickers: Optional ticker or list of tickers to filter by columns: Optional list of columns to load start\_date: Optional start date in YYYY-MM-DD format end\_date: Optional end date in YYYY-MM-DD format. If not provided and start\_date is provided, defaults to today + 7 days to include recent and near-future data post\_process: Optional function to apply to the final DataFrame |

**Returns**

DataFrame with the requested data

**Examples**

```python
>>> df = load_frame_s3_partitioned_high(
    ...     "clinical_trials", 
    ...     tickers=["AMGN", "PFE"],
    ...     start_date="2020-01-01",
    ...     end_date="2020-12-31"
    ... )
    
    >>> # Date-only query (from January 1, 2023 to today + 7 days)
    >>> df = load_frame_s3_partitioned_high(
    ...     "spending/awards",
    ...     start_date="2023-01-01"
    ... )
    
    >>> # Date-only query with explicit end date
    >>> df = load_frame_s3_partitioned_high(
    ...     "spending/awards",
    ...     start_date="2023-01-01",
    ...     end_date="2023-12-31"
    ... )
```

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sov.ai/api-reference/utils/client-side-s3-part-high.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
