# Pandas Extensions

**Module:** `sovai.extensions.pandas_extensions`

## Classes

### `CustomDataFrame`

```python
class CustomDataFrame(pd.DataFrame)
```

**Attributes**

* `attrs`

**Methods**

### `filter()`

```python
def filter(
    self,
    conditions: Union[str, List[str]],
    verbose: bool = False,
) -> CustomDataFrame
```

Filter the DataFrame based on given conditions.

**Parameters**

| Parameter    | Type                    | Description                                                      |
| ------------ | ----------------------- | ---------------------------------------------------------------- |
| `conditions` | `Union[str, List[str]]` | A string or list of strings describing the filtering conditions. |
| `verbose`    | `bool`                  | If True, print detailed information about the filtering process. |

**Returns**

: A filtered CustomDataFrame.

***

### `merge_data()`

```python
def merge_data(self, column: str) -> CustomDataFrame
```

Merge the current DataFrame with the combined DataFrame based on ticker and a specified column.

**Parameters**

| Parameter | Type  | Description                                      |
| --------- | ----- | ------------------------------------------------ |
| `column`  | `str` | The column from the combined DataFrame to merge. |

**Returns**

: A new CustomDataFrame with the merged data.

***

### `cointegration()`

```python
def cointegration(self, on = 'ticker', shift = 12)
```

Calculate an approximate cointegration proxy using shifted cosine similarity.

**Parameters**

| Parameter | Type | Description                                                             |
| --------- | ---- | ----------------------------------------------------------------------- |
| `df`      | —    | Pandas DataFrame with MultiIndex.                                       |
| `level`   | —    | The level of the MultiIndex to group by (default 'ticker').             |
| `shift`   | —    | The number of periods to shift for lagged comparison (default 1 month). |

**Returns**

: DataFrame of shifted cosine similarities.

***

### `normalize_min_max()`

```python
def normalize_min_max(matrix)
```

Apply Min-Max normalization.

**Parameters**

| Parameter | Type | Description |
| --------- | ---- | ----------- |
| `matrix`  | —    | —           |

***

### `select_features()`

```python
def select_features(
    self,
    method = 'random_projection',
    n_components = None,
    variability = 0.9,
)
```

Selects features based on importance scores from various methods.

**Parameters**

| Parameter      | Type | Description                                                                                                                 |
| -------------- | ---- | --------------------------------------------------------------------------------------------------------------------------- |
| `method`       | —    | The method to use for calculating feature importance ('random\_projection', 'fourier', 'ica', 'svd', 'sparse\_projection'). |
| `n_components` | —    | Number of components to keep. If specified, this takes precedence over variability.                                         |
| `variability`  | —    | The explained variance threshold (default 0.90).                                                                            |

**Returns**

: CustomDataFrame with selected features.

***

### `ticker()`

```python
def ticker(self, ticker = 'AAPL')
```

Orthogonalizes the features of the DataFrame using the Gram-Schmidt process.

**Parameters**

| Parameter | Type | Description       |
| --------- | ---- | ----------------- |
| `ticker`  | —    | Default: `'AAPL'` |

**Returns**

: CustomDataFrame with orthogonalized features.

***

### `date()`

```python
def date(self, date_inputs = ())
```

Selects data for a specific date or date range from the DataFrame.

**Parameters**

| Parameter     | Type | Description                                                    |
| ------------- | ---- | -------------------------------------------------------------- |
| `date_inputs` | —    | str or tuple of str or multiple str, the date(s) in any format |

**Returns**

: CustomDataFrame with selected data

***

### `select_stocks()`

```python
def select_stocks(self, market_cap = 'mega')
```

Select stocks based on market capitalization category.

**Parameters**

| Parameter    | Type  | Description                                                            |
| ------------ | ----- | ---------------------------------------------------------------------- |
| `market_cap` | `str` | Market capitalization category (e.g., "mega", "large", "mid", "small") |

**Returns**

CustomDataFrame: Filtered dataframe containing only stocks of the specified market cap

***

### `date_range()`

```python
def date_range(self, date_inputs = ())
```

Selects data for a specific date range from the DataFrame.

**Parameters**

| Parameter     | Type | Description                                    |
| ------------- | ---- | ---------------------------------------------- |
| `date_inputs` | —    | str or multiple str, the date(s) in any format |

**Returns**

: CustomDataFrame with selected data

***

### `extract_features()`

```python
def extract_features(
    self,
    entity_col = 'ticker',
    date_col = 'date',
    lookback = None,
    features = None,
    every = 'all',
    verbose = False,
)
```

Extracts features from the CustomDataFrame and returns a new CustomDataFrame with the extracted features.

**Parameters**

| Parameter    | Type | Description         |
| ------------ | ---- | ------------------- |
| `entity_col` | —    | Default: `'ticker'` |
| `date_col`   | —    | Default: `'date'`   |
| `lookback`   | —    | Default: `None`     |
| `features`   | —    | Default: `None`     |
| `every`      | —    | Default: `'all'`    |
| `verbose`    | —    | Default: `False`    |

***

### `reduce_dimensions()`

```python
def reduce_dimensions(
    self,
    method = 'pca',
    explained_variance = 0.95,
    verbose = False,
    n_components = None,
)
```

Perform dimensionality reduction on the CustomDataFrame.

**Parameters**

| Parameter            | Type    | Description                                                                                                                   |
| -------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `method`             | `str`   | Dimensionality reduction method. Options: 'pca', 'truncated\_svd', 'factor\_analysis', 'gaussian\_random\_projection', 'umap' |
| `explained_variance` | `float` | Amount of variance to be explained (0 to 1)                                                                                   |
| `verbose`            | `bool`  | If True, print additional information                                                                                         |

**Returns**

CustomDataFrame: Reduced data in panel format

***

### `weight_optimization()`

```python
def weight_optimization(self)
```

Perform dimensionality reduction on the CustomDataFrame.

**Parameters**

| Parameter            | Type    | Description                                                                         |
| -------------------- | ------- | ----------------------------------------------------------------------------------- |
| `method`             | `str`   | Dimensionality reduction method.                                                    |
| `Options`            | —       | 'pca', 'truncated\_svd', 'factor\_analysis', 'gaussian\_random\_projection', 'umap' |
| `explained_variance` | `float` | Amount of variance to be explained (0 to 1)                                         |
| `verbose`            | `bool`  | If True, print additional information                                               |

**Returns**

CustomDataFrame: Reduced data in panel format

***

### `signal_evaluator()`

```python
def signal_evaluator(self, verbose = False)
```

Perform weight optimization on the input multi-index DataFrame.

**Parameters**

| Parameter | Type | Description      |
| --------- | ---- | ---------------- |
| `verbose` | —    | Default: `False` |

**Returns**

SignalEvaluator: A SignalEvaluator object with optimized weights

***

### `feature_importance()`

```python
def feature_importance(self, num_simulations = 4, clustering_method = 'KMEANS')
```

Computes feature importance using SHAP values based on multiple simulations.

**Parameters**

| Parameter           | Type | Description                                          |
| ------------------- | ---- | ---------------------------------------------------- |
| `num_simulations`   | —    | The number of simulations to run (default 4).        |
| `clustering_method` | —    | The clustering method to use ('OPTICS' or 'KMeans'). |

**Returns**

: A DataFrame with average SHAP values per feature.

***

***


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sov.ai/api-reference/extensions/pandas-extensions.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
