Phrama Clinical Trials

This section covers a very unique dataset that tags clinical trials with their predicted outcome success.

Data is updated weekly on Fridays as is made available from regulatory filers

Tutorials are the best documentation — Clinical Trials Tutorial

Input Datasets

Regulatory Filings; Biochemical Data

Models Used

Deep Learning Encoders; Langauge Models

Model Outputs

Success prediction; Expected duration

Description

We predict the success of a clinical trial, its duration, and the expected economic impact, including potential market reactions, using state-of-the-art machine learning models. Our solution also provides detailed metadata about each trial that allowed us to predict regulatory phase success and/or approval rate, empowering users to anticipate outcomes with greater accuracy.

Achieving an impressive 87% ROC-AUC—the highest among commercially available solutions—clients can rely on our predictions to make informed decisions. With an average of 1,052 new clinical trials launched each week, our platform lets you screen and focus on the most promising opportunities.

Data Access

Prediction Data:

import sovai as sov
df_clinical = sov.data("clinical/predict", full_history=True)

Description Data

import sovai as sov
df_clinical = sov.data("clinical/trials", full_history=True)

Accessing Specific Tickers

You can also retrieve data for specific tickers. For example:

import sovai as sov
df_pfizer = sov.data("clinical/predict", tickers=["PFE"]) 

Data Dictionary

Type: sectorial (pharma/biotech) Endpoints: clinical/predict, clinical/trials Frequency: weekly updates (typical) Index (often): ticker, date


clinical/predict — Prediction outputs

Column
Type
Description

ticker

string

Mapped company ticker (or source label)

date

date

Record/snapshot date

success_prediction

float

Prob. of trial success (0–1)

economic_effect

float

Modeled economic impact (unitless index)

duration_prediction

float

Predicted trial duration (days)

success_composite

float

Composite success score (0–1)

class

string

Sponsor class (e.g., INDUSTRY/NIH/OTHER)

Notes

  • Values are model outputs; ranges typically 0–1 for probabilities/scores.

  • duration_prediction is in days (e.g., 732 ≈ 2 yrs).


clinical/trials — Trial descriptions & metadata (≈75 fields)

A) Source & sponsor

Column
Type
Description

source

string

Record source class (e.g., government/private/listed)

subsidiary

string

Sponsor subsidiary (if any)

sponsor

string

Normalized sponsor org

class

string

Sponsor class (e.g., INDUSTRY/NIH/OTHER)

B) Identifiers & titles

Column
Type
Description

trial_id

string

Registry ID (e.g., NCT number)

sponsor_study_id

string

Sponsor’s internal study ID

official_title

string

Official study title

brief_title

string

Short study title

C) Lead sponsor

Column
Type
Description

lead_sponsor

string

Lead sponsor label

lead_sponsor_name

string

Lead sponsor name

sponsor_type

string

Sponsor type (e.g., INDUSTRY/NIH/OTHER/NETWORK)

lead_sponsor_type

string

Lead sponsor type (same coding)

D) Study classification

Column
Type
Description

study_type

string

INTERVENTIONAL/OBSERVATIONAL/etc.

phase_category

string

phase_1/2/3/other

enrollment_type

string

ACTUAL/ESTIMATED

enrollment_count

int

Planned/actual enrollment

study_size_category

string

Small/Medium/Large/Very Large

healthy_volunteers

bool

Healthy volunteers included

E) Conditions & interventions

Column
Type
Description

condition_keywords

string

Keyword list (semicolon-delimited)

primary_condition

string

Primary condition/disease

intervention_type

string

DRUG/BIOLOGICAL/PROCEDURE/etc. (list)

primary_intervention

string

Primary intervention label

intervention_name

string

Intervention name(s)

intervention_arm_group_labels

string

Arm/group labels

intervention_description

string

Brief arm/intervention description

F) Oversight & responsibility

Column
Type
Description

has_data_monitoring_committee

bool

DMC presence

responsible_party_investigator_affiliation

string

RP investigator affiliation

responsible_party_investigator_title

string

RP investigator title

responsible_party_investigator_name

string

RP investigator name

G) Key dates

Column
Type
Description

first_posted_date

date

First posted date

last_update_posted_date

date

Last posted update

start_date

date

Study start date

primary_completion_date

date

Primary endpoint completion

study_completion_date

date

Final completion date

H) Locations

Column
Type
Description

study_locations_city

string

City list (semicolon-delimited)

study_locations_state

string

State/region list

study_locations_country

string

Country list

study_locations_zip

string

ZIP/postal list

study_locations_facility

string

Facility/site list

study_locations_geopoint

string

lat/lon pairs (semicolon-delimited)

I) Eligibility

Column
Type
Description

standard_age_groups

string

ADULT/OLDER_ADULT/CHILD

sex

string

ALL/MALE/FEMALE

minimum_age

int

Minimum age (yrs)

maximum_age

int

Maximum age (yrs or NA)

J) Status

Column
Type
Description

overall_status

string

RECRUITING/COMPLETED/etc.

status_category

string

Active/completed/terminated, etc.

status_verified_date

date

Status verified date

K) Outcomes — primary

Column
Type
Description

primary_outcomes_measures

string

Primary measure(s)

primary_outcomes_timeframes

string

Timeframe(s)

primary_outcomes_descriptions

string

Measure description(s)

L) Outcomes — secondary

Column
Type
Description

secondary_outcomes_measures

string

Secondary measure(s)

secondary_outcomes_timeframes

string

Timeframe(s)

secondary_outcomes_descriptions

string

Description(s)

M) Results & narrative

Column
Type
Description

has_results

bool

Results posted flag

conditions

string

Condition list

brief_summary

string

Short summary

detailed_description

string

Detailed description

N) Design

Column
Type
Description

masking

string

NONE/SINGLE/DOUBLE

allocation

string

RANDOMIZED/NON_RANDOMIZED/NA

intervention_model

string

PARALLEL/SINGLE_GROUP/etc.

primary_purpose

string

TREATMENT/PREVENTION/etc.

has_expanded_access

bool

Expanded access flag

O) Duration & references

Column
Type
Description

study_duration_days

int

Duration (days)

trial_duration

float

Duration (days, numeric)

references_type

string

BACKGROUND/RESULT/DERIVED/etc.

references_citation

string

Pub citations

references_pmid

string

PMIDs (semicolon-delimited)

P) Collaborators & sharing

Column
Type
Description

collaborators_name

string

Collaborator names

collaborators_class

string

NIH/INDUSTRY/OTHER_GOV/etc.

ipd_sharing

string

YES/NO/UNKNOWN

Q) Model outputs (on trials table)

Column
Type
Description

success_prediction

float

Prob. of success (0–1)

economic_effect

float

Modeled economic impact index

duration_prediction

float

Predicted duration (days)

success_composite

float

Composite success score

R) Index (often present as index columns)

Column
Type
Description

ticker

string

Mapped company ticker (public) or source label (e.g., GOV/PRIVATE/SGP)

date

date

Record/snapshot date

Common derivations

  • links (derived): https://clinicaltrials.gov/study/ + trial_id

  • Location fields often contain semicolon-separated lists.


Want this appended to your Excel?

Say the word and I’ll append both clinical/predict and clinical/trials dictionaries to the spreadsheet I already made for you and share an updated file.

Use Cases

  1. Risk Assessment: Evaluate the risk profile of financial institutions based on complaint data.

  2. Consumer Sentiment Analysis: Analyze consumer sentiment towards different financial products and companies.

  3. Regulatory Compliance: Monitor compliance issues and identify potential regulatory risks.

  4. Product Performance Evaluation: Assess the performance and issues related to specific financial products.

  5. Competitive Analysis: Compare complaint profiles across different financial institutions.

  6. Geographic Trend Analysis: Identify regional trends in financial complaints.

  7. Customer Service Improvement: Identify areas for improvement in customer service based on complaint types and resolutions.

  8. ESG Research: Incorporate complaint data into Environmental, Social, and Governance (ESG) assessments.

  9. Fraud Detection: Identify patterns that might indicate fraudulent activities.

  10. Policy Impact Assessment: Evaluate the impact of policy changes on consumer complaints over time.

The resulting dataset provides a comprehensive view of consumer complaints in the financial sector, enabling detailed analysis of company performance, consumer issues, and regulatory compliance.

Last updated

Was this helpful?