Phrama Clinical Trials
This section covers a very unique dataset that tags clinical trials with their predicted outcome success.
Dataset contains 850+ tickers, available from 1999-11-01 onwards.
Tutorials
are the best documentation — Clinical Trials Tutorial
Input Datasets
Regulatory Filings; Biochemical Data
Models Used
Deep Learning Encoders; Langauge Models
Model Outputs
Success prediction; Expected duration
Description
We predict the success of a clinical trial, its duration, and the expected economic impact, including potential market reactions, using state-of-the-art machine learning models. Our solution also provides detailed metadata about each trial that allowed us to predict regulatory phase success and/or approval rate, empowering users to anticipate outcomes with greater accuracy.
Achieving an impressive 87% ROC-AUC—the highest among commercially available solutions—clients can rely on our predictions to make informed decisions. With an average of 1,052 new clinical trials launched each week, our platform lets you screen and focus on the most promising opportunities.
Data Access
Prediction Data:
import sovai as sov
df_clinical = sov.data("clinical/predict", full_history=True)

Description Data
import sovai as sov
df_clinical = sov.data("clinical/trials", full_history=True)

Accessing Specific Tickers
You can also retrieve data for specific tickers. For example:
import sovai as sov
df_pfizer = sov.data("clinical/predict", tickers=["PFE"])
Data Dictionary
Type: sectorial (pharma/biotech)
Endpoints: clinical/predict
, clinical/trials
Frequency: weekly updates (typical)
Index (often): ticker
, date
clinical/predict
— Prediction outputs
clinical/predict
— Prediction outputsticker
string
Mapped company ticker (or source label)
date
date
Record/snapshot date
success_prediction
float
Prob. of trial success (0–1)
economic_effect
float
Modeled economic impact (unitless index)
duration_prediction
float
Predicted trial duration (days)
success_composite
float
Composite success score (0–1)
class
string
Sponsor class (e.g., INDUSTRY/NIH/OTHER)
Notes
Values are model outputs; ranges typically 0–1 for probabilities/scores.
duration_prediction
is in days (e.g., 732 ≈ 2 yrs).
clinical/trials
— Trial descriptions & metadata (≈75 fields)
clinical/trials
— Trial descriptions & metadata (≈75 fields)A) Source & sponsor
source
string
Record source class (e.g., government/private/listed)
subsidiary
string
Sponsor subsidiary (if any)
sponsor
string
Normalized sponsor org
class
string
Sponsor class (e.g., INDUSTRY/NIH/OTHER)
B) Identifiers & titles
trial_id
string
Registry ID (e.g., NCT number)
sponsor_study_id
string
Sponsor’s internal study ID
official_title
string
Official study title
brief_title
string
Short study title
C) Lead sponsor
lead_sponsor
string
Lead sponsor label
lead_sponsor_name
string
Lead sponsor name
sponsor_type
string
Sponsor type (e.g., INDUSTRY/NIH/OTHER/NETWORK)
lead_sponsor_type
string
Lead sponsor type (same coding)
D) Study classification
study_type
string
INTERVENTIONAL/OBSERVATIONAL/etc.
phase_category
string
phase_1/2/3/other
enrollment_type
string
ACTUAL/ESTIMATED
enrollment_count
int
Planned/actual enrollment
study_size_category
string
Small/Medium/Large/Very Large
healthy_volunteers
bool
Healthy volunteers included
E) Conditions & interventions
condition_keywords
string
Keyword list (semicolon-delimited)
primary_condition
string
Primary condition/disease
intervention_type
string
DRUG/BIOLOGICAL/PROCEDURE/etc. (list)
primary_intervention
string
Primary intervention label
intervention_name
string
Intervention name(s)
intervention_arm_group_labels
string
Arm/group labels
intervention_description
string
Brief arm/intervention description
F) Oversight & responsibility
has_data_monitoring_committee
bool
DMC presence
responsible_party_investigator_affiliation
string
RP investigator affiliation
responsible_party_investigator_title
string
RP investigator title
responsible_party_investigator_name
string
RP investigator name
G) Key dates
first_posted_date
date
First posted date
last_update_posted_date
date
Last posted update
start_date
date
Study start date
primary_completion_date
date
Primary endpoint completion
study_completion_date
date
Final completion date
H) Locations
study_locations_city
string
City list (semicolon-delimited)
study_locations_state
string
State/region list
study_locations_country
string
Country list
study_locations_zip
string
ZIP/postal list
study_locations_facility
string
Facility/site list
study_locations_geopoint
string
lat/lon pairs (semicolon-delimited)
I) Eligibility
standard_age_groups
string
ADULT/OLDER_ADULT/CHILD
sex
string
ALL/MALE/FEMALE
minimum_age
int
Minimum age (yrs)
maximum_age
int
Maximum age (yrs or NA)
J) Status
overall_status
string
RECRUITING/COMPLETED/etc.
status_category
string
Active/completed/terminated, etc.
status_verified_date
date
Status verified date
K) Outcomes — primary
primary_outcomes_measures
string
Primary measure(s)
primary_outcomes_timeframes
string
Timeframe(s)
primary_outcomes_descriptions
string
Measure description(s)
L) Outcomes — secondary
secondary_outcomes_measures
string
Secondary measure(s)
secondary_outcomes_timeframes
string
Timeframe(s)
secondary_outcomes_descriptions
string
Description(s)
M) Results & narrative
has_results
bool
Results posted flag
conditions
string
Condition list
brief_summary
string
Short summary
detailed_description
string
Detailed description
N) Design
masking
string
NONE/SINGLE/DOUBLE
allocation
string
RANDOMIZED/NON_RANDOMIZED/NA
intervention_model
string
PARALLEL/SINGLE_GROUP/etc.
primary_purpose
string
TREATMENT/PREVENTION/etc.
has_expanded_access
bool
Expanded access flag
O) Duration & references
study_duration_days
int
Duration (days)
trial_duration
float
Duration (days, numeric)
references_type
string
BACKGROUND/RESULT/DERIVED/etc.
references_citation
string
Pub citations
references_pmid
string
PMIDs (semicolon-delimited)
P) Collaborators & sharing
collaborators_name
string
Collaborator names
collaborators_class
string
NIH/INDUSTRY/OTHER_GOV/etc.
ipd_sharing
string
YES/NO/UNKNOWN
Q) Model outputs (on trials table)
success_prediction
float
Prob. of success (0–1)
economic_effect
float
Modeled economic impact index
duration_prediction
float
Predicted duration (days)
success_composite
float
Composite success score
R) Index (often present as index columns)
ticker
string
Mapped company ticker (public) or source label (e.g., GOV/PRIVATE/SGP)
date
date
Record/snapshot date
Common derivations
links
(derived):https://clinicaltrials.gov/study/
+trial_id
Location fields often contain semicolon-separated lists.
Want this appended to your Excel?
Say the word and I’ll append both clinical/predict
and clinical/trials
dictionaries to the spreadsheet I already made for you and share an updated file.
Use Cases
Risk Assessment: Evaluate the risk profile of financial institutions based on complaint data.
Consumer Sentiment Analysis: Analyze consumer sentiment towards different financial products and companies.
Regulatory Compliance: Monitor compliance issues and identify potential regulatory risks.
Product Performance Evaluation: Assess the performance and issues related to specific financial products.
Competitive Analysis: Compare complaint profiles across different financial institutions.
Geographic Trend Analysis: Identify regional trends in financial complaints.
Customer Service Improvement: Identify areas for improvement in customer service based on complaint types and resolutions.
ESG Research: Incorporate complaint data into Environmental, Social, and Governance (ESG) assessments.
Fraud Detection: Identify patterns that might indicate fraudulent activities.
Policy Impact Assessment: Evaluate the impact of policy changes on consumer complaints over time.
The resulting dataset provides a comprehensive view of consumer complaints in the financial sector, enabling detailed analysis of company performance, consumer issues, and regulatory compliance.
Last updated
Was this helpful?