How to Read Health Research Like a Pro | 1st Optimal

At a glance
- Evidence hierarchy / RCTs and systematic reviews sit at the top; expert opinion sits at the bottom
- P-value threshold / P<0.05 means a 1-in-20 chance the result is random noise, not proof of causation
- Effect size / a statistically significant finding can still be clinically meaningless if the effect is tiny
- Funding bias / industry-funded trials are 3.4x more likely to report favorable outcomes than independent trials
- Confidence interval / a wide CI signals an imprecise estimate, even when the p-value looks impressive
- Sample size matters / a trial with N<100 is usually underpowered for detecting modest real-world effects
- Surrogate vs. Hard endpoints / a change in a blood marker is not the same as fewer heart attacks or longer life
- Absolute vs. Relative risk / "50% reduction in risk" can mean 2% down to 1%, which changes the whole story
- Pre-registration / trials registered at ClinicalTrials.gov before enrollment are far less likely to cherry-pick results
- Replication / a single study, however large, should never be the sole basis for a clinical decision
Why Reading Research Is a Skill, Not Just Fact-Finding
Most health articles online present a single study as settled science. The process of actually evaluating a paper is different from reading its abstract. You need to check the study design, the population enrolled, who paid for it, and whether the outcome that was measured is the one that actually matters to patients.
The gap between a study's headline claim and what it actually proves is where most health misinformation lives. Learning to spot that gap is the single most useful skill you can develop as a patient or a clinician deciding on a therapy.
The Evidence Pyramid: Where Every Study Fits
Not all evidence is created equal. The National Institutes of Health outlines a clear hierarchy that clinicians use every day [1].
At the top sit systematic reviews and meta-analyses, which pool data from multiple trials. Below those are randomized controlled trials (RCTs), then cohort studies, then case-control studies, then case reports, and at the very bottom, expert opinion. A single observational study, regardless of how it is reported in the press, sits in the middle of this pyramid, not at the top.
When a news article says "a new study suggests," your first question should be: what kind of study? A phase 3 RCT with 1,500 participants and a two-year follow-up is categorically different from a 12-person pilot trial run over eight weeks.
Why Study Design Determines Everything
The reason RCTs rank above observational studies comes down to one concept: confounding. In an observational study, people who take a supplement or drug also differ from non-users in dozens of other ways, such as diet, income, baseline health, and healthcare access. Randomization breaks those confounders apart so the only systematic difference between groups is the intervention itself [2].
The PREDIMED trial on Mediterranean diet and cardiovascular outcomes, for example, was retracted and republished after randomization problems were discovered, which changed some of its headline conclusions. That correction happened precisely because scientists applied strict design criteria to the work [3].
Understanding P-Values Without Getting Fooled
A p-value tells you the probability of seeing a result at least as extreme as the one observed if the null hypothesis (no effect) were true. A p-value of 0.05 means there is a 5% chance the finding is random noise. That is not a 95% chance the drug works.
What P<0.05 Actually Means
The 0.05 threshold is a convention, not a law of nature. The American Statistical Association released a formal statement in 2016 clarifying that "a p-value does not measure the probability that the studied hypothesis is true" [4]. Dozens of true findings have p-values above 0.05, and dozens of false positives have p-values well below it.
In large trials (N>10,000), even a trivially small, clinically meaningless difference will produce a p-value far below 0.05 simply because the study has enough statistical power to detect noise. This is why effect size must always accompany any p-value discussion.
Effect Size: The Number That Actually Matters Clinically
Effect size answers the question: how big is the difference? Cohen's d, odds ratios, hazard ratios, and mean differences are all measures of effect size. A hazard ratio of 0.98 for all-cause mortality, even with P<0.001, tells you the intervention probably does not matter much in the real world.
Compare that to semaglutide 2.4 mg (Wegovy) in the STEP-1 trial (N=1,961): participants lost a mean of 14.9% of body weight versus 2.4% in the placebo arm at 68 weeks, an absolute difference of 12.5 percentage points [5]. That is a large effect size by any clinical standard, which is why the finding changed prescribing behavior.
Confidence Intervals Tell You About Precision
Every estimate in a trial comes with a confidence interval (CI). A 95% CI of 0.80 to 1.20 for a hazard ratio includes 1.0, meaning the drug might help, might hurt, or might do nothing. That is a wide and imprecise estimate. A 95% CI of 0.60 to 0.72 is much tighter and more informative.
The EMPEROR-Reduced trial of empagliflozin (Jardiance) in heart failure with reduced ejection fraction reported a hazard ratio for cardiovascular death or hospitalization of 0.75 (95% CI 0.65 to 0.86, P<0.001), a narrow interval that told clinicians the estimate was precise and meaningful [6].
Absolute Risk vs. Relative Risk: The Most Misused Statistics in Health Media
Relative risk reduction makes a drug sound impressive. Absolute risk reduction tells you whether it is worth taking.
A Worked Example
Suppose a drug reduces the annual risk of a heart attack from 2% to 1%. The relative risk reduction is 50%. The absolute risk reduction is 1 percentage point. The number needed to treat (NNT) is 100, meaning 100 people must take the drug for one year for one person to avoid a heart attack.
That NNT of 100 may be entirely acceptable for a drug that is cheap and safe, but it looks very different from a headline reading "Drug Cuts Heart Attack Risk in Half."
How to Find Absolute Numbers in a Paper
Look for the event rates in each arm, not just the hazard ratio or relative risk in the results table. Many papers bury event rates in supplementary tables. The CONSORT reporting guidelines require RCTs to report absolute event rates, so any trial following CONSORT will have them somewhere [7].
Funding, Conflicts of Interest, and Industry Bias
A 2017 systematic review published in PLOS ONE found that industry-funded drug studies were 3.4 times more likely to report favorable efficacy results than independently funded studies [8]. That does not mean industry-funded research is always wrong. It means you should read it more carefully and look for the same result in independent replications.
Where to Check Funding and Conflicts
Every journal article published in a reputable peer-reviewed outlet includes a disclosure statement near the end. Look for "Declaration of competing interests" or "Conflict of interest statement." On PubMed, the full text (when available) will list these disclosures. ClinicalTrials.gov listings also name the study sponsor.
When a testosterone therapy study is sponsored by AbbVie (maker of AndroGel) and shows favorable outcomes, the finding may well be real, but independent replication, such as the NIH-funded Testosterone Trials (TTrials) program, provides a cleaner signal [9].
The Role of Pre-Registration
Trials registered on ClinicalTrials.gov before enrollment begins must specify their primary and secondary endpoints in advance. This prevents researchers from running 20 analyses and then reporting only the one that came out positive, a practice called p-hacking or outcome-switching.
If you want to check whether a trial pre-specified its primary endpoint, look up the NCT number on ClinicalTrials.gov and compare the protocol endpoints to what the published paper calls its primary outcome. Discrepancies are a red flag.
Surrogate Endpoints vs. Hard Clinical Outcomes
A surrogate endpoint is a lab value or imaging finding used as a proxy for something that actually matters to patients. HbA1c is a surrogate for diabetes complications. Bone mineral density is a surrogate for fractures. PSA is a surrogate for prostate cancer mortality.
Why Surrogates Can Mislead
The FDA has approved drugs based on surrogate endpoint improvement that later failed to show benefit on hard outcomes. The ACCORD trial tested aggressive glucose lowering (targeting HbA1c below 6%) in type 2 diabetes and found it actually increased all-cause mortality compared with standard control, despite improving the surrogate [10]. Better numbers on a lab panel did not translate to longer life.
When you read a TRT or GLP-1 study that reports improvements in testosterone levels, PSA, HbA1c, or insulin resistance, ask: did this trial also measure what happened to cardiovascular events, fractures, hospitalizations, or mortality? Those hard endpoints are what matter to you as a patient.
Hard Endpoints in Hormone and Metabolic Research
The SELECT trial (N=17,604) of semaglutide 2.4 mg in adults with obesity and established cardiovascular disease but without diabetes reported a 20% reduction in the composite of cardiovascular death, nonfatal myocardial infarction, or nonfatal stroke (hazard ratio 0.80, 95% CI 0.72 to 0.90, P<0.001) [11]. That is a hard endpoint trial. The drug's effect on actual cardiovascular events, not just body weight, is what made clinicians take it seriously for primary prevention discussions.
How to Read a Methods Section Without a PhD
The methods section is where a study either earns its conclusions or exposes its weaknesses. Most readers skip it. Do not skip it.
Sample Size and Power Calculations
Look for the power calculation or sample size justification. This tells you whether the trial was designed large enough to detect the effect it was looking for. A trial that was powered to detect a 20% difference but found only a 10% difference should be interpreted cautiously even if p<0.05, because that borderline result might disappear in a larger replication.
The Testosterone Trials (TTrials) enrolled 788 men aged 65 and older across seven coordinated sub-trials, each powered for specific endpoints such as sexual function, bone density, and anemia, rather than lumping everything into one underpowered single primary outcome [9]. That design let each sub-trial answer its specific question cleanly.
Blinding and Control Conditions
Double-blind means neither participants nor researchers know who is receiving the active drug versus placebo. Single-blind means only participants are blinded. Open-label means everyone knows. The less blinding, the more susceptible the results are to placebo effects and researcher expectation bias.
For TRT research specifically, open-label trials often show larger subjective improvements in energy and libido than blinded trials, which is consistent with a meaningful placebo component in patient-reported outcomes.
Inclusion and Exclusion Criteria: Who Was Actually Studied
A trial that enrolled only men aged 45 to 65 with total testosterone between 200 and 350 ng/dL and no cardiovascular disease at baseline cannot tell you what happens when you treat a 72-year-old with testosterone at 150 ng/dL who has had a prior MI. Generalizability, called external validity, depends entirely on whether the trial population matches your patient.
Read the inclusion and exclusion criteria carefully. They tell you who the results actually apply to.
Meta-Analyses and Systematic Reviews: Powerful but Not Perfect
A meta-analysis pools data from multiple trials to produce a more precise overall estimate. Done well, it is the strongest form of clinical evidence. Done poorly, it mixes incomparable trials and produces a precise-looking but meaningless number, sometimes called "garbage in, garbage out."
How to Evaluate a Meta-Analysis
Check the I-squared statistic (I²). This measures heterogeneity, how much the individual trial results vary from each other. An I² above 75% means the trials are so different from each other that pooling them into a single estimate may not be scientifically justified. The Cochrane Handbook recommends reporting and explaining any I² above 50% [12].
Also check whether the meta-analysis searched multiple databases (PubMed, Embase, Cochrane), had pre-registered its protocol on PROSPERO, and used GRADE criteria to rate the overall quality of evidence. A meta-analysis that did none of these things is far less reliable than one that did all of them.
Publication Bias and the Funnel Plot
Studies with positive results are more likely to be published than studies showing no effect. This publication bias inflates effect sizes in meta-analyses. A funnel plot, which displays each trial's effect size against its sample size, should show a symmetric distribution if no publication bias exists. An asymmetric funnel plot suggests smaller negative trials were not published.
The Cochrane Library (cochranelibrary.com) provides free access to systematic reviews that apply these standards rigorously. When a therapy you are considering has a Cochrane review, read it [12].
Applying This to TRT, GLP-1, and Peptide Research Specifically
Hormone therapy and metabolic drug research present some specific interpretation challenges worth naming directly.
The HealthRX medical team uses a five-question framework before citing any study in a clinical protocol:
- Was this an RCT, and was it blinded?
- What was the primary endpoint, and is it a hard clinical outcome or a surrogate?
- Was the trial pre-registered, and did the published primary outcome match the registered one?
- Who funded the study, and has the result been independently replicated?
- Does the enrolled population match the patients we are treating?
A study that scores well on all five questions gets high weight in clinical decision-making. A study that fails two or more of them is treated as hypothesis-generating at best.
TRT Research: Specific Considerations
Testosterone research is complicated by the fact that normal testosterone ranges vary by assay, by lab, and by age-specific reference intervals. The Endocrine Society's 2018 clinical practice guideline on male hypogonadism specifies that diagnosis requires "unequivocally low serum testosterone concentration" confirmed on at least two morning measurements using an accurate assay [13]. A single low reading on a direct immunoassay (rather than liquid chromatography-mass spectrometry) is insufficient for diagnosis and insufficient as a study baseline.
When reading a TRT trial, check whether testosterone was measured by mass spectrometry or immunoassay, and whether samples were taken in the morning (when levels peak) or at random times. Studies with afternoon or random-draw baselines likely enrolled men whose testosterone appeared lower than their true morning values, which biases the enrolled population.
GLP-1 and Peptide Research: Specific Considerations
The GLP-1 receptor agonist literature is currently among the strongest bodies of evidence in metabolic medicine. The SURMOUNT-1 trial of tirzepatide (Mounjaro/Zepbound) 15 mg in adults with obesity (N=2,539) showed 22.5% mean weight loss at 72 weeks versus 2.4% placebo, P<0.001 [14]. The SELECT trial of semaglutide 2.4 mg extended these findings to hard cardiovascular outcomes [11].
Peptide research outside approved GLP-1 agents, such as BPC-157 or CJC-1295, has not been studied in phase 3 RCTs in humans. The existing literature consists almost entirely of rodent models and small open-label series. Applying the five-question framework above to most peptide research quickly reveals its current limitations. That does not make the therapies useless, but it means the evidence level is low and informed consent conversations must reflect that honestly.
Reading Research From Primary Sources vs. Press Coverage
Science journalists work under deadline pressure and often lack the training to interpret methods sections accurately. A 2020 analysis in PLOS ONE found that 40% of press releases about biomedical research contained exaggerated or misleading claims compared to the underlying papers [15].
Going Directly to PubMed
PubMed (pubmed.ncbi.nlm.nih.gov) indexes over 36 million citations. You can search by drug name, condition, and study type. Filter by "Clinical Trial" or "Randomized Controlled Trial" under Article Types to cut out animal and in-vitro studies immediately.
The abstract gives you the study design, population, primary outcome, and main results. The full text (often freely available or via PubMed Central) gives you the methods, tables, and supplementary data you need to fully evaluate the work.
Using the GRADE Framework as a Mental Checklist
GRADE (Grading of Recommendations Assessment, Development and Evaluation) is the most widely used system for rating the overall quality of a body of evidence. It rates evidence as high, moderate, low, or very low based on study design, risk of bias, inconsistency, indirectness, and imprecision [16].
The Endocrine Society uses GRADE in its clinical practice guidelines [13]. When a guideline says a recommendation is based on "low-quality evidence," that tells you the recommendation could change with better data, which should inform how confidently you and your clinician act on it.
The Endocrine Society's 2018 guideline on male hypogonadism states directly: "We suggest against making a diagnosis of androgen deficiency in men with low serum testosterone concentrations who have not had symptoms of hypogonadism" [13]. That quote reflects a deliberate grading decision based on the quality of evidence available, not just clinical opinion.
Frequently asked questions
›What is the first thing I should check when reading a health study?
›What does a p-value of 0.05 actually mean?
›Why does funding source matter when reading a study?
›What is the difference between absolute and relative risk reduction?
›What is a surrogate endpoint and why should I care?
›How do I know if a clinical trial was well-designed?
›What is a meta-analysis and when should I trust one?
›How does this apply to reading TRT research specifically?
›Is peptide research trustworthy?
›Where can I find primary research sources without a medical library subscription?
›What does 'statistically significant' mean versus 'clinically meaningful'?
›How to read health research like a pro: what is the single most important habit?
References
-
National Institutes of Health, National Library of Medicine. Study designs: evidence hierarchy. Available at: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6235166/
-
Schulz KF, Altman DG, Moher D; CONSORT Group. CONSORT 2010 statement: updated guidelines for reporting parallel group randomized trials. Ann Intern Med. 2010;152(11):726-732. https://www.ncbi.nlm.nih.gov/pubmed/20335313
-
Estruch R, Ros E, Salas-Salvadó J, et al. Primary prevention of cardiovascular disease with a Mediterranean diet supplemented with extra-virgin olive oil or nuts (PREDIMED, retraction and republication). N Engl J Med. 2018;378(25):e34. https://www.nejm.org/doi/full/10.1056/NEJMoa1800389
-
Wasserstein RL, Lazar NA. The ASA's statement on p-values: context, process, and purpose. Am Stat. 2016;70(2):129-133. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5187603/
-
Wilding JPH, Batterham RL, Calanna S, et al. Once-weekly semaglutide in adults with overweight or obesity (STEP 1). N Engl J Med. 2021;384(11):989-1002. https://pubmed.ncbi.nlm.nih.gov/33567185/
-
Packer M, Anker SD, Butler J, et al. Cardiovascular and renal outcomes with empagliflozin in heart failure (EMPEROR-Reduced). N Engl J Med. 2020;383(15):1413-1424. https://pubmed.ncbi.nlm.nih.gov/32865377/
-
CONSORT Group. CONSORT 2010 explanation and elaboration: updated guidelines for reporting parallel group randomised trials. BMJ. 2010;340:c869. https://www.bmj.com/content/340/bmj.c869
-
Lundh A, Lexchin J, Mintzes B, Schroll JB, Bero L. Industry sponsorship and research outcome. Cochrane Database Syst Rev. 2017;2:MR000033. https://pubmed.ncbi.nlm.nih.gov/28207928/
-
Snyder PJ, Bhasin S, Cunningham GR, et al. Effects of testosterone treatment in older men (Testosterone Trials). N Engl J Med. 2016;374(7):611-624. https://pubmed.ncbi.nlm.nih.gov/26886521/
-
Action to Control Cardiovascular Risk in Diabetes Study Group; Gerstein HC, Miller ME, et al. Effects of intensive glucose lowering in type 2 diabetes (ACCORD). N Engl J Med. 2008;358(24):2545-2559. https://pubmed.ncbi.nlm.nih.gov/18539917/
-
Lincoff AM, Brown-Frandsen K, Colhoun HM, et al. Semaglutide and cardiovascular outcomes in obesity without diabetes (SELECT). N Engl J Med. 2023;389(24):2221-2232. https://pubmed.ncbi.nlm.nih.gov/37952131/
-
Higgins JPT, Thomas J, Chandler J, et al. Cochrane Handbook for Systematic Reviews of Interventions. Version 6.4. Cochrane, 2023. https://www.cochranelibrary.com/about/about-cochrane-reviews
-
Bhasin S, Brito JP, Cunningham GR, et al. Testosterone therapy in men with hypogonadism: an Endocrine Society clinical practice guideline. J Clin Endocrinol Metab. 2018;103(5):1715-1744. https://pubmed.ncbi.nlm.nih.gov/29562364/
-
Jastreboff AM, Aronne LJ, Ahmad NN, et al. Tirzepatide once weekly for the treatment of obesity (SURMOUNT-1). N Engl J Med. 2022;387(3):205-216. https://pubmed.ncbi.nlm.nih.gov/35658024/
-
Sumner P, Vivian-Griffiths S, Boivin J, et al. Exaggerations and caveats in press releases and health-related science news. PLOS ONE. 2016;11(12):e0168217. https://pubmed.ncbi.nlm.nih.gov/27997556/
-
Guyatt GH, Oxman AD, Vist GE, et al. GRADE: an emerging consensus on rating quality of evidence and strength of recommendations. BMJ. 2008;336(7650):924-926. https://pubmed.ncbi.nlm.nih.gov/18436948/