Inside the Testosterone Trials (T-Trials) Methodology: What Most Summaries Skip

At a glance
| Parameter | Detail | |---|---| | Trial name | The Testosterone Trials (TTrials) | | N | 790 (394 testosterone, 396 placebo) | | Intervention | 1% testosterone gel (AndroGel), dose-titrated to mid-normal range | | Comparator | Matching placebo gel | | Duration | 12 months | | Primary endpoints | Sexual function (PDQ-Q4), vitality (FACIT-Fatigue), physical function (6-minute walk distance) | | Key result | Significant improvement in sexual desire and erectile function; modest gain in 6-minute walk distance; no significant vitality benefit vs. placebo | | Registry | NCT00799617 |
Why the T-Trials Exist: Filling a Specific Evidence Gap
Before 2016, most testosterone replacement data in older men came from small, short trials or observational registries. The Institute of Medicine's 2003 report called for adequately powered RCTs in men over 65 with low testosterone and age-related symptoms. The National Institute on Aging funded the T-Trials specifically to answer whether raising serum testosterone to the mid-normal range for younger men would produce measurable symptomatic benefit over one year.
This origin matters. The trial was not designed to answer whether TRT prevents fractures, reduces cardiovascular events, or extends life. It was designed to measure symptom-level endpoints in a symptomatic population. Every interpretation of the results should start from that boundary.
The Coordinated Multi-Trial Architecture
The most unusual feature of the T-Trials is its structure. Seven sub-trials (Sexual Function, Physical Function, Vitality, Cognitive Function, Bone, Anemia, Cardiovascular) shared a single randomization scheme. Each participant enrolled in one to three sub-trials based on their symptom profile, but all received the same testosterone or placebo gel regardless of which sub-trial(s) they joined.
This had practical consequences:
- Efficiency. One screening pipeline fed multiple research questions. A man with both low libido and slow walking speed contributed data to both the Sexual Function and Physical Function trials.
- Correlation. Because the same participants appear in multiple sub-trials, the results are not independent. A systemic effect of testosterone (improved mood, for instance) could influence outcomes across sub-trials simultaneously.
- Power trade-offs. The three "main" trials (Sexual Function, Vitality, Physical Function) were powered for their primary endpoints. The four "subsidiary" trials were explicitly exploratory, with smaller sample sizes and wider confidence intervals.
Eligibility: Who Got In, Who Didn't
Inclusion required men aged 65 or older with two morning serum testosterone levels <275 ng/dL (average), plus symptoms qualifying them for at least one sub-trial. The symptom gates were specific:
- Sexual Function Trial: low libido or erectile difficulty on the DISF-M-II questionnaire.
- Physical Function Trial: difficulty walking two blocks or climbing ten stairs, confirmed by a 6-minute walk test of <500 meters.
- Vitality Trial: FACIT-Fatigue score ≤30.
Key exclusions: prostate cancer history, PSA >4.0 ng/mL, severe lower urinary tract symptoms (AUA-SI >19), BMI >40, hematocrit >48%, unstable cardiovascular disease within the prior 3 months, and uncontrolled heart failure.
Why these cutoffs matter. The testosterone threshold of <275 ng/dL is lower than some clinical guidelines use for hypogonadism diagnosis. The Endocrine Society's 2018 guideline sets a decisional threshold around 300 ng/dL with repeat confirmation. The T-Trials' stricter cutoff means this population had genuinely low testosterone, not borderline levels. It also means the results may not generalize to men in the 275 to 350 ng/dL gray zone where many prescriptions actually occur.
The cardiovascular and prostate exclusions created a relatively "clean" safety population. Men with the highest theoretical risk from TRT were excluded. This was appropriate for a proof-of-concept efficacy trial but limits what the study can tell us about safety in real-world prescribing.
Randomization and Blinding
Randomization was 1:1, stratified by clinical site and by sub-trial combination. The stratification by sub-trial combination is a detail most summaries omit. Because a participant could be enrolled in one, two, or three sub-trials, the randomization ensured balance within each combination stratum, not just overall.
The testosterone and placebo gels were packaged identically (pump bottles, same appearance and texture). Investigators, participants, and outcome assessors were all blinded. To maintain the blind during dose titration, a separate unblinded pharmacist at each site reviewed testosterone levels and adjusted the gel dose. This pharmacist had no role in outcome assessment.
This double-blind design with pharmacist-managed titration is stronger than many TRT trials, where side effects (acne, erythrocytosis, mood changes) can functionally unblind participants. The T-Trials acknowledged this risk but did not formally test whether participants guessed their assignment, a missed opportunity that would have strengthened the blinding assessment.
Dose Titration Protocol
All testosterone-arm participants started at 5 g/day of 1% testosterone gel. Serum testosterone was measured at months 1, 2, 3, 6, and 9. The unblinded pharmacist titrated the dose to achieve a target serum testosterone of 400 to 798 ng/dL (the mid-normal range for men aged 19 to 40 per the trial's reference laboratory).
Dose adjustments were made in fixed increments. If testosterone remained below target, the dose went up; if above, it went down. Placebo participants received sham dose adjustments to maintain blinding.
Median achieved level: approximately 460 ng/dL at month 3 in the testosterone arm vs. approximately 230 ng/dL in the placebo arm. The separation was clean, confirming the gel achieved its pharmacologic target.
This titration-to-target approach differs from fixed-dose designs. It increases the probability of showing efficacy (because most men reach physiologic levels) but introduces variability in actual dose received. Some men were on 5 g/day, others on 10 g/day. The primary analyses were intention-to-treat and did not stratify by achieved dose, which is methodologically appropriate but means the "average treatment effect" blends different exposure levels.
Primary Endpoint Definitions
The three co-primary endpoints each used a different instrument:
| Sub-trial | Endpoint | Instrument | MCID used | |---|---|---|---| | Sexual Function | Sexual activity | PDQ-Q4 (Psychosexual Daily Questionnaire, question 4) | Not pre-specified; effect size approach | | Vitality | Fatigue score | FACIT-Fatigue (0-52 scale) | 3-point change | | Physical Function | Walking capacity | 6-minute walk distance (meters) | 50-meter change |
The PDQ-Q4 asks participants to rate daily sexual activity on a 0-4 scale, averaged over the observation period. This is a patient-reported outcome with established sensitivity to hormonal changes. The trial reported a statistically significant increase in sexual activity, desire, and erectile function across multiple instruments.
The FACIT-Fatigue scale is widely validated in oncology and chronic disease. A 3-point change is generally accepted as clinically meaningful. The T-Trials found a statistically significant but clinically modest improvement (approximately 2.4-point difference). The investigators concluded this did not meet the pre-specified threshold for clinical significance, a commendably honest interpretation.
The 6-minute walk test showed a statistically significant but small improvement (approximately 6 meters over placebo at month 6, with attenuation by month 12). The pre-specified MCID of 50 meters was not met. The improvement was real but not clinically significant by the trial's own standards.
Statistical Approach and Multiplicity Correction
The three main trials were analyzed using a hierarchical testing procedure. Sexual Function was tested first; only if it was significant could Vitality proceed; only if Vitality was significant could Physical Function proceed. This gate-keeping approach controls the family-wise Type I error rate at 0.05 without requiring Bonferroni-style adjustment.
In practice, Sexual Function was significant, Vitality was not (by the clinical significance criterion), and the hierarchy stopped. This means the Physical Function p-value, while nominally <0.05, cannot be interpreted as confirmatory evidence. Most media coverage missed this distinction.
The primary analysis model was a mixed-effects repeated-measures model adjusting for baseline values, clinical site, age, and sub-trial stratum. Missing data were handled under a missing-at-random assumption. Sensitivity analyses included pattern-mixture models and tipping-point analyses to test the robustness of findings to informative dropout.
Dropout was approximately 14% in each arm, balanced and mostly due to participant decision or intercurrent illness, not treatment-related adverse events. The sensitivity analyses supported the primary conclusions.
What the Estimand Framework Reveals
The T-Trials predated the ICH E9(R1) addendum on estimands, but we can apply that framework retrospectively. The implicit estimand was a treatment-policy estimand: the effect of assigning testosterone gel versus placebo, regardless of adherence or dose achieved. This is clinically relevant because it reflects what a clinician can expect when prescribing.
An alternative estimand, the effect in men who achieve target testosterone levels, would likely show larger benefits. The trial did not formally report this (though subgroup analyses by testosterone level were included in supplementary materials). From a regulatory and clinical standpoint, the treatment-policy estimand is the more conservative and generalizable choice.
Comparator Choice: Why Placebo Was Right (and Limiting)
A placebo comparator was the correct choice for this trial's question. No prior therapy was established as standard-of-care for age-related testosterone decline in men over 65. A comparison against, say, exercise or PDE5 inhibitors would answer a different question.
The limitation is that clinicians often want head-to-head data. A man with low libido and testosterone of 250 ng/dL might reasonably ask: "Should I try testosterone gel or sildenafil first?" The T-Trials cannot answer that question. Nor can they compare gel to injections, pellets, or transdermal patches, all of which differ in pharmacokinetics and patient experience.
Limitations the Authors Acknowledged
The original publication and the coordinating center's design paper explicitly noted:
- One-year duration. Too short to assess long-term cardiovascular or prostate safety. The subsequent T-Trials cardiovascular sub-study found increased coronary artery plaque volume, a concerning signal that prompted further investigation and contributed to the design of the TRAVERSE trial.
- Symptom-based eligibility. Results apply to older men with confirmed low testosterone and specific symptoms, not to the broader population of older men with mildly low levels.
- Single formulation. AndroGel 1% (AbbVie) was the only product tested. Bioequivalence across testosterone gel products is not guaranteed, and the FDA label for testosterone products carries product-specific dosing instructions.
- Predominantly White cohort. Approximately 75% of participants were White, limiting generalizability to other racial and ethnic groups where testosterone metabolism, baseline levels, and symptom expression may differ.
- No long-term follow-up. The benefits observed at 12 months may not persist. Whether testosterone gel needs to be continued indefinitely, and what happens at discontinuation, was not addressed.
What Came Next: The T-Trials in Context
The T-Trials' cardiovascular plaque finding directly informed the TRAVERSE trial (N = 5,246), which was powered for major adverse cardiovascular events and reported in 2023. TRAVERSE found no increased risk of MACE over a median 33-month follow-up, partially allaying the concerns raised by the T-Trials sub-study.
The Endocrine Society's 2018 clinical practice guideline cited the T-Trials as Level 1 evidence for symptomatic benefit in sexual function but maintained a conditional recommendation due to the limited safety data available at that time.
The Bottom Line for Clinicians
The T-Trials demonstrated that testosterone gel improves sexual function in symptomatic older men with confirmed low testosterone. The evidence for vitality and physical function was weaker. The trial's coordinated design was efficient but created statistical dependencies that require careful interpretation.
The methodology was rigorous: double-blind, placebo-controlled, titration-to-target, with hierarchical multiplicity control. Its boundaries are equally clear: one year, one formulation, one demographic skew, and exclusion of the men at highest cardiovascular and prostate risk. Read any headline about "testosterone benefits in older men" through those boundaries before applying it to a patient in front of you.
Frequently asked questions
›
›
›
›
›
›
›
›
›
›
References
- Snyder PJ, Bhasin S, Cunningham GR, et al. Effects of Testosterone Treatment in Older Men. N Engl J Med. 2016;374(7):611-624. PubMed
- Budoff MJ, Ellenberg SS, Lewis CE, et al. Testosterone Treatment and Coronary Artery Plaque Volume in Older Men With Low Testosterone. JAMA. 2017;317(7):708-716. PubMed
- Lincoff AM, Bhasin S, Flevaris P, et al. Cardiovascular Safety of Testosterone-Replacement Therapy. N Engl J Med. 2023;389(2):107-117. PubMed
- Bhasin S, Brito JP, Cunningham GR, et al. Testosterone Therapy in Men With Hypogonadism: An Endocrine Society Clinical Practice Guideline. J Clin Endocrinol Metab. 2018;103(5):1715-1744. PubMed
- FDA. AndroGel (testosterone gel) 1% prescribing information. FDA Label
- ICH E9(R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials. PubMed