Q. I make a point of reading myeloma-related articles in medical journals, but I have a hard time understanding statistics. Can you help me interpret this information?
A. In the Summer 2016 edition of Myeloma Today, the InfoLine column was dedicated to the art of reading medical journalism, offering some tips on how to separate facts from hype in stories that tout a miracle cure or a revolutionary new treatment. Now we turn to the art of reading articles from peer-reviewed medical journals, and how to wade through the jargon of medical statistics to get to the bottom line: Does a particular treatment actually show benefit or not?
Dr. Brian G.M. Durie addressed this issue, focusing on the meaning of p-value, in a March 2016 segment of Ask Dr. Durie (visit askdrdurie.myeloma.org). Dr. Durie, like all researchers who conduct and publish results from clinical trials, must use statistical methods to establish the reliability and authenticity of data. Here is an excerpt from a recent article about the VRd versus Rd study published by Durie et al. in the pre-eminent hematology journal, Blood*:
The pre-specified significance level of 0.02 was reached in the log rank testing. The stratified hazard ratio (HR) was 0.735 (96% Wald confidence interval: 0.573, 0.941), and the one-sided stratified log rank p-value for PFS (VRd vs. Rd) was 0.00995. The OS was also improved for VRd vs. Rd with HR = 0.666; two-sided log-rank p-value = 0.0114.
The publication of the SWOG 0777 study of VRD vs. Rd in newly diagnosed myeloma was a highly significant event, providing long-awaited documentation that the three-drug VRd combination of a proteasome inhibitor (Velcade®, bortezomib), an immunomodulatory agent (Revlimid®, lenalidomide), and a steroid (dexamethasone) is more effective than the doublet therapy of Revlimid + dexamethasone (Rd). This study has had, and will continue to have, a major impact on treatment choices around the globe. In presenting the data that demonstrates the significant benefit of adding Velcade to Rd, the authors used abundant and careful statistical methods: Log rank testing (one-sided and two-sided), Hazard Ratio (HR, stratified and not), Confidence Interval (CI), and p-value. (Are you non-statisticians and non-mathematicians lost yet?) Following is a brief explanation of some statistical terms used in the above and other journal articles that may enable you to boldly go where few patients and caregivers (and InfoLine columnists) have gone before.
Long-rank test: A log-rank test is the most popular method of comparing the survival of groups, one which takes the whole follow-up period into account instead of looking at a single time point. An alternate test is the generalized Wilcoxan (Breslow) test, which places emphasis on the early part of the survival curves. The log-rank test puts greater emphasis on the later points of the survival curves and is widely used in clinical trials to establish the efficacy of a new treatment in comparison with a control treatment when the measurement is the time to an event. In the above study, the events measured were progression-free survival (PFS, or the time until relapse) and overall survival (OS, or the time until death). Log-rank tests can be one-sided or two-sided (also called “one-tailed” and “two-tailed”). A one-sided log-rank test will test whether the mean result is either significantly greater than or significantly less than a given value. A two-sided log-rank test tests whether the mean result is significantly greater than and significantly less than a given result. Dr. Durie et al. provide both types of log-rank tests to ensure that they have covered all their statistical bases in assessing the differences in PFS and OS with VRd as compared to Rd.
P-value: The “p” in p-value stands for “probability.” It is used as a measure of the likeliness that a hypothesis is true. The hypothesis in a clinical trial, for statistical purposes, is usually that there is no difference between two treatments. This is known as the “null hypothesis.” The p-value gives the probability that any observed difference between the groups studied could have happened by chance. A p-value of 0.5 means that the probability of a difference this large or larger having happened by chance is 0.5 in 1, or 50:50. A p-value of 0.05 means that the probability of a difference this large or larger having happened by chance is 0.05 in 1, or 1 in 20. The lower the p-value, the less likely it is that the difference in results happened by chance, and so the higher the significance of the finding. P=0.01 (1 in a 100) is considered to be “highly significant.” In the VRd vs. Rd article, the p-value for the improved PFS with VRd is very low indeed – 0.00995 – meaning that the odds are even less than 1 in 100 that VRd improved PFS by chance, rather than because of its efficacy. The p-value for OS with VRd is almost as low – 0.0114 – just slightly greater than 1 in 100 that the result could have happened by chance, so the improved OS with VRd is also highly significant.
Stratified: A stratified sample is one that has been split into a number of sub-groups. In the above example, the data were stratified by the stage of each patient’s myeloma and whether or not the patient intended to have an autologous stem cell transplant after induction therapy. In other myeloma clinical trials, patients may be stratified by age, by number of prior treatments (in the example above, all patients were newly diagnosed, so none had had any prior treatment), or by whether or not they’ve had certain prior therapies, or by risk status (as determined by the genetics of the myeloma cells). A stratified log-rank test allows researchers to compare treatment groups, but also to adjust for such variables as disease stage, patient age, and so on.
Hazard Ratio (HR): “Hazard” is a statistical euphemism for “death.” HR is formed by dividing the hazard (death) rate of the experimental group by that of the control group. A treatment that does better than the control will have a hazard ratio that is less than one.
Kaplan-Meier Survival Curve: Kaplan-Meier survival curves are used to graph the survival of a group of patients. The Y axis of the graph shows cumulative survival, while the X axis shows duration of survival. Each time a death occurs, the curve is adjusted downward to reflect that event at a certain point in time.
Confidence interval: Statisticians can calculate a range (interval) in which we can be fairly sure (confident) that the true value would lie if there were data for the whole population. The larger the number of participants in a study, the narrower the confidence interval. The narrower the confidence interval, the more reliable the study results. In the above VRd vs. Rd study, the confidence interval for the hazard ratio is very narrow: between 0.573 and 0.941.
While these definitions alone will not enable anyone to pass a statistics exam, those of us who have not studied statistics and who read medical literature can, at least, become familiar with some of these difficult terms and read with an eye to narrow confidence intervals, hazard ratios that are less than one, and p-values that are 0.01 or less. For clarification of non-statistical myeloma terms and definitions, visit glossary.myeloma.org or contact the IMF InfoLine.
* “Bortezomib, Lenalidomide and Dexamethasone Vs. Lenalidomide and Dexamethasone in Patients (Pts) with Previously Untreated Multiple Myeloma without an Intent for Immediate Autologous Stem Cell Transplant (ASCT): Results of the Randomized Phase III Trial SWOG S0777.” Brian Durie, Antje Hoering, S. Vincent Rajkumar, Muneer H. Abidi, Joshua Epstein, Stephen P. Kahanic, Mohan C. Thakuri, Frederic J. Reu, Christopher M. Reynolds, Rachael Sexton, Robert Z. Orlowski, Bart Barlogie, Angela Dispenzieri. Blood 2015 126:25.