Why Kill Controls?

“The effects of chance are the most accurately calculable, and the least doubtful of all factors in the evolutionary situation.”
R. A. Fisher, ca. 1953

COVID-19 vaccination claims have changed from “prevention” to “reduced severity.” FDA approved Pfizer’s vaccine for 95% efficacy, compared with the placebo control sample. Pfizer’s placebo sample had 86% efficacy, compared with the US population case rate! Sample subjects resembled each other but not the US population!

Ronald Fisher deserves credit for current randomized clinical trials practice and for the method of enumeration of alternative outcomes’ probabilities. Why would reliability people care?

Reliability testing is supposed to show what might happen in the field. Field reliability is useful in life testing as well as in other applications: warranty reserves, spares stocks, diagnostics, recalls, etc.

I had to do life testing on something and found the Kolmgorov-Smirnov (K-S) test was commonly used. However it was for complete samples, not censored. So I modified the K-S test to deal with censored samples. I also used likelihood ratio test because nonparametric reliability functions could be estimated from test samples and population data with or without life data.

In case there is no population life data, just period ships and returns counts (without identifying which cohort returns came from), I used Ronald Fisher’s enumeration of simulated, grouped life data that matched the observed ships and returns counts. This is an example of “Neurosophic” statistics.

Tests results on one sample produce the equivalent of the sound made by one hand clapping. Why not compare test reliability results with population reliability?

Randomized Clinical Trials vs. Single-Arm Trials

Vaccine efficacy is 1−Risk(vaccinated)/ Risk(unvaccinated), where Risk() is infections/sample size. Pfizer received emergency use authorization with vaccine efficacy =1−(8/21500)/(162/21728) = 95.06% (8 cases vaccinated vs. 162 placebo cases). Suppose instead of 21728 placebo control sample, compare with unvaccinated US population Risk()? Placebo (saline) efficacy = 1−(162/21728)/(17.8M/328.2M) = 1−0.007455/0.054235 = 86.25%! The difference between unvaccinated case rate 0.75% and US population case rate 5.42% shows that Pfizer’s sample is not representative of US population.

Others recognize this problem [Averitt et al.]. David Moore (former statistics group leader of Abbott Laboratories) told me, “We’re lucky to find 100 subjects with the disease, and we have to split them into control and treatment blocks.” What are the consequences? [Deeney] Translating clinical trials evidence into medical practice may be facilitated by representative sample vs. population comparisons. Comparing a treated sample vs. untreated population avoids the ethical dilemma of killing controls and removes the bias due to convenience sampling.

The FDA says, “Real-world data and real-world evidence are playing an increasing role in health care decisions.” https://www.fda.gov/science-research/science-and-research-special-topics/real-world-evidence. Why not compare treated sample life statistics with untreated population statistics? [Deeney]

Lady Tasting Tea Led to Randomized Clinical Trials Practice

Ronald. A. Fisher did experiments at Rothamsted Research Station, on plants! At UC Berkeley, I took engineering statistics from Elizabeth Scott. She taught Ronald Fisher’s lesson about “The Lady Tasting Tea.” [https://www.nbi.dk/~petersen/Teaching/Stat2009/Fisher_ExactTest_LadyTastingTea.pdf/]

The Lady claimed she could tell whether the milk was poured into the tea cup before or after the tea. Fisher proposed two sets of cups: one set had the milk poured in first. The lady passed the test. I asked Professor Scott what this had to do with engineering? I am ashamed that I asked the question, but I got the answer. The Fisher exact test observes yes-no results, computes the probabilities of all possible combinations, and rejects the null hypothesis of guessing if the number of correct answers (identifying whether milk was added first) is improbable.

There is at least one clinical trial to see whether plasma-antibody treatment improves corona-virus case fatality rate (deaths/cases) [Joyner et al]. The clinical trial is to see whether treatment for prolongs lifetime (survival) and reduces time to recovery:

Ho: survival function of a treated sample is same as untreated population vs.

Ha: survival function of treated sample is stochastically better than that of untreated sample; i.e., P[Life>t|treated sample] > P[Life>t|untreated sample] for some t. Why not compare with other populations?

Typical randomized clinical trials presume similar data from randomly selected treated and untreated samples. Treated sample life data differs from untreated population case and death or recovery counts data, although both contain survival function information. Treated sample subjects produce (censored) life data, times from infection to death or recovery, by patient name or unique identifier. The Kaplan-Meier nonparametric maximum likelihood estimator could be used to estimate the treated survival function, P[Life>t|Treated].

What does this has to do with reliability? Engineers do life tests, to determine if changes in design, process, or other factors improve reliability, P{Life > t] for reasonable life t. The life test null hypothesis is that the change(s) cause no difference in life between the changed sample and the unchanged controls. Why have controls when there is a product population already in service with field reliability function that you could estimate from available ships and returns count data required by generally accepted accounting principles?

The untreated population produces case and death or recovery counts, without lifetime data [George and Agrawal]. Periodic cohort case and death or recovery counts are statistically sufficient to make nonparametric estimates of population survival functions, P[Life>t|Untreated], https://sites.google.com/site/fieldreliability/corona-virus-survival-analysis/.

The clinical trial hypothesis test could be done by sample survival function estimate vs. population survival function estimate using the Kolmgorov-Smirnov (K-S) maximum absolute difference, likelihood ratio, or other test statistic. The FDA would call this a “single-arm” trial and the population an “external” control. Dan Moore (real biostatistician) says, “The FDA does accept “historical controls” as a comparison to treated in phase II trials. You have to show that there has been no change in your endpoint over chronological time.” [Leblanc and Tangen [2012], Belin et al. [2017], Dean et al. [2020], and others] Death is a clear endpoint although recovery from corona virus may not be as clear.

Kolmgorov-Smirnov Life Test for Censored Data

My 1999 paper with the same title presumed that both the sample and population data consisted of cases and death counts, without lifetime data. It uses a likelihood ratio test. But life tests generate lifetime data, because sample subjects are tracked by name or unique identifier. Lifetimes give more precise survival function estimates than case and death counts; e.g., the Kaplan-Meier nonparametric maximum likelihood estimator for censored, grouped life data vs. nonparametric maximum likelihood estimator for case and death counts [https://sites.google.com/site/fieldreliability/random-tandem-queues-and-reliability-estimation-without-life-data/].

The references by Grover and by Fleming and Harrington deal with censored life tests. Grover’s paper and my 1997 presentation assumed equal size treated and untreated samples of life data. What if you had a small, treated sample of censored life data and a huge untreated population of case and endpoint event count data, in which the case cohorts started at different times (“staggered start”), and the event counts did not identify the cohort they came from?

This problem falls in the realm of “neutrosophic” statistics [Smarandache], because the population case cohort and endpoint event counts could have come from a variety of lives with the same periodic event counts. Table 1 shows grouped life data and event counts from two cohorts started in two periods. Table 2 shows alternative life data that result in the same event counts in the bottom row. These alternatives don’t have the same probabilities, assuming the population survival function estimate from population case and event counts.

Table 1. Grouped life data and case and endpoint event counts. Period 1 cohort has 2 deaths in period 1 and 3 in period 2. Period 2 cohort has 2 deaths in period 2. Bottom row are endpoint event count sums of event counts by period. More than one period cohort (cross-section) of population cases are needed to reduce length-bias without life data [Chan].

Period	Cases	Deaths Period 1	Deaths Period 2
1	98	2	3
2	100		2
Period Sums	198	2	5

Table 2. Alternative grouped life data endpoint event counts that could have resulted in same event counts as in table 1 bottom row. Each pair of 1-2 columns shaded yellow shows alternative grouped life data that gives same column sums as in table 1.

Period	1	2	1	2	1	2	1	2	1	2
1	2	2	2	1	2	0	2	4	2	5
2		3		4		5		1		0
Sums	2	5	2	5	2	5	2	5	2	5

Problem statement

From “To the Man with a Hammer,…” [George 1997]

“I compute nonparametric, age-specific reliability estimators from ships and failures data, without life data. Although they are maximum likelihood estimators, they are not Kaplan-Meier (K-M) estimators because failures are grouped by calendar time intervals regardless of ages-at-failures. Fortunately they are population, not sample, estimators, so their only uncertainty is due to censoring.”

“The modified (for censored data) Kolmgorov-Smirnov (K-S) test applies to K-M estimators (from life data), not ships and returns estimators. What is the asymptotic distribution of the maximum difference between two reliability functions estimated from grouped ships and returns data [George 1996]? Is the modification in [Gnedenko] still appropriate? Is only power affected, not P[type I error]? I conjecture the modified K-S test still has the same asymptotic distribution, but the numbers of observed failures should be replaced by the numbers of time intervals containing failures. The references by Nikiforov derive the asymptotic distribution of the K-S test statistic and provide a robust program for the K-S test statistic, but not for the modification for table 2 data. Reference by Fleming and Harrington, 1991, describes log-rank statistic alternatives to the K-S test, which may be more powerful than K-S tests when reliability estimates cross.”

Muhammad Aslam [2020] proposed one- and two-sample “neutrosophic” K-S tests (NK-S) where observations are contained in intervals, not known “crisply.” The test is based on an interval containing the K-S difference statistic instead of its exact value. Aslam did not specify how to deal with functions of interval observations: enumeration, interval arithmetic, or simulation.

Solution

Simulate population life data with the same column sums or event counts as in the population data. Compute the K-M estimator from the simulated population life data and its K-S distance from the sample K-M survival function estimate. If the sample K-M estimator K-S distance is less than some percentile of the simulated |population−sample| K-S distance, do not reject the null hypothesis. Naturally, I call this an SNK-S test.

I simulated life data from the population data in table 3 and 20 simulations of the K-S distance between population and sample data. Figure 1 shows lognormal distribution fit pretty well, especially near the upper end. Simulated mean of ln(K-S distance) was -4.23 and standard deviation was 5.2. The 95-th percentile was 0.032. If a population nonparametric maximum likelihood estimator, from case and death counts, and sample K-M K-S distance is less than 0.032, do not reject the null hypothesis with significance level 95%.

However, each set of simulated life data is not equally likely, assuming the population survival function estimate from population case and event counts. So I weighted each simulated K-S distance by a normalized Kullback-Leibler (K-L) divergence of its simulated K-M estimator from the population survival function. Figure 2 shows the weighted alternative to figure 1, for the same simulated K-S distances. Simulated mean of ln(weighted K-S distance) was 0.00115 and standard deviation was 0.00112. The 95-th percentile was 0.00304. If a population nonparametric maximum likelihood estimator, from cases and deaths, and weighted sample K-M |population−sample| K-S distance is less than 0.00304, do not reject the null hypothesis with significance level 95%.

Table 3. Life data for simulation to give same bottom row

Period	Ships	1	2	3
1	100	2	3	4
2	100		2	3
3	100			2
Sums	300	2	5	9

Figure 1. Simulated K-S distances from table 3 data. Distance is maximum absolute difference between nonparametric maximum likelihood estimator from bottom row and the Kaplan-Meier estimator from simulated event counts.

Figure 2. Simulated K-S distances from table 3 data, weighted by K-L divergence from population survival function. Horizontal axis differs from figure 1, because K-S distances are multiplied by ratio of (K-L divergence)/Σ(K-L divergences).

Afterthoughts: Multiple inference and COVID-19 vaccine

“Recognize that any frequentist statistical test has a random chance of indicating significance when it is not really present. Running multiple tests on the same data set at the same stage of an analysis increases the chance of obtaining at least one invalid result. Selecting the one “significant” result from a multiplicity of parallel tests poses a grave risk of an incorrect conclusion. Failure to disclose the full extent of tests and their results in such a case would be highly misleading.” Professionalism Guideline 8, Ethical Guidelines for Statistical Practice, American Statistical Association, 1997. [https://web.ma.utexas.edu/users/mks/statmistakes/multipleinference.html/]

1. Simulate the K-S distance for all simulated population K-M estimates and do not reject the null hypothesis if all the K-S distances are small.

2. Do the likelihood ratio test too, [George 1999 and 2021] using the column sums from the sample life data and the population counts. If somebody gives me some treatment life data and I can find corresponding population case and death (event) counts, I will run both tests.

3. Do log-rank and Gehan-Wilcoxon tests too? [Ed Gehan suggested that to me, 1976.]

4. Do the weighted-difference in cumulative failure rate functions proposed by Fleming and Harrington [1980]. This deals with crossing failure rate functions. Their test statistic has known asymptotic properties.

5. Does 95% Pfizer COVID-19 vaccine trial efficacy apply to population? Vaccine efficacy = (Cases(unvacc.)/TTT(unvacc.)−Cases(vacc.)/TTT(vacc.))/Cases(unvacc.)/TTT(unvacc.); (TTT() stands for total time on test.) TTT() is time since vaccination for the treated subjects, and is total time since February or March 2020 when COVID-19 started or comparable time since June when vaccination trials started.

Are you running sample life tests? What are you comparing to the sample reliability function estimate? Compare sample with population reliability function estimates; the latter doesn’t have any sample uncertainty!

References

Aslam, Muhammad, “Introducing Kolmogorov−Smirnov Tests under Uncertainty: An Application to Radioactive Data,” http://pubs.acs.org/journal/acsodf, ACS Omega 5, 914−917, 2020

Amelia J. Averitt, Chunhua Weng, Patrick Ryan, and Adler Perotte, “Translating evidence into practice: eligibility criteria fail to eliminate clinically significant differences between real-world and study populations,” Digital Medicine 3:67 ; https://doi.org/10.1038/s41746-020-0277-8, 2020

Belin, Lisa, Yann De Rycke, and Phillippe Broët, “A two-stage design for phase II trials with time-to-event endpoint using restricted follow-up,” Contemporary Clinical Trials Communications, Volume 8, Pages 127-134, https://doi.org/10.1016/j.conctc.2017.09.010, December 2017

Chan, Kwun Chuen Gary, “Survival analysis without survival data: connecting length-biased and case-control data,” Biometrika 100 (3): 764-770, 2013

Dean, N., Gsell, P.S., Brookmeyer, R., Crawford, F., Donnelly, C., Ellenberg, S., Fleming, T., Halloran, M. E., Horby, P., Jaki, T., Krause, P., Longini, I., Mulangu, S., Muyembe-Tamfum, J.J., Nason, M., Smith, P., Wang, R., Henao-Restrepo, A., and De Gruttola, V. “Creating a Framework for Conducting Randomized Clinical Trials During Disease Outbreaks.” The New England Journal of Medicine, 382, 1366-1369, 2020

Dianna Deeney, “How Many Controls Do We Need to Reduce Risk?” https://accendoreliability.com/podcast/the-reliability-fm-network/qdd-027-how-many-controls-do-we-need-to-reduce-risk/#more-449094, Sept. 2021

FDA, “Submitting Documents Using Real-World Data and Real-World Evidence to FDA for Drugs and Biologics Guidance for Industry,” May 2019

Fleming, Thomas R. and David P. Harrington, “A Class Of Hypothesis Tests For One and Two Sample Censored Survival Data,” Technical Report Series, No. 9, August 1980

Fleming, T. R. and D. P. Harrington, Counting Processes and Survival Analysis, Wiley-Interscience, New York, 1991

Gehan, E. A., “A generalized Wilcoxon test for comparing arbitrarily singly-censored samples.” Biometrika 52, 203-223, 1965

George, L. L., and A. C. Agrawal, “Estimation of a hidden service distribution of an M/G/∞ system,” Naval Research Logistics, 20: 549–555. doi: 10.1002/nav.3800200314, https://sites.google.com/site/fieldreliability/home/m-g-infinity-service-distribution, 1973

George, L. L. “Ergodic Theory, Nyquist Samples, and Field Reliability,“ Triad Systems Corp., March 1996

George, L. L. “Product Reliability Comparison with Censored Data,” or “To the Man With a Hammer, Everything Looks Like a Nail,” ASQ Reliability Review, Vol. 17, No. 1, March 1997

George, L. L., “Compare Population and Customer Reliability,” Quality and Productivity Research Conference, ASQ and UC Berkeley, Santa Rosa, CA May 1998

George, L. L. ,“Why Kill Controls?” https://www.linkedin.com/feed/update/urn:li:activity:6704848960865103872, 1999

Gnedenko, B. V., Yu. K. Belyayev, and A. D. Solovyev, Mathematical Methods of Reliability Theory, Academic Press, New York, pp. 274-276, 1969

Grover, N. B., “Two-sample Kolmogorov-Smirnov test for truncated data,” https://doi.org/10.1016/0010-468X(77)90039-3

Joyner, Michael, et al., “Effect of Convalescent Plasma on Mortality among Hospitalized Patients with COVID-19: Initial Three Month Experience,” MedRxiv preprint, https://doi.org/10.1101/2020.08.12.20169359, Aug. 2020

Koziol, James A. and David P. Byar, “Percentage Points of the Asymptotic Distributions of One and Two Sample K-S Statistics for Truncated or Censored Data,” Technometrics, Vol. 17, No. 4, pp. 507-510, doi = 10.1080/00401706.1975.10489380, https://www.tandfonline.com/doi/abs/10.1080/00401706.1975.10489380, 1975

LeBlanc, Michael and Catherine Tangen, “Choosing Phase II Endpoints and Designs: Evaluating the possibilities,” Clin. Cancer Res. 18(8): 2130–2132. Published online 2012 Mar 8. doi: 10.1158/1078-0432.CCR-12-0454, 2012 Apr 15

Nikiforov, A. M. “Algorithm AS288, Exact Smirnov Two-sample Tests for Arbitrary Distributions,” Appl. Statist, v. 43, No. 1, pp. 265-284, 1994

Nikiforov, A. M., “Subroutine GSMIRN,” statlib@lib.stat.cmu.edu

Smarandache, Florentin, Introduction to Neutrosophic Statistics, Sitech & Education Publishing, Columbus, Ohio, 2014

Comments

Larry George says
February 23, 2022 at 4:19 PM
Just read…AAAS Scientific Freedom and Responsibility award given to Ronald Jones…
“Jones is being honored for his role in exposing one of the biggest medical scandals in New Zealand’s history. He was a part of a group of three Kiwi doctors who exposed ethical abuses in a study examining cervical carcinoma in situ, or CIS.”
“In 1973, Jones joined the staff of National Women’s Hospital in Auckland as a junior obstetrician and gynecologist. At this time, Professor Herbert Green had been conducting a study into CIS that had been in progress for seven years. Despite common knowledge at the time that CIS was a precursor to cancer, Green had embarked on a study of women with CIS, without their consent, that involved merely observing rather than treating them.”
“Sadly, many of the women subsequently developed cancer and some died.”