Age-specific reliability of a standby system depends on components’ failure rates. Reliability computation is interesting when part failure rates depend on age, which is what motivates having a standby system. A Markov chain, approximates the age-specific reliability and availability, which are complicated to compute exactly, unless you assume constant failure rates. Why not use age-specific (actuarial) rates? They are Markov chain transition rates.
Markov chain models of standby systems are not new [Carer et al., Chakravarthy, Pattavina, El-Damcese et al., George 1973 and 2007, Manglik and Ram, and others]. Most of the Markov chain references assume constant transition rates and compute steady state behavior and MTBF, not age-specific system reliability or availability. The reference by El-Damcese et al. describes a standby system with a partial failure mode. The reference by Manglik and Ram uses constant failure rates but general repair time distributions. This article describes a transient Markov chain workbook with age-specific (actuarial) failure rates to approximate age-specific, transient standby system reliability and availability.
The workbook computes the age-specific system reliability, availability, and MTBF of the cold-standby system in figure 1. It includes a discrete-time Markov approximation and exact solution for a continuous-time system [George 1973].
Markov chain transitions include, from left to right in figure 2:
- Successful operation of part 1 for mission time
- failure of part 1 and successful operation of standby part 2 for remaining mission time
- failure of both parts before mission completion
Circular arrows on operating states represent part survival through one transition. The circular arrow on the failure state means there is no repair. This transition diagram doesn’t show that transition rates depend on age.
For a standby system with a finite mission time, transition from part 1 operating to success occurs when mission time or useful life is over, assuming part 1 doesn’t fail. If part 1 fails, transition from part 2 operating to success cannot occur until the remaining mission time is over, assuming part 2 doesn’t fail.
Computations with constant failure rates
If failure rates were constant, then a three-state (part 1 operating, part 2 standing by or operating, failure) Markov transition matrix would be sufficient to describe the system and compute age-specific reliability and availability.
Table 1. Markov chain transition matrix P with constant transition (failure) rates “a()” for both parts when operating.
|States||Part 1 Operating||Part 2 Operating||Failure|
|Part 1 operating||1-a()||a()||0|
|Part 2 standby or operating||0||1-a()||a()|
The Markov chain approximation multiplies state probability vector p(t-1) times transition matrix P, p(t) = p(t-1)P, t = 1, 2,…,mission time, where p(t) is the state probability vector after t transitions, and P is the transition probability matrix with constant transition (failure) rates. System reliability R(t) at age t is the complement of the sum of the p(s|Failure) failure-state probabilities, R(t)= 1-Sp(s|Failure); s=1,2,…,t. (Age-specific availability is the sum of the part 1 or part 2 operating-state probabilities at any time t.) Table 2 shows the data and discrete Markov chain p(t‑1)P system failure rate, reliability, and exact continuous-time reliability.
Table 2. Failure rates (FR) and Markov chain system reliability R(t) for cold standby independent and identical parts. “Exact R(t)” is a numerical integration of R(t)+∫f(s)R(t‑s)ds, where the integral is from 0 to t.
|Age||Part FR||System FR||System R(t)||Exact R(t)|
Computations with age-specific, actuarial failure rates
If transition rates depend on age, then the state space could be expanded to include ages of components and actuarial (transition) rates. The computation becomes p(t) = p(t-1)P(t) where the P(t) transition rate matrix includes actuarial rates conditional on state of system at age t. Actuarial rates are failure rates conditional on survival to age t, so they satisfy the Markov property that that transition at time t only depends on the state of Markov process at age t-1.
Table 3 shows the transition probability matrix for a two-period mission. The first three rows and three columns define the Markov chain states. “Cal Time” stands for calendar time into the mission, and “Res Time” stands for residual mission time. The other rows and columns are transition matrix, P(t), in terms of the age-specific part failure rates a(1) and a(2), conditional on survival up to the beginning of each age, a(t) = P[t<Life≤t+1|Life>t]. The matrix represents one event per transition.
Table 3. Markov transition matrix for a cold standby system with two-period mission time and actuarial failure rates a(1) and a(2) for parts at ages 1 and 2
Exact system reliability with age-dependent failure rates in the transition matrix is
Rsys(t) = R1(t)+∫f1(s)R2(t-s)ds, t ≥ 0,
where integration is from 0 to t, Rsys(t) represents reliability, P[System Life > t], and f1(t) is the probability density function of part 1 life. R1(t) is the probability part 1 survives the mission. The second term is the probability of failure of part 1 at age s and successful operation of part 2, R2(t-s) for the remainder of mission time t-s. The Markov approximation and exact solutions don’t agree exactly, because the Markov approximation allows at most one event per transition.
Table 4. Failure rates (FR) and Markov chain system reliability R(t) for cold standby independent and identical parts but with age-specific failure rates.
|Age||Part FR||System FR||System R(t)||Exact R(t)|
Open the Markov2.xls workbook and enable the VBA computer program (convolution function), if you want the exact solution. The workbook computes state probabilities for a finite-time mission of eight time units, system reliability, and failure rates as functions of age, and MTBF (mean time between failures of successive missions). It graphs failure rates (figure 3).
Table 1 of the Markov2 spreadsheet contains input and output data. Put your discrete, age-specific (actuarial) part failure rates in column B. Columns C, D, and E contain results. If your mission time differs, rescale the rates to eight time units, or add more rows.
Table 2 of the Markov2 workbook contains the state probability vectors, p(t), starting with p(0), the initial probability vector. It currently represents starting new with eight time units to go. You could change it to represent other starting conditions. The other rows of table 2 contain p(t) vectors after time 0, computed by matrix multiplications p(t-1)P. Table 3 of Markov2.xlsx contains the Markov transition matrix, P.
Table 1 of the Exact spreadsheet implements a discrete integral approximation for the exact solution. Table 2 computes the expected failure time during a failed mission, and
MTBF = 8*E[Number of missions before failure] + E[Time to failure|Mission failure].
Generalizations and limitation
The Markov chain approximation generalizes to: different failure rates for parts 1 and 2, “warm” standby, more redundant parts, other mission times, repair, and parallel subsystems in series. Exact solution of these generalizations is impractical.
A VBA computer program make the transition matrix from failure rates, part counts, and mission time, does the matrix multiplication, and prints the results in columns. In the real world, failure rates change as parts age, and age-specific failure rates require no unjustifiable, mathematically convenient assumptions to estimate and use.
If you want the workbook Markov2.xlsx for doing these computations, let me know. Request the Markov2.xls workbook or send field data to email@example.com, and I will send back estimates of your age-specific failure rates and implementation of your standby system, free of charge. Please refer to https://sites.google.com/site/fieldreliability/ for field data alternatives.
Carer, P., J. Bellvis, M. Bouissou, J. Domergue, and J. Pestourie, “A new method for reliability assessment of electrical power supplies with standby redundancies,”https://www.semanticscholar.org/paper/A-new-method-for-reliability-assessment-of-power-Carer-Bellvis/c7a284a711b7db4cf15962437c050585dd9165f0/2002
Chakravarthy, Srinivas R., “Analysis of a k-out-of-n system with spares, repairs and a probabilistic rule,” J. Appl. Math. and Stochastic Analysis, Vol. 2006, Article ID 39093, pp. 1-23, https://www.hindawi.com/journals/ijsa/2006/039093/2006
Medhat Ahmed El-Damcese, Naglaa Hassan El-Sodany, “Discrete Time Semi-Markov Model of a Two Non-Identical Unit Cold Standby System with Preventive Maintenance with Three Modes,” American Journal of Theoretical and Applied Statistics, Volume 4, Issue 4, pp. 277-290, doi: 10.11648/j.ajtas.20150404.18, July 2015
George, L. L, “Diffusion Approximation for Two Channel, Poisson-Exponential Service Systems with Dependence,” Ph. D. thesis, University of California, Berkeley, 1973
George, L. L., “Markov Approximation of Standby System Redundancy,” ASQ R&M Tech Briefs, Vol. 1, No. 2, pp. 2-5, Jan. 2007
James Li, “Reliability Comparative Evaluation of Active Redundancy vs. Standby Redundancy,” International Journal of Mathematical, Engineering and Management Sciences, Vol. 1, No. 3, pp. 122–129, https://dx.doi.org/10.33889/IJMEMS.2016.1.3-013 122, 2016
Monika Manglik and Mangey Ram, “Reliability Analysis of a Two Unit Cold Standby System Using Markov Process,” Mathematical Sciences Research Journal,December 2013
Pattavina, Jeffrey S., “Tutorial on Analyzing High Reliability: Part 2,” Comms. Design,https://www.eetimes.com/tutorial-on-analyzing-high-reliability-part-2/, March 11, 2004
Larry George says
Thanks for publishing the article on using age-specific failure rate functions in Markov models. Next article should be about making simultaneous, nonparametric estimates of age-specific failure rate functions, by failure mode, without life data, for use as transition rates in Markov models. It shows competing-risk modeling, without assuming independent, competing risks. Face it, competing risks are depedent: on age (of course), process, environment, usage, customer, etc.
BTW: R(t)= 1-Sp(s|Failure) should have been R(t)= 1-SUM[p(s|Failure)]; s=1,2,…,mission time. I’ll try to use English instead of Greek.