
In reliability engineering and data-driven maintenance strategies, understanding different failure modes is crucial for designing robust systems. Survival analysis is a powerful statistical tool that allows us to analyze time-to-event data and assess the reliability of components over time. In this article, we’ll explore how survival analysis can be applied to multi-modal failure scenarios using R.
What is Survival Analysis?
Survival analysis helps in modeling and predicting the time until an event of interest occurs, such as equipment failure. It is widely used in fields like healthcare, manufacturing, and engineering. The Kaplan-Meier estimator is a popular non-parametric method used to estimate the survival function from lifetime data.
Multi-Modal Failure Analysis
In real-world applications, systems often exhibit multiple failure modes due to varying stress factors and operational conditions. Analyzing these failure modes separately allows for better insights into system performance and potential improvements.
The figure below illustrates a survival analysis by failure modes performed in R:
Example: Key Insights from the Plot [Generated Using R]
- Kaplan-Meier Curves:
The plot shows separate survival curves for two failure modes (Mode1 in green, Mode2 in blue), along with censored data in red.
Mode1 appears to have better survival performance compared to Mode2, as its survival probability remains higher over longer distances.
- Censoring Information:
The red marks indicate censored observations, meaning that for those instances, the failure did not occur within the observed time.
- Number at Risk:
The middle section of the plot displays the number of units still at risk at different distance intervals.
- Statistical Significance:
The p-value (< 0.0001) suggests a significant difference between the failure modes, reinforcing the need for tailored maintenance strategies for each failure type.
Performing Survival Analysis in R
To replicate this analysis, the survival and survminer packages in R can be used. Below is a basic code snippet to analyze multi-modal failure data:
library(survival)
library(survminer)
# Sample data
data <- data.frame(
distance = c(5000, 10000, 15000, 20000, 25000),
status = c(1, 0, 1, 1, 0), # 1 = event occurred, 0 = censored
failure_mode = c('Mode1', 'Mode2', 'Mode1', 'Mode2', 'Mode1')
)
# Fit survival model
fit <- survfit(Surv(distance, status) ~ failure_mode, data = data)
# Plot survival curves
ggsurvplot(fit, data = data, pval = TRUE, risk.table = TRUE,
conf.int = TRUE, legend.title = "Failure Modes",
palette = c("green", "blue"))
Conclusion
Survival analysis provides valuable insights into system reliability and helps identify different failure behaviors. By leveraging such techniques, engineers can implement data-driven maintenance strategies, optimize component usage, and enhance overall operational efficiency.
If you’re looking to dive deeper into reliability modeling, R offers a comprehensive set of tools to perform survival analysis effectively.
Feel free to share your thoughts and experiences with survival analysis in the comments!
Hello Laxman, congratulations on posting your first article on the Accendo website! Looking forward to seeing more from you on the topic of Design for Reliability.
Here are some comments on this article that you might find useful. I see that you have also used the “shock absorber data” to plot the figure in your article. If you have used the above code snippet (modifying it for the shock absorber data), I don’t think it is giving you what you think it is giving you. The code just treats the 3 different groups (Censored, Mode 1 and Mode 2) as 3 separate populations and plots 3 Kaplan Meir curves on the same plot (the curve for the “Censored” group is the red horizontal line at survival probability 1). This is similar to plotting a Kaplan Meir curve for the time to fatigue failure of an automotive component and time to voltage-surge failure of an electronic component in a washing machine on the same plot. The electronic component cannot experience a fatigue failure and the automotive part cannot experience a failure due to voltage surge. The “number at risk” for both groups are different, because there is no connection between the two. But this is not the case when we are talking about multiple failure modes of a component. The same component can fail due to mode 1, mode 2 or be censored. We do not know at time zero what will happen to the component in the future. Hence, the “number-at-risk” at time zero includes all the components and we cannot separate it by the failure mode. I suggest taking a look at the concept of cumulative incidence function (also called sub distribution function) when dealing with competing risks/multiple failure modes.
Regards,
Shishir.
Hi Shishir,
Thank you for your thoughtful feedback! I really appreciate your insights, especially regarding the interpretation of multiple failure modes in the Kaplan-Meier analysis.
You’re absolutely right that in a strict competing risks scenario, all components should be considered together at time zero, rather than treating different failure modes as separate populations. However, in this example, where there are Commercial Off-The-Shelf (COTS) components involved, predominant failure modes of sub-components are often analyzed separately, as different failure modes may be attributed to distinct operational conditions. In contrast, for actual manufacturers designing and testing their own components, multiple failure modes are typically analyzed within the same reliability framework.
The graph and code snippet in the article aren’t meant to be an exact competing risks representation but rather a more generic approach to illustrate survival trends under different conditions. That said, I agree that using the cumulative incidence function would provide a more precise representation of competing failure modes. I’ll look into incorporating that perspective in future analyses.
I really appreciate your detailed input—discussions like this help refine our approaches and improve how we communicate reliability concepts. Looking forward to more such exchanges!
Regards,
Laxman
I see. If I understood you correctly, mode 1 and mode 2 failure risks in this case are not acting on a component at the same time (mode 1 could be a brittle fracture failure due to to extreme cold operating conditions and mode 2 could be a stress related failure due to extreme heat conditions). In this case, I agree this is not a competing risks framework and are 2 different populations. Thank you for the clarification! Cumulative incidence function is useful when both risks are active at the same time. I noticed the words “multi modal failure” and the shock absorber data in the Kaplan Meir curve and automatically assumed that this is a competing risks framework, which it is not.
On a side note, I see that you have used the “survminer” package to get the p-value from a log rank test. The “survminer” package plots look better aesthetically as compared to the default ones plotted by the “survival” package, but can sometimes give the incorrect p-values (when the “weights” argument is used in “survfit”). I wrote a blog article on it last year, which you might find useful: https://rpubs.com/shishir909/1199328
I am not sure if there is a different way to use the “ggsurvplot” function to incorporate case weights, or if they have fixed the bug. This is something that could be easily missed. You might find it informative.
Regards,
Shishir.
Hi Shishir,
Thank you for the clarification! I completely agree— since failure modes 1 and 2 act on different populations rather than simultaneously on the same component, this isn’t a competing risks framework. Your point about the cumulative incidence function being useful when risks are active at the same time is well noted.
Also, I appreciate you sharing your blog article on the survminer package and its potential issue with the weights argument in survfit. I wasn’t aware of this specific limitation, and it’s definitely something to keep in mind when using ggsurvplot. I’ll check out your post—always great to learn from real-world findings like this! Have you come across any alternative approaches or recent updates that address this issue?
Regards,
Laxman
The github page for the “survminer” package shows that the latest version was released on Oct 30, 2024, and I had written the blog sometime in June, 2024. I don’t know if this bug was fixed or not, I haven’t checked. (I should do it someday and raise an issue on github if it hasn’t been done already)
I would rely on the “survival” package for the log-rank test p-values, since this is a very stable package with regular updates. I verified it by calculating the p-value manually (using excel) for the example in the blog. You can conduct the log-rank test using the “survdiff” function, and the p-value that you get here can be hard coded on ggsurvplot by using the “pval” argument. The blog has details on this.