Introduction to Ongoing Reliability Testing
This type of reliability may have different names. A quick search of a few references in my library and I didn’t find ongoing reliability testing, ORT, in any of them.
It does exist and you may have heard of it before or even use some form of ORT. Or not.
Ongoing reliability testing or ORT is the continued evaluation of your product typically using samples drawn from production. The testing evaluates the reliability performance of recent production units.
The focus is on finding anomalies or changes that may occur in the design, supply chain, or production process that significantly changes field reliability performance.
What ORT Is and Is Not
ORT is some form of life test. It may be an accelerated life test, ALT, or it may rely on operating the test samples in as close to use conditions as possible, i.e. real-time testing.
ORT is not the same as burn-in or HASS approaches as ORT is not a screen to expose early life failures. ORT is not a verification of engineering changes, although it may be used in part for that function.
ORT is not lot sampling, although the sampling aspect of ORT may resemble lot sampling.
ORT draws samples on a regular basis from production.
The testing evaluates the samples for adverse changes in reliability.
It provides an early warning in most cases of unwanted changes that impact the durability or longevity of products recently introduced to the field.
An accelerated life test ORT example
A relatively high volume hand-held game controller began production with the attachment of the pop filter enjoying a large amount of variability.
We knew from testing during development that too little or too much adhesive along with variations in the component diameter would lead to premature failure.
We also knew the largest risk of damage during use was due to drops and the resulting shock, vibration, and associated accumulated damage.
Combining the uncertainty concerning how many drops and under what conditions, along with the range of damage that may occur per drop, we didn’t have a clear sense the design had sufficient margin to meet reliability targets.
We decided to conduct ORT, in this case, drop testing, in order to monitor production stability until upstream adhesive and dimension control processes improved.
We also wanted to monitor the durability of the device in order to detect adverse changes to the robustness to drop damage.
The testing sampled two units from each production line at random each week. The test replicated the drop testing done during development. Drop from 2 meters with random alignment till failure.
Count drops till screen separation and count drops till functional failure.
We used a CUSUM control chart to detect significant changes in the number of drops to these two failures.
Upon a signal of an adverse change, we conducted a detailed failure analysis to determine the root cause of the change. When possible we then made adjustments or changes within the design, supply chain, or production process to re-establish prior robustness or to make improvements.
The sampling and testing provided information about changes that may have occurred up to a week (or more) ago, thus placing all production during that time frame in a suspect of higher than expected failure rates.
Sampling more often would have reduced the number of units at risk.
Within a few weeks, this ORT detected the unknown introduction of a manual adhesive application process and a change in stiffness in the circuit board due to a board manufacturer change to a new site and slight changes in their lamination process.
Without ORT both changes would have only been detected through reported field problems.
The time till the failures occurred would mean many more units would have included the undesired changes and resulted in much higher field failures.
A real-time ORT example
Inkjet printers have different use patterns by different customers.
Some print rarely, maybe a couple of pages a week. Others print full-color pages all day (think realtor flyers for a home for sale).
Different failure mechanisms occur depending, in part, on the use of the printer. The number and complexity of the potential failure mechanism makes evaluating each using a focused ALT approach impractical.
While some failure mechanisms were expected to more prominent than other for the various use cases, the desire was to evaluate current production for as many possible failure mechanisms as possible.
Thus the team decided on a real-time operation of a sample of production units.
The primary concern was on failures that occur within the first 2 months of installation. Therefore test scripts were created to replicate use patterns for the 5 different customer profiles defined in the engineering and marketing requirements documents.
Some printers printed two black and white pages per week on average, while others printed hundreds of full-color flyers per day.
A possible sample plan may draw five units per week from production.
The test facility would require sufficient capacity to operating 5 units per week for 8 weeks, thus would need births for 40 units. In practice, given the volume and complexity of production, the testing had the capacity to test many more units at a time.
As units finished the 8 weeks they were replaced by units from that week’s production.
Failed units were examined for the root cause of the failure and remedial action taken to minimize field failures.
The testing caught a change to the pick roller which went from assembling the tacky rubber roller onto a shaft to molding the part directly onto the shaft.
The change included a mold release agent which eventually would bloom to the surface of the roller and prevent the pick roller from picking up one sheet of paper to start the paper movement from the paper tray into the printer.
This failure mechanism would impact a large proportion of all printers.
The heavy users may not experience the issue as the bloomed material would rub off with regular printing, while less than very heavy use would allow the release agent to accumulate and cause the failure.
In this case, the ORT finding provided just a few weeks lead time to respond to the field failures. The team found a solution for fielded units, alerted call centers and customers of a maintenance plan of action to remedy the issue.
Plus the team changed the design away from using the mold release agent underlying the issue.
ORT is not a specific type of reliability test.
It has to focus the testing on the risk of failure. This may be a specific failure mechanism, or a type of use stress, or an exploratory test looking for anomalies.
Nearly any type of reliability test could be used for ORT.
What makes ORT unique is it focuses on the recent production units and attempts to detect changes in produced units that impact field reliability.
Sampling consideration and risk management
The testing is destructive. Or at least significantly ages product such they are unfit for sale.
This typically reduces the number of samples allocated for ORT.
The balance of sample size and risk of undetected failures is further complicated by the ongoing nature of the testing. Pulling samples more often lessens the risk of undetected changes impacting a significant amount of production.
Sampling less often means more unit may have the potential adverse changes leading to premature failures.
The risk of increased failures versus the cost of sampling and testing requires careful consideration and planning.
The sampling naturally includes its inherent ability to detect changes in a population, thus the lower the failure rate that signals a change the more samples required.
The ability to detect a change could be set at a point that the team would take action to resolve the problem, or at a recall or stop production size problem.
Again, this is a discussion with our team to balance risk and cost in order to craft the right ORT test plan.
ORT is a useful means to minimize the risk of unwanted changes adversely leading to unacceptable field failure rates.
Once the design is complete and production starts it does not mean the variability stops changing.
Changes in your supplied parts, in your production process, and even due to engineering modifications to the design, all have a risk of causing significant field failures.
ORT is one way to detect the changes before your customer alerts you to the issue.
Also published on Medium.