When is the biggest ‘improvement’ in the reliability of a new ‘type’ of product? This is a very broad question with no doubt lots of answers that aren’t wrong. But what we do know is that the more experience we have with building something, the more reliable it gets. And this effect is most marked at the start of the product’s life.
Studies have shown that the biggest increase in small satellite reliability occurs between the first and second product produced by a manufacturer. In fact the first thing that any small satellite designer does in response to compressed schedules is to try and use as much of previous designs as possible. These modules and components are described as having ‘heritage.’
What does this mean? Well it is another argument against a compliance-based approach to reliability. Because if there was some sort of checklist of activities or things to do that would make a satellite reliable, then there is nothing to be gained from experience or ‘heritage.’ If the first product every built ticked all the ‘reliability boxes,’ then there would be no scope for improvement.
But there always is. Because reliability, risk and safety are children of critical thinking. Not compliance.
This is the sixth article in a series that talks about reliability in emerging technologies. This series was initiated by my experiences in helping those involved in the manufacture of small satellites improve their reliability. Small satellites are far enough removed from their massive, multi-billion dollar predecessors for them to be effectively considered a new technology. And they are currently facing some of the typical challenges any new product faces.
The previous five articles have looked at how ‘we’ traditionally deal with mission assurance through compliance. And the results aren’t pretty. Thousands of standards, textbooks and guidebooks have been authored that include prescriptive lists of things the design team has to do to assure the thing is reliable. But these checklists are generic, always outdated, overly prescriptive and generally not helpful.
Why? They replace critical thought. Engineers are too scared to do what the science says is right versus what the standard says is the law. And there are too many examples of nuclear power plants exploding, ships sinking, gas plants catching fire and so on to say that regulatory compliance works.
Instead, we need to revisit mission assurance and focus on performance. And by this we mean looking at what the product does as opposed to what the design team does. And when we do, we can start to improve reliability and safety in an informed way.
We know that satellites tend to fail due to ‘Region 0’ or Dead on Arrival (DOA) and ‘Region I’ or infant mortality failures. The data is unequivocal. But in our most recent article, we looked at how the satellite industry is almost exclusively focusing on wear-out failures in its compliance-based frameworks. They are fixing failures that don’t happen. This is madness.
So what is the answer? This article talks about a new approach. ‘Evolutionary Production’ essentially means that critical thinking is required at all stages of design and manufacture. As opposed to adhering to a checklist of activities, everything supports the mission. If something doesn’t support the mission, then everyone has the right to ignore it. And we do this by continually evolving the design and manufacturing process as a single entity.
What does this look like? Read on.
What good stuff is happening now?
There are some novel approaches that satellite designers are starting to explore despite an over-bearing compliance framework. One approach takes inspiration from ‘stem’ cells. Stem cells are biological cells that most organisms use when growing from microscopic embryos. Stem cells can be ‘reprogrammed’ by DNA through the synthesis of proteins to become virtually any cell in our body.
When this concept is applied to small satellite design, ‘cells’ comprising non-volatile memory and generic input/output interfaces that allow interactions with payloads, other components and BUSs are created. Each cell contains ‘proteins’ that communicate with each other, controlled by the ‘macromolecular machinery (MM).’ The MM is one of three ‘proteins’ that amongst other things monitors the health of all others. If the MM fails, another assumes its role.
With the amount of information stored in the non-volatile memory microcontroller (MCU) which contain substantial amounts of functionality, there is less of a requirement for software code. And as each cell fails, another can take its functional place. With each cell able to communicate to multiple components (such as momentum wheels, sun sensors et cetera), a level of functionality is always maintained. Experimental results have shown great potential for reliability improvement.
This is a fantastic concept (noting it is yet to be practically proven). It has obvious utility for all failures. But there is an additional benefit that most people (including the pioneers of the approach) may not yet identify.
By using the same cells, we don’t redesign the component for each satellite or function. So our manufacturing techniques are allowed to evolve and improve because we are producing more and more of the same cells regardless of function or application. And because they are being reused, system level functionality testing becomes less important as these tests are effectively being demonstrated by operational satellites. So we quickly gain experience at the tiniest level.
This is good. This is evolutionary.
Evolving corporate satellite design knowledge
If we value experience, then there is (an obvious) value in large scale production – albeit on a relative scale. Mass production is a manufacturing concept that allows efficiencies in manufacture. More importantly, it allows a retention of corporate knowledge. What we learn from the previous model’s strengths and weakness informs the next. From smartphones to automobiles, mass production has been largely responsible for enabling the vast improvements in reliability that we enjoy today.
Mass production involves single production lines. The antithesis of mass production is job production. Job production is essentially developmental design in response to a customer’s specification. It involves one (or very few) systems in a single project. Batch production involves elements of the system being manufactured in production runs between production line reconfiguration.
Mass production inherently tackles the main reliability-based challenge of small satellite design: infant mortality. By allowing continual improvement in manufacturing techniques and through sheer scale, defects can be eliminated. Design flaws can be iteratively identified and improved upon. It also allows components that are used on actual missions (not just laboratory environments) to help us understand what needs to be changed for the next one.
Continually developing each component or module from scratch during each design process does not allow this to happen.
But satellite manufacture needs a different concept than those used for the mass, batch and job production methodologies described above. This is evolutionary production. This concept is characterized by satellite design the focuses on each component and the way it is manufactured as a single entity. This entity is then recorded and iteratively grown upon in each design iteration. This allows corporate knowledge to be retained in a new way which does not involve a single large-scale production line of mass production which may involve months or years of production before change.
The culture of assurance
This brings us to the nexus of the challenge. Traditional mission assurance approaches focus on enforcing activities that are hoped will create a culture of ‘engineering assurance.’ This is using brute administrative force to make designers create a reliable system. The higher-level intent is never to have the activities enforced for their own sake – it is to create the culture of engineering assurance. But we can’t as easily enforce a culture, so we see a tendency to stick with activities.
This needs to change. Key small satellite compliance document such as Aerospace Corporation’s Mission Assurance Guide (MAG) and NASA’s Mission Assurance Requirements (MAR) were discussed in Article 2 of this series. They are big documents, containing many standards and mandatory activities. But it is well known that satellite manufacturers tend to treat the intent of these documents as a secondary consideration at best. Which means reliability is not a priority. Looking like they are doing something about reliability is.
The need for a return of critical thinking
Again returning to Article 2 of this series, we again quote Douglas Bader.
Rules are for the guidance of wise men and the obedience of fools.
Bader was a very successful World War II pilot for the Royal Air Force despite two amputated legs. And the point he is making (at least in how it relates to reliability and mission assurance) is that standards, guides and handbooks can never be assumed to represent best practice.
Designers, coders, developers and engineers need to be empowered to make their own decisions about how to achieve the intent of standards, guides and handbooks only. And if you come across an engineer who steadfastly clings to doing things by the book, you may have a fool on your hands. If this aligns with the prescriptions contained therein – great. But if not, experts involved in small satellite design must be compelled (not just encouraged) to create a better process.
If you don’t trust your designers and engineers … you have hired the wrong people. This is your fault if you hired them or have not fired them having never gained your trust.
To get to a state where we are (finally) allowing designers and engineers to make critical decisions, we need a new culture.
The need for a champion
Every organization that has excelled in terms of reliability and mission assurance has had a ‘champion.’ The disasters and catastrophes we see every night on the news have a common thread: a lack of a reliability champion and a workable mission assurance framework. The champion can be called many things, but has the same purpose in any organization.
There are success stories in the small satellite domain. John Hopkin’s Applied Physics Laboratory (APL) called their small satellite reliability champion the Systems Assurance Manager (SAM) who had the functional freedom and authority to interact with all elements of the project with direct access to management independent of project management. The APL cited his independence as a key enabling factor for reliable satellites. So when it happens, it works!
Instead of a compliance checklist, customers (not just for small satellites) need to focus on understanding the skills, experience and personality of this independent champion.
The need to understand value
There is limited literature on satellite reliability and value. A 2005 study focused on large communications satellites but was based on ‘wear-out’ assumptions of satellite reliability that we know are not true (and certainly not applicable for small satellites – please read the previous articles in this series if there is any doubt). And this study claimed that its modelling approach was ‘novel.’ But it simply replicated what is already known about reliability functions based on a nonhomogeneous Poisson process. Any reliability engineer should have done these many times. Perhaps it is novel in the small satellite domain, but it is not novel.
Notwithstanding, the study makes a useful contribution in that it perhaps for the first time puts forward a logical framework for concluding that redundancy was not worth it for large satellites. Before setting reliability requirements and doing things like incorporating redundancy, we need to understand the net value this yields. Otherwise you may be simply buying ‘magic beans.’
Emerging technology manufacturers need to put a price on complying with a standard along with the scientifically predicted benefit (or lack thereof) in doing so. Manufacturers (as a bare minimum) need to communicate to customers that they know the intent of a standard, and how they intend to achieve it.
Customers (for their part) need to allow this conversation. They need to be open to this dialogue, and not fear ‘non-compliance.’ Instead, they should only fear paying fools to build their satellite.
The need for useful – and perhaps novel requirements
Systems engineering in its most inspiring form involves proactive and imaginative linkages between what the customer wants and what the designers need to do. Too often this is perceived as a complicated administrative function of ever subordinate requirements emanating from ‘high-level’ specifications. While traceability is key, it is not the answer.
The APL experience involved the establishment of a Performance Assurance Implementation Plan, which formalized a two-way conversation between designers and assurance staff on what needed to be done to create a reliable system. This created a sense of buy-in which is necessary for a cultural focus on assurance.
The need for Evolutionary Production (and Plug and Play).
Small satellites are typically manufactured using ‘job production.’ Most satellites are designed for a specific customer each time. And the two main issues of job production are higher costs and a lack of continual manufacturing improvement in support of quality and reliability goals.
Small satellites will never be truly mass produced. They will never be as numerous as smart phones, automobiles or other consumer devices. But we need to incorporate some of the benefits of mass production when making small satellites or any new product.
Reliable products are not made using ‘job production.’ Each small satellite can’t be ‘one-offs.’ Corporate knowledge never accumulates. Continual improvement never happens as no satellite ‘continues’ to be made after it has been provided to a customer. Standardization (or perhaps versioning) needs to occur within actual build states. Small satellite components need to be incremental generational improvements from the previous production.
This needs the repeated use of common, critical modules and components. Designs are refined and improved for each small satellite customer, but never designed from scratch. This approach requires some form of standard design practices (more on that later). But to try and replicate manufacturing processes that already exist for unrelatable applications does not work.
The inferred outcome of this task is standardization in interfaces. Some manufacturers have used things like the UNISEC electrical bus specification to allow the development of modules that allow extraordinarily compressed test and assembly. The University of Würzburg has used the UNISEC specification and believes that the inherent robustness of the architecture has contributed to its UWE-3 satellite (launched in 2013) remaining operational at the time of writing.
The outcome of this is ‘plug and play,’ but with a small satellite flavor.
The need for resilience
Usefully, the ‘bits’ of small satellites that are likely to be repeatedly used from one mission to the next are the critical ones that allow the small satellite to be controlled and adapted from Earth. So these become the focus of the majority of design effort. Payloads and other modules that have reprogrammable software and memory units afford a level of flexibility. They then naturally need to ‘give up’ some of the design for reliability resources otherwise dedicated for them.
A need for a Quality Management System (QMS) – but not the boring kind
Anyone experienced with QMS will probably know that this is often considered to be a process that is ‘compliant’ with Standard AS 9100D Quality Management Systems – Requirements for Aviation, Space and Defense Organizations. But the intent of any quality standard is to focus on the end state and not the process (notwithstanding the good number of organizations who make a living from the alternate explanation).
History has shown that making novel products (including small satellites) needs to align with the intent and not the letter of the standard. And an important principle is:
devolving responsibility to the lowest possible level.
This means (for example) that configuration levels for designs and drawings are reduced. Engineers need to make changes without having to go through the rigmarole of waging a bureaucratic war on complex engineering change order completion. This supports flexibility and adaptability. Keeping change orders limited to those changes that need higher level involvement means that higher level managers only focus on important changes. And being swamped with reviewing every single change means that managers cannot do their job. Again, this supports the intent of the quality standard. And importantly, software assurance becomes part of the QMS – not a separate task.
A need for testing to learn – not testing to pass
We have discussed how a ‘culture of compliance’ is not and has never been truly helpful in terms of optimizing quality and reliability. The reality is that passing a qualification test provides a sense of reassurance to the designer. It does not then inform the designer on what he or she could do to improve reliability. This needs to change.
Some opportunities for better testing due to small satellite size and mass have been identified. These include the incorporation of standard interfaces and automation. An ongoing problem with small satellites is the ability to place sensors in a system that is already highly constrained in terms of size. This is an element of testability and should be implemented for software as well.
Things get better and more complicated for cooperative satellites, which require a new approach to testing. Test and simulation platforms need to be established to examine satellite cooperative behavior, with some developments in this regard underway. And a more automated approach to testing is more than possible that will dramatically reduce developmental timeframes.
A need for vendors to ‘come on board’
Manufacturers in the small satellite industry and many other are delusional when it comes to vendors. There is an ill-founded expectation (hope?) that component suppliers will spontaneously design more robust elements for small satellites. Phrases like this one are common in small satellite design literature:
For [small satellite] and CubeSat missions of the future, we [the paper’s authors] expect an overall maturing of the vendor and supply base.
You get what you demand and not what you expect (or hope for). Consider the following two quotes from suppliers to North American automobile manufacturers:
In my opinion, [Ford] seems to send its people to ‘hate school’ so that they learn how to hate suppliers. The company is extremely confrontational. After dealing with Ford, I decided not to buy its cars … Senior executive, supplier to Ford
The Big Three [US automakers] set annual cost-reduction targets [for the parts they purchase]. To realize those targets, they’ll do anything. [They’ve unleashed] a reign of terror, and it gets worse every year. You can’t trust anyone [in those companies] … Director, interior systems supplier to Ford, GM and Chrysler
Now consider the following quotes describing known automobile industry leaders in terms of reliability and quality:
Honda is a demanding customer, but it is loyal to us. [American] automakers have us work on drawings, ask other suppliers to bid on them, and give the job to the lowest bidder. Honda never does that … CEO, industrial fasteners supplier to Ford, GM, Chrysler, and Honda
Toyota helped us dramatically improve our production system. We started by making one component, and as we improved, [Toyota] rewarded us with orders for more components. Toyota is our best customer … Senior executive, supplier to Ford, GM, Chrysler, and Toyota
So how do small satellite manufacturers intend to reward its suppliers? How will they know when to do it? Why would you (as a supplier) do anything to improve reliability, when this is likely to make the component more expensive and there is no indication that manufacturers will pay this premium?
A common theme that applies to components and parts suppliers is that the marketplace drives the products. If there is a consistent focus on cost, then suppliers are being ‘told’ to not focus on reliability. Any improvement in reliability (which comes at a cost) means that the components will not be purchased. If there is no ramification for substandard components, then suppliers are being ‘told’ that focusing on creating ‘acceptable’ components is a waste of money. And if manufactures do not engage suppliers, then suppliers are being ‘told’ there is nothing wrong.
But there is a clear issue with the Toyota and Honda analogies being used above. Small satellite manufacturers do not have the buying power to influence design in the same way. So there is a need to break down suppliers into two categories:
- Deep Supplier Partnerships with suppliers who value small satellite manufacturers as significant customers who are willing to work collaboratively in a way that improves all organizations based on trust and free exchange of information; and
- Shallow Supplier Relationships with suppliers whose market is large enough that there is little perceived value in modifying or improving components for small satellite manufacturers, causing a focus on parts screening and analysis.
Mass produced consumer electronics don’t align with the unique demands of small satellites. But consumer electronics are the key driver of many electronic part suppliers’ business value proposition.
We have been manufacturing relatively simple components such as transistors and capacitors for a long time now. So if they aren’t reliable now – why should we presume they will be in the future if we do nothing?
Everyone needs to change. Previous articles have shown how the satellite industry should now focus on DOA and infant mortality failure only – and not the wear-out failures they are currently focusing on. This means studying radiation, cyclic loads or any other mechanism that accumulates damage won’t do anything to improve reliability.
Nothing can really be done to change the way standards and assurance guides are authored. Consensus drives the outcomes, which is the antithesis of how emerging technologies work. We don’t want to focus on what has worked in the past when we are dealing with an emerging technology.
The change needs to come in terms of the ‘customer-manufacturer’ continuum. Customers demand compliance – not manufacturers. And if customers demand compliance, they are demanding an unreliable small satellite that is ‘certified’ as being reliable.
So customers and manufacturers need to work together to move toward a performance based assurance framework. This prioritizes critical thinking in a culturally supportive way. And we will know this has occurred once we start only talking about DOA and infant mortality satellite failures.
Some progress has been made on small satellite operator ‘norms’ of behavior. These non-binding discussions primarily deal with how small satellites affect those around them, including things like de-orbit, tracking and so on. There is little in terms of collective efforts to share information regarding building better satellites.
But it can be reasonably concluded that:
- Evolutionary Production as a concept needs to be implemented. Satellite design is as useful as how each component is manufactured that allows an iterative baseline for incremental improvement.
- A move away from compliance and toward performance where customers start valuing ‘heritage,’ science-based analysis, and the extent to which satellites are characterized by ‘mature’ components.
- Best practices focus on manufacturing quality and system level integration testing and not wear-out failure mechanisms (or whatever the failure mechanism that reliability data analysis shows is not dominant).
- Customers and manufacturers understand the value proposition of small satellites to better inform all stakeholders about what reliability design decisions need to be made.
- Manufacturing organizations are valued by their organizational approach to critical thinking.
And to hold true to the ideals suggested in this paper, these conclusions themselves need to be continually reevaluated with critical thinking. There may come a time where we tackle DOA and infant mortality failures to the extent that (our now highly reliable) satellites only fail due to wear-out. And a refocus on these failure mechanisms is precisely what is philosophically being recommended here.
The only thing that doesn’t change is change itself.
The next and final article discusses some very specific things that were implemented within a small satellite manufacturer to initiate evolutionary production. We will only know if these changes were successfully implemented and subsequently worth it over the following years. But hopefully it will help you understand what is going on, and how these ideas could be applied to your business.
Leave a Reply