An Overview
ISO 26262, the Automotive Functional Safety standard, provides safety concepts, definitions, analysis methodologies, safety qualification guidelines, processes, management practices, and much more. Analysis methodologies such as FMEA were originally developed by the military in the late 1940s and later adapted and refined by NASA and the automotive industry. These are systematic methods used to identify and address potential failures. At the core of the standard is understanding safety-related faults, how to analyze them, detect them, and implement effective mitigation strategies. It is important to clearly define safety requirements and design inputs to ensure effective verification and validation. Engineering practices play a key role in systematic risk mitigation and safety analysis.
ISO 26262 provides theoretical definitions for faults and failures. Therefore, it becomes essential to understand how these faults and failures are classified based on their behavior, for example, which ones directly impact the safety goal, which ones indirectly contribute to a violation, which ones are inherently safe and can be ignored, and which ones may emerge during the system’s lifetime.
Our motto is to understand these concepts in a clear manner using simple and realistic examples.
NOTE: You can find the theoretical definitions in ISO 26262 Part 1, Vocabulary.
The ISO 26262 official website:
< credits: https://www.iso.org/standard/68383.html>
Despite the benefits of these methodologies, there are challenges in applying them, such as detecting failures, comprehensively assessing risks, and addressing the limitations of the methods in complex or organizational scenarios. Recognizing these challenges is important for improving risk assessment accuracy.
Difference Between Fault, Failure and Error
Fault: A fault is an abnormal condition that can cause an element or item to fail, e.g., a relay open circuit or a microcontroller pin is shorted to ground.
Failure: A failure is a termination of intended behavior of an element or an item due to a fault, e.g., a relay open circuit stops the power conversion in HVDCDC system.
Error: An error is a discrepancy between the computed, observed, or measured value or condition and the true, specified or theoretically correct value or condition, e.g., the control unit reads incorrect sensor data due to the short circuit at pin.
Overview of Faults and Failures
This section describes the various categories of safety relevant faults and
This section describes the various categories of safety relevant faults and failures.
Understanding the different types of faults and failures is essential for performing safety analysis, defining safety mechanisms, and ensuring compliance with Automotive Safety Integrity Level (ASIL) requirements. ISO 26262 categorizes faults and failures based on their origin, behavior, and persistence, helping engineers systematically assess risks and design effective detection, mitigation, and recovery strategies. Understanding the specific characteristics of components or systems is essential for accurate risk assessment, as these characteristics directly influence potential risks. It is also important to define the scope of the analysis—whether at the component, subsystem, or system level—to ensure all relevant aspects are considered. The analysis helps identify potential hazards that could impact safety and reliability. Engineers determine the overall risk level by analyzing the effects and likelihood of different faults and failures.
So, let us get ready to dig deeper into the details!
Systematic Faults and Failure
This section describes faults related to specification, design, implementation, or process.
Systematic Fault: It is a fault whose failure is manifested in a deterministic way that can only be prevented by applying process or design measures.
To provide more clarity on the design measure, I would like to add an example.
Suppose you receive a customer specification and some of the requirements are incorrect. These incorrect requirements can be identified through reviews or inspections and corrected by following the formal change request process.
Systematic Failure: It is a failure related to a cause that can only be eliminated by a change of design, manufacturing process, procedures, documentation, or other relevant factors.
Examples
- A system failure can be due to a bug in the SW code.
- There could be missed or incorrect requirements in the safety requirement specifications.
- There could be a missing or incorrect connection between two hardware components in the schematic.
- There could be an incorrectly defined or missed interface for an ASIL Component in the SW architecture.
- Incorrectly done safety analysis.
- Incorrectly written code, design, or documentation review which missed a major safety requirement.
- Components like resistors, capacitors, etc. that do not populate during manufacturing.
- An incorrect test case or procedure for testing a safety requirement.
Random Hardware Fault and Failure
This section describes faults that occurred randomly over time.
Random Hardware fault: It is a hardware fault with a probabilistic distribution.
Random Hardware failure: A failure that can occur unpredictably during the lifetime of a hardware element and that follows probability distribution. The failure rate of hardware components is a key parameter in reliability analysis and safety assessments.
Example
Aging or stress failure of electronic components including contact failure, soldered joint failure, PCB/semi-conductor failure. The average time to failure (mean time to failure) is used to quantify how long components typically operate before experiencing a failure. Engineers estimate failure rates using historical data, standards, or reliability models.
Residual Faults and Failure
This section describes fault which indicate weaknesses in diagnostic coverage.
Residual fault: It is a portion of a random hardware fault that by itself leads to the violation of a safety goal, occurring in a hardware element where that portion of the random hardware fault is not controlled by a safety mechanism. Components may continue operating normally until a random failure occurs.
Note: If a safety mechanism has a coverage of 60% of faults in an item/element, then the remaining 40% are residual faults.
Example
Consider a hardware element (e.g., a register) has three types of faults: open, short-to-ground, and short-to-high.
If the safety mechanisms are implemented to cover the open and short-to-ground faults but a short-to-high fault is not covered by any safety mechanism; then this uncovered fault is considered a residual fault, as it is not detected or mitigated by any safety mechanism and could lead to a violation of the specified safety goal.
Single-point / Dual-point / Multiple-point / Latent Point Fault and Failure
This fault classification helps to evaluate architectural robustness and calculate hardware metrics such as SPFM and LFM.
Single-point Fault: This is the hardware fault in an element that leads directly to the violation of a safety goal, and no fault in that element is covered by any safety mechanism.
Single-point Failure: This failure results from a single-point fault.
Example
- An unsupervised resistor for which an open circuit has the potential to directly violate the safety goal.
- A fault in the external power supply can lead the MCU to behave in an unpredictable manner and directly lead to the violation of a safety goal. Therefore, faults related to supply voltages are treated as single-point faults.
Dual-Point Fault: An individual fault that, in combination with another independent fault, leads to a dual-point failure.
Dual-Point Failure: A failure resulting from a combination of two independent hardware faults that leads to the violation of a safety goal.
Examples
One fault affects a safety-related element, and another fault affects the corresponding safety mechanism, and combined effect of these failures leads to a safety goal violation.
- Consider an HVDCDC system where the primary over-voltage protection comparator is stuck in the “OK” state due to an internal analog failure. At the same time, the output voltage sensing resistor has aged, leading to incorrect voltage feedback.
- As the comparator is stuck and the feedback signal is incorrect, the controller fails to detect the output over-voltage condition, which leads to a safety goal violation.
Multiple-Point Fault: An individual fault that, in combination with other independent faults if undetected and not perceived, could lead to a multiple-point failure.
Multiple-Point Failure: A failure, resulting from the combination of several independent hardware faults, which leads directly to the violation of a safety goal.
Example
- In a brake-by-wire system, a biased signal from the brake pedal position sensor can occur due to a sensor’s fault.
- At the same time, a software logic error may cause the plausibility check between redundant pedal sensors to fail.
- Due to the combination of these two faults, the incorrect brake demand is not detected by the system.
- As a result, braking assistance can be reduced or delayed, which may lead to a hazardous situation.
- This scenario represents a multiple-point failure caused by a sensor fault combined with a safety mechanism failure.
Latent Point Fault: This is a multiple point fault whose presence is not detected by the safety mechanism nor perceived by the driver within the muti point fault detection time interval.
Example
- A fault in the window watchdog can disable its ability to detect and control microcontroller failure modes.
- If this fault is not detected by any safety mechanism (for example, the watchdog startup test) and is not perceived by the driver, it is considered a latent single-point fault.
Detected, Perceived, Safe Faults
This classification helps determine diagnostic coverage and supports hardware safety metric calculations in ISO 26262 projects.
Detected fault: A fault whose presence is detected within a prescribed timeframe by a safety mechanism.
Example
- Suppose an ADC is used to measure the 12 V battery input. Due to an internal ADC fault, the measured voltage becomes stuck at a constant value (for example, 12.0 V), even though the actual battery voltage changes.
- If a safety mechanism such as a plausibility check is implemented, which compares the ADC measurement with an independent reference, the fault can be detected when the ADC value remains constant beyond the allowed time window.
Perceived fault: A fault that may be perceived indirectly (through deviating behavior at the vehicular level).
Example
- Consider a Battery Management System (BMS) with an initial fault in the form of degraded coolant pump performance. This degradation leads to insufficient cooling, causing the battery module temperature to rise.
- As a result, the temperature sensor readings approach their operational limits, which in turn affects the internal resistance estimation performed by the BMS. Based on this estimation, the BMS gradually degrades the allowable discharge current to protect the battery.
- This current limitation results in a noticeable loss of vehicle acceleration and reduced driving range.
In this case the driver’s perception would be that the EV feels weak and its range has reduced suddenly. While the root cause (coolant pump degradation) is not known to driver, the fault is perceived through degraded behavior.
Safe fault: A fault whose occurrence will not significantly increase the probability of violation of a safety goal. A fault is considered safe when no hazardous behavior occurs because of its presence.
Example
- The diagnostic LED driver inside the DC-DC ECU fails such that the LED indicates “converter ON,” while the actual converter output is already OFF.
- This fault only affects the indication and does not affect the converter’s functional or safety paths, which remain unaffected.
Permanent Fault
This section describes the fault that caused by physical damage or hardware degradation.
Permanent fault: A fault that occurs and stays until removed or repaired.
Example
- Suppose an output voltage feedback resistor is damaged, and thus ADC always reads incorrect voltage.
- Here the safety monitor detects implausible voltage continuously, so the system disabled DC-DC converter and fault is latched.
- This fault remains in the system until hardware is repaired. It requires repair, replacement, or the power cycle to be cleared.
Transient Fault
Transient fault: This is a fault that occurs once and subsequently disappears. Transient faults can appear due to electromagnetic interference.
Example
- Electromagnetic interference can lead to bit-flips. A strong electromagnetic interference causes a temporary bit error on the CAN bus.
- One or two CAN frames are corrupted, but communication returns to normal in the next cycle.
- No hardware is permanently damaged.
Dependent failures
Failures that are not statistically independent, i.e., the probability of the combined occurrence of the failures is not equal to the product of the probabilities of occurrence of all considered independent failures.
Dependent failures include common cause failures and cascading failures.
Whether a given failure is a cascading failure or a common cause, failure may depend on the hierarchical structure of the elements.
Common Cause Failures
A failure of two or more elements of an item resulting directly from a single specific event or root cause which is either internal or external to all of these elements.
Note: Common cause failures are dependent failures that are not cascading failures.
Example
- Suppose HV DCDC has two independent voltage monitoring paths:
- Main MCU ADC (control path)
- Independent safety monitor IC (safety path)
- Both are intended to independently detect output over-voltage. Both monitoring paths use the same 5 V reference.
Consider, a shared 5 V reference supply becomes unstable due to a PCB solder crack or regulator degradation. Due to this, ADC readings in MCU and safety monitor are shifted in the same direction, and an over-voltage condition is not detected by either path.
Here, the redundancy is defeated by a single common cause.
Cascading failure
A failure of an element of an item resulting from a root cause [inside or outside of the element] and then causing a failure of another element or elements of the same or different item.
Example
- Consider an initial fault where the camera lens is partially obstructed by dirt or glare. This reduces lane detection confidence.
- The fault propagates further, causing the lane model to become unstable and leading to oscillations in steering correction.
- As a result, the system disables the Lane Keeping Assist (LKA) function. The driver perceives a sudden loss of lane-keeping assistance. In this case, a small environmental fault cascades into a loss of function.
Final Thoughts
This blog emphasizes the importance of understanding fault terminologies, with examples that simplify complex concepts and provide a clear view of how different failures occur.
By understanding these fault categories, it becomes evident that a single point fault is more critical than dual or multiple point faults, as it can directly lead to a safety goal violation.
Latent faults are particularly dangerous because they are not detected by diagnostics and can silently disable safety mechanisms until another fault occurs.
At runtime, random hardware faults are more critical due to their unpredictable nature, whereas during development, systemic faults are more critical since they can be present across all units and repeatedly violate safety goals.
Among dependent failures, common cause failures are the most critical, as multiple elements can fail simultaneously due to the same root cause, potentially defeating redundancy in the system.
Finally, a clear understanding of ISO 26262 fault classifications is essential for performing effective safety analysis, defining robust diagnostics, and ensuring reliable ASIL compliance across the automotive safety lifecycle.




