What is FMEA?
The Failure Mode and Effect Analysis is the bottom-to-top analysis approach. It is a systematic method used to identify potential failure in a design or process and to evaluate how those failures could affect the outcome and determine what actions are required to reduce the chance or impact of failures.
The Importance of FMEA
When the FMEA is not used, many problems go unnoticed until the product is assessed and some problems might only appear after the product has been released. However, the cost of developing counter measures in the later phase is much higher than in the earlier stages.
Serious design flaws can severely harm a company’s reputation or cause massive losses while minor design issues can cause customer dissatisfaction, delays in product launches, or releases, and burden the company.
However, by using FMEA at an early stage of the product development, it is possible to spot problems in a system, process, or product and fathom what could go wrong. It is all about staying ahead of issues by catching them early and devising ways to avoid them.
Types of FMEA
- System FMEA (SFMEA):
This FMEA is applied to the entire system, to identify and evaluate potential failures within a system.
- Design FMEA (DFMEA)
Design FMEAs are performed on products, systems, hardware, or software during the design phase to identify potential design failures. These FMEAs focus on individual components or subsystems. They help identify failure modes related to the design of specific parts. - Process FMEA (PFMEA)
Process FMEAs are applied to manufacturing and assembly processes to determine potential failures before they are implemented.
Typically, the System FMEA is performed first to assess high-level system risks, followed by the Design FMEA which focuses on detailed component-level failures, and finally the Process FMEA (PFMEA) that addresses risks in the manufacturing and assembly. Each step dives deeper into the product lifecycle.
The AIAG & VDA FMEA Handbook (2019)
Is there a standard that outlines the FMEA process in a structured and detailed way? Yes, the AIAG & VDA FMEA Handbook (2019) is the most widely recognized standard for Failure Mode and Effects Analysis (FMEA).
It is jointly developed by the Automotive Industry Action Group (AIAG) and the German Association of the Automotive Industry (VDA). This standard outlines a 7-step approach, which we will explore further ahead. It covers System FMEA, Design FMEA, Process FMEA and the Action Priority concept which replaces traditional RPN approach.
FMEA: The 7-step Approach
Step 1 – Planning and Preparation
Define:
- Scope – Outline the specific tasks, deliverables, timelines, and responsibilities.
- Objectives – Define the goals and objectives to achieve.
- Team – Allocate people to the tasks (The FMEA Moderator should lead the FMEA along with cross-functional team)
- System or Process-Understand the system or process that needs to be analyzed.
Step 2 – Structure Analysis
- Break down the system into its elements.
- This helps to map interaction and boundaries of all components.
Step 3 – Function Analysis
- List out the functions of each element.
Step 4 – Function Analysis
- Determine the different failure modes for each function.
- Determine the associated causes and effects of each failure.
Step 5 – Risk Analysis
- Decide Severity, Occurrence, and Detection by analyzing the effects of failure and its probable causes.
- Calculate Risk Priority Number based on S, O, and D values for every failure.
Here is the formula:
Recently, the old RPN method has been replaced by Action Priority (AP), described in the latest AIAG & VDA standard.
Severity, Occurrence and Detection
- Severity (S): How severe is the impact of the failure?
- Occurrence (O): How frequently the failure occurs?
- Detection (D): How likely is it that the failure will be detected?
NOTE: These Severity (S), Occurrence (O) and Detection (D) reference tables are provided in AIAG & VDA FMEA Handbook.
Step 6 – Optimization
- Find out what the existent prevention and detection controls are.
- Based on the RPN value, existing measures may be sufficient to mitigate the failures or alternatively, new additional prevention and detection actions may be required.
- Define and implement actions to enhance the design, process, and detection methods to reduce risk.
Step 7 – Results Documentation
- Documentation includes recording of findings, decisions, and outcomes of the actions implemented.
- This helps to have clear traceability throughout the whole FMEA process and provides valuable references for future reviews, improvement, audits, or related projects.
The DFMEA is a living document. Hence, it needs to be updated whenever the design changes or new issues arise.
Why Action Priority replaced the Risk Priority Number (RPN)
Various combinations of S, O, and D can produce the same RPN value, although the risk levels are different. The RPN does not always highlight how serious a problem is. Even if a failure is severe, it may not be considered critical if occurrence or detection ratings are low. It does not clearly explain what actions to take based on the score, which can lead to inconsistent decisions in managing risks.
On the other hand, Action Priority puts more focus on Severity, making sure that the more serious risks are not overlooked. It also gives clearer guidance on when and how to act. It helps avoid misleading risk rankings that can result from simple multiplication of severity, occurrence, and detection scores.
FMEAs can totally be done in Excel, but there are some specialized tools like APIS IQ-FMEA PRO that make the complete process easier. Here are a few snapshots of the tool.
Function Net
Failure Net
Here is an example to gain some practical insight.
Consider an Automotive Battery Management System (BMS). While it performs several functions, we will focus on analyzing just one function to illustrate the concept.
Function: Voltage sensing
Failure: Wrong voltage reading
Potential effects: Cell damage, thermal runaway, overcharging or under-charging of cells
Potential causes: Sensor drift, faulty connection
Current detection controls: Redundancy, calibration checks
Once these details are gathered and analyzed, the next step is to decide the severity, occurrence and detection ratings provided in AIAG VDA handbook tables.
The Severity Rating Assigned is 9 (High Severity).
Reasons behind the High Severity rating of 9:
- Voltage sensing directly impacts cell protection.
- An incorrect voltage reading can cause overcharging which leads to a safety-critical event like thermal runaway while undercharging leads to cell degradation warranty issues, and reduced driving range.
- Although this failure is profoundly serious, safety features like backup sensors and hardware cutoffs help to prevent it from being completely dangerous. That is why it is rated nine, not ten.
The Occurrence Rating Assigned is 4 (Moderate).
Reasons behind the moderate rating 4:
- Voltage sensor failures or drift do not occur frequently, these failures are caused by errors in the Analog-to-Digital Converter (ADC), sensors slowly lose accuracy over time, loose or corroded connections.
- These issues can be reduced by using high-quality automotive parts and proper manufacturing practices. Regular maintenance and backup circuits help reduce risk, so the likelihood of failure is considered moderate.
The Assigned Detection Rating is 5 (Medium).
Reasons Behind the Medium Rating 5:
- Some controls are in place, such as periodic calibration, cross-verification with other data e.g., SOC estimation, current flow etc., diagnostic trouble codes (DTCs) in the BMS.
- However, early-stage drift or silent failures may not be detected immediately, especially if the failure is gradual or within error tolerance limits.
- Hence, detection is not certain and rated mid-range.
Here the RPN value is: 9x5x4 =180 but we need to find out the AP based on the S, O, D values for more accurate decision-making.
According to the Action Priority chart, when Severity is close to its maximum, even moderate Occurrence and Detection ratings can result in a high action priority.
Next, we need to identify the recommended actions so that we reduce the count of AP. Below are some examples outlined.
- Implement dual-redundant voltage sensors.
- Develop real-time voltage validation algorithms.
- Increase diagnostic monitoring frequency.
- Perform rigorous calibration and testing.
It is not necessary to implement all the options mentioned above. The ISO 26262 standard provides guidelines that help determine which safety measures are necessary and which ones are not required.
We can implement safety measures based on the system’s ASIL level. For higher ASIL levels, more advanced safety measures are required, while lower ASIL levels, we can adopt simpler measures with reduced implementation complexity.