SCADA Alarm Analysis: Tips & Standards for Mastery

Jason Israelsen & Alyssa
October 22, 2024
10 min read

Understanding Alarm System Overload

  • The high frequency of alarms in supervisory control and data acquisition (SCADA) systems can lead to desensitization and complacency among operators, reducing the effectiveness of the alarms in signaling genuine emergencies.
  • To maintain a healthy alarm system, a holistic and proactive approach is recommended, including regular assessment, multidisciplinary review, targeted action, and continuous re-evaluation to ensure operational efficiency and safety.

Alarm Fatigue: A Hidden Threat to Operational Efficiency

While acquainting themselves with a new plant’s operations, a controls engineer delved into the alarm summary to gather insights. What they discovered was staggering: The system was inundated with hundreds of alarms. Some were being triggered more than 50 times a day while others remained in a constant state of alarm for weeks at a time.

Further inquiry revealed this scenario was considered normal within the day-to-day operations of the plant. The plant staff were comfortable with the state of their system and the volume of alerts.

However, as an outsider, the controls engineer felt overwhelmed and struggled to effectively navigate the system. The disconnect between experiences prompted the engineer to consider how to bridge this gap in understanding and propose a feasible solution to address the issue.

The Role of Alarms in SCADA Systems

Alarming serves a critical function within a SCADA system, alerting operations (either audibly or visibly) to items that need to be addressed, such as process deviation, abnormal conditions, or equipment malfunction. However, the function of an alarm system is often compromised when high frequencies of alarms occur, leading to habituation, desensitization, and complacency among operators. This overstimulation diminishes the significance of alarms and hinders the system’s ability to bring attention to genuine emergencies.

The classic tale “The Boy Who Cried Wolf” provides an analogy of consequences that come from an alarm system with poor performance. In the story, the boy repeatedly falsely alarms the townspeople. Over time, the people become indifferent to the alarms. Consequently, when a real wolf threatens the safety of the town, the people no longer heed the boy’s cries and the sheep suffer the consequences in the end. Unfortunately, comparable situations are not uncommon in SCADA systems, where an operator’s desensitization to authentic alarms endangers operations and threatens safety.

Figure 1: A sample alarm distribution of alarm priorities, the bar chart shows the distribution in percentages and the table shows the distribution by counts. Note in this sample the distribution is often heavy on the higher priority alarms ("HIGH" and "MEDIUM"), versus the low priority alarms ("LOW" AND "INFO").
Alarm Counts Priority Distribution: A sample alarm distribution of alarm priorities, the bar chart shows the distribution in percentages and the table shows the distribution by counts. Note in this sample the distribution is often heavy on the higher priority alarms (“HIGH” and “MEDIUM”), versus the low priority alarms (“LOW” AND “INFO”).

How to Evaluate SCADA Alarm System Health

Based on situations like the one mentioned above, with the understanding the scenario described is unfortunately quite common in SCADA systems, it is recommended to approach alarm system health holistically.

Crises can be averted by conducting an objective assessment of an alarm system’s state and address areas of concern. This proactive stance allows users to identify potential issues before they escalate, enabling timely intervention and effective implementation.

Building an Effective Alarm Management Process

Through systematic evaluation and targeted action, organizations can mitigate risks and ensure continued safety, efficiency and reliability of critical industrial processes.
With a well-defined process, an unbiased perspective of alarm system health can be achieved.

This four-step process identifies, addresses and maintains the health of an alarm system:

Step 1: Analyze Alarms with Industry Standards

Analysis: Use standards to measure the alarm system’s health. This serves as a baseline for the evaluation, removing subjectivity.

Step 2: Collaborate Across Departments

Review: Review the results with a multidisciplinary group (e.g., programmers, operations personnel, engineers, etc.). During this phase:

  • Prioritize action: Focus on addressing a manageable subset of alarms, rather than attempting to tackle all issues simultaneously.
  • Form solutions: There isn’t a single solution to alarm system problems. It’s likely a combination of approaches will be necessary.

Step 3: Implement Targeted Solutions

Act: Implement solutions based on the results of the alarm analysis and review steps.

Step 4: Repeat for Continuous Improvement

Repeat: Run the analysis, review, act and continually repeat the process for ongoing improvement and maintenance of alarm system health.

Keep In Mind:

Regarding this process, it’s important to keep in mind the following:

  • The value in a healthy alarm system includes reduced load to operate the system, increased responsiveness to urgent alarms and improved overall system performance.
  • Each alarm system is unique, varying in size, complexity, personnel and cohesion, which impacts the ease or complexity of each step.
  • If the alarm system is in a critical state, achieving and maintaining system health may require regular attention and involvement from the controls systems team. It’s important to acknowledge the system’s health didn’t deteriorate overnight and restoring it to an acceptable level requires time and effort.

Measuring Alarm Performance Using Key Metrics

An effective way to assess the health of an alarm system is to measure it against a standard. This helps remove subjectivity and mitigate habituation, also known as “Tuning Out.” One such standard for alarm systems is found in IEC 62682 a set of guidelines developed by the International Electrotechnical Commission (IEC, 2022).

Figure 2: A sample of alarm floods occurring in a system where the size of bubble is the number of alarms in the flood and the height of the bubble indicates how long the flood occurred. Note the frequency of the floods, number of “large” floods, and the percentage that last longer than an hour.
Alarm Floods: A sample of alarm floods occurring in a system where the size of bubble is the number of alarms in the flood and the height of the bubble indicates how long the flood occurred. Note the frequency of the floods, number of “large” floods, and the percentage that last longer than an hour.

This standard offers recommendations for the design, implementation, operation and management of industrial alarm systems. It delineates principles for alarm management covering aspects like design, prioritization and documentation, with the aim to enhance safety, efficiency, and situational awareness in industrial settings. Adherence to IEC 62682 facilitates the establishment of best practices for alarm systems, providing direction on items such as the following:

Priority Distribution: Operational Desensitization

  • Priority distribution (see Alarm Counts Priority Distribution figure)

Alarm Floods: Identifying Critical Patterns

  • Maximum number of alarms in a period of time
  • Acceptable duration for “flood” conditions (see Alarm Floods figure)
  • The amount of “chattering” and “fleeting” alarms

Frequent Alarms (Bad Actors): Finding the Worst Offenders

  • Allowable percentage the most frequent alarms should account for (see Alarm Bad Actors figure)

Figure 3: A sample of the top 10 most frequent alarms in a system via percentage (also known as “Bad Actors”). Note in this sample the top 10 most frequent alarms account for 50% of the alarms of the system.
Alarm Bad Actors: A sample of the top 10 most frequent alarms in a system via percentage (also known as “Bad Actors”). Note in this sample the top 10 most frequent alarms account for 50% of the alarms of the system. This is a problem.

Analyzing Results: Multidisciplinary Review

With the alarm analysis process outlined, the next step is reviewing the results, which should be conducted with a multidisciplinary group. The group should include programmable logic controller (PLC) programmers, human-machine interface (HMI) developers, engineers, operators and others who could provide a valuable perspective.

The group’s perspectives, expertise and insights lead to more thorough results compared to individual efforts.

As listed above, the analysis results are broken down into various measurables. The priorities should be determined by the group, based off need. When meeting, consider that smaller, frequent meetings will be more effective. There’s no need to tackle the whole task at once. Keep in mind there isn’t a single solution to every alarm problem. Alarm issues will require different approaches and solutions to create a healthy alarm system.

Implementing Effective Solutions

The next crucial step is to translate the insights gathered into actionable strategies and ensure the identified concerns will be effectively addressed. The actions taken in this stage will vary, but some samples are provided below to provide guidance and inspiration.

These issues outline commonly found deficiencies along with actions that can be taken to resolve them:

Priority distribution

  • Issue: The distribution of alarm severity is inverted from the IEC recommendations (IEC, 2022). Analysis shows the occurrences of most frequent alarms to least frequent are: High, medium and low. However, for proper distribution, the order of occurrences should be in ascending frequency with low occurring the most and high occurring the least.
  • Resolution: The alarm priorities are reviewed, and a new standard of alarm priority categorization is established. This new categorization focuses on a simple “need to respond in ‘x’ minutes” metric. The standard was developed in the review based on needs. The alarm review process consisted of the following: Training on alarm prioritization, a discussion of alarm priorities and input from various supervisors and operators on re-prioritizing.

Bad actors

  • Issue: The top 10 worst offenders for alarming accounted for more than 75% of the alarms. In comparison to IEC standards, the number should be in the realm of 1 to 5% (IEC, 2022).
  • Resolution: After identifying and researching the worst offenders, a plan was made to address each alarm. The variety of solutions included adjustments to alarm setpoints, deadbands, and the creation of an “info” category.

Stale alarms

  • Issue: Several identified alarms that have been in an active state for weeks or months.
  • Resolution: Many of these alarms came from equipment that were down for an extended period due to construction, maintenance or being broken. An “out of service” state was created to disable alarms from equipment not currently in use.

Re-Analysis

  • Issue: Over time, with the addition of new processes, the changing of the seasons and the different demands placed on equipment, significant changes in the occurrences of alarms result in variations to the previously established alarm categorization.
  • Resolution: A regular schedule was put together to analyze the health of the alarm system. This schedule is intended to create actionable plans to address alarming, focusing on manageable bite-sized amounts. An unexpected benefit was during this maintenance process many obsolete alarms were found and removed. Some of the alarms got lost with changes to processes and equipment. This re-analysis process provided a means for review and identifying of these obsolete points.

Elevating Safety and Efficiency: The Vital Role of Alarm System Health

Alarm system health is crucial for the efficient and safe operation of industrial processes controlled by SCADA systems. Adopting a holistic approach to assessing, reviewing and acting upon alarm system issues helps organizations to mitigate risks and ensure the continued reliability of critical processes.

Leveraging standards such as IEC 62682 provides a framework for evaluating alarm system performance and identifying areas for improvement. Collaboration within multidisciplinary teams fosters diverse perspectives and leads to more comprehensive solutions.

Continuous monitoring and periodic re-evaluation are essential for maintaining alarm system health over time. By following a structured process and remaining proactive in addressing alarm system challenges, organizations can enhance operational efficiency, promote safety, and safeguard workers and facilities against potential disruptions.

How can we help?

Let's start a conversation.

GET STARTED