Redundancy for OPC

Early one morning, Mel Farnsworth was sitting in the control booth at the Hardy Automotive Parts assembly line, drinking his final cup of coffee before the end of the shift. Watching the line meter graph in his HMI console, he noticed that the yield and efficiency trends for the Line 3 had dropped to zero. So he looked down through the control-room window, but Line 3 seemed to be rolling right along. What was the problem?

The line was running smoothly, but Mel wasn’t getting the data he needed. Somewhere between the PLCs and his HMI display there was a data disconnect. Maybe it was a fieldbus problem, or a bad network connection. Perhaps it was caused by his OPC server, or possibly even his HMI system itself. Whatever the reason, since Mel’s data connection was a single chain, one break in the chain means that he didn’t get his data. To minimize this kind of risk and ensure the highest possible availability, mission-critical systems often use redundancy.

What is Redundancy?

Redundancy in a process control system means that some or all of the system is duplicated, or redundant. The goal is to eliminate, as much as possible, any single point of failure. When a piece of equipment or a communication link goes down, a similar or identical component is ready to take over. There are three types of redundant systems, categorized by how quickly a replacement (or standby) can be brought online. These are cold standby, warm standby, and hot standby.

Cold standby implies that there will be a significant time delay in getting the replacement system up and running. The hardware and software are available, but may have to be booted up and loaded with the appropriate data. Picture the olden days of steam locomotives. The cold standby was the extra engine in the roundhouse that had to be fired up and brought into service.   Cold standby is not usually used for control systems unless the data changes very infrequently.

Warm standby has a faster response time, because the backup (redundant) system is always running, and regularly updated with a recent copy of the data set. When a failure occurs on the primary system, the redundant system can disconnect from the failed system and connect instead to the backup system. This allows the system to recover fairly quickly (within seconds, usually), and continue the work. Some data will be lost during this disconnect/reconnect cycle, but warm standby can be an acceptable solution where some data loss can be tolerated.

Hot standby means that both the primary and secondary data systems run simultaneously, and both are providing identical data streams to the downstream client. The underlying physical system is the same, but the two data systems use separate hardware to ensure that there is no single point of failure. When the primary system fails, the switchover to the secondary system is intended to be completely seamless, or “bumpless”, with no data loss. Hot standby is the best choice for systems that cannot tolerate the data loss of a cold or warm standby system.

A Typical Redundant OPC System

What does redundancy look like in an OPC-based system? A typical scenario would have two OPC servers connected either to a single device or PLC, or possibly duplicate devices or PLCs. Those two OPC servers would then connect to some kind of OPC redundancy management software which, in turn, offers a single connection to the OPC client, such as an HMI. The redundancy manager is responsible for switching to the secondary OPC server when any problem arises with the data coming from the primary OPC server. This scenario creates a redundant data stream from the physical system all the way to the HMI.

The most common use of redundancy for OPC is with OPC DA or UA, but it is possible to configure redundant OPC A&E systems as well. The principles are the same.  Sometimes, on large systems, it is necessary to configure multiple redundant pairs. Redundancy can also be configured over a network, using DCOM or OPC tunneling. For a networked configuration, the redundancy manager would normally reside on the OPC client machine, to minimize the number of potential points of failure.

Although cold or warm standby may be useful under some circumstances, typically an engineer or system integrator implementing a redundant OPC system is looking for hot standby. This is the most useful kind of redundancy in a process control system, and at the same time the most difficult to achieve. Let’s look a little more closely at that all-important task of the OPC redundancy manager in a hot-standby system—making the switch.

Making the Switch

Put simply, a hot-standby redundancy manager receives data from two identical inputs, and sends a single output to the OPC client. It is the redundancy manager’s job to determine at all times which of the two data streams is the best, and switch from one to the other as soon as possible whenever the status changes. The switch can be triggered by a number of different kinds of events:

  • Single point value change – to or from a certain value, achieving a threshold, etc.
  • Single point quality change – for example, from “Good” to any other OPC quality.
  • Multiple item monitoring – if the quality or value of any point in a group goes bad.
  • Rate of change monitoring – if points change value more slowly than expected.
  • Network breaks and timeouts – checked with some kind of heartbeat mechanism.

Once the switch has occurred, the system or the redundancy manager itself might have the ability to send an alarm or email message, or even launch some kind of diagnostic or investigative program. It might also be able to log diagnostic information about the state of the primary OPC server or network connection. And in a system that distinguishes between primary and secondary inputs, there will often be a means to favor the primary input, and switch back to it when possible, sometimes referred to as a fallback.

Practical Considerations

The idea of redundancy for OPC is not difficult to grasp, but implementing it takes some thought. An initial decision on cold, warm or hot standby will impact all aspects of the implementation. The choice of proper hardware and software is critical for a well-functioning system. Robust system architecture is also important, especially if the connection is across a network. In addition to selecting OPC servers and planning the network infrastructure (if necessary), an important decision will be the software used to manage the redundancy. Good redundancy management software should be easy to use, with no programming necessary. The technology should be up to date, capable of running on the latest version of Windows. There should be an absolute minimum chance of data loss during a switchover, even over a network.

The Timer Pitfall

In practice it is not possible to achieve a completely seamless switchover in all cases, even with a hot standby system. For example, if a network failure occurs on the primary connection, a certain amount of time will pass before a redundancy manager can detect that failure. Data transmitted during this period will fail to arrive, but the redundancy manager will not be able to distinguish between a failure and a normal pause in data flow.

Many redundancy managers implement timers to periodically check the network connection status to try to minimize this delay, but a switchover mechanism based on periodic timers will always suffer from data loss. Systems with multiple timing parameters will often result in additive delays, where the fastest possible switchover for the system is the sum of these timing delays. In addition, the use of timers to detect network failure can result in a configuration problem where the system integrator must trade off switchover latency against false-positive network failure detection. This effectively becomes a trade off between system stability and responsiveness.

Using timers to periodically check data values or qualities, or poll the OPC servers, is also problematic because timers introduce unnecessary latency into the system. Whereas a network failure must be detected based on timing, a data value or quality change can be detected immediately as the event occurs. Generally it is usually best to avoid systems based on time-based value change detection, and use event-based object monitoring instead.

Object and Link Monitoring

A good redundancy manager should be able to support both object monitoring and link monitoring. Object monitoring means the ability to monitor individual points, and make a switchover based on an event. For example, if a designated watchdog tag changes in a significant way, such as turning negative or going over a specified threshold, it can trigger a switch to the secondary OPC server. Or maybe you’d like to monitor a group of points, and if the quality of any of them goes to “Bad” or “Unconnected”, you can switch.

Link monitoring is especially useful for networked connections. Your system will need a way to detect a network break very quickly, to prevent data loss. For hot standby on high-speed systems with fast data update rates, timeout detection with a sub-second response rate is essential. In any event, the system should be able to detect a timeout for a failed network connection, as well as a failure to receive data. This distinction is important. It may take seconds or even minutes to detect a communication failure, but a redundancy manager should be able to detect a stoppage of data flow in an amount of time very close to the true data rate from the physical system. The redundancy manager should be able to switch from one source to the other based solely on an observation that data has not arrived from the primary connection, but has arrived from the backup system.

Some systems use COM timeouts for link monitoring. This may be acceptable for circumstances where relatively long data outages are tolerable, but we do not recommend relying on COM timeouts for hot or warm standby.

Smart Switchover

The behavior of the redundancy system during a switchover can be significant. For example, suppose the primary and secondary connections have both failed for some reason. A typical redundancy manager will begin a cycle of attempting to attach to one and then the other OPC server until one of them responds. The redundancy manager will flip-flop between the two indefinitely, injecting sleep periods between each flip-flop to reduce system resource load. This sleep period is itself a source of latency. A smarter switchover model is to maintain a source health status that allows the redundancy manager to only switch over when a source status changes. This allows the redundancy manager to effectively idle, or perform simultaneous reconnection attempts, until a source status changes, then immediately respond without introducing extra latency. Smarter switching logic can result in substantially reduced system load and switchover times.

Forced Switching vs Preferred Source

It is useful to be able to select one data source over another, even if the currently attached source is healthy. A naïve redundancy manager will “force” the user to switch, even if the backup system is not available. This will again result in a flip-flop behavior as the redundancy manager attempts to switch to the unavailable backup source. A much better approach is for the redundancy manager to understand the concept of a preferred source that can be changed at runtime. If the preferred source is available, the redundancy manager will switch to it. If the user wants to switch from one source to another, he simply changes the preferred source. If that source is available, the switch will be made. If it is not, the redundancy manager will make the switch only when it becomes available. This eliminates the flip-flop behavior while at the same time eliminating the data loss associated with the minimum of two switch cycles that the naïve redundancy manager will impose.

Accessing Raw Data

A good hot redundancy system will give the client application access not just to the redundant data, but also to the raw data from both sources. This gives the client application the option of presenting diagnostic information about the system on the “far side” of the redundancy manager. Most redundancy managers hide this information so that a client application would have to make and manage multiple connections to access the raw data, if it is possible at all.

Other options and features

In addition to the above capabilities, a good redundancy manager may offer additional features for your convenience. It might provide the option to refresh the entire data set at switchover. Maybe it will send out emails or even launch additional programs at each switchover. This can be useful for notifying key personnel of the system status. It may log diagnostics to provide valuable information about the reasons for making the switch. Some redundancy managers can connect to multiple servers, and create multiple redundant connections. Others can let you work with subsets of the data. Another desirable feature is the ability to assign the primary and secondary data sources, and to trigger a fallback from the secondary to the primary data source once the problem that caused the switchover has been resolved.

As control systems continue to grow in complexity, and as we rely more and more on them, Mel Farnsworth’s situation will become more common, and more costly. If data connectivity is crucial to the success of the company, it would be wise to consider the possibility of installing a redundant system, and to take into account the above considerations when implementing redundancy for OPC.

Advanced Tunnelling for OPC with Cogent DataHub

OPC has become a leading standard for industrial process control and automation systems.  Among several OPC standards, the one most widely used throughout the world is OPC DA, or OPC Data Access. Many hardware manufacturers offer an OPC DA interface to their equipment, and OPC DA servers are also offered by third-party suppliers.  Likewise, most HMI vendors build OPC DA client capabilities into their software.  Thus data from most factory floor devices and equipment can connect to most HMIs and other OPC DA clients.  This universal connectivity has greatly enhanced the flexibility and efficiency of industrial automation systems.

But OPC DA has a major drawback—it does not network well.  OPC DA is based on the COM protocol, which uses DCOM (Distributed COM) for networking.  DCOM was not designed for real-time industrial applications. It is neither as robust nor secure as industrial systems require, and it is very difficult to configure. To overcome these limitations, Cogent offers a “tunnelling” solution, as an alternative to DCOM, to transfer OPC data over a network.  Let’s take a closer look at how tunnelling solves the issues associated with DCOM, and how the Cogent DataHub from Cogent Real-Time Systems provides a secure, reliable, and easy-to-use tunnelling solution with many advanced features.

Making Configuration Easy and Secure

The first problem you will encounter with DCOM is that it is difficult to configure.  It can take a DCOM expert hours, and sometimes days, to get everything working properly.  It is difficult to find good documentation on DCOM because configuration is not a simple, step-by-step process.  Even if you are successful, the next Windows Update or additional new setting may break your working system.  Although it is not recommended practise, many companies “solve” the problem by simply bypassing DCOM security settings altogether.  But this kind of granting broad access permissions is becoming less and less viable in today’s security-conscious world, and most companies cannot risk lowering their guard to allow DCOM to function.

Tunnelling with the Cogent DataHub eliminates DCOM completely, along with all of its configuration and security issues.  The Cogent DataHub uses the industry standard TCP/IP protocol to network data between an OPC server on one computer and an OPC client on another computer, thus avoiding all of the major problems associated with using the DCOM protocol.

The Cogent DataHub offers this tunnelling feature by effectively ‘mirroring’ data from one Cogent DataHub running on the OPC server computer, to another Cogent DataHub running on the OPC client computer, as shown in the image above.  This method results in very fast data transfer between Cogent DataHub nodes.

Better Network Communication

When a DCOM connection is broken, there are very long timeout delays before either side is notified of the problem, due to DCOM having hard coded timeout periods which can’t be adjusted by the user.  In a production system, these long delays without warning can be a very real problem.  Some OPC clients and OPC client tools have internal timeouts to overcome this one problem but this approach does not deal with the other issues discussed in this paper.

The Cogent DataHub has a user-configurable heartbeat and timeout feature which allows it to react immediately when a network break occurs.  As soon as this happens, the Cogent DataHub begins to monitor the network connection and when the link is re-established, the local Cogent DataHub automatically reconnects to the remote Cogent DataHub and refreshes the data set with the latest values.  Systems with slow polling rates over long distance lines can also benefit from the user-configurable timeout, because DCOM timeouts might have been too short for these systems.

Whenever there is a network break, it is important to protect the client systems that depend on data being delivered.  Because each end of the tunnelling connection is an independent Cogent DataHub, the client programs are protected from network failures and can continue to run in isolation using the last known data values.  This is much better than having the client applications lose all access to data when the tunnelling connection goes down.

The Cogent DataHub uses an asynchronous messaging system that further protects client applications from network delays.  In most tunnelling solutions, the synchronous nature of DCOM is preserved over the TCP link.  This means that a when a client accesses data through the tunnel, it must block waiting for a response.  If a network error occurs, the client will continue to block until a network timeout occurs.  The Cogent DataHub removes this limitation by releasing the client immediately and then delivering the data over the network.  If a network error occurs, the data will be delivered once the network connection is re-established.

Cogent DataHub Other tunnelling products
The Cogent DataHub keeps all OPC transactions local to the computer, thus fully protecting the client programs from any network irregularities. Other products expose OPC transactions to network irregularities, making client programs subject to timeouts, delays, and blocking behavior. Link monitoring can reduce these effects, while the Cogent DataHub eliminates them.
The Cogent DataHub mirrors data across the network, so that both sides maintain a complete set of all the data. This shields the clients from network breaks as it lets them continue to work with the last known values from the server. When the connection is re-established, both sides synchronize the data set. Other products pass data across the network on a point by point basis and maintain no knowledge of the current state of the points in the system. A network break leaves the client applications stuck with no data to work with.
A single tunnel can be shared by multiple client applications. This significantly reduces network bandwidth and means the customer can reduce licensing costs as all clients (or servers) on the same computer share a single tunnel connection. Other tunnelling products require a separate network connection for each client-server connection. This increases the load on the system, the load on the network and increases licensing costs.

These features make it much easier for client applications to behave in a robust manner when communications are lost, saving time and reducing frustration.  Without these features, client applications can become slow to respond or completely unresponsive during connection losses or when trying to make synchronous calls.

Securing the System

Recently, DCOM networking has been shown to have serious security flaws that make it vulnerable to hackers and viruses. This is particularly worrying to companies who network data across Internet connections or other links outside the company.

To properly secure your communication channel, the Cogent DataHub offers secure SSL connections over the TCP/IP network.  SSL Tunnelling is fully encrypted, which means the data is completely safe for transmission over open network links outside the company firewalls.  In addition, the Cogent DataHub provides access control and user authentication through the use of optional password protection.  This ensures that only authorized users can establish tunnelling connections.  It is a significant advantage having these features built into the Cogent DataHub, since other methods of data encryption can require complicated operating system configuration and the use of more expensive server PCs, which are not required for use with the Cogent DataHub.

Advanced Tunnelling for OPC

While there are a few other products on the market that offer tunnelling capabilities to replace DCOM, the Cogent DataHub is unique in that it is the only product to combine tunnelling with a wide range of advanced and complimentary features to provide even more added benefits.

Significant reduction in network bandwidth

The Cogent DataHub reduces the amount of data being transmitted across the network in a two ways:

  1. Rather than using a polling cycle to transmit the data, the Cogent DataHub only sends a message when a new data value is received.  This significantly improves performance and reduces bandwidth requirements.
  2. The Cogent DataHub can aggregate both client and server connections.  This means that the Cogent DataHub can collect data from multiple OPC servers and send it across the network using a single connection.  On the client side, any number of OPC clients can attach to the Cogent DataHub and they all receive the latest data as soon as it arrives.  This eliminates the need for each OPC client to connect to each OPC server using multiple connections over the network.
Non-Blocking

While it may seem simple enough to replace DCOM with TCP/IP for networking OPC data, the Cogent DataHub also replaces the inherent blocking behaviour experienced in DCOM communication.  Client programs connecting to the Cogent DataHub are never blocked from sending new information.  Some vendors of tunnelling solutions for OPC still face this blocking problem, even though they are using TCP/IP.

Supports slow network and Internet links

Because the Cogent DataHub reduces the amount of data that needs to be transmitted over the network, it can be used over a slow network link.  Any interruptions are dealt with by the Cogent DataHub while the OPC client programs are effectively shielded from any disturbance caused by the slow connection.

Access to data on network computers running Linux

Another unique feature of the Cogent DataHub is its ability to mirror data between Cogent DataHubs running on other operating systems, such as Linux and QNX.  This means you can have your own custom Linux programs act as OPC servers, providing real-time data to OPC client applications running on networked Windows computers.  The reverse is also true.  You can have your Linux program access data from OPC servers running on networked Windows computers.

Load balancing between computers

The Cogent DataHub also offers the unique ability to balance the load on the OPC server computers.  You may have a system where multiple OPC clients are connecting to the OPC server at the same time, causing the server computer to experience high CPU loads and slower performance.  The solution to this is to mirror data from the Cogent DataHub on the OPC server computer to an Cogent DataHub on another computer and then have some of your OPC clients connect to this second ‘mirrored’ computer.  This reduces the load on the original OPC server computer and provides faster response to all OPC client computers.

Advanced Tunnelling for OPC Example – TEVA Pharmaceuticals (Hungary)

TEVA Pharmaceuticals in Hungary recently used the Cogent DataHub to combine tunnelling and aggregation to network OPC data over the network and through the company firewall.

Laszlo Simon is the Engineering Manager for the TEVA API plant in Debrecen, Hungary. He had a project that sounded simple enough. He needed to connect new control applications through several OPC stations to an existing SCADA network. The plant was already running large YOKOGAWA DCS and GE PLC control systems, connected to a number of distributed SCADA workstations. However, Mr. Simon did face a couple of interesting challenges in this project:

  • The OPC servers and SCADA systems were on different computers, separated by a company firewall. This makes it extremely difficult to connect OPC over a network, because of the complexities of configuring DCOM and Windows security permissions.
  • Each SCADA system needed to access data from all of the new OPC server stations. This meant Mr. Simon needed a way to aggregate data from all the OPC stations into a single common data set on each SCADA computer.

After searching the web, Mr. Simon downloaded and installed the Cogent DataHub. Very quickly he had connected the Cogent DataHub to his OPC servers and determined that he was reading live process data from the new control systems. He was also able to easily set up the tunnelling link between the OPC server stations and the SCADA workstations, by simply installing another Cogent DataHub on the SCADA computer and configuring it to connect to the OPC server stations.

“I wanted to reduce and simplify the communication over the network because of our firewall. It was very easy with the Cogent DataHub.” said Mr. Simon after the system was up and running. Currently about 7,000 points are being transferred across the network, in real-time, using the Cogent DataHub. “In the future, the additional integration of the existing or new OPC servers will be with the Cogent DataHub.”