Posts

Which Quality of Service (QoS) is Right for IIoT?

Part 6 of Data Communication for Industrial IoT

Quality of Service (QoS) is a general term to indicate the delivery contract from a sender to a receiver.  In some applications QoS talks about delivery time, reliability, latency or throughput.  In IIoT, QoS generally refers to the reliability of delivery.

Using MQTT as an example, there are three common Quality of Service levels for IIoT:

  • Level 0 – At most once.  Every message will be delivered on a best-effort basis, similar to UDP.  If the message is lost in transit for whatever reason, it is abandoned―the receiver never receives it, and the sender does not know that it was lost.
  • Level 1 – At least once.  Every message will be delivered to a receiver, though sometimes the same message will be delivered two or more times.  The receiver may be able to distinguish the duplicates, but perhaps not.  The sender is not aware that the receiver received multiple copies of the message.
  • Level 2 – Exactly once.  Every message will be delivered exactly once to the receiver, and the sender will be aware that it was received.

These QoS levels actually miss something important that comes up a lot in industrial systems, but let’s look at these three quickly.

First, QoS level 0 is simply unacceptable.  It is fine to lose a frame of a video once in a while, but not fine to lose a control signal that safely shuts down a stamping machine.  If the sender is transmitting data more quickly than the receiver can handle it, there will come a point where in-flight messages will fill the available queue positions, and new messages will be lost.  Essentially, QoS 0 will favor old messages over new ones.  In IIoT, this is a fatal flaw.  There’s no reason to discuss QoS 0 further.

QoS level 1 seems pretty reasonable at first glance.  Message duplication is not a problem in most cases, and where there is an issue the duplicates can be identified by the receiver and eliminated, assuming the client maintains enough history to be able to identify them.

However, problems arise when the sender is transmitting data more quickly than the receiver can process it.  Since there is a delivery guarantee at QoS 1, the sender must be able to queue an infinite number of packets waiting for an opportunity to deliver them.  Longer queues mean longer latencies.  For example, if I turn a light on and off three times, and the delivery latency is 5 seconds simply due to the queue volume, then it will take 30 seconds for the receiver to see that the light has settled into its final state.  In the meantime the client will be acting on false information.  In the case of a light, this many not matter much (unless it is a visual alarm), but in industrial systems timeliness matters.  The problem becomes even more severe if the client is aggregating data from multiple sources.  If some sources are delayed by seconds or minutes relative to other, then the client will be performing logic on data that are not only inconsistent with reality but also with each other.

Ultimately, QoS 1 cannot be used where any client could produce data faster than the slowest leg of the communication path can handle.  Beyond a certain data rate, the system will effectively “fall off a cliff” and become unusable.  I’ve personally seen this exact thing happen in a municipal waste treatment facility.  It wasn’t pretty.  The solution was to completely replace the communication mechanism.

QoS level 2 is similar to QoS 1, but more severe.  QoS 2 is designed for transactional systems, where every message matters, and duplication is equivalent to failure.  For example, a system that manages invoices and payments would not want to record a payment twice or emit multiple invoices for a single sale.  In that case, latency matters far less than guaranteed unique delivery.

Since QoS level 2 requires more communication to provide its guarantee, it requires more time to deliver each message.  It will exhibit the same problems under load as QoS level 1, but at a lower data rate.  That is, the maximum sustained data rate for QoS 2 will be lower than for QoS 1.  The “cliff” just happens sooner.

QoS Levels 1 and 2 Don’t Propagate

Both QoS level 1 and level 2 suffer from another big flaw – they don’t propagate.  Consider a trivial system where two clients, A and B, are connected to a single broker.  The goal is to ensure that B receives every message that A transmits, meaning that QoS 1 or 2 should apply between A and B.  Looking at QoS 1, A would send a message and wait for a delivery confirmation.  The broker would need to transmit the message to B before sending the confirmation to A.  That would imply that the broker knows that A needs to wait for B to respond.  Two problems arise: first, A cannot know that B is even connected to the broker.  That is a fundamental property of a one-to-many broker like MQTT.  Second, the broker cannot know that the intention of the system is to provide reliable communication between A and B.  Even if the broker were somehow programmed to wait like that, how would it deal with a third client, C, also listening for that message.  Would it wait for delivery on all clients?  What would it do about clients that are temporarily disconnected?  The answer is that it cannot.  If the intention of the system is to offer QoS 1 or 2 among clients then that QoS promise cannot be kept.

Some brokers have a server-to-server, or daisy-chain, mechanism that allows brokers to transfer messages to each other.  This allows clients on different brokers to intercommunicate.  In this configuration the QoS promise cannot be maintained beyond the connection between the original sender and the first broker in the chain.

Guaranteed Consistency

None of these QoS levels is really right for IIoT.  We need something else, and that is guaranteed consistency.  In a typical industrial system there are analog data points that move continuously, like flows, temperatures and levels.  A client application would like to see as much detail as it can, but most critical is the current value of these points.  If it misses a value that is already superseded by a new measurement, that is not generally a problem.  However, the client cannot accept missing the most recent value for a point.  For example, if I flick a light on and off 3 times, the client does not need to know how many times I did it, but it absolutely must know that the switch ended in the off position.  The communication path needs to guarantee that the final “off” message gets through, even if some intermediate states are lost.  This is the critical insight in IIoT.  The client is mainly interested in the current state of the system, not in every transient state that led up to it.

Guaranteed consistency for QoS is actually slightly more complex than that.  There are really three critical aspects that are too often ignored:

  1. Message queues must be managed for each data point and client. When communication is slow, old messages must be dropped from the queue in favor of new messages to avoid ever-lengthening latencies.  This queuing must occur on a per-point, per-client basis.  Only messages that are superseded for a specific point destined for a specific client can be dropped.  If we drop messages blindly then we risk dropping the last message in a sequence, as in the final switch status above.
  2. Event order must be preserved.  When a new value for a point enters the queue, it goes to the back of the queue even if it supersedes a message near the front of the queue.  If we don’t do this, the client could see the light turn on before the switch is thrown.  Ultimately the client gets a consistent view of the data, but for a short time it may have been inconsistent.
  3. The client must be notified when a value is no longer current.  For the client to trust its data, it must know when data consistency is no longer being maintained.  If a data source is disconnected for any reason, its data will no longer be updated in the client.  The physical world will move on, and the client will not be informed.  Although the data delivery mechanism cannot stop hardware from breaking, it can guarantee that the client knows that something is broken.  The client must be informed, on a per-point basis, whether the point is currently active and current or inaccessible and thus invalid.  In the industrial world this is commonly done using data quality, a per-point indication of the trustworthiness of each data value.

For those instances where it is critical to see every change in a process (that is, where QoS 1 or 2 is required), that critical information should be handled as close as possible to the data source, whether it’s a PLC or an embedded device.  That is, time-critical and event-critical information should be processed at its source, not transmitted via the network to a remote system for processing where that transmission could introduce latency or drop intermediate values. We will discuss this more when we talk about edge processing.

For the IIoT, the beauty of guaranteed consistency for QoS is that it can respond to changes in network conditions without slowing down, backing up or invalidating the client’s view of the system state.  It has a bounded queue size and is thus suitable for resilient embedded systems.  This quality of service can propagate through any number of intermediate brokers and still maintain its guarantee, as well as notify the client when any link in the chain is broken.

So there’s the answer.  For IIoT, you definitely don’t want QoS 0, and probably cannot accept the limitations and failure modes of QoS 1 or 2.  You want something altogether different—guaranteed consistency.

Remote Control without a Direct Connection

Part 5 of Data Communication for Industrial IoT

As discussed previously, the idea of using a cloud service as an intermediary for data resolves the problems of securing the device and securing the network.  If both the device and the user make outbound connections to a secure cloud server, there is no need to open ports on firewalls, and no need for a VPN. But this approach brings up two important questions for anyone interested in remote control:

  1. Is it fast enough?
  2. Does it still permit a remote user to control his device?

The answer to the first question is fairly simple.  It’s fast enough if the choice of communication technology is fast enough.  Many cloud services treat IoT communication as a data storage problem, where the device populates a database and then the client consults the contents of the database to populate web dashboards.  The communication model is typically a web service over HTTP(S).  Data transmission and retrieval both essentially poll the database.

The Price of Polling

Polling introduces an inevitable trade-off between resource usage on the server and polling rate, where the polling rate must be set with a reasonable delay to avoid overloading the cloud server or the user’s network.  This polling does two things – it introduces latency, a gap in time between an event occurring on the device and the user receiving notification of it, and it uses network bandwidth in proportion to the number of data items being handled.  Remote control of the device is still possible through polling if you are willing to pay the latency and bandwidth penalty of having the device poll the cloud.  This might be fine for a device with 4 data values, but it scales exceptionally poorly for an industrial device with hundreds of data items, or for an entire plant with tens of thousands of data items.

Publish/Subscribe Efficiency

By contrast, some protocols implement a publish/subscribe mechanism where the device and user both inform the cloud server that they have an interest in a particular data set.  When the data changes, both the device and user are informed without delay.  If no data changes, no network traffic is generated.  So, if the device updates a data value, the user gets a notification.  If the user changes a data value the device gets a notification.  Consequently, you have bi-directional communication with the device without requiring a direct connection to it.

This kind of publish/subscribe protocol can support bidirectional communication with latencies as low as a few milliseconds over the background network latency.  On a reasonably fast network or Internet connection, this is faster than human reaction time.  Thus, the publish/subscribe approach has the potential to support remote control without a direct connection.

Access the Data, Not the Network

Part 4 of Data Communication for Industrial IoT

The idea of a client/server relationship where the server is the source of information is ingrained strongly into the typical software available today.  As a system design that is very difficult to eliminate.  Some companies try to make this into a “secure” mechanism by trying to add a layer of security on top of the client/server connection.  That layer of security is generally a VPN or (in rare cases) a point-to-point tunnel like SSH tunneling.  Since a VPN is typically the answer, it deserves a little more examination.

The purpose of a VPN is to create a virtual IP subnet that is shared only by computers that are authorized to join that subnet.  Packets transmitted on the subnet are automatically encrypted, even if neither the sender nor receiver is consciously using encryption.  That definitely makes it harder for an outsider to intercept communication among members of the VPN.

The big concern with a VPN is that once a computer or device is a member of the VPN, it is effectively like being on a local area network containing all other members of the VPN.  If a computer is inside the VPN it is inside the trusted perimeter.  This exposes the other VPN members to attack from within, even if they are safe from attack from without.  This is similar to what happened in a big box store in 2013, where attackers gained access to the LAN by breaching a third-party company who had “secure” access to the store’s internal network.  The larger the number of computers on a VPN, the more points of entry through the secure perimeter you have.

In the Internet of Things, security concerns have been pushing away from VPNs for a while.  A blog posting at Microsoft from 2013 takes a look at VPNs and the issues surrounding them.  If you haven’t seen it, it’s worth a read.

When we are talking about collections of devices, plant control systems or data acquisition systems on a larger network a VPN might seem like a compelling solution, but it inevitably exposes your network to attack, either due to a compromise in a VPN member, a compromise in the VPN server or simple theft of network credentials.  Once you have any of those, every machine on the VPN becomes a sitting duck.

There is no valid reason why you should provide external access to the whole network any more than you should provide external access to an embedded device.  In exactly the same way that you protect your devices by having them transmit data outbound to a middleman you can protect any data source, like an industrial control system, using the same mechanism.  You can have remote access to your data without exposing your internal network.

In the world of IIoT you should aim to access your data, not the network.

Access the Data, Not the Device

Part 3 of Data Communication for Industrial IoT

Front and center to data communication for the IIoT is the idea that IoT devices never stop talking.  They are always connected to the Internet, and always accessible.  The accepted wisdom is that for IoT devices to be accessible they must be data “servers”—always listening for somebody to contact them to request information or perform an action.  If we cannot reach the device remotely, how can we access the data it contains?  However, this presents a security problem. If the device is always listening and reachable from the Internet, it is exposing an “attack surface”—a point of contact that a hacker can try to use to compromise the device.

This thinking comes from an entrenched understanding that a client-server architecture is the right model for information sharing.  It’s the basis of the World-Wide Web, after all, and we have all seen how successful that is.  Web browsers (clients) talk to web servers.  The web servers contain the information and the clients consume it.  The analogy with IoT devices is perfect.  IoT devices contain information, and smart phones, web browsers and other IoT devices consume it.  The device is the server, right?  But, if the device is always listening for client connections then it is therefore exposing an entry point for attacks from the Internet.  The big issue, in this world view, is that device makers must do a really good job of network security.  Every device manufacturer, whether making a car or a toaster, must employ highly-specialized and rare experts in network security to ensure that hackers don’t imprint images of Elvis in your breakfast or shut off your car engine on the highway.  Alright, hackers didn’t do the toast.  But they did hack the car.

A Fundamental Misunderstanding of the Problem

Frankly, the device-as-server world view is insane.  Why would you ever need to put a web server in a car and then expose it to the cellular network?  Why must the car be listening for any connection at all, ever?   Why must IoT devices be listeners?  Why must they expose an attack surface to the Internet?  The answer lies in a fundamental misunderstanding of the problem.  You want to access the data that the device contains.  You don’t want the world to have access to it.  Just you.  So, you don’t need the device to listen for you (and the world) to contact it.  You can tell the device where to send the information and pick it up from there, so you never need to talk to the device directly.  Effectively, the device transmits its information to a middleman, and when you want to know what your device is up to, visit the middleman to find out.

Then the question becomes – who is this middleman?  That is the role of a cloud service.  It’s a secure point of contact between you and your device.  Yes, it listens for communication from both you and the device.  Yes, it exposes an attack surface to the Internet.  But it relieves that responsibility from both you and the device.  There are far fewer IoT cloud services (tens or hundreds of them) than there are devices (billions of them), thus reducing the number of rare experts we need to achieve decent security.

Using a middleman like this does not mean that you will have to put up with slow communication, or that you will be unable to control your device.  It just means that there is an extra hop in the communication chain between you and your device, eliminating the need for the device to be directly visible on the Internet.  Questions of speed and bi-directionality are answered by the design of the cloud service.

Requirements for IIoT Data Communication

Part 2 of Data Communication for Industrial IoT

When we look at the IIoT there are three main requirements:

1. Access to device data: Our connected devices generate information that we want to use.  It can be as simple as a door sensor in a home security system or as complex as an oil drilling platform’s SCADA system.  The common theme is that remote access to the information has value. There are different scenarios within the industrial sector for this kind of access.

  • Plant to field device
  • Device to device (M2M)
  • Central office to process systems (IT to OT)
  • Central location to remote locations (data gathering, possible supervisory control)
  • Vendors / OEMs to in-plant or field devices (monitoring)

2. Remote control of the device: Sometimes it is necessary to control the system we are accessing remotely.  If a door sensor says the door is unlocked then we might like to lock it.  If our SCADA system says that a machine is malfunctioning we would like to turn it off.  Remote control should of course be optional.  The determination of whether control is allowed should be made at the device, so attempts to control the system remotely will fail if the device is not configured to allow it.

3. Security: We have all heard of security breaches in connected systems.  Hackers turn home appliances into bots on spam networks.  SCADA systems are remotely hacked to shut down or otherwise damage industrial systems.  A couple of years ago a power plant in the Ukraine was hacked.  Attackers “used stolen VPN credentials to reach the industrial control systems network, and remote access tools to control the HMIs and pull the breakers,” according to an article published on the DarkReading website, and other reports.

Commenting on the recent WannaCry attack and its implications for the Industrial IoT, Brad Hegrat in an IOActive blog wrote, “It may be time to rethink critical infrastructure cybersecurity engineering because if MS17-010 exploiting malware variants are successful, we are clearly doing something wrong.”

The security for IIoT data communication systems must be improved to prevent these kinds of attacks.  And, as we will explain later, securing an industrial system for the IoT is fundamentally different from traditional industrial network security.  A new approach is needed.  IIoT must be secure by design.

Data Communication for Industrial IoT

What is the IoT?  Is it really just a fancy word for the Internet?  Yes and no.  The Internet of Things is the promise of a world where billions of connected devices are connected to us and to each other, making decisions for us, coordinating among themselves, collecting and collating information, and generally relieving us of the mundane aspects of living in the physical world.

We’ve had the Internet for enough time now that it has become embedded in our lives.  My (adult) kids don’t remember a world without it.  The IoT is, at its most basic level, a continuation of that embedding.  Instant communication is taken for granted among people, and plenty of mature products provide it.  Is there anything really novel about devices participating in that communication alongside people?

In a word, yes.  Not novel in the sense that we need entirely new technologies to achieve this data communication among devices, but novel in the sense that a whole raft of new problems arise from it.  The IoT is going to remain nothing but a promise until those problems are solved (shameless plug here: I’m writing this from a backward-facing perspective.  We at Skkynet have solutions for the problems I will discuss in this series).

So what is the Industrial IoT (IIoT)?  Does it require a different way of thinking about IoT, relative to the “regular” IoT?  Not really, the IIoT just has greater consequences.  If somebody hacks your refrigerator, your food gets too hot or cold, or you become an unwitting source of spam email.  If somebody hacks your industrial process they could shut down an expensive line, damage equipment, injure people, or even put critical infrastructure out of service.  That said, the data communication, network security, privacy, speed, latency and accessibility issues surrounding the IoT are the same in the IIoT, just with more urgency.

On the other hand, is the IIoT simply the application of IoT technology to industrial applications?  Not really; rather, it is the application of IoT concepts to industrial applications.  This series of articles will examine some of these concepts related to communication for the Industrial IoT.  Even that is a very big topic, covering data acquisition, protocol gateways, cloud protocols, data storage, big data analysis, reliability, fault tolerance and security.  To keep things short we will narrow the conversation further to look at data acquisition, communication and security.