Posts

Tech Talk and Action in IIoT Data Communications

Is summer over already?  It may be hard to accept, but on my morning walks the sun rises later each day, the wind is more brisk, and the leaves are turning yellow and red.  Before fall arrives in earnest, I’d like to share a bountiful harvest of summer activity here at Skkynet.  While most of the world was on holiday and taking it easy for a few weeks, our technical team took the opportunity to jot down some of their thoughts on our specialty: data communication for Industrial IoT.

In this first installment of a new series of Tech Talk blogs, lead developer and company CEO Andrew Thomas discusses IIoT security, data protocols, best practices, and common pitfalls.  He starts by introducing the unique requirements for Industial IoT, and he challenges the assumptions that lead to inherently insecure system design.  He then discusses each of the data protocols often suggested for use in the IIoT: UDP, MQTT, OPC UA, and REST, pointing out the strengths and weaknesses of each.  The best approach, he argues, exhibits the best qualities of these and more, as well as supporting edge and fog processing and public, private, or hybrid clouds.

This is the thinking that underlies SkkyHub, Skkynet’s secure-by-design approach to Industrial IoT.  Combined with our ETK and Cogent DataHub, the result is Industrial IoT that actually works.  You can install it in green field or brownfield projects, and connect to new or existing systems, use open protocols, and provide secure, robust, real-time performance at speeds not much slower than Internet propagation speeds.  And it is available today, right now.

This fall we are putting SkkyHub, DataHub, and ETK on display and into play in several arenas.  We will be at conferences and trade shows in North America, Europe and the Far East, including OPC Foundation Seminars in Vancouver and Toronto, Industry of Things World 2017 in Berlin, Sensors Midwest in Chicago, ARM TechCon in Santa Clara, SPS Drives in Nuremberg, and SCF in Tokyo.  If you are attending any of these, please stop by.

In the field, SkkyHub customers are enjoying the benefits of the service, and some have expressed an interest in sharing their experiences.  We will be blogging about those soon.  Meanwhile, the tech team has shfited back into development mode, and we expect some exciting news from them soon as well.  Summer may be winding down, but Skkynet continues to move rapidly ahead.

What is Edge Processing anyway?

Part 12 of Data Communication for Industrial IoT

Edge processing refers to the execution of aggregation, data manipulation, bandwidth reduction and other logic directly on an IoT sensor or device.  The idea is to put basic computation as close as possible to the physical system, making the IoT device as “smart” as possible.

Is this a way to take advantage of all of the spare computing power in the IoT device?  Partially.  The more work the device can do to prepare the data for the cloud, the less work the cloud needs to do.  The device can convert its information into the natural format for the cloud server, and can implement the proper communication protocols.  There is more, though.

Data Filter

Edge processing means not having to send everything to the cloud.  An IoT device can deal with some activities itself.  It can’t rely on a cloud server to implement a control algorithm that would need to survive an Internet connection failure.  Consequently, it should not need to send to the cloud all of the raw data feeding that algorithm.

Let’s take a slightly contrived example.  Do you need to be able to see the current draw of the compressor in your smart refrigerator on your cell phone?  Probably not.  You might want to know whether the compressor is running constantly – that would likely indicate that you left the door ajar.  But really, you don’t even need to know that.  Your refrigerator should recognize that the compressor is running constantly, and it should decide on its own that the door is ajar.  You only need to know that final piece of information, the door is ajar, which is two steps removed from the raw input that produces it.

Privacy

This has privacy and information security implications.  If you don’t send the information to the Internet, you don’t expose it.  The more processing you can do on the device, the less you need to transmit on the Internet.  That may not be a big distinction for a refrigerator, but it matters a lot when the device is a cell tower, a municipal water pumping station or an industrial process.

Bandwidth

Edge processing also has network bandwidth implications.  If the device can perform some of the heavy lifting before it transmits its information it has the opportunity to reduce the amount of data it produces.  That may be something simple, like applying a deadband to a value coming from an A/D converter, or something complex like performing motion detection on an image.  In the case of the deadband, the device reduces bandwidth simply by not transmitting every little jitter from the A/D converter.  In the case of the motion detection, the device can avoid sending the raw images to the cloud and instead just send an indication of whether motion was detected.  Instead of requiring a broadband connection the device could use a cellular connection and never get close to its monthly data quota.

Data Protocol

There is just one thing to watch for.  In our example of the motion detection, the device probably wants to send one image frame to the cloud when it detects motion.  That cannot be represented as a simple number.  Generally, the protocol being used to talk to the cloud server needs to be rich enough to accept the processed data the device wants to produce.  That counts out most industrial protocols like Modbus, but fits most REST-based protocols as well as the higher level protocols like OPC UA and MQTT.

Continue reading, or go back to Table of Contents

Where does Blockchain fit into the IIoT?

Part 11 of Data Communication for Industrial IoT

Nothing I’ve read suggests that blockchain will replace SSL for IoT security.  Blockchains are “distributed ledgers” that are known to be tamper-proof (though there are ways to tamper with them in actuality if you own enough of the computing power validating the transactions). This design works fine for certain Internet applications like bitcoin, but I don’t see the blockchain fitting well into the IIoT.

Size matters

First of all, since there is no central ledger, all participating devices must contain, or have access to, the entire ledger.  No entry can ever be removed from the ledger.  As the number of devices grows, and the number of transactions it contains grows, the size of the ledger grows geometrically.  The size of the bitcoin blockchain is roughly doubling every year and currently is over 60GB.  For an IoT node to fully trust the blockchain it would need a geometrically growing amount of storage.  That’s obviously not possible.

So, individual devices can prune the block chain and store only the last few minutes or seconds of it, hoping that nearby peer devices will provide independent confirmation that their little piece of the blockchain is cryptographically secure.  That produces a possible line of attack on the device, where nearby devices could lie, and produce a satisfactory probability of truth in the “mind” of the target device.

Thus security is based on the availability of massive storage, and attempts to reduce that storage requirement diminish security.  As far as I can tell this is an unsolved problem right now.

Too much connectivity?

The second problem with blockchains is that they assume that every transaction in the system must be transmitted to every participant in the blockchain.  Yes, when somebody’s fridge turns on in Paris, every one of the billions of devices participating in the blockchain must be told.  If they are not, then their local copy of the blockchain is inconsistent and they cannot trust the next transaction, which they might actually be interested in.  As the number of devices and transactions rises, the amount of worldwide network bandwidth required to maintain the integrity of the blockchain grows geometrically.  One article I read says that on a 10Mbit Internet connection the theoretical maximum number of transactions in the entire bitcoin universe that connection could sustain would be 7 transactions per second.  Seven.

The result of these two limitations is that a blockchain probably cannot be used to carry the actual data that the devices produce.  Instead it is more likely to be used as an authentication mechanism.  That is, a device that is legitimately on the blockchain can be verified as being itself based on something that the blockchain knows.  My personal opinion is that it sounds very much like the blockchain would become a distributed certificate authority.  Instead of having the current SSL “chain of trust” of certificates, you would have a “blockchain of trust”.  But since an individual device could not contain the entire blockchain you would still need a server to provide the equivalent of certificate validation, so there’s your point of attack.

There are some examples of IoT devices using blockchains, like a washing machine that buys detergent using bitcoins, that are using misdirection to claim the use of blockchains.  Yes, they are using blockchains in their bitcoin transactions because that’s how bitcoin works, but the maintenance data they produce (the real point of the blockchains-for-IoT conversation) are not being transmitted via blockchain at all.

I have yet to see a practical application of blockchains to IoT data or even to IoT authentication.  The conversation at the moment is in the realm of “it would be nice” but the solutions to the implementation problems are not clear.  Incidentally the same problems exist for bitcoin and there are no clear solutions in that space either.

Continue reading, or go back to Table of Contents

Is REST the Answer for IIoT?

Part 10 of Data Communication for Industrial IoT

As we’ve stated previously, the IIoT is imagined as a client-server architecture where the “things” can be smart devices with embedded micro-controllers.  The devices generate data based on sensors, and send that data to a server that is usually elsewhere on the Internet.  Similarly, a device can be controlled by retrieving data from the server and acting upon it, say to turn on an air conditioner.

The communication mechanism typically used for devices to communicate with the servers over the Internet is called REST (Representational State Transfer) using HTTP.  Every communication between the device and server occurs as a distinct HTTP request.  When the device wants to send data to the server it makes an HTTP POST call.  When it wants to get data (like a new thermostat setting) it makes an HTTP GET call.  Each HTTP call opens a distinct socket, performs the transaction, and then closes the socket.  The protocol is said to be “connectionless”.  Every transaction includes all of the socket set-up time and communication overhead.  Since there is no connection, all transactions must take the form of “request/response” where the device sends a request to the server and collects the response.  The server generally does not initiate a transaction with the device, as that would expose the device to attack from the Internet.

HTTP does define a keep-alive connection, where several HTTP transactions are sent on a single socket.  This definitely reduces the amount of time spent creating and destroying TCP connections, but does not change the basic request/response behaviour of the HTTP protocol.  Scalability issues and trade-offs between latency and bandwidth still overwhelm any benefit gained from a keep-alive connection.

One of the identifying features of the IIoT is the data volume.  Even a simple industrial system contains thousands of data points.  REST APIs might be fine for a toaster, but at industrial scale they run into problems:

Bandwidth

REST messages typically pay the cost for socket setup on every message or group of messages.  Then they send HTTP headers before transmitting the data payload.  Finally, they demand a response, which contains at least a few required headers.  Writing a simple number requires hundreds of bytes to be transmitted on multiple IP packets.

Latency

Latency measures the amount of time that passes between an event occurring and the user receiving notification.  In a REST system, the latency is the sum of:

  • The client’s polling rate
  • Socket set-up time
  • Network transmission latency to send the request
  • Transmission overhead for HTTP headers
  • Transmission time for the request body
  • Network transmission latency to send the response
  • Socket take-down time

By comparison an efficient persistent connection measures latency as:

  • Network transmission latency to send the request
  • Transmission time for the request body
  • Network transmission time for an optional response body

The largest sources of latency in a REST system (polling rate, socket set-up, response delivery) are all eliminated with the new model.  This allows it to achieve transmission latencies that are mere microseconds above network latencies.

REST’s latency problems become even clearer in systems where two devices are communicating with one another through an IoT server.  Low-latency event-driven systems can achieve practical data rates hundreds or thousands of time faster than REST.  REST was never designed for the kind of data transmission IIoT requires.

Scalability

One of the factors in scalability is the rate at which a server responds to transactions from the device.  In a REST system a device must constantly poll the server to retrieve new data.  If the device polls the server quickly then it causes many transactions to occur, most of which produce no new information.  If the device polls the server slowly, it may miss important data or will experience a significant time lag (latency) due to the polling rate.  As the number of devices increases, the server quickly becomes overloaded and the system must make a choice between the number of devices and the latency of transmission.

All systems have a maximum load.  The question is, how quickly does the system approach this maximum and what happens when the maximum is reached?  We have all seen suddenly-popular web sites become inaccessible due to overloading.  While those systems experienced unexpectedly high transaction volumes, a REST system in an IIoT setting will be exposed to that situation in the course of normal operation.  Web systems suffer in exactly the scenarios that IIoT is likely to be useful.  Event-driven systems scale much more gradually, as adding clients does not necessarily add significant resource cost.  For example, we have been able to push REST systems to about 3,000 transactions per minute.  We have pushed event driven systems to over 5,000,000 transactions per minute on the same hardware.

Symmetry

REST APIs generally assume that the data flow will be asymmetrical.  That is, the device will send a lot of data to the server, but retrieve data from the server infrequently.  In order to maintain reasonable efficiency, the device will typically transmit frequently, but poll the server infrequently.  This causes additional latency, as discussed earlier.  In some systems this might be a reasonable sacrifice, but in IIoT systems it usually is not.

For example, a good IIoT server should be capable of accepting, say, 10,000 data points per second from an industrial process and retransmitting that entire data set to another industrial process, simulator, or analytics system without introducing serious alterations to the data timing.  To do that, the server must be capable of transmitting data just as quickly as it receives it.  A good way to achieve this is through establishing persistent, bidirectional connections to the server.  This way, if the device or another client needs to receive 10,000 data changes per second the communication mechanism will support it.

Robustness

Industrial applications are often mission-critical to their owners.  This is one of the big issues holding back the IIoT.  What happens if the Internet connection goes down?

In the typical IoT scenario, the device is making REST calls to a server running in the cloud.  In some ways this is a by-product of the cloud vendor’s business model, and in some ways it is due to the REST implementation.  A REST server is typically a web server with custom URL handlers, tightly coupled to a proprietary server-side application and database.  If the Internet connection is lost, the device is cut off, even if the device is inside an industrial plant providing data to a local control system via the cloud.  If the cloud server is being used to issue controls to the device, then control becomes impossible, even locally.  This could be characterized as “catastrophic degradation” when the Internet connection is lost.

Ideally, the device should be able to make a connection to a local computer inside the plant, integrate directly with a control system using standard protocols like OPC and DDE, and also transmit its data to the cloud.  If the Internet connection is lost, the local network connection to the control system is still available.  The device is not completely cut off, and control can continue.  This is a “graceful degradation” when the Internet connection is lost.

In conclusion, REST systems work reasonably well in low-speed transactional systems.  However, they have a number of disadvantages when applied to high speed, low latency systems, and to systems where data transfer from the server to the device is frequent.  Industrial IoT systems are characterized by exactly these requirements, making REST an inappropriate communication model for IIoT.

Continue reading, or go back to Table of Contents

Which Quality of Service (QoS) is Right for IIoT?

Part 6 of Data Communication for Industrial IoT

Quality of Service (QoS) is a general term to indicate the delivery contract from a sender to a receiver.  In some applications QoS talks about delivery time, reliability, latency or throughput.  In IIoT, QoS generally refers to the reliability of delivery.

Using MQTT as an example, there are three common Quality of Service levels for IIoT:

  • Level 0 – At most once.  Every message will be delivered on a best-effort basis, similar to UDP.  If the message is lost in transit for whatever reason, it is abandoned―the receiver never receives it, and the sender does not know that it was lost.
  • Level 1 – At least once.  Every message will be delivered to a receiver, though sometimes the same message will be delivered two or more times.  The receiver may be able to distinguish the duplicates, but perhaps not.  The sender is not aware that the receiver received multiple copies of the message.
  • Level 2 – Exactly once.  Every message will be delivered exactly once to the receiver, and the sender will be aware that it was received.

These QoS levels actually miss something important that comes up a lot in industrial systems, but let’s look at these three quickly.

First, QoS level 0 is simply unacceptable.  It is fine to lose a frame of a video once in a while, but not fine to lose a control signal that safely shuts down a stamping machine.  If the sender is transmitting data more quickly than the receiver can handle it, there will come a point where in-flight messages will fill the available queue positions, and new messages will be lost.  Essentially, QoS 0 will favor old messages over new ones.  In IIoT, this is a fatal flaw.  There’s no reason to discuss QoS 0 further.

QoS level 1 seems pretty reasonable at first glance.  Message duplication is not a problem in most cases, and where there is an issue the duplicates can be identified by the receiver and eliminated, assuming the client maintains enough history to be able to identify them.

However, problems arise when the sender is transmitting data more quickly than the receiver can process it.  Since there is a delivery guarantee at QoS 1, the sender must be able to queue an infinite number of packets waiting for an opportunity to deliver them.  Longer queues mean longer latencies.  For example, if I turn a light on and off three times, and the delivery latency is 5 seconds simply due to the queue volume, then it will take 30 seconds for the receiver to see that the light has settled into its final state.  In the meantime the client will be acting on false information.  In the case of a light, this may not matter much (unless it is a visual alarm), but in industrial systems timeliness matters.  The problem becomes even more severe if the client is aggregating data from multiple sources.  If some sources are delayed by seconds or minutes relative to other, then the client will be performing logic on data that are not only inconsistent with reality but also with each other.

Ultimately, QoS 1 cannot be used where any client could produce data faster than the slowest leg of the communication path can handle.  Beyond a certain data rate, the system will effectively “fall off a cliff” and become unusable.  I’ve personally seen this exact thing happen in a municipal waste treatment facility.  It wasn’t pretty.  The solution was to completely replace the communication mechanism.

QoS level 2 is similar to QoS 1, but more severe.  QoS 2 is designed for transactional systems, where every message matters, and duplication is equivalent to failure.  For example, a system that manages invoices and payments would not want to record a payment twice or emit multiple invoices for a single sale.  In that case, latency matters far less than guaranteed unique delivery.

Since QoS level 2 requires more communication to provide its guarantee, it requires more time to deliver each message.  It will exhibit the same problems under load as QoS level 1, but at a lower data rate.  That is, the maximum sustained data rate for QoS 2 will be lower than for QoS 1.  The “cliff” just happens sooner.

QoS Levels 1 and 2 Don’t Propagate

Both QoS level 1 and level 2 suffer from another big flaw – they don’t propagate.  Consider a trivial system where two clients, A and B, are connected to a single broker.  The goal is to ensure that B receives every message that A transmits, meaning that QoS 1 or 2 should apply between A and B.  Looking at QoS 1, A would send a message and wait for a delivery confirmation.  The broker would need to transmit the message to B before sending the confirmation to A.  That would imply that the broker knows that A needs to wait for B to respond.  Two problems arise: first, A cannot know that B is even connected to the broker.  That is a fundamental property of a one-to-many broker like MQTT.  Second, the broker cannot know that the intention of the system is to provide reliable communication between A and B.  Even if the broker were somehow programmed to wait like that, how would it deal with a third client, C, also listening for that message.  Would it wait for delivery on all clients?  What would it do about clients that are temporarily disconnected?  The answer is that it cannot.  If the intention of the system is to offer QoS 1 or 2 among clients then that QoS promise cannot be kept.

Some brokers have a server-to-server, or daisy-chain, mechanism that allows brokers to transfer messages to each other.  This allows clients on different brokers to intercommunicate.  In this configuration the QoS promise cannot be maintained beyond the connection between the original sender and the first broker in the chain.

Guaranteed Consistency

None of these QoS levels is really right for IIoT.  We need something else, and that is guaranteed consistency.  In a typical industrial system there are analog data points that move continuously, like flows, temperatures and levels.  A client application would like to see as much detail as it can, but most critical is the current value of these points.  If it misses a value that is already superseded by a new measurement, that is not generally a problem.  However, the client cannot accept missing the most recent value for a point.  For example, if I flick a light on and off 3 times, the client does not need to know how many times I did it, but it absolutely must know that the switch ended in the off position.  The communication path needs to guarantee that the final “off” message gets through, even if some intermediate states are lost.  This is the critical insight in IIoT.  The client is mainly interested in the current state of the system, not in every transient state that led up to it.

Guaranteed consistency for QoS is actually slightly more complex than that.  There are really three critical aspects that are too often ignored:

  1. Message queues must be managed for each data point and client. When communication is slow, old messages must be dropped from the queue in favor of new messages to avoid ever-lengthening latencies.  This queuing must occur on a per-point, per-client basis.  Only messages that are superseded for a specific point destined for a specific client can be dropped.  If we drop messages blindly then we risk dropping the last message in a sequence, as in the final switch status above.
  2. Event order must be preserved.  When a new value for a point enters the queue, it goes to the back of the queue even if it supersedes a message near the front of the queue.  If we don’t do this, the client could see the light turn on before the switch is thrown.  Ultimately the client gets a consistent view of the data, but for a short time it may have been inconsistent.
  3. The client must be notified when a value is no longer current.  For the client to trust its data, it must know when data consistency is no longer being maintained.  If a data source is disconnected for any reason, its data will no longer be updated in the client.  The physical world will move on, and the client will not be informed.  Although the data delivery mechanism cannot stop hardware from breaking, it can guarantee that the client knows that something is broken.  The client must be informed, on a per-point basis, whether the point is currently active and current or inaccessible and thus invalid.  In the industrial world this is commonly done using data quality, a per-point indication of the trustworthiness of each data value.

For those instances where it is critical to see every change in a process (that is, where QoS 1 or 2 is required), that critical information should be handled as close as possible to the data source, whether it’s a PLC or an embedded device.  That is, time-critical and event-critical information should be processed at its source, not transmitted via the network to a remote system for processing where that transmission could introduce latency or drop intermediate values. We will discuss this more when we talk about edge processing.

For the IIoT, the beauty of guaranteed consistency for QoS is that it can respond to changes in network conditions without slowing down, backing up or invalidating the client’s view of the system state.  It has a bounded queue size and is thus suitable for resilient embedded systems.  This quality of service can propagate through any number of intermediate brokers and still maintain its guarantee, as well as notify the client when any link in the chain is broken.

So there’s the answer.  For IIoT, you definitely don’t want QoS 0, and probably cannot accept the limitations and failure modes of QoS 1 or 2.  You want something altogether different—guaranteed consistency.

Continue reading, or go back to Table of Contents

Remote Control without a Direct Connection

Part 5 of Data Communication for Industrial IoT

As discussed previously, the idea of using a cloud service as an intermediary for data resolves the problems of securing the device and securing the network.  If both the device and the user make outbound connections to a secure cloud server, there is no need to open ports on firewalls, and no need for a VPN. But this approach brings up two important questions for anyone interested in remote control:

  1. Is it fast enough?
  2. Does it still permit a remote user to control his device?

The answer to the first question is fairly simple.  It’s fast enough if the choice of communication technology is fast enough.  Many cloud services treat IoT communication as a data storage problem, where the device populates a database and then the client consults the contents of the database to populate web dashboards.  The communication model is typically a web service over HTTP(S).  Data transmission and retrieval both essentially poll the database.

The Price of Polling

Polling introduces an inevitable trade-off between resource usage on the server and polling rate, where the polling rate must be set with a reasonable delay to avoid overloading the cloud server or the user’s network.  This polling does two things – it introduces latency, a gap in time between an event occurring on the device and the user receiving notification of it, and it uses network bandwidth in proportion to the number of data items being handled.  Remote control of the device is still possible through polling if you are willing to pay the latency and bandwidth penalty of having the device poll the cloud.  This might be fine for a device with 4 data values, but it scales exceptionally poorly for an industrial device with hundreds of data items, or for an entire plant with tens of thousands of data items.

Publish/Subscribe Efficiency

By contrast, some protocols implement a publish/subscribe mechanism where the device and user both inform the cloud server that they have an interest in a particular data set.  When the data changes, both the device and user are informed without delay.  If no data changes, no network traffic is generated.  So, if the device updates a data value, the user gets a notification.  If the user changes a data value the device gets a notification.  Consequently, you have bi-directional communication with the device without requiring a direct connection to it.

This kind of publish/subscribe protocol can support bidirectional communication with latencies as low as a few milliseconds over the background network latency.  On a reasonably fast network or Internet connection, this is faster than human reaction time.  Thus, the publish/subscribe approach has the potential to support remote control without a direct connection.

Continue reading, or go back to Table of Contents