Posts

The cloud is good, except when it’s not.

Part 14 of Data Communication for Industrial IoT

Cloud computing can be quite useful in industrial systems to gather data and do supervisory control in some application spaces.  “Big data” services help managers and engineers to locate inefficiencies, coordinate predictive maintenance, and boost productivity.  The cloud model of software as a service (SaaS) offers a convenient way to add new functionality to existing systems, and it shifts costs from capital to operating expenses.

Despite the advantages of cloud systems, system integrators and key decision-makers in industrial facilities are reluctant to try it.   Some of the reasons for this might include:

  • License enforcement — “Will this cloud-based system be used to ensure software license compliance, in the same way my kids need an internet connection to play a single-player computer game?”
  • Vendor lock-in — “If all the processing power of the system is in the cloud service, how can I switch services?”
  • No edge processing — “There are too many cloud services that are basically just Internet-accessible databases. That’s not flexible enough for me.”
  • Security — “Once my data leaves my plant, is it safe from prying eyes?  And if I connect my plant to the cloud, will my plant be open to attack?”
  • Loss of connectivity — “If my Internet connection goes down, will I lose my ability to control my plant?”

So should we avoid cloud services altogether?  No.  They provide capability and efficiency you can’t get any other way.  In addition to data-gathering, cloud services can be used to support remote connectivity over the Internet.

Cloud as Intermediary

If we link an operation center in one city to a production system in another, there must be a network.  If we make a direct connection, then one or the other must accept an inbound connection from the Internet.  Using a cloud system as an intermediary means that neither the operation center nor the production system needs to open its firewall, thereby improving security by moving the point of attack outside either system.

Limited Data Sets

Should IIoT devices send all of their data to the cloud?  No, it’s usually not necessary.  Only the data necessary for remote monitoring and control needs to be accessed.  Device information is not monolithic – you should be able to pick and choose what the cloud has access to.

Support for Local Capability

But what happens when cloud is not available?  What happens if a cloud provider goes out of business (think Google/NEST Revolv)?  The system should degrade in a way that essential functions still remain available. The goal should be to support fundamental local capability, enhanced with cloud services.  We should still be able to use our devices when the Internet is not available.

Like most things in life, the cloud has its strong points and its weaknesses.  The most successful implementations will take full advantage of the strong points, and design around the weaknesses.  For industrial applications, that means keeping remote devices and in-plant systems behind closed firewalls and protecting them from any network slowdowns or outages.  This can be accomplished through edge and fog processing mentioned previously, and/or by implementing a hybrid cloud, which we will discuss next.

Continue reading, or go back to Table of Contents

Cloud, Edge, and Fog Processing

Part 13 of Data Communication for Industrial IoT

With the huge number of devices comprising the IoT, we can imagine a looming problem with processing power.  If a user wants to perform some logic specific to a device, he usually has two choices: perform that logic in a cloud application, or perform that processing directly on the device (edge processing).  There are clearly advantages to both strategies.

Cloud

Moving complex logic to the cloud means that the device can be made with lower power requirements and less expensive parts.  The device just needs to be able to read raw values from transducers or A/D converters and transmit them using a low-level protocol like Modbus.  The cloud application can decode the raw data, apply scaling, deadbands, alerts and closed-loop control.  Since you should have no direct access to the device (see Access the Data, not the Device), you may not be able to change the logic at the device.  Placing processing in the cloud lets you modify your system logic without endangering device security.  In addition, cloud applications let you write logic that uses information from multiple devices, since all of the device information is available at the cloud server.

On the other hand, placing logic in the cloud means that the device is dumb.  If the network connection is lost, the device can do nothing but collect data.  If the logic in the cloud is important to the operation of the device or the system in which it is running then the system is effectively crippled until the connection is re-established (see The Cloud is Good, Except when it Isn’t—coming next).  In many instances, particularly in industrial systems, this is simply not acceptable.

In addition, placing logic in the cloud presents a scaling issue.  If the cloud server must handle a few data change events for a handful of data items, no problem.  If it must handle millions of data changes for hundreds of thousands of data items, which is more in line with industrial processes, then the simple availability of CPU, memory and network bandwidth become an issue.  The devices already have a certain amount of processing capability on them, so why not use it reduce the load at the server?

Edge

It is easy to see how the device can perform computations like engineering unit conversion, simple alarm generation, and even some closed-loop control before ever transmitting its information to the cloud.  With a little more power on the device it could, for example, read an image from a camera, analyze it for a production error and then send only an indication of whether the image contains an error to the cloud.  Instead of transmitting 100 KB of image, it can send a few bytes of result.

In addition to reducing network bandwidth, this approach also improves response time for closed-loop control.  The latency of data arriving at the device from the physical system can be a few milliseconds.  The device can process the data and determine a response within microseconds, resulting in an event-to-response time of just a few milliseconds.  Compare this to the latencies typical of Internet round-trip times, and it’s clear that edge processing for device control can save two orders of magnitude in response time.  The difference between 2 milliseconds and 200 may not always matter, but in industrial applications it is commonly a consideration.

Fog

There is, commonly, a third option between the edge and cloud—fog.  In industrial applications it is rare, and usually undesirable, to have plant-floor equipment connect directly to the cloud.  It is far more likely that devices and equipment connect first to a local aggregator or gateway.  Even in home automation applications the devices usually talk to a gateway rather than directly to a cloud server.  Processing at the gateway or concentrator is called fog processing.

The purpose of the gateway is to collect information from multiple devices using their native protocols and to convert that information to a form suitable for transmission to a cloud service.  In industrial applications these protocols are typically standards like Modbus, Profibus, Ethernet/IP, or OPC.  In home automation you might see something like ECHONET Lite.  The communication from the gateway to the cloud is determined by the cloud service, and may bear no resemblance to the device protocol.  In essence, the gateway acts as a collection point between the devices and the cloud service.

There is no reason why the gateway cannot perform fog processing on the device data before sending it to the cloud.  In fact, since the gateway is likely to have far more computing power than the edge devices, it often makes sense to perform complex tasks there instead of at the device itself.  Latencies are low, since the device and gateway share a local connection, whether it is Ethernet, WiFi, Zigbee or some other short-range network.  Response times for fog processing are higher than for edge processing, but the difference is now a few milliseconds, not tens or hundreds of milliseconds as it would be with cloud processing.

For larger systems, there is even a fourth option—a hybrid cloud—which I will touch on later.

Continue reading, or go back to Table of Contents

What is Edge Processing anyway?

Part 12 of Data Communication for Industrial IoT

Edge processing refers to the execution of aggregation, data manipulation, bandwidth reduction and other logic directly on an IoT sensor or device.  The idea is to put basic computation as close as possible to the physical system, making the IoT device as “smart” as possible.

Is this a way to take advantage of all of the spare computing power in the IoT device?  Partially.  The more work the device can do to prepare the data for the cloud, the less work the cloud needs to do.  The device can convert its information into the natural format for the cloud server, and can implement the proper communication protocols.  There is more, though.

Data Filter

Edge processing means not having to send everything to the cloud.  An IoT device can deal with some activities itself.  It can’t rely on a cloud server to implement a control algorithm that would need to survive an Internet connection failure.  Consequently, it should not need to send to the cloud all of the raw data feeding that algorithm.

Let’s take a slightly contrived example.  Do you need to be able to see the current draw of the compressor in your smart refrigerator on your cell phone?  Probably not.  You might want to know whether the compressor is running constantly – that would likely indicate that you left the door ajar.  But really, you don’t even need to know that.  Your refrigerator should recognize that the compressor is running constantly, and it should decide on its own that the door is ajar.  You only need to know that final piece of information, the door is ajar, which is two steps removed from the raw input that produces it.

Privacy

This has privacy and information security implications.  If you don’t send the information to the Internet, you don’t expose it.  The more processing you can do on the device, the less you need to transmit on the Internet.  That may not be a big distinction for a refrigerator, but it matters a lot when the device is a cell tower, a municipal water pumping station or an industrial process.

Bandwidth

Edge processing also has network bandwidth implications.  If the device can perform some of the heavy lifting before it transmits its information it has the opportunity to reduce the amount of data it produces.  That may be something simple, like applying a deadband to a value coming from an A/D converter, or something complex like performing motion detection on an image.  In the case of the deadband, the device reduces bandwidth simply by not transmitting every little jitter from the A/D converter.  In the case of the motion detection, the device can avoid sending the raw images to the cloud and instead just send an indication of whether motion was detected.  Instead of requiring a broadband connection the device could use a cellular connection and never get close to its monthly data quota.

Data Protocol

There is just one thing to watch for.  In our example of the motion detection, the device probably wants to send one image frame to the cloud when it detects motion.  That cannot be represented as a simple number.  Generally, the protocol being used to talk to the cloud server needs to be rich enough to accept the processed data the device wants to produce.  That counts out most industrial protocols like Modbus, but fits most REST-based protocols as well as the higher level protocols like OPC UA and MQTT.

Continue reading, or go back to Table of Contents

Which Quality of Service (QoS) is Right for IIoT?

Part 6 of Data Communication for Industrial IoT

Quality of Service (QoS) is a general term to indicate the delivery contract from a sender to a receiver.  In some applications QoS talks about delivery time, reliability, latency or throughput.  In IIoT, QoS generally refers to the reliability of delivery.

Using MQTT as an example, there are three common Quality of Service levels for IIoT:

  • Level 0 – At most once.  Every message will be delivered on a best-effort basis, similar to UDP.  If the message is lost in transit for whatever reason, it is abandoned―the receiver never receives it, and the sender does not know that it was lost.
  • Level 1 – At least once.  Every message will be delivered to a receiver, though sometimes the same message will be delivered two or more times.  The receiver may be able to distinguish the duplicates, but perhaps not.  The sender is not aware that the receiver received multiple copies of the message.
  • Level 2 – Exactly once.  Every message will be delivered exactly once to the receiver, and the sender will be aware that it was received.

These QoS levels actually miss something important that comes up a lot in industrial systems, but let’s look at these three quickly.

First, QoS level 0 is simply unacceptable.  It is fine to lose a frame of a video once in a while, but not fine to lose a control signal that safely shuts down a stamping machine.  If the sender is transmitting data more quickly than the receiver can handle it, there will come a point where in-flight messages will fill the available queue positions, and new messages will be lost.  Essentially, QoS 0 will favor old messages over new ones.  In IIoT, this is a fatal flaw.  There’s no reason to discuss QoS 0 further.

QoS level 1 seems pretty reasonable at first glance.  Message duplication is not a problem in most cases, and where there is an issue the duplicates can be identified by the receiver and eliminated, assuming the client maintains enough history to be able to identify them.

However, problems arise when the sender is transmitting data more quickly than the receiver can process it.  Since there is a delivery guarantee at QoS 1, the sender must be able to queue an infinite number of packets waiting for an opportunity to deliver them.  Longer queues mean longer latencies.  For example, if I turn a light on and off three times, and the delivery latency is 5 seconds simply due to the queue volume, then it will take 30 seconds for the receiver to see that the light has settled into its final state.  In the meantime the client will be acting on false information.  In the case of a light, this may not matter much (unless it is a visual alarm), but in industrial systems timeliness matters.  The problem becomes even more severe if the client is aggregating data from multiple sources.  If some sources are delayed by seconds or minutes relative to other, then the client will be performing logic on data that are not only inconsistent with reality but also with each other.

Ultimately, QoS 1 cannot be used where any client could produce data faster than the slowest leg of the communication path can handle.  Beyond a certain data rate, the system will effectively “fall off a cliff” and become unusable.  I’ve personally seen this exact thing happen in a municipal waste treatment facility.  It wasn’t pretty.  The solution was to completely replace the communication mechanism.

QoS level 2 is similar to QoS 1, but more severe.  QoS 2 is designed for transactional systems, where every message matters, and duplication is equivalent to failure.  For example, a system that manages invoices and payments would not want to record a payment twice or emit multiple invoices for a single sale.  In that case, latency matters far less than guaranteed unique delivery.

Since QoS level 2 requires more communication to provide its guarantee, it requires more time to deliver each message.  It will exhibit the same problems under load as QoS level 1, but at a lower data rate.  That is, the maximum sustained data rate for QoS 2 will be lower than for QoS 1.  The “cliff” just happens sooner.

QoS Levels 1 and 2 Don’t Propagate

Both QoS level 1 and level 2 suffer from another big flaw – they don’t propagate.  Consider a trivial system where two clients, A and B, are connected to a single broker.  The goal is to ensure that B receives every message that A transmits, meaning that QoS 1 or 2 should apply between A and B.  Looking at QoS 1, A would send a message and wait for a delivery confirmation.  The broker would need to transmit the message to B before sending the confirmation to A.  That would imply that the broker knows that A needs to wait for B to respond.  Two problems arise: first, A cannot know that B is even connected to the broker.  That is a fundamental property of a one-to-many broker like MQTT.  Second, the broker cannot know that the intention of the system is to provide reliable communication between A and B.  Even if the broker were somehow programmed to wait like that, how would it deal with a third client, C, also listening for that message.  Would it wait for delivery on all clients?  What would it do about clients that are temporarily disconnected?  The answer is that it cannot.  If the intention of the system is to offer QoS 1 or 2 among clients then that QoS promise cannot be kept.

Some brokers have a server-to-server, or daisy-chain, mechanism that allows brokers to transfer messages to each other.  This allows clients on different brokers to intercommunicate.  In this configuration the QoS promise cannot be maintained beyond the connection between the original sender and the first broker in the chain.

Guaranteed Consistency

None of these QoS levels is really right for IIoT.  We need something else, and that is guaranteed consistency.  In a typical industrial system there are analog data points that move continuously, like flows, temperatures and levels.  A client application would like to see as much detail as it can, but most critical is the current value of these points.  If it misses a value that is already superseded by a new measurement, that is not generally a problem.  However, the client cannot accept missing the most recent value for a point.  For example, if I flick a light on and off 3 times, the client does not need to know how many times I did it, but it absolutely must know that the switch ended in the off position.  The communication path needs to guarantee that the final “off” message gets through, even if some intermediate states are lost.  This is the critical insight in IIoT.  The client is mainly interested in the current state of the system, not in every transient state that led up to it.

Guaranteed consistency for QoS is actually slightly more complex than that.  There are really three critical aspects that are too often ignored:

  1. Message queues must be managed for each data point and client. When communication is slow, old messages must be dropped from the queue in favor of new messages to avoid ever-lengthening latencies.  This queuing must occur on a per-point, per-client basis.  Only messages that are superseded for a specific point destined for a specific client can be dropped.  If we drop messages blindly then we risk dropping the last message in a sequence, as in the final switch status above.
  2. Event order must be preserved.  When a new value for a point enters the queue, it goes to the back of the queue even if it supersedes a message near the front of the queue.  If we don’t do this, the client could see the light turn on before the switch is thrown.  Ultimately the client gets a consistent view of the data, but for a short time it may have been inconsistent.
  3. The client must be notified when a value is no longer current.  For the client to trust its data, it must know when data consistency is no longer being maintained.  If a data source is disconnected for any reason, its data will no longer be updated in the client.  The physical world will move on, and the client will not be informed.  Although the data delivery mechanism cannot stop hardware from breaking, it can guarantee that the client knows that something is broken.  The client must be informed, on a per-point basis, whether the point is currently active and current or inaccessible and thus invalid.  In the industrial world this is commonly done using data quality, a per-point indication of the trustworthiness of each data value.

For those instances where it is critical to see every change in a process (that is, where QoS 1 or 2 is required), that critical information should be handled as close as possible to the data source, whether it’s a PLC or an embedded device.  That is, time-critical and event-critical information should be processed at its source, not transmitted via the network to a remote system for processing where that transmission could introduce latency or drop intermediate values. We will discuss this more when we talk about edge processing.

For the IIoT, the beauty of guaranteed consistency for QoS is that it can respond to changes in network conditions without slowing down, backing up or invalidating the client’s view of the system state.  It has a bounded queue size and is thus suitable for resilient embedded systems.  This quality of service can propagate through any number of intermediate brokers and still maintain its guarantee, as well as notify the client when any link in the chain is broken.

So there’s the answer.  For IIoT, you definitely don’t want QoS 0, and probably cannot accept the limitations and failure modes of QoS 1 or 2.  You want something altogether different—guaranteed consistency.

Continue reading, or go back to Table of Contents

Skkynet at CSIA 2017

Several of us at Skkynet had the pleasure of attending the Control Systems Integrators Association annual conference (CSIA 2017) last week, in Fort Lauderdale, Florida.  Everyone appreciated the beach-side venue and great food, and the balmy weather was a welcome change to Ontario’s cold, rainy spring. The theme of the conference this year was “From Best Practices to Transformative Business Models,” which set the tone and direction of many of the presentations and resulting conversations.

The idea of transformative business models was presented by Mike Harvath, CEO of Revenue Rocket Consulting Group, who offered a vision of the way digital technologies and the IoT are changing how business will be done by system integrators over the next few years.  One of the main differences he and others foresee is a shift from projects and products to services.  Citing recent trends, such as companies providing lighting as a service, Harvath foresees system integrators designing projects and providing products on a service-based model.

Many of the integrators we talked to at CSIA 2017 understood the Industrial IoT in the terms of cloud-based data storage and analytics.  Offering their customers this kind of cloud service would fit the transformative business model, they felt, but a number of questions were raised about how to implement the vision.  In a special “Unconference” on transformative business models, we had a chance to brainstorm and bounce ideas off one another in a peer-to-peer environment.

Top Concerns

Among the top concerns were how to start moving towards a service-model business in general, and how to provide secure IoT services in particular.  Most of the customers for these system integrators are large manufacturing or infrastructure companies, like energy or wastewater facilities, and tend to be conservative in adopting new business models.  Likewise, being engineers and responsible for multi-million dollar budgets and mission-critical systems, the system integrators themselves are being cautious.

I spoke with a number of them about business transformation and the IoT, and most indicated that they are open to the idea, but that seeing is believing.   They and their customers want to see examples of secure IT to OT connectivity, cloud-based data collection, and good return on investment.  We had some enlightening conversations about Skkynet’s secure-by-design approach to the IIoT, and showed them on some demonstration hardware how to monitor and control a system from a web page or smart phone.  The revenue-sharing opportunities of the SkkyHub service struck a welcome chord with those who were getting serious about shifting towards a more service-oriented approach to their business.

Overall, CSIA 2017 was a good experience—a chance to meet those in a position to use or recommend the DataHub and SkkyHub, and find out whether their customers can benefit from this kind of technology.  It turns out that many of them can, and they are starting to realize it.

AWS Outage Calls Attention to Hybrid Cloud

At the end of February Amazon Web Services (AWS) slowed to a crawl for about four hours, causing a major loss of service for hundreds of thousands of websites in North America.  Sites with videos, images, and data files stored on the AWS cloud server suddenly lost much or all of their content, and/or shut down altogether.

After the initial weeping, moaning, and outrage died down, a lively discussion ensued among IT technicians, managers, and concerned citizens on to how to deal with this kind of incident in the future.  The comment section on a story at The Register gives a sample of the kinds of ideas put forward, and there is a clear consensus on a number of them.  Most experts agree that the occasional service outage is one of the inherent risks of using the Internet and cloud services, and that if you need high reliability for your data, you’d better have some kind of redundant or backup solution.

There are normal, accepted ways of building redundancy into a data communications system, including IoT and cloud applications.  One approach mentioned frequently is “hybrid cloud“, a public and a private cloud running simultaneously.  A public cloud is service offered to anyone, typically by a company for paying customers, like AWS.  A private cloud is a service operated and maintained by an individual or company for its own internal use.  To achieve redundancy for AWS in this past outage, a private cloud would have been up and running with a copy of all the company’s data and software, the same as AWS, but just not online.  When AWS stopped serving data, the system would have automatically switched to the private cloud, and someone using the website would not even have noticed.

This is how it works in theory, but building and maintaining a hybrid cloud system that can perform this kind of redundant operation is no small task.  Depending on the level of data and functional replication, in addition to the speed of error detection and  switch-over capability, the hybrid site could cost as much, or even more than the cloud site.  Companies considering such an option would need to do a cost/benefit analysis, based on their specific circumstances.

For Industrial IoT applications a hybrid cloud approach to redundancy may be useful.  Although low-level process control systems should typically not be dependant on the Internet or cloud services, companies who use the IIoT for process monitoring, data collection, or high-level control applications may find it worthwhile to maintain a hybrid cloud.

Skkynet’s SkkyHub service lends itself particularly well to hybrid cloud solutions.  It is possible, and not very difficult, to run a replica system on an in-house server, using the DataHub. Although the DataHub is different from SkkyHub in some respects, for the primary task of data connectivity the two function in an equivalent way.  Readers interested in trying this out are encouraged to contact Cogent for technical tips to ensure a secure and robust implementation.