The NIST Cybersecurity Framework is among the most followed guidelines for managing cyber risk and resiliency. The core of the framework is five functions – identify, protect, detect, respond, and recover.
The first functions of identify and protect receive a lot of attention; after all, knowing what you have and identifying ways to protect the most at risk assets is foundational to good cybersecurity risk management. (This is a topic I’ve written about in more depth here.)
Also, understandably, detect and respond garner a large share of cybersecurity team attention and budget focus. There are a number of tools in the operational technology (OT) cybersecurity space, for example, that can assist with detection of anomalous network activity and possible breaches.
But, it’s more than just packet sniffers that are relevant for detection in the industrial sector, because knowing someone is on your process control network is good but not sufficient. In the industrial context, having an inventory of your control system configurations and being able to detect a change is equally (if not, arguably, even more) important – because it’s when changes are made that affect processes and how molecules move that the worst can happen (e.g. explosions, safety and environmental incidents). Falling short of those catastrophic consequences, though, there is still the risk of cyber foul-play in an attempt to disrupt operations by causing equipment failures, shutting down a unit, or tripping a safety system. Detecting and responding to configuration changes is, thus, critical to effective OT Cybersecurity – but may be overlooked, especially by IT security teams unfamiliar with OT. (Both PAS Cyber Integrity and PAS Automation Integrity include the ability to baseline configuration settings and detect deviations from them.)
The function in the NIST Framework, however, that does not seem to get as much attention as it deserves is the recover function. While reducing the risk of being disrupted in the first place is, obviously, important – it is equally important to be able to recover accurately and quickly. This is where being proactive vs. simply reactive can make all the difference. Let me explain via a real-world example.
Recently, one of our customers experienced a major loss of configuration data for one of their most important units in one of their largest plants. While this incident was not specifically related to a cyber attack, the organization’s ability to recover is a testament to the failures and opportunities associated with being reactive vs. proactive. Here’s what happened.
What was thought to be a harmless IT change (updating a static IP address on an engineering workstation that was hosting the distributed control system – or DCS – configuration), caused the loss of the entire control strategy information along with tag references and programs. Because no one had anticipated an IP address change could have such a negative impact, no back up of the workstation and configuration files were taken proactively prior to the change. And, to make matters worse, there was no other good back up available from an earlier time.
Thankfully, the unit was already taken offline for the maintenance procedure and so this did not cause a safety incident; however, it meant that a major unit in the plant could not be brought back online, which had significant implications for revenue given the criticality of the unit. Without a good back up, the organization was staring at the very likely scenario of spending months attempting to recreate the many years of engineering effort and intellectual property that had gone into setting and fine-tuning the control system strategies and configurations.
The good news in this bad situation was that the organization had already implemented Automation Integrity and so they were able to use the configuration baseline data contained in it to restore the lost data in a matter of hours.
What can be learned from this real-world occurrence then? First, taking the time to produce a back up before undertaking maintenance tasks, even when they do not have expected risks, should not be overlooked. Second, having another source of your OT configuration data is a must have for business resiliency – whether that is to recover from human error, as in this case, or a cyber attack.
While this organization fell into the trap of reactive response and recovery, they were fortunate to have been proactive in implementing Automation Integrity – to get themselves out of what would have otherwise been a much more serious situation. The morale of the story is that when it comes to response and recovery, being proactive makes all the difference.