In this post, we will discuss the various phases of closed loop automation (CLA), the technologies and tools involved in each phase, CLA advantages and conclude with some use cases.
Closed Loop Automation Overview:
Every network engineer knows about the basic network elements required to form a topology such as a Switch, Router, Firewall, Load balancer, etc. So, to get a network to function as designed, administrators must go through few phases such as Plan, Document (Design), Deploy (Installation), Day 0 configuration, Network Service Configuration, Verification, Monitoring, and Maintenance.
During the monitoring and maintenance phase, to sustain the network health and compliance – we can create an intelligent system with all possible issues and their auto recovery strategies. This thick upper layer along with automation is known as Closed Loop Automation.
Closed Loop Automation Phases:
To achieve Closed Loop Automation, the automation software must go through several phases:
1. Data Collection: collect and parse the operational and configuration data from network devices using SNMP, Telemetry, NetFlow, Syslog, sFlow, IPFIX or NETCONF, etc. For this, you can use data collection engines such as Telegraf, Logstash, FluentD, etc.
2. Persistence: To persist the data which we have collected in phase 1, we can use files or time series database (E.g., Influxdb, Prometheus, Elastic search, OpenTSDB, etc.) or distributed streaming platform such as Kafka.
3. Correlation Engine: In this phase, the automation performs Descriptive Analytics using data aggregation and data mining to provide insight into the past. It can help answer questions such as “What metrics have deviated from the baseline behavior?”. For instance: if the interface is getting into an err-disable state more than once in 3 minutes, it should be treated as a severity-2 trigger. But, if it is failing more than five times in 5 minutes, it should be treated as a severity-1 trigger. To achieve this functionality, we can use Kapacitor or Kafka Streams or any of your own engines. The basic principle is the classification or clustering of an issue using machine learning.
4. Assurance or Prescriptive or Optimization or Remediation Engine: In this phase, the software will enforce all the corrective or possible actions. For example, if it is a severity-2 issue, raise the alarm. If it is a severity-1 issue, shut down the interface and email the operations team. This logic can be implemented in python or any other scripting language. Sometimes this phase can be limited to visualization with basic actions such as reporting using Kibana, grafana, Chronograf, etc.
This complete process is called as Closed loop automation. In simple terms, to achieve a stable networking environment with minimal manual intervention, we must collect, store, process and remediate. This concept is popularly known as Closed Loop Assurance or Intent based networking or Self-healing Networks or Auto-healing Networks or any other buzzword you feel suitable.
The above phases can be combined or divided more depending on your requirements. Keep in mind, as you increase the number of components, you will introduce latency to collect and take some action on that data. But, if you have a single component that performs everything; then it might not be capable of handling huge loads. So, design your platform wisely.
Closed Loop Automation involves several technologies such as Networking, DevOps (docker, kubernetes, monitoring tools, other CI/CD tools, etc.), Scripting, Machine learning, etc.
Closed Loop Automation Advantages:
1. Provides Speed and Agility with little manual intervention
2. Reduces risk by simplifying troubleshooting and avoids manual errors
3. Reduces downtime and provides the compliant network by rich automation.
4. Increases business value with a focus on optimization instead of fault identification/isolation.
Closed Loop Automation Pitfalls:
1. Requires cross-functional resources to achieve this system.
2. Takes time to stabilize and ensure a good platform for the end users.
We need to realize that the entire system can’t be built in one day. We shall start small, show the business value and then expand opportunistically.
Closed Loop Automation is an evolving area, and I am excited about the improvements in Machine Learning to unlock new use-cases.
Closed Loop Automation Use-cases:
1. Bandwidth Monitoring or Traffic engineering or Health Monitoring (ex: If an interface queue depth is too high or latency is more or too much congestion then change the QoS policies or provide load sharing or migrate the high bandwidth link, etc.)
2. Link Issues: (ex: if a link is getting flapped continuously, make sure you have redundant or fail-over links, alerting SFP, Line card failures)
3. Security issues or Threat detection: (ex: observe the traffic which is injecting malfunction into the network and change the policies to mitigate that, check the certificate expiry issues etc..)
4. Predictive Analysis: Observe the network behavior for some time and take a decision regarding corrective actions? (ex: Ordering the new line cards to extend the network– capacity management).
However, we can’t do everything through automation. We need human intervention sometimes.
Everything should be made as simple as possible but not simpler.
Albert Einstein
Note: This post first appeared at https://www.linkedin.com/pulse/what-closed-loop-automation-how-achieve-networking-rajesh-reddy-n/