TestAndDiagnostic

Created on June 21, 2022

This page is under development

Introduction

TAD (Test And Diagnostic) monitors the amount of free memory available in the system at run time. It triggers the memory/Resource Reclamation (RR) process in TDM, when the amount of free memory drops below a configurable threshold. RR process can also be triggered by memory allocation failures which results in a notification being sent to TDM to try to reclaim memory.

Selfheal is another feature implemented in Test And Diagnostic Component.

Self-heal monitors:

  • CPU usage
  • Memory Usage
  • Critical RDK-B processes

Self-heal stores Reset Count and Reboot Count.
Self-heal takes required action like: Rebooting the device, Restarting required process based on predefined conditions.
Self-heal does connectivity test.


Feature

Selfheal – Resource Monitoring

Monitors the resources periodically (eg: 15 mins). If “Average Memory Used” reaches threshold value, necessary action will be executed. 

“resource_monitor.sh” script is used for monitoring Memory and CPU usage.
Located in the device path: “/fss/gw/usr/ccsp/tad/resource_monitor.sh”.

Selfheal – Process Monitoring

Monitors the processes periodically (eg: 15 mins) based on process id (pid). Based on the process id availability, required action will be taken such as restarting the process, rebooting the device.

“task_health_monitor.sh” script is used for monitoring RDK-B processes. This is located at path: “/fss/gw/usr/ccsp/tad/task_health_monitor.sh”. We can monitor any RDK-B processes by adding the process pid in this script.

Self-heal stores Reset Count and Reboot Count

Selfheal – Connectivity Test

Self-heal does connectivity test. Ping test will be done through server IP/URI (this needs to be configured). If server IP/URI is not configured, Ping test won’t be executed and no action will be taken. If server is configured and ping test fails, reboot action will be executed.

“self_heal_connectivity_test.sh” script is used for ping test

Selfheal – Action

Self-heal takes the required action through “corrective_action.sh” script. This script has implementation of the actions.

Some of the actions are:

rebootNeeded – Reboots the device
resetNeeded – Restarts the required process
storeInformation – Stores Memory and CPU usage

In Raspberry Pi the functionality of self-heal feature is provided by systemd.


Code Flow

Resource Monitoring – resource_monitor.sh

  • resource_monitor.sh monitors the Memory and CPU usage
  • Average memory and CPU thresholds will be obtained from syscfg.db (default avg_cpu_threshold:100, avg_memory_threshold:100)


  • Memory Usage Monitor
    • Gets the total, free and used memory details using free command
    • AvgMemUsed = usedMem*100 / totalMem
    • if AvgMemUsed > memory_threshold, device will be rebooted
  • CPU Usage Monitor
    • Active CPU is considered as sum of user, system, iowait, irq, softirq, steal cpu
    • CPU usage difference in every 30 seconds for a period of 5 mins, is considered as an Curr_CPULoad_Avg.
    • If Curr_CPULoad_Avg > cpu_threshold, corrective action will be taken



Process Monitoring – task_health_monitor.sh

  • task_health_monitor.sh monitors the status of various taks periodically and takes the corrective action
  • Default monitoring interval is 15mins and can be modified using resource_monitor_interval in syscfg.db
  • Monitors
    • Health of peer processor, in case of dual core processors
    • Other tasks added as part of the script
  • New tasks can be added by editing the script


Connectivity Test – self_heal_connectivity_test.sh

  • Self_heal_connectivity_test.sh will run Ping and DNS tests.
  • ConnTest_PingInterval in syscfg.db specifies the frequency of the connectivity test.
  • If nothing specified, it is 60seconds by default.


  • runPingTest
    • Gets the IP (default_router IP) from syscfg.db
    • If no IP specified, it will try pinging to default gw
    • If ping fails, takes the corrective action, which is none by default.
  • runDNSPingTest
    • This is disabled by default. Can be enabled by selfheal_dns_pingtest_enable in syscfg.db
    • Gets the urlToVerify from syscfg.db
    • If nslookup fails, takes the corrective action, which is none by default

Objects

Self heal objects in its DML layer: 

Device.SelfHeal.X_RDKCENTRAL-COM

Self heal can be Enabled/disabled by the below data model. By default, it is enabled

$ dmcli eRT getv Device.SelfHeal.X_RDKCENTRAL-COM_Enable
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.X_RDKCENTRAL-COM_Enable
Execution succeed.
Parameter    1 name: Device.SelfHeal.X_RDKCENTRAL-COM_Enable
               type:       bool,    value: true


Verify the selfheal feature running status

$ ps -Af | grep -i self
 4449 root       0:00 {self_heal_conne} /bin/sh /usr/ccsp/tad/self_heal_connectivity_test.sh
18921 root       0:00 grep -i self

Resource monitoring

The Below DM is used to verify the Average CPU threshold. By default the value is set to 100

$ dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgCPUThreshold
               type:       uint,    value: 100


The Below DM is used to verify the Average Memory threshold. By default the value is set to 100

$ dmcli eRT getv Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
CR component name is: eRT.com.cisco.spvtg.ccsp.CR
subsystem_prefix eRT.
getv from/to component(eRT.com.cisco.spvtg.ccsp.tdm): Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
Execution succeed.
Parameter    1 name: Device.SelfHeal.ResourceMonitor.X_RDKCENTRAL-COM_AvgMemoryThreshold
               type:       uint,    value: 100




Go To Top