This guide is made specifically for the Island's operator, so as they can find fast answer and be aware of the best practices for the problems resolution. This guide will be divided according to the CMFs.
CMFs: | OCF | OXA | OMF |
---|
Also, to help the operator to find the proper answer, we will summarize the common problems below:
Components: | Expedient | FlowVisor | NetFPGA | OXA |
---|
The user's slice is unable to start. What should I do?
#1 - Verify if the Flowvisor process that have stopped of working.
1.1 Analysis: The Flowvisor process have stopped.
Some times the user will be unable to start the slice. This is mostly due to the Flowvisor process that have stopped.
To verify if the the Flowvisor has stopped, do the following procedure:
ps ax | grep -i flowvisor |
If the command does not return the flowvisor processes, it means that the flowvisor has crashed and it's needed to start it again.
1.1 Solution:
Instead of just starting the service, for safety, we usually restart the Flowvisor service:
/etc/init.d/flowvisor restart |
1.2 Analysis: The process is jammed at 99% of CPU usage
Also, the process can be jammed at 99% of CPU usage, to verify if this is the case, we strongly recommend the use of a tool called top.
top |
If there's a Java process being executed at 99% of CPU usage, it means that the Flowvisor process is jammed.
1.2 Solution:
Following the previous analysis, restart the Flowvisor service, so as to resolve the matter.
/etc/init.d/flowvisor restart |
1.3 Analysis: The Flowvisor is not working properly,
There is some cases that the Flowvisor will be listed at the process and won't be jammed at 99% of CPU usage, but it won't work properly. So as to verify this kind of situation, execute the following command:
fvctl-xml getLinks |
If there's no output, it means the Flowvisor service has stopped of working.
1.3 Solution:
If there is no output at the results of the analysis, the Flowvisor service must be restarted.
/etc/init.d/flowvisor restart |
#2 - The hard disk of the Flowvisor VM is full.
2.1 Analysis:
This situation happens because the log of the Flowvisor process has filled all the hard disk.
In case the Operator does not know how to verify the space left in disk, use this command:
df -h |
But it's highly advisable to use a monitoring tool, like ZenOSS or Zabbix, or the NOC's Monitoring tool.
2.2 Solution:
So as to solve this situation, the following procedure must be done:
Stop the Flowvisor service:
/etc/init.d/flowvisor stop |
And access the Flowvisor's log directory and remove the logs:
cd /var/log/flowvisor/ rm *.log |
Also it's advised to configure the logrotate tool so as to avoid the problem. Below it's shown a example of configuration for the logrotate:
Create a file at this directory:
touch /etc/logrotate.d/flowvisor |
After that, use this configuration as your template for the logrotate operation.
/var/log/flowvisor/flowvisor-db.log /var/log/flowvisor/flowvisor-stderr.log { weekly size 1M copytruncate rotate 10 compress maxage 100 missingok } |
Be aware, when this kind of situation happens in a federated environment, at least three (03) Flowvisors must be verified:
In case of your Flowvisor not being the fault one, you must contact the NOC operator or the other Island's Operator. |
The user experiments start the slice but the experiment doesn't work. What should I do?
#1 - Verify the user's experiment
1.1 Analysis: The user is creating a loop topology.
Though this is not exactly a problem, depending on the desired application running over the controller (example.: Learning Switch), this may be the cause of the experiment not being able to execute.
1.1 Solution:
If necessary, try to guide the user for the correct creation of the slice (First Experiment Doc to be created).
1.2 Analysis: The controller is not using the correct port
Depending of the chosen controller, the default port may be different than 6633.
1.2 Solution:
For the list of the correct ports and supported controllers, click on this link (doc of controllers to be created).
1.3 Analysis: The chosen VLAN is incorrect
Depending where the experiment is running, VLAN restrictions may be applied. Below is listed the rules for this scenario:
1.3 Solution:
It's necessary to rebook the OpenFlow resources, choose the correct VLAN, and update the slice again.
After doing all the analysis and the experiment is still not working, other scenarios must be verified, like problems with the NetFPGA servers, and the ToR switch and the Pronto switch may be jammed. |
The user's virtual machine won't start. What should I do?
#1 - Verify the OCF configuration
1.1 Analysis:
Some times the cause of the miss behaviour of the user's virtual machine, may be related of how was done the configuration at the OCF's Virtual Aggregate Manager (VTAM) and the Ofelia Xen Agent (OXA). So to verify this matter, it's necessary to compare the configuration of both components.
At the OCF's VM (10.XXX.0.100, where XXX stands for the Island ID), verify the VTAM's configuration:
vim /opt/ofelia/vt_manager/src/python/vt_manager/mySettings.py |
And verify these fields:
XMLRPC_USER = "admin" XMLRPC_PASS = "12345678" VTAM_IP = "10.XXX.0.100" VTAM_PORT = "8445" |
As previously mentioned, it's necessary to verify OXA's configuration:
vim /opt/ofelia/oxa/bin/mySettings.py |
Verify these fields:
VTAM_IP = "10.XXX.0.100" VTAM_PORT = "8445" XMLRPC_USER = "admin" XMLRPC_PASS = "12345678" |
If those configuration files doesn't have the same configuration at the fields mentioned above, this is the probable cause of the user's Virtual Machine miss behaviour.
1.1 Solution:
In case of misconfiguration in any of the fields , it's necessary to make the configuration files match each other.
#2 - Verify the OXA service
Some times the OXA service, may crash
In case of procedures above doesn't work out, it may be necessary to restart the whole virtualization server (dom0). Use this option as the last resort. Below are listed the procedure for this scenario:
|
The user experiments start the slice but the experiment doesn't work. What should I do?
#1 - Verify NetFPGA's OpenFlow service