Troubleshooting InfiniBand Connections
The following is a brief troubleshooting guide for an InfiniBand network found in common HPC Linux clusters. Running these commands requires OFED 1.5.2 or later package installed on your systems. Additionally, the “pdsh” (parallel shell) command is part of the HP CMU cluster management software (version 4.2.1 used in our example) installed on the head node. If you don’t have CMU installed, below you will find a simple scripted alternative for running commands on multiple cluster nodes. In our cluster compute nodes are named “node1” through “node32”.
Identify hardware module used for the IB interface:
ls /sys/class/infiniband
Sample output:
mlx4_0
Check the state of the IB port:
cat /sys/class/infiniband/mlx4_0/ports/1/state
Sample output:
4: ACTIVE
Check the status of the subnet manager and start it if necessary:
if [ `/etc/init.d/opensmd status | grep -c "not running"` -gt 0 ] then /etc/init.d/opensmd start fi
Sample output:
Starting opensm: done
Check the state of the IB port on the compute nodes:
pdsh –w node[1-32] cat /sys/class/infiniband/mlx4_0/ports/1/state
Sample output:
node1: 4: ACTIVE node5: 4: ACTIVE node2: 4: ACTIVE node6: 4: ACTIVE node7: 4: ACTIVE ...
Here’s a way of running the command above without pdsh:
i=1 while [ $i -le 32 ] do echo -n "node$i: " ssh node$i "cat /sys/class/infiniband/mlx4_0/ports/1/state" (( i = i + 1 )) done
The next step is to check the speed of IB ports on the head node:
cat /sys/class/infiniband/mlx4_0/ports/1/rate
Sample output:
40 Gb/sec (4X QDR)
… and on the compute nodes:
pdsh –w node[1-32] cat /sys/class/infiniband/mlx4_0/ports/1/rate
And here’s how you do this without pdsh:
i=1 while [ $i -le 32 ] do echo -n "node$i: " ssh node$i "cat /sys/class/infiniband/mlx4_0/ports/1/rate" (( i = i + 1 )) done
More detailed analysis of the IB connection can be performed with the ibdiagnet command:
ibdiagnet -pc -c 1000
Sample output:
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.5.4 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /usr/lib64/ibdm1.5.4 -I- Using port 1 as the local port. -I- Discovering ... 39 nodes (3 Switches & 36 CA-s) discovered. -I--------------------------------------------------- -I- Bad Guids/LIDs Info -I--------------------------------------------------- -I- No bad Guids were found -I--------------------------------------------------- -I- Links With Logical State = INIT -I--------------------------------------------------- -I- No bad Links (with logical state = INIT) were found -I--------------------------------------------------- -I- General Device Info -I--------------------------------------------------- -I--------------------------------------------------- -I- PM Counters Info -I--------------------------------------------------- -W- lid=0x0001 guid=0x0008f10500200898 dev=23130 Port=4 Performance Monitor counter : Value symbol_error_counter : 0x4 (Increase by 4 during ibdiagnet scan.) -I--------------------------------------------------- -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) -I--------------------------------------------------- -I- PKey:0x7fff Hosts:36 full:36 limited:0 -I--------------------------------------------------- -I- IPoIB Subnets Check -I--------------------------------------------------- -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps -I--------------------------------------------------- -I- Bad Links Info -I- No bad link were found -I--------------------------------------------------- ---------------------------------------------------------------- -I- Stages Status Report: STAGE Errors Warnings Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 General Devices Info Report 0 0 Performance Counters Report 0 1 Partitions Check 0 0 IPoIB Subnets Check 0 1 Please see /tmp/ibdiagnet.log for complete log ---------------------------------------------------------------- -I- Done. Run time was 11 seconds.
cat /sys/class/infiniband/mlx4_0/ports/1/lid
pdsh –w node[1-32] /sys/class/infiniband/mlx4_0/ports/1/lid
ibdiagnet –pc –c 1000
The final step is to check the error state of each port:
ibcheckerrors
Sample output:
#warn: counter SymbolErrors = 35 (threshold 10) lid 1 port 255 #warn: counter RcvSwRelayErrors = 512 (threshold 100) lid 1 port 255 Error check on lid 1 (Voltaire 4036 - 36 QDR ports switch) port all: FAILED #warn: counter SymbolErrors = 35 (threshold 10) lid 1 port 4 Error check on lid 1 (Voltaire 4036 - 36 QDR ports switch) port 4: FAILED ## Summary: 39 nodes checked, 0 bad nodes found ## 134 ports checked, 1 ports have errors beyond threshold
In our case, the “ibcheckerrors” command revealed a problem with the IB switch. This turned out to be a hardware problem and the switch needed to be replaced.
6 Comments »
1 Pingbacks »
-
[…] Igor’s InfiniBand troubleshooting guide at Krazyworks […]
I’m taking a couple Pre-IB (International Baccalaureate) classes right now, and I’d like to move down from those classes next year, which will then become IB, so that I can have time to actually practice music at home. For a Music Major my primary focus should be my knowledge of music theory and the quality of my performance, but I’m afraid that moving down from two pre-IB classes one level might hurt my chances.
i’m trying out for 2 ib schools in my area
i’m really nervous because in 1 of the schools.. it states that only 100 students are accepted
how should i prepare myself?
how difficult are the test?
thanks
I’m trying out for 2 ib highschools in my area…
its stated that in 1 of the schools only 100 are accepted
how difficult are the test?
how should I prepare for it?
thanks!
I want to go to Cambridge in England when I am older.
Would IB classes be a good thing to take?
I’m in middle school right now going on to High School soon.
By the way I’m just finishing 7th grade right now and I just finished Algebra 1 Honors and going onto Geometry 1 Honors in 8th grade.
The IB program is the international baccalaureate program and running start is taking classes at the community college nearby. Which would a college like yale prefer on an application??
Pleease help!! Thanks!!!
I am planning on going to an american university when i finish the IB programme.
Now… wtf is the difference between graduate and undergraduate degrees and which one will i have to apply to?
Also do IB students get any extra credit or financial aid in american universities? I am also an international student (i study in Malaysia)
please enlighten me i am very confused :(
thank you guys :)
please answer the question.
I just got a notice from the school saying that i need to pay for my IB Exams.
I tought only those who are getting the IB Diploma need to take the exams or is it for both?
Please answer.