You are on page 1of 18

WHITE PAPER APRIL 2021

Configuring a Mellanox® Network Adapter for


Lossless Networking
Western Digital® Platforms

Abstract

This document provides an overview of how to configure lossless Ethernet settings on Mellanox based Ethernet network adapters.
Configuring a Mellanox Network Adapter for Lossless Networking

Notices
Western Digital Technologies, Inc. or its affiliates' (collectively “Western Digital”) general policy does not recommend the use of
its products in life support applications wherein a failure or malfunction of the product may directly threaten life or injury. Per
Western Digital Terms and Conditions of Sale, the user of Western Digital products in life support applications assumes all risk of
such use and indemnifies Western Digital against all damages.
This document is for information use only and is subject to change without prior notice. Western Digital assumes no responsibility
for any errors that may appear in this document, nor for incidental or consequential damages resulting from the furnishing,
performance or use of this material.
Absent a written agreement signed by Western Digital or its authorized representative to the contrary, Western Digital explicitly
disclaims any express and implied warranties and indemnities of any kind that may, or could, be associated with this document
and related material, and any user of this document or related material agrees to such disclaimer as a precondition to receipt and
usage hereof.
Each user of this document or any product referred to herein expressly waives all guaranties and warranties of any kind associated
with this document any related materials or such product, whether expressed or implied, including without limitation, any implied
warranty of merchantability or fitness for a particular purpose or non-infringement. Each user of this document or any product
referred to herein also expressly agrees Western Digital shall not be liable for any incidental, punitive, indirect, special, or
consequential damages, including without limitation physical injury or death, property damage, lost data, loss of profits or costs
of procurement of substitute goods, technology, or services, arising out of or related to this document, any related materials or
any product referred to herein, regardless of whether such damages are based on tort, warranty, contract, or any other legal
theory, even if advised of the possibility of such damages.
This document and its contents, including diagrams, schematics, methodology, work product, and intellectual property rights
described in, associated with, or implied by this document, are the sole and exclusive property of Western Digital. No intellectual
property license, express or implied, is granted by Western Digital associated with the document recipient's receipt, access and/or
use of this document or the products referred to herein; Western Digital retains all rights hereto.
Western Digital, the Western Digital logo, and OpenFlex are registered trademarks or trademarks of Western Digital Corporation
or its affiliates in the US and/or other countries. Mellanox and ConnectX are registered trademarks of Mellanox Technologies, Ltd.
Red Hat, CentOS, Fedora, and Red Hat Enterprise Linux are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries
in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. The
NVMe and NVMe-oF word marks are trademarks of NVMExpress, Inc. All other marks are the property of their respective owners.
Product specifications subject to change without notice. Pictures shown may vary from actual products. Not all products are
available in all regions of the world.
Western Digital
5601 Great Oaks Parkway
San Jose, CA 95119

© 2021 Western Digital Corporation or its affiliates. All Rights Reserved.

WHITE PAPER Page 2 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Table of Contents
Notices ................................................................................................................................................................................................. 2
Introduction ......................................................................................................................................................................................... 4
Configuration Process Summary ...................................................................................................................................................... 4
Example Hardware Specifications .................................................................................................................................................... 4
Configuration Table .......................................................................................................................................................................... 4
Download and Install Software Driver ................................................................................................................................................. 5
Configure PFC ....................................................................................................................................................................................... 8
Summary of CLI Commands ........................................................................................................................................................... 10
Configure ECN .................................................................................................................................................................................... 11
Summary of CLI Commands ........................................................................................................................................................... 11
Show Pertinent Network Counters .................................................................................................................................................... 12
Configure Boot Time Scripts............................................................................................................................................................... 14
Appendix ............................................................................................................................................................................................ 17
Contributors ................................................................................................................................................................................... 17
References ..................................................................................................................................................................................... 17
Document Feedback ...................................................................................................................................................................... 17
Version History ............................................................................................................................................................................... 17

WHITE PAPER Page 3 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Introduction
NVMe-oF™ based storage offers the promise of low latency shared storage. To obtain the performance potential of this
technology, Ethernet Network Adapters in initiator hosts must be configured for lossless networking using standard Data Center
Bridging (DCB) technologies. While these settings may be unfamiliar to many readers, they are not complicated to understand or
follow.

Configuration Process Summary


The process of enabling lossless networking functionality on Mellanox® based Ethernet Network Adapters can be broken down
into the following steps:
1. Download and Install Software Driver
2. Configure PFC (Priority Flow Control)
3. Configure ECN (Explicit Congestion Notification)
4. Show Pertinent Switch Counters
5. Configure Boot Time Scripts
At the end of each section will be included a single code block with all of the CLI commands. This is for convenience to enable the
reader to copy-and-paste the commands used in that section into their own script in a single action instead of requiring the reader
to copy and paste each individual CLI command.

Example Hardware Specifications


In this guide, the following network adapter was used. Terminal output shown in this guide may vary based on your product and
firmware version.
 Model(s): ConnectX®-4, ConnectX-5, & ConnectX-6
 Firmware Version: CX4-12.25.1020 CX5-16.25.1020 CX6-16.27.2008
 Driver Version: 4.5-1.0.1.0, 5.0-2.1.8

Configuration Table
Included below for convenience is a table to record the lossless configuration values for your deployment.

Description Variable Example Your Value


Ethernet Device Name <DEVICE> p2p1
RoCE Device Name <ROCE_DEVICE> mlx5_0
PFC Priority <ROCE_PRI> 3
PFC DSCP <ROCE_DSCP> 24
CNP Priority <CNP_PRI> 6
CNP DSCP <CNP_DSCP> 48
MST Device Path <MST_DEVICE> /dev/mst/mt4121_pciconf0
Type of Service <TOS> 96
PCIe Device <PCIE_DEVICE> 8f:00.0

WHITE PAPER Page 4 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Download and Install Software Driver


1. First verify that the system has a Mellanox adapter installed. After the host has booted into the operating system (Linux®),
run the following command to verify the Mellanox Network Adapter shows up on the PCIe bus
$ lspci -v | grep -i Mellanox

2. Visit Mellanox's website and navigate to "Products->Software->InfiniBand/VPI Drivers"

As manufacturer’s websites continually change this can only be an example of how to acquire the current driver

WHITE PAPER Page 5 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

3. Select “Mellanox OFED Linux - MLNX_OFED” and then under “Download” select your OS, OS Version and Architecture

4. Download the resultant "tgz" driver file


5. Copy the driver .tgz to the host

If the host has access to the internet the driver installation performs a firmware update. If the host does not
have access to the internet please refer to Mellanox documentation to properly update the RNIC to the most
current firmware.

6. Install the pre-requisite packages

This is not a comprehensive list of pre-requisite packages. This is highly dependent on how the OS installation
was performed. Attempt the installation of the driver and examine the output to discover any additional pre-
requisite packages that are needed

$ yum install python-devel kernel-devel redhat-rpm-config rpm-build gcc


$ yum install tcl gcc-gfortran tk

7. Extract the driver .tgz file on the host server


$ tar zxvf MNX_OFED_LINUX-x-x.x.x.x-x86_64.tgz

8. Run the installation script from the folder where the tar file was extracted. Make sure to use the --with-nvmf and --add-
kernelsupport command modifiers
$ ./mlnxofedinstall --add-kernel-support --with-nvmf

9. Create or edit the file /etc/modules-load.d/modules.conf to contain the following

WHITE PAPER Page 6 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

#
# Mellanox nvmeOF
rdma_cm
ib_uverbs
rdma_ucm
nvme
nvme-rdma

10. Restart the driver


$ /etc/init.d/openibd restart

11. Check ibv_devices and verify that the Mellanox Network Adapter is in the list
$ ibv_devices

Example Output:

WHITE PAPER Page 7 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Configure PFC
The following is a list of things to know before following this process:
 All of these commands must be executed for each port on the RDMA Network Adapter that needs to be configured
 The majority of these commands are not persistent through reboot and will have to be scripted to run at every boot
PFC set up process:
1. Use MST to acquire PCIe address to Ethernet Device to IB Device mapping as well as the MST device path

These commands are informational and not required to be run at boot

a. Start MST
$ mst start

b. Get MST device path


$ mst status -v

Example Output:

2. Ensure LLDP and DCBx are disabled

This command is persistent and is not required to be run at boot

$ mlxconfig -d <MST_DEVICE> set


LLDP_NB_DCBX_P1=FALSE LLDP_NB_TX_MODE_P1=2 LLDP_NB_RX_MODE_P1=2 LLDP_NB_DCBX_P2=FALSE
LLDP_NB_TX_MODE_P2=2 LLDP_NB_RX_MODE_P2=2

3. Reboot the host server


4. Verify LLDP and DCBx are disabled

WHITE PAPER Page 8 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

This command is informational and not required to be run at boot

$ mlxconfig -d <MST_DEVICE> q | grep LLDP

Example Output:

5. Configure the trust state for DSCP


$ mlnx_qos -i <DEVICE> --trust dscp

6. Enable PFC on desired Priority

Parameter --pfc takes eight comma separated values. The values passed are Boolean (0 or 1) to indicate
whether or not PFC is enabled while the position indicates priority. The priorities are 0 - 7 from left to right. In
this example we are enabling PFC on priority "3"

$ mlnx_qos -i <DEVICE> --pfc 0,0,0,1,0,0,0,0

7. Allocate memory to secondary buffer


$ mlnx_qos -i <DEVICE> --buffer_size 130944,130944,0,0,0,0,0,0

8. Map desired priority to newly created buffer 1

Parameter --prio2buffer takes eight comma separated values. The values passed are the buffer to assign while
the position indicates priority. The priorities are 0 - 7 from left to right. In this example we are assigning priority
"3" to use the newly created buffer 1

$ mlnx_qos -i <DEVICE> --prio2buffer 0,0,0,1,0,0,0,0

WHITE PAPER Page 9 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

9. Set traffic class transmission algorithm and bandwidth allocation to match the default on the Mellanox switch

Parameter --tsa takes eight comma separated values. The values passed are the egress scheduling method to
assign while the position indicates priority. The priorities are 0 - 7 from left to right. In this example we are
assigning priority "6" to use the strict scheduling method

Parameter --tcbw takes eight comma separated values. The values passed are a percentage of the total
bandwidth while the position indicates priority. The priorities are 0 - 7 from left to right. In this example we are
mimicking the default bandwidth assignments done at a Mellanox switch adjusting for the strict scheduling on
priority 6

$ mlnx_qos -i <DEVICE> --tsa ets,ets,ets,ets,ets,ets,strict,ets --tcbw 14,15,14,15,14,14,0,14

10. Set default RoCE mode for RNICs


$ cma_roce_mode -d <ROCE_DEVICE> -m 2

11. Verify that the RoCE mode is now set to RoCE v2

This command is informational and not required to be run at boot

$ cma_roce_mode -d <ROCE_DEVICE>

12. Determine Type of Service (TOS) value to assign to RoCE v2 traffic by using the DSCP value and perform a bitwise left shift
by 2 depending on the desired configuration
TOS is an 8-bit construct.
DSCP is a 6-bit field residing in TOS (6 most significant bits).
Priority is a 3-bit field residing in DSCP (3 most significant bits).
For example: DSCP24: 011000 with corresponding TOS96: 01100000
13. Assign RoCE v2 traffic to use derived TOS value
$ cma_roce_tos -d <ROCE_DEVICE> -t <TOS>

Summary of CLI Commands


mlxconfig -d <MST_DEVICE> set
LLDP_NB_DCBX_P1=FALSE LLDP_NB_TX_MODE_P1=2 LLDP_NB_RX_MODE_P1=2 LLDP_NB_DCBX_P2=FALSE
LLDP_NB_TX_MODE_P2=2 LLDP_NB_RX_MODE_P2=2
mlnx_qos -i <DEVICE> --trust dscp
mlnx_qos -i <DEVICE> --pfc 0,0,0,1,0,0,0,0
mlnx_qos -i <DEVICE> --buffer_size 130944,130944,0,0,0,0,0,0
mlnx_qos -i <DEVICE> --prio2buffer 0,0,0,1,0,0,0,0
mlnx_qos -i <DEVICE> --tsa ets,ets,ets,ets,ets,ets,strict,ets --tcbw 14,15,14,15,14,14,0,14
cma_roce_mode -d <ROCE_DEVICE> -m 2
cma_roce_tos -d <ROCE_DEVICE> -t <TOS>

WHITE PAPER Page 10 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Configure ECN

By default, Mellanox configures CNPs to priority 6 using a DSCP value of 48. If this is not your desired priority
for ECN CNP traffic use these instructions to change the CNP DSCP value

The majority of these commands are not persistent through reboot and will have to be scripted to run at every
boot

1. Ensure that the debugfs is mounted


$ mount | grep debugfs

2. If the path isn't available, a mount is required


$ mount -t debugfs none /sys/kernel/debug

3. Set CNP mode to manual configuration


$ echo 0 > /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_prio_mode

4. Verify CNP mode is set to manual configuration


$ cat /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_prio_mode

5. Change the CNP DSCP value


$ echo <CNP_DSCP> > /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_dscp

6. Change the CNP Class of Service


$ echo <CNP_PRI> > /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_prio

7. Verify the values were set correctly


$ cat /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp*

Summary of CLI Commands


mount -t debugfs none /sys/kernel/debug
echo 0 > /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_prio_mode
cat /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_prio_mode
echo <CNP_DSCP> > /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_dscp
echo <CNP_PRI> > /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp_prio
cat /sys/kernel/debug/mlx5/<PCIE_DEVICE>/cc_params/np_cnp*

WHITE PAPER Page 11 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Show Pertinent Network Counters

As of August 2020, the CNP counters on the Mellanox RNICs do not appear to be accurate

1. Show PFC counters for a specific interface


$ ethtool -S <DEVICE> | grep 'pause.*prio\|prio.*pause'

2. Show Traffic Class counters on a specific interface


$ ethtool -S <DEVICE> | grep prio

Example Output:

3. Show drop, discard, pause, and abort counters on a specific interface. This command only shows non-zero counts
$ grep "" /sys/class/infiniband/mlx5_*/ports/1/counters/* | grep -i -e drop -e dis -e err -e pau
-e abort | awk -F: '$NF != 0 { print $0 }'

WHITE PAPER Page 12 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

4. Show ECN CNP counters on specific interface


$ grep "" /sys/class/infiniband/mlx5_*/ports/1/hw_counters/* | grep -i -e cnp -e ecn

Example Output:

WHITE PAPER Page 13 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Configure Boot Time Scripts


The prior sections of this document provided instructions on how to configure RDMA capable Network Adapters for lossless
networking. Many of the commands used do not persist during a reboot. This section details the method for configuring lossless
settings to persist during a reboot cycle. To that end, a combination of a systemd service and a script that this service invokes is
used to apply the lossless configuration to the host at boot time.

The instructions below are specific example using RedHat® based operating systems with systemd. If your
system does not meet these specifications use the scripts below as a template to create operable services and
scripts for your operating system

1. Create a lossless configuration script named /usr/local/sbin/lossless.sh with the following contents:

This is a script specific to Mellanox based network adapters. This script may require customization depending on
the RNIC that is being used. Pay particular attention to the bold comments for instruction on how to customize

#!/bin/bash

# Mellanox specific lossless network configuration script

ibdev_from_nic ()
{
local _nic=${1} _pcidev _lnks _ibdev;
if [ -z "${_nic}" ]; then
echo "${FUNCNAME}(): NIC must be specified -- aborting" 1>&2;
exit 1;
fi;
local _path="/sys/class/net/${_nic}/device";
if [ ! -h "${_path}" ]; then
echo "${FUNCNAME}(): \"${_path}\" not found -- aborting" 1>&2;
exit 1;
fi;
_pcidev=`readlink ${_path} | sed -e 's/.*\///'`;
_lnks=`find /sys/class/infiniband -maxdepth 1 -type l -print -exec readlink {} \; | grep
${_pcidev}`;
_ibdev=`echo "${_lnks}" | head -1 | sed -e 's/.*\///'`;
if [ -n "${_ibdev}" ]; then
echo ${_ibdev};
else
echo "none";
fi
}

NICS="enp65s0f0 enp65s0f1"
ECN=1

# Calculate the total available Rx buffer size for all ports on each RNIC
declare -A nics nics_c
for nic in ${NICS}; do
this_nic=`echo ${nic} | cut -ds -f1`
if [ -z "${nics[${this_nic}]}" ]; then
nics[${this_nic}]=0

WHITE PAPER Page 14 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

nics_c[${this_nic}]=1
else
((nics_c[${this_nic}]++))
fi
rbt=${nics[${this_nic}]}
for rb in `mlnx_qos -i ${nic} | grep Receive | tr -dc '[0-9],' | tr ',' ' '`; do
rbt=$((rbt+rb))
done
# Store in hash for later use
nics[${this_nic}]=${rbt}
done
# Take the total available Rx buffer size for all ports and divide by 2 * Nports
for this_nic in ${!nics[@]}; do
nics[${this_nic}]=$((nics[${this_nic}]/(2 * nics_c[${this_nic}])))
done

for nic in ${NICS}; do

mlnx_qos -i ${nic} --trust dscp --dcbx os >/dev/null 2>&1


mlnx_qos -i ${nic} --pfc 0,0,0,1,0,0,0,0 2>&1
this_nic=`echo ${nic} | cut -ds -f1`
rbt=${nics[${this_nic}]}
# Set the Rx buffer size for 0 and 1
mlnx_qos -i ${nic} --buffer_size ${rbt},${rbt},0,0,0,0,0,0 2>&1
mlnx_qos -i ${nic} --prio2buffer 0,0,0,1,0,0,0,0 2>&1
mlnx_qos -i ${nic} --tsa ets,ets,ets,ets,ets,ets,strict,ets --tcbw 14,15,14,15,14,14,0,14
2>&1

ibdev=`ibdev_from_nic ${nic}`
if [ "${ibdev}" != "none" ]; then
cma_roce_tos -d ${ibdev} -t 96 >/dev/null 2>&1
fi

if ((ECN)); then
pcidev=`readlink /sys/class/net/${nic}/device | sed -e 's/.*\///'`
if ! grep -q debugfs /proc/mounts; then
mount -t debugfs none /sys/kernel/debug
fi
pushd /sys/kernel/debug/mlx5/${pcidev}/cc_params/ >/dev/null
echo 0 > np_cnp_prio_mode
echo 48 > np_cnp_dscp
echo 6 > np_cnp_prio
popd >/dev/null
fi
done

exit 0

WHITE PAPER Page 15 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

2. Create a service control script /etc/systemd/system/lossless.service with the following contents:


# Service to configure lossless on system startup
[Unit]
Description=Script to set DSCP mode and default CMA TOS value
After=network.target
# Requires=openibd.service

[Service]
Type=simple
ExecStart=/usr/local/sbin/lossless.sh
TimeoutStartSec=0

[Install]
WantedBy=default.target

3. Register the new service


# systemctl enable lossless

4. Start the new service


# systemctl start lossless

WHITE PAPER Page 16 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

Appendix
Contributors
Name Company Title
Jon Flynn Western Digital Technologist, Applications Engineering
Niall Macleod Western Digital Sr. Manager, Applications Engineering
Barrett Edwards Western Digital Director, Technical Marketing

References
Document Title
Lossless Networking Technology Overview
OpenFlex Data24 Deployment Considerations
OpenFlex F3200 Deployment Considerations
Configuring an Arista Switch for Lossless Networking
Configuring a Cisco Switch for Lossless Networking
Configuring a Lenovo Switch for Lossless Networking
Configuring a Mellanox Switch for Lossless Networking
Configuring a Switch Running Cumulus for Lossless Networking
Configuring a Broadcom Network Adapter for Lossless Networking
Configuring a Mellanox Network Adapter for Lossless Networking

Document Feedback
For feedback, questions, and suggestions for improvements to this document send an email to the Data Center Systems (DCS)
Technical Marketing Engineering (TME) team distribution list at pdl-dcs-tm@wdc.com.

Version History
Version Date Notes
1.0 7 Apr 2021 Initial release

WHITE PAPER Page 17 of 18


Configuring a Mellanox Network Adapter for Lossless Networking

5601 Great Oaks Parkway


San Jose, CA 95119, USA
US (Toll-Free): 800.801.4618
International: 408.717.6000

www.westerndigital.com

WHITE PAPER Page 18 of 18

You might also like