Professional Documents
Culture Documents
Abstract
This document provides an overview of how to configure lossless Ethernet settings on Mellanox based Ethernet network adapters.
Configuring a Mellanox Network Adapter for Lossless Networking
Notices
Western Digital Technologies, Inc. or its affiliates' (collectively “Western Digital”) general policy does not recommend the use of
its products in life support applications wherein a failure or malfunction of the product may directly threaten life or injury. Per
Western Digital Terms and Conditions of Sale, the user of Western Digital products in life support applications assumes all risk of
such use and indemnifies Western Digital against all damages.
This document is for information use only and is subject to change without prior notice. Western Digital assumes no responsibility
for any errors that may appear in this document, nor for incidental or consequential damages resulting from the furnishing,
performance or use of this material.
Absent a written agreement signed by Western Digital or its authorized representative to the contrary, Western Digital explicitly
disclaims any express and implied warranties and indemnities of any kind that may, or could, be associated with this document
and related material, and any user of this document or related material agrees to such disclaimer as a precondition to receipt and
usage hereof.
Each user of this document or any product referred to herein expressly waives all guaranties and warranties of any kind associated
with this document any related materials or such product, whether expressed or implied, including without limitation, any implied
warranty of merchantability or fitness for a particular purpose or non-infringement. Each user of this document or any product
referred to herein also expressly agrees Western Digital shall not be liable for any incidental, punitive, indirect, special, or
consequential damages, including without limitation physical injury or death, property damage, lost data, loss of profits or costs
of procurement of substitute goods, technology, or services, arising out of or related to this document, any related materials or
any product referred to herein, regardless of whether such damages are based on tort, warranty, contract, or any other legal
theory, even if advised of the possibility of such damages.
This document and its contents, including diagrams, schematics, methodology, work product, and intellectual property rights
described in, associated with, or implied by this document, are the sole and exclusive property of Western Digital. No intellectual
property license, express or implied, is granted by Western Digital associated with the document recipient's receipt, access and/or
use of this document or the products referred to herein; Western Digital retains all rights hereto.
Western Digital, the Western Digital logo, and OpenFlex are registered trademarks or trademarks of Western Digital Corporation
or its affiliates in the US and/or other countries. Mellanox and ConnectX are registered trademarks of Mellanox Technologies, Ltd.
Red Hat, CentOS, Fedora, and Red Hat Enterprise Linux are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries
in the United States and other countries. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries. The
NVMe and NVMe-oF word marks are trademarks of NVMExpress, Inc. All other marks are the property of their respective owners.
Product specifications subject to change without notice. Pictures shown may vary from actual products. Not all products are
available in all regions of the world.
Western Digital
5601 Great Oaks Parkway
San Jose, CA 95119
Table of Contents
Notices ................................................................................................................................................................................................. 2
Introduction ......................................................................................................................................................................................... 4
Configuration Process Summary ...................................................................................................................................................... 4
Example Hardware Specifications .................................................................................................................................................... 4
Configuration Table .......................................................................................................................................................................... 4
Download and Install Software Driver ................................................................................................................................................. 5
Configure PFC ....................................................................................................................................................................................... 8
Summary of CLI Commands ........................................................................................................................................................... 10
Configure ECN .................................................................................................................................................................................... 11
Summary of CLI Commands ........................................................................................................................................................... 11
Show Pertinent Network Counters .................................................................................................................................................... 12
Configure Boot Time Scripts............................................................................................................................................................... 14
Appendix ............................................................................................................................................................................................ 17
Contributors ................................................................................................................................................................................... 17
References ..................................................................................................................................................................................... 17
Document Feedback ...................................................................................................................................................................... 17
Version History ............................................................................................................................................................................... 17
Introduction
NVMe-oF™ based storage offers the promise of low latency shared storage. To obtain the performance potential of this
technology, Ethernet Network Adapters in initiator hosts must be configured for lossless networking using standard Data Center
Bridging (DCB) technologies. While these settings may be unfamiliar to many readers, they are not complicated to understand or
follow.
Configuration Table
Included below for convenience is a table to record the lossless configuration values for your deployment.
As manufacturer’s websites continually change this can only be an example of how to acquire the current driver
3. Select “Mellanox OFED Linux - MLNX_OFED” and then under “Download” select your OS, OS Version and Architecture
If the host has access to the internet the driver installation performs a firmware update. If the host does not
have access to the internet please refer to Mellanox documentation to properly update the RNIC to the most
current firmware.
This is not a comprehensive list of pre-requisite packages. This is highly dependent on how the OS installation
was performed. Attempt the installation of the driver and examine the output to discover any additional pre-
requisite packages that are needed
8. Run the installation script from the folder where the tar file was extracted. Make sure to use the --with-nvmf and --add-
kernelsupport command modifiers
$ ./mlnxofedinstall --add-kernel-support --with-nvmf
#
# Mellanox nvmeOF
rdma_cm
ib_uverbs
rdma_ucm
nvme
nvme-rdma
11. Check ibv_devices and verify that the Mellanox Network Adapter is in the list
$ ibv_devices
Example Output:
Configure PFC
The following is a list of things to know before following this process:
All of these commands must be executed for each port on the RDMA Network Adapter that needs to be configured
The majority of these commands are not persistent through reboot and will have to be scripted to run at every boot
PFC set up process:
1. Use MST to acquire PCIe address to Ethernet Device to IB Device mapping as well as the MST device path
a. Start MST
$ mst start
Example Output:
Example Output:
Parameter --pfc takes eight comma separated values. The values passed are Boolean (0 or 1) to indicate
whether or not PFC is enabled while the position indicates priority. The priorities are 0 - 7 from left to right. In
this example we are enabling PFC on priority "3"
Parameter --prio2buffer takes eight comma separated values. The values passed are the buffer to assign while
the position indicates priority. The priorities are 0 - 7 from left to right. In this example we are assigning priority
"3" to use the newly created buffer 1
9. Set traffic class transmission algorithm and bandwidth allocation to match the default on the Mellanox switch
Parameter --tsa takes eight comma separated values. The values passed are the egress scheduling method to
assign while the position indicates priority. The priorities are 0 - 7 from left to right. In this example we are
assigning priority "6" to use the strict scheduling method
Parameter --tcbw takes eight comma separated values. The values passed are a percentage of the total
bandwidth while the position indicates priority. The priorities are 0 - 7 from left to right. In this example we are
mimicking the default bandwidth assignments done at a Mellanox switch adjusting for the strict scheduling on
priority 6
$ cma_roce_mode -d <ROCE_DEVICE>
12. Determine Type of Service (TOS) value to assign to RoCE v2 traffic by using the DSCP value and perform a bitwise left shift
by 2 depending on the desired configuration
TOS is an 8-bit construct.
DSCP is a 6-bit field residing in TOS (6 most significant bits).
Priority is a 3-bit field residing in DSCP (3 most significant bits).
For example: DSCP24: 011000 with corresponding TOS96: 01100000
13. Assign RoCE v2 traffic to use derived TOS value
$ cma_roce_tos -d <ROCE_DEVICE> -t <TOS>
Configure ECN
By default, Mellanox configures CNPs to priority 6 using a DSCP value of 48. If this is not your desired priority
for ECN CNP traffic use these instructions to change the CNP DSCP value
The majority of these commands are not persistent through reboot and will have to be scripted to run at every
boot
As of August 2020, the CNP counters on the Mellanox RNICs do not appear to be accurate
Example Output:
3. Show drop, discard, pause, and abort counters on a specific interface. This command only shows non-zero counts
$ grep "" /sys/class/infiniband/mlx5_*/ports/1/counters/* | grep -i -e drop -e dis -e err -e pau
-e abort | awk -F: '$NF != 0 { print $0 }'
Example Output:
The instructions below are specific example using RedHat® based operating systems with systemd. If your
system does not meet these specifications use the scripts below as a template to create operable services and
scripts for your operating system
1. Create a lossless configuration script named /usr/local/sbin/lossless.sh with the following contents:
This is a script specific to Mellanox based network adapters. This script may require customization depending on
the RNIC that is being used. Pay particular attention to the bold comments for instruction on how to customize
#!/bin/bash
ibdev_from_nic ()
{
local _nic=${1} _pcidev _lnks _ibdev;
if [ -z "${_nic}" ]; then
echo "${FUNCNAME}(): NIC must be specified -- aborting" 1>&2;
exit 1;
fi;
local _path="/sys/class/net/${_nic}/device";
if [ ! -h "${_path}" ]; then
echo "${FUNCNAME}(): \"${_path}\" not found -- aborting" 1>&2;
exit 1;
fi;
_pcidev=`readlink ${_path} | sed -e 's/.*\///'`;
_lnks=`find /sys/class/infiniband -maxdepth 1 -type l -print -exec readlink {} \; | grep
${_pcidev}`;
_ibdev=`echo "${_lnks}" | head -1 | sed -e 's/.*\///'`;
if [ -n "${_ibdev}" ]; then
echo ${_ibdev};
else
echo "none";
fi
}
NICS="enp65s0f0 enp65s0f1"
ECN=1
# Calculate the total available Rx buffer size for all ports on each RNIC
declare -A nics nics_c
for nic in ${NICS}; do
this_nic=`echo ${nic} | cut -ds -f1`
if [ -z "${nics[${this_nic}]}" ]; then
nics[${this_nic}]=0
nics_c[${this_nic}]=1
else
((nics_c[${this_nic}]++))
fi
rbt=${nics[${this_nic}]}
for rb in `mlnx_qos -i ${nic} | grep Receive | tr -dc '[0-9],' | tr ',' ' '`; do
rbt=$((rbt+rb))
done
# Store in hash for later use
nics[${this_nic}]=${rbt}
done
# Take the total available Rx buffer size for all ports and divide by 2 * Nports
for this_nic in ${!nics[@]}; do
nics[${this_nic}]=$((nics[${this_nic}]/(2 * nics_c[${this_nic}])))
done
ibdev=`ibdev_from_nic ${nic}`
if [ "${ibdev}" != "none" ]; then
cma_roce_tos -d ${ibdev} -t 96 >/dev/null 2>&1
fi
if ((ECN)); then
pcidev=`readlink /sys/class/net/${nic}/device | sed -e 's/.*\///'`
if ! grep -q debugfs /proc/mounts; then
mount -t debugfs none /sys/kernel/debug
fi
pushd /sys/kernel/debug/mlx5/${pcidev}/cc_params/ >/dev/null
echo 0 > np_cnp_prio_mode
echo 48 > np_cnp_dscp
echo 6 > np_cnp_prio
popd >/dev/null
fi
done
exit 0
[Service]
Type=simple
ExecStart=/usr/local/sbin/lossless.sh
TimeoutStartSec=0
[Install]
WantedBy=default.target
Appendix
Contributors
Name Company Title
Jon Flynn Western Digital Technologist, Applications Engineering
Niall Macleod Western Digital Sr. Manager, Applications Engineering
Barrett Edwards Western Digital Director, Technical Marketing
References
Document Title
Lossless Networking Technology Overview
OpenFlex Data24 Deployment Considerations
OpenFlex F3200 Deployment Considerations
Configuring an Arista Switch for Lossless Networking
Configuring a Cisco Switch for Lossless Networking
Configuring a Lenovo Switch for Lossless Networking
Configuring a Mellanox Switch for Lossless Networking
Configuring a Switch Running Cumulus for Lossless Networking
Configuring a Broadcom Network Adapter for Lossless Networking
Configuring a Mellanox Network Adapter for Lossless Networking
Document Feedback
For feedback, questions, and suggestions for improvements to this document send an email to the Data Center Systems (DCS)
Technical Marketing Engineering (TME) team distribution list at pdl-dcs-tm@wdc.com.
Version History
Version Date Notes
1.0 7 Apr 2021 Initial release
www.westerndigital.com