====== Why does my MPI job complain about connection errors or addresses already in use? ====== Depending on the number of MPI processes you attempt to start in your Slurm jobs, you may see errors such as: mca_btl_tcp_endpoint_start_connect] bind on local address (aaa.bb.cc.dd:0) failed: Address already in use (98) Or pmix_ptl_base: send_msg: write failed: Broken pipe (32) [sd = xyz] Or words to the effect of: /usr/share/openmpi/help-mpi-btl-tcp.txt: Too many open files. Sorry! This occurs because MPI processes open network sockets to talk to each other. Depending on the MPI communications mechanism your application is configured to use, //the number of required network sockets can vary//. ==== Why This Happens ==== In most cases this is because your job has ran out of available open network sockets. Network sockets share the same (consumeable) system resources as open files; Linux tracks the use of these resources in order to attempt ensure that the system can still read and write essential system files and respond to incoming network requests (e.g. logins). The maximum number of network sockets on a Linux system is **65536** - but, this is shared with all of the essential system network services such as SSH, NFS, email server, etc. The real number available to users on a Linux system is lower. You can verify how many of these //your// account can open with the following command: $ ulimit -n 1024 This shows that you are allowed to open **1024** files or network sockets (this is the normal Linux default). ==== Requesting Increased Sockets or Files ==== You may request an increase to the number of open files or sockets by supplying an argument to the ''ulimit -n'' command, for example, to increase to **2048**: $ ulimit -n 2048 $ ulimit -n 2048 This limit change only persists within the session you run it. ==== What is the maximum system limit for files or sockets? ==== You can check what the system is currently configured to allow with the following command: $ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 32768 60999 This indicates that sockets in the numbered range ''32768 - 60999'' are allowed. This gives a figure of ''28231'' - a seemingly huge number, but this is shared by **all** users and **all** processes on the system which want to open a network socket. If you attempt to request an increase to the open files number (via ''ulimit -n'') //beyond// what is allowed for individual users, you will see an error. For example: $ ulimit -n 3001 -bash: ulimit: open files: cannot modify limit: Operation not permitted Here we can see that 3001 open files or sockets is **more** than is allowable for a single user. **Think about your use case!** It is not normal to allow a single user to use more than a few thousand open files / network sockets. //Tens of thousands// would be an __extremely__ unusual use case. If you do hit this issue, then you should probably reconsider how you are approaching the problem. ==== MPI Edge Cases ==== There are some specific examples in MPI jobs that may want to open many, many thousands of network sockets. If, for example, you are using the ''alltoall'' inter-process communication method in your MPI code, then the number of network sockets you need can be calculated as: ''c x ((n x c) - c)'' Where: * ''c'' is the number of MPI processes per node * ''n'' is the number of nodes So in the case of using all 256 cores on a node, and running on two nodes: ''256 x ((2 x 256) - 256) == 65536'' Four nodes would increase that to: ''256 x ((4 x 256) - 256) == 196608'' This makes it **impossible** to use the ''alltoall'' inter-process communication method beyond a small number of MPI processes on a node, as it requires more available network sockets on each host than is possible (at an __absolute minimum__ 1024 network sockets are always reserved for essential network services). This problem is explored in more detail in a useful OpenMPI github error report: * https://github.com/open-mpi/ompi/issues/7246 Note that, as discussed, there is //no specific fix// for this; there are a number of //possible// workarounds, but as explained in the socket calculation code above, there is an __absolute maximum ceiling__ for the size of jobs which use ''alltoall'' communications. **Configuring your MPI jobs** If any MPI code you are using expects to use ''alltoall'' communication, you should configure it to use an alternative - or configure the maximum number of MPI processes to a //sensible// value which does not require tens of thousands of open network sockets. In //most// cases every single MPI process does //not// need to communicate with every other process running across every other node in the entire job set. ---- [[:faq:index|Back to FAQ index]]