Table of Contents

File Transfers and RDW

Monitoring RDW shares

RDW is file storage for research, hosted on our campus datacentre. It is possible to view the files from both Windows and Linux, permissions on the files are set using your campus login so it’s essential that you are logged in as the correct campus user in order to see your files.

Checking the size of data

Check the size of your data before transferring; this will give you an idea of how much space you’ll need on the destination and how long the transfer may take. Windows File Explorer will give you an estimate of directory size with right-click and ‘properties’ but you can get much better information at the linux command line using du. Try this tutorial: https://www.geeksforgeeks.org/du-command-linux-examples/

Principles for setting up large file transfers

Large data transfer tips

Consider the route the data takes If you run a copy command on a laptop at home to copy data from Rocket to RDW, the data will have to move off campus to your laptop and back again via the (much slower) internet connection of your laptop. If you log in to Rocket from your laptop and run a copy command to RDW, the data will move straight from Rocket to RDW without leaving the (much faster) data centre network.

Avoid Graphical copy/paste. It's slow and error-prone; instead of a graphical file manager, use a file synchronisation tool, which can keep track of the copy and only move data which isn’t already present at the destination. This means that a transfer which fails part-way through is not wasted. Next time the command is run, the transfer will pick up from where it stopped.

Viewing and copying to RDW on Linux Command Line

Viewing your RDW share on Rocket

When an RDW share is set up, you will be provided with its Windows share name. Navigate to your RDW share on Rocket from the login node. RDW is split up ‘behind the scenes’ into numbered blocks, for admin purposes. This means there will be a number between 01 and 08 after /rdw in the path, so the path will be like: /rdw/05/share_name

To find your a new share called “share-name”, use:

$ cd /rdw
$ find -maxdepth 2 -type d -name "share-name"

NB: Groups with more than 1 project may have a super-directory on rdw like /rdw/02/group/share_name

Using rsync to copy to RDW

For large data, you may find the scp command limiting. The rsync utility provides advanced features for file transfer and is typically faster compared to both scp and sftp. It is especially useful for transferring large and/or many files. The syntax is similar to cp and scp. Rsync can be used on a locally mounted filesystem or a remote filesystem.

rsync is a powerful command

  • it's possible to over-write, duplicate or delete data accidentally with rsync
  • Ensure you understand the options you use
  • Check source and destination are the right way round
  • Check whether a trailing slash / is needed in the destination path
  • Always do a 'dry run'

Try out a dry run:

[userid@login01 ~]$ cd /nobackup/proj/training/userid/
[userid@login01 userid]$ rsync -trlv --inplace TestDir /rdw/03/rse-hpc/training/userid --dry-run

sending incremental file list
TestDir/
TestDir/testfile1
TestDir/testfile2

sent 121 bytes  received 26 bytes  294.00 bytes/sec
total size is 0  speedup is 0.00 (DRY RUN)

Run ‘for real’:

[userid@login01 userid]$ rsync -trlv --inplace TestDir /rdw/03/rse-hpc/training/userid

sending incremental file list
created directory /rdw/03/rse-hpc/training/userid
TestDir/
TestDir/testfile1
TestDir/testfile2

sent 197 bytes  received 415 bytes  408.00 bytes/sec
total size is 0  speedup is 0.00

Re-run the rsync transfer command: This gives you confidence that nothing was missed. The second run should be very fast, as rsync will not need to copy any data and will simply list all the files.

Output to a log file

Try out a dry run:

rsync --dry-run -rltv --inplace --itemize-changes --progress --stats --whole-file --size-only /nobackup/myusername/source /rdw/path/to/my/share/destination/ 2>&1 | tee /home/myusername/meaningful-log-name.log1

Run ‘for real’

rsync -rltv --inplace --itemize-changes --progress --stats --whole-file --size-only /nobackup/myusername/source /rdw/path/to/my/share/destination/ 2>&1 | tee /home/myusername/meaningful-log-name.log2

rsync options

Those familiar with rsync will often use the -av option, which preserves permissions, but leads to group modification errors on RDW. For Rocket and RDW, replace -av with -rltv

Note: RDW has a super-fast connection to Rocket, which means that it takes more resource to compress and un-compress the data than it does to do the transfer.

Troubleshooting

Read the error messages - Not all errors are a cause for concern:

files/attrs were not transferred

files/attrs were not transferred: This error may be returned because RDW doesn't 'know' about Rocket's groups.

Long transfers being halted by permissions errors

This usually means the ‘kerberos ticket’ for your user has expired. Kerberos tickets allow the system to know what your user is allowed to do, for security, they automatically expire at a set time after you log in. Most of the time these tickets are automatically renewed while you’re working, but they can expire during long copy commands. You need to take two steps to avoid timeouts. * Firstly, you will periodically need to renew your Kerberos authentication ticket, which controls your access to '/rdw' and expires after 10 hours. The 'krenew' command will do the renewal automatically for up to a week. * Secondly, to stop your process being killed if you are logged out, run it within a tmux session, then detach from your login. You can reattach later if necessary. An example session might look like this: Start a new tmux session: tmux

Run the command to copy your data to /rdw:

$ tmux
$ krenew -v -- bash -c 'rsync -trlv --inplace /nobackup/myuser/ /rdw/myshare/ >> mylogfile'

Data Transfer over Fast Connections

Transfers from Rocket to RDW don’t leave our fast data centre network. The options needed are the same as for using rsync for disk to disk transfers in the same machine:

Why not -z? Compression uses lots of CPU and this becomes a bottleneck once network speed is fast enough. Why –inplace? Rsync usually creates a temp file on disk before copying, which places load on the CPU and hard drive. –inplace tells rsync not to create the temp file but send the data straight away. It doesn’t matter if the connection is interrupted, because rsync keeps track and tries again.

Rsync over Slow Connections

For rsync a slow connection like the internet:


Back to RDW FAQ