====== File Transfers and RDW ====== ===== Monitoring RDW shares ===== RDW is file storage for research, hosted on our campus datacentre. It is possible to view the files from both Windows and Linux, permissions on the files are set using your campus login so it’s essential that you are logged in as the correct campus user in order to see your files. ==== Checking the size of data ==== Check the size of your data before transferring; this will give you an idea of how much space you’ll need on the destination and how long the transfer may take. [[Windows File Explorer]] will give you an estimate of directory size with right-click and ‘properties’ but you can get much better information at the linux command line using ''du''. Try this tutorial: https://www.geeksforgeeks.org/du-command-linux-examples/ ===== Principles for setting up large file transfers ===== * Prioritise the most important data and plan the directory structure to help you find data later. * Keep what you need - consider whether it's necessary to keep large data that can be easily re-downloaded or re-created * Include a ‘README’ text file at the top level directory for a project or section, with information about the data, how it was generated and its expected uses. * //Take a look// at source and destination files before you start. * Use a file synchronisation tool, like rsync (linux) or robocopy (Windows), which can keep track of the copy and only move data which isn’t already present at the destination. * Start small * Try a ‘dry run’ to check your command * Try a small amount of data / number of files * Try a small number of directories * Try a ‘dry run’ for the whole transfer and output to a log file //(this will be quick because no data is moved)// * Run the transfer and output to a log file * Run the transfer again to confirm success === Large data transfer tips === **Consider the route the data takes** If you run a copy command on a laptop at home to copy data from Rocket to RDW, the data will have to move off campus to your laptop and back again via the (much slower) internet connection of your laptop. If you log in to Rocket from your laptop and run a copy command to RDW, the data will move straight from Rocket to RDW without leaving the (much faster) data centre network. **Avoid Graphical copy/paste.** It's slow and error-prone; instead of a graphical file manager, use a //**file synchronisation**// tool, which can keep track of the copy and only move data which isn’t already present at the destination. This means that a transfer which fails part-way through is not wasted. Next time the command is run, the transfer will pick up from where it stopped. ===== Viewing and copying to RDW on Linux Command Line ===== ==== Viewing your RDW share on Rocket ==== When an RDW share is set up, you will be provided with its Windows share name. Navigate to your RDW share on Rocket from the login node. RDW is split up ‘behind the scenes’ into numbered blocks, for admin purposes. This means there will be a number between 01 and 08 after ''/rdw'' in the path, so the path will be like: ''/rdw/05/share_name'' To find your a new share called “share-name”, use: $ cd /rdw $ find -maxdepth 2 -type d -name "share-name" NB: Groups with more than 1 project may have a super-directory on rdw like ''/rdw/02/group/share_name'' ==== Using rsync to copy to RDW ==== For large data, you may find the ''scp'' command limiting. The [[https://rsync.samba.org/|rsync]] utility provides advanced features for file transfer and is typically faster compared to both ''scp'' and ''sftp''. It is especially useful for transferring large and/or many files. The syntax is similar to ''cp'' and ''scp''. Rsync can be used on a locally mounted filesystem or a remote filesystem. === rsync is a powerful command === * it's possible to //**over-write**, **duplicate** or **delete data**// accidentally with ''rsync'' * Ensure you understand the options you use * Check source and destination are the right way round * Check whether a trailing slash ''/'' is needed in the destination path * Always do a 'dry run' === Try out a dry run: === [userid@login01 ~]$ cd /nobackup/proj/training/userid/ [userid@login01 userid]$ rsync -trlv --inplace TestDir /rdw/03/rse-hpc/training/userid --dry-run sending incremental file list TestDir/ TestDir/testfile1 TestDir/testfile2 sent 121 bytes received 26 bytes 294.00 bytes/sec total size is 0 speedup is 0.00 (DRY RUN) === Run ‘for real’: === [userid@login01 userid]$ rsync -trlv --inplace TestDir /rdw/03/rse-hpc/training/userid sending incremental file list created directory /rdw/03/rse-hpc/training/userid TestDir/ TestDir/testfile1 TestDir/testfile2 sent 197 bytes received 415 bytes 408.00 bytes/sec total size is 0 speedup is 0.00 **Re-run the rsync transfer command:** This gives you confidence that nothing was missed. The second run should be very fast, as rsync will not need to copy any data and will simply list all the files. ==== Output to a log file ==== === Try out a dry run: === rsync --dry-run -rltv --inplace --itemize-changes --progress --stats --whole-file --size-only /nobackup/myusername/source /rdw/path/to/my/share/destination/ 2>&1 | tee /home/myusername/meaningful-log-name.log1 === Run ‘for real’ === rsync -rltv --inplace --itemize-changes --progress --stats --whole-file --size-only /nobackup/myusername/source /rdw/path/to/my/share/destination/ 2>&1 | tee /home/myusername/meaningful-log-name.log2 === rsync options === Those familiar with ''rsync'' will often use the ''-av'' option, which preserves permissions, but leads to group modification errors on RDW. For Rocket and RDW, replace ''-av'' with ''-rltv'' * `-r` = recurse through subdirectories * `-l` = copy symlinks * `-t` = preserve timestamps * `-v` = verbose * `--inplace --whole-file --size-only` speed up transfer and prevent rsync filling up space with a large temporary directory * `--itemize-changes --progress --stats` for more informative output * `| tee` sends output both to the screen and to a log file * Use `man rsync` for more information on options. **Note: ** RDW has a super-fast connection to Rocket, which means that it takes more resource to compress and un-compress the data than it does to do the transfer. ===== Troubleshooting ===== **Read the error messages** - Not all errors are a cause for concern: ==== files/attrs were not transferred ==== ''files/attrs were not transferred'': This error may be returned because RDW doesn't 'know' about Rocket's groups. * Applying group permissions from Rocket will fail because RDW has 'trumped' our local permissions and imposed its own permissions. * This would not prevent the transfer if only the ''group'' attribute of the file couldn't be transferred ''rsync: chgrp ... failed: Invalid argument (22)'' * RDW shares are set up to [[https://services.ncl.ac.uk/itservice/core-services/filestore/grouper-permissions/|allow access for users permitted by the PI]]. ==== Long transfers being halted by permissions errors ==== This usually means the ‘kerberos ticket’ for your user has expired. Kerberos tickets allow the system to know what your user is allowed to do, for security, they automatically expire at a set time after you log in. Most of the time these tickets are automatically renewed while you’re working, but they can expire during long copy commands. You need to take two steps to avoid timeouts. * Firstly, you will periodically need to renew your Kerberos authentication ticket, which controls your access to '/rdw' and expires after 10 hours. The 'krenew' command will do the renewal automatically for up to a week. * Secondly, to stop your process being killed if you are logged out, run it within a tmux session, then detach from your login. You can reattach later if necessary. An example session might look like this: Start a new tmux session: ''tmux'' == Run the command to copy your data to /rdw: == $ tmux $ krenew -v -- bash -c 'rsync -trlv --inplace /nobackup/myuser/ /rdw/myshare/ >> mylogfile' * Detach the tmux session with '''' * If necessary, start tmux again and attach to your previous session: ''tmux attach'' === Data Transfer over Fast Connections === Transfers from Rocket to RDW don’t leave our fast data centre network. The options needed are the same as for using rsync for disk to disk transfers in the same machine: * DON'T use compression ''-z'' * DO use ''--inplace'' Why not ''-z''? Compression uses lots of CPU and this becomes a bottleneck once network speed is fast enough. Why ''--inplace''? Rsync usually creates a temp file on disk before copying, which places load on the CPU and hard drive. ''--inplace'' tells rsync not to create the temp file but send the data straight away. It doesn’t matter if the connection is interrupted, because rsync keeps track and tries again. === Rsync over Slow Connections === For rsync a slow connection like the internet: * DO use compression ''-z'' * DON’T use ''--inplace''. ----- [[faq:020|Back to RDW FAQ]]