2 Remote Computing
2.1 Learning Objectives
- Understand the basic architecture of computer networks
- Learn how to connect to a remote computer via a shell
- Become familiarized with Bash Shell programming to navigate your computer’s file system, manipulate files and directories, and automate processes
2.2 Introduction
Scientific synthesis and our ability to effectively and efficiently work with big data depends on the use of computers and the internet. Working on a personal computer may be sufficient for many tasks, but as data get larger and analyses more computationally intensive, scientists often find themselves needing more computing resources than they have available locally. Remote computing, or the process of connecting to a computer(s) in another location via a network link is becoming more and more common in overcoming big data challenges.
In this lesson, we’ll learn about the architecture of computer networks and explore some of the different remote computing configurations that you may encounter, we’ll learn how to securely connect to a remote computer via a shell, and we’ll become familiarized with using Bash Shell to efficiently manipulate files and directories. We will begin working in the VS Code IDE (integrated development environment), which is a versatile code editor that supports many different languages.
2.3 Servers & Networking
Remote computing typically involves communication between two or more “host” computers. Host computers connect via networking equipment and can send messages to each other over communication protocols (aka an Internet Protocol, or IP). Host computers can take the role of client or server, where servers share their resources with the client. Importantly, these client and server roles are not inherent properties of a host (i.e. the same machine can play either role).
- Client: the host computer intiating a request
- Server: the host computer responding to a request
Fig 1. Examples of different remote computing configurations. (a) A client uses secure shell protocol (SSH) to login/connect to a server over the internet. (b) A client uses SSH to login/connect to a computing cluster (i.e. a set of computers (nodes) that work together so that they can be viewed as a single system) over the internet. In this example, servers A - I are each nodes on this single cluster. The connection is first made through a gateway node (i.e. a computer that routes traffic from one network to another). (c) A client uses SSH to login/connect to a computing cluser where each node is a virtual machine (VM). In this example, the cluster comprises three servers (A, B, and C). VM1 (i.e. node 1) runs on server A while VM4 runs on server B, etc. The connection is first made through a gateway node.
2.4 IP addressing
Hosts are assigned a unique numerical address used for all communication and routing called an Internet Protocol Address (IP Address). They look something like this: 128.111.220.7. Each IP Address can be used to communicate over various “ports”, which allow multiple applications to communicate with a host without mixing up traffic.
Because IP addresses can be difficult to remember, they are also assigned hostnames, which are handled through the global Domain Name System (DNS). Clients first look up a hostname in the DNS to find the IP address, then open a connection to the IP address.
2.5 Bash Shell Programming
What is a shell? From Wikipedia:
“a computer program which exposes an operating system’s services to a human user or other programs. In general, operating system shells use either a command-line interface (CLI) or graphical user interface (GUI), depending on a computer’s role and particular operation.”
What is Bash? Bash, or Bourne-again Shell, is a command line tool (language) commonly used to manipulate files and directories. Accessing and using bash is slightly different depending on what type of machine you work on:
Mac: bash via the Terminal, which comes ready-to-use with all Macs and Linux machines
Windows: running bash depends on which version of Windows you have – newer versions may ship with bash or may require a separate install (e.g. Windows Subsystem for Linux (WSL) or Git Bash), however there are a number of different (non-bash) shell options as well (they all vary slightly; e.g. PowerShell, Command Prompt).
2.5.1 Some commonly used (and very helpful) bash commands:
Below are just a few bash commands that you’re likely to use. Some may be extended with options (more on that in the next section) or even piped together (i.e. where the output of one command gets sent to the next command, using the |
operator). You can also find some nice bash cheat sheets online, like this one. Alternatively, the Bash Reference Manual has all the content you need, albeit a bit dense.
bash command | what it does |
---|---|
pwd |
print your current working directory |
cd |
change directory |
ls |
list contents of a directory |
tree |
display the contents of a directory in the form of a tree structure (not installed by default) |
echo |
print text that is passed in as an argument |
mv |
move or rename a file |
cp |
copy a file(s) or directory(ies) |
touch |
create a new empty file |
mkdir |
create a new directory |
rm /rmdir |
remove a file/ empty directory (be careful – there is no “trash” folder!) |
grep |
searches a given file(s) for lines containing a match to a given pattern list |
awk |
a text processing language that can be used in shell scripts or at a shell prompt for actions like pattern matching, printing specified fields, etc. |
sed |
stands for Stream Editor; a versatile command for editing files |
cut |
extract a specific portion of text in a file |
join |
join two files based on a key field present in both |
top , htop |
view running processes in a Linux system (press Q to quit) |
2.5.2 General command syntax
Bash commands are typically are written as: command [options] [arguments]
where the command must be an executable on your PATH and where options (settings that change the shell and/or script behavior) take one of two forms: short form (e.g. command -option-abbrev
) or long form (e.g. command --option-name
or command -o option-name
). An example:
# the `ls` command lists the files in a directory
ls file/path/to/directory
# adding on the `-a` or `--all` option lists all files (including hidden files) in a directory
ls -a file/path/to/directory # short form
ls --all file/path/to/directory # long form
ls -o all file/path/to/directory # long form
2.5.3 Some useful keyboard shortcuts
It can sometimes feel messy working on the command line. These keyboard shortcuts can make it a little easier:
Ctrl
+L
: clear your terminal windowCtrl
+U
: delete the current lineCtrl
+C
: abort a command- up & down arrow keys: recall previously executed commands in chronological order
- TAB key: autocompletion
2.6 Connecting to a remote computer via a shell
In addition to navigating your computer/manipulating your files, you can also use a shell to gain accesss to and remotely control other computers. To do so, you’ll need the following:
- a remote computer (e.g. server) which is turned on
- client and server ssh clients installed/enabled
- the IP address or name of the remote computer
- the necessary permissions to access the remote computer
Secure Shell, or SSH, is a network communication protocol that is often used for securely connecting to and running shell commands on a remote host, tremendously simplifying remote computing.
2.7 Git via a shell
Git, a popular version control system and command line tool can be accessed via a shell. While there are lots of graphical user interfaces (GUIs) that faciliatate version control with Git, they often only implement a small subset of Git’s most-used functionality. By interacting with Git via the command line, you have access to all Git commands. While all-things Git is outside the scope of this workshop, we will use some basic Git commands in the shell to clone GitHub (remote) repositories to the server and save/store our changes to files. A few important Git commands:
Git command | what it does |
---|---|
git clone |
create a copy (clone) of repository in a new directory in a different location |
git add |
add a change in the working directory to the staging area |
git commit |
record a snapshot of a repository; the -m option adds a commit message |
git push |
send commits from a local repository to a remote repository |
git fetch |
downloads contents (e.g. files, commits, refs) from a remote repo to a local repo |
git pull |
fetches contents of a remote repo and merges changes into the local repo |
2.8 Let’s practice!
We’ll now use bash commands to do the following:
- connect to the server (included-crab) that we’ll be working on for the remainder of this course
- navigate through directories on the server and add/change/manipulate files
- clone a GitHub repository to the server
- automate some of the above processes by writing a bash script
2.8.1 Exercise 1: Connect to a server using the ssh
command (or using VS Code’s command palette)
Let’s connect to a remote computer (included-crab) and practice using some of above commands.
- Launch a terminal in VS Code
There are two options to open a terminal window, if a terminal isn’t already an open pane at the bottom of VS Code
Click on
Terminal > New Terminal
in top menu barClick on the
+ (dropdown menu) > bash
in the bottom right corner
- Connect to a remote server
- You can choose to SSH into the server (included-crab.nceas.ucsb.edu) through (a) the command line by using the
ssh
command, or (b) through VS Code’s command palette. If you prefer the latter, please refer back to the Log in to the server section. To do so via the command line, use thessh
command followed byyourusername@included-crab.nceas.ucsb.edu
. You’ll be prompted to type/paste your password to complete the login. It should look something like this:
yourusername:~$ ssh yourusername@included-crab.nceas.ucsb.edu
yourusername@included-crab.nceas.ucsb.edu's password:
yourusername@included-crab:~$
2.8.2 Exercise 2: Practice using some common bash commands
Use the
pwd
command to print your current location, or working directory. You should be in your home directory on the server (e.g./home/yourusername
).Use the
ls
command to list the contents (any files or subdirectories) of your home directoryUse the
mkdir
command to create a new directory namedbash_practice
:
mkdir bash_practice
- Use the
cd
command to move into your newbash_practice
directory:
# move from /home/yourusername to home/yourusername/bash_practice
cd bash_practice
- To move up a directory level, use two dots,
..
:
# move from /home/yourusername/bash_practice back to /home/yourusername
$ cd ..
- Add some
.txt
files (file1.txt
,file2.txt
,file3.txt
) to yourbash_practice
subdirectory using thetouch
command (Note: be sure tocd
intobash_practice
if you’re not already there):
# add one file at a time
touch file1.txt
touch file2.txt
touch file3.txt
# or add all files simultanously like this:
touch file{1..3}.txt
# or like this:
touch file1.txt file2.txt file3.txt
- You can also add other file types (e.g.
.py
,.csv
, etc.)
touch mypython.py mycsv.csv
- Print out all the
.txt
files inbash_practice
using a wildcard,*
:
ls *.txt
- Count the number of
.txt
files inbash_practice
by combining thels
andwc
(word count) funtions using the pipe,|
, operator:
# `wc` returns a word count (lines, words, chrs)
# the `-l` option only returns the number of lines
# use a pipe, `|`, to send the output from `ls *.txt` to `wc -l`
ls *.txt | wc -l
- Delete
mypython.py
using therm
command:
rm mypython.py
- Create a new directory inside
bash_practice
calleddata
and movemycsv.csv
into it.
mkdir data
mv mycsv.csv ~/bash_practice/data
# add the --interactive option (-i for short) to prevent a file from being overwritten by accident (e.g. in case there's a file with the same name in the destination location)
mv -i mycsv.csv ~/bash_practice/data
- Use
mv
to renamemycsv.csv
tomydata.csv
mv mycsv.csv mydata.csv
- Add column headers
col1
,col2
,col3
tomydata.csv
usingecho
+ the>
operator
echo "col1, col2, col3" > mydata.csv
- Append a row of data to
mydata.csv
usingecho
+ the>>
operator
# using `>` will overwrite the contents of an existing file; `>>` appends new information to an existing file
echo "1, 2, 3" >> mydata.csv
2.8.3 Exercise 3: Clone a GitHub repository to the server
IDEs commonly have helper buttons for cloning (i.e. creating a copy of) remote repositories to your local computer (or in this case, a server), but using git commands in a terminal can be just as easy. We can practice that now, following the steps below:
- Go to the
scalable-computing-examples
repository on GitHub at https://github.com/NCEAS/scalable-computing-examples – this repo contains example files for you to edit and practice in throughout this course. Fork (make your own copy of the repository) this repo by clicking on the Fork button (top right corner of the repository’s page).
- Once forked, click on the green Code button (from the forked version of the GitHub repo) and copy the URL to your clipboard.
- In the VS Code terminal, use the
git clone
command to create a copy of thescalable-computing-examples
repository in the top level of your user directory (i.e. your home directory) on the server (Note: usepwd
to check where you are; usecd ~/
to navigate back to your home directory if you find that you’re somewhere else).
git clone <url-of-forked-repo>
- You should now have a copy of the
scalable-computing-examples
repository to work on on the server. Use thetree
command to see the structure of the repo (you need to be in thescalable-computing-examples
directory for this to work) – there should be a subdirectory called02-bash-babynames
that contains (i) aREADME.MD
file, (ii) aKEY.sh
file (this is a functioning bash script available for reference; we’ll be recreating it together in the next exercise) and (iii) anamesbystate
folder containing 51.TXT
files and aStateReadMe.pdf
file with some metadata.
2.8.4 Bonus Exercise: Automate data processing with a Bash script
As we just demonstrated, we can use bash commands in the terminal to accomplish a variety of tasks like navigating our computer’s directories, manipulating/creating/adding files, and much more. However, writing a bash script allows us to gather and save our code for automated execusion.
We just cloned the scalable-computing-examples
GitHub repository to the server in Exercise 3 above. This contains a 02-bash-babynames
folder with 51 .TXT
files (one for each of the 50 US states + The District of Columbia), each with the top 1000 most popular baby names in that state. We’re going to use some of the bash commands we learned in Exercise 2 to concatenate all rows of data from these 51 files into a single babynames_allstates.csv
file.
Let’s begin by creating a simple bash script that when executed, will print out the message, “Hello, World!” This simple script will help us determine whether or not things are working as expected before writing some more complex (and interesting) code.
Open a terminal window and determine where you are by using the
pwd
command – we want to be inscalable-computing-examples/02-bash-babynames
. If necessary, navigate here using thecd
command.Next, we’ll create a shell script called
mybash.sh
using thetouch
command:
$ touch mybash.sh
- There are a number of ways to edit a file or script – we’ll use Nano, a terminal-based text editor, as we did earlier. Open your
mybash.sh
with nano by running the following in your terminal:
$ nano mybash.sh
- We can now start to write our script. Some important considerations:
- Anything following a
#
will not be executed as code – these are useful for adding comments to your scripts - The first line of a Bash script starts with a shebang,
#!
, followed by a path to the Bash interpreter – this is used to tell the operating system which interpreter to use to parse the rest of the file. There are two ways to use the shebang to set your interpreter (read up on the pros & cons of both methods on this Stack Overflow post):
# (option a): use the absolute path to the bash binary
#!/bin/bash
# (option b): use the env untility to search for the bash executable in the user's $PATH environmental variable
#!/usr/bin/env bash
- We’ll first specify our bash interpreter using the shebang, which indicates the start of our script. Then, we’ll use the
echo
command, which when executed, will print whatever text is passed as an argument. Type the following into your script (which should be opened with nano), then save (Use the keyboard shortcutcontrol
+X
to exit, then typeY
when it asks if you’d like to save your work. Pressenter/return
to exit nano).
# specify bash as the interpreter
#!/bin/bash
# print "Hello, World!"
echo "Hello, World!"
- To execute your script, use the
bash
command followed by the name of your bash script (be sure that you’re in the same working directory as yourmybash.sh
file or specify the file path to it). If successful, “Hello, World!” should be printed in your terminal window.
bash mybash.sh
- Now let’s write our script. Re-open your script in nano by running
nano mybash.sh
. Using what we practiced above and the hints below, write a bash script that does the following:
- prints the number of
.TXT
files in thenamesbystate
subdirectory - prints the first 10 rows of data from the
CA.TXT
file (HINT: use thehead
command) - prints the last 10 rows of data from the
CA.TXT
file (HINT: use thetail
command) - creates an empty
babynames_allstates.csv
file in thenamesbystate
subdirectory (this is where the concatenated data will be saved to) - adds the column names,
state
,gender
,year
,firstname
,count
, in that order, to thebabynames_allstates.csv
file - concatenates data from all
.TXT
files in thenamesbystate
subdirectory and appends those data to thebabynames_allstates.csv
file (HINT: use thecat
command to concatenate files)
Here’s a script outline to fill in (Note: The echo
statements below are not necessary but can be included as progress indicators for when the bash script is executed – these also make it easier to diagnose where any errors occur during execution):
#!bin/bash
echo "THIS IS THE START OF MY SCRIPT!"
echo "-----Verify that we have .TXT files for all 50 states + DC-----"
<add your code here>
echo "-----Printing head of CA.TXT-----"
<add your code here>
echo "-----Printing tail of CA.TXT-----"
<add your code here>
echo "-----Creating empty .csv file to concatenate all data-----"
<add your code here>
echo "-----Adding column headers to csv file-----"
<add your code here>
echo "-----Concatenating files-----"
<add your code here>
echo "DONE!"