Skip to Main Content

Using your Big Data Cluster Account

Table of Contents

This tutorial below will walk you through your home directory, some useful Linux information, review what to expect in your new account, the module utility to load software, and the specialized storage areas on taki.

Connecting to the Big Data Cluster

The only nodes with a connection to the outside network are the login nodes, also called Edge nodes. From the outside, we must refer to the hostname login.hadoop.umbc.edu. To log in to the system, you must use a secure shell like SSH from Unix/Linux, PuTTY from Windows, or similar. You connect to the user node, which is the only node visible from the internet. For example, suppose we are connecting to Big Data Cluster from taki:

[reetam1@taki-usr1 ~]$ ssh reetam1@login.hadoop.umbc.edu
WARNING: UNAUTHORIZED ACCESS to this computer is in violation of Criminal Law
         Article section 8-606 and 7-302 of the Annotated Code of MD.

NOTICE:  This system is for the use of authorized users only. Individuals using
         this computer system without authority, or in excess of their authority
         , are subject to having all of their activities on this system
         monitored and recorded by system personnel.

reetam1@login.hadoop.umbc.edu's password:
Last login: Tue Sep 10 22:01:03 2019 from taki-usr1.rs.umbc.edu

UMBC Division of Information Technology                    http://doit.umbc.edu/
--------------------------------------------------------------------------------
If you have any questions or problems regarding these systems, please call the
DoIT Technology Support Center at 410-455-3838, or submit your request on the
web by visiting http://my.umbc.edu/help/request

Remember that the Division of Information Technology will never ask for your
password. Do NOT give out this information under any circumstances.
--------------------------------------------------------------------------------

-bash-4.2$

Replace “reetam1” with your UMBC username (that you use to log into myUMBC and into taki). You will be prompted for your password when connecting; your password is your myUMBC /taki password.

As another example, suppose we’re SSHing to the Big Data Cluster from a Windows machine with PuTTY. When setting up a connection, use “login.hadoop.umbc.edu” as the hostname. Once you connect, you will be prompted for your username and password, as mentioned above.

How to copy files to and from the Big Data Cluster

Note: This is out of data and will be edited.

Probably the most general way to transfer files between machines is by Secure Copy (scp). Because some remote filesystems may be mounted to taki, it may also be possible to transfer files using “local” file operations like cp, mv, etc.

The taki cluster only allows secure connection from the outside. Secure Copy is the file copying program that is part of Secure Shell (SSH). To transfer files to and from taki, you must use scp or compatible software (such as WinSCP or SSHFS). On Unix machines such as Linux or MacOS X, you can execute scp from a terminal window.

Setting bash as your default shell on the cluster

On logging in the first time, the default shell available to you will not be the bash shell you might be familiar with from taki. However, bash needs to be set as the default before you can use all the cluster’s resources. This is a one-time change that needs to be done at your myUMBC settings:

  1. On your web browser, navigate to https://webadmin.umbc.edu/.
  2. Under the My Account section, click on the link titled Edit my Shell and Unix Settings.
  3. At this point, if you are not logged in already to myUMBC, you will be asked to do so.
  4. On the Change Your UNIX Account Settings page that you are led to, select the radio button for bash – the last option under Login Shells.
  5. Click on the Change your Unix Shell button to save, and exit the page.
  6. You will need to log back on to the Big Data Cluster for the changes to take effect.

You can also verify whether your current shell is bash:

-bash-4.2$ echo $SHELL
/bin/bash

Updating your .bashrc with environment variables

At this point, your default shell has been switched to bash, and the next step is to edit your ~/.bashrc before you can start using the cluster. The ~/.bashrc is likely empty right now; edit it using an editor like vi or nano and added the following content:

JAVA_HOME='/usr/java/jdk1.8.0_181-cloudera'
export PYSPARK_PYTHON=python3

Save and exit once you are done. To make the changes take effect, either source your bash file by using the command source ~/.bashrc or simply log out and log back into the cluster.

A brief tour of your account

This section assumes that you have logged in as described above. It assumes that you are a member of a research group, otherwise you might not have some of the storage areas.

Home directory

At any given time, the directory that you are currently in is referred to as your current working directory. Since you just logged in, your home directory is your current working directory. Your AFS storage area serves as the home directory for the big data cluster. The command “pwd” tells you the full path of the current working directory, so let us run pwd to see where your home directory really is:

-bash-4.2$ pwd
/afs/umbc.edu/users/r/e/reetam1/home

Scratch storage

Each of the 8 big data nodes, as well as the edge node has a local /scratch storage. The space in this area is shared among current users of the node so the total amount available will vary based on system usage. The total space available is 800 GB, and is intended to be the location from which large data transfers into HDFS originate.

-bash-4.2$ cd /scratch/
-bash-4.2$ pwd
/scratch
-bash-4.2$ df -h /scratch/
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-scratchlv  800G   22G  779G   3% /scratch

Hadoop Distributed File System (HDFS)

The HDFS serves as the base location where the data needs to be placed before any sort of analysis can be done on it. You can view the files you have deposited onto HDFS by using the command “hdfs dfs -ls”.

-bash-4.2$ hdfs dfs -ls
Found 4 items
drwx------   - reetam1 reetam1          0 2019-09-02 10:00 .Trash
drwxr-xr-x   - reetam1 reetam1          0 2019-09-24 11:54 .sparkStaging
drwx------   - reetam1 reetam1          0 2019-09-06 18:58 .staging
drwxr-xr-x   - reetam1 reetam1          0 2019-09-01 08:51 test

The following sections will go into more details on how to access and work with files in HDFS using both a GUI as well as from the command line.

Accessing files and folders on the Big Data Cluster

While general Linux commands work on the cluster like they do on taki when it comes to accessing your data on the cluster, there are some key differences. Most importantly, anything which requires you to access HDFS requires specialized commands. Additionally, with some configuration, there are aspects of the cluster that you can interact with via a Web User Interface in the form of a GUI.

Accessing the Web UI for Hadoop, Spark, Yarn and Hive

Configuring Firefox proxy to view the Web UIs

While the following steps in theory work with any browser, they currently have only been tested using the Firefox browser. We want to apply the following settings only to the browser, but other browsers often tend to apply the settings at the system level which is not desirable.

  1. Go to the Settings for your Firefox browser – on Windows platforms it is accessible via a hamburger menu near the top right of the browser.
  2. Scroll down under General settings until you find Network Settings near the end. Alternatively, search for Network Settings using the search bar on the top of the page [screenshot].
  3. Click on the Settings button.
  4. In the window that opens, make the following changes [screenshot]:
    1. Select the radio button for Manual proxy configuration.
    2. Set SOCKS Host to localhost, and its port to 9090.
    3. Select the radio button for SOCKS v5.
    4. Tick the checkbox for Proxy DNS when using SOCKS v5.
    5. Click OK and close the Settings page.

Once the proxy settings have been applied to the browser, you need to SSH on to the cluster with port forwarding enabled. Login to the cluster using the following command:

 ssh reetam1@login.hadoop.umbc.edu -D 9090

You can now access the UIs from your Firefox web browser while you remain logged into the cluster.

Accessing HDFS UI

The HDFS Web UI can be accessed by opening the following link from your Firefox browser – http://10.3.0.2:9870/dfshealth.html#tab-overview

 

HDFS commands on the cluster

Running any jobs on the cluster would require your data to be present on HDFS, which is a high throughput file system suited for working with large quantities of data. However, when you first log in, your data would usually be sitting on the /scratch partition, and needs to be moved to HDFS. Similarly, once you have results, you would want to move them back to the /scratch partition. Both these things require knowledge of HDFS commands, which are similar to their Unix counterparts. A list of commands with examples for their usage can be found online as part of Hadoop’s documentation. A demonstration of some of the common commands follow.

First we go over to the /scratch partition, where one of the subfolders has some test files and folders we can work with.

-bash-4.2$ cd /scratch/reetam
-bash-4.2$ ls -l
total 1421524
drwxr-xr-x 2 reetam1 rpc 88 Sep 6 19:01 grep-wiki-out-2 
drwxr-xr-x 2 reetam1 rpc 6 Sep 26 18:50 testdir 
-rwxrwx--- 1 reetam1 rpc 1455637416 Aug 2 17:52 testfile

Let’s also quickly take a look at what files I already have put on HDFS. The commands are very similar to the ones above, but HDFS commands need to start with ‘hdfs dfs’:

-bash-4.2$ hdfs dfs -ls
Found 6 items
drwx------ - reetam1 reetam1 0 2019-09-02 10:00 .Trash
drwxr-xr-x - reetam1 reetam1 0 2019-09-24 11:54 .sparkStaging
drwx------ - reetam1 reetam1 0 2019-09-06 18:58 .staging
drwxr-xr-x - reetam1 reetam1 0 2019-09-06 18:58 grep-wiki-out-2
drwxr-xr-x - reetam1 reetam1 0 2019-09-01 08:51 test
drwxr-xr-x - reetam1 reetam1 0 2019-09-06 18:54 wordcount
Let’s try creating an empty file and see the space that the files I have placed on HDFS take up.

-bash-4.2$ hdfs dfs -touchz sample
-bash-4.2$ hdfs dfs -du -h0 0 .Trash
0 0 .sparkStaging
0 0 .staging
9.2 M 27.6 M grep-wiki-out-2
0 0 sample
0 0 test
350.0 M 1.0 G wordcount
Now let’s move some files and folders from /scratch to HDFS.

-bash-4.2$ hdfs dfs -moveFromLocal testdir testdir_on_hdfs 
-bash-4.2$ hdfs dfs -moveFromLocal testfile testdir_on_hdfs/testfile_on_hdfs
-bash-4.2$ hdfs dfs -ls
Found 8 items
drwx------ - reetam1 reetam1 0 2019-09-26 19:23 .Trash
drwxr-xr-x - reetam1 reetam1 0 2019-09-24 11:54 .sparkStaging
drwx------ - reetam1 reetam1 0 2019-09-06 18:58 .staging
drwxr-xr-x - reetam1 reetam1 0 2019-09-06 18:58 grep-wiki-out-2
-rw-r--r-- 3 reetam1 reetam1 0 2019-09-26 18:58 sample
drwxr-xr-x - reetam1 reetam1 0 2019-09-01 08:51 test
drwxr-xr-x - reetam1 reetam1 0 2019-09-26 19:26 testdir_on_hdfs
drwxr-xr-x - reetam1 reetam1 0 2019-09-06 18:54 wordcount
-bash-4.2$ hdfs dfs -ls testdir_on_hdfs
Found 1 items
-rw-r--r-- 3 reetam1 reetam1 1455637416 2019-09-26 19:26 testdir_on_hdfs/testfile_on_hdfs

Now to copy the files back to scratch, check that they got copied, and remove the files from HDFS.
-bash-4.2$ hdfs dfs -copyToLocal testdir_on_hdfs testdir_scratch 
-bash-4.2$ ls -l
total 0
drwxr-xr-x 2 reetam1 rpc 88 Sep 6 19:01 grep-wiki-out-2 
drwxr-xr-x 2 reetam1 rpc 30 Sep 26 19:38 testdir_scratch 
-bash-4.2$ ll testdir_scratch/
total 1421524
-rw-r--r-- 1 reetam1 rpc 1455637416 Sep 26 19:38 testfile_on_hdfs
As always, please be very careful while using ‘rm -r’ to delete directories which are not empty!

-bash-4.2$ hdfs dfs -rm -r testdir_on_hdfs
19/09/26 19:40:01 INFO fs.TrashPolicyDefault: Moved: 'hdfs://worker2.hdp-internal:8020/user/reetam1/testdir_on_hdfs' to 
trash at: hdfs://worker2.hdp-internal:8020/user/reetam1/.Trash/Current/user/reetam1/testdir_on_hdfs1569541201407 
-bash-4.2$ hdfs dfs -ls
Found 7 items
drwx------ - reetam1 reetam1 0 2019-09-26 19:23 .Trash
drwxr-xr-x - reetam1 reetam1 0 2019-09-24 11:54 .sparkStaging
drwx------ - reetam1 reetam1 0 2019-09-06 18:58 .staging
drwxr-xr-x - reetam1 reetam1 0 2019-09-06 18:58 grep-wiki-out-2
-rw-r--r-- 3 reetam1 reetam1 0 2019-09-26 18:58 sample
drwxr-xr-x - reetam1 reetam1 0 2019-09-01 08:51 test
drwxr-xr-x - reetam1 reetam1 0 2019-09-06 18:54 wordcount

Things to check on your new account

Note: while these remain generally the same, they need to be tested and updated with new examples from within the cluster.

Group membership

Your account has membership in one or more Unix groups. On taki, groups are usually (but not always) organized by research group and named after the PI; students in a class are in the ‘student’ group. The primary purpose of these groups is to facilitate sharing of files with other users, through the Unix permissions system. To see your Unix groups, use the groups command:

[gobbert@taki-usr1 ~]$ groups
pi_gobbert 

In the example above, the user is a member of the pi_gobbert group.

If any of the symbolic links to storage areas do not exist, you may create them using the following commands. You only need to do this once. We suggest that you repeat it for each PI if you are a member of multiple research groups.

[gobbert@taki-usr1 ~]$ ln -s /umbc/xfs1/gobbert/common ~/gobbert_common
[gobbert@taki-usr1 ~]$ ln -s /umbc/xfs1/gobbert/users/gobbert ~/gobbert_user

Umask

By default, your account will have a line in ~/.bashrc which sets your “umask”

umask 007

The umask helps to determine the permissions for new files and directories you create. Usually when you create a file, you don’t specify what its permissions will be. umask 007 ensures that outside users have no access to your new files. See the Wikipedia entry for umask for more explanation and examples. On taki, the storage areas’ permissions are already set up to enforce specific styles of collaboration. We’ve selected 007 as the default umask to not prevent sharing with your group, but to prevent sharing with outside users. If you generally want to prevent your group from modifying your files (for example), even in the shared storage areas, you may want to use a more restrictive umask.

Please run the following command to check your bashrc file:

[gobbert@taki-usr1 ~]$ more .bashrc
# .bashrc
# Set the permissions to limit the default read privs to only the user
umask 007

# Source global definitions
if [ -f /etc/bashrc ]; then
        . /etc/bashrc
fi

# Loads the default module environment on taki
module load default-environment

# User specific aliases and functions

More about permissions

Standard Unix permissions are used on taki to control which users have access to your files. We’ve already seen some examples of this. It’s important to emphasize that this is the mechanism that determines the degree of sharing, and on the other hand privacy, of your work on this system. In setting up your account, we’ve taken a few steps to simplify things, assuming you use the storage areas for the basic purposes they were designed. This should be sufficient for many users, but you can also customize your use of the permissions system if you need additional privacy, to share with additional users, etc.

Changing a file’s permissions

For existing files and directories, you can modify permissions with the “chmod” command. As a very basic example:

[gobbert@taki-usr1 ~]$ ll tmpfile 
-rwxrwxr-x 1 gobbert pi_gobbert 0 Jun 14 17:50 tmpfile
[gobbert@taki-usr1 ~]$ chmod o-rwx tmpfile 
[gobbert@taki-usr1 ~]$ chmod ug-x tmpfile
[gobbert@taki-usr1 ~]$ ll tmpfile 
-rw-rw---- 1 gobbert pi_gobbert 0 Jun 14 17:50 tmpfile

See “man chmod” for more information, or the Wikipedia page for chmod

Changing a file’s group

For users in multiple groups, you may find the need to change a file’s ownership to a different group. This can be accomplished on a file-by-file basis by the “chgrp” command

[gobbert@taki-usr1 ~]$ touch tmpfile 
[gobbert@taki-usr1 ~]$ ll tmpfile 
-rw-rw---- 1 araim1 pi_nagaraj 0 Jun 14 18:00 tmpfile
[gobbert@taki-usr1 ~]$ chgrp pi_gobbert tmpfile 
[gobbert@taki-usr1 ~]$ ll tmpfile 
-rw-rw---- 1 araim1 pi_gobbert 0 Jun 14 18:00 tmpfile

Setting permissions and changing groups for folders

For folders containing files owned by multiple users within the same group, file and folder permissions need to be set appropriately and need to propagate correctly through sub-directory trees. The following example shows how to change the ownership of a folder to a different group, and then allow users from the group to access those files and create sub-folders with the correct permissions. Since only owner can do anything, each user should best execute these commands from time to time. In particular, the chgrp can happen if things are copied in from other Unix systems with other groups or from other groups on taki.

[reetam1@taki-usr1 ~]$ mkdir tmpfolder
[reetam1@taki-usr1 ~]$ ll
drwxrwx--- 2 reetam1 pi_nagaraj  2 Sep  4 17:53 tmpfolder/
[reetam1@taki-usr1 ~]$ chgrp -R pi_gobbert tmpfolder
[reetam1@taki-usr1 ~]$ chmod -R g+rwX tmpfolder
[reetam1@taki-usr1 ~]$ chmod -R o-rwx tmpfolder
[reetam1@taki-usr1 ~]$ ll
drwxrwx--- 2 reetam1 pi_gobbert  2 Sep  4 17:53 tmpfolder/
[reetam1@taki-usr1 ~]$ find tmpfolder -type d -exec chmod g+rws {} \;
[reetam1@taki-usr1 ~]$ ll
drwxrws--- 2 reetam1 pi_gobbert  2 Sep  4 17:53 tmpfolder/

The “-R” is recursive, so it fixes entire sub-directory trees. The capital X in chmod -R g+rwX . is purposeful, namely it only sets x for group, if it is x for user. The last command fixes the sticky bit “s” in drwxrws— and is vital for permissions to propagate correctly.

Checking disk usage vs. storage limits

For users to check the amount of storage that is remaining in their local storage and their primary group they can use the module hpc_toolkit and then use the hpc_checkStorage script.

[normalUser@taki-usr1 ~]$ module load hpc_toolkit
[normalUser@taki-usr1 ~]$ hpc_checkStorage 

The user normalUser has currently used  284M  of  300M  of storage 

The group normalGroup has currently used  65B  of the total group storage 100G  
[normalUser@taki-usr1 ~]$ 

 

Storage areas

The directory structure that DoIT will set up as part of your account creation is designed to facilitate the work of research groups consisting of several users and also reflects the fact that all HPCF accounts must be sponsored by a faculty member at UMBC. This sponsor will be referred to as PI (short for principal investigator) in the following. A user may be a member of one or several research groups on taki. Each research group has several storage areas on the system in the following specified locations. See the System Description for a higher level overview of the storage and the cluster architecture.

Note that some special users, such as students in a class, may not belong to a research group and therefore may not have all of the group storage areas set up.

Storage Area Location Description
User Home /home/username/ This is where the user starts after logging in to taki. Only accessible to the user by default. Default size is 100 MB, storage is located on the management node. This area is backed up nightly.
User Workspace Symlink: /home/username/pi_name_user
Mount point: /umbc/xfs1/pi_name/users/username/
A central storage area for the user’s own data, accessible only to the user and with read permission to the PI, but not accessible to other group members by default. Ideal for storing output of parallel programs, for example. This area is not backed up.
Group Workspace Symlink: /home/username/pi_name_common
Mount point: /umbc/xfs1/pi_name/common/
The same functionality and intent for use as user workspace, except this area is accessible with read and write permission to all members of the research group.
Scratch space /scratch/NNNNN Each compute node on the cluster has local /scratch storage. The space in this area is shared among current users of the node so the total amount available will vary based on system usage. This storage is convenient temporary space to use while your job is running, but note that your files here persist only for the duration of the job. Use of this area is encouraged over /tmp, which is also needed by critical system processes. Note that a subdirectory NNNNN (e.g., 22704) is created for your job by the scheduler at runtime.
Tmp Space /tmp/ Each machine on the cluster has its own local /tmp storage, as is customary on Unix systems. This scratch area is shared with other users and is purged periodically by the operating system, therefore is only suitable for temporary scratch storage. Use of /scratch is encouraged over /tmp (see above).
AFS /afs/umbc.edu/users/u/s/username/ Your AFS storage is conveniently available on the cluster, but can only be accessed from the user node. The “/u/s” in the directory name should be replaced with the first two letters of your username (for example user “straha1” would have directory /afs/umbc.edu/users/s/t/straha1).

“Mount point” indicates the actual location of the storage on the filesystem. Traditionally, many users prefer to have a link to the storage from their home directory for easier navigation. The field “symlink” gives a suggested location for this link. For example, once the link is created, you may use the command “cd ~/pi_name_user” to get to User Workspace for the given PI. These links may be created for users as part of the account creation process; however, if they do not yet exist, simple instructions are provided below to create them yourself.

The amount of space available in the PI-specific areas depend on the allocation given to your research group. Your AFS quota is determined by DoIT. The quota for everyone’s home directory is generally the same.

Note that listing the contents of /umbc/xfs1 may not show storage areas for all PIs. This is because PI storage is only loaded when it is in use. If you attempt to access a PI’s subdirectory in /umbc/xfs1, it should be loaded (seamlessly) if it was previously offline.