This tutorial below will walk you through your home directory, some useful Linux information, and review what to expect in your new account.
Connecting to the Big Data Cluster
The only nodes with a connection to the outside network are the login nodes, also called Edge nodes. From the outside, we must refer to the hostname login.hadoop.umbc.edu. To log in to the system, you must use a secure shell-like SSH from Unix/Linux, PuTTY from Windows, or similar. You connect to the user node, which is the only node visible from the internet. For example, suppose we are connecting to Big Data Cluster from our personal computer(pc_name).
[reetam1@pc_name]$ ssh firstname.lastname@example.org WARNING: UNAUTHORIZED ACCESS to this computer is in violation of Criminal Law Article section 8-606 and 7-302 of the Annotated Code of MD. NOTICE: This system is for the use of authorized users only. Individuals using this computer system without authority, or in excess of their authority , are subject to having all of their activities on this system monitored and recorded by system personnel. email@example.com's password: Last login: Tue Sep 10 22:01:03 2019 from 188.8.131.52 UMBC Division of Information Technology http://doit.umbc.edu/ -------------------------------------------------------------------------------- If you have any questions or problems regarding these systems, please call the DoIT Technology Support Center at 410-455-3838, or submit your request on the web by visiting http://my.umbc.edu/help/request Remember that the Division of Information Technology will never ask for your password. Do NOT give out this information under any circumstances. -------------------------------------------------------------------------------- -bash-4.2$
Replace “reetam1” with your UMBC username (that you use to log into myUMBC). You will be prompted for your password when connecting; your password is your myUMBC account password.
As another example, suppose we’re SSHing to the Big Data Cluster from a Windows machine with PuTTY. When setting up a connection, use “login.hadoop.umbc.edu” as the hostname. Once you connect, you will be prompted for your username and password, as mentioned above.
Setting bash as your default shell on the cluster
On logging in the first time, bash needs to be set as the default before you can use all the cluster’s resources. This is a one-time change that needs to be done at your myUMBC settings:
- On your web browser, navigate to https://webadmin.umbc.edu/.
- Under the My Account section, click on the link titled Edit my Shell and Unix Settings.
- At this point, if you are not logged in already to myUMBC, you will be asked to do so.
- On the Change Your UNIX Account Settings page that you are led to, select the radio button for bash – the last option under Login Shells.
- Click on the Change your Unix Shell button to save, and exit the page.
- You will need to log back on to the Big Data Cluster for the changes to take effect.
You can also verify whether your current shell is bash:
-bash-4.2$ echo $SHELL /bin/bash
Updating your .bash_profile with environment variables
At this point, your default shell has been switched to bash, and the next step is to create and edit your ~/.bash_proile before you can start using the cluster. follow the steps below.
Create a the .bash_profile with your favorite text editor or use the command below.
-bash-4.2$ nano .bash_profile
Copy and paste the following command in your .bash_profile
export JAVA_HOME='/usr/java/jdk1.8.0_181-cloudera' export PYSPARK_PYTHON=python3 PATH=$PATH:$HOME/bin:$JAVA_HOME/bin export PATH
Save and exit once you are done. To make the changes take effect, either source your bash file by using the command source ~/.bash_profile or simply log out and log back into the cluster.
Hadoop Distributed File System (HDFS)
This section assumes that you have logged in as described above.
At any given time, the directory that you are currently in is referred to as your current working directory. Since you just logged in, your home directory is your current working directory. Your AFS storage area serves as the home directory for the big data cluster. The command “pwd” tells you the full path of the current working directory, so let us run pwd to see where your home directory really is:
-bash-4.2$ pwd /afs/umbc.edu/users/r/e/reetam1/home
Each of the 8 big data nodes, as well as the edge node, has a local /scratch storage. The space in this area is shared among current users of the node so the total amount available will vary based on system usage. The total space available is 500 GB and is intended to be the location from which large data transfers into HDFS originate.
-bash-4.2$ cd /scratch/ -bash-4.2$ pwd /scratch -bash-4.2$ df -h /scratch/ Filesystem Size Used Avail Use% Mounted on /dev/mapper/rootvg-scratchlv 500G 22G 779G 3% /scratch
Since the scratch directory will be the staging area it is best to make a directory in there to store your files and folders, before transferring them to HDFS
Making directories In scratch
-bash-4.2$ mkdir newfolder -bash-4.2$ ls data603 data603_virtualenv data603_vritualenv newfolder yk98337 -bash-4.2$
Accessing files and folders on the Big Data Cluster
While general Linux commands work on the cluster, when it comes to accessing your data on the cluster, there are some key differences. Most importantly, anything which requires you to access HDFS requires specialized commands.
All our working files will be stored in “newfolder” before moving it to HDFS for processing
Before moving files to HDFS it is necessary to create a directory On HDFS to store the files that will be processed, we’ll call this “folder_on _hdfs” and use the list command to verify the directory creation.
-bash-4.2$ hdfs dfs -mkdir folder_on_hdfs -bash-4.2$ hdfs dfs -ls drwxr-xr-x - oodunsi1 hadoop 0 2020-07-16 01:01 folder_on_hdfs
To demonstrate the various ways files can be moved in and out of HDFS we’ll create a few text files in the /scratch/newfolder directory for this purpose.
-bash-4.2$ touch file1 file2 file3 file4_with_text -bash-4.2$ ls file1 file2 file3 file4_with_text
Using the nano command we’ll add some text in file4_with_text
-bash-4.2$ nano file4_with_text
Output- The following text was entered into the file4_with_text “The Quick brown fox jumps over the lazy dog” and the text saved.
GNU nano 2.3.1 File: file4_with_text The Quick brown fox jumps over the lazy dog The Quick brown fox jumps over the lazy dog The Quick brown fox jumps over the lazy dog The Quick brown fox jumps over the lazy dog [ Read 4 lines ] ^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell
Copy from Scratch to HDFS
-bash-4.2$ hdfs dfs -put /scratch/newfolder/file4_with_text folder_on_hdfs
Using the command below verifies that the file was copied from scratch to HDFS
-bash-4.2$ hdfs dfs -ls folder_on_hdfs/ Found 1 items -rw-r--r-- 3 oodunsi1 hadoop 184 2020-07-16 14:30 folder_on_hdfs/file4_with_text
Most Linux commands are applicable when navigating between folders and directories on HDFS, by appending, “hdfs dfs -” to the Linux commands.
SCP (macOS/Linux/Windows Subsystem for Linux.)
“scp” and “sftp” can be used from a Terminal window. The syntax of scp is “scp [from] [to]”. The from and to can either be a filename or a directory/folder on a host computer or the cluster
Transfer a File from Your Computer to a Cluster
scp myfile.txt umbcID@login.hadoop.umbc.edu:/scratch/newfolder
In this case, myfile.txt is copied to the test directory on the big data cluster. This example assumes that myfile.txt is in the current directory. The full path of myfile.txt can also be specified
Transfer a Directory to a Cluster
scp -r directory umbcID@login.hadoop.umbc.edu:/scratch/newfolder
In this example, the contents of the directory are transferred. The -r indicates that the copy is recursive.
Transfer Files from the Cluster to Your Computer
scp umbcID@login.hadoop.umbc.edu:/scratch/newfolder/myfile.txt .
Note -: “.” represents the current working directory, “.” can be replaced with the full path
How to copy files to and from the Big Data Cluster
Graphical Transfer Tool
FileZilla is a free software, cross-platform FTP application, consisting of FileZilla Client and FileZilla Server. Client binaries are available for Windows, Linux, and macOS.
Step 1: Download and Install FileZilla from https://filezilla-project.org/download.php for your particular OS.
Step 2: Specify the hostname as URL to the cluster: “login.hadoop.umbc.edu”
Step 3: Specific the username and password as your UMBCID and password
Step 4: Specify the port as 22
Step 5: Click Quickconnect.
Step 6: Once the connection to the cluster is established the Remote site tab on the right will display the files and directories the user can access on the cluster.
Step 7: Files can be moved back and forth from the (Local site)host system and (Remote site) cluster by dragging and dropping.
Accessing the Web UI for Hadoop, Spark, Yarn and Hive
Configuring Firefox proxy to view the Web UIs
While the following steps, in theory, work with any browser, they currently have only been tested using the Firefox browser. We want to apply the following settings only to the browser, but other browsers often tend to apply the settings at the system level which is not desirable.
- Go to the Settings for your Firefox browser – on Windows platforms, it is accessible via a hamburger menu near the top right of the browser.
- Scroll down under General settings until you find Network Settings near the end. Alternatively, search for Network Settings using the search bar on the top of the page [screenshot].
- Click on the Settings button.
- In the window that opens, make the following changes [screenshot]:
- Select the radio button for Manual proxy configuration.
- Set SOCKS Host to localhost, and its port to 9090.
- Select the radio button for SOCKS v5.
- Tick the checkbox for Proxy DNS when using SOCKS v5.
- Click OK and close the Settings page.
Once the proxy settings have been applied to the browser, you need to SSH on to the cluster with port forwarding enabled. Login to the cluster using the following command:
ssh firstname.lastname@example.org -D 9090
You can now access the UIs from your Firefox web browser while you remain logged into the cluster.
Accessing HDFS UI
The HDFS Web UI can be accessed by opening the following link from your Firefox browser – http://10.3.0.2:9870/dfshealth.html#tab-overview
To view the submitted Hadoop jobs running on the cluster, visit the Hadoop webpage at http://worker2.hdp-internal:19888/jobhistory
Other Web Interfaces
To view the other web UIs visit the Apache Website here Web Interfaces