Using Spark on the Big Data Cluster

What is Spark?

Spark is a general framework for distributed computing that offers high performance for both batch and interactive processing. It exposes APIs for Java, Python, and Scala.

Steps to access and use Spark on the Big Data cluster:

Step 1: Create an SSH session to the Big data cluster see how here

Step 2: Login with the user credentials

Step 3: Input “pyspark” to enter the Python interpreter for spark

UMBC Division of Information Technology                    http://doit.umbc.edu/
--------------------------------------------------------------------------------
If you have any questions or problems regarding these systems, please call the
DoIT Technology Support Center at 410-455-3838, or submit your request on the
web by visiting http://my.umbc.edu/help/request

Remember that the Division of Information Technology will never ask for your
password. Do NOT give out this information under any circumstances.
--------------------------------------------------------------------------------
-bash-4.2$ pyspark
Python 3.6.8 (default, Apr  2 2020, 13:34:55) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
20/08/17 16:28:23 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
20/08/17 16:28:28 WARN lineage.LineageWriter: Lineage directory /data/hadoop/drive-b/spark-log doesn't exist or is not writable. Lineage for this application will be disabled.
20/08/17 16:28:29 WARN lineage.LineageWriter: Lineage directory /data/hadoop/drive-b/spark-log doesn't exist or is not writable. Lineage for this application will be disabled.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.0-cdh6.2.0
      /_/

Using Python version 3.6.8 (default, Apr  2 2020 13:34:55)
SparkSession available as 'spark'.
>>> 

The above shows that a user has been able to access spark on the Big Data cluster to perform any spark operation using python.

For more functionality and capability of Spark refer to the official Apache Spark documentation.

Reference

https://spark.apache.org/docs/latest/index.html