Using Kafka on the big data cluster

What is Kafka?

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. (Wikipedia)

A Walk-through of using Kafka on the Big Data Cluster.

The purpose of this walkthrough is to give the basic introduction of how to get up and running using Kafka on the big data cluster. The use of Kafka on the cluster is a case by case basis, although there are basic building blocks that are fundamental to the use of Kafka on the big data cluster. These concepts will be discussed and used for the purpose of demonstration.

Each concept discussed has a plethora of use cases and dynamicity of how it can be used and its application depending on the type of action/project/ task a user hopes to accomplish. The Kafka and confluent website contain the documentation of Kafka and all it entails for your curiosity and entertainment. Watching Youtube videos or taking an online class on Kafka and each concept should help shed light on any gray area in the understanding and use of Kafka and all it entails.

Note: Kafka is a high-level technical tool, following this walkthrough gives a user the guarantee that the resources to perform any task or application is available from a cluster administration standpoint. If you encounter any challenge outside running this simple test for allocation resources, please, debug and test your code, check the official documentation website and perform a couple of Google searches on the error before putting in a ticket.

Thank you.

Kafka Walk-through

Requirement:

  1. User has the appropriate permission to access the cluster
  2. Open two terminal sessions

Step 1: Start an SSH Login to the cluster on one of the open terminals

    • Mac and Linux users:
    • Perform ssh UMBC_ID@login.hadoop.umbc.edu #Note: UMBC_ID is your myUMBC username login
  • Windows users will need to use putty to access the cluster.

NOTE: To connect to the cluster remotely please use your VPN connection.

Then you’ll be prompted to input your password. As shown below.

YOUR PC ~ % ssh UMBC_ID@login.hadoop.umbc.edu
WARNING: UNAUTHORIZED ACCESS to this computer is in violation of Criminal Law
         Article section 8-606 and 7-302 of the Annotated Code of MD.

NOTICE:  This system is for the use of authorized users only. Individuals using
         this computer system without authority, or in excess of their authority
         , are subject to having all of their activities on this system
         monitored and recorded by system personnel.

UMBC_ID@login.hadoop.umbc.edu's password: 
Last login: Tue Jun  9 11:44:12 2020 from 130.85.46.245

UMBC Division of Information Technology                    http://doit.umbc.edu/
--------------------------------------------------------------------------------
If you have any questions or problems regarding these systems, please call the
DoIT Technology Support Center at 410-455-3838, or submit your request on the
web by visiting http://my.umbc.edu/help/request

Remember that the Division of Information Technology will never ask for your
password. Do NOT give out this information under any circumstances.
--------------------------------------------------------------------------------

-bash-4.2$ 

This means you are in the Hadoop cluster and you have been granted the necessary permission.

For this walk-through, the first step is creating a Topic.

A Topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.

Step 2: For this walk-through, we can create a topic by navigating to the root directory and use the command below

/usr/bin/kafka-topics --create --zookeeper worker2.hdp-internal 9092 --replication-factor 1 --partitions 1 --topic your_topic

There are a bunch of such commands and they are specific to the type of operation a user wishes to perform. Getting familiar with knowing how and when to use these commands is the joy of any Kafka user.

-bash-4.2$ cd /
-bash-4.2$ /usr/bin/kafka-topics --create --zookeeper worker2.hdp-internal 9092 --replication-factor 1 --partitions 1 --topic your_topic
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "your_topic".
-bash-4.2$

Step 3: To see the existing or previously created topics use the command below

/usr/bin/kafka-topics --list --zookeeper worker2.hdp-internal 9092

Once a topic has been successfully created, the next step is to create a producer that sends data to the topic.

A Producer: Producers publish data to the topics of their choice.

Step 6: To test that the producer functionality works, use the following command for the creation of a producer.

/usr/bin/kafka-console-producer --broker-list worker2.hdp-internal:9092 --topic your_topic

A Consumer: Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group.

Step 4: To Test the consumer functionality, open a new terminal, and perform Steps 1 – 3, then use the command below to create a consumer that consumes data from the Topic.

/usr/bin/kafka-console-consumer --bootstrap-server worker2.hdp-internal:9092 --topic your_topic --from-beginning

If everything goes as expected, anything typed in the producer terminal should appear in the consumer terminal.

To delete a topic using the following command

/usr/bin/kafka-topics --delete --zookeeper worker2.hdp-internal 9092 --topic your_topic