Apache Hadoop is an open-source, Java-based software platform that can be used to manage and process large datasets for applications that require fast and scalable data processing. It can be used across multiple distributed computing, delivering high availability in cluster environments for your applications. There are many ways to install Apache Hadoop and some custom installations are also available. However, for most system administrators, the method below should be the standard way to install it in their environment.
How to install and use Apache Hadoop on Ubuntu Linux
As mentioned above, Apache Hadoop is an open-source, Java-based software platform that can be used to manage and process large datasets for applications that require fast and scalable data processing. Below is how to install it when using Ubuntu Linux.
Install JAVA OpenJDK
Apache Hadoop is a Java-based application and therefore requires JDK installed. The open-source version of Java is great to use with Hadoop. Run the commands below to install the default OpenJDK packages in Ubuntu’s repositories. Once Java is installed, you can verify its version by running the commands below. Additional resources on installing OpenJDK on Ubuntu Linux can be found in the post below. How to install OpenJDK on Ubuntu Linux
Create a user account for Hadoop
As I said above, there are multiple ways to install Apache Hadoop. In this article, we’re going to be created a dedicated user account to run Hadoop services as well as being able to log on to its web interface. The account will have a password and will also be able to log in to Ubuntu. Run the commands below to create a new user called Hadoop. Complete the user details when prompted. Once the user is created, switch to the newly created account by running the commands below. When prompted for a password, type the password created when creating the user account above. Next, generate an SSH key for Hadoop and copy the key to the authorized_keys file in the user profile. When prompted to enter a passphrase, press ENTER and leave empty for no passphrase. Log in to accept the SSH server’s key. Accept the key when prompted:
Install Apache Hadoop
We are now ready to install Apache Hadoop. At this time of writing this post, the latest Hadoop version is 3.3.4. Run the commands below to download Hadoop’s current version. Extract the downloaded file and move it into a new folder called Hadoop inside the user home directory. Next, run the commands below to open the bashrc file to set up Hadoop environment variables. You will have to add the variable JAVA_HOME location inside the file. You can quickly find the default JAVA_HOME location by running the commands below. That should output a similar line as below. Copy and paste all the lines below and append them to the bashrc file. Make sure the highlighted line matches your JAVA_HOME path. Save the load in the new config. You will also want to confirm JAVA_HOME is correct in the hadoop-env.sh file. Under Generic settings for HADOOP, find the export JAVA_HOME line and uncomment. Then add the JAVA_HOME path and save. Save and exit.
Configure Apache Hadoop
At this point, we’re ready to configure Hadoop to begin accepting connections. First, create two folders (namenode and datanode) inside the hdfs directory. Next, edit the core-site.xml file below. Then between the configuration lines, enter the hostname value to reference your system name. Save the file and exit. Next, edit the hdfs-site.xml file. Then create configuration properties to reference the namenode and datanode folders created above. Save and exit. Next, edit the mapred-site.xml file. Then create configuration properties for the MapReduce framework. Save and exit. Next, edit the yarn-site.xml file. Then create configuration properties for yarn. Save and exit.
Start and access the Apache Hadoop portal
Finally, we’re ready to start and access the Hadoop portal. First, run the commands below to format the Hadoop namenode. If successfully formatted, you should see a message in the results with the similar line below: Next, run the commands below to start Hadoop. A successful start message will look like the one below: Next, open your web browser and browse to the server’s hostname or IP address followed by port number 9870 to view the Hadoop summary portal. The hadoop application portal can be accessed by going to the server’s hostname or IP address followed by port number 8088. That should be it! Conclusion: This post showed you how to install and use Apache Hadoop on Ubuntu Linux. If you find any error above or have something to add, please use the comment form below.