This tutorial covers how to compile and run a MapReduce Version 2 (MRv2) job written in Java using Hadoop. All the code and some of the commands used in this tutorial are derived from the Map-Reduce Part 2 — Developing First MapReduce Job section from the Hadoop Tutorial: Developing Big-Data Applications with Apache Hadoop by coreservlets.com.
Note: I’m running Hadoop 2.3.0 in the following Cloudera QuickStart VM: a VMWare virtual machine of the Cloudera Distribution Including Apache Hadoop (CDH) version 5.0 last updated on 02 Apr 2014. I’m running this virtual machine in the following VMWare Player: VMware Player 6.0.2 build 1744117 released on 2014-04-17. My host machine is the first version of the Microsoft Surface Pro tablet running Windows 8.1 Pro (64-bit).
Prerequisites:
- A properly configured installation of Hadoop 2.3.0 (including all dependencies) … good luck!
;-)
or
- A 64-bit operating system, such as Windows 8.1 Pro (64-bit), that can run VMWare Player 6.0.2.
- (free) VMWare Player 6.0.2 for 64-bit operating systems.
- (free) A Cloudera QuickStart VM of CDH 5.0.
Outline:
- (Optional) Format the Hadoop Distributed File System (HDFS).
- Create the folder structure for this MapReduce project on the local file system.
- Create the folder structure for this MapReduce project on the HDFS.
- Download the data to the local file system.
- Copy the data from the local file system to the HDFS.
- Write the following *.java files for this MapReduce project on the local file system:
- Compile *.java files to *.class files on the local file system.
- Archive the *.class files to a *.jar on the local file system.
- Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.
- Copy the results of this MapReduce project from the HDFS to the local file system.
Note: In order for you … and me (when I come back to this procedure) … to figure out where we are in the local file system, each step in the procedure starts with a call to cd
.
Procedure:
- (Optional) Format the Hadoop Distributed File System (HDFS).
This step doesn’t need to be done if we’re using the CDH 5.0 Cloudera QuickStart VM; however, I’ve included this step if we’re using our Hadoop installation for the first time.
Launch the Terminal and execute the following:cd ~ hdfs namenode -format hdfs dfs -mkdir /user hdfs dfs -mkdir /user/our-username
where
our-username
is the username that we are currently logged in under. If we’re using the CDH 5.0 Cloudera QuickStart VM, our username iscloudera
and we would execute the following:cd ~ hdfs namenode -format hdfs dfs -mkdir /user hdfs dfs -mkdir /user/cloudera
If any of these folders already exist, we’ll receive the following type of error:
mkdir: `/user': File exists mkdir: `/user/cloudera': File exists
- Create the folder structure for this MapReduce project on the local file system.
We’ll need somewhere to put our input data files, our Java MapReduce code, and the eventual results (i.e. our output data files), so let’s create a folder namedStartsWithCount
for this project under the current user’s home directory:cd ~ mkdir ~/StartsWithCount
- Create the folder structure for this MapReduce project on the HDFS.
Similar to the step above, we’ll need a place to put our input data and the eventual results (i.e. our output data files) in the HDFS, so let’s create a folder namedStartsWithCount
and nest a folder inside calledinput
… theoutput
folder will be created by a later step:cd ~/StartsWithCount hdfs dfs -mkdir StartsWithCount hdfs dfs -mkdir StartsWithCount/input hdfs dfs -rm -r StartsWithCount/output
Note: Upon the successful completion of a MapReduce job, results are written to a file named
part-r-00000
. If thatpart-r-00000
file is still around the next time we run our MapReduce job, it won’t be overwritten … instead, our MapReduce job will fail … and we will be sad that we wasted our time.:-(
Therefore, the last command above (hdfs dfs -rm -r StartsWithCount/output
) recursively removes theStartsWithCount/output
folder (if it exists) in the HDFS and everything in thatoutput
folder … including any files that might be namedpart-r-00000
.;-)
- Download the data to the local file system.
Usewget
to download the text file pg1524.txt from Project Gutenberg. pg1524.txt is Shakespeare’s Hamlet … and will be our input data file for the rest of this Hadoop tutorial. Since pg1524.txt isn’t very descriptive, let’s rename pg1524.txt to hamlet.txt:cd ~/StartsWithCount wget http://www.gutenberg.org/cache/epub/1524/pg1524.txt mv pg1524.txt hamlet.txt
- Copy the data from the local file system to the HDFS.
Upload (put) hamlet.txt to the HDFS. Since hamlet.txt is our input data file, we’ll put that text file in theStartsWithCount/input
folder:cd ~/StartsWithCount hdfs dfs -put hamlet.txt StartsWithCount/input
Under Construction – Begin
- Write the following *.java files for this MapReduce project on the local file system:
TODO:cd ~/StartsWithCount mkdir ~/StartsWithCount/mr mkdir ~/StartsWithCount/mr/wordcount cd ~/StartsWithCount/mr/wordcount wget http://hadoop-course.googlecode.com/svn/trunk/HadoopSamples/src/main/java/mr/wordcount/StartsWithCountMapper.java wget http://hadoop-course.googlecode.com/svn/trunk/HadoopSamples/src/main/java/mr/wordcount/StartsWithCountReducer.java wget http://hadoop-course.googlecode.com/svn/trunk/HadoopSamples/src/main/java/mr/wordcount/StartsWithCountJob.java
package mr.wordcount; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class StartsWithCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable countOne = new IntWritable(1); private final Text reusableText = new Text(); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer = new StringTokenizer(value.toString()); while (tokenizer.hasMoreTokens()) { reusableText.set(tokenizer.nextToken().substring(0, 1)); context.write(reusableText, countOne); } } }
package mr.wordcount; import java.io.IOException; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Reducer; import org.apache.log4j.Logger; public class StartsWithCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> { Logger log = Logger.getLogger(StartsWithCountMapper.class); @Override protected void reduce(Text token, Iterable<IntWritable> counts, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable count : counts) { sum+= count.get(); } context.write(token, new IntWritable(sum)); } }
package mr.wordcount; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class StartsWithCountJob extends Configured implements Tool{ @Override public int run(String[] args) throws Exception { Job job = Job.getInstance(getConf(), "StartsWithCount"); job.setJarByClass(getClass()); // configure output and input source TextInputFormat.addInputPath(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); // configure mapper and reducer job.setMapperClass(StartsWithCountMapper.class); job.setCombinerClass(StartsWithCountReducer.class); job.setReducerClass(StartsWithCountReducer.class); // configure output TextOutputFormat.setOutputPath(job, new Path(args[1])); job.setOutputFormatClass(TextOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new StartsWithCountJob(), args); System.exit(exitCode); } }
- Compile *.java files to *.class files on the local file system.
TODO:cd ~/StartsWithCount javac -classpath `yarn classpath` mr/wordcount/*.java
- Archive the *.class files to a *.jar on the local file system.
TODO:cd ~/StartsWithCount jar cvf HadoopSamples.jar mr/wordcount/*.class
- Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.
TODO:cd ~/StartsWithCount yarn jar HadoopSamples.jar mr.wordcount.StartsWithCountJob StartsWithCount/input StartsWithCount/output
Under Construction – End
- Copy the results of this MapReduce project from the HDFS to the local file system.
Download (get) the resulting output file (part-r-00000) of our successful MapReduce job from the HDFS and display that downloaded file usingcat
:cd ~/StartsWithCount hdfs dfs -get StartsWithCount/output/part-r-00000 cat part-r-00000
Alternatively, we can keep the resulting output file where it’s at on the HDFS and display it from there using
hdfs dfs -cat
:cd ~/StartsWithCount hdfs dfs -cat StartsWithCount/output/part-r-00000
References:
- Hadoop Tutorial: Developing Big-Data Applications with Apache Hadoop by coreservlets.com
- The “Map-Reduce Part 2 — Developing First MapReduce Job” tutorial Section on SlideShare
- The “Map-Reduce Part 2 — Developing First MapReduce Job” tutorial section in PDF
- Sourcecode for the “Map-Reduce Part 2 — Developing First MapReduce Job” tutorial section