This tutorial covers how to compile and run the MaxTemperature example covered in Chapter 2 (MapReduce) of Hadoop: The Definitive Guide, 3rd Edition, using the Microsoft HDInsight Emulator for Windows Azure (Hadoop on Windows).
Prerequisites:
- A supported Windows operating system:
- Windows 8,
- Windows 7,
- Windows Vista SP2,
- Windows XP SP3+,
- Windows Server 2003 SP2+,
- Windows Server 2008,
- Windows Server 2008 R2, or
- Windows Server 2012
- An Internet connection.
- Administrator privileges.
- 7-Zip or gzip -d … or some other way of decompressing *.gz files.
Outline:
- Install the Microsoft HDInsight Emulator for Windows Azure (Hadoop on Windows).
- (Optional) Format the Hadoop Distributed File System (HDFS).
- Create the folder structure for this MapReduce project on the local file system.
- Create the folder structure for this MapReduce project on the HDFS.
- Download the data to the local file system.
- Copy the data from the local file system to the HDFS.
- Write the following *.java files for this MapReduce project on the local file system:
- Compile *.java files to *.class files on the local file system.
- Archive the *.class files to a *.jar on the local file system.
- Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.
- Copy the results of this MapReduce project from the HDFS to the local file system.
Procedure:
- Install the Microsoft HDInsight Emulator for Windows Azure (Hadoop on Windows).
Follow the instructions at http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started-emulator/#install to install Apache Hadoop in a single-node cluster deployment using the Hortonworks Data Platform (HDP) for Windows.
Note: The Microsoft HDInsight Emulator will be installed using the Microsoft Web Platform Installer (Web PI) launched from http://www.microsoft.com/web/gallery/install.aspx?appid=HDINSIGHT.
After installing, you will have the following:- Apache Hadoop 1.0.3
- Apache Pig 0.9.3
- Apache HCatalog 0.4.1
- Apache Templeton 0.1.4
- Apache Hive 0.9.0
- Apache Sqoop 1.4.2
- Apache Oozie 3.2.0
These versions are a few years old; however, they’re good enough to get Hadoop up and running on Windows with minimal effort.
The Microsoft HDInsight Emulator for Windows Azure makes the following system modifications:- Installs the following:
- Python 2.7.3 (32-bit)
- Hortonworks Data Platform for Windows
- Microsoft HDInsight Emulator for Windows Azure
- Creates a new Local Group named “HadoopUsers”
- Creates a new Local User named “hadoop” that is a member of the following Local Groups: HadoopUsers and HomeUsers
- Creates the following services and automatically starts these services under the context of the new Local User hadoop:
- Apache Hadoop datanode
- Apache Hadoop derbyserver
- Apache Hadoop historyserver
- Apache Hadoop hiveserver
- Apache Hadoop hiveserver2
- Apache Hadoop hwi
- Apache Hadoop jobtracker
- Apache Hadoop metastore
- Apache Hadoop namenode
- Apache Hadoop oozieservice
- Apache Hadoop secondarynamenode
- Apache Hadoop tasktracker
- Apache Hadoop templeton
References:
- (Optional) Format the Hadoop Distributed File System (HDFS).
This step doesn’t need to be done if you have installed Microsoft HDInsight Emulator for Windows Azure; however, I’ve included this step if you’re attempting to reuse these instructions for Linux … as I do.
Launch the Hadoop Command Line shortcut and execute the following:hadoop namenode -format hadoop fs -mkdir /user hadoop fs -mkdir /user/your-username
If any of these folders already exist, then you will receive the following type of error:
mkdir: cannot create directory /user: File exists
- Create the folder structure for this MapReduce project on the local file system.
Create a folder on the C drive called “Temp” ( i.e. C:\Temp\ ).
Inside this folder, create another folder called “MaxTemp” ( i.e. C:\Temp\MaxTemp\ ). - Create the folder structure for this MapReduce project on the HDFS.
Using the Hadoop Command Line, execute the following:hadoop fs -mkdir MaxTemp hadoop fs -mkdir MaxTemp/input
In HDFS, this will make a folder named “MaxTemp” under the /user/your-username/ folder ( i.e. /user/your-username/MaxTemp/ ). Then, it will make a folder named “input” under the /user/your-username/MaxTemp/ folder ( i.e. /user/your-username/MaxTemp/input/ ).
- Download the data to the local file system.
In C:\Temp\MaxTemp\, download the following files:- https://github.com/tomwhite/hadoop-book/raw/3e/input/ncdc/all/1901.gz
- https://github.com/tomwhite/hadoop-book/raw/3e/input/ncdc/all/1902.gz
Extract each *.gz file to C:\Temp\MaxTemp\ so that you have the following:- C:\Temp\MaxTemp\1901
- C:\Temp\MaxTemp\1902
If you’re using 7-Zip, choose “Extract Here” from the 7-Zip contextual menu on each *.gz file. - Copy the data from the local file system to the HDFS.
Using the Hadoop Command Line, execute the following:hadoop fs -copyFromLocal C:\Temp\MaxTemp\1901 MaxTemp/input hadoop fs -copyFromLocal C:\Temp\MaxTemp\1902 MaxTemp/input
- Write the following *.java files for this MapReduce project on the local file system:
In the folder C:\Temp\MaxTemp\, write the following into a file named MaxTemperatureMapper.java ( i.e. C:\Temp\MaxTemp\MaxTemperatureMapper.java ):import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MaxTemperatureMapper extends Mapper< /* input key type: */ LongWritable, /* input value type: */ Text, /* output key type: */ Text, /* output value type: */ IntWritable > { private static final int MISSING = 9999; @Override public void map ( /* input key: */ LongWritable key, /* input value: */ Text value, Context context ) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring( 15, 19 ); int airTemperature; if ( line.charAt( 87 ) == '+' ) { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt( line.substring( 88, 92 ) ); } else { airTemperature = Integer.parseInt( line.substring( 87, 92 ) ); } String quality = line.substring( 92, 93 ); if ( airTemperature != MISSING && quality.matches( "[01459]" ) ) { context.write( /* output key: */ new Text( year ), /* output value: */ new IntWritable( airTemperature ) ); } } }
In the folder C:\Temp\MaxTemp\, write the following into a file named MaxTemperatureReducer.java ( i.e. C:\Temp\MaxTemp\MaxTemperatureReducer.java ):
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class MaxTemperatureReducer extends Reducer< /* input key type: */ Text, /* input value type: */ IntWritable, /* output key type: */ Text, /* output value type: */ IntWritable > { @Override public void reduce ( /* input key: */ Text key, /* input value type: */ Iterable< IntWritable > values, Context context ) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for ( IntWritable value : values ) { maxValue = Math.max( maxValue, value.get() ); } context.write( /* output key: */ key, /* output value: */ new IntWritable( maxValue ) ); } }
In the folder C:\Temp\MaxTemp\, write the following into a file named MaxTemperature.java ( i.e. C:\Temp\MaxTemp\MaxTemperature.java ):
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class MaxTemperature { public static void main ( String[] args ) throws Exception { if ( args.length != 2 ) { System.err.println( "Usage: MaxTemperature <input path> <output path>" ); System.exit( -1 ); } Job job = new Job(); job.setJarByClass( MaxTemperature.class ); job.setJobName( "Max temperature" ); FileInputFormat.addInputPath( job, new Path( args[ 0 ] ) ); FileOutputFormat.setOutputPath( job, new Path( args[ 1 ] ) ); job.setMapperClass( MaxTemperatureMapper.class ); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass( Text.class ); job.setOutputValueClass( IntWritable.class ); System.exit( job.waitForCompletion( true ) ? 0 : 1 ); } }
- Compile *.java files to *.class files on the local file system.
The Microsoft HDInsight Emulator installs an older version of the JDK (i.e. 1.6.0_31) in the folder C:\Hadoop\java\ … so we’ll use this JDK to compile the *.java files.
Using the Hadoop Command Line, execute the following:cd C:\Temp\MaxTemp C:\Hadoop\java\bin\javac.exe -classpath C:\Hadoop\hadoop-1.1.0-SNAPSHOT\hadoop-core-1.1.0-SNAPSHOT.jar;C:\Hadoop\hadoop-1.1.0-SNAPSHOT\lib\commons-cli-1.2.jar -d C:\Temp\MaxTemp MaxTemperatureMapper.java MaxTemperatureReducer.java MaxTemperature.java
- Archive the *.class files to a *.jar on the local file system.
As mentioned above, the Microsoft HDInsight Emulator installs an older version of the JDK (i.e. 1.6.0_31) in the folder C:\Hadoop\java\ … so we’ll use this JDK to archive the *.class files.
Using the Hadoop Command Line, execute the following:cd C:\Temp\MaxTemp C:\Hadoop\java\bin\jar.exe cvf MaxTemperature.jar -C C:\Temp\MaxTemp MaxTemperatureMapper.class MaxTemperatureReducer.class MaxTemperature.class
- Run this MapReduce project using the *.jar on the local file system and the data on the HDFS.
Using the Hadoop Command Line, execute the following:hadoop jar C:\Temp\MaxTemp\MaxTemperature.jar MaxTemperature MaxTemp/input MaxTemp/output
- Copy the results of this MapReduce project from the HDFS to the local file system.
Using the Hadoop Command Line, execute the following:hadoop fs -copyToLocal MaxTemp/output/part-r-00000 C:\Temp\MaxTemp
Then, open part-r-00000 in WordPad or Notepad to see the results.
The results should be:1901 317 1902 244
Remember that WordPad will interpret “\n” as a new line … whereas, Notepad will not. Notepad only interprets “\r\n” as a new line. Therefore, if you’re set on opening up the part-r-00000 file in Notepad, you can first open the file in WordPad, save & close the file, and then open the file in Notepad.
References:
- Hadoop: The Definitive Guide, 3rd Edition
- Microsoft HDInsight Emulator for Windows Azure
- Install the HDInsight Emulator
- Develop Java MapReduce programs for HDInsight
- Website for Hadoop: The Definitive Guide
- Code for Hadoop: The Definitive Guide, 3rd Edition
- Hadoop
- Hortonworks Data Platform (HDP) for Windows Version 1.1.0 Documentation