大数据:Hadoop入门经典案例wordcount单词统计Java代码实现,Windows 10环境,IntelliJ IDEA集成开发环境。
附1通过命令行形式实现大数据平台Hadoop下单词统计功能。现在通过自己编写Java代码实现。本例基于Hadoop 2.8.3,Windows 10(64位)。开发环境是Windows下的IntelliJ IDEA。
1,首先需要为IntelliJ IDEA增加maven依赖,在项目的pom.xml中添加以下Hadoop开发需要的依赖包,然后同步:
<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.8.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>2.8.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.8.3</version>
</dependency>
</dependencies>
2,编写单词统计的Java代码。
主类WordCountMain.java:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCountMain {
public WordCountMain(String[] args) throws Exception {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration, "word_count");
job.setJarByClass(WordCountMain.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setOutputValueClass(LongWritable.class);
job.setOutputKeyClass(Text.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.out.println(job.waitForCompletion(true) ? "运行成功" : "运行失败");
}
public static void main(String[] args) {
try {
WordCountMain wordCountMain = new WordCountMain(args);
} catch (Exception e) {
e.printStackTrace();
}
}
}
map类:
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
@Override
protected void map(LongWritable key, Text value, Context context) {
String line = value.toString();
String[] words = line.split(" ");
for (String word : words) {
// 将单词作为key,将次数1作为value。
try {
context.write(new Text(word), new LongWritable(1));
} catch (Exception e) {
e.printStackTrace();
}
}
}
}
reducer类:
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
@Override
protected void reduce(Text key, Iterable<LongWritable> values, Context context) {
int count = 0;
for (LongWritable value : values) {
count += value.get();
}
try {
context.write(key, new LongWritable(count));
} catch (Exception e) {
e.printStackTrace();
}
}
}
3,在IntelliJ IDEA中直接导出可运行的Java的jar包(具体方法见文章 https://zhangphil.blog.csdn.net/article/details/99434450),注意,在Hadoop2.8.3,Windows 10,64位环境下,编译出的jar包里面有两个文件夹:META-INF和license,如果直接运行将导致失败。需要用解压缩工具打开这个jar包,删掉META-INF和license这两个文件夹,如图:
如果不删除这两个文件夹,在jar包运行时候,会抛出错误导致运行失败。
4,通过start-all命令,启动Hadoop,如附录1中那样。只不过这次使用的jar包不是Hadoop示例代码中的jar包,而是我自己编写的Java代码:
hadoop jar E:/code/IdeaProjects/bigdata/out/artifacts/bigdata_jar/bigdata_jar.jar WordCountMain /test_dir/myfile /test_dir/result
bigdata_jar.jar包就是上面Java代码在第三步IntelliJ IDEA中生成的jar包。
运行后输出的结果和附1相同。
附:
1、命令行方式使用Hadoop自带的word count单词统计jar包统计字符单词数