大数据:Hadoop入门经典案例wordcount单词统计Java代码实现

大数据:Hadoop入门经典案例wordcount单词统计Java代码实现,Windows 10环境,IntelliJ IDEA集成开发环境。

附1通过命令行形式实现大数据平台Hadoop下单词统计功能。现在通过自己编写Java代码实现。本例基于Hadoop 2.8.3,Windows 10(64位)。开发环境是Windows下的IntelliJ IDEA。

1,首先需要为IntelliJ IDEA增加maven依赖,在项目的pom.xml中添加以下Hadoop开发需要的依赖包,然后同步:

    <dependencies>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>2.8.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.8.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>2.8.3</version>
        </dependency>

    </dependencies>

 

2,编写单词统计的Java代码。

主类WordCountMain.java:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCountMain {
    public WordCountMain(String[] args) throws Exception {
        Configuration configuration = new Configuration();

        Job job = Job.getInstance(configuration, "word_count");

        job.setJarByClass(WordCountMain.class);

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);
        job.setOutputKeyClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.out.println(job.waitForCompletion(true) ? "运行成功" : "运行失败");
    }

    public static void main(String[] args) {
        try {
            WordCountMain wordCountMain = new WordCountMain(args);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

 

 

 

map类:

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public  class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) {
        String line = value.toString();

        String[] words = line.split(" ");

        for (String word : words) {
            // 将单词作为key,将次数1作为value。
            try {
                context.write(new Text(word), new LongWritable(1));
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}

 

 

reducer类:

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

public class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
    @Override
    protected void reduce(Text key, Iterable<LongWritable> values, Context context) {
        int count = 0;

        for (LongWritable value : values) {
            count += value.get();
        }

        try {
            context.write(key, new LongWritable(count));
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

 

3,在IntelliJ IDEA中直接导出可运行的Java的jar包(具体方法见文章 https://zhangphil.blog.csdn.net/article/details/99434450),注意,在Hadoop2.8.3,Windows 10,64位环境下,编译出的jar包里面有两个文件夹:META-INF和license,如果直接运行将导致失败。需要用解压缩工具打开这个jar包,删掉META-INF和license这两个文件夹,如图:

如果不删除这两个文件夹,在jar包运行时候,会抛出错误导致运行失败。

 

4,通过start-all命令,启动Hadoop,如附录1中那样。只不过这次使用的jar包不是Hadoop示例代码中的jar包,而是我自己编写的Java代码:

hadoop jar E:/code/IdeaProjects/bigdata/out/artifacts/bigdata_jar/bigdata_jar.jar  WordCountMain /test_dir/myfile  /test_dir/result

bigdata_jar.jar包就是上面Java代码在第三步IntelliJ IDEA中生成的jar包。

 

运行后输出的结果和附1相同。

 

附:

1、命令行方式使用Hadoop自带的word count单词统计jar包统计字符单词数 

https://zhangphil.blog.csdn.net/article/details/98982399

已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 编程工作室 设计师:CSDN官方博客 返回首页