博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
hadoop mapreduce 基础实例一记词
阅读量:6732 次
发布时间:2019-06-25

本文共 5068 字,大约阅读时间需要 16 分钟。

mapreduce实现一个简单的单词计数的功能。

一,准备工作:eclipse 安装hadoop 插件:

下载相关版本的hadoop-eclipse-plugin-2.2.0.jar到eclipse/plugins下。

二,实现:

新建mapreduce project

map 用于分词,reduce计数。

package tank.demo;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/** * @author tank * @date:2015年1月5日 上午10:03:43 * @description:记词器 * @version :0.1 */public class WordCount {    public static class TokenizerMapper extends Mapper
{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer
{ private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable
values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); if (args.length != 2) { System.err.println("Usage: wordcount "); System.exit(2); } Job job = new Job(conf, "word count"); //主类 job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class); //map输出格式 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); //输出格式 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); }}

 

打包world-count.jar

三,准备输入数据

hadoop fs -mkdir /user/hadoop/input//建好输入目录

//随便写点数据文件

echo hello my hadoop this is my first application>file1

echo hello world my deer my applicaiton >file2

//拷贝到hdfs中

hadoop fs -put file* /user/hadoop/input

hadoop fs -ls /user/hadoop/input //查看

 

四,运行

上传到集群环境中:

hadoop jar world-count.jar  WordCount input output

截取一段输出如:

15/01/05 11:14:36 INFO mapred.Task: Task:attempt_local1938802295_0001_r_000000_0 is done. And is in the process of committing

15/01/05 11:14:36 INFO mapred.LocalJobRunner:
15/01/05 11:14:36 INFO mapred.Task: Task attempt_local1938802295_0001_r_000000_0 is allowed to commit now
15/01/05 11:14:36 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1938802295_0001_r_000000_0' to hdfs://192.168.183.130:9000/user/hadoop/output/_temporary/0/task_local1938802295_0001_r_000000
15/01/05 11:14:36 INFO mapred.LocalJobRunner: reduce > reduce
15/01/05 11:14:36 INFO mapred.Task: Task 'attempt_local1938802295_0001_r_000000_0' done.
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 running in uber mode : false
15/01/05 11:14:36 INFO mapreduce.Job:  map 100% reduce 100%
15/01/05 11:14:36 INFO mapreduce.Job: Job job_local1938802295_0001 completed successfully
15/01/05 11:14:36 INFO mapreduce.Job: Counters: 32
        File System Counters
                FILE: Number of bytes read=17706
                FILE: Number of bytes written=597506
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=205
                HDFS: Number of bytes written=85
                HDFS: Number of read operations=25
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=5
        Map-Reduce Framework
                Map input records=2
                Map output records=14
                Map output bytes=136
                Map output materialized bytes=176
                Input split bytes=232
                Combine input records=0
                Combine output records=0
                Reduce input groups=10
                Reduce shuffle bytes=0
                Reduce input records=14
                Reduce output records=10
                Spilled Records=28
                Shuffled Maps =0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=67
                CPU time spent (ms)=0
                Physical memory (bytes) snapshot=0
                Virtual memory (bytes) snapshot=0
                Total committed heap usage (bytes)=456536064
        File Input Format Counters
                Bytes Read=80
        File Output Format Counters
                Bytes Written=85

查看输出目录下的文件

[hadoop@tank1 ~]$ hadoop fs -cat /user/hadoop/output/part-r-00000

applicaiton     1
application     1
deer    1
first   1
hadoop  1
hello   2
is      1
my      4
this    1
world   1

已经正确统计出单词数量!

 

转载地址:http://lsfqo.baihongyu.com/

你可能感兴趣的文章
(转)Spring读书笔记-----Spring的Bean之Bean的基本概念
查看>>
NUC1016 斐波那契数列
查看>>
hadoop安装
查看>>
【编码的法则】谨慎的使用static
查看>>
小白的进阶之路1
查看>>
python day2
查看>>
Spring MVC 3 深入总结
查看>>
滚动条 viewPager
查看>>
C 内存分配【转】
查看>>
基于HT for Web的3D树的实现
查看>>
掉了,全掉了。
查看>>
用canvas写一个h5小游戏
查看>>
JavaScript中的arguments,callee,caller
查看>>
HTML元素1: 基本元素,标题,段落,链接,图像等
查看>>
51Nod 1001 数组中和等于K的数对
查看>>
This Android SDK requires Android Developer Toolkit version 23.0.0 or above
查看>>
cnblogs-minor-mode.org
查看>>
List Box Macros
查看>>
javascript----mouseover和mouseenter的区别
查看>>
echarts 折线图
查看>>