java-使用Map Reduce的最小最大计数
内容导读
互联网集市收集整理的这篇技术教程文章主要介绍了java-使用Map Reduce的最小最大计数,小编现在分享给大家,供广大互联网技能从业者学习和参考。文章包含8323字,纯文字阅读大概需要12分钟。
内容图文
![java-使用Map Reduce的最小最大计数](/upload/InfoBanner/zyjiaocheng/676/387bc23877cc44cebbc2d53b7167b832.jpg)
我开发了一个Map reduce应用程序,用于根据Donald Miner编写的书来确定用户的第一次和最后一次评论以及该用户的评论总数.
但是我的算法的问题是减速器.我已根据用户ID对评论进行了分组.我的测试数据包含两个用户ID,每个用户ID在不同的日期发布3条评论.因此共有6行.
因此,我的reducer输出应打印两条记录,每条记录分别显示用户的第一次和最后一次评论以及每个用户ID的总评论.
但是,我的减速器正在打印六个记录.有人可以指出以下代码有什么问题吗?
import java.io.IOException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Map;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.arjun.mapreduce.patterns.mapreducepatterns.MRDPUtils;
import com.sun.el.parser.ParseException;
public class MinMaxCount {
public static class MinMaxCountMapper extends
Mapper<Object, Text, Text, MinMaxCountTuple> {
private Text outuserId = new Text();
private MinMaxCountTuple outTuple = new MinMaxCountTuple();
private final static SimpleDateFormat sdf =
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSS");
@Override
protected void map(Object key, Text value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException {
Map<String, String> parsed =
MRDPUtils.transformXMLtoMap(value.toString());
String date = parsed.get("CreationDate");
String userId = parsed.get("UserId");
try {
Date creationDate = sdf.parse(date);
outTuple.setMin(creationDate);
outTuple.setMax(creationDate);
} catch (java.text.ParseException e) {
System.err.println("Unable to parse Date in XML");
System.exit(3);
}
outTuple.setCount(1);
outuserId.set(userId);
context.write(outuserId, outTuple);
}
}
public static class MinMaxCountReducer extends
Reducer<Text, MinMaxCountTuple, Text, MinMaxCountTuple> {
private MinMaxCountTuple result = new MinMaxCountTuple();
protected void reduce(Text userId, Iterable<MinMaxCountTuple> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException {
result.setMin(null);
result.setMax(null);
result.setCount(0);
int sum = 0;
int count = 0;
for(MinMaxCountTuple tuple: values )
{
if(result.getMin() == null ||
tuple.getMin().compareTo(result.getMin()) < 0)
{
result.setMin(tuple.getMin());
}
if(result.getMax() == null ||
tuple.getMax().compareTo(result.getMax()) > 0) {
result.setMax(tuple.getMax());
}
System.err.println(count++);
sum += tuple.getCount();
}
result.setCount(sum);
context.write(userId, result);
}
}
/**
* @param args
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String [] otherArgs = new GenericOptionsParser(conf, args)
.getRemainingArgs();
if(otherArgs.length < 2 )
{
System.err.println("Usage MinMaxCout input output");
System.exit(2);
}
Job job = new Job(conf, "Summarization min max count");
job.setJarByClass(MinMaxCount.class);
job.setMapperClass(MinMaxCountMapper.class);
//job.setCombinerClass(MinMaxCountReducer.class);
job.setReducerClass(MinMaxCountReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(MinMaxCountTuple.class);
FileInputFormat.setInputPaths(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
boolean result = job.waitForCompletion(true);
if(result)
{
System.exit(0);
}else {
System.exit(1);
}
}
}
Input:
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-30T07:29:33.343" UserId="831878" />
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-01T07:29:33.343" UserId="831878" />
<row Id="8189677" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="831878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-06-30T07:29:33.343" UserId="931878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-07-01T07:29:33.343" UserId="931878" />
<row Id="8189678" PostId="6881722" Text="Have you looked at Hadoop?" CreationDate="2011-08-02T07:29:33.343" UserId="931878" />
output file contents part-r-00000:
831878 2011-07-30T07:29:33.343 2011-07-30T07:29:33.343 1
831878 2011-08-01T07:29:33.343 2011-08-01T07:29:33.343 1
831878 2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1
931878 2011-06-30T07:29:33.343 2011-06-30T07:29:33.343 1
931878 2011-07-01T07:29:33.343 2011-07-01T07:29:33.343 1
931878 2011-08-02T07:29:33.343 2011-08-02T07:29:33.343 1
job submission output:
12/12/16 11:13:52 INFO input.FileInputFormat: Total input paths to process : 1
12/12/16 11:13:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
12/12/16 11:13:52 WARN snappy.LoadSnappy: Snappy native library not loaded
12/12/16 11:13:52 INFO mapred.JobClient: Running job: job_201212161107_0001
12/12/16 11:13:53 INFO mapred.JobClient: map 0% reduce 0%
12/12/16 11:14:06 INFO mapred.JobClient: map 100% reduce 0%
12/12/16 11:14:18 INFO mapred.JobClient: map 100% reduce 100%
12/12/16 11:14:23 INFO mapred.JobClient: Job complete: job_201212161107_0001
12/12/16 11:14:23 INFO mapred.JobClient: Counters: 26
12/12/16 11:14:23 INFO mapred.JobClient: Job Counters
12/12/16 11:14:23 INFO mapred.JobClient: Launched reduce tasks=1
12/12/16 11:14:23 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12264
12/12/16 11:14:23 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/12/16 11:14:23 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/12/16 11:14:23 INFO mapred.JobClient: Launched map tasks=1
12/12/16 11:14:23 INFO mapred.JobClient: Data-local map tasks=1
12/12/16 11:14:23 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10124
12/12/16 11:14:23 INFO mapred.JobClient: File Output Format Counters
12/12/16 11:14:23 INFO mapred.JobClient: Bytes Written=342
12/12/16 11:14:23 INFO mapred.JobClient: FileSystemCounters
12/12/16 11:14:23 INFO mapred.JobClient: FILE_BYTES_READ=204
12/12/16 11:14:23 INFO mapred.JobClient: HDFS_BYTES_READ=888
12/12/16 11:14:23 INFO mapred.JobClient: FILE_BYTES_WRITTEN=43479
12/12/16 11:14:23 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=342
12/12/16 11:14:23 INFO mapred.JobClient: File Input Format Counters
12/12/16 11:14:23 INFO mapred.JobClient: Bytes Read=761
12/12/16 11:14:23 INFO mapred.JobClient: Map-Reduce Framework
12/12/16 11:14:23 INFO mapred.JobClient: Map output materialized bytes=204
12/12/16 11:14:23 INFO mapred.JobClient: Map input records=6
12/12/16 11:14:23 INFO mapred.JobClient: Reduce shuffle bytes=0
12/12/16 11:14:23 INFO mapred.JobClient: Spilled Records=12
12/12/16 11:14:23 INFO mapred.JobClient: Map output bytes=186
12/12/16 11:14:23 INFO mapred.JobClient: Total committed heap usage (bytes)=269619200
12/12/16 11:14:23 INFO mapred.JobClient: Combine input records=0
12/12/16 11:14:23 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
12/12/16 11:14:23 INFO mapred.JobClient: Reduce input records=6
12/12/16 11:14:23 INFO mapred.JobClient: Reduce input groups=2
12/12/16 11:14:23 INFO mapred.JobClient: Combine output records=0
12/12/16 11:14:23 INFO mapred.JobClient: Reduce output records=6
12/12/16 11:14:23 INFO mapred.JobClient: Map output records=6
解决方法:
抓到了罪魁祸首,只需将您的reduce方法的签名更改为以下内容:
受保护的void reduce(文本userId,Iterable< MinMaxCountTuple>值,
上下文上下文)
引发IOException,InterruptedException {
基本上,您只需要具有Context而不是org.apache.hadoop.mapreduce.Reducer.Context
现在输出如下:
831878 2011-07-30T07:29:33.343 2011-08-02T07:29:33.343 3
931878 2011-06-30T07:29:33.343 2011-08-02T07:29:33.343 3
我为您在本地进行了测试,而这一更改就成功了.虽然这是一种奇怪的行为,但如果有人能阐明这一点,那就太好了.它与泛型有关.当使用org.apache.hadoop.mapreduce.Reducer.Context时,它表示:
"Reducer.Context is a raw type. References to generic type Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>.Context should be parameterized"
但是当只使用“上下文”时,没关系.
内容总结
以上是互联网集市为您收集整理的java-使用Map Reduce的最小最大计数全部内容,希望文章能够帮你解决java-使用Map Reduce的最小最大计数所遇到的程序开发问题。 如果觉得互联网集市技术教程内容还不错,欢迎将互联网集市网站推荐给程序员好友。
内容备注
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 gblab@vip.qq.com 举报,一经查实,本站将立刻删除。
内容手机端
扫描二维码推送至手机访问。