您现在的位置: 爱51代码网 >> 主页设计 >> 文章正文
设计mapreduce计算所有酒店1公里范围内的地标

This is an interested question. Here is my thoughts.

1) We only need one MR job.
2) 重点 is how to design a custom key class, which will sort the data as the way we want. Remember, in MapReduce job, almost every key will be compared with each other, to grouping/partition and sorted. This is the key feature provided from Hadoop, and we should utilize that.

The input of mapper will be (LongWritable, Text), as default TextInputFormat, but the output of the mapper will be <KeyClass, NullWriteable).

The most important is to design this custom KeyClass.

Here are some important information related this key class, I will use pseudo code:

Class Key {
     enum type
     altitude double
     latitude double
}

In this mapper, we have to have the custom SortComparator, GroupComparator  and PartitionComparator. Here are the trick parts of each:

1) For sort comparator:
     For same type, compare altitude and latitude.
     For different type, make sure all 酒店 is less than 地标, that will sort 酒店s first, then 地标.

2) For grouping and Partition comparator, this is very important part.
     a) Make sure each 酒店 will be different key, means they are different. So same 酒店 will go to same reducer.
     b) For 地标 data, if the compared object is another 地标, just compare their altitude and latitude, so different 地标 will omit as different, but if the compared object is 酒店, and if 2 pair of (altitude and latitude) are within 1公里, mark these 2 objects are the same. So this 地标 data will treat as same grouping as the 酒店 data, and both will be treat as the same key, and sent to the same reducer.

All this will just need one MR job, this interview question is really to check if you are design a custom key, and override the grouping and partition comparator (Make sure that if 地标 data is within 1公里 as 酒店 data, they will be grouped and partitioned together, so they will go to same reducer). So after mapper stage, you will have 100万 reducer group key, as you have 100万酒店 data, but all the 地标 data within 1公里 of the 酒店 will be treated as the same key of that 酒店, then being sent to the same reducer.

I hope I explain clearly in English, as typing Chinese is too slow for me. I think it's going to be a perfect solution
But I'm not sure how to use only one MR job

when the last reduce output format like this:
酒店99 地标31
酒店2 地标1034
酒店1 地标41
酒店322 地标9
酒店12 地标199
......
......
酒店98 地标199122
酒店998 地标122
酒店3 地标223
the final output is not sorted according to the hotel,I think one mr job is OK


but if the last output need to sort the hotel data,I think it needs two  mr job
酒店1 地标1
酒店1 地标4
酒店1 地标12
酒店1 地标31
酒店2 地标455
酒店2 地标1525
.....
.....
酒店n-1 地标1211
酒店n 地标242323
酒店n 地标4563432

 b) For 地标 data, if the compared object is another 地标, just compare their altitude and latitude, so different 地标 will omit as different, but if the compared object is 酒店, and if 2 pair of (altitude and latitude) are within 1公里, mark these 2 objects are the same. So this 地标 data will treat as same grouping as the 酒店 data, and both will be treat as the same key, and sent to the same reducer.
If my idea is wrong,this interpretation is the key
Would you like to show your codes of this interpretation

上一页  [1] [2] 

  • 上一篇文章:

  • 下一篇文章: 没有了
  • 最新文章 热点文章 相关文章
    设计mapreduce计算所有酒店1公里
    linux上的计划定时任务访问一个网
    如何在代码中实现读取dmesg的信息
    i2c-dev.c 与i2c设备驱动有什么关
    Failed to connect to dl-ssl.go
    SharePoint 2013 Search REST AP
    SharePoint如何搜索指定的爬网内
    weblogic Servlet: "action" fai
    webdav 与exchange通信失败未找到
    SharePoint2013文档库可以直接存
    SharePoint 2013 Search REST AP
    SharePoint如何搜索指定的爬网内
    weblogic Servlet: "action" fai
    webdav 与exchange通信失败未找到
    SharePoint2013文档库可以直接存
    Unable to write data to the tr
    asp.net中listbox的items.count属
    C#不是每次查询数据是不是被缓存
    ASP.NET发布后能加载引用的js文件
    Hadoop2.2.0在eclipse控制台没有
    Big Data,Hadoop,MapReduc
    hadoop2.4.0中有hadoop-core
    设计mapreduce计算所有酒店1
    mahout训练出来的分类模型如
    安装cloudera cdh5.2.0后执行
    sqoop数据导出不完整
    hadoop distcp源集群如何识别
    Hadoop2.2.0在eclipse控制台
    windows客户端如何访问hdfs
    yarn.scheduler.fair.locali
     



    设为首页 | 加入收藏 | 网站地图 | 友情链接 |