This is an interested question. Here is my thoughts.
1) We only need one MR job. 2) 重点 is how to design a custom key class, which will sort the data as the way we want. Remember, in MapReduce job, almost every key will be compared with each other, to grouping/partition and sorted. This is the key feature provided from Hadoop, and we should utilize that.
The input of mapper will be (LongWritable, Text), as default TextInputFormat, but the output of the mapper will be <KeyClass, NullWriteable).
The most important is to design this custom KeyClass.
Here are some important information related this key class, I will use pseudo code:
Class Key { enum type altitude double latitude double }
In this mapper, we have to have the custom SortComparator, GroupComparator and PartitionComparator. Here are the trick parts of each:
1) For sort comparator: For same type, compare altitude and latitude. For different type, make sure all 酒店 is less than 地标, that will sort 酒店s first, then 地标.
2) For grouping and Partition comparator, this is very important part. a) Make sure each 酒店 will be different key, means they are different. So same 酒店 will go to same reducer. b) For 地标 data, if the compared object is another 地标, just compare their altitude and latitude, so different 地标 will omit as different, but if the compared object is 酒店, and if 2 pair of (altitude and latitude) are within 1公里, mark these 2 objects are the same. So this 地标 data will treat as same grouping as the 酒店 data, and both will be treat as the same key, and sent to the same reducer.
All this will just need one MR job, this interview question is really to check if you are design a custom key, and override the grouping and partition comparator (Make sure that if 地标 data is within 1公里 as 酒店 data, they will be grouped and partitioned together, so they will go to same reducer). So after mapper stage, you will have 100万 reducer group key, as you have 100万酒店 data, but all the 地标 data within 1公里 of the 酒店 will be treated as the same key of that 酒店, then being sent to the same reducer.
I hope I explain clearly in English, as typing Chinese is too slow for me. I think it's going to be a perfect solution But I'm not sure how to use only one MR job
when the last reduce output format like this: 酒店99 地标31 酒店2 地标1034 酒店1 地标41 酒店322 地标9 酒店12 地标199 ...... ...... 酒店98 地标199122 酒店998 地标122 酒店3 地标223 the final output is not sorted according to the hotel,I think one mr job is OK
but if the last output need to sort the hotel data,I think it needs two mr job 酒店1 地标1 酒店1 地标4 酒店1 地标12 酒店1 地标31 酒店2 地标455 酒店2 地标1525 ..... ..... 酒店n-1 地标1211 酒店n 地标242323 酒店n 地标4563432
b) For 地标 data, if the compared object is another 地标, just compare their altitude and latitude, so different 地标 will omit as different, but if the compared object is 酒店, and if 2 pair of (altitude and latitude) are within 1公里, mark these 2 objects are the same. So this 地标 data will treat as same grouping as the 酒店 data, and both will be treat as the same key, and sent to the same reducer. If my idea is wrong,this interpretation is the key Would you like to show your codes of this interpretation 上一页 [1] [2]
|