Partitioning – Equal Distribution of work


It is not advisable to use single reducer as this degrade the overall performance of map reduce job. Formula is that each partition should run at least 5 minutes to process data. Sometimes, it happen that data is not equally distributed and we got the degraded performance of map reduce job.

To distribute the work among all the reducers we should use Partitioning. By default, hadoop use hash partitioning but we can write our own custom partitioner.

partitioner

Hash partitioner use below hash function to equally distribute the data.

public int getpartition(K key, V value, int numReduceTask)
{
     return(key.hashCode() & Integer.MAX_VALUE) % numReduceTask;
}

To write our own customize hash function, we should 200% sure that our custom partition would equally distribute the work among all the reducer otherwise writing a partitioner will be a waste of time and under utilization of resources.

In driver class, we have to add below statement

job.setPatitionerClass(AgePartitioner.class);

Suppose, we have age data and we want to partition it on age and we want to process our data on 3 reducer. Below code will assign the data where age 50) to reducer 3.

public static class AgePartitioner extends Partitioner<Text, Text> {
 
        @Override
        public int getPartition(Text key, Text value, int numReduceTasks) {
 
            String [] nameAgeScore = value.toString().split("\t");
            String age = nameAgeScore[1];
            int ageInt = Integer.parseInt(age);
           
            if(numReduceTasks == 0)
                return 0;
 
            if(ageInt <=20){               
                return 0;
            }
            if(ageInt >20 && ageInt <=50){
               
                return 1 % numReduceTasks;
            }
            else
                return 2 % numReduceTasks;
           
        }
    }

Posted in Partitioner | Tagged , , , | Leave a comment