Wednesday, April 6, 2011

Calculating GE Job Priorities

Introduction

Sun Grid Engine (now called just Grid Engine) is one of the most popular job queuing systems in many computing centers. Despite the current changes in the status of the GE project, there are still many places where this system has been implemented. However, those administrators who were trying to fine-tune the GE system probably found out that not all questions concerning details of the implementation and configuration can be answered after reading the documentation. Among them, one of the most interesting as well as most cryptic aspects is how the job priorities are calculated and how pending jobs are sorted to determine the order in which they are going to be executed.

This post is a typical attempt to discover known - although there must be people who know how the priorities are calculated (at least some of the GE developers), this secret is generally hidden to a user. At least, I have  found almost nothing on the web, and I think I was googling hard. Of course, there is a whole chapter in the documentation describing how to tune the system of policies; the description unfortunately mostly stays on the level of "to change this click on that". The quantitative effect and details of the priority calculation stays unexplained.

I spent some time trying to get more info on how GE policies work and I hope my "discoveries" can be useful for others. To read the following article it is essential to (at least partially) understand the official documentation, i.e. to know what are the policies and the policy parameters and how they can be changed.

Important note: All statements presented below are based solely on my observations of how my installation of the GE system version 6.2 behaves and on some peeks into the source code; they may be inaccurate or even completely wrong.

General

As described in the documentation, the job priority is calculated using the formula

job_priority = weight_urgency * normalized_urgency_value
+ weight_ticket * normalized_ticket_value
+ weight_priority * normalized_priority_value

Let's assume that we have not changed the default (post-install) settings so the above weights are

weight_urgency = 0.1
weight_ticket = 0.01
weight_priority = 1

(to see details of your setup run qconf -msconf) and the "only" thing that remains to get the job priority is to evaluate normalized values corresponding to urgency, ticket-based, and explicit (POSIX) priority policies.

The first amazing fact which one discovers when studying policies and priority calculations, is that if some of the policies is not active (i.e. the value corresponding to this policy is equal to 0 for all jobs), the normalized value for every job is equal to 0.5. We will see later from where this exceptional normalization may have come, but it does not change anything on the fact, that it is at least surprising - I am sure that nobody would expect that a set of zeroes can be normalized to become a set of 0.5's. Unfortunately, this mathematical miracle does not stay alone in GE priority calculations - Deus Ex Machina seems to appear from time to time elsewhere in the priority calculations.

This (ab)normalization can be clearly visible when, for example, all policies are ineffective ("turned off"). To simulate such situation

  • make sure all jobs have explicit (POSIX) priority equal to 0 (set the priority using qalter -p 0 <jobid> command),
  • set weight_waiting_time and weight_deadline to 0 (using qconf -msconf),
  • set all resource-based urgency values to 0 (using qconf -mc),
  • set weight_tickets_functional and weight_tickets_share to 0 (using qconf -msconf) and make sure that no entity has explicitly granted any override tickets.

Then run qstat -pri - you should see that all normalized values (nurg, npprior, and ntckts) are equal to 0.5 and the priority of any job is 0.555 = 0.5 * 1 + 0.5 * 0.1 + 0.5 * 0.01.

Urgency

The default policy configuration is usually quite simple. Let's suppose that the ticket-based policies were not configured and are ineffective, explicit (POSIX) job priorities were not changed and are equal to zero for all jobs. Then, the only policy that plays a role in the priority calculation is the urgency policy.

The urgency policy value is calculated as a sum of three contributions:

urgency = SUM(resource_requirement * resource_urgency) +  weight_deadline / job_free_time + weight_waiting_time * waiting_time

The resource_urgency can be set for every resource in the complex configuration file (qconf -mc) and normally only a value for slots resource is set and equals to 1000. The weight_deadline and weight_waiting_time can be set in scheduler configuration (qconf -msconf) or, if using qmon, in Policy Configuration dialog.

I have never used the deadline feature so I will not include it in the following examples, but it is not very complicated to take it into an account if you use it (job_free_time is the number of seconds that remain to the job deadline initiation time specified by -dl parameter of qsub). If weight_waiting_time is equal to zero, the only value that remains is the multiple of number of slots allocated (requested) by the job and slots resource urgency value (set to 1000 by default). To see actual urgency values, run qstat -urg; you should get a table of jobs similar to the following (middle parts of lines were removed and the table format was edited to fit the page width):

job-ID prior   nurg        urg rrcontr wtcontr dlcontr .. slots
------------------------------------------------------ .. -----
 66622 0.25000 0.00000    1000    1000       0       0 ..     1
 66722 0.25000 0.00000    1000    1000       0       0 ..     1
 66847 0.25000 0.00000    1000    1000       0       0 ..     1
 67652 0.46429 0.42857    4000    4000       0       0 ..     4
 63285 0.75000 1.00000    8000    8000       0       0 ..     8
 66699 0.75000 1.00000    8000    8000       0       0 ..     8
 63284 0.75000 1.00000    8000    8000       0       0 ..     8
 66700 0.75000 1.00000    8000    8000       0       0 ..     8
 66623 0.25000 0.00000    1000    1000       0       0 ..     1

While the calculation of raw values (rrcontr and urg) is clear here, the normalized values may seem to be strange - why, for example, the normalized  urgency value for job 676521 is not equal to 0.5? We will discuss it later.

Setting weight_waiting_time to some nonzero value makes job waiting time contribution (wtcontr) relevant and the resulting table then looks like this:

job-ID prior   nurg        urg rrcontr wtcontr dlcontr .. slots
------------------------------------------------------ .. -----
 66622 0.37199 0.24398  314545    1000  313545       0 ..     1
 66722 0.36206 0.22412  289075    1000  288075       0 ..     1
 66847 0.30784 0.11567  149994    1000  148994       0 ..     1
 67652 0.25116 0.00232    4624    4000     624       0 ..     4
 63285 0.75000 1.00000 1284083    8000 1276083       0 ..     8
 66699 0.36619 0.23237  299656    8000  291656       0 ..     8
 63284 0.75000 1.00000 1284085    8000 1276085       0 ..     8
 66700 0.36619 0.23237  299654    8000  291654       0 ..     8
 66623 0.37199 0.24397  314531    1000  313531       0 ..     1

Waiting time is calculated as a number of seconds between the time when the job was submitted and the current time for both running and pending jobs. However, qstat -urg report displays job submition time for pending jobs and job start time (completely useless for urgency calculations) for running jobs. The only way to display submission time for a running job (and to check waiting time contribution calculation) I was able to find was to run qstat -j <job>.

The raw (unnormalized) urgency values are calculated for all jobs (running as well as pending) and normalized. As seen before, the normalization is not very straightforward, values are not normalized to maximum, but rather according to the formula

nurg(job) = (urg(job) - MIN(urg()) / (MAX(urg()) - MIN(urg())

Resulting nurg value then enters the formula for overall priority calculation (see the previous chapter) as normalized_urgency_value. Run qstat -pri to see the urgency contribution to the overall job priority.

As for the normalization, there are at least two questions: First, it is not very clear why the described  formula has been used instead of simple normalization to maximum. Of course, this formula ensures that there are always jobs with nurg = 0 and jobs with nurg = 1. But do we really need it? This type of normalization is not linear - if the raw urgency increases twice, the normalized urgency does not increase with the same factor. This complicates estimations and makes policy mathematics unnecessarily complicated.

The second question (which will appear again later) is: Why the urgency values of running jobs are considered for normalization? Running jobs are just running and they will not become pending any more (except when they are suspended, but this is another case). If we want to sort pending jobs to determine which one is most important and should be run first, why do we consider urgency values of running jobs? And even more: What meaning has the urgency value for running job? Running job just cannot be run sooner no matter how big its urgency is.

Explicit (POSIX) Priority

Users and administrators can specify job explicit (called also POSIX) priorities using -p argument to qsub or qalter command. The default job priority value is 0 and while users can only decrease priority of their jobs (assign them negative priority values), administrators can also increase job priority by setting the priority value to some positive integer. The allowed values are between -1023 and 1024.

Job priorities can be displayed using qstat -pri command - see ppri column for the raw priority value and npprior for the normalized value. POSIX priorities seem to be normalized using the simple formula

npprior = (ppri + 1024) / 2048

(please note that the priorities are not normalized to the value determined from the set of actual jobs' priorities as in urgency calculations, but to the fixed value 2048).

As an example, the following table displays the result of setting the priorities to a set of values between -960 to 1024 with step 64 (the table was reformated and only beginnings of lines are shown to fit into the column):

job-ID  prior   nurg    npprior ntckts   ppri ...
--------------------------------------------- ...
  63300 1.10500 1.00000 1.00000 0.50000  1024 ...
  63301 1.07375 1.00000 0.96875 0.50000   960 ...
  63302 1.04250 1.00000 0.93750 0.50000   896 ...
  63303 1.01125 1.00000 0.90625 0.50000   832 ...
  63304 0.98000 1.00000 0.87500 0.50000   768 ...
  63305 0.94875 1.00000 0.84375 0.50000   704 ...
  63306 0.91750 1.00000 0.81250 0.50000   640 ...
  63307 0.88625 1.00000 0.78125 0.50000   576 ...
  63308 0.85500 1.00000 0.75000 0.50000   512 ...
  63309 0.82375 1.00000 0.71875 0.50000   448 ...
  63310 0.79250 1.00000 0.68750 0.50000   384 ...
  63311 0.76125 1.00000 0.65625 0.50000   320 ...
  63312 0.73000 1.00000 0.62500 0.50000   256 ...
  63313 0.69875 1.00000 0.59375 0.50000   192 ...
  63314 0.66750 1.00000 0.56250 0.50000   128 ...
  63315 0.63625 1.00000 0.53125 0.50000    64 ...
  63316 0.60500 1.00000 0.50000 0.50000     0 ...
  63317 0.57375 1.00000 0.46875 0.50000   -64 ...
  63318 0.54250 1.00000 0.43750 0.50000  -128 ...
  63319 0.51125 1.00000 0.40625 0.50000  -192 ...
  63320 0.48000 1.00000 0.37500 0.50000  -256 ...
  63321 0.44875 1.00000 0.34375 0.50000  -320 ...
  63322 0.41750 1.00000 0.31250 0.50000  -384 ...
  63323 0.38625 1.00000 0.28125 0.50000  -448 ...
  63324 0.35500 1.00000 0.25000 0.50000  -512 ...
  63325 0.32375 1.00000 0.21875 0.50000  -576 ...
  63326 0.29250 1.00000 0.18750 0.50000  -640 ...
  63327 0.26125 1.00000 0.15625 0.50000  -704 ...
  63328 0.23000 1.00000 0.12500 0.50000  -768 ...
  63329 0.19875 1.00000 0.09375 0.50000  -832 ...
  63330 0.16750 1.00000 0.06250 0.50000  -896 ...
  63331 0.13625 1.00000 0.03125 0.50000  -960 ...

There is one aspect of POSIX priorities that is probably the root of the previously discussed 0.5 normalization - the fact that the standard priority 0 can be decreased as well as increased makes the default value stand in the middle of allowed interval and equal to 0.5 after normalization. One can imagine that due to this fact the developers decided to make 0.5 the default value for all three contributions to the priority calculation. However, it does not change anything on the fact that this approach must be fundamentally wrong and makes the priority calculations inconsistent and unpredictable. The problem would need a deeper mathematical analysis but I have a strong feeling that selecting 0.5 as an average (default, normal) value is one of the biggest mistakes in the priority calculation design. It would be much consistent to allow negative values for all priority calculations and set the default equal to 0.

Ticket-based policies

The last and the most powerful part of the job priority calculations are the ticket-based policies that are composed from functional policy, share-tree policy and override policy

tickets(job) = functional_tickets(job) + share_tickets(job) + override_tickets(job)

In the following text I will concentrate on the functional policy only (and leave the other two policies to be examined by someone else), so I will assume

tickets(job) = functional_tickets(job)

Why? First of all, the functional policy will be probably most used among the three. Second, the override policy seems to be a sort of exceptional tool for manual interventions that I wanted to avoid as long as possible. Third, the deeply configurable share based (share-tree based) policy involves accumulated past usage adjusted by a decay factor in its calcualtions what makes it very hard to understand and check.

And last but not least, I spent a lot of time time studying the functional policy and to be honest - there is enough complexity in functional policy calculations to kill an average administrator like me. I just did not have energy for more dimensions.

To activate the functional or share-tree based policy one has to assign some tickets to the selected policy either using qconf -msconf command (weight_tickets_functional and weight_tickets_share - naming of these parameters is somehow inconsistent and confusing - I would prefer e.g. total_tickets_functional instead of weight_tickets_functional) or using qmon and Policy Configuration dialog (Total Functional Tickets and Total Share Tree Tickets).

The amount of assigned tickets is usually high as tickets are divided into portions among jobs and to retain some decent accuracy it is better to have total number of tickets rather higher than smaller. However, the absolute number is not as important as the relative number - ratio - between the tickets assigned to different policies - see the documentation for explanation of this aspect as well as for description of override tickets assignments.

Lets suppose that according to the restriction described above we have assigned

weight_tickets_functional = 1000000
weight_tickets_share = 0

and  that we have not granted any override tickets. The functional tickets now have to be distributed among  jobs using a set of four categories: user, project, department and job. The portion of total tickets which is granted to each category is determined by the four corresponding factors - weights. Let consider the default setting of these weights (see qconf -msconf)

weight_user = 0.250000
weight_project = 0.250000
weight_department = 0.250000
weight_job = 0.250000

(the sum of all weights must be 1). With this configuration, the number of tickets assigned to each category is

weight_tickets_functional_user = 250000
weight_tickets_functional_department = 250000
weight_tickets_functional_project = 250000
weight_tickets_functional_job = 250000

To make the analysis even more digestible let us consider that there are no projects and no departments defined and that no jobs have been assigned extra tickets. Then, the only thing that remains is to divide 250000 tickets among jobs according to the user category.

Before we proceed to the calculation, I would like to note that this setup is probably what most administrators would like to start with. This seems to be one of the ways to somehow introduce "fair sharing" among users - to give users that have less jobs running and/or pending more priority than those that are using the system more intensively.

But back to the calculation. To divide tickets among users we first have to define who is a user. In the standard configuration (if administrator has not defined any users manually by qconf -auser), users are registered as GE users automatically: user becomes registered GE user as soon as he/she submits a job and unregisters automatically after auto_user_delete_time (see qconf -mconf) has passed after his/her last job disappeared from the system. To get the list of registered users, run qconf -suserl command.

By default, the automatically registered user gets no share from the total amount of tickets. The administrator can either assign some share manually to the existing user with qconf -muser command (fshare attribute) or using qmon (Policy Settings / Functional Policy / Functional Shares column) or configure the system to assign the default share to automatically created user (run qconf -mconf and change auto_user_fshare parameter). The user shares defined using described automatic and/or manual way then determine the relative portion of total tickets that should be assigned to the given user.

Let us construct an artifical example that will make the description simpler and more straightforward. Let's assume that the administrator has set

auto_user_fshare = 100

after the system installation so that all automatically registered users get equal functional share 100. Let's also suppose there are three registered users: userA, userB and userC and that they have sumbmitted jobs that are now in the following state:

jobA1   userA   running
jobA2   userA   running
jobB1   userB   running

jobB2   userB   pending
jobB3   userB   pending
jobB4   userB   pending
jobC1   userC   pending
jobC2   userC   pending

As all these users have been automatically granted equal share of the total functional tickets (and it plays no role if it was 100 or 1000, just that it was equal), each of them gets equal portion of the total amount of tickets

userA_functional_tickets = 250000 * 100 / 300 = 83333
userB_functional_tickets = 250000 * 100 / 300 = 83333
userC_functional_tickets = 250000 * 100 / 300 = 83333

Let's calculate the ticket-based policy contribution to the pending job priority. First, we will do the calculation, comments will follow.

1. Count the number of users that have at least one job running.

users_running = 2

2. Divide all functional tickets by this number.

tickets_per_user_running = 1000000 / 2 = 500000

3. For each user with running jobs assign equal number of tickets from his/her portion to every running job.

ftckt_jobA1 = 250000
ftckt_jobA2 = 250000
ftckt_jobB1 = 500000

4. Sort all pending jobs according to the submission time (first submitted goes first).

5. Loop through each user and each pending job sorted as decribed above and assign each pending job the number of tickets that is equal to the user's share of functional tickets divided by the number of user's jobs considered so far including user's running jobs and the current job

ftckt_jobB2 = 83333 / 2 = 41666 (2 = 1 running + this one)
ftckt_jobB3 = 83333 / 3 = 27777 (3 = 1 running + 1 pending + this one)
ftckt_jobB4 = 83333 / 4 = 20833 (4 = 1 running + 2 pending + this one)
ftckt_jobC1 = 83333 / 1 = 83333 (1 = this one)
ftckt_jobC2 = 83333 / 2 = 41666 (2 = 1 pending + this one)

6. Normalize the tickets according to the maximal number of tickets among all (running as well as pending) jobs.

ntckts_jobB2 = 41666 / 500000 = 0.08333
ntckts_jobB3 = 27777 / 500000 = 0.05555
ntckts_jobB4 = 20833 / 500000 = 0.04166
ntckts_jobC1 = 83333 / 500000 = 0.16666
ntckts_jobC2 = 41666 / 500000 = 0.08333

Easy as it gets.

To play with the policies I recommend to open qmon Policy Configuration window, change various parameters there and see the effects of these changes in qstat -ext report - watch ftckt, tckts, and ntckts colums.

As far as I have understood comments in the source file, if there are nonzero department, project and job contributions to the functional policy ticket distribution, the GE then sorts the pending jobs according to the tickets gained so far and repeats step 5 for each of the contributing categories summing all tickets each job has gained. Fortunately I have no experience with such configuration as I have never gone that far.

While the general idea of functional policy ticket distribution seems nice, the implementation is again, at least, questionable. First of all, there is the problem of involving running jobs in the normalization - why are running jobs considered? Why are the values normalized to the maximum value of all (running and pending) jobs? Second, there is one big question (exclamation) mark in step 5 - what is the background idea behind the procedure that calculates the number of tickets assigned to the pending job? Why the total number of tickets is divided by the number of jobs considered so far, not by the number of all jobs?

One would expect that sum of all tickets should give the total number of tickets granted for the policy, but it is not the case here, of course, due to the method used to calculate ticket shares. In our example, there are total 1000000 tickets assigned to running jobs and 215275 (!?) tickets assigned to pending jobs. While one can somehow accept the first number, the second number can never be predicted or estimated without performing full and exact calculation. Moreover, it changes with the number of pending jobs - if userB submitted 1000 pending jobs instead of 2, he/she would get granted 540051 tickets instead of 90276, what is already more than two times bigger than the number of total tickets granted to the user category.

Conclusion

I have started studying GE setup as I wanted to understand the job priority calculations and I suspected that the documentation does not tell us everything needed. Now I know a lot more, but I am pretty much disappointed - in the current state, the priority calculations are overcomplicated and somehow irrational, and the resulting priorities are almost unpredictable when it comes to a little more complicated setup than the default.

I think that the policy logic in GE needs a redesign that would simplify the calculation and make it somehow more linear. Although I know I have only limited knowledge of all aspects of the queuing system, I suggest to consider the following ideas for the future versions

  • allow negative priorities and make the standard (default) priority equal to 0,
  • do not include running jobs in normalizations,
  • replace all unnecessarily complex operations (like the urgency normalization and/or step 5 of functional tickets share calculation) with simple linear formulas.

If the policies were simpler, it would be easier to manage them; administrators would have better understanding of relations and the GE behavior would be more predictable. Moreover, it would be much easier to explain policies to users and decrease the number of complaints on the resulting priorities of their jobs compared to the jobs of others.