User Tools

Site Tools


queuing_time

Prototype and Statistics

Storing Scheduler State

Each site has to be able to provide the following information for each job submitted to the system:

CREATE TABLE jobs (
  site TEXT,			-- computing site (smuc, psnc, etc)
  username TEXT,		-- user name of the user who submitted the job
  class TEXT,			-- queue, partition, class, etc.
  when_submitted BIGINT,	-- timestamp when the job was submitted (s)
  time_requested BIGINT,	-- how much wallclock time requested (s)
  cpu_time_requested BIGINT,	-- how much cpu time was requested (s)
  tasks_requested BIGINT,	-- how many tasks requested
  nodes_requested BIGINT,	-- how many nodes requested
  sys_jobs_running BIGINT,	-- how many jobs does the whole system has running
  sys_jobs_queued BIGINT,	-- how many jobs does the whole system has queued
  sys_tasks_running BIGINT,	-- how many tasks does the whole system has running
  sys_tasks_queued BIGINT,	-- how many tasks does the whole system has queued
  sys_nodes_running BIGINT,	-- how many nodes does the whole system has running
  sys_nodes_queued BIGINT,	-- how many nodes does the whole system has queued
  sys_time_running BIGINT,	-- how much wallclock time will the system's running jobs take (s)
  sys_time_used BIGINT, 	-- how much wallclock time have the system's jobs already used up (s)
  sys_time_queued BIGINT,	-- how much wallclock time will the system's queued jobs take (s)
  sys_cpu_time_running BIGINT,	-- how much cpu time will the system's running jobs take (s)
  sys_cpu_time_used BIGINT,	-- how much cpu time have the system's running jobs already used up (s)
  sys_cpu_time_queued BIGINT,	-- how much cpu time will the system's queued jobs take (s)
  class_jobs_running BIGINT,	-- how many jobs are running in that queue
  class_jobs_queued BIGINT,	-- how many jobs are queued for that queue
  class_tasks_running BIGINT,	-- how many tasks are running in that queue
  class_tasks_queued BIGINT,	-- how many tasks are queued for that queue
  class_nodes_running BIGINT,	-- how many nodes are currently running jobs
  class_nodes_queued BIGINT,	-- how many nodes are queued
  class_time_running BIGINT,	-- how much wallclock time will be used by currently running jobs (s)
  class_time_used BIGINT,	-- how much wallclock time is already used up by running jobs (s)
  class_time_queued BIGINT,	-- how much wallclock time will be used by jobs in the queue (s)
  class_cpu_time_running BIGINT,-- how much cpu time will be used by currently running jobs (s)
  class_cpu_time_used BIGINT,	-- how much cpu time is already used up by running jobs (s)
  class_cpu_time_queued BIGINT,	-- how much cpu time will be used by jobs in the queue (s)
  user_jobs_running BIGINT,	-- how many jobs does this user have running
  user_jobs_queued BIGINT,	-- how many jobs does this user have queued
  user_tasks_running BIGINT,	-- how many tasks does this user have running
  user_tasks_queued BIGINT,	-- how many tasks does this user have queued
  user_nodes_running BIGINT,	-- how many nodes does this user have running
  user_nodes_queued BIGINT,	-- how many nodes does this user have queued
  user_time_running BIGINT,	-- how much wallclock time will the users running jobs take (s)
  user_time_used BIGINT, 	-- how much wallclock time this user's running jobs already used up (s)
  user_time_queued BIGINT,	-- how much wallclock time will the users queued jobs take (s)
  user_cpu_time_running BIGINT,	-- how much cpu time will the users running jobs take (s)
  user_cpu_time_used BIGINT,	-- how much cpu time this user's running jobs already used up (s)
  user_cpu_time_queued BIGINT,	-- how much cpu time will the users queued jobs take (s)
  time_spent_queued BIGINT,	-- how much wallclock time did the job stay in the queue (s)
  job_id BIGINT AUTO_INCREMENT PRIMARY KEY
);

Queuing Time Prediction Service Architecture

The service consists of the following components:

  • MySQL database created according to the schema above. It currently resides on the nagios-compat.drg.lrz.de machine, but should probably be moved to a dedicated VM.
  • A web service for the queuing time prediction service with statistics, etc. Currently resides on the nagios-compat.drg.lrz.de machine, but should probably be moved to a dedicated VM (maybe the same one as the database?)
  • A cloud VM and a web service for training the non-linear regression component of the system using GP. A single GP run can last a couple of hours even on 8 cores. It might make sense to submit these jobs to the cluster since GP scales well with the number of cores.

Genetic Programming Service For Non-Linear Regression

https://gitlab.lrz.de/di73kuj2/gpservice

This asynchronous web service does non-linear regression using genetic programming. Given a table of entries with the dependent variable last in each row. Returns the regression expression. At the LRZ internal network runs on 10.155.209.37:5000.

Submitting A Regression Task

Data has to be submitted in the following format:

{
    "data" : [[x1_1, x1_2, x1_3, ..., x1_n, y1],
              [x2_1, x2_2, x2_3, ..., x2_n, y2],
              ...
              [xm_1, xm_2, xm_3, ..., xm_n, ym]],
    "population_size" : integer,
    "iterations" : integer, 
    "tournament_size" : integer
}

Parameters “population_size”, “iterations” and “tournament_size” control the size of the genetic programming population, number of algorithm iterations and size for the tournament selection respectively.

Note that dependant variable has to be last. Any number of attributes is supported. With curl it could look like this:

curl -d "{\"data\" : [[1, 2, 3, 4], [5, 6, 7, 8]]}" http://10.155.209.37:5000

The service will then return a job id in a form of UUID4. The response takes the form:

{
  "jobid": "1380ab36-fd49-4b94-8f0c-ecd08bff0823"
}

With the id being unique to each job. This ID is then used to retrieve the results for this job and to query the job status.

Querying Results

To check the results use the /results/jobid URL, for example:

http://10.155.209.37:5000/results/1380ab36-fd49-4b94-8f0c-ecd08bff0823

QTPS API

Request:

{
  number_of_cores,
  number_of_nodes,
  wallclock,
  user,
  node_type
}

Reply:

{
  {
    q1: [(interval1, probability1), ...],
    q2: [(interval1, probability1), ...],
    ...
  }
}
queuing_time.txt · Last modified: 2017/10/19 13:43 by di73kuj2