Scheduling Job Involving both RAM and CPU

Overview

The question was part of a system design question, where I had to design 2 functions:

Schedule a Job
Check if Job is finished
The criteria for scheduling a job on a rack of machines, was that each job requires a certain amount of CPU & RAM, and each machine has different amount of CPU & RAM.

So how do I go about optimally scheduling jobs on it?

Multiple jobs could be scheduled on the same machine, as long as it can support it.

Second part was to check if the Job is finished or not.

Solution

Category the job with CPU&Memory, such as S, M, L, XL.
Prioritize the job.
Use message queues to publish job. Meanwhile, provide an API layer to get job/update job info (including status)
Each worker node subscribe to the message queues.
Worker Node itself have logic to determine which kind of job it could get, based on its current CPU/Memory consumption.
The job running on worker node is isolated, worker node have daemon to monitor the job’s resource consumption/status, able to kill the job if found job is dead or exceed resource quota.
Worker node daemon will update job status to the API layer, periodically.
Have monitor to check the message queue length, dynamically scale machines.
Have monitors to monitor the job progress, if one job don’t have update for long time, mark the host worker node as un-healthy and release the job task. This monitor could co-locate on worker nodes and monitor each other.
Client side will poll the API layer directly to get the job status. If traffic is heavy, need to have cache layer behind API with in-validation logic

final TreeSet<Machine> machinePool = new TreeSet<>((a, b) -> a.cpu != b.cpu ? a.cpu - b.cpu : (a.ram != b.ram ? a.ram - b.ram : a.core - b.core));
class Scheduler {
   public Future<T> submit(Task<T> task) {
        Machine required = new Machine(task.cpu, task.ram, 1);
        Machine found = machinePool.ceiling(required);
        if (!found) throw new RuntimeException("All machine busy");
        Future<T> future = found.submit(task);
        machinePool.remove(found);
        machinePool.add(found);    // reorder
   }
}

class Machine {
   public final AtomicInteger cpu, ram, core;  // available cpu/ram
   public final int cfgCPU, cfgRAM, cfgCore;// default configured CPU/RAM
   private final Executor executor = ThreadPool.newWorkStealingPool();

   public Machine(int cpu, int ram, int core) {
         this.cpu = new AtomicInteger(this.cfgCPU = cpu);
         this.ram = new AtomicInteger(this.cfgRAM = ram);
         this.core = new AtomicInteger(this.cfgCore = core);
   }

   public Future<T> submit(final Task<T> task) {
         if (this.cpu < task.cpu || this.ram < task.ram) {
             throw new RuntimeException("Not enough resource");
         }
         this.cpu.set(this.cpu.get() - task.cpu);
         this.ram.set(this.ram.get() - task.ram);
         this.core.decrement();
         return executor.submit(() -> {
               T res = task.execute();
               this.cpu.set(this.cpu.get() + task.cpu);
               this.ram.set(this.ram.get() + task.ram);
               this.core.increment();
               machinePool.remove(this);
               machinePool.add(this);    // reorder
               return res;
         });
   }
}

class Task<T> {
    public int cpu, ram; // required CPU, RAM for this task
    T execute();    // execute the task and return the computed result of type T
}