Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

google.location == us-east5 in Nextflow config leads to wrong machine type choices #5602

Open
aaronegolden opened this issue Dec 12, 2024 · 16 comments
Assignees
Labels

Comments

@aaronegolden
Copy link
Contributor

cloudinfo.seqera.io seems to be missing information for us-east5

https://cloudinfo.seqera.io/api/v1/providers/google/services/compute/regions/us-east5/products

yields

{
  "type": "about:blank",
  "title": "validation problem",
  "status": 400,
  "detail": "Key: 'GetRegionPathParams.Region' Error:Field validation for 'Region' failed on the 'region' tag"
}

and as a consequence the GoogleBatchMachineTypeSelector can't find any available machine types for us-east5 here:

https://github.com/nextflow-io/nextflow/blob/master/plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchMachineTypeSelector.groovy#L144

which can lead to NextFlow submitting Batch jobs with incorrect machine types.

@pditommaso
Copy link
Member

Is it reporting an error? can you please include the .nextflow.log file?

@tony-vs
Copy link

tony-vs commented Dec 16, 2024

When comparing the returned data for the us-east5 region to the GCP Compute product listing for the us-east1 region (e.g., us-east1 region data), it becomes evident that the us-east5 data contains invalid information.

@pditommaso
Copy link
Member

@jorgee worth to have a look at this

@jorgee
Copy link
Contributor

jorgee commented Dec 17, 2024

In this case, the job should fail instead of submitting it, is it correct?

@pditommaso
Copy link
Member

Likely it should report a warning and fallback on the default mechanism that's this

final memoryGB = Math.ceil(memoryMB / 1024.0 as float) as int

@tony-vs
Copy link

tony-vs commented Dec 17, 2024

Are we expecting a valid response from the API call? The us-east5 region is valid, so the data returned should reflect this.

In this case, the job should fail instead of submitting it, is it correct?

@pditommaso
Copy link
Member

It should but, for some reason that we are investigating, it's failing. In any case, Nextflow should be resilient to these error conditions.

@jorgee
Copy link
Contributor

jorgee commented Dec 18, 2024

I have run a pipeline using us-east5 and I see there is already a fallback. Looking at the .nextflow.log, I can see the cloudInfo call returns error 400. It generates an exception that is catched and the job has been scheduled and running with an instance type selected by Google Batch in the selected region.

The log file shows the following:

Dec-17 20:47:02.215 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Cannot select machine type using Seqera Cloudinfo for task: `TEST (1)` - Server returned HTTP response code: 400 for URL: https://cloudinfo.seqera.io/api/v1/providers/google/services/compute/regions/us-east5/products
Dec-17 20:47:02.217 [Task submitter] TRACE n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] new job request > task_groups {
  task_spec {
    compute_resource {
      cpu_milli: 1000
    }
    volumes {
      gcs {
        remote_path: "jorgee-test"
      }
      mount_path: "/mnt/disks/jorgee-test"
      mount_options: "-o rw"
      mount_options: "-implicit-dirs"
    }
    runnables {
      container {
        image_uri: "nextflow/rnaseq-nf"
        commands: "/bin/bash"
        commands: "-o"
        commands: "pipefail"
        commands: "-c"
        commands: "trap \"{ cp .command.log /mnt/disks/jorgee-test/work/e0/b3bbe04c961de9304e1623bfd95638/.command.log; }\" ERR; /bin/bash /mnt/disks/jorgee-test/work/e0/b3bbe04c961de9304e1623bfd95638/.command.run 2>&1 | tee .command.log"
        volumes: "/mnt/disks/jorgee-test/work/e0/b3bbe04c961de9304e1623bfd95638:/mnt/disks/jorgee-test/work/e0/b3bbe04c961de9304e1623bfd95638:rw"
      }
      environment {
      }
    }
  }
}
allocation_policy {
  instances {
    policy {
      provisioning_model: SPOT
    }
  }
}
logs_policy {
  destination: CLOUD_LOGGING
}

Dec-17 20:47:02.866 [Task submitter] DEBUG n.c.g.batch.GoogleBatchTaskHandler - [GOOGLE BATCH] Process `TEST (1)` submitted > job=nf-e0b3bbe0-1734464817649; uid=nf-e0b3bbe0-173446-e5a1ec47-c3a6-44160; work-dir=gs://jorgee-test/work/e0/b3bbe04c961de9304e1623bfd95638
Dec-17 20:47:02.867 [Task submitter] INFO  nextflow.Session - [e0/b3bbe0] Submitted process > TEST (1)

and the Google Batch service dashboard I can see the selected instance_type for the job:
image

In this part of the code, Nextflow invoke the GoogleBatchMachineTypeSelector.bestMachineType method that calls the CloudInfo service in GoogleBatchMachineTypeSelector.getAvailableMachineTypes throwing the exception when error 400 is produced. The exception is catch return null to do not include the instance type in the Jobs's allocation policy. When no instance type is provided Google Batch automatically decides which instance to use.

From the Nextflow point of view, I would just add the warning message to notify this situation to the user. I think there is no need to modify the fallback mechanism.

@tony-vs
Copy link

tony-vs commented Dec 18, 2024

If a user specifies a particular machineType to use, the fallback machine will not align with their request. Should the process fail in such cases to ensure accurate provisioning?

@jorgee
Copy link
Contributor

jorgee commented Dec 18, 2024

If the user specifies a particular machineType, then there is no fallback. the machineType is it is set to the BatchJob.
However, the case where I can see an issue is when the user specifies a family with a pattern such as 'n1-*'. Maybe it is the case that @tony-vs is referring to. In this case, Google Batch can not fulfill the family directive. I can't see any way to specify an instance family/series in the Job request description, so I think Nextflow should raise the failure in this case. What do you think @pditommaso?

@tony-vs
Copy link

tony-vs commented Dec 18, 2024

You're correct. Currently, for any machineType with a pattern like n1-* or a2-* for examples that includes GPU requirements, it always falls back to e2-* machines, which do not satisfy the user's GPU requests.

If the user specifies a particular machineType, then there is no fallback. the machineType is it is set to the BatchJob. However, the case where I can see an issue is when the user specifies a family with a pattern such as 'n1-*'. Maybe it is the case that @tony-vs is referring to. In this case, Google Batch can not fulfill the family directive. I can't see any way to specify an instance family/series in the Job request description, so I think Nextflow should raise the failure in this case.

@pditommaso
Copy link
Member

However, the case where I can see an issue is when the user specifies a family with a pattern such as 'n1-*'.

@jorgee can you provide more details about this?

@jorgee
Copy link
Contributor

jorgee commented Dec 19, 2024

Here, it is checking if the MachineType is a particular type, a list (separated by commas) or a family (with *). It only sets the MachineType if the user provides a particular type. When there is a list, we could choose one of them (for instance the first one) but when the user provides a family, we can not set any hint to Google Batch because the API only support to set a particular machine type in the allocation policy. So the selected instance by Google Batch could not match with the family specified by the user.

There is an Google Compute API call to get the types, that we could use but I think we will re-implement the CloudInfo functionality. Therefore, I think the best in this case is to raise a failure to the user suggesting to specify a single instance.

@pditommaso
Copy link
Member

My point is that if a concrete machineType is provided (no list, not *) it should retained if cloudinfo is disabled or fails, because it will take this branch

return new GoogleBatchMachineTypeSelector.MachineType(
type: machineType,
location: location,
priceModel: priceModel
)

and here

Is that correct?

@jorgee
Copy link
Contributor

jorgee commented Dec 19, 2024

Yes, it is correct in that case there is no issue. The issue is if the user provides a family. It will continue without any machine type in the instance policy, so the selected instance could not fit with the user directive and could produce a failure. We could warn and continue (assuming that the job could fail) or raise a failure without submitting the job.

@pditommaso
Copy link
Member

I would limit to report a warning when Cloud info fails to fetch the required metadata.

Regarding this comment, the problem looks related when the it's specified a machine type and a GPU.

However I was wonder if the problem is only when cloud info is failing or in any case because I don't see any condition in the logic below related to GPU filtering

if (families == null)
families = Collections.<String>emptyList()
// Check if a specific machine type was defined
if (families.size() == 1) {
final familyOrType = families.get(0)
if (familyOrType.contains("custom-"))
return new MachineType(type: familyOrType, family: 'custom', cpusPerVm: cpus, memPerVm: memoryMB, location: region, priceModel: spot ? PriceModel.spot : PriceModel.standard)
final machineType = getAvailableMachineTypes(region, spot).find { it.type == familyOrType }
if( machineType )
return machineType
}
final memoryGB = Math.ceil(memoryMB / 1024.0 as float) as int
if (!families ) {
families = fusionEnabled
? DEFAULT_FAMILIES_FOR_FUSION
: DEFAULT_FAMILIES
}
// All types are valid if no families are defined, otherwise at least it has to start with one of the given values
final matchMachineType = {String type -> !families || families.find { matchType(it, type) }}
// find machines with enough resources and SSD local disk
final validMachineTypes = getAvailableMachineTypes(region, spot).findAll {
it.cpusPerVm >= cpus &&
it.memPerVm >= memoryGB &&
matchMachineType(it.type)
}.collect()
final sortedByCost = validMachineTypes.sort {
(it.cpusPerVm > 2 || it.memPerVm > 2 ? FAMILY_COST_CORRECTION.get(it.family, 1.0) : 1.0) * (spot ? it.spotPrice : it.onDemandPrice)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

When branches are created from issues, their pull requests are automatically linked.

4 participants