Start a custom job in Google Vertex AI.
For more details, check out the custom job documentation.
type: "io.kestra.plugin.gcp.vertexai.customjob"id: gcp_vertexai_custom_job
namespace: company.team
tasks:
- id: custom_job
type: io.kestra.plugin.gcp.vertexai.CustomJob
projectId: my-gcp-project
region: europe-west1
displayName: Start Custom Job
spec:
workerPoolSpecs:
- containerSpec:
imageUri: gcr.io/my-gcp-project/my-dir/my-image:latest
machineSpec:
machineType: n1-standard-4
replicaCount: 1
The job display name.
The GCP region.
The job specification.
trueDelete the job at the end.
The GCP service account to impersonate.
The GCP project ID.
["https://www.googleapis.com/auth/cloud-platform"]The GCP scopes to be used.
The GCP service account.
trueWait for the end of the job.
Allowing to capture job status & logs.
date-timeTime when the CustomJob was created.
date-timeTime when the CustomJob was ended.
Resource name of a CustomJob.
JOB_STATE_UNSPECIFIEDJOB_STATE_QUEUEDJOB_STATE_PENDINGJOB_STATE_RUNNINGJOB_STATE_SUCCEEDEDJOB_STATE_FAILEDJOB_STATE_CANCELLINGJOB_STATE_CANCELLEDJOB_STATE_PAUSEDJOB_STATE_EXPIREDJOB_STATE_UPDATINGJOB_STATE_PARTIALLY_SUCCEEDEDUNRECOGNIZEDThe detailed state of the CustomJob.
date-timeTime when the CustomJob was updated.
The URI of a container image in the Container Registry that is to be run on each worker replica.
Must be on google container registry, example: gcr.io/{{ project }}/{{ dir }}/{{ image }}: {{ tag }}
The arguments to be passed when starting the container.
The command to be invoked when the container is started.
It overrides the entrypoint instruction in Dockerfile when provided.
Environment variables to be passed to the container.
Maximum limit is 100.
The Cloud Storage location to store the output of this job.
Whether you want Vertex AI to enable interactive shell access to training containers.
The full name of the Compute Engine network to which the Job should be peered.
For example, projects/12345/global/networks/myVPC.
Format is of the form projects/{project}/global/networks/{network}. Where {project} is a project number, as in 12345, and {network} is a network name.
To specify this field, you must have already configured VPC Network Peering for Vertex AI.
If this field is left unspecified, the job is not peered with any network.
Scheduling options for a CustomJob.
Specifies the service account for workload run-as account.
Users submitting jobs must have act-as permission on this run-as account.
If unspecified, the [Vertex AI Custom Code Service
Agent](https://cloud.google.com/vertex-ai/docs/general/access-control#service-agents)
for the CustomJob's project is used.
The name of a Vertex AI Tensorboard resource to which this CustomJob
will upload Tensorboard logs. Format: projects/{project}/locations/{location}/tensorboards/{tensorboard}
Google Cloud Storage URI to output directory.
If the uri doesn't end with '/', a '/' will be automatically appended. The directory is created if it doesn't exist.
The custom container task.
The specification of a single machine.
The specification of the disk.
The python package specs.
The specification of the disk.
The Google Cloud Storage location of the Python package files which are the training program and its dependent packages.
The maximum number of package URIs is 100.
Environment variables to be passed to the python module.
Maximum limit is 100.
The Google Cloud Storage location of the Python package files which are the training program and its dependent packages.
The maximum number of package URIs is 100.
100Size in GB of the boot disk.
PD_SSDPD_SSDPD_STANDARDType of the boot disk.
The type of the machine.
The number of accelerators to attach to the machine.
ACCELERATOR_TYPE_UNSPECIFIEDNVIDIA_TESLA_K80NVIDIA_TESLA_P100NVIDIA_TESLA_V100NVIDIA_TESLA_P4NVIDIA_TESLA_T4NVIDIA_TESLA_A100NVIDIA_A100_80GBNVIDIA_L4NVIDIA_H100_80GBNVIDIA_H100_MEGA_80GBNVIDIA_H200_141GBNVIDIA_B200NVIDIA_GB200NVIDIA_RTX_PRO_6000TPU_V2TPU_V3TPU_V4_PODTPU_V5_LITEPODUNRECOGNIZEDThe type of accelerator(s) that may be attached to the machine.
Restarts the entire CustomJob if a worker gets restarted.
This feature can be used by distributed training jobs that are not resilient to workers leaving and joining a job.
durationThe maximum job running time. The default is 7 days.