| AWS Batch | |
| User Guide | |
| What is AWS Batch? | |
| AWS Batch helps you to run batch computing workloads on the AWS Cloud. Batch computing | |
| is a common way for developers, scientists, and engineers to access large amounts of compute | |
| resources. AWS Batch removes the undifferentiated heavy lifting of configuring and managing the | |
| required infrastructure, similar to traditional batch computing software. This service can efficiently | |
| provision resources in response to jobs submitted in order to eliminate capacity constraints, reduce | |
| compute costs, and deliver results quickly. | |
| As a fully managed service, AWS Batch helps you to run batch computing workloads of any scale. | |
| AWS Batch automatically provisions compute resources and optimizes the workload distribution | |
| based on the quantity and scale of the workloads. With AWS Batch, there's no need to install or | |
| manage batch computing software, so you can focus your time on analyzing results and solving | |
| problems. | |
| 1 | |
| AWS Batch | |
| User Guide | |
| AWS Batch provides all of the necessary functionality to run high-scale, compute-intensive | |
| workloads on top of AWS managed container orchestration services, Amazon ECS and Amazon EKS. | |
| AWS Batch is able to scale compute capacity on Amazon EC2 instances and Fargate resources. | |
| AWS Batch provides a fully managed service for batch workloads, and delivers the operational | |
| capabilities to optimize these types of workloads for throughput, speed, resource efficiency, and | |
| cost. | |
| AWS Batch also enables SageMaker Training job queuing, allowing data scientists and ML engineers | |
| to submit Training jobs with priorities to configurable queues. This capability ensures that ML | |
| workloads run automatically as soon as resources become available, eliminating the need for | |
| manual coordination and improving resource utilization. | |
| For machine learning workloads, AWS Batch provides queuing capabilities for SageMaker Training | |
| jobs. You can configure queues with specific policies to optimize cost, performance, and resource | |
| allocation for your ML Training workloads. | |
| This provides a shared responsibility model where administrators set up the infrastructure and | |
| permissions, while data scientists can focus on submitting and monitoring their ML training | |
| workloads. Jobs are automatically queued and executed based on configured priorities and | |
| resource availability. | |
| 2 | |
| AWS Batch | |
| User Guide | |
| Are you a first-time AWS Batch user? | |
| If you are a first-time user of AWS Batch, we recommend that you begin by reading the following | |
| sections: | |
| • Components of AWS Batch | |
| • Create IAM account and administrative user | |
| • Setting up AWS Batch | |
| • Getting started with AWS Batch tutorials | |
| • Getting started with AWS Batch on SageMaker AI | |
| Related services | |
| AWS Batch is a fully managed batch computing service that plans, schedules, and runs your | |
| containerized batch ML, simulation, and analytics workloads across the full range of AWS compute | |
| offerings, such as Amazon ECS, Amazon EKS, AWS Fargate, and Spot or On-Demand Instances. For | |
| more information about each managed compute service, see: | |
| • Amazon EC2 User Guide | |
| • AWS Fargate Developer Guide | |
| • Amazon EKS User Guide | |
| • Amazon SageMaker AI Developer Guide | |
| Accessing AWS Batch | |
| You can access AWS Batch using the following: | |
| AWS Batch console | |
| The web interface where you create and manage resources. | |
| AWS Command Line Interface | |
| Interact with AWS services using commands in your command line shell. The AWS Command | |
| Line Interface is supported on Windows, macOS, and Linux. For more information about the | |
| AWS CLI, see AWS Command Line Interface User Guide. You can find the AWS Batch commands | |
| in the AWS CLI Command Reference. | |
| Are you a first-time AWS Batch user? | |
| 3 | |
| AWS Batch | |
| User Guide | |
| AWS SDKs | |
| If you prefer to build applications using language-specific APIs instead of submitting a request | |
| over HTTP or HTTPS, use the libraries, sample code, tutorials, and other resources provided by | |
| AWS. These libraries provide basic functions that automate tasks, such as cryptographically | |
| signing your requests, retrying requests, and handling error responses. These functions make it | |
| more efficient for you to get started. For more information, see Tools to Build on AWS. | |
| Components of AWS Batch | |
| AWS Batch simplifies running batch jobs across multiple Availability Zones within a Region. You | |
| can create AWS Batch compute environments within a new or existing VPC. After a compute | |
| environment is up and associated with a job queue, you can define job definitions that specify | |
| which Docker container images to run your jobs. Container images are stored in and pulled from | |
| container registries, which may exist within or outside of your AWS infrastructure. | |
| Compute environment | |
| A compute environment is a set of managed or unmanaged compute resources that are used to | |
| run jobs. With managed compute environments, you can specify desired compute type (Fargate | |
| or EC2) at several levels of detail. You can set up compute environments that use a particular type | |
| of EC2 instance, a particular model such as c5.2xlarge or m5.10xlarge. Or, you can choose | |
| only to specify that you want to use the newest instance types. You can also specify the minimum, | |
| desired, and maximum number of vCPUs for the environment, along with the amount that you're | |
| Components of AWS Batch | |
| 4 | |
| AWS Batch | |
| User Guide | |
| willing to pay for a Spot Instance as a percentage of the On-Demand Instance price and a target | |
| set of VPC subnets. AWS Batch efficiently launches, manages, and terminates compute types as | |
| needed. You can also manage your own compute environments. As such, you're responsible for | |
| setting up and scaling the instances in an Amazon ECS cluster that AWS Batch creates for you. For | |
| more information, see Compute environments for AWS Batch. | |
| Job queues | |
| When you submit an AWS Batch job, you submit it to a particular job queue, where the | |
| job resides until it's scheduled onto a compute environment. You associate one or more | |
| compute environments with a job queue. You can also assign priority values for these compute | |
| environments and even across job queues themselves. For example, you can have a high priority | |
| queue that you submit time-sensitive jobs to, and a low priority queue for jobs that can run | |
| anytime when compute resources are cheaper. For more information, see Job queues. | |
| Job definitions | |
| A job definition specifies how jobs are to be run. You can think of a job definition as a blueprint for | |
| the resources in your job. You can supply your job with an IAM role to provide access to other AWS | |
| resources. You also specify both memory and CPU requirements. The job definition can also control | |
| container properties, environment variables, and mount points for persistent storage. Many of | |
| the specifications in a job definition can be overridden by specifying new values when submitting | |
| individual Jobs. For more information, see Job definitions | |
| Jobs | |
| A unit of work (such as a shell script, a Linux executable, or a Docker container image) that you | |
| submit to AWS Batch. It has a name, and runs as a containerized application on AWS Fargate or | |
| Amazon EC2 resources in your compute environment, using parameters that you specify in a job | |
| definition. Jobs can reference other jobs by name or by ID, and can be dependent on the successful | |
| completion of other jobs or the availability of resources you specify. For more information, see | |
| Jobs. | |
| Scheduling policy | |
| You can use scheduling policies to configure how compute resources in a job queue are allocated | |
| between users or workloads. Using fair-share scheduling policies, you can assign different share | |
| identifiers to workloads or users. The AWS Batch job scheduler defaults to a first-in, first-out (FIFO) | |
| strategy. For more information, see Fair-share scheduling policies. | |
| Job queues | |
| 5 | |
| AWS Batch | |
| User Guide | |
| Consumable resources | |
| A consumable resource is a resource that is needed to run your jobs, such as a 3rd party license | |
| token, database access bandwidth, the need to throttle calls to a third-party API, and so on. | |
| You specify the consumable resources which are needed for a job to run, and Batch takes these | |
| resource dependencies into account when it schedules a job. You can reduce the under-utilization | |
| of compute resources by allocating only the jobs that have all the required resources available. For | |
| more information, see Resource-aware scheduling . | |
| Service Environment | |
| A Service Environment define how AWS Batch integrates with SageMaker for job execution. Service | |
| Environments enable AWS Batch to submit and manage jobs on SageMaker while providing the | |
| queuing, scheduling, and priority management capabilities of AWS Batch. Service Environments | |
| define capacity limits for specific service types such as SageMaker Training jobs. The capacity limits | |
| control the maximum resources that can be used by service jobs in the environment. For more | |
| information, see Service environments for AWS Batch. | |
| Service job | |
| A service job is a unit of work that you submit to AWS Batch to run on a service environment. | |
| Service jobs leverage AWS Batch's queuing and scheduling capabilities while delegating actual | |
| execution to the external service. For example, SageMaker Training jobs submitted as service | |
| jobs are queued and prioritized by AWS Batch, but the SageMaker Training job execution occurs | |
| within SageMaker AI infrastructure. This integration enables data scientists and ML engineers | |
| to benefit from AWS Batch's automated workload management, and priority queuing, for their | |
| SageMaker AI Training workloads. Service jobs can reference other jobs by name or ID and support | |
| job dependencies. For more information, see Service jobs in AWS Batch. | |
| Consumable resources | |
| 6 | |
| AWS Batch | |
| User Guide | |
| Setting up AWS Batch | |
| If you've already signed up for Amazon Web Services (AWS) and are using Amazon Elastic Compute | |
| Cloud (Amazon EC2) or Amazon Elastic Container Service (Amazon ECS), you can soon use AWS | |
| Batch. The setup process for these services is similar. This is because AWS Batch uses Amazon ECS | |
| container instances in its compute environments. To use the AWS CLI with AWS Batch, you must | |
| use a version of the AWS CLI that supports the latest AWS Batch features. If you don't see support | |
| for an AWS Batch feature in the AWS CLI, upgrade to the latest version. For more information, see | |
| http://aws.amazon.com/cli/. | |
| Note | |
| Because AWS Batch uses components of Amazon EC2, you use the Amazon EC2 console for | |
| many of these steps. | |
| Complete the following tasks to get set up for AWS Batch. | |
| Topics | |
| • Create IAM account and administrative user | |
| • Create IAM roles for your compute environments and container instances | |
| • Create a key pair for your instances | |
| • Create a VPC | |
| • Create a security group | |
| • Install the AWS CLI | |
| Create IAM account and administrative user | |
| To get started, you need to create an AWS account and a single user that is typically granted | |
| administrative rights. To accomplish this, complete the following tutorials: | |
| Sign up for an AWS account | |
| If you do not have an AWS account, complete the following steps to create one. | |
| Create IAM account and administrative user | |
| 7 | |
| AWS Batch | |
| User Guide | |
| Getting started with AWS Batch tutorials | |
| You can use the AWS Batch first-run wizard to get started quickly with AWS Batch. After you | |
| complete the Prerequisites, you can use the first-run wizard to create a compute environment, a job | |
| definition, and a job queue. | |
| You can also submit a sample "Hello World" job using the AWS Batch first-run wizard to test your | |
| configuration. If you already have a Docker image that you want to launch in AWS Batch, you can | |
| use that image to create a job definition. | |
| Afterward, you can use the AWS Batch first-run wizard to create a compute environment, job | |
| queue, and submit a sample Hello World job. | |
| Getting started with Amazon EC2 orchestration using the | |
| Wizard | |
| Amazon Elastic Compute Cloud (Amazon EC2) provides scalable computing capacity in the AWS | |
| Cloud. Using Amazon EC2 eliminates your need to invest in hardware up front, so you can develop | |
| and deploy applications faster. | |
| You can use Amazon EC2 to launch as many or as few virtual servers as you need, configure | |
| security and networking, and manage storage. Amazon EC2 enables you to scale up or down to | |
| handle changes in requirements or spikes in popularity, reducing your need to forecast traffic. | |
| Overview | |
| This tutorial demonstrates how to setup AWS Batch with the Wizard to configure Amazon EC2 and | |
| run Hello World. | |
| Intended Audience | |
| This tutorial is designed for system administrators and developers responsible for setting up, | |
| testing, and deploying AWS Batch. | |
| Features Used | |
| This tutorial shows you how to use the AWS Batch console wizard to: | |
| • Create and configure an Amazon EC2 compute environment | |
| • Create a job queue. | |
| Getting started with Amazon EC2 using the Wizard | |
| 16 | |
| AWS Batch | |
| User Guide | |
| • Create a job definition | |
| • Create and submit a job to run | |
| • View the output of the job in CloudWatch | |
| Time Required | |
| It should take about 10–15 minutes to complete this tutorial. | |
| Regional Restrictions | |
| There are no country or regional restrictions associated with using this solution. | |
| Resource Usage Costs | |
| There's no charge for creating an AWS account. However, by implementing this solution, you | |
| might incur some or all of the costs that are listed in the following table. | |
| Description | |
| Cost (US dollars) | |
| Amazon EC2 instance | |
| You pay for each Amazon EC2 instance that | |
| is created. For more information about | |
| pricing, see Amazon EC2 Pricing. | |
| Prerequisites | |
| Before you begin: | |
| • Create an AWS account if you don't have one. | |
| • Create the ecsInstanceRole Instance role. | |
| Step 1: Create a compute environment | |
| Important | |
| To get started as simply and quickly as possible, this tutorial includes steps with default | |
| settings. Before creating for production use, we recommend that you familiarize yourself | |
| with all settings and deploy with the settings that meet your requirements. | |
| To create a compute environment for an Amazon EC2 orchestration, do the following: | |
| Prerequisites | |
| 17 | |
| AWS Batch | |
| User Guide | |
| Best practices for AWS Batch | |
| You can use AWS Batch to run a variety of demanding computational workloads at scale without | |
| managing a complex architecture. AWS Batch jobs can be used in a wide range of use cases in areas | |
| such as epidemiology, gaming, and machine learning. | |
| This topic covers the best practices to consider while using AWS Batch and guidance on how to run | |
| and optimize your workloads when using AWS Batch. | |
| Topics | |
| • When to use AWS Batch | |
| • Checklist to run at scale | |
| • Optimize containers and AMIs | |
| • Choose the right compute environment resource | |
| • Amazon EC2 On-Demand or Amazon EC2 Spot | |
| • Use Amazon EC2 Spot best practices for AWS Batch | |
| • Common errors and troubleshooting | |
| When to use AWS Batch | |
| AWS Batch runs jobs at scale and at low cost, and provides queuing services and cost-optimized | |
| scaling. However, not every workload is suitable to be run using AWS Batch. | |
| • Short jobs – If a job runs for only a few seconds, the overhead to schedule the batch job might | |
| take longer than the runtime of the job itself. As a workaround, binpack your tasks together | |
| before you submit them in AWS Batch. Then, configure your AWS Batch jobs to iterate over the | |
| tasks. For example, stage the individual task arguments into an Amazon DynamoDB table or as a | |
| file in an Amazon S3 bucket. Consider grouping tasks so the jobs run 3-5 minutes each. After you | |
| binpack the jobs, loop through your task groups within your AWS Batch job. | |
| • Jobs that must be run immediately – AWS Batch can process jobs quickly. However, AWS Batch | |
| is a scheduler and optimizes for cost performance, job priority, and throughput. AWS Batch | |
| might require time to process your requests. If you need a response in under a few seconds, then | |
| a service-based approach using Amazon ECS or Amazon EKS is more suitable. | |
| When to use AWS Batch | |
| 487 | |
| AWS Batch | |
| User Guide | |
| Checklist to run at scale | |
| Before you run a large workload on 50 thousand or more vCPUs, consider the following checklist. | |
| Note | |
| If you plan to run a large workload on a million or more vCPUs or need guidance running at | |
| large scale, contact your AWS team. | |
| • Check your Amazon EC2 quotas – Check your Amazon EC2 quotas (also known as limits) in the | |
| Service Quotas panel of the AWS Management Console. If necessary, request a quota increase for | |
| your peak number of Amazon EC2 instances. Remember that Amazon EC2 Spot and Amazon OnDemand instances have separate quotas. For more information, see Getting started with Service | |
| Quotas. | |
| • Verify your Amazon Elastic Block Store quota for each Region – Each instance uses a GP2 or | |
| GP3 volume for the operating system. By default, the quota for each AWS Region is 300 TiB. | |
| However, each instance uses counts as part of this quota. So, make sure to factor this in when | |
| you verify your Amazon Elastic Block Store quota for each Region. If your quota is reached, you | |
| can’t create more instances. For more information, see Amazon Elastic Block Store endpoints and | |
| quotas | |
| • Use Amazon S3 for storage – Amazon S3 provides high throughput and helps to eliminate the | |
| guesswork on how much storage to provision based on the number of jobs and instances in each | |
| Availability Zone. For more information, see Best practices design patterns: optimizing Amazon | |
| S3 performance. | |
| • Scale gradually to identify bottlenecks early – For a job that runs on a million or more vCPUs, | |
| start lower and gradually increase so that you can identify bottlenecks early. For example, start | |
| by running on 50 thousand vCPUs. Then, increase the count to 200 thousand vCPUs, and then | |
| 500 thousand vCPUs, and so on. In other words, continue to gradually increase the vCPU count | |
| until you reach the desired number of vCPUs. | |
| • Monitor to identify potential issues early – To avoid potential breaks and issues when running | |
| at scale, make sure to monitor both your application and architecture. Breaks might occur | |
| even when scaling from 1 thousand to 5 thousand vCPUs. You can use Amazon CloudWatch | |
| Logs to review log data or use CloudWatch Embedded Metrics using a client library. For more | |
| information, see CloudWatch Logs agent reference and aws-embedded-metrics | |
| Checklist to run at scale | |
| 488 | |
| AWS Batch | |
| User Guide | |
| Optimize containers and AMIs | |
| Container size and structure are important for the first set of jobs that you run. This is especially | |
| true if the container is larger than 4 GB. Container images are built in layers. The layers are | |
| retrieved in parallel by Docker using three concurrent threads. You can increase the number of | |
| concurrent threads using the max-concurrent-downloads parameter. For more information, see | |
| the Dockerd documentation. | |
| Although you can use larger containers, we recommend that you optimize container structure and | |
| size for faster startup times. | |
| • Smaller containers are fetched faster – Smaller containers can lead to faster application start | |
| times. To decrease container size, offload libraries or files that are updated infrequently to the | |
| Amazon Machine Image (AMI). You can also use bind mounts to give access to your containers. | |
| For more information, see Bind mounts. | |
| • Create layers that are even in size and break up large layers – Each layer is retrieved by one | |
| thread. So, a large layer might significantly impact your job startup time. We recommend a | |
| maximum layer size of 2 GB as a good tradeoff between larger container size and faster startup | |
| times. You can run the docker history your_image_id command to check your container | |
| image structure and layer size. For more information, see the Docker documentation. | |
| • Use Amazon Elastic Container Registry as your container repository – When you run thousands | |
| of jobs in parallel, a self-managed repository can fail or throttle throughput. Amazon ECR works | |
| at scale and can handle workloads with up to over a million vCPUs. | |
| Optimize containers and AMIs | |
| 489 | |
| AWS Batch | |
| User Guide | |
| Choose the right compute environment resource | |
| AWS Fargate requires less initial setup and configuration than Amazon EC2 and is likely easier | |
| to use, particularly if it's your first time. With Fargate, you don't need to manage servers, handle | |
| capacity planning, or isolate container workloads for security. | |
| If you have the following requirements, we recommend you use Fargate instances: | |
| • Your jobs must start quickly, specifically less than 30 seconds. | |
| • The requirements of your jobs are 16 vCPUs or less, no GPUs, and 120 GiB of memory or less. | |
| For more information, see When to use Fargate. | |
| If you have the following requirements, we recommend that you use Amazon EC2 instances: | |
| • You require increased control over the instance selection or require using specific instance types. | |
| • Your jobs require resources that AWS Fargate can’t provide, such as GPUs, more memory, a | |
| custom AMI, or the Amazon Elastic Fabric Adapter. | |
| • You require a high level of throughput or concurrency. | |
| • You need to customize your AMI, Amazon EC2 Launch Template, or access to special Linux | |
| parameters. | |
| With Amazon EC2, you can more finely tune your workload to your specific requirements and run at | |
| scale if needed. | |
| Amazon EC2 On-Demand or Amazon EC2 Spot | |
| Most AWS Batch customers use Amazon EC2 Spot instances because of the savings over OnDemand instances. However, if your workload runs for multiple hours and can't be interrupted, | |
| On-Demand instances might be more suitable for you. You can always try Spot instances first and | |
| switch to On-Demand if necessary. | |
| If you have the following requirements and expectations, use Amazon EC2 On-Demand instances: | |
| • The runtime of your jobs is more than an hour, and you can't tolerate interruptions to your | |
| workload. | |
| Choose the right compute environment resource | |
| 490 | |
| AWS Batch | |
| User Guide | |
| • You have a strict SLO (service-level objective) for your overall workload and can’t increase | |
| computational time. | |
| • The instances that you require are more likely to see interruptions. | |
| If you have the following requirements and expectations, use Amazon EC2 Spot instances: | |
| • The runtime for your jobs is typically 30 minutes or less. | |
| • You can tolerate potential interruptions and job rescheduling as a part of your workload. For | |
| more information, see Spot Instance advisor. | |
| • Long running jobs can be restarted from a checkpoint if interrupted. | |
| You can mix both purchasing models by submitting on Spot instance first and then use | |
| On-Demand instance as a fallback option. For example, submit your jobs on a queue that's | |
| connected to compute environments that are running on Amazon EC2 Spot instances. If a job | |
| gets interrupted, catch the event from Amazon EventBridge and correlate it to a Spot instance | |
| reclamation. Then, resubmit the job to an On-Demand queue using an AWS Lambda function or | |
| AWS Step Functions. For more information, see Tutorial: Sending Amazon Simple Notification | |
| Service alerts for failed job events, Best practices for handling Amazon EC2 Spot Instance | |
| interruptions and Manage AWS Batch with Step Functions. | |
| Important | |
| Use different instance types, sizes, and Availability Zones for your On-Demand compute | |
| environment to maintain Amazon EC2 Spot instance pool availability and decrease the | |
| interruption rate. | |
| Use Amazon EC2 Spot best practices for AWS Batch | |
| When you choose Amazon Elastic Compute Cloud (EC2) Spot instances, you likely can optimize | |
| your workflow to save costs, sometimes significantly. For more information, see Best practices for | |
| Amazon EC2 Spot. | |
| To optimize your workflow to save costs, consider the following Amazon EC2 Spot best practices | |
| for AWS Batch: | |
| Use Amazon EC2 Spot best practices for AWS Batch | |
| 491 | |
| AWS Batch | |
| User Guide | |
| • Choose the SPOT_CAPACITY_OPTIMIZED allocation strategy – AWS Batch chooses Amazon | |
| EC2 instances from the deepest Amazon EC2 Spot capacity pools. If you’re concerned about | |
| interruptions, this is a suitable choice. For more information, see Instance type allocation | |
| strategies for AWS Batch. | |
| • Diversify instance types – To diversify your instance types, consider compatible sizes and | |
| families, then let AWS Batch choose based on price or availability. For example, consider | |
| c5.24xlarge as an alternative to c5.12xlarge or c5a, c5n, c5d, m5, and m5d families. For | |
| more information, see Be flexible about instance types and Availability Zones. | |
| • Reduce job runtime or checkpoint – We advise against running jobs that take an hour or more | |
| when using Amazon EC2 Spot instances to avoid interruptions. If you divide or checkpoint | |
| your jobs into smaller parts that consist of 30 minutes or less, you can significantly reduce the | |
| possibility of interruptions. | |
| • Use automated retries – To avoid disruptions to AWS Batch jobs, set automated retries for jobs. | |
| Batch jobs can be disrupted for any of the following reasons: a non-zero exit code is returned, a | |
| service error occurs, or an instance reclamation occurs. You can set up to 10 automated retries. | |
| For a start, we recommend that you set at least 1-3 automated retries. For information about | |
| tracking Amazon EC2 Spot interruptions, see Spot Interruption Dashboard. | |
| For AWS Batch, if you set the retry parameter, the job is placed at the front of the job queue. | |
| That is, the job is given priority. When you create the job definition or you submit the job in the | |
| AWS CLI, you can configure a retry strategy. For more information, see submit-job. | |
| $ aws batch submit-job --job-name MyJob \ | |
| --job-queue MyJQ \ | |
| --job-definition MyJD \ | |
| --retry-strategy attempts=2 | |
| • Use custom retries – You can configure a job retry strategy to a specific application exit code | |
| or instance reclamation. In the following example, if the host causes the failure, the job can | |
| be retried up to five times. However, if the job fails for a different reason, the job exits and the | |
| status is set to FAILED. | |
| "retryStrategy": { | |
| "attempts": 5, | |
| "evaluateOnExit": | |
| [{ | |
| "onStatusReason" :"Host EC2*", | |
| "action": "RETRY" | |
| Use Amazon EC2 Spot best practices for AWS Batch | |
| 492 | |
| AWS Batch | |
| User Guide | |
| },{ | |
| "onReason" : "*", | |
| "action": "EXIT" | |
| }] | |
| } | |
| • Use the Spot Interruption Dashboard – You can use the Spot Interruption Dashboard to track | |
| Spot interruptions. The application provides metrics on Amazon EC2 Spot instances that are | |
| reclaimed and which Availability Zones that Spot instances are in. For more information, see Spot | |
| Interruption Dashboard | |
| Common errors and troubleshooting | |
| Errors in AWS Batch often occur at the application level or are caused by instance configurations | |
| that don’t meet your specific job requirements. Other issues include jobs getting stuck in | |
| the RUNNABLE status or compute environments getting stuck in an INVALID state. For more | |
| information about troubleshooting jobs getting stuck in RUNNABLE status, see Jobs stuck in a | |
| RUNNABLE status. For information about troubleshooting compute environments in an INVALID | |
| state, see INVALID compute environment. | |
| • Check Amazon EC2 Spot vCPU quotas – Verify that your current service quotas meet the job | |
| requirements. For example, suppose that your current service quota is 256 vCPUs and the job | |
| requires 10,000 vCPUs. Then, the service quota doesn't meet the job requirement. For more | |
| information and troubleshooting instructions, see Amazon EC2 service quotas and How do I | |
| increase the service quota of my Amazon EC2resources?. | |
| • Jobs fail before the application runs – Some jobs might fail because of a | |
| DockerTimeoutError error or a CannotPullContainerError error. For troubleshooting | |
| information, see How do I resolve the "DockerTimeoutError" error in AWS Batch?. | |
| • Insufficient IP addresses – The number of IP addresses in your VPC and subnets can limit the | |
| number of instances that you can create. Use Classless Inter-Domain Routings (CIDRs) to provide | |
| more IP addresses than are required to run your workloads. If necessary, you can also build a | |
| dedicated VPC with a large address space. For example, you can create a VPC with multiple | |
| CIDRs in 10.x.0.0/16 and a subnet in every Availability Zone with a CIDR of 10.x.y.0/17. | |
| In this example, x is between 1-4 and y is either 0 or 128. This configuration provides 36,000 IP | |
| addresses in every subnet. | |
| Common errors and troubleshooting | |
| 493 | |