This week I want to demonstrate how we can use AWS Batch for analysing exif ( Exchangeable image file format (officially Exif, according to JEIDA/JEITA/CIPA specifications) is a standard that specifies the formats for images, sound, and ancillary tags used by digital cameras (including smartphones), scanners and other systems handling image and sound files recorded by digital cameras – Source wikipedia) data of several images. In my demo, I will upload several images to a S3 bucket and this will trigger a Lambda function. This Lambda function will submit a batch job that will download the images and analyse them in parallel. If they have any exif data, this data will be recorded in DynamoDb.
Before proceeding with the demo, I want to highlight the features of the service. First of all, the service is in preview so if you want to use it, you should contact with AWS.
AWS Batch is a service that helps us to run batch computing workloads without installing and managing a batch software and it is automatically scaled depending on the workload.
To use AWS Batch, we need to understand the basics of it. Initially, we need a compute environment that will run our jobs. AWS Batch compute environments contain Amazon ECS container instances. This compute environment can be mapped to one or another job queue. For AWS Batch, we can create either “managed compute environment” or “unmanaged compute environment”. As you may understand, if we select the managed one, the resources are managed by AWS. If you prefer unmanaged one, you should take care of your resources. You can both select on-demand or spot instances when you create your compute environment. Keep in mind that, ECS instances should be able to reach AWS ECS service endpoint, so you should configure your subnet before configuring it.
After we create our compute environment, we need a job queue that will keep our jobs till they find a compute environment. Job queues can be prioritised by giving them a lower integer value.
As we have both job queue and compute environment, our next step is creating our job definition. By creating a job definition, we configure how our job will run, like the docker image we want to use, cpu and memory settings for the container and the command will work when container runs etc. After creating the job queue, we submit a job that will work on our compute environment.
For the demo, let’s create the requirements step by step.
I create the compute environment by selecting “managed” type. I name it as “Awsome-environment” and select the roles for both AWS Batch and ECS. Also the environment should be “enabled”.
I select “on-demand” provisioning model and “optimal” for allowed instance types. AWS Batch will select the optimal instance family and type for me. I also set maximum vCpu to 10.
For the networking setting I select my VPC and subnets.
Now my compute environment is being created.
Again I set a name for it and enable the queue. I also select 1 as priority and map my “awsome-environment” compute environment for this queue.
I create a job definition. Here the role has the permissions for reading from S3 and writing to DynamoDB. The Docker image I prepared is basically installs boto3 and pillow libraries and downlads the code from github. Then we run the code, it downloads the images from S3, analyse them, put the exif info into db ( if it is not there ) and deletes them.
You can find the code here (The code is a sample, so it can be improved).
(Below the docker image would be “osalkk/aws-batch-example”)
To start the batch job, I used the lambda code below and set the trigger as my S3 bucket.
The Lambda code:
import boto3 client = boto3.client('batch') def lambda_handler(event, context): response = client.submit_job( jobName='analyse-exif', jobQueue='Awsome-queue', jobDefinition='Awsome-batch:3', #I use the 3. version here... )
It’s time to see the AWS Batch in action…
I upload several images to my S3 bucket.
Lambda function is triggered.
Jobs are submitted to AWS Batch.
Jobs are in runnable status now.
And they succeeded.
This is the ECS cluster that runs our jobs. Currently there are 2 EC2 instances and we can see the stopped jobs.
And the exif data of images are recorded in our table. As you can see, I record the brand name, model, software, and also Gpsinfo of the images.
This was the basic usage of AWS Batch service and next step should be analysing the exif data. If you want to learn more about the service, you can read it on AWS documentation page. I hope you find this post useful. If you have any questions or comments, please feel free to write and don’t forget to share this post please.