Key Data Science

RSS
Apr
19

One line S3 cleaner

The Amazon S3 Object Expiration allows you to define rules to schedule the removal of your objects after a pre-defined time period. However, I have S3 data that I want to remove only after a data ingestion process has completed successfully.

For example, my bucket has directories with the timestamp in the name. I want to remove everything that’s older than 2 days and only if my process has successfully imported the data.

A simple combination of bash and aws cli is usefull. You can test the removal with –dryrun

aws s3 rm --dryrun s3://path-to-your-bucket/ --recursive --exclude $(date --date="1 days ago" +%Y-%m-%d*) --exclude $(date +%Y-%m-%d*)

I use Jenkins to orchestrate my ETL jobs. I simply added the below shell code to the pipeline as a contitional build step:

aws s3 rm s3://path-to-your-bucket/ --recursive --exclude $(date --date="1 days ago" +%Y-%m-%d*) --exclude $(date +%Y-%m-%d*)

Quick and easy.

AWS, Bash, Linux Comments Off on One line S3 cleaner
Oct
13

Resizing photos in bulk

I recently had to create a simple batch script to resize a massive number of images quickly. The server on which the job was going to run had 16 CPUs so wanted to ensure that all the processing power will be utilised.

I started off with Python but soon realised that there is a much easier way of doing it. The server had a minimal Linux install but surprisingly ImageMagick was there. A quick look at man pages and I came up with:

find . -name "*.[Jj][Pp][Gg]" -type f -print0 | xargs -0 -P16 -I'{}' convert -verbose -quality 100 -resize 10% {} {}

Some people install GNU Parallel for this, but there’s the -P option in xargs already. The bonus is that find and xargs are core components of virtually every Linux distro out there.

Linux , Comments Off on Resizing photos in bulk