Using Placeholder Templates With Xargs In The Pipeline

pipes running along a wall

Using placeholder templates with xargs gives you a lot more power than simply using xargs to append args onto the end of a command.

I previously blogged about how to use xargs to append arguments to another command in the pipeline. This post goes into a bit more detail and shows you a more powerful way of using xargs.

Basic Xargs Example

With a typical, simple xargs command, you might append one argument onto the end of an existing command in the pipeline like this:

echo "this is a basic xargs example" | xargs echo "you said:"

The above command results in the output:

you said: this is a basic xargs example

Placeholder Templates With Xargs Example

Now, here is a simple example of using a placeholder, or template in your command, and passing your argument into that.

echo "this is a basic xargs example" | xargs -I {} echo "you said: {}"

The output is exactly the same as with the basic example. However, you can now use the curly braces placeholder to move the argument placement to anywhere in your command.

You’re no longer constrained to it being on the end of the echo (or whichever command you’re using). You can also do multiple placements of the argument in your command.

For example:

echo "FOO" | xargs -I {} echo "you said: {}. Here is another usage of your sample argument: {}. And here is yet another: {}"

A Slightly More Practical Example

Enough of the simple echo examples though. How about using this for a more practical, real world example?

In the following example, we want to list a bunch of AWS S3 buckets, and then do a summary output of their total size in GiB. We cut out the bucket name using cut from the initial listing that is returned with aws s3 ls.

aws s3 ls | cut -d' ' -f3 | xargs -n1 aws s3 ls --summarize --human-readable --recursive s3://

Using xargs to append the bucket name from the pipeline looks like it would work, as we only need it right at the end of the aws s3 ls command. There is an issue though, xargs would add a space, and we want the bucket name appended to s3:// without a space.

Using The Template or Placeholder

This is where using a placeholder or template with xargs can come in handy.

aws s3 ls | cut -d' ' -f3 | xargs -n1 -I {} aws s3 ls --summarize --human-readable --recursive s3://{}

It’s also worth noting that you can change your template or placeholder token with the -I parameter. It doesn’t have to be {} as in the examples above.

In summary, your usage of xargs can be levelled up by using the -I parameter to leverage placeholder or template tokens.

Fast Batch S3 Bucket object deletion from the shell

This is a quick post showing a nice and fast batch S3 bucket object deletion technique.

I recently had an S3 bucket that needed cleaning up. It had a few million objects in it. With path separating forward slashes this means there were around 5 million or so keys to iterate.

The goal was to delete every object that did not have a .zip file extension. Effectively I wanted to leave only the .zip file objects behind (of which there were only a few thousand), but get rid of all the other millions of objects.

My first attempt was straight forward and naive. Iterate every single key, check that it is not a .zip file, and delete it if not. However, every one of these iterations ended up being an HTTP request and this turned out to be a very slow process. Definitely not fast batch S3 bucket object deletion…

I fired up about 20 shells all iterating over objects and deleting like this but it still would have taken days.

I then stumbled upon a really cool technique on serverfault that you can use in two stages.

  1. Iterate the bucket objects and stash all the keys in a file.
  2. Iterate the lines in the file in batches of 1000 and call delete-objects on these – effectively deleting the objects in batches of 1000 (the maximum for 1 x delete request).

In-between stage 1 and stage 2 I just had to clean up the large text file of object keys to remove any of the lines that were .zip objects. For this process I used sublime text and a simple regex search and replace (replacing with an empty string to remove those lines).

So here is the process I used to delete everything in the bucket except the .zip objects. This took around 1-2 hours for the object key path collection and then the delete run.

Get all the object key paths

Note you will need to have Pipe Viewer installed first (pv). Pipe Viewer is a great little utility that you can place into any normal pipeline between two processes. It gives you a great little progress indicator to monitor progress in the shell.

aws s3api list-objects --output text --bucket the-bucket-name-here --query 'Contents[].[Key]' | pv -l > all-the-stuff.keys

 

Remove any object key paths you don’t want to delete

Open your all-the-stuff.keys file in Sublime or any other text editor with regex find and replace functionality.

The regex search for sublime text:

^.*.zip*\n

Find and replace all .zip object paths with the above regex string, replacing results with an empty string. Save the file when done. Make sure you use the correctly edited file for the following deletion phase!

Iterate all the object keys in batches and call delete

tail -n+0 all-the-stuff.keys | pv -l | grep -v -e "'" | tr '\n' '\0' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket the-bucket-name-here --delete "Objects=[$(printf "{Key=%q}," "$@")],Quiet=false"' _

This one-liner effectively:

  • tails the large text file (mine was around 250MB) of object keys
  • passes this into pipe viewer for progress indication
  • translates (tr) all newline characters into a null character ‘\0’ (effectively every line ending)
  • chops these up into groups of 1000 and passes the 1000 x key paths as an argument with xargs to the aws s3api delete-object command. This delete command can be passed an Objects array parameter, which is where the 1000 object key paths are fed into.
  • finally quiet mode is disabled to show the result of the delete requests in the shell, but you can also set this to true to remove that output.

Effectively you end up calling aws s3api delete-object passing in 1000 objects to delete at a time.

This is how it can get through the work so quickly.

Nice!