Fast Batch S3 Bucket object deletion from the shell

This is a quick post showing a nice and fast batch S3 bucket object deletion technique.

I recently had an S3 bucket that needed cleaning up. It had a few million objects in it. With path separating forward slashes this means there were around 5 million or so keys to iterate.

The goal was to delete every object that did not have a .zip file extension. Effectively I wanted to leave only the .zip file objects behind (of which there were only a few thousand), but get rid of all the other millions of objects.

My first attempt was straight forward and naive. Iterate every single key, check that it is not a .zip file, and delete it if not. However, every one of these iterations ended up being an HTTP request and this turned out to be a very slow process. Definitely not fast batch S3 bucket object deletion…

I fired up about 20 shells all iterating over objects and deleting like this but it still would have taken days.

I then stumbled upon a really cool technique on serverfault that you can use in two stages.

  1. Iterate the bucket objects and stash all the keys in a file.
  2. Iterate the lines in the file in batches of 1000 and call delete-objects on these – effectively deleting the objects in batches of 1000 (the maximum for 1 x delete request).

In-between stage 1 and stage 2 I just had to clean up the large text file of object keys to remove any of the lines that were .zip objects. For this process I used sublime text and a simple regex search and replace (replacing with an empty string to remove those lines).

So here is the process I used to delete everything in the bucket except the .zip objects. This took around 1-2 hours for the object key path collection and then the delete run.

Get all the object key paths

Note you will need to have Pipe Viewer installed first (pv). Pipe Viewer is a great little utility that you can place into any normal pipeline between two processes. It gives you a great little progress indicator to monitor progress in the shell.

aws s3api list-objects --output text --bucket the-bucket-name-here --query 'Contents[].[Key]' | pv -l > all-the-stuff.keys

 

Remove any object key paths you don’t want to delete

Open your all-the-stuff.keys file in Sublime or any other text editor with regex find and replace functionality.

The regex search for sublime text:

^.*.zip*\n

Find and replace all .zip object paths with the above regex string, replacing results with an empty string. Save the file when done. Make sure you use the correctly edited file for the following deletion phase!

Iterate all the object keys in batches and call delete

tail -n+0 all-the-stuff.keys | pv -l | grep -v -e "'" | tr '\n' '\0' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket the-bucket-name-here --delete "Objects=[$(printf "{Key=%q}," "$@")],Quiet=false"' _

This one-liner effectively:

  • tails the large text file (mine was around 250MB) of object keys
  • passes this into pipe viewer for progress indication
  • translates (tr) all newline characters into a null character ‘\0’ (effectively every line ending)
  • chops these up into groups of 1000 and passes the 1000 x key paths as an argument with xargs to the aws s3api delete-object command. This delete command can be passed an Objects array parameter, which is where the 1000 object key paths are fed into.
  • finally quiet mode is disabled to show the result of the delete requests in the shell, but you can also set this to true to remove that output.

Effectively you end up calling aws s3api delete-object passing in 1000 objects to delete at a time.

This is how it can get through the work so quickly.

Nice!

Issue using the Count method to count objects with PowerShell 2.0

 

I came across a small issue with a little helper script I wrote to count vSphere objects using PowerCLI this morning. It’s been a couple of a weeks since I last did a blog post – things have been very busy, so I have not been able to commit much time over the last few weeks to blogging. As such, I thought I would do a quick post around this small issue I came across earlier. It has more than likely been covered off elsewhere, but will be a good reference point to come back to if I ever forget!

 

So to the issue I saw. Essentially, if an object count is 1 or less, then the object is returned as the object type itself. For example, where only one Distributed Virtual Switch exists in a vSphere environment, and we use the cmdlet, Get-VirtualSwitch -Distributed, a single object is returned of BaseType “VMware.VimAutomation.ViCore.Impl.V1.VIObjectImpl“.

 

However, if we had more than one dvSwitch, then we would get a BaseType of “System.Array” returned.

 

We are able to use the Count() method on an array with PowerShell version 2.0, but are not able to use the Count() method on a single object. The work around I found here (when using PowerShell 2.0) is to cast the object type specifically as an array.

So in the case of our dvSwitch example above, originally we would have done:

@dvSwitchCount = (Get-VirtualSwitch -Distributed).Count

 

To cast this as an array, and therefore having an accurate count of the objects, whether there are no members, one member, or more, we would use:

@dvSwitchCount = @(Get-VirtualSwitch -Distributed).Count

 

Note the addition of the “@” sign – used to cast the variable as an array.

Jonathan Medd also kindly pointed out that this is fixed with PowerShell 3.0 – have a read of the new features to see the addition that allows .Count to be used on any type of object over here.