03 December 2013

The ruby s3sync tool (not the similar python option) has proven extremely valuable over the years. However, the emergence of an official AWS CLI and web-based manager, a couple minor glitches with our installed version of s3sync, and the temporary halt of development when the original developer stopped work all led to the decision to make a change. However, the AWS CLI had problems downloading files that were uploaded by s3sync because of how directories were created.

Version note: I currently have s3sync 1.2.5 installed.

The challenge of directories on S3 is well-documented. In fact, the s3sync documentation even acknowledges, "In S3 there's no actual concept of folders, just keys and nodes. So, every tool uses its own proprietary way of storing dir info (my scheme being the best naturally) and in general the methods are not compatible". As the s3fs community discusses, Amazon has effectively created a spec by implementation at this point. Unfortunately, the spec is different than what s3sync uses.

Essentially, each S3 folder should have its meta data stored in a file named "dirname_$folder$". This is a zero-byte file that simply provides a place for any meta data to be stored. If no such file is created, then there is effectively no meta data being stored for that directory, which should still be acceptable.

On the other hand, s3sync uses a file named "dirname" that contains a UUID. The data is still stored as meta data, from what I can tell, but the naming convention has the unfortunate side-effect of making downloading using a different program (like the AWS CLI) difficult because there is now a file that has the exact same path as a directory. We know that the file and directory cannot both exist on your filesystem. The result (at least using the AWS CLI) is that the metadata file is downloaded, and further downloads fail because the directory was not and cannot be created.

The key note is that the metadata items are effectively option under both schemas. Unless you have a very specific use case in mind, simply deleting the metadata items from S3 is sufficient to allow the AWS CLI sync to work correctly from your bucket.

The Solution: A Dirty BASH Script

Assumptions

  1. You have already installed the AWS CLI and run `aws configure`.
  2. You know your bucket name and have appropriate access credentials.
  3. You do not use periods in your directory names. If you do, you will need to modify the script or manually handle those cases.
  4. You do not store files without file extensions. If you do, you have reason to believe that they files never be exactly 38 bytes.
  5. You are proficient enough in BASH to understand what the script is doing and modify it accordingly. Minimally, you will need to change it to work with your bucket name. If you are proficient, you will likely note inefficiencies -- it really is a quick script.
  6. You understand that this script (without modifications) is written to automatically delete items from S3, which cannot be undone.
  7. USE AT YOUR OWN RISK! THIS AUTOMATICALLY DELETES FILES FROM S3!!!

The Script

  1. Line 1: List all objects in a bucket.
  2. Line 2: Filter rows to only show rows with item keys that have a file size of 38 bytes.
  3. Line 3: Extract the item key and ignore any keys that have a dot after the last slash (e.g., ignore likely files).
  4. Line 4: Remove each of the remaining items from S3.
aws s3api list-objects --bucket=BUCKETNAME --max-items=100000000 --output=text | \
  egrep "CONTENTS" | sed 's/\t/~/g' | egrep "Z~38~" | \
  cut --delimiter=~ --fields=3 | egrep -v '\.[^/]*$' | \
  while read obj; do echo "$obj"; aws s3 rm "s3://BUCKETNAME /$obj"; done


blog comments powered by Disqus