AWS

Setup

First, you’ll need some AWS credentials. Without these you can only access public S3 buckets. Once you have those, S3 interaction will work. For other services such as Redshift, the setup is a bit more involved.

Once you have some AWS credentials, you’ll need to put those in a config file. Boto has a nice doc page on how to set this up.

Now that you have a boto config, we’re ready to interact with AWS.

Interface

odo provides access to the following AWS services:

URIs

To access an S3 key, simply provide the path to the S3 key prefixed with s3://

>>> csvfile = resource('s3://bucket/key.csv')

S3 commonly uses a prefix to limit an operation to a subset of keys. We can simulate a glob of keys by combining a prefix with the * character:

>>> csv_glob = resource('s3://bucket/prefix*.csv')

This will match all keys with starting with prefix and ending with the .csv extension. The result csv_glob can be used just like a glob of files from your local disk.

Accessing a Redshift database is the same as accessing it through SQLAlchemy

>>> db = resource('redshift://user:pass@host:port/database')

To access an individual table simply append :: plus the table name

>>> table = resource('redshift://user:pass@host:port/database::table')

Conversions

odo can take advantage of Redshift’s fast S3 COPY command. It works transparently. For example, to upload a local CSV file called users.csv to a Redshift table

>>> table = odo('users.csv', 'redshift://user:pass@host:port/db::users')

Remember that these are just additional nodes in the odo network, and as such, they are able to take advantage of conversions to types that don’t have an explicit path defined for them. This allows us to do things like convert an S3 CSV to a pandas DataFrame

>>> import pandas as pd
>>> from odo import odo
>>> df = odo('s3://mybucket/myfile.csv', pd.DataFrame)

TODO

  • Multipart uploads for huge files
  • GZIP’d files
  • JSON to Redshift (JSONLines would be easy)
  • boto get_bucket hangs on Windows