AWS

Setup

First, you’ll need some AWS credentials. Without these you can only access public S3 buckets. Once you have those, S3 interaction will work. For other services such as Redshift, the setup is a bit more involved.

Once you have some AWS credentials, you’ll need to put those in a config file. Boto has a nice doc page on how to set this up.

Now that you have a boto config, we’re ready to interact with AWS.

Interface

odo provides access to the following AWS services:

URIs

To access an S3 bucket, simply provide the path to the S3 bucket prefixed with s3://

>>> csvfile = resource('s3://bucket/key.csv')

Accessing a Redshift database is the same as accessing it through SQLAlchemy

>>> db = resource('redshift://user:pass@host:port/database')

To access an individual table simply append :: plus the table name

>>> table = resource('redshift://user:pass@host:port/database::table')

Conversions

odo can take advantage of Redshift’s fast S3 COPY command. It works transparently. For example, to upload a local CSV file called users.csv to a Redshift table

>>> table = odo('users.csv', 'redshift://user:pass@host:port/db::users')

Remember that these are just additional nodes in the odo network, and as such, they are able to take advantage of conversions to types that don’t have an explicit path defined for them. This allows us to do things like convert an S3 CSV to a pandas DataFrame

>>> import pandas as pd
>>> from odo import odo
>>> df = odo('s3://mybucket/myfile.csv', pd.DataFrame)

TODO

  • Multipart uploads for huge files
  • GZIP’d files
  • JSON to Redshift (JSONLines would be easy)
  • boto get_bucket hangs on Windows