First, you’ll need some AWS credentials. Without these you can only access public S3 buckets. Once you have those, S3 interaction will work. For other services such as Redshift, the setup is a bit more involved.
Once you have some AWS credentials, you’ll need to put those in a config file. Boto has a nice doc page on how to set this up.
Now that you have a boto config, we’re ready to interact with AWS.
odo provides access to the following AWS services:
- S3 via boto.
- Redshift via a SQLAlchemy dialect
To access an S3 bucket, simply provide the path to the S3 bucket prefixed with
>>> csvfile = resource('s3://bucket/key.csv')
Accessing a Redshift database is the same as accessing it through SQLAlchemy
>>> db = resource('redshift://user:pass@host:port/database')
To access an individual table simply append
:: plus the table name
>>> table = resource('redshift://user:pass@host:port/database::table')
odo can take advantage of Redshift’s fast S3
COPY command. It works
transparently. For example, to upload a local CSV file called
users.csv to a
>>> table = odo('users.csv', 'redshift://user:pass@host:port/db::users')
Remember that these are just additional nodes in the
odo network, and as
such, they are able to take advantage of conversions to types that don’t have
an explicit path defined for them. This allows us to do things like convert an
S3 CSV to a pandas DataFrame
>>> import pandas as pd >>> from odo import odo >>> df = odo('s3://mybucket/myfile.csv', pd.DataFrame)
- Multipart uploads for huge files
- GZIP’d files
- JSON to Redshift (JSONLines would be easy)
get_buckethangs on Windows