AWS¶
Dependencies¶
Setup¶
First, you’ll need some AWS credentials. Without these you can only access public S3 buckets. Once you have those, S3 interaction will work. For other services such as Redshift, the setup is a bit more involved.
Once you have some AWS credentials, you’ll need to put those in a config file. Boto has a nice doc page on how to set this up.
Now that you have a boto config, we’re ready to interact with AWS.
Interface¶
odo provides access to the following AWS services:
- S3 via boto.
- Redshift via a SQLAlchemy dialect
URIs¶
To access an S3 bucket, simply provide the path to the S3 bucket prefixed with
s3://
>>> csvfile = resource('s3://bucket/key.csv')
Accessing a Redshift database is the same as accessing it through SQLAlchemy
>>> db = resource('redshift://user:pass@host:port/database')
To access an individual table simply append :: plus the table name
>>> table = resource('redshift://user:pass@host:port/database::table')
Conversions¶
odo can take advantage of Redshift’s fast S3 COPY command. It works
transparently. For example, to upload a local CSV file called users.csv to a
Redshift table
>>> table = odo('users.csv', 'redshift://user:pass@host:port/db::users')
Remember that these are just additional nodes in the odo network, and as
such, they are able to take advantage of conversions to types that don’t have
an explicit path defined for them. This allows us to do things like convert an
S3 CSV to a pandas DataFrame
>>> import pandas as pd >>> from odo import odo >>> df = odo('s3://mybucket/myfile.csv', pd.DataFrame)
TODO¶
- Multipart uploads for huge files
- GZIP’d files
- JSON to Redshift (JSONLines would be easy)
- boto
get_buckethangs on Windows