Bulk Load Data Files in S3 Bucket into Aurora RDS

Bulk Load Data Files in S3 Bucket into Aurora RDS 

We typically get data feeds from our clients ( usually about  ~ 5 – 20 GB) worth of data. We download these data files to our lab environment and use shell scripts to load the data into AURORA RDS . We wanted to avoid unnecessary data transfers  and decided to setup data pipe line to automate the process and use S3 Buckets for file uploads from the clients.

In theory it’s very simple process of setting up data pipeline to load data from S3 Bucket into Aurora Instance .Even though it’s trivial , setting up this process is very convoluted multi step process . It’s not as simple as it sounds . Welcome to Managed services world.

STEPS INVOLVED :

  • Create ROLE and Attach S3 Bucket Policy :
  • Create Cluster Parameter Group 
  • Modify Custom Parameter Groups  to use ROLE
  • REBOOT AURORA INSTANCE

GRANT AURORA  INSTANCE ACCESS TO S3 BUCKET 
By default aurora cannot access S3 Buckets and we all know it’s just common sense default setup to reduce the surface area for better security.

For EC2 Machines you can attach a role and the EC2 machines can access other AWS services on behalf of role assigned to the Instance.Same method is applicable for AURORA RDS. You Can associate a role to AURORA RDS which has required permissions to S3 Bucket .

There are ton of documentation on how to create a role and attach policies . It’s pretty widely adopted best practice in AWS world. Based on AWS Documentation, AWS Rotates access keys attached to these roles automatically. From security aspect , its lot better than using hard coded Access Keys.

In Traditional Datacenter world , you would typically run few configuration commands to change configuration options .( Think of sp_configure in SQL Server ).

In AWS RDS World , its tricky . By default configurations gets attached to your AURORA Cluster . If you need to override any default configuration , you have to create your own DB Cluster Parameter Group and modify your RDS instance to use the custom DB Cluster Parameter Group you created .Now you can edit your configuration values .

The way you attach a ROLE to AURORA RDS is through Cluster parameter group .
These three configuration options are related to interaction with S3 Buckets.

aws_default_s3_role
aurora_load_from_s3_role
aurora_select_into_s3_role

Get the ARN for your Role and modify above configuration values from default empty string to ROLE ARN value.

Then you need to modify your Aurora instance and select to use the role . It should show up in the drop down menu in the modify role tab.

GRANT AURORA LOGIN  LOAD FILE PERMISSION 

GRANT LOAD FROM S3 ON *.* TO user@domain-or-ip-address
GRANT LOAD FROM S3 ON *.* TO 'aurora-load-svc'@'%'

REBOOT AURORA INSTANCE

Without Reboot you will be spending lot of time troubleshooting. You need to reboot to the AURORA Instance for new cluster parameter values to take effect.

After this you will be be able to execute the LOAD FILE FROM S3 to AURORA .

Screen Shots :

Create ROLE and Attach Policy :

Create RDS Role
Create RDS Role
ROLE and S3 POLICY
ROLE and S3 POLICY

Attach S3 Bucket Policy :

S3 FULL Access Policy
S3 FULL Access Policy

Create Parameter Group :

AURORA CLUSTER PARAMETER GROUP
AURORA CLUSTER PARAMETER GROUP

Modify Custom Parameter Groups  

Modify Parameter Group
Modify Parameter Group

Modify AURORA RDS Instance to use ROLE

Modify AURORA Instance to use IAM ROLE
Modify AURORA Instance to use IAM ROLE

Troubleshooting :
Errors :

Error Code: 1871. S3 API returned error: Missing Credentials: Cannot instantiate S3 Client 0.078 sec 
Usually means , AURORA Instance can’t reach S3 Bucket. Make sure you have applied the role and rebooted the Instance.

Sample BULK LOAD Command :

You could use following sample scripts to test your Setup.

LOAD DATA FROM S3 's3://yourbucket/allusers_pipe.txt'
INTO TABLE ETLStage.users
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
(@var1, @var2, @var3, @var4, @var5, @var6, @var7, @var8, @var9, @var10, @var11, @var12, @var13, @var14, @var15, @var16, @var17, @var18)
SET
userid = @var1,
username = @var2,
firstname = @var3,
lastname = @var4,
city=@var5,
state=@var6,
email=@var7,
phone=@var8,
likesports=@var9,
liketheatre=@var10,
likeconcerts=@var11,
likejazz=@var12,
likeclassical=@var13,
likeopera=@var14,
likerock=@var15,
likevegas=@var16,
likebroadway=@var17,
likemusicals=@var18

Sample File in S3 Public Bucket :  s3://awssampledbuswest2/tickit/allusers_pipe.txt

SELECT * FROM ETLStage.users INTO OUTFILE S3 's3-us-west-2://s3samplebucketname/outputestdata'
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
MANIFEST ON
OVERWRITE ON;
create table users_01(
userid integer not null primary key,
username char(8),
firstname varchar(30),
lastname varchar(30),
city varchar(30),
state char(2),
email varchar(100),
phone char(14),
likesports varchar(100),
liketheatre varchar(100),
likeconcerts varchar(100),
likejazz varchar(100),
likeclassical varchar(100),
likeopera varchar(100),
likerock varchar(100),
likevegas varchar(100),
likebroadway varchar(100),
likemusicals varchar(100))

References:

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Authorizing.AWSServices.S3CreatePolicy.html

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.LoadFromS3.html

https://stackoverflow.com/questions/40246937/amazon-aurora-1-8-load-data-from-s3-cannot-instantiate-s3-client

http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Aurora.Authorizing.AWSServices.html#Aurora.Authorizing.AWSServices.CreateRole

https://stackoverflow.com/questions/40246937/amazon-aurora-1-8-load-data-from-s3-cannot-instantiate-s3-client

http://www.cshanes.com/loading-a-csv-file-from-s3-to-an-aurora-rds-instance/

http://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html

You may also like...

(3) Comments

  1. sk

    Hi Raju, could you clarify this error? – ERROR 1871 (HY000): S3 API returned error: No Such Key:The specified key does not exist.. I am tring to load data from 2 different bucket( same region us-east-1) into aurora mysql db table. data load works well if i load from single bucket, that is, iam able to load data from both buckets seperately using manifest file, but if i upate both file details in one manifest file, then i get this error.

  2. MUHAMMED FALAKH TE

    Where do i execute these sql queries LOAD DATA FROM S3 ‘s3://yourbucket/allusers_pipe.txt’
    INTO TABLE ETLStage.users
    FIELDS TERMINATED BY ‘|’
    LINES TERMINATED BY ‘\n’
    When i tried it in workbench, its throwing syntax error.
    Is it possible to load data to RDS mysql also using the same way

  3. Kejvi Doko

    I am getting an error like below:

    Error Code: 1871. S3 API returned error: Unknown:Unable to parse ExceptionName: ExpiredToken Message: The provided token has expired.

    Do you have any idea why is happening ?

Leave a Reply

Your email address will not be published. Required fields are marked *