AWS AURORA RDS – Learnings from Loading Billion Rows to Aurora
Business Use Case:
We do online assessment and reporting for K-12 Education Industry. ( http://emetric.net/ ). Our business needed to load K-3 through K-12 students historical test reports for entire state for past 10 + years , which ended up being close to billion rows of historical test report data.
Product Selection:
We have been using SQL Server for majority of our database needs and we wanted to see whether AURORA would satisfy our business needs . Primary reason being Licensing Models of SQL Server is prohibitively expensive and scaling out has been challenging.
The features we liked out of the Box from Aurora are :
Elasticity – We can go up-to 244 Gig RAM 8x Instance.
Lower Cost
Simple in licensing Model and Pay as we go
We ended up using AURORA Reserved Instances since our demand doesn’t change and we have pretty predictable and stable traffic.
Data Processing :
Our source data is CSV Files. We had several programs and data layout is different for each program. Initially we decided to create Target Tables in AURORA Instance and then use MySQLImport Tool to load the CSV Files. This worked well but the number of files grew, we had to create multiple tables and load the data. It was getting unwieldy and we decided to try different approach.
The steps are :
Use Bash Scripts to Parse CSV and Fixed Length Data Files
Use Bash Scripts to Load into Staging Tables
Stored Procedures to Move Data from Staging Tables to Target Tables
Call the Stored Proc in multiple threads
LOAD INFILE OPTOIN
LOAD DATA LOCAL INFILE 'G:\\TexasDI\\StudentPortal\\StudentPortalData_01(10).csv' INTO TABLE StudentPortalDataBCP FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' ; LOAD DATA LOCAL INFILE 'sample.csv' INTO TABLE sampletable FIELDS TERMINATED BY ',' ENCLOSED BY '"' ESCAPED BY '"' LINES TERMINATED BY '\n' IGNORE 1 LINES ; LOAD DATA LOCAL INFILE 'G:/FileFolder/filename.txt' INTO TABLE `filename` FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 2 LINES ; -- PASS NULL FOR EMPTY STRINGS -- https://bugs.mysql.com/bug.php?id=23212 LOAD DATA LOCAL INFILE 'G:/TexasDI/DataFiles/LookupFiles/testminmax.csv' INTO TABLE LookupTables.LookupSTAAR38TestCodeMinMaxCut_NULLTest FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES (`ADMIN_CODE`, `TEST_CODE`, `TEST_VERSION`, `TEST_LANGUAGE`, `STANDARD_CODE`, `MIN_SCALE_SCORE`, `MAX_SCALE_SCORE`, `CUT_MIN`, `CUT_LEVEL1`, `CUT_LEVEL2`, `CUT_LEVEL3`, `MAX_POINTS`, `CreatedBy`, `CreatedDateTime` ) SET CUT_MIN = nullif(`CUT_MIN`,'') , CUT_LEVEL1 = nullif(`CUT_LEVEL1`,'') , CUT_LEVEL2 = nullif(`CUT_LEVEL2`,'') , CUT_LEVEL3 = nullif(`CUT_LEVEL3`,'') , MAX_POINTS = nullif(`MAX_POINTS`,'') , CreatedBy = user() , CreatedDateTime = current_timestamp();
Nice thing about LOAD INFILE option , we can create one staging table for each program since the schema is same for each program. It helped us to trim our database to very few tables.
PARALLEL LOADING:
Loading 100 + Gig of data serially takes a while. We tried that with largest possible instance and it took about 14 or so hours.
We started evaluating other options for loading the data parallelly and wrote simple bash script to load the data parallel.
## HOSTNAME OF AURORA INSTANCE HOST_NAME=awsauroracluster.cluster-afdbacdd.us-east-1.rds.amazonaws.com DATABASE_NAME="ETLStage" USER_NAME=username PASSWORD="password" INPUT_DIR="G:/root/DataFiles/" TARGET_TABLE="TARGET_TABLE" TODAY=`date +%Y_%m_%d_%H_%M_%S` # or whatever pattern you desire LOG_FILE="G:/root/DataFiles/Logs/Loading_Data_Staging_Table_$TODAY.log" echo "Started the Process of Loading Files into $TARGET_TABLE Tables DateTime `date +%Y-%m-%d%H:%M:%S`">>$LOG_FILE for filenames in ${INPUT_DIR}/*.txt do filename=${filenames##*/}; echo $filename echo "`date`">>$LOG_FILE # SQL_STMT="LOAD DATA LOCAL INFILE '$filenames' INTO TABLE $TARGET_TABLE FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 2 LINES ; " SQL_STMT="LOAD DATA LOCAL INFILE '$filenames' INTO TABLE $TARGET_TABLE FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 1 LINES (col1, col2, col3, col4, col5 CreatedDateTime, CreatedBy, ComputedColumn1, ComputedColumn2, ComputedColumn3, FileName) SET col1 = NULLIF(col1,'') , col2 = NULLIF(col2,'') , col3 , col4 , col5 , CreatedDateTime = current_timestamp(), CreatedBy= USER() ComputedColumn1 = SHA(CONCAT(IFNULL(col1,''), IFNULL(col2 , ''))) , ComputedColumn2 = SHA(CONCAT(IFNULL(col3,''), IFNULL(col4 , ''))) , FileName = '$filename' ; " echo "$SQL_STMT" echo "$SQL_STMT">>$LOG_FILE mysql -vvv -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -e "$SQL_STMT" --reconnect 1>>$LOG_FILE 2>&1 & if [ "$?" -eq 0 ]; then echo "Successfully started the processing of file $filenames ; ">>$LOG_FILE else echo "Error occured while processing file $filenames ; ">>$LOG_FILE fi done echo "Completed the Process of Loading Files into Staging Tables `date`">>$LOG_FILE
HANDLING NULLS:
While loading the Data , if the CSV File doesn’t contain a value for a particular field, AURORA loads empty value instead of NULL. For INT Data Types , it loads zero ( 0 ) , which skews the data averages. Averages are important metric for us which indicates the performance of a campus, district and state. To overcome this problem, you can set the value while you insert the data using NULLIF Statement. We found this out in hard way.
Moving Data to Target Tables::
We wanted to move the data from Staging Tables to Target Tables. This is performed as typical upsert (Update if exists , else insert ). Moving Billion rows sequentially is not even an option. So we wrote a stored procedure to move the data from Staging Table to Target Table based on startID and EndID. We had auto increment ID Column in staging tables which makes it easier to process the slices of data.
## This is database for processing latest Student Files for Parent Portal. HOST_NAME=awsaurora.cluster.us-east-1.rds.amazonaws.com DATABASE_NAME=ETLStage USER_NAME=username PASSWORD="password" TARGET_TABLE="TARGET_TABLE" TODAY=`date +%Y_%m_%d_%H_%M_%S` # or whatever pattern you desire LOG_FILE="G:/ROOT/Logs/Move_To_ODS_STORED_PROC_$TODAY.log" echo "Started the Move_To_ODS_STORED_PROC Process DateTime `date`">>$LOG_FILE SQL_STMT="SELECT MIN(ID) FROM $TARGET_TABLE WHERE Processed = 1; " MIN_ID=$(mysql -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -ss -N -e "$SQL_STMT") >>$LOG_FILE 2>&1 echo "$SQL_STMT">>$LOG_FILE SQL_STMT="SELECT MAX(ID) FROM $TARGET_TABLE WHERE Processed = 1 ; " MAX_ID=$(mysql -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -ss -N -e "$SQL_STMT") >>$LOG_FILE 2>&1 echo "$SQL_STMT">>$LOG_FILE echo "MIN_ID Processing ID from $TARGET_TABLE " $MIN_ID >>$LOG_FILE echo "MAX_ID Processing ID from $TARGET_TABLE " $MAX_ID >>$LOG_FILE echo "Moving the Data from Staging Table to Target Tables ">>$LOG_FILE job=0; echo "Start `date`" #for ((i=700000; i<109642390; i=i+100000)) for ((i=$MIN_ID; i<$MAX_ID; i=i+100000)) do echo "Executing Move_To_ODS_STORED_PROC " j=$(($i+250000)); k=25000; echo `date +%Y_%m_%d_%H_%M_%S` >>$LOG_FILE echo "CALL Move_To_ODS_STORED_PROC($i,$j, $k)" `date` >>$LOG_FILE 2>&1 mysql -vvv -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -e "UNLOCK TABLES ; CALL Move_To_ODS_STORED_PROC($i,$j, $k)" >>$LOG_FILE 2>&1 & echo "Query $i submitted..... " job=$(($job+1)) if (( $job%15 == 0 )); then echo "Inside $job waiting for rest of guys" wait; # sleep 1; echo "slept 10" fi done wait echo " End `date`" echo "Completed the Move_To_ODS_STORED_PROC Process DateTime `date +%Y-%m-%d%H:%M:%S`">>$LOG_FILE
Performance Improvement configurations
Mondified the configuraion for innodb_autoinc_lock_mode
innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)
In this lock mode, no “INSERT-like” statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time. This is the fastest and most scalable lock mode, but it is not safe when using statement-based replication or recovery scenarios when SQL statements are replayed from the binary log.
Parameter Group Configuration Overrides
Name | Value | Allowed Values | Is Modifiable | Source | Description |
---|---|---|---|---|---|
innodb_autoinc_lock_mode | 2 | 0-2 | TRUE | user | The locking mode to use for generating auto-increment values |
Parameter | OurCustomParameters | default.aurora5.6 |
---|---|---|
slow_query_log | 1 | engine-default |
tx_isolation | READ-COMMITTED | engine-default |
general_log | 1 | engine-default |
long_query_time | 5 | engine-default |
group_concat_max_len | 20000000 | engine-default |
nice article. really helpful.
After data loading, do you have a fast way to create index?
Hi.
You say your load time was 14 hours when done serially. Can you tell us how much did your load time drop in parallel loading.
Thanks
Manuel
Hi Manuel,
We were able to bring down our load time to 2-3 hours after parallel load process. It did improve the load time significantly
Hi Raju,
Thank you for this write up.
Could you tell us what was the reserved instance type you used for your aurora db? (e.g. db.r3.large, db.r3.xlarge, etc).
Thanks
Hello Yanet,
We have been using db.r3.4xlarge for past few years.
Thank you Raju!
Very helpful your post.
Yanet
Hi Raju, Good article. Could you please share your stored proc – “Move_To_ODS_STORED_PROC”
Thanks.