AWS AURORA RDS – Loading Billion Rows

AWS AURORA RDS – Learnings from Loading Billion Rows to Aurora

Business Use Case:
We do online assessment and reporting for K-12 Education Industry. ( http://emetric.net/ ). Our business needed to load K-3 through K-12 students historical test reports for entire state for past 10 + years , which ended up being close to billion rows of historical test report data.

Product Selection:
We have been using SQL Server for majority of our database needs and we wanted to see whether AURORA would satisfy our business needs . Primary reason being Licensing Models of SQL Server is prohibitively expensive and scaling out has been challenging.

The features we liked out of the Box from Aurora are :
Elasticity – We can go up-to 244 Gig RAM 8x Instance.
Lower Cost
Simple in licensing Model and Pay as we go
We ended up using AURORA Reserved Instances since our demand doesn’t change and we have pretty predictable and stable traffic.

Data Processing :
Our source data is CSV Files. We had several programs and data layout is different for each program. Initially we decided to create Target Tables in AURORA Instance and then use MySQLImport Tool to load the CSV Files. This worked well but the number of files grew, we had to create multiple tables and load the data. It was getting unwieldy and we decided to try different approach.
The steps are :
Use Bash Scripts to Parse CSV and Fixed Length Data Files
Use Bash Scripts to Load into Staging Tables
Stored Procedures to Move Data from Staging Tables to Target Tables
Call the Stored Proc in multiple threads

LOAD INFILE OPTOIN

LOAD DATA LOCAL INFILE 'G:\\TexasDI\\StudentPortal\\StudentPortalData_01(10).csv'
INTO TABLE StudentPortalDataBCP
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
;

LOAD DATA
LOCAL INFILE 'sample.csv'
INTO TABLE sampletable
FIELDS
TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY '"'
LINES
TERMINATED BY '\n'
IGNORE 1 LINES
;

LOAD DATA LOCAL INFILE 'G:/FileFolder/filename.txt'
INTO TABLE `filename`
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
IGNORE 2 LINES ;

-- PASS NULL FOR EMPTY STRINGS
-- https://bugs.mysql.com/bug.php?id=23212

LOAD DATA LOCAL INFILE 'G:/TexasDI/DataFiles/LookupFiles/testminmax.csv' INTO TABLE LookupTables.LookupSTAAR38TestCodeMinMaxCut_NULLTest
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES
(`ADMIN_CODE`,
`TEST_CODE`,
`TEST_VERSION`,
`TEST_LANGUAGE`,
`STANDARD_CODE`,
`MIN_SCALE_SCORE`,
`MAX_SCALE_SCORE`,
`CUT_MIN`,
`CUT_LEVEL1`,
`CUT_LEVEL2`,
`CUT_LEVEL3`,
`MAX_POINTS`,
`CreatedBy`,
`CreatedDateTime`
)
SET CUT_MIN = nullif(`CUT_MIN`,'') ,
CUT_LEVEL1 = nullif(`CUT_LEVEL1`,'') ,
CUT_LEVEL2 = nullif(`CUT_LEVEL2`,'') ,
CUT_LEVEL3 = nullif(`CUT_LEVEL3`,'') ,
MAX_POINTS = nullif(`MAX_POINTS`,'') ,
CreatedBy = user() ,
CreatedDateTime = current_timestamp();

Nice thing about LOAD INFILE option , we can create one staging table for each program since the schema is same for each program. It helped us to trim our database to very few tables.

PARALLEL LOADING:

Loading 100 + Gig of data serially takes a while. We tried that with largest possible instance and it took about 14 or so hours.
We started evaluating other options for loading the data parallelly and wrote simple bash script to load the data parallel.

## HOSTNAME OF AURORA INSTANCE
HOST_NAME=awsauroracluster.cluster-afdbacdd.us-east-1.rds.amazonaws.com
DATABASE_NAME="ETLStage"
USER_NAME=username
PASSWORD="password"
INPUT_DIR="G:/root/DataFiles/"

TARGET_TABLE="TARGET_TABLE"
TODAY=`date +%Y_%m_%d_%H_%M_%S` # or whatever pattern you desire
LOG_FILE="G:/root/DataFiles/Logs/Loading_Data_Staging_Table_$TODAY.log"

echo "Started the Process of Loading Files into $TARGET_TABLE Tables DateTime `date +%Y-%m-%d%H:%M:%S`">>$LOG_FILE

for filenames in ${INPUT_DIR}/*.txt
do
filename=${filenames##*/};
echo $filename
echo "`date`">>$LOG_FILE
# SQL_STMT="LOAD DATA LOCAL INFILE '$filenames' INTO TABLE $TARGET_TABLE FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 2 LINES ; "

SQL_STMT="LOAD DATA LOCAL INFILE '$filenames' INTO TABLE $TARGET_TABLE FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 1 LINES
(col1,
col2,
col3,
col4,
col5
CreatedDateTime,
CreatedBy,
ComputedColumn1,
ComputedColumn2,
ComputedColumn3,
FileName)

SET col1 = NULLIF(col1,'') ,
col2 = NULLIF(col2,'') ,
col3 ,
col4 ,
col5 ,
CreatedDateTime = current_timestamp(),
CreatedBy= USER()
ComputedColumn1 = SHA(CONCAT(IFNULL(col1,''), IFNULL(col2 , ''))) ,
ComputedColumn2 = SHA(CONCAT(IFNULL(col3,''), IFNULL(col4 , ''))) ,
FileName = '$filename' ;
"

echo "$SQL_STMT"
echo "$SQL_STMT">>$LOG_FILE
mysql -vvv -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -e "$SQL_STMT" --reconnect 1>>$LOG_FILE 2>&1 &

if [ "$?" -eq 0 ]; then
echo "Successfully started the processing of file $filenames ; ">>$LOG_FILE
else
echo "Error occured while processing file $filenames ; ">>$LOG_FILE
fi

done

echo "Completed the Process of Loading Files into Staging Tables `date`">>$LOG_FILE

HANDLING NULLS:

While loading the Data , if the CSV File doesn’t contain a value for a particular field, AURORA loads empty value instead of NULL. For INT Data Types , it loads zero ( 0 ) , which skews the data averages. Averages are important metric for us which indicates the performance of a campus, district and state. To overcome this problem, you can set the value while you insert the data using NULLIF Statement. We found this out in hard way.

Moving Data to Target Tables::

We wanted to move the data from Staging Tables to Target Tables. This is performed as typical upsert (Update if exists , else insert ). Moving Billion rows sequentially is not even an option. So we wrote a stored procedure to move the data from Staging Table to Target Table based on startID and EndID. We had auto increment ID Column in staging tables which makes it easier to process the slices of data.

## This is database for processing latest Student Files for Parent Portal.
HOST_NAME=awsaurora.cluster.us-east-1.rds.amazonaws.com
DATABASE_NAME=ETLStage
USER_NAME=username
PASSWORD="password"
TARGET_TABLE="TARGET_TABLE"
TODAY=`date +%Y_%m_%d_%H_%M_%S` # or whatever pattern you desire
LOG_FILE="G:/ROOT/Logs/Move_To_ODS_STORED_PROC_$TODAY.log"
echo "Started the Move_To_ODS_STORED_PROC Process DateTime `date`">>$LOG_FILE

SQL_STMT="SELECT MIN(ID) FROM $TARGET_TABLE WHERE Processed = 1; "
MIN_ID=$(mysql -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -ss -N -e "$SQL_STMT") >>$LOG_FILE 2>&1
echo "$SQL_STMT">>$LOG_FILE

SQL_STMT="SELECT MAX(ID) FROM $TARGET_TABLE WHERE Processed = 1 ; "
MAX_ID=$(mysql -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -ss -N -e "$SQL_STMT") >>$LOG_FILE 2>&1
echo "$SQL_STMT">>$LOG_FILE

echo "MIN_ID Processing ID from $TARGET_TABLE " $MIN_ID >>$LOG_FILE
echo "MAX_ID Processing ID from $TARGET_TABLE " $MAX_ID >>$LOG_FILE

echo "Moving the Data from Staging Table to Target Tables ">>$LOG_FILE

job=0;
echo "Start `date`"
#for ((i=700000; i<109642390; i=i+100000))
for ((i=$MIN_ID; i<$MAX_ID; i=i+100000)) do echo "Executing Move_To_ODS_STORED_PROC " j=$(($i+250000)); k=25000; echo `date +%Y_%m_%d_%H_%M_%S` >>$LOG_FILE
echo "CALL Move_To_ODS_STORED_PROC($i,$j, $k)" `date` >>$LOG_FILE 2>&1
mysql -vvv -h $HOST_NAME -u$USER_NAME -p$PASSWORD $DATABASE_NAME -e "UNLOCK TABLES ; CALL Move_To_ODS_STORED_PROC($i,$j, $k)" >>$LOG_FILE 2>&1 &

echo "Query $i submitted..... "
job=$(($job+1))
if (( $job%15 == 0 )); then
echo "Inside $job waiting for rest of guys"
wait;
# sleep 1;
echo "slept 10"
fi

done
wait
echo " End `date`"

echo "Completed the Move_To_ODS_STORED_PROC Process DateTime `date +%Y-%m-%d%H:%M:%S`">>$LOG_FILE

Performance Improvement configurations
Mondified the configuraion for innodb_autoinc_lock_mode

innodb_autoinc_lock_mode = 2 (“interleaved” lock mode)

In this lock mode, no “INSERT-like” statements use the table-level AUTO-INC lock, and multiple statements can execute at the same time. This is the fastest and most scalable lock mode, but it is not safe when using statement-based replication or recovery scenarios when SQL statements are replayed from the binary log.

Parameter Group Configuration Overrides

NameValueAllowed ValuesIs ModifiableSourceDescription
innodb_autoinc_lock_mode20-2TRUEuserThe locking mode to use for generating auto-increment values

ParameterOurCustomParametersdefault.aurora5.6
slow_query_log1engine-default
tx_isolationREAD-COMMITTEDengine-default
general_log1engine-default
long_query_time5engine-default
group_concat_max_len20000000engine-default 

You may also like...

(7) Comments

  1. mike

    nice article. really helpful.

    After data loading, do you have a fast way to create index?

  2. Manuel Sarmiento

    Hi.
    You say your load time was 14 hours when done serially. Can you tell us how much did your load time drop in parallel loading.
    Thanks
    Manuel

  3. Raju Venkataraman

    Hi Manuel,
    We were able to bring down our load time to 2-3 hours after parallel load process. It did improve the load time significantly

  4. Yanet

    Hi Raju,
    Thank you for this write up.
    Could you tell us what was the reserved instance type you used for your aurora db? (e.g. db.r3.large, db.r3.xlarge, etc).
    Thanks

  5. Raju Venkataraman

    Hello Yanet,
    We have been using db.r3.4xlarge for past few years.

  6. Yanet

    Thank you Raju!
    Very helpful your post.
    Yanet

  7. Abhishek Anand

    Hi Raju, Good article. Could you please share your stored proc – “Move_To_ODS_STORED_PROC”
    Thanks.

Leave a Reply to Yanet Cancel reply

Your email address will not be published. Required fields are marked *