BCP of large dataset to SQL DB on Azure Virtual Machine
Sunday, February 15, 2015
by jsalvo
During the past 6 months, I’ve been working on re-designing an app that uses BCP to move data from thousands of similar (but not identical) databases each storing sales and other related data for individual bicycle dealers into a unified database on an Azure Virtual Machine (currently a D13 VM with 8 core, 56 GB and SSD) that stores the data for all dealers. The overall unified database is roughly 500 GB in size and is reloaded on a weekly basis, so the movement and loading of data needed to be optimized as much as possible.
Originally this process was implemented via Powershell scripts using BCP, file compression (.zip) and FTP. The data was first BCP’d from tables in the dealer’s database into a flat file format. Then the dealer’s flat files were compressed into .zip format and FTP’d. Another set of Powershell scripts ran on the receiving server (in Azure) and pulled the files down via FTP, unzipped them and executed BCP commands to move the data from the flat file into the database. The scripts processed one table at a time for a single dealer and once all tables completed it would move onto the next dealer in the queue. Overall, the total dataset was taking roughly 15-16 hours to load and was unsustainable.
To optimize the process further, the Powershell scripts were replaced with a C# application. The C# application is multi-threaded to load multiple tables in parallel as well as multiple dealers, the number of dealers that can be loaded in parallel is controlled by a parameter. After performing some benchmark testing, the parameter was set to 10 parallel locations since this appears to provide the best performance. The C# application uses the SqlBulkCopyService which directly BCPs data from tables in the dealer’s database to the appropriate tables in the unified database residing in Azure. This eliminated the flat files and the need to compress and FTP them. The C# application’s performance provided a significant improvement, the data is now loading in roughly 9 hours versus 15-16 hours with the Powershell scripts, thanks in large part to the parallel loading of tables and dealers.
In addition to re-working the code to move the data, several optimizations were also implemented involving the unified database and virtual machine in Azure. During initial performance testing while BCP’ing the data, the unified database log file was a bottleneck. The log file was originally on a persistent VHD stored on Azure blog storage. To improve log performance, the log file is temporarily moved to the D: drive (local non-persistent SSD drive) throughout the duration of the weekly load. Once the load completes, the log file is moved back to a persistent drive. We also have the database data file on a drive spread across 10 disks to help improve the IO performance and to allow for future growth. The index file is split across two disks for performance as well.
I’m sure this implementation will continue to evolve over time and I will write more blog posts as things progress.