Local vs Remote MySQL Inserts

I wrote this really long blog post and then said to myself TL;DR; Here are some take aways

  • Ruby does not have great support for reading large Excel (100MB+) files
    • Roo, SpreadSheet, ParseExcel all try to load whole report into memory -> 5GB+
  • Python has excellent Excel parsing support
    • OpenPyXL supports row-by-row iteration
  • Insert to a local MySQL database then dump & load into your remote MySQL database
    • ((24000000 rows/250 remote inserts a second)/60 seconds)/60 minutes = 26.6 hours
    • vs ((24000000 rows/3000 local inserts a second)/60 seconds)/60 minutes = 2.2 hours

Ultimately, dividing and conquering a problem is a great strategy and can significantly reduce your time to process large data sets. However, be careful that the co-routine you’re using in your divisions of labor are as fast as possible. It is way faster to fill up a local MySQL DB and then transfer it to a Remote host.