Data Importer Optimizations – Coder's Desiderata

Nobody wants a slow application. Efficiently optimizing your application takes experience and constraint. You don’t want to prematurely optimize, but you also don’t want to code something subpar that will contribute to your technical debt and put your app in the grave early.

There is a balance that needs to be struck. Knowing when and how to optimize needs to become second nature so that it doesn’t interfere with the development workflow. Producing something so that it can be demonstrated often is more important than optimizing in many cases.

This post will cover a few optimization techniques associated to reading from files and running SQL queries. These tests are written in PHP, an interpreted language, though the techniques can be applied to other languages as well.

The Script

The program is basically a data importer. It takes item data from the World of Warcraft API and updates a mySQL database. Before this script runs, the Warcraft JSON data is saved in a single file delimited by a newline character. This single file, which acts as a cache, allows me to focus on optimizing the local operations without network overhead.

XDebug Setup

XDebug for PHP has a useful profiling feature. When profiling is enabled, XDebug will store the profiling data in the directory specified, and the file can be read with KCacheGrind (Linux) or WinCachGrind (Windows).

[xdebug]
xdebug.profiler_enable=1
xdebug.profiler_output_dir=\Users\Justin\Misc Code\Profiling\php7

Optimizations and Tests

The complete dataset consists of 99,353 records. There are empty records which take up a newline, so the text file contains 127,000 lines.

Base Test

(Link)

I started off with a PHP script that was purposely designed to be slow. Each item was stored in its own cache file.

For each item:

Open the item specific data file
Check if there is JSON data (move onto next line of JSON if none)
Parse JSON
Query database for existing record
Execute insert if no existing record
Execute update if record exists

This took 2 hours and 37 mins. This is absolutely horrible. The good news is there are quite a few things that can be optimized here.

Here is one thing to keep in mind in regards to speed, the closer your data is stored to the processor, the faster it can be processed.

CPU Cache < Memory < Hard Drive < Local Daemon < Network/Internet.

MySQL Optimizations

(link)

Communicating with other applications, such as MySQL has overhead, it’s a bottleneck. Think about the steps involved in making that communication happen. There are sockets that need to be opened –which is not instantaneous, and then there’s the actual query where your application sends data to MySQL either through Linux Sockets or TCP/IP and then MySQL responds after doing its own crunching through the same means. There’s a lot going on there compared to reading memory or the hard drive.

I could have made the first test even worse by opening a new MySQL connection per query.

Upsert
So how do we get around this bottleneck? Given we must use MySQL, the answer is to to reduce the amount of back and forth. MySQL supports an “upsert” syntax that will either insert or update a record with a single line of SQL.

Batch Queries
The PDO library allows queries to be batched so that multiples queries can be sent to the MySQL server at once. For inserts and updates, this is faster than just sending one query at a time.

Keep in mind that the script will block and wait for the MySQL response for each query. MySQL server is pretty efficient in processing queries and storing data, so any overhead in local memory or MySQL processing isn’t more than the overhead caused by the blocking from the individual queries.

This test executed in 10 minutes (compared to 157 minutes). That’s a pretty significant increase. But that’s still slow.

Single Cache File

(link)

While opening a file pointer is relatively fast compared to downloading the file from the internet, it is actually costly and adds up quickly. So I moved all the data into a single file.

This test executed in 2 minutes and 24 seconds. Another pretty nice jump in speed.

Caveat
One downside to this approach is it has made it more difficult to update individual records in our cache. While records are visually distinguishable in a text editor, in the raw data the newline is just another character which means finding the record and updating it becomes much more complex. This becomes a file structure issue, which is out of scope of the topic at hand.

Summary

Here were a few basic examples of optimizations to make your script or program run faster. Resources such as disk and network I/O are expensive. It takes time to open connections and file pointers, which will cause a program to block and not process anything else.

With the optimizations covered here, I was able to bring my script down from 157 minutes to a little below 2.5 minutes.

I have both a C++ and Java version I’m currently working on to see if I can get this timing down even further, and I will be covering that in another post.