Data Importer Optimizations


Nobody wants a slow application. Efficiently optimizing your application takes experience and constraint. You don’t want to prematurely optimize, but you also don’t want to code something subpar that will contribute to your technical debt and put your app in the grave early.

There is a balance that needs to be struck. Knowing when and how to optimize needs to become second nature so that it doesn’t interfere with the development workflow. Producing something so that it can be demonstrated often is more important than optimizing in many cases.

This post will cover a few optimization techniques associated to reading from files and running SQL queries. These tests are written in PHP, an interpreted language, though the techniques can be applied to other languages as well.

The Script

The program is basically a data importer. It takes item data from the World of Warcraft API and updates a mySQL database. Before this script runs, the Warcraft JSON data is saved in a single file delimited by a newline character. This single file, which acts as a cache, allows me to focus on optimizing the local operations without network overhead.

XDebug Setup

XDebug for PHP has a useful profiling feature. When profiling is enabled, XDebug will store the profiling data in the directory specified, and the file can be read with KCacheGrind (Linux) or WinCachGrind (Windows).

[xdebug]
xdebug.profiler_enable=1
xdebug.profiler_output_dir=\Users\Justin\Misc Code\Profiling\php7

Optimizations and Tests

The complete dataset consists of 99,353 records. There are empty records which take up a newline, so the text file contains 127,000 lines.

Base Test

(Link)

I started off with a PHP script that was purposely designed to be slow. Each item was stored in its own cache file.

For each item:

  • Open the item specific data file
  • Check if there is JSON data (move onto next line of JSON if none)
  • Parse JSON
  • Query database for existing record
  • Execute insert if no existing record
  • Execute update if record exists

This took 2 hours and 37 mins. This is absolutely horrible. The good news is there are quite a few things that can be optimized here.

Here is one thing to keep in mind in regards to speed, the closer your data is stored to the processor, the faster it can be processed.

CPU Cache < Memory < Hard Drive < Local Daemon < Network/Internet.