Parallel MySQL Queries
Unlike Oracle, MySQL has no support for parallel SQL queries. A bummer for sure, in light of all the awesome multi-coreness we enjoy these days. There’ve been some attempts (1, 2) to get around this unfortunate limitation: both relying on running multiple queries in the background with little or no flow control.
The downside of this technique is the likelihood of overloading the server: contrary to the expected improvement in performance, you may crash your whole box. And even if you don’t, contention for system resources may slow down the query – not speed it up. The basic idea is sound, but flow control is essential to keep the system from sinking into the depths of context switching and system interrupts.
For some time now I’ve been playing with GNU Parallel and it is the perfect choice for the job. Without further ado, here’s a quick practical example. We have a text file with a list of search keywords – one per line. For each keyword we need to run a SELECT query and count the number of results. If the number of hits is one or more, then we need to display the keyword along with the count.
Here’s an example of the keyword file:
# cat search_string_one_per_line.txt keyword1 keyword2 keyword_3 keyword 04
The common approach would be to loop through the file and execute a SELECT query for each keyword:
#!/bin/bash db_user="dbuser" db_pass="dbpasswd" db_name="dbname" tbl_name="tblname" MYSQL="/usr/bin/mysql --batch --skip-column-names --max_allowed_packet=100M -u${db_user} -p${db_pass} ${db_name} -e" for search_string in `cat search_string_one_per_line.txt` do count=$($MYSQL "SELECT COUNT(*) FROM ${tbl_name} WHERE keyword \ LIKE '%${search_string}%' ORDER BY timestamp") if [ "$${count}" -gt 0 ] then echo "${search_string}:\t$${count}" fi done
This should work just fine, but will probably take awhile. Annoyingly, this approach will utilize just a single core of your fancy multi-core server. It will be like cutting your hair – one at a time.
Here’s pretty much the same process, but the keyword input is slightly massaged using “sed” and piped through “parallel”. After “sed” is done with it, the list of keywords will look something like this: “keyword1\nkeyword2\nkeyword_3\nkeyword 04”. Just the way “parallel” likes it.
The GNU Parallel utility will figure out how many cores your system has and will put them all to good use. You can run “htop” in the background and watch it light up like a Christmas tree.
#!/bin/bash select_query() { search_string="" db_user="dbuser" db_pass="dbpasswd" db_name="dbname" tbl_name="tblname" MYSQL="/usr/bin/mysql --batch --skip-column-names --max_allowed_packet=100M -u${db_user} -p${db_pass} ${db_name} -e" count=$($MYSQL "SELECT COUNT(*) FROM ${tbl_name} WHERE keyword LIKE '%${search_string}%' ORDER BY timestamp") if [ "$${count}" -gt 0 ] then echo "${search_string}:\t$${count}" fi } export -f select_query query_run() { search_string=$(sed ':a;N;$!ba;s/\n/\n/g' search_string_one_per_line.txt) echo -e "${search_string}" | parallel --gnu --will-cite select_query "{}" } query_run
Will this be faster? In most cases, yes. A whole lot faster. However, there are some gotchas you need to consider. First, just because your server has 128 cores, doesn’t mean the hard drive hosting your database will be able to keep up. You may run into I/O wait scenario and the end result will be disappointing. So make sure your datastore can keep up with the demand.
Second – memory. May even be first. If your server has 128 cores, chances are, it has plenty of RAM also, but it’s worth checking. If the server starts using disk swap, your SQL query will not set any speed records.
Finally, just like rubber ducks, MySQL is not actually made out of rubber: it won’t stretch forever. At the very least you may need to check my.cnf and, perhaps, up the “max_connections” and point “tmpdir” to a filesystem with plenty of space (it defaults to /tmp, which may not be sufficient). So, have fun, but not on the production server.