Fun with xargs: parallelized FLAC to MP3 conversion

I’ve been wanting to use xargs for a practical application for some time. I’ve read many articles that have mentioned it, but it was Ted Dziuba’s Taco Bell Programming which was the tipping point.

Here’s the situation: I want to convert my FLAC collection to an MP3 collection so that I can upload my CD collection to Amazon. Further, I’d like to store the MP3s in a different folder, while maintaining the original subfolder structure. Finally, I want to maintain the metadata in my FLAC files; the flac command line utility only seems to support decoding to .wav without metadata.

Finally, I have 8 virtual cores, and I want to use them all to convert everything as quickly as possible.

Creating a mirrored directory structure, using xargs

First, let’s create the directories to store our MP3 files in:

find Music_CDs_FLAC/ -iname '*.flac' \
    | sed -re 's#^Music_CDs_FLAC/(.*)/[^/]*$#Music_CDs_MP3s/\1#' \
    | sort \
    | uniq \
    | xargs -r -P1 -n1 -d'\n' mkdir -pv

This command pulls all unique folder names under Music_CDs_FLAC and creates directories. In this case I used -P1 which only allocates 1 process. Of course, at this rate xargs is not necessary; the above could be accomplished with a simple bash for loop instead.

Creating a FLAC to MP3 script, using xargs

Next, let’s convert a single FLAC file to an MP3. This is a complex process since we can’t easily convert straight from FLAC to MP3. (Well, I’m sure there’s programs out there, but I don’t know what’s good and this serves as a good example.)

Oh, but wait, there’s more: lame expects tags on the command line, and what do we do about tags containing quotes or apostrophes? Do we escape them? I’d rather not bother here; let’s allow xargs to do the work for us.

Let’s make a script which takes an input and output file name, and which performs the needed conversion, using xargs for passing tags:

#!/bin/bash

# Author: Paul Goins
# License: Public Domain
# File: flac2mp3.sh
#
# Script to convert FLAC files to MP3, using xargs for passing tag
# arguments to lame.
#
# Why do we do this?  It's for the metatags which may contain
# characters requiring shell quotes.  This gets around having to do
# any escaping.
#
# Note: This script "works for me"; it may not keep all metadata from
# your FLAC files.  It does keep most of the data from my FLAC files
# which were generated via abcde.

set -o nounset
set -o errexit

infile="$1"
outfile="$2"
argfile="$(mktemp)"

# Extract metadata and push tags to argfile
metafile="$(mktemp)"
metaflac --export-tags-to "$metafile" "$infile"

arg_regex='s/^[^=]*=(.*)$/\1/'
ARTIST=$(grep "$metafile" -e 'ARTIST=' | sed -re "$arg_regex")
if [ ! -z "$ARTIST" ]; then
    echo -e "--ta\n$ARTIST" >> "$argfile"
fi
ALBUM=$(grep "$metafile" -e 'ALBUM=' | sed -re "$arg_regex")
if [ ! -z "$ALBUM" ]; then
    echo -e "--tl\n$ALBUM" >> "$argfile"
fi
TITLE=$(grep "$metafile" -e 'TITLE=' | sed -re "$arg_regex")
if [ ! -z "$TITLE" ]; then
    echo -e "--tt\n$TITLE" >> "$argfile"
fi
DATE=$(grep "$metafile" -e 'DATE=' | sed -re "$arg_regex")
if [ ! -z "$DATE" ]; then
    echo -e "--ty\n$DATE" >> "$argfile"
fi
GENRE=$(grep "$metafile" -e 'GENRE=' | sed -re "$arg_regex")
if [ ! -z "$GENRE" ]; then
    echo -e "--tg\n$GENRE" >> "$argfile"
fi
TRACKNUMBER=$(grep "$metafile" -e 'TRACKNUMBER=' | sed -re "$arg_regex")
if [ ! -z "$TRACKNUMBER" ]; then
    echo -e "--tn\n$TRACKNUMBER" >> "$argfile"
fi

rm "$metafile"

# Push infile/outfile as final args
echo "-" >> "$argfile"
echo "$outfile" >> "$argfile"

# Perform the conversion
flac --decode --stdout --silent "$infile" \
    | xargs --arg-file="$argfile" -r -d'\n' lame --quiet -b320

rm "$argfile"

Okay, this bit of code does have some real benefit from using xargs. However, it’s not what really makes me excited.

A sidenote: xargs and multiple commands

xargs runs a single command. For complex behavior involving pipes, it is possible to run something like the following (contrived) example:

# Contrived example!  Yay!!
cat /dev/null | xargs -n1 bash -c "echo foo | grep -e \"oo\""

For complex behavior like the above file conversion script, the benefits of writing a dedicated script are rather obvious. However, even when piping a couple of commands together, it may be worth writing a quick and dirty script and calling it via xargs, rather than messing with shell quoting. This is especially true if you’re dealing with lots of data which could contain characters requiring special quoting.

Using xargs for parallelization: fully utilizing 8 virtual cores

This brings us to the final part: to automatically convert all my FLAC files to MP3, and to do so in parallel. I run a Core i7, which means I have 4 physical cores with hyperthreading, coming out to 8 virtual cores. I’m going to run 12 parallel processes to perform the conversion, ensuring my processor stays as busy as possible.

Here’s the script:

find Music_CDs_FLAC -iname '*.flac' \
    | sed -re 's#^Music_CDs_FLAC/(.*)\.flac$#Music_CDs_FLAC/\1.flac\nMusic_CDs_MP3s/\1.mp3#' \
    | xargs -t -r -P12 -n2 -d'\n' ~/scripts/flac2mp3.sh

There we go. The above will print out each command as it’s executed, and it will quickly take up nearly 100% of all my CPU cores. It may hit a bottleneck when writes need to be flushed to disk, but this should (hopefully) take less time overall than doing things one-by-one.

A quick description of what I did:

  • I found all .flac files and printed out one per line. (find command)
  • I took each .flac file and output two lines: the original .flac and the target .mp3. (sed command)
  • I launched concurrent conversion scripts, up to 12 at a time, taking two arguments per script call (input/output files), separating input args on newlines. The commands were echoed to the console as they were run. (xargs command)

Final note: after actually running the above snippet, converting 36 GiB of FLACs (1283 files) clocked in as follows:

real    41m56.118s
user    321m43.014s
sys     3m18.788s

Note: the real time was only 42 minutes, yet the user time was 5 hours and 22 minutes! Without xargs, using just a for loop to work through all the files, this would have taken much longer. This workload obviously benefitted from parallelization via xargs.

Leave a Reply

Your email address will not be published. Required fields are marked *