Craft your own audiobook

Make your own audiobook using synthetic speech. You can follow the steps below for a ‘manual’ process or use these python scripts for a semi-automated version. You need python installed in your computer if you use the latter. The audiobook we’ll make consists of a m4b file, which is a container format just like mp4 but which allows chapter metadata and bookmarking. That means that the listening position will be saved and will be used as starting point the next time the file is opened. It is a proprietary format (developed by Apple) but can be read by open software such as VLC, for example.


  • one or more text files in markdown format (each chapter should have its own file)
  • a cover image (jpg or png)
  • pandoc
  • flite
  • sox
  • ffmpeg
  • ffprobe (installed with ffmpeg)
  • MP4Box
  • m4chaps


The first thing we should do is to convert our markdown (md) files into plain txt. The text-to-speech (TTS) engine we’re using does not give a pretty output for a Heading written like this:


By using plain text we’ll have exactly what we need.

Note: in the semi-automated process, images will be translated into their corresponding Alt Text and footnotes will be placed inline (removed from the end of the file). In the manual process, references and footnotes are identified by their numbers and remain in their original positions.

We’ll use pandoc to make this conversion. If you don’t have pandoc installed, have a look at this page: With pandoc installed, open your terminal, navigate to the directory which contains the md file(s) and type:

pandoc -s -t plain -o chapter01.txt

Note: in this recipe we’ll refer to one file for the sake of simplicity, but you should run every command for every chapter file that you want to add to your audiobook.

Next, we’ll convert our text to speech. For that, we’ll use flite, a light version of open source TTS engine Festival. There are many voices available – we chose rms. You can follow the instructions on this page to install flite. On a MacOS we had to edit the Makefile to be able to install it. Check this if you experience the same. Using the terminal, type

flite -f chapter01.txt -voice rms -o chapter01.wav

Ideally, there is some interval of silence between chapters. We need to add this interval, as our TTS engine does not do that when converting from text to audio. In order to do that, we’ll use sox, a very handy command line tool for manipulating audio. You can install it with brew or compile it. If you don’t want to install sox just for that, you can use any other audio manipulation software, create 1 second of silence and then you could use ffmpeg to concatenate them (silence + audiofile + silence). But we’ll go with sox. Go back to the Terminal and type:

sox chapter01.wav chapter01-pad.wav pad 1 1

The line above will add 1 second of silence in the beginning and end of the audio file and save it as a new file (chapter01-pad.wav). We’re now ready to convert the wav file into mp3 (it’ll be faster to manipulate them than wav). For that, we’ll use ffmpeg. Installation instructions and files can be found here: Once ffmpeg is installed, go back to your Terminal and type:

ffmpeg -i chapter01-pad.wav -q:a 0 chapter01.mp3

The parameter -q:a 0 will make sure that the bitrate is kept between 220 and 260 kbit/s (see the table in We can now run ffprobe (installed when you install ffmpeg) to retrieve the duration of each file. In the Terminal, type:

ffprobe -of default=noprint_wrappers=1:nokey=1 -show_entries format=duration -sexagesimal chapter01.mp3

This command will output the duration of the audio file on your Terminal. You can copy the output and save it in a txt file. You will get something like this: 0:12:36.144000 This formatting is achieved by using the parameter -sexagesimal. By flagging noprint_wrappers and nokey we get just the duration value. But besides having all chapters durations, we also need their starting time (which is calculated by summing previous chapters durations). You can use Libre Office or MS Excel to help you with that. Before we go on with the audio files, let’s make our chapters file. It’s simply a text file containing timecode referencing Chapters titles and starting times. It looks like this:

CHAPTER2NAME=Chapter 001 Title Goes Here
CHAPTER3NAME=Chapter 002 Title Goes Here

Fill it in, using the starting times you got from your spreadsheet and the chapter titles of your book. Save it using the extension .chapters (for example ‘MyBookTitle.chapters’). Now, let’s go back to our audio files: concatenate the existing mp3 files into one single file.

ffmpeg -i "concat:chapter01.mp3|chapter02.mp3|chapter03.mp3" -q:a 0 MyAudioook.mp3

The next step requires MP4Box. You can check installation instructions here: With MP4Box installed, we simply need to add the chapters data to our audio file. At the same time, we convert our single mp3 file into mp4. We do that by typing in the Terminal:

MP4Box -add MyAudiobook.mp3 -chap MyBookTitle.chapters audiobook.mp4

In order for the chapters to work also in iTunes, we need to convert the file with chapter data into QuickTime format. We’ll need another application for that, this time mp4chaps, which you can download here and install. Next command in the Terminal will be:

mp4chaps --convert --chapter-qt audiobook.mp4

At this point, you should be able to listen to your audiobook with working bookmark functionality. But you’ll notice that there is no metadata. The book does not show its title nor author in the media player software. We’ll fix that in the next step:

ffmpeg -i audiobook.mp4 -metadata album_author="Author Name Goes Here" -metadata album="Book Title Goes Here" -metadata year="2016" audiobook-tags.m4a

Note that we also renamed the file and gave it a new extension, m4a. M4a is very similar to m4b, except it does not allow bookmarking – we’re not losing that feature, though: it will be back as soon as we rename the file again. Simply type:

mv audiobook-tags.m4a audiobook.m4b

And as the final touch, let’s add a cover image to the audiobook. Again, in the Terminal, type:

mp4box -itags cover=cover.jpg audiobook.m4b

The cover file can be jpg or png. Make sure it is in the same directory you’re currently in (or use appropriate path to the file).

Voilà! Your audiobook is ready, enjoy! — This process is of course labor intensive and not the most appropriate for production. The idea behind this recipe is the one of ‘crafting’ (digitally but manually) an audiobook. This research’s aim, however, is to produce such production workflow. As a first step towards it, three python scripts have been made. They are not yet the final output, but feel free to experiment with them. And, of course, comments and feedback are welcome and always appreciated. : )

Tags: , ,