Home-brewed Audiobooks — a guest post by Sabrina Chase

*As a way of explanation — do I need to explain? — I have my internet turned off this weekend.  This is a means of trying to finish Through Fire and getting it to Baen this week, so I can write Darkship Revenge, and then bowl of red.  I hear that in the old days, in soccer teams trainers would separate the players from their wives, to make them more keen to play.  I’m not sure how that worked, unless, of course the wives were shipped to a small deserted island and forever out of reach if they lost the match, which I don’t think is likely, but who knows?  I will be responding to comments on the Fire (tablet) which is the only permitted internet access this weekend.  This means of course there will be specactulabulous auto correct errors, like Mucous for Lucius, but never mind.  I shall report on Tuesday on whether this worked, and by then we can decide whether to send soccer players’ wives to little deserted islands or not.  Meanwhile, Ms. Sabrina Chase, she who forges ahead where no indie has gone before — okay, where no hun has gone before — is here to tell us about the joys and wonders of producing your own indie audiobooks! I’d tell you to be gentle with her, but she’s one of you and can dish it out better than you can, so never mind.*

Home-brewed Audiobooks – by Sabrina Chase

Greetings, O Huns and Hoydens! I come to tell you of my adventures recording an audio version of my short story collection The Bureau of Substandards Annual Report. I picked this as my test project because it was short (only five stories) and I wanted to see how difficult it would be to do a full book. Apparently I did it right because Audible just accepted the project and it is available!

My equipment consisted of the following: a digital recorder, remote for recorder, headphones, software that came with recorder, Roxio software for final mp3 conversion, a closet, a shelf, and some memory foam. Total cost ~$300 (business expense, written off on taxes. Having a business can be a good thing.)




The digital recorder came with audio processing software, WaveLab LE7. It probably isn’t the best, but it was free and did 90% of what I needed to do, to the limits of my equipment. The memory foam was free, courtesy of my sister’s dog who decided her memory foam mattress harbored ill intent and needed to be destroyed. I cut chunks from the carcass and assembled them into a little foam quonset hut for sound isolation. The digital recorder went on a tripod stand inside the foam hut, which was placed on a shelf in my closet so the microphone was at head level while standing (I prefer to stand while recording, for better breath support and to avoid chair squeaks).

Recording Process:

First off, I put my book on my Kindle Fire. Adjustable font size, no paper rustling for page changes. Once in my “studio”, I followed the ACX suggestions and recorded over a minute of dead sound. This is what you use for cutting and pasting over coughs,sirens, etc. instead of using a “blank” sound. Then I started recording the actual text. Whenever I made a mistake, I hit the “mark” button on the remote, waited a second or so, then repeated the bad section and continued on as normal. The marks show up in the file but not as sounds, so they are easy to find in post-production and fix.


Now you get to open the recorded files in the audio software. I learned many interesting things looking at my recorded files. For one thing, when I take a breath it looks like a whale. Really easy to find the ones I needed to remove. Also, the pops from words that begin with “p” or end with “t” are amazing and can be toned down or even simply deleted in the software. You just select with the mouse, and click delete. Insta-fix! Same thing with all the marked errors in your file. You figure out, using the cursor and the “play from here” function, where the last good word was that you repeated in your fix. Select the section with the error *including* that good repeated word and up to the same word in the fixed section, and delete. Play the section again to make sure you have enough space and you didn’t double words by accident. I was surprised how easy it was to fix errors this way.

You can remove spikes, rustling, etc. and you can also even the volume using the normalizer and level tools in the software.

Now we get to the technical aspects required by ACX http://www.acx.com/help/rules-for-audiobook-production/200485520

Each chapter, or section, must have a certain amount of dead space at the beginning, the end, and after the section title. You must also record beginning and end credits. Once you have all the various pieces, and the obvious errors are removed, you need to normalize the files so everything has a similar volume level (compression). Note that if you have loud spikes (yelling, or whatever) in your recording the vanilla compression or equalization available in software won’t work correctly. The *average* will look OK, but only because everything else is too soft. Hence the pre-cleaning step above to dampen unusual loud spots.

Now ACX doesn’t want .wav files. WAV is easier to work with for the software, but you need to convert the finished files to .mp3 format. That was the only thing the WaveLab software did NOT do for me, but I had Roxio that did, and could even convert in a batch.

You are so done with this: It takes a long time to clean up audio. 3x the recording time is pretty much what I had to do. I am sure an experienced voice artist with pro software can do better, but not me. So, I finished the files. Then I went to ACX, started a new project for my existing book, and said I had the audio files myself.

In the next step, you upload every separate file, in order, and here is where you can name them whatever you want (I had little intro-bits to each story.) Once uploaded, ACX does their review process. My first attempt failed because of the normalization volume problem, so they told me about it and a few other errors (wrong length of dead space at the end of a few files). I fixed them, uploaded the new files per their directions, and the second time was successful.

Pros and cons of doing your own recording:


-WAY too much work for the return. Much better off writing, for book-length projects anyway.

-I have a lot to learn about audio, and I now have greater appreciation of voice artists and their skills

-ACX is not very clear and some say contradictory in their guidelines about audio production. I did a lot of “try this and see if they notice” in my attempt

-cats hear you talking and want to join in.

Out of consideration for Sarah’s blog and length and such I did not include screenshots of the editing in action, but if there is interest I can do that and maybe share it using dropbox or Google docs or something like that.

(My previous post on ACX in general is here )


42 thoughts on “Home-brewed Audiobooks — a guest post by Sabrina Chase

  1. Three times the recording time to clean it up I would think was very good. I’ve did a little video work, and while I’m admittedly anything but a professional, I found I needed two to three full days spent editing, adding music, etc. to produce one hour of video. (I could slam one out for myself or friends to watch in no time, but if I wanted one that looked at least semi-professional, and I wasn’t embarrassed to sell, it took time)

  2. ACX is not very clear and some say contradictory in their guidelines about audio production. I did a lot of “try this and see if they notice” in my attempt.

    Sounds like they mostly didn’t notice…

  3. “…there will be specactulabulous auto correct errors, like Mucous for Lucius,”

    I always thought he was a little snot.

    1. This post is Sabrina’ s, but yes. We have an audio booth & will be producing my non Barn books as audio. Also, if Steve Green has time, he’ll be doing a few of the shorts

      1. Excellent! I can’t wait! Does that include A Few Good Men?
        I’m blind so I read books either with a screen reader or by listening to a human narrator. The Audible versions of Darkship Thieves and Renegades were great. I’d much rather hear your books in a real voice rather than a computer one.
        Thanks for getting back to me.

          1. Aww, that was pretty tame for a first autocorrect error. Barn for Baen is even logical, one letter off and it’s next door on the keyboard. Might even be an actual typo!

  4. Sabrina –

    Congratulations on successfully recording an audiobook. It sounds like you did extremely well for a first time project.

    As a bit of background, I am an audio and acoustics consultant and have designed many professional recording studios. With that said I just wished to clarify a few of your comments. You wrote:

    “The memory foam was free, courtesy of my sister’s dog who decided her memory foam mattress harbored ill intent and needed to be destroyed. I cut chunks from the carcass and assembled them into a little foam quonset hut for sound isolation.”

    There are two very different types of acoustic materials. Soft materials such as foam, fiberglass, or mineral wool are common sound absorbing materials. They absorb and control sound inside a space, but by themselves do almost nothing to prevent outside sounds from entering a space. Sound isolation materials tend to be heavy and stiff such as drywall that surfaces most residential walls. Installed properly with all the little gaps sealed with caulk, they can be very effective at keeping outside sounds out.

    Most foam materials, while they can be effective sound absorbers have one huge downside. They burn extremely easily. When most foams burn, they give off poison gas. Therefore it is an extremely bad idea to use common foams in places where they might catch fire such as if someone smokes near them.

    “Once you have all the various pieces, and the obvious errors are removed, you need to normalize the files so everything has a similar volume level (compression).”

    Normalization is not the same thing as compression. I do not have the time right now to explain the differences, but I will try to get back and do so later.

    “It takes a long time to clean up audio. 3x the recording time is pretty much what I had to do.”

    That is doing pretty good. While music recording is much more involved than the spoken word recording you were doing, it is not uncommon to spend weeks editing and post producing a music CD. Spending only 3x the recording time is not bad at all and shows that you are working efficiently.

    1. Thanks for adding clarification, Ray. As should be obvious, I knew nothing about matters audio before I began and know next to nothing now! 😉

      The sound isolation is being handled by the closet walls (plywood, sheetrock, and fiberglass insulation). I made the little foam hut in imitation of the audio recording cubes I saw for sale. From other feedback from people with more sound experience, I should probably have some kind of rug or mat on the floor to soften the reflected sound (wood floors, and he could HEAR that in the sample. Wild.)

      Please do explain all the differences between the audio doohickeys. ACX did NOT and while my google-fu is strong finding an idiot’s guide to audio processing was not easy.

      1. I recall a book out a few years ago called Guerrilla home recording : how to get great sound from any studio (no matter how weird or cheap your gear is)” that covers a lot of this for the novice.

      2. “The sound isolation is being handled by the closet walls (plywood, sheetrock, and fiberglass insulation).”

        Yes, you understand that correctly. One bad thing about using the typical closet is air circulation. Depending on the size of the closet you will have a limited amount of time you can record without taking a break, opening the door, and letting in fresh air. Professional recording studios go to significant effort to provide continuous air circulation while still providing excellent sound isolation, but that is an effort and cost many home recording studios don’t take.

        “From other feedback from people with more sound experience, I should probably have some kind of rug or mat on the floor to soften the reflected sound (wood floors, and he could HEAR that in the sample. Wild.)”

        It is amazing what careful listening to a recording can tell you about the acoustic environment in which it was recorded. This is particularly true of a simple voice recording. We hear live voices in different environments every day, and our ear/brain has learned to interpret what we hear to tell us a lot about the environment.

        An acoustically “dead” environment allows great intelligibility and clarity, and makes editing easy. This is one big reason an isolation booth of some sort is so often used for voice. However, even with the use of acoustic treatments such as the foam you used, it is hard to hide the acoustic signature of being in a small room.

        A closet is not the ideal environment to speak, nor is it the ideal environment in which to listen to someone talking. Both as talkers and listeners we are more comfortable in larger rooms, and this reflects in how well we speak. The ideal environment is one in which we are comfortable speaking (or playing an instrument). The downside is that it may be more difficult to find a room that is large enough to be comfortable yet is also quiet enough so we don’t waste a lot of time and effort editing out unwanted sounds. This is one advantage of a well designed professional recording studio.

        “ACX is not very clear and some say contradictory in their guidelines about audio production.”

        I have just read their guidelines, and would love to have you point me to portions that were unclear to you. I would be glad to try to interpret portions that are less than clear.

        I will have more later.

      3. “Please do explain all the differences between the audio doohickeys. ACX did NOT and while my google-fu is strong finding an idiot’s guide to audio processing was not easy.”

        First I need to to explain what a “dB” is. ACX uses this term and so does your recorder and sound editing program. Basically a dB is a ratio between two different sound levels. If there is a 3 dB difference in level the average person will probably notice the difference in a direct comparison. If there is a 10 dB difference in level the louder sound will be perceived as being twice as loud as the sound that is 10 dB lower.

        All modern sound recorders use a digital recording process. These digital recorders have an absolute maximum level they can store digitally that is called 0 dB Full Scale, or 0 dB FS. In the ACX Mastering section they talk about dB or dB RMS (or db which is a typo), but what they are really talking about is some number of dB FS, or how much lower the recorded level is relative to the maximum possible recorded level. You will notice that all the levels they talk about are negative because they are all lower in level than the maximum possible recorded level.

        Please note that there is no fixed relationship between the maximum possible recorded level and actual acoustic sound levels.

        Acoustic sound levels are also expressed in dB, but in this case it is dB Sound Pressure Level or dB SPL. 0 dB SPL was picked to be about the softest sound level a young person with excellent hearing could hear. Someone talking at an “average” speech level produces roughly 60 to 70 dB SPL at 1 meter from the talker’s mouth. Soft speech can be lower (maybe 40 or 50 dB SPL) and loud speech or screaming can be a lot higher (maybe 80 or 90 dB SPL).

        All of the above dB SPL numbers are at 1 meter. If the mic is closer the level will be higher. For the same speech level each time the distance drops in half the level rises 6 dB. So someone talking at an average level of 65 dB SPL at 1 meter, will be producing 71 dB at 1/2 meter and 77 dB at 1/4 meter, and so on.

        ACX has a section on audio distortion. The point is that different people talk at different levels, plus there are variations due to talking softer or louder, and variations due to the distance from the microphone. You need to adjust the recording level so that even with the loudest voice you expect to use, the meter on the recorder will still indicate well under 0 dB FS.

        Since none of us can talk at an absolutely even level, and it is easy once we get involved in reading the book to raise the voice level, it is best to record at a bit lower level so that even if we get involved in what we are reading and unconsciously increase our speech level, we still will not hit that 0 dB FS maximum possible recorded level and cause distortion.

        If the recording is clean and otherwise OK but a bit too low in recorded level, it is possible to adjust the level in the editing program, and to use what is called “dynamics processing” to get the levels that ACX specifies in the finished recording.

        I will get into dynamics processing in the next post.

      4. A good reader does not read in a monotone or with an absolutely constant level. A good reader puts feeling and expression into their reading. Part of this is tonal changes in the voice, and part of this is differences in sound level, or what I would call natural dynamics.

        ACX for a number of good reasons asks that this natural dynamic range be reduced in the finished recordings you submit. What they are asking for is a judgement call on their part that trades off some of the natural dynamics of good speech for a recording that will sound good when played in a less than ideal playback environment. In particular the reduced dynamic range helps when the listener is in a noisy environment.

        Levels that are said to be some number of dB RMS are average levels of the voice. Levels that are said to be some number of dB Peak, are measured differently, and are mostly important in making sure the maximum possible recording level is not exceeded. Some meters show RMS levels while other types of meter show Peak levels. Very often recorder show just Peak levels, while editing programs might show both Peak and RMS.

        Unprocessed speech will have Peak levels that are roughly 12 to 15 dB higher than the RMS level.

        ACX says “Your submitted files should measure between -23dB and -18dB RMS, with peaks hovering around -3dB.” With a peak level of -3 dB FS, and average speech, the RMS level will be roughly -17 dB FS (14 dB lower). Since they want average RMS levels no lower than -23 dB FS, that means a dynamic range (the difference between the louder and softer speech levels) of roughly 6 dB which is not much at all.

        Even making a conscious effort to moderate your natural speech dynamics it would be very hard to get the dynamic range down to 6 dB. This is where dynamics processing and in particular compression and limiting come in.

        ACX says “Compression should be applied with a fast attack and release, around a ratio of 3:1.” Compressors (either hardware devices or virtual compressors that are part of your editing software) have a “threshold level”. This threshold level is something you can adjust to get the best sound. Below the threshold level of a compressor no change is made to the sound. If the threshold is set too low, the compressor will boost the background noise between words. If the threshold is set to high, it will do very little processing since only sound that exceed the threshold will be changed by the compressor.

        ACX suggests a ratio of 3:1. This means that for sounds above the threshold level, for every 3 dB change in the input level, the output will only change 1 dB. If the threshold is set to the softest level in your recording, with a 3:1 ratio it will take an 18 dB dynamic range in your unprocessed recording and reduce it to 6 dB. In other words it can take a more natural dynamic range and reduce it to the narrower dynamic range ACX wants for their recordings.

        When you use compression be sure to listen carefully to the end results. Done right the processing will not be audible as an artifact in the recording. Done wrong or overdone it can make the recording sound worse. Taken to an extreme, compression can take all the life out of a recording. So don’t ever adjust dynamics processing including compression going only by meters. You must listen and always ask yourself if it is making the sound better or worse.

        After compression you can “normalize” the recording. This process looks for the very highest peak level in the recording and adjusts the level for the entire track so that the highest peak level matches some specified level. This is often -0.5 dB FS, but ACX asks for -3 dB FS so you want to set that level when you normalize. Note that normalization does not alter the dynamics of the recording, but merely adjust the level in a fixed way for the entire recorded track so the highest peak level is what is specified.

  5. I like this (though it is a lot of work for y’all) as I have heard the Audible version of Weber’s HH stuff and the reader does bad accents that don’t really fit the characters or those who should have them do not. Hard to argue with the author on accents (though Sarah’s will all have a Porto accent (~_^)) or a lack thereof.

    1. This is one thing I can’t try, not only I have an obvious Finnish accent,

      (okay, I don’t think I’m quite as bad as this guy either is or is pretending to be 🙂 https://www.youtube.com/watch?v=cUqTM6kCR_U)

      due to the fact that it’s been nearly two decades since I last had the chance to really speak English I have an opposite problem than native speakers, I may know how to spell but have no idea how to pronounce a lot of words.

      1. You know, there is nothing wrong with an accent as long as you are understandable. It would be a selling point, since a lot of voice artists aim for the standard voice. Just aim for clarity, projection, and understanding what your work means.

          1. Familiar with this?:

            Hiski Salomaa’s songs were still played pretty often in the radio here when I was a child. I think they were recorded in New York sometime during the 1930’s.

              1. I was just pasting in the link, not the embed code. I forgot we can embed here, not all places will do it.

            1. sounds like something from Finland Calling. Suomi Kutsuu is the longest running Finnish language tv show in the world, and I used to watch it from time to time when I was really bored and channel6 was the only thing coming through (yes kids, we had 3 tv stations as a kid. a CBS, NBC and a PBS, and later once dad installed a 20 ft tall tower on the roof and if weather was right, ABC from Green Bay). I understand it is still on though the host is “retired” and only works on the show (used to be a news guy and other stuff at the station).

              1. I have read that although he was popular among the immigrants there were several other singers and other musicians around that time who were more well known than Salomaa, but he seems to have been the one of them whose songs became the best known and liked in Finland, although only after the wars. Who knows why, possibly just some sort of lucky coincidence.

                1. BTW, Finnish has several dialects – or used to have, they are phasing out nowadays – and Salomaa is originally from Savo – or Savonia, which seems to be the English name – one of the areas with a well known, and often mocked, dialect. People from there are supposed to be rather easy going and a bit lazy, but also wily and not to be trusted. To the east of that is Karelia – lively and talkative, and a bit weird since they are almost like foreigners. And to the west first Tavastia and then Ostrobothnia, in the north, and in the south Finland Proper, and the people in all three are the more ‘typical’ Finns, kind of slow talking, reliable and stolid.

                  I guess that whoever first figured out those descriptions was from west Finland and rather suspicious of the more eastern people.

  6. I have yet to put anything up in ACX, but was regularly podcasting fiction a while ago – short stories, mostly. And it was a big help in the final push to finish Battlehymn. It gave me a final draft / walkthrough for the entire book.

  7. Very timely, this post. I’d like to offer a different experience (sorry it’s so long). I’m just in the middle of recording the first book (To Carry the Horn) in my first series. It’s a long work so I knew it would be expensive, no matter what (estimate: 16 finished hours). I don’t have the proper space to do an audio booth. If you’re interested, here are the economics of what I ended up with.

    First, I explored ACX for the royalty-share option, but (1) they want exclusive for 7 years, and (2) the narrators I wanted were not inclined to risk-share with a newbie. It would have cost me circa $200-300/finished hour times 16.

    More importantly, ACX (Audible/iTunes) only distributes Digital Downloads (MP3s). There are two other important formats: Digital CDs and Audio CDs (think libraries and truck stops and cars).

    So I priced a local studio with serious equipment. $40/hr, with an estimated 1.5 to 1.75 multiplier for finished hours, or $60-70/finished hour times 16. We correct in place whenever there’s a fumble so that it’s easier for the edit process. And I found a distributor for all audio formats (hard to do): Spoken Word).

    So far, I’ve recorded half the book, and it’s going very smoothly. The only real problem (within my narrator-skills limits) is that I can hoarsen up after a couple of hours, and the contrasts are a problem that will require a small amount of rerecording to keep the recording-day boundary transitions less noticeable. It only costs me perhaps twice the read time – 16 hours to record, 16 hours to listen and check.

    I used a teleprompter program (PromptDog). It cost money, but it made the scroll-through easier to read and control, and I figure I’ll recoup it over many books.

    One thing to be aware of: I was pushed to do this when I heard from a reader that Amazon’s Whispersync was activated and butchering my Welsh character names. Once you have a legit digital recording, that gets used for Whispersync and there are rules about how much of the book needs to be included. You get a pass for Front and Back Matter, generally, but I was told explicitly that if you have, say, an excerpt from another work in the Back Matter, that needs to be recorded, too.

    I wanted a real narrator but it was just too expensive. At least I know I’ll always be available for later books in the series. I do worry about how well it will be received by readers (listeners), though — I’ve seen some bad reviews for poor narration by authors.

    This is a very expensive channel, any way you look at it, and I look upon this as an experiment. If the reviews are good, even if the sales start small, I will do the other books in the series.

    BTW, I turned up errata at the rate of about one per every two chapters, just in the process of reading aloud. I would never have believed that, at this point. When they tell you to read out loud as part of the proofing process, that’s good advice.

    1. Wow. Thanks for doing your own book as well. I’m excited to read it.

      I found a company:
      that puts authors and narrators together. Their aim is to make it cheaper for independent authors to produce their books on audio.
      *Off to get Sabrina Chase’s book now*

      1. Like ACX, Iambik is digital download only. I couldn’t see a list of who they distribute to, but I recommend checking that the list (and costs) are significantly better than ACX’s.

  8. On a tangent:

    IMHO anybody who has to give a significant presentation and does not rehearse via recorded audio is leaving the door open to trouble.

    The same applies to video, for those who can afford it. Not me at this time, but hopefully the price will drop as tech advances.

Comments are closed.