You’ve recorded your samples, you’ve put everything together in a folder, now it’s time to get into the real nitty gritty of making an UTAU!
The oto.ini is a vital configuration file. Think of it as a map of each sound, telling the UTAU program where consonants are in relation to vowels on each sample, what part of the recording to stretch and hold for long notes, and so on. Even the most basic CV UTAU will NOT function correctly without an oto!
Choose a section below and let’s get learning!
Parts of the Oto.ini
To start off, I’m going to explain the different parts of the oto and various functions within the oto editor. These are about the same between SetParam and UTAU/UTAU-Synth, with some color differences in the visual editors. The parameters in SetParam are clearly labeled, so for this tutorial, I will be referencing the colors of the parameters in UTAU and UTAU-Synth.
Offset is the blue shaded area before the beginning of the sample. Anything in this blue area will NOT be heard; its edge indicates the very start of the sample as it will play in UTAU. Moving the Offset will move everything after it as well, but keep the item positions relative to each other. (This is how otoing VCV usually works; you use a base oto with standard parameters for everything, but scoot the Offset over so the Preutterance is in the correct place on each sample!)
Overlap is the green vertical line. The space behind this line is the space where the sample fades into the one before it. This ensures a smooth connection between the different syllables and phonemes when they play in UTAU.
Top Of Note / Preutterance
The Top Of Note or Preutterance is the vertical red line. This indicates the point on the sample that will match with the very start of a beat/note. For CV samples, it will ALWAYS go directly between the consonant and vowel.
The Consonant is the pink shaded area. Anything within this area is treated specifically as a “consonant” rather than a “vowel” and therefore will not be stretched for long notes. Anything within this area is affected by the Consonant Velocity property. Typically, the Consonant area will extend a few milliseconds into the vowel to cover any pitch fluctuations that may happen right after the actual consonant.
Stretched Area (Vowel / White Space)
This is anything between the pink Consonant area and the blue Cutoff/Blank area. For most samples, this is the vowel that is stretched for long notes. For ending samples like breaths or VC, this is empty, silent white space after the end of the sample.
Cutoff / Blank
The Cutoff or Blank is the blue area that indicates the end of the audible sample. Anything within this blue area, like the Offset before it, will be ignored by the program. The left edge of this area indicates the actual cutoff where the sample will end.
And finally, Aliases are names for each oto line, and are what you type into the notes to call to that specific sample and setting. In the case of CV, these can be used to add romaji or kana capability to a bank (ka vs か) and in strung-sound methods, these are used to reference different sections of long recordings (turning kakakikakukeka into -ka, a ka, a ki, and so on).
The Oto Editor
There are multiple ways you can create and modify oto.ini files. Some advanced users even use Notepad to create their base file, however I’m just going to introduce you to the standard visual oto editors to make things a bit easier!
Editing in UTAU (PC)
It is possible to oto a bank in UTAU by using Tools > Voice Bank Settings. This will bring up a list of all the samples in the voicebank, and allow for entering the parameters via the numerical values on the right, and visual editing by clicking
Editing in UTAU-Synth (Mac)
Editing the oto in UTAU-Synth for Mac is almost exactly like editing the oto in PC UTAU, with one major difference: rather than saving the configuration file as an oto.ini, UTAU-Synth favors oto_ini.txt instead. This file functions the same as an oto.ini but only for UTAU-Synth and must be converted into an oto.ini to function on PC.
Editing in SetParam (PC)
SetParam is a program built for batch-editing oto.ini files. It is made as companion software to OREMO, and is especially recommended for editing lengthy banks outside of CV. With SetParam, you can standardize the oto file by entering the same value for multiple samples, then tweak each oto line individually afterwards in the visual editor.
CV banks are considered the easiest to record, as the samples are single syllable. Otoing these takes a bit of special attention as there is no one-size-fits-all oto for CV, unlike with strung-sound banks like VCV or CVVC.
Phonemes are divided up below into varying categories based on the type of sound, consonants in particular. I have used phonetic terms loosely as a categorizing method.
“Starting” Vowels: [a] [e] [i] [o] [u] [n]
These are vowels that come after a rest, usually starting a phrase or verse. They are otoed to begin right at the start of the note without any fading before. Sometimes these are multi-aliased with a hyphen to indicate they are specifically for the start of a phrase, such as [- あ] .
SPECIAL NOTE: When otoing these, take special note for any throat noise (vocal fry, or glottal sounds) that may precede the vowel, as this should be otoed much like a consonant with the Top Of Note AFTER the distortion (but any kind of distortion like this should be rather short unless it is intentional vocal fry!) This type of thing is usually visible in the waveform or spectrogram. Don’t worry about this sort of stuff, it usually adds some nice character to the voice!
Crossfading Vowels: [* a] [* e] [* i] [* o] [* u] [* n]
These are vowels that fade into the note before them, such as “a” to “i” in “ai”. Using an asterisk in the alias to indicate a crossfade vowel is recommended as these crossfade automatically in UTAU-Synth (it’s how Defoko is otoed there!)
To oto these, scoot the Offset a bit further into the vowel so that there is no fade in on the sample. Then place the Preutterance a bit into the sample, with Overlap around 1/2 between Offset and Preutterance. My go-to ratio is 90/60 Preutterance/Overlap, but you can go up to 300/150 as that is the value UTAU uses for its “Omakase” tuning automation tool.
Unvoiced Plosive Consonants: [k] [p] [t] [ch] [ts]
These are consonants that have a hard, sharp sound, and due to their pronunciation naturally have a gap between the preceding vowel and the start of the consonant. These consonants should NOT blend into the vowel before them; instead, leave some white space and place the overlap between the offset and the beginning of the consonant.
Voiced Plosive Consonants: [b] [d] [g]
These consonants also do not blend into the vowel before them, but have less of a hard gap between the preceding vowel and consonant than the unvoiced plostives. The same oto layout is recommended, but with a smaller white space between offset and the start of consonant.
“Soft” Consonants (Nasal, Fricative, Approximate, etc.):
[m] [n] [h] [f] [j] [r] [v]
These consonants DO blend into the preceding vowel. Usually, I recommend placing the offset right at the start of the consonant, and the overlap about 1/3 between the offset and Top Of Note/Preutterance.
Tricky Approximate Consonants: [w] [y]
These consonants are like the “Soft” Consonants above, but require careful placement of the Top Of Note/Preutterance as it is not usually obvious where the consonant ends and vowel begins. To check this, be sure to look at the spectrogram
of the sample by clicking the “s” button in the Voice Bank Editor, as that will make it easier to see the divide.
Sibilant Consonants: [s] [sh] [z]
These consonants are like the Soft Consonants but usually longer. Again, put Offset at the start of the consonant (or even sometimes partially into it if the Sibilants are particularly long) and then place the Overlap either 1/3 to 1/2 between the Offset and Top Of Note/Preutterance.
Glides: (ex. [kya] [gwo]
These samples should follow the same otoing conventions as their non-glide counterparts (i.e. ka vs kya). The Top Of Note/Preutterance should go between the starting consonant and the “y” or “w”. However, be especially sure to open the spectrogram viewer and place the Consonant (pink area) well over the “y/w” and into the vowel.
RECORDING NOTE: The glide should NOT be drawn out. While people usually pronounce “Tokyo” as having 3 syllables (to-ki-yo) the glide in Japanese is actually treated as a single syllable and is a rather quick sound (to-kyo). The shorter and more cleanly recorded glides, the better!
Extras (VV, Breaths, Ending Breaths, English R and L)
These are non-conventional samples for CV banks, however it has become popular practice to include these to add human expression and smoother transitions in vowels.
Vowel to vowel transition samples sound better than their standalone [* v] counterparts. The Top Of Note/Preutterance should be placed directly between the two vowels after the transition into the second one, which is most easily seen by the spectrogram image of the sample. The Overlap should be around halfway between the Preutterance and Offset.
Standalone breaths (inhaled) are tricky, as they are purely white noise and UTAU does not usually play well with white noise except for specific samplers. To increase the odds of a breath being picked up by UTAU, include a bit of sung vowel after the breath and oto it out using Cutoff. The Offset, Overlap and Preutterance should all be in the same space, right at the start of the breath, and the Consonant should cover the ENTIRE breath, leaving a bit of empty white space between the end of Consonant and Cutoff at the end of the sample.
Vowel Endings & Ending Breaths
Vowel Endings and Ending Breaths are rather like VC samples, that either end the vowel naturally or add a trailed-out breath to the end for expressive effect. These should be otoed with the Top Of Note/Preutterance right on the end of the voiced vowel, the Consonant covering the entire trailed out breath, with white space between the Consonant and Cutoff. The Overlap should be placed around halfway between the Preutterance and Offset. My preference is around 90/60 for Preutterance/Overlap, but depending on sample length, and I would personally go no larger than 200/100.
English R and L
These are to be treated as “Soft” consonants.