Moene's Main Site

Game audio, retro synth, machine learning, and random thoughts

ROFLcopter on Kokoro

Due to how Microsoft Sam was implemented to run on low spec hardware, it contained quite a few funny quirks in it’s synthesis. Today I’d like to experiment with weird and funny speech synthesis phrases with ONNX’s TTS engine Kokoro. This is purely experimental and I simply wanted to see how can I break an AI TTS model.

So here’s my code and some absolutely weird results:

Model used:

hexgrad. (2025). GitHub – hexgrad/misaki: G2P. GitHub. https://github.com/hexgrad/misaki

onnx-community/Kokoro-82M-v1.0-ONNX · Hugging Face. (2025). Huggingface.co. https://huggingface.co/onnx-community/Kokoro-82M-v1.0-ONNX

Results

ROFLcopter

This is a classic one!

My ROFLCopter goes soi soi soi soi soi soi soi soi soi soi soi soi soi

From Kokoro:

My Steamroller

My steamroller goes lol lol lol lol lol lol lol lol lol lol lol lol lol lol

From Kokoro:

However, this time it got absolutely no audio for the lol part. Which I realised that I need to provide a fallback phoneme source for Misaki, which in this case, espeak-ng.

Kokoro with espeak-ng:

Now that’s more like it! Now let’s try the ROFLcopter with espeak-ng fallback:

Now we are talking like SAM!

Why

According to Kokoro’s code we can see that there are two components that can get confused: Misaki and Kokoro. Misaki is the tokenizer and can be confused with invalid tokens but can be fixed with espeak-ng fallback. While Kokoro can be confused by uncommon tokens.

Code used in this experiment

(converted from a Jupyter Notebook)

import IPython
import os
import numpy as np
from onnxruntime import InferenceSession
from misaki import en
import scipy.io.wavfile as wavfileCode language: Python (python)
def phonemes_to_tokens(phonemes):
    vocab = {
        ";": 1,
        ":": 2,
        ",": 3,
        ".": 4,
        "!": 5,
        "?": 6,
        "—": 9,
        "…": 10,
        "\"": 11,
        "(": 12,
        ")": 13,
        "“": 14,
        "”": 15,
        " ": 16,
        "\u0303": 17,
        "ʣ": 18,
        "ʥ": 19,
        "ʦ": 20,
        "ʨ": 21,
        "ᵝ": 22,
        "\uAB67": 23,
        "A": 24,
        "I": 25,
        "O": 31,
        "Q": 33,
        "S": 35,
        "T": 36,
        "W": 39,
        "Y": 41,
        "ᵊ": 42,
        "a": 43,
        "b": 44,
        "c": 45,
        "d": 46,
        "e": 47,
        "f": 48,
        "h": 50,
        "i": 51,
        "j": 52,
        "k": 53,
        "l": 54,
        "m": 55,
        "n": 56,
        "o": 57,
        "p": 58,
        "q": 59,
        "r": 60,
        "s": 61,
        "t": 62,
        "u": 63,
        "v": 64,
        "w": 65,
        "x": 66,
        "y": 67,
        "z": 68,
        "ɑ": 69,
        "ɐ": 70,
        "ɒ": 71,
        "æ": 72,
        "β": 75,
        "ɔ": 76,
        "ɕ": 77,
        "ç": 78,
        "ɖ": 80,
        "ð": 81,
        "ʤ": 82,
        "ə": 83,
        "ɚ": 85,
        "ɛ": 86,
        "ɜ": 87,
        "ɟ": 90,
        "ɡ": 92,
        "ɥ": 99,
        "ɨ": 101,
        "ɪ": 102,
        "ʝ": 103,
        "ɯ": 110,
        "ɰ": 111,
        "ŋ": 112,
        "ɳ": 113,
        "ɲ": 114,
        "ɴ": 115,
        "ø": 116,
        "ɸ": 118,
        "θ": 119,
        "œ": 120,
        "ɹ": 123,
        "ɾ": 125,
        "ɻ": 126,
        "ʁ": 128,
        "ɽ": 129,
        "ʂ": 130,
        "ʃ": 131,
        "ʈ": 132,
        "ʧ": 133,
        "ʊ": 135,
        "ʋ": 136,
        "ʌ": 138,
        "ɣ": 139,
        "ɤ": 140,
        "χ": 142,
        "ʎ": 143,
        "ʒ": 147,
        "ʔ": 148,
        "ˈ": 156,
        "ˌ": 157,
        "ː": 158,
        "ʰ": 162,
        "ʲ": 164,
        "↓": 169,
        "→": 171,
        "↗": 172,
        "↘": 173,
        "ᵻ": 177
    }

    tokens = [vocab[char] for char in phonemes if char in vocab]
    return tokensCode language: Python (python)
fallback = espeak.EspeakFallback(british=False) # en-us
g2p = en.G2P(trf=False, british=False, fallback=fallback)
def tttk(s):
    phonemes, tokens = g2p(s)
    return phonemes_to_tokens(phonemes)Code language: Python (python)
tokens = tttk("My ROFLCopter goes soi soi soi soi soi soi soi soi soi soi soi soi soi")

# Context length is 512, but leave room for the pad token 0 at the start & end
assert len(tokens) <= 510, len(tokens)

# Style vector based on len(tokens), ref_s has shape (1, 256)
voices = np.fromfile('./voices/af.bin', dtype=np.float32).reshape(-1, 1, 256)
ref_s = voices[len(tokens)]
tokens = [[0, *tokens, 0]]

model_name = 'model.onnx' # Options: model.onnx, model_fp16.onnx, model_quantized.onnx, model_q8f16.onnx, model_uint8.onnx, model_uint8f16.onnx, model_q4.onnx, model_q4f16.onnx
sess = InferenceSession(os.path.join('onnx', model_name))
audio = sess.run(None, dict(
    input_ids=tokens,
    style=ref_s,
    speed=np.ones(1, dtype=np.float32),
))[0]
wavfile.write('audio.wav', 24000, audio[0])
IPython.display.Audio('audio.wav')Code language: Python (python)


Posted

in

by

Tags: