Story Mode: The audio is generated by providing a sequence of text prompts. These influence how the model continues the semantic tokens derived from the previous caption.
Text and Melody Conditioning: By adding melody embeddings to the conditioning, we can generate music that respects the text prompt while following the provided melody.