支持的領域 / 任務:audio(音頻)/ ttsv2(語音合成)。
CosyVoice語音合成是基于通義實驗室的生成式語音大模型(CosyVoice),依托大規模預訓練語言模型,深度融合文本理解和語音生成的一項新型語音合成技術,能夠精準解析并詮釋各類文本內容,將其轉化為宛如真人般的自然語音,提供超自然擬人的語音合成能力。支持文本至語音的流式輸入和流式輸出。
除了傳統的“輸入一段文本→直接輸出音頻/流式輸出音頻”的交互方式外,CosyVoice還提供了“流式輸入文本→流式輸出音頻”的純流式交互方式,可以實時合成LLM流式生成的文本。
前提條件
已開通服務并獲得API-KEY:獲取API-KEY。
已安裝最新版SDK:安裝DashScope SDK。
同步調用
提交單個語音合成任務,無需調用回調函數,進行語音合成(無流式輸出中間結果),最終一次性獲取完整結果。
請求示例
以下示例展示如何使用同步接口調用語音大模型CosyVoice的發音人龍小淳(longxiaochun),將文案“今天天氣怎么樣”合成采樣率為22050Hz、音頻格式為MP3的音頻,并保存到名為output.mp3的文件中。
需要使用您的API-KEY替換示例中的your-dashscope-api-key,代碼才能正常運行。
同步接口將阻塞當前線程,直到合成完成或者出現錯誤。
# coding=utf-8
import dashscope
from dashscope.audio.tts_v2 import *
# 將your-dashscope-api-key替換成您自己的API-KEY
dashscope.api_key = "your-dashscope-api-key"
model = "cosyvoice-v1"
voice = "longxiaochun"
synthesizer = SpeechSynthesizer(model=model, voice=voice)
audio = synthesizer.call("今天天氣怎么樣?")
print('requestId: ', synthesizer.get_last_request_id())
with open('output.mp3', 'wb') as f:
f.write(audio)
package SpeechSynthesisDemo;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
public class Tts2File {
/**
* 將your-dashscope-api-key替換成您自己的API-KEY
*/
private static String apikey = "your-dashscope-api-key";
private static String model = "cosyvoice-v1";
private static String voice = "longxiaochun";
public static void StreamAuidoDataToSpeaker() {
SpeechSynthesisParam param =
SpeechSynthesisParam.builder()
.apiKey(apikey)
.model(model)
.voice(voice)
.build();
SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
ByteBuffer audio = synthesizer.call("今天天氣怎么樣?");
File file = new File("output.mp3");
System.out.print("requestId: " + synthesizer.getLastRequestId());
try (FileOutputStream fos = new FileOutputStream(file)) {
fos.write(audio.array());
} catch (IOException e) {
throw new RuntimeException(e);
}
}
public static void main(String[] args) {
StreamAuidoDataToSpeaker();
System.exit(0);
}
}
請求參數說明
參數 | 類型 | 是否必填 | 默認值 | 說明 |
model | string | 是 | 無 | 指定用于語音合成的模型名(指定為:cosyvoice-v1)。 |
voice | string | 是 | 無 | 指定用于語音合成的音色名,更多信息,請參見音色列表。 |
text | string | 是 | 無 | 待合成文本。 |
format | AudioFormat | 否 | 模型列表中發音人對應的默認采樣率和音頻格式。 | 合成音頻的編碼格式,支持下列格式:
|
volume | int | 否 | 50 | 合成音頻的音量,取值范圍:0~100。 |
speech_rate | double | 否 | 1.0 | 合成音頻的語速,取值范圍:0.5~2。
|
pitch_rate | double | 否 | 1.0 | 合成音頻的語調,取值范圍:0.5~2。 |
返回結果說明
返回結果為合成的二進制音頻數據。
接口詳情
"""
Speech synthesis.
If callback is set, the audio will be returned in real-time through the on_event interface.
Otherwise, this function blocks until all audio is received and then returns the complete audio data.
Parameters:
-----------
text: str
utf-8 encoded text
return: bytes
If a callback is not set during initialization, the complete audio is returned as the function's return value. Otherwise, the return value is null.
"""
def call(self, text:str):
/**
* Speech synthesis.<br>
* If callback is set, the audio will be returned in real-time through the on_event interface.<br>
* Otherwise, this function blocks until all audio is received and then returns the complete audio data.
*
* @param text utf-8 encoded text
* @return If a callback is not set during initialization, the complete audio is returned as the function's return value. Otherwise, the return value is null.
*/
public ByteBuffer call(String text)
異步調用
提交單個語音合成任務,通過回調的方式流式輸出中間結果,合成結果通過ResultCallback中的回調函數流式進行獲取。
調用示例
以下示例,展示如何使用同步接口調用語音大模型CosyVoice的發音人龍小淳(longxiaochun),將文案“今天天氣怎么樣”合成采樣率為22050Hz,音頻格式為MP3的音頻。
需要使用您的API-KEY替換示例中的
your-dashscope-api-key
,代碼才能正常運行。異步接口不會阻塞當前線程,需要監聽onComplete事件接收完所有音頻。
# coding=utf-8
import dashscope
from dashscope.audio.tts_v2 import *
# 將your-dashscope-api-key替換成您自己的API-KEY
dashscope.api_key = "your-dashscope-api-key"
model = "cosyvoice-v1"
voice = "longxiaochun"
class Callback(ResultCallback):
_player = None
_stream = None
def on_open(self):
self.file = open("output.mp3", "wb")
print("websocket is open.")
def on_complete(self):
print("speech synthesis task complete successfully.")
def on_error(self, message: str):
print(f"speech synthesis task failed, {message}")
def on_close(self):
print("websocket is closed.")
self.file.close()
def on_event(self, message):
print(f"recv speech synthsis message {message}")
def on_data(self, data: bytes) -> None:
print("audio result length:", len(data))
self.file.write(data)
callback = Callback()
synthesizer = SpeechSynthesizer(
model=model,
voice=voice,
callback=callback,
)
synthesizer.call("今天天氣怎么樣?")
print('requestId: ', synthesizer.get_last_request_id())
package com.alibaba.dashscope;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;
import java.util.concurrent.CountDownLatch;
public class StreamInputTtsPlayableDemo {
/**
* 將your-dashscope-api-key替換成您自己的API-KEY
*/
private static String apikey = "your-dashscope-api-key";
private static String model = "cosyvoice-v1";
private static String voice = "longxiaochun";
public static void StreamAuidoDataToSpeaker() {
CountDownLatch latch = new CountDownLatch(1);
// 配置回調函數
ResultCallback<SpeechSynthesisResult> callback =
new ResultCallback<SpeechSynthesisResult>() {
@Override
public void onEvent(SpeechSynthesisResult result) {
System.out.println("收到消息: " + result);
if (result.getAudioFrame() != null) {
// TODO: 處理音頻
System.out.println("收到音頻");
}
}
@Override
public void onComplete() {
System.out.println("收到Complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
System.out.println("收到錯誤: " + e.toString());
latch.countDown();
}
};
SpeechSynthesisParam param =
SpeechSynthesisParam.builder()
.apiKey(apikey)
.model(model)
.voice(voice)
.build();
SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
// 帶Callback的call方法將不會阻塞當前線程
synthesizer.call("今天天氣怎么樣?");
System.out.print("requestId: " + synthesizer.getLastRequestId());
// 等待合成完成
try {
latch.await();
// 等待播放線程全部播放完
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
public static void main(String[] args) {
StreamAuidoDataToSpeaker();
System.exit(0);
}
}
請求參數說明
和同步調用一致,在初始化時設定callback,則call函數轉為異步接口,會立刻返回null。音頻從回調函數中實時返回。詳情請參見請求參數說明。
返回結果說明
數據在on_event回調返回的SpeechSynthesisResult對象中。包含如下成員方法用于獲取數據:
成員方法 | 方法簽名 | 說明方法 |
getAudioFrame | ByteBuffer getAudioFrame() | 返回一個流式合成片段的增量二進制音頻數據,可能為空。 |
call函數無返回數據。
流式輸入調用
調用示例
在同一個語音合成任務中分多次提交文本,并通過回調的方式實時獲取合成結果。
以下示例,展示如何使用同步接口調用語音合成大模型CosyVoice的發音人龍小淳(longxiaochun),分多次發送文案,合成采樣率為22050Hz,編碼格式為PCM的音頻,并使用播放器實時播放。
# coding=utf-8
#
# Installation instructions for pyaudio:
# APPLE Mac OS X
# brew install portaudio
# pip install pyaudio
# Debian/Ubuntu
# sudo apt-get install python-pyaudio python3-pyaudio
# or
# pip install pyaudio
# CentOS
# sudo yum install -y portaudio portaudio-devel && pip install pyaudio
# Microsoft Windows
# python -m pip install pyaudio
import time
import pyaudio
import dashscope
from dashscope.api_entities.dashscope_response import SpeechSynthesisResponse
from dashscope.audio.tts_v2 import *
# 將your-dashscope-api-key替換成您自己的API-KEY
dashscope.api_key = "your-dashscope-api-key"
model = "cosyvoice-v1"
voice = "longxiaochun"
class Callback(ResultCallback):
_player = None
_stream = None
def on_open(self):
print("websocket is open.")
self._player = pyaudio.PyAudio()
self._stream = self._player.open(
format=pyaudio.paInt16, channels=1, rate=22050, output=True
)
def on_complete(self):
print("speech synthesis task complete successfully.")
def on_error(self, message: str):
print(f"speech synthesis task failed, {message}")
def on_close(self):
print("websocket is closed.")
# 停止播放器
self._stream.stop_stream()
self._stream.close()
self._player.terminate()
def on_event(self, message):
print(f"recv speech synthsis message {message}")
def on_data(self, data: bytes) -> None:
print("audio result length:", len(data))
self._stream.write(data)
callback = Callback()
test_text = [
"流式文本語音合成SDK,",
"可以將輸入的文本",
"合成為語音二進制數據,",
"相比于非流式語音合成,",
"流式合成的優勢在于實時性",
"更強。用戶在輸入文本的同時",
"可以聽到接近同步的語音輸出,",
"極大地提升了交互體驗,",
"減少了用戶等待時間。",
"適用于調用大規模",
"語言模型(LLM),以",
"流式輸入文本的方式",
"進行語音合成的場景。",
]
synthesizer = SpeechSynthesizer(
model=model,
voice=voice,
format=AudioFormat.PCM_22050HZ_MONO_16BIT,
callback=callback,
)
for text in test_text:
synthesizer.streaming_call(text)
time.sleep(0.5)
synthesizer.streaming_complete()
print('requestId: ', synthesizer.get_last_request_id())
package com.alibaba.dashscope;
import com.alibaba.dashscope.audio.tts.SpeechSynthesisResult;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.common.ResultCallback;
import com.alibaba.dashscope.utils.Constants;
import java.util.concurrent.CountDownLatch;
public class StreamInputTtsPlayableDemo {
private static String[] textArray = {"流式文本語音合成SDK,",
"可以將輸入的文本", "合成為語音二進制數據,", "相比于非流式語音合成,",
"流式合成的優勢在于實時性", "更強。用戶在輸入文本的同時",
"可以聽到接近同步的語音輸出,", "極大地提升了交互體驗,",
"減少了用戶等待時間。", "適用于調用大規模", "語言模型(LLM),以",
"流式輸入文本的方式", "進行語音合成的場景。"};
/**
* 將your-dashscope-api-key替換成您自己的API-KEY
*/
private static String apikey = "your-dashscope-api-key";
private static String model = "cosyvoice-v1";
private static String voice = "longxiaochun";
public static void StreamAuidoDataToSpeaker() {
CountDownLatch latch = new CountDownLatch(1);
// 配置回調函數
ResultCallback<SpeechSynthesisResult> callback =
new ResultCallback<SpeechSynthesisResult>() {
@Override
public void onEvent(SpeechSynthesisResult result) {
System.out.println("收到消息: " + result);
if (result.getAudioFrame() != null) {
// TODO: 處理音頻
System.out.println("收到音頻");
}
}
@Override
public void onComplete() {
System.out.println("收到Complete");
latch.countDown();
}
@Override
public void onError(Exception e) {
System.out.println("收到錯誤: " + e.toString());
latch.countDown();
}
};
SpeechSynthesisParam param =
SpeechSynthesisParam.builder()
.apiKey(apikey)
.model(model)
.voice(voice)
.format(SpeechSynthesisAudioFormat
.PCM_22050HZ_MONO_16BIT) // 流式合成使用PCM或者MP3
.build();
SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, callback);
// 帶Callback的call方法將不會阻塞當前線程
// 帶Callback的call方法將不會阻塞當前線程
for (String text : textArray) {
synthesizer.streamingCall(text);
}
synthesizer.streamingComplete();
System.out.print("requestId: " + synthesizer.getLastRequestId());
// 等待合成完成
try {
latch.await();
// 等待播放線程全部播放完
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
public static void main(String[] args) {
StreamAuidoDataToSpeaker();
System.exit(0);
}
}
接口詳情
發送文本
""" Streaming input mode: You can call the stream_call function multiple times to send text. A session will be created on the first call. The session ends after calling streaming_complete. Parameters: ----------- text: str utf-8 encoded text """ def streaming_call(self, String text):
/** * Streaming input mode: You can call the stream_call function multiple times to send text. A session will be created on the first call. * The session ends after calling streaming_complete. * @param text utf-8 encoded text */ public void streamingCall(String text)
同步結束任務流
""" Synchronously stop the streaming input speech synthesis task. Wait for all remaining synthesized audio before returning Parameters: ----------- complete_timeout_millis: int Throws TimeoutError exception if it times out. """ def streaming_complete(self, complete_timeout_millis=10000):
/** * Synchronously stop the streaming input speech synthesis task. Wait for all remaining synthesized audio before returning * If it does not complete within 10 seconds, a timeout occurs and a TimeoutError exception is thrown. */ public void streamingComplete() /** * Synchronously stop the streaming input speech synthesis task. Wait for all remaining synthesized audio before returning * @param completeTimeoutMillis The timeout period for await. Throws TimeoutError exception if it times out. */ public void streamingComplete(long completeTimeoutMillis)
異步結束任務流
""" Asynchronously stop the streaming input speech synthesis task, returns immediately. You need to listen and handle the STREAM_INPUT_TTS_EVENT_SYNTHESIS_COMPLETE event in the on_event callback. Do not destroy the object and callback before this event. """ def async_streaming_complete(self):
/** * Asynchronously stop the streaming input speech synthesis task, returns immediately. * You need to listen and handle the STREAM_INPUT_TTS_EVENT_SYNTHESIS_COMPLETE event in the on_event callback. * Do not destroy the object and callback before this event. */ public void asyncStreamingComplete()
取消當前任務
""" Immediately terminate the streaming input speech synthesis task and discard any remaining audio that is not yet delivered. """ def streaming_cancel(self):
/** * Immediately terminate the streaming input speech synthesis task and discard any remaining audio that is not yet delivered. */ public void streamingCancel()
通過Flowable的調用
Java SDK還額外提供了通過Flowable流式調用的方式進行語音合成。在Flowable對象onComplete( )后,可以通過Synthesis對象的getAudioData( )獲取完整結果。
非流式輸入調用示例
以下示例展示了通過Flowable對象的blockingForEach接口,阻塞式的獲取每次流式返回的SpeechSynthesisResult類型數據msg。
package com.alibaba.dashscope;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import com.alibaba.dashscope.utils.Constants;
public class StreamInputTtsPlayableDemo {
/**
* 將your-dashscope-api-key替換成您自己的API-KEY
*/
private static String apikey = "your-dashscope-api-key";
private static String model = "cosyvoice-v1";
private static String voice = "longxiaochun";
public static void StreamAuidoDataToSpeaker() throws NoApiKeyException {
SpeechSynthesisParam param =
SpeechSynthesisParam.builder()
.apiKey(apikey)
.model(model)
.voice(voice)
.build();
SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
synthesizer.callAsFlowable("今天天氣怎么樣?").blockingForEach(result -> {
System.out.println("收到消息: " + result);
if (result.getAudioFrame() != null) {
// TODO: 處理音頻
System.out.println("收到音頻");
}
});
}
public static void main(String[] args) throws NoApiKeyException {
StreamAuidoDataToSpeaker();
System.exit(0);
}
}
接口詳情
/**
* Stream output speech synthesis using Flowable features (non-streaming input)
* @param text Text to be synthesized
* @return The output event stream, including real-time audio
* @throws ApiException
* @throws NoApiKeyException
*/
public Flowable<SpeechSynthesisResult> callAsFlowable(String text)
throws ApiException, NoApiKeyException
流式輸入調用示例
以下示例展示了通過Flowable對象作為輸入參數,輸入文本流。并通過Flowable對象作為返回值,利用的blockingForEach接口,阻塞式地獲取每次流式返回的SpeechSynthesisResult類型數據msg。
package com.alibaba.dashscope;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisAudioFormat;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesisParam;
import com.alibaba.dashscope.audio.ttsv2.SpeechSynthesizer;
import com.alibaba.dashscope.exception.NoApiKeyException;
import io.reactivex.BackpressureStrategy;
import io.reactivex.Flowable;
public class StreamInputTtsPlayableDemo {
private static String[] textArray = {"流式文本語音合成SDK,",
"可以將輸入的文本", "合成為語音二進制數據,", "相比于非流式語音合成,",
"流式合成的優勢在于實時性", "更強。用戶在輸入文本的同時",
"可以聽到接近同步的語音輸出,", "極大地提升了交互體驗,",
"減少了用戶等待時間。", "適用于調用大規模", "語言模型(LLM),以",
"流式輸入文本的方式", "進行語音合成的場景。"};
/**
* 將your-dashscope-api-key替換成您自己的API-KEY
*/
private static String apikey = "your-daskscope-api-key";
private static String model = "cosyvoice-v1";
private static String voice = "longxiaochun";
public static void StreamAuidoDataToSpeaker() throws NoApiKeyException {
// 模擬流式輸入
Flowable<String> textSource = Flowable.create(emitter -> {
new Thread(() -> {
for (int i = 0; i < textArray.length; i++) {
emitter.onNext(textArray[i]);
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
emitter.onComplete();
}).start();
}, BackpressureStrategy.BUFFER);
SpeechSynthesisParam param =
SpeechSynthesisParam.builder()
.apiKey(apikey)
.model(model)
.voice(voice)
.build();
SpeechSynthesizer synthesizer = new SpeechSynthesizer(param, null);
synthesizer.streamingCallAsFlowable(textSource).blockingForEach(result -> {
if (result.getAudioFrame() != null) {
// TODO: 將音頻片段發送給播放器
System.out.println(
"audio result length: " + result.getAudioFrame().capacity());
}
});
}
public static void main(String[] args) throws NoApiKeyException {
StreamAuidoDataToSpeaker();
System.exit(0);
}
}
返回結果說明
該接口主要通過返回的Flowable<SpeechSynthesisResult>來獲取流式結果,也可以在Flowable的所有流式數據返回完成后通過對應SpeechSynthesizer對象的getAudioData來獲取完整的合成結果。關于Flowable的使用,請參見rxjava API。
接口詳情
/**
* Stream input and output speech synthesis using Flowable features
* @param textStream The text stream to be synthesized
* @return The output event stream, including real-time audio
* @throws ApiException
* @throws NoApiKeyException
*/
public Flowable<SpeechSynthesisResult> streamingCallAsFlowable(
Flowable<String> textStream)