實時長文本語音合成服務在輸出音頻流的同時,可輸出每個漢字/英文單詞在音頻中的時間位置,即時間戳。時間戳功能又叫字級別音素邊界接口,該時間信息可用于驅動虛擬人口型、做視頻配音字幕等。
功能概述
實時長文本語音實時合成服務的時間戳是將大段的文本切分為多個句子,以每句話為單位,與音頻一起流式的輸出該句子的時間戳和該句話中每個字的時間戳。時間戳以每句話為一個區塊,返回句內每個字的時間戳。時間戳與合成的音頻保持同步,只有當音頻被合成了,時間戳才能被準確計算。
例如,“阿里巴巴。達摩院。”是一個長文本,但屬于兩個句子:
第一句:“阿里巴巴。”
第二句:“達摩院。”
時間戳輸出示例如下(以下示例僅做舉例展示,不代表每個subtitles
元素只合成一個字的音頻):
// "sentence":true表示句子時間戳,"sentence":false表示字時間戳
{"subtitles":[{"begin_index":0,"end_index":1,"begin_time":0,"end_time":0,"phoneme":"null","text":"","sentence":true},{"begin_index":0,"end_index":1,"begin_time":0,"end_time":150,"phoneme":"null","text":"阿","sentence":false}]}
{"subtitles":[{"begin_index":0,"end_index":1,"begin_time":0,"end_time":0,"phoneme":"null","text":"","sentence":true},{"begin_index":0,"end_index":1,"begin_time":0,"end_time":150,"phoneme":"null","text":"阿","sentence":false},{"begin_index":1,"end_index":2,"begin_time":150,"end_time":325,"phoneme":"null","text":"里","sentence":false}]}
{"subtitles":[{"begin_index":0,"end_index":1,"begin_time":0,"end_time":0,"phoneme":"null","text":"","sentence":true},{"begin_index":0,"end_index":1,"begin_time":0,"end_time":150,"phoneme":"null","text":"阿","sentence":false},{"begin_index":1,"end_index":2,"begin_time":150,"end_time":325,"phoneme":"null","text":"里","sentence":false},{"begin_index":2,"end_index":3,"begin_time":325,"end_time":525,"phoneme":"null","text":"巴","sentence":false}]}
// 當整句話合成后,句子時間戳被準確計算,
{"subtitles":[{"begin_index":0,"end_index":1,"begin_time":0,"end_time":850,"phoneme":"null","text":"","sentence":true},{"begin_index":0,"end_index":1,"begin_time":0,"end_time":150,"phoneme":"null","text":"阿","sentence":false},{"begin_index":1,"end_index":2,"begin_time":150,"end_time":325,"phoneme":"null","text":"里","sentence":false},{"begin_index":2,"end_index":3,"begin_time":325,"end_time":525,"phoneme":"null","text":"巴","sentence":false},{"begin_index":3,"end_index":4,"begin_time":525,"end_time":788,"phoneme":"null","text":"巴","sentence":false}]}
// 句子時間戳的begin_index與end_index與該句話的第個字時間戳相同
{"subtitles":[{"begin_index":4,"end_index":5,"begin_time":850,"end_time":850,"phoneme":"null","text":"","sentence":true},{"begin_index":4,"end_index":5,"begin_time":850,"end_time":1025,"phoneme":"null","text":"達","sentence":false}]}
{"subtitles":[{"begin_index":4,"end_index":5,"begin_time":850,"end_time":850,"phoneme":"null","text":"","sentence":true},{"begin_index":4,"end_index":5,"begin_time":850,"end_time":1025,"phoneme":"null","text":"達","sentence":false},{"begin_index":5,"end_index":6,"begin_time":1025,"end_time":1200,"phoneme":"null","text":"摩","sentence":false}]}
{"subtitles":[{"begin_index":4,"end_index":5,"begin_time":850,"end_time":1512,"phoneme":"null","text":"","sentence":true},{"begin_index":4,"end_index":5,"begin_time":850,"end_time":1025,"phoneme":"null","text":"達","sentence":false},{"begin_index":5,"end_index":6,"begin_time":1025,"end_time":1200,"phoneme":"null","text":"摩","sentence":false},{"begin_index":6,"end_index":7,"begin_time":1200,"end_time":1450,"phoneme":"null","text":"院","sentence":false}]}
只有支持字級別音素邊界接口的發音人才有此功能。
TTS服務返回的字幕是基于發音的,所以不能直接用于上屏,需要使用您的原始文本。
如果用于上屏,可以基于返回的結果,定位每個句子的句首和句尾時間戳。
參數設置
在客戶端設置請求參數enable_subtitle
為true
,開啟時間戳功能。
以Java SDK為例,其設置?式如下。
// 是否開啟字幕功能(返回對應文本的相應時間戳),默認不開啟。
synthesizer.addCustomedParam("enable_subtitle", true);
服務端響應
服務端返回的帶字幕信息的響應MetaInfo事件。
參數 | 類型 | 說明 |
subtitles | List | 時間戳信息。 |
其中,SubtitleItem格式如下。
參數 | 類型 | 說明 |
text | String | ?本信息。 |
begin_time | Integer | ?本對應tts語?開始時間戳,單位ms。 |
end_time | Integer | ?本對應tts語?結束時間戳,單位ms。 |
phoneme | String | 不支持打印該字對應的phone系列,默認輸出為 |
begin_index | Integer | 該字在整句中的開始位置,從0開始。 |
end_index | Integer | 該字在整句中的結束位置,從0開始。 |
sentence | Boolean | 句子時間戳控制,True表示當前時間戳為句子。 |
返回示例
{
"header":{
"namespace":"SpeechLongSynthesizer",
"name":"MetaInfo",
"status":20000000,
"message_id":"49818960d4ca40d88ebxxxxxxxxxxx",
"task_id":"326f3b9d9cfa47f3a692xxxxxxxxxx",
"status_text":"Gateway:SUCCESS:Success."
},
"payload":{
"subtitles":[
{
"text":"",
"phoneme":"null",
"sentence":true,
"begin_index":0,
"end_index":1,
"begin_time":0,
"end_time":498
},
{
"text":"你",
"phoneme":"null",
"sentence":false,
"begin_index":0,
"end_index":1,
"begin_time":0,
"end_time":118
},
{
"text":"好",
"phoneme":"null",
"sentence":false,
"begin_index":1,
"end_index":2,
"begin_time":118,
"end_time":439
}
]
}
}
{
"header":{
"namespace":"SpeechLongSynthesizer",
"name":"MetaInfo",
"status":20000000,
"message_id":"bb4e791f1dff464e9997xxxxxxxxxxxxxx",
"task_id":"326f3b9d9cfa47f3a6921xxxxxxxxxxxx",
"status_text":"Gateway:SUCCESS:Success."
},
"payload":{
"subtitles":[
{
"text":"",
"phoneme":"null",
"sentence":true,
"begin_index":2,
"end_index":3,
"begin_time":498,
"end_time":1067
},
{
"text":"明",
"phoneme":"null",
"sentence":false,
"begin_index":2,
"end_index":3,
"begin_time":498,
"end_time":687
},
{
"text":"天",
"phoneme":"null",
"sentence":false,
"begin_index":3,
"end_index":4,
"begin_time":687,
"end_time":1008
}
]
}
}
字級別時間戳代碼示例
示例中使用SDK內置的默認語音合成服務的外網訪問服務URL,如果您使用位于阿里云上海地域的ECS,且需要通過內網訪問服務URL,則在創建NlsClient對象時,設置內網訪問的URL:
client = new NlsClient("ws://nls-gateway.cn-shanghai-internal.aliyuncs.com/ws/v1", accessToken);
示例中將合成的音頻保存在文件中,如果您需要播放音頻且對實時性要求較高,建議使用流式播放,即邊接收語音數據邊播放,減少延時。
package com.alibaba.nls.client;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import com.alibaba.nls.client.protocol.NlsClient;
import com.alibaba.nls.client.protocol.OutputFormatEnum;
import com.alibaba.nls.client.protocol.SampleRateEnum;
import com.alibaba.nls.client.protocol.tts.SpeechSynthesizer;
import com.alibaba.nls.client.protocol.tts.SpeechSynthesizerListener;
import com.alibaba.nls.client.protocol.tts.SpeechSynthesizerResponse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* 此示例演示了:
* 長文本語音合成API調用(setLongText)。
* 流式合成TTS。
* 首包延遲計算。
*
* 說明:該示例和nls-example-tts下的SpeechSynthesizerLongTextDemo不完全相同,長文本語音合成是單獨的產品功能,是將一長串文本直接發送給服務端去合成,
* 而SpeechSynthesizerLongTextDemo演示的是將一長串文本在調用方處切割然后分段調用語音合成接口。
*/
public class SpeechLongSynthesizerDemo {
private static final Logger logger = LoggerFactory.getLogger(SpeechLongSynthesizerDemo.class);
private static long startTime;
private String appKey;
NlsClient client;
public SpeechLongSynthesizerDemo(String appKey, String token, String url) {
this.appKey = appKey;
//創建NlsClient實例應用全局創建一個即可。生命周期可和整個應用保持一致,默認服務地址為阿里云線上服務地址。
if(url.isEmpty()) {
client = new NlsClient(token);
} else {
client = new NlsClient(url, token);
}
}
private static SpeechSynthesizerListener getSynthesizerListener() {
SpeechSynthesizerListener listener = null;
try {
listener = new SpeechSynthesizerListener() {
File f=new File("ttsForLongText.wav");
FileOutputStream fout = new FileOutputStream(f);
private boolean firstRecvBinary = true;
//語音合成結束
@Override
public void onComplete(SpeechSynthesizerResponse response) {
// 調用onComplete時,表示所有TTS數據已經接收完成,因此為整個合成數據的延遲。該延遲可能較大,不一定滿足實時場景。
System.out.println("name: " + response.getName() + ", status: " + response.getStatus()+", output file :"+f.getAbsolutePath());
}
//語音合成的語音二進制數據
@Override
public void onMessage(ByteBuffer message) {
try {
if(firstRecvBinary) {
// 此處計算首包語音流的延遲,收到第一包語音流時,即可以進行語音播放,以提升響應速度(特別是實時交互場景下)。
firstRecvBinary = false;
long now = System.currentTimeMillis();
logger.info("tts first latency : " + (now - SpeechLongSynthesizerDemo.startTime) + " ms");
}
byte[] bytesArray = new byte[message.remaining()];
message.get(bytesArray, 0, bytesArray.length);
//System.out.println("write array:" + bytesArray.length);
fout.write(bytesArray);
} catch (IOException e) {
e.printStackTrace();
}
}
@Override
public void onMetaInfo(SpeechSynthesizerResponse response) {
System.out.println("name: " + response.getName() + ", taskId: " + response.getTaskId());
JSONArray subtitles = (JSONArray)response.getObject("subtitles");
List<Map> subtitleList = subtitles.toJavaList(Map.class);
for (Map word : subtitleList) {
System.out.println("current subtitle: " + word);
}
}
@Override
public void onFail(SpeechSynthesizerResponse response){
// task_id是調用方和服務端通信的唯一標識,當遇到問題時,需要提供此task_id以便排查。
System.out.println(
"task_id: " + response.getTaskId() +
//狀態碼
", status: " + response.getStatus() +
//錯誤信息
", status_text: " + response.getStatusText());
}
};
} catch (Exception e) {
e.printStackTrace();
}
return listener;
}
public void process(String text) {
SpeechSynthesizer synthesizer = null;
try {
//創建實例,建立連接。
synthesizer = new SpeechSynthesizer(client, getSynthesizerListener());
synthesizer.setAppKey(appKey);
//設置返回音頻的編碼格式。
synthesizer.setFormat(OutputFormatEnum.WAV);
//設置返回音頻的采樣率。
synthesizer.setSampleRate(SampleRateEnum.SAMPLE_RATE_16K);
//發音人。注意Java SDK不支持調用超高清場景對應的發音人(例如"zhiqi"),如需調用請使用restfulAPI方式。
synthesizer.setVoice("siyue");
//語調,范圍是-500~500,可選,默認是0。
synthesizer.setPitchRate(0);
//語速,范圍是-500~500,默認是0。
synthesizer.setSpeechRate(0);
//設置用于語音合成的文本
// 此處調用的是setLongText接口(原語音合成接口是setText)。
synthesizer.setLongText(text);
//此方法將以上參數設置序列化為JSON發送給服務端,并等待服務端確認。
long start = System.currentTimeMillis();
synthesizer.start();
logger.info("tts start latency " + (System.currentTimeMillis() - start) + " ms");
SpeechLongSynthesizerDemo.startTime = System.currentTimeMillis();
//等待語音合成結束
synthesizer.waitForComplete();
logger.info("tts stop latency " + (System.currentTimeMillis() - start) + " ms");
} catch (Exception e) {
e.printStackTrace();
} finally {
//關閉連接
if (null != synthesizer) {
synthesizer.close();
}
}
}
public void shutdown() {
client.shutdown();
}
public static void main(String[] args) throws Exception {
String appKey = "";
String token = "填寫你的token";
// url取默認值
String url = "wss://nls-gateway.cn-shanghai.aliyuncs.com/ws/v1";
if (args.length == 2) {
appKey= args[0];
token = args[1];
} else if (args.length == 3) {
appKey = args[0];
token = args[1];
url = args[2];
} else {
System.err.println("run error, need params(url is optional): " + "<app-key> <token> [url]");
System.exit(-1);
}
String ttsTextLong = "百草堂與三味書屋 魯迅 \n" +
"我家的后面有一個很大的園,相傳叫作百草園。現在是早已并屋子一起賣給朱文公的子孫了,連那最末次的相見也已經隔了七八年,其中似乎確鑿只有一些野草;但那時卻是我的樂園。\n" +
"不必說碧綠的菜畦,光滑的石井欄,高大的皂莢樹,紫紅的桑葚;也不必說鳴蟬在樹葉里長吟,肥胖的黃蜂伏在菜花上,輕捷的叫天子(云雀)忽然從草間直竄向云霄里去了。\n" +
"單是周圍的短短的泥墻根一帶,就有無限趣味。油蛉在這里低唱,蟋蟀們在這里彈琴。翻開斷磚來,有時會遇見蜈蚣;還有斑蝥,倘若用手指按住它的脊梁,便會啪的一聲,\n" +
"從后竅噴出一陣煙霧。何首烏藤和木蓮藤纏絡著,木蓮有蓮房一般的果實,何首烏有臃腫的根。有人說,何首烏根是有像人形的,吃了便可以成仙,我于是常常拔它起來,牽連不斷地拔起來,\n" +
"也曾因此弄壞了泥墻,卻從來沒有見過有一塊根像人樣! 如果不怕刺,還可以摘到覆盆子,像小珊瑚珠攢成的小球,又酸又甜,色味都比桑葚要好得遠......";
SpeechLongSynthesizerDemo demo = new SpeechLongSynthesizerDemo(appKey, token, url);
demo.process(ttsTextLong);
demo.shutdown();
}
}