As one of the pivotal technologies leading towards embodied intelligence, audio-visual segmentation is geared towards achieving precise segmentation of sounding objects, offering vast application prospects in scenario...
详细信息
As one of the pivotal technologies leading towards embodied intelligence, audio-visual segmentation is geared towards achieving precise segmentation of sounding objects, offering vast application prospects in scenarios such as emergency rescue and natural exploration. Nevertheless, the performance of audio-visual segmentation technology encounters limitations stemming from challenges related to the adaptation and fusion of crossmodal information encoding, as well as the decoding and generation of masks. To address these issues, this paper explores the adaptation of multi -modal information based on a shared encoder by employing a neural architecture search method to design a hierarchical encoder cooperation module for enhanced information interaction. An intermediate loss is leveraged to help the encoder to keep spatial knowledge reserved. Furthermore, an audio -guided class -aware decoder is devised to guide the generation of masks. Our approach has yielded competitive experimental results across multiple datasets, thus substantiating its effectiveness.
In this paper, we present a neural model to map structured table into document-scale descriptive texts. Most existing neural network based approaches encode a table record-by-record and generate long summaries by atte...
详细信息
ISBN:
(纸本)9783030323813;9783030323806
In this paper, we present a neural model to map structured table into document-scale descriptive texts. Most existing neural network based approaches encode a table record-by-record and generate long summaries by attentional encoder-decoder model, which leads to two problems. (1) portions of the generated texts are incoherent due to the mismatch between the row and corresponding records. (2) a lot of irrelevant information is described in the generated texts due to the incorrect selection of the redundant records. Our approach addresses both problems by modeling the row representation as an intermediate structure of the table. In the encoding phase, we first learn record-level representation via transformer encoder. Afterwards, we obtain each row's representation according to their corresponding records' representation and model row-level dependency via another transformer encoder. In the decoding phase, we first attend to row-level representation to find important rows. Then, we attend to specific records to generate texts. Experiments were conducted on ROTOWIRE, a dataset which aims at producing a document-scale NBA game summary given structured table of game statistics. Our approach improves a strong baseline's BLEU score from 14.19 to 15.65 (+10.29%). Furthermore, three extractive evaluation metrics and human evaluation also show that our model has the ability to select salient records and the generated game summary is more accurate.
In this paper,we present a neural model to map structured table into document-scale descriptive *** existing neural net-work based approaches encode a table record-by-record and generate long summaries by attentional ...
详细信息
ISBN:
(纸本)9783030323806
In this paper,we present a neural model to map structured table into document-scale descriptive *** existing neural net-work based approaches encode a table record-by-record and generate long summaries by attentional encoder-decoder model,which leads to two problems.(1)portions of the generated texts are incoherent due to the mismatch between the row and corresponding records.(2)a lot of irrelevant information is described in the generated texts due to the in-correct selection of the redundant *** approach addresses both problems by modeling the row representation as an intermediate struc-ture of the *** the encoding phase,we first learn record-level rep-resentation via transformer ***,we obtain each row's representation according to their corresponding records' representation and model row-level dependency via another transformer *** the decoding phase , we first attend to row-level representation to find important ***,we attend to specific records to generate *** were conducted on ROTOWIRE,a dataset which aims at producing a document-scale NBA game summary given structured ta-ble of game *** approach improves a strong baseline's BLEU score from 14.19 to 15.65(+10.29%).Furthermore,three extractive eval-uation metrics and human evaluation also show that our model has the ability to select salient records and the generated game summary is more accurate.
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech in...
详细信息
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) plays an important role in producing natural and intelligible speech. Although inter-utterance linguistic information can influence the speech interpretation of the target utterance, previous works on PSP mainly focus on utilizing intrautterance linguistic information of the current utterance only. This work proposes to use inter-utterance linguistic information to improve the performance of PSP. Multi-level contextual information, which includes both inter-utterance and intrautterance linguistic information, is extracted by a hierarchical encoder from character level, utterance level and discourse level of the input text. Then a multi-task learning (MTL) decoder predicts prosodic boundaries from multi-level contextual information. Objective evaluation results on two datasets show that our method achieves better F1 scores in predicting prosodic word (PW), prosodic phrase (PPH) and intonational phrase (IPH). It demonstrates the effectiveness of using multi-level contextual information for PSP. Subjective preference tests also indicate the naturalness of synthesized speeches are improved(1).
Due to the dynamic evolution of network traffic, open world traffic classification has become a vital problem. Traditional traffic classification methods have achieved success to a certain extent but failed with unkno...
详细信息
ISBN:
(纸本)9783030868901;9783030868895
Due to the dynamic evolution of network traffic, open world traffic classification has become a vital problem. Traditional traffic classification methods have achieved success to a certain extent but failed with unknown traffic detection due to the assumption of a closed world. Existing techniques on unknown traffic detection suffer from an unsatisfactory accuracy and robustness because they lack design according to the hierarchical structure of network flows. Meanwhile, the diverse flow patterns in the same attacks and the similar flow patterns from different attacks lead to the existence of hard examples, which degrades the classification performance. As a solution, we present a Siamese hierarchical encoder Network for traffic classification in an open world setting. We import a hierarchical encoder mechanism which mines the potential sequential and spatial characteristics of traffic deeply and adopt the siamese structure with a new designed complementary loss function which focuses on mining hard paired examples and quickens the convergence. Both of the key designs conjointly learn the intra-class compactness and inter-class separateness in the feature space to set aside more space for unknown traffic. Our comprehensive experiments on real-world datasets covering intrusion detection and malware detection indicate that SHE-Net achieves excellent performance and outperforms the state-of-the-art methods.
When conversational communication, people often draw upon their rich world knowledge in addition to the dialogue context. The commonsense world fact can facilitate natural language understanding. In the paper, we pres...
详细信息
ISBN:
(纸本)9781728140865
When conversational communication, people often draw upon their rich world knowledge in addition to the dialogue context. The commonsense world fact can facilitate natural language understanding. In the paper, we present a rich knowledge cognition hierarchical (RKC-H) multi-turn dialogue model in open-domain to improve language generation. Given the input, the model selects the corresponded seed-graphs and encodes the seed-graph nodes with a seed-graph attention mechanism. Then, the hierarchical encoder captures the multi-granularity of current utterance and history dialogue text features. We apply graph-to-sequence generator to the responses and provide Exponential Maximum Mutual Information loss function. Automatic and human evaluations show that the proposed model can complete rich meaning and coherent multi-turn dialogue. Our model outperforms over the baseline.
Lung nodule segmentation is usually considered a 3D semantic segmentation task. Due to the small size, diverse morphology, and low recognition of lung nodules, it is hard to segment any nodule precisely. To solve this...
详细信息
暂无评论