Word sense acquisition and distinction are key issues for both lexicography and lexical semantic processing. However, it is quite difficult to automatically acquire word senses and to further evaluate the results with...
详细信息
ISBN:
(纸本)9781424427796
Word sense acquisition and distinction are key issues for both lexicography and lexical semantic processing. However, it is quite difficult to automatically acquire word senses and to further evaluate the results with the lexica, which more likely bear the different findings of word sense distinction anti granularity. In this paper, we'd like to put forward the idea of measuring word polysemousness anti sense granularity at a language level. Two methods, viz. MECBC and THEM, are at first employed as attempts to extract Chinese word senses from the corpora. Automatic word senses mapping to the lexica and evaluation of the results are devised and realized afterwards. Our experiments shows a rather fine fitness of Chinese word polysemousness between the results anti the lexica at the whole language level. Comaprison of sense granularity between different lexical semantic resources can hence he made on a sound judgment.
In this paper we suggest that efficient processing of naturallanguage should be considered as a complex procedure, which comprises such stages as knowledge encoding by means of naturallanguage units, its digitalizat...
详细信息
An important aspect of developing LLMs that interact with humans is to align models' behavior to their users. It is possible to prompt an LLM into behaving as a certain persona, especially a user group or ideologi...
详细信息
ISBN:
(纸本)9798891760615
An important aspect of developing LLMs that interact with humans is to align models' behavior to their users. It is possible to prompt an LLM into behaving as a certain persona, especially a user group or ideological persona the model captured during its pertaining stage. But, how to best align an LLM with a specific user and not a demographic or ideological group remains an open question. Mining public opinion surveys (by PEW research), we find that the opinions of a user and their demographics and ideologies are not mutual predictors. We use this insight to align LLMs by modeling relevant past user opinions in addition to user demographics and ideology, achieving up to 7 points accuracy gains in predicting public opinions from survey questions across a broad set of topics(1). Our work opens up the research avenues to bring user opinions as an important ingredient in aligning language models.
Several statistical methods have already been proposed to detect and correct the real-word errors of a context. However, to the best of our knowledge, none of them has been applied on Persian language yet. In this pap...
详细信息
The following work investigates the subject of using GPGPU technology for naturallanguageprocessing. naturallanguageprocessing involves analysing very large volumes of data based on sophisticated algorithms. This ...
详细信息
ISBN:
(纸本)9783319234373;9783319234366
The following work investigates the subject of using GPGPU technology for naturallanguageprocessing. naturallanguageprocessing involves analysing very large volumes of data based on sophisticated algorithms. This process can only be performed on computers with significant computing power. Parallel computing and utilisation of the processing capacity of graphics cards can help achieve the above requirements. The work presents the problem of building n-gram models of naturallanguage based on specific text. Two algorithms were developed: a sequential one for a typical CPU and a parallel one, which uses the capacity of a GPU. The GPU algorithm was prepared using Nvidia CUDA technology. Experiments were carried out in order to compare the effectiveness of the developed algorithms depending on the size of the analysed text and the number of words in the n-grams. The results showed that a parallel type algorithm is better for a GPU environment.
Social media interactions have become increasingly important in today's world. A survey conducted in 2014 among adult Americans found that a majority of those surveyed use at least one social media site. Twitter, ...
详细信息
ISBN:
(纸本)9781467390057
Social media interactions have become increasingly important in today's world. A survey conducted in 2014 among adult Americans found that a majority of those surveyed use at least one social media site. Twitter, in particular, serves 310 million active users on a monthly basis, and thousands of tweets are published every second. The public nature of this data makes it a prime candidate for data mining. Twitter users publish 140-character long messages and have the ability to geo-tag these tweets using a variety of methods: GPS coordinates, IP geolocation and user-declared location. However, few users disclose their location, only between 1% and 3% of users provide location data, according to our empirical findings. In this article, we aim to aggregate information from different sources to provide an estimation on the location of any Twitter user. We use an hybrid approach, using techniques in the fields of naturallanguageprocessing and network theory. Tests have been conducted on two datasets, inferring the location of each individual user and then comparing it against the actual known location of users with geolocation information. The estimation error is the distance in kilometers between the estimation and the actual location. Furthermore, there is a comparison of the relative average error per country, to account for difference in country sizes. Our results improve those presented in different researches in the literature. Our research has as feature to be independent of the language used by the user, while most of works in the literature use just one language or a reduced set of languages. The article also showcases the evolution of our estimation approach and the impact that the modifications had on the results.
This paper discusses efficient parameter estimation methods for joint (unconditional) maximum entropy language models such as whole-sentence models. Such models are a sound framework for formalizing arbitrary linguist...
详细信息
This paper discusses efficient parameter estimation methods for joint (unconditional) maximum entropy language models such as whole-sentence models. Such models are a sound framework for formalizing arbitrary linguistic knowledge in a consistent manner. It has been shown that general-purpose gradient-based optimization methods are among the most efficient algorithms for estimating parameters of maximum entropy models for several domains in naturallanguageprocessing. This paper applies gradient methods to whole-sentence language models and other domains whose sample spaces are infinite or practically innumerable and require simulation. It also presents Open Source software for easily fitting and testing joint maximum entropy models.
Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing ...
详细信息
Achieving consistent high-quality machine translation (MT) across diverse domains remains a significant challenge, primarily due to the limited and imbalanced parallel training data available in various domains. While...
详细信息
Pretrained language models (PLM) have recently advanced graph-to-text generation, where the input graph is linearized into a sequence and fed into the PLM to obtain its representation. However, efficiently encoding th...
详细信息
暂无评论