检索结果-内蒙古大学图书馆

Variance-reduced gradient estimation via noise-reuse in online evolution strategies 23

学校读者我要写书评

暂无评论

Variance-reduced gradient estimation via noise-reuse in onli...

Proceedings of the 37th International Conference on Neural Information Processing Systems

作者： Oscar Li James Harrison Jascha Sohl-Dickstein Virginia Smith Luke Metz Machine Learning Department School of Computer Science Carnegie Mellon University Google DeepMind

Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.

关键词：

Three Towers: Flexible Contrastive Learning with Pretrained Image Models

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Kossen, Jannik Collier, Mark Mustafa, Basil Wang, Xiao Zhai, Xiaohua Beyer, Lucas Steiner, Andreas Berent, Jesse Jenatton, Rodolphe Kokiopoulou, Efi OATML Department of Computer Science University of Oxford United Kingdom Google Research Google DeepMind United Kingdom

We introduce Three Towers (3T), a flexible method to improve the contrastive learning of vision-language models by incorporating pretrained image classifiers. While contrastive models are usually trained from scratch, LiT [85] has recently shown performance gains from using pretrained classifier embeddings. However, LiT directly replaces the image tower with the frozen embeddings, excluding any potential benefits from training the image tower contrastively. With 3T, we propose a more flexible strategy that allows the image tower to benefit from both pretrained embeddings and contrastive training. To achieve this, we introduce a third tower that contains the frozen pretrained embeddings, and we encourage alignment between this third tower and the main image-text towers. Empirically, 3T consistently improves over LiT and the CLIP-style from-scratch baseline for retrieval tasks. For classification, 3T reliably improves over the from-scratch baseline, and while it underperforms relative to LiT for JFT-pretrained models, it outperforms LiT for ImageNet-21k and Places365 pretraining. Copyright © 2023, The Authors. All rights reserved.

关键词： Towers

Insufficient Statistics Perturbation: Stable Estimators for Private Least Squares 37

学校读者我要写书评

暂无评论

Insufficient Statistics Perturbation: Stable Estimators for ...

37th Annual Conference on Learning Theory, COLT 2024

作者： Brown, Gavin Hayase, Jonathan Hopkins, Samuel Kong, Weihao Liu, Xiyang Oh, Sewoong Perdomo, Juan C. Smith, Adam Paul G. Allen School of Computer Science and Engineering University of Washington United States Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology United States Google Research United States Harvard University United States Department of Computer Science Boston University United States

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Brown, Bradley Juravsky, Jordan Ehrlich, Ryan Clark, Ronald Le, Quoc V. Ré, Christopher Mirhoseini, Azalia Department of Computer Science Stanford University United States University of Oxford United Kingdom Google DeepMind United Kingdom

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage – the fraction of problems that are solved by any generated sample – scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget. © 2024, CC BY.

关键词： Budget control

Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Caswell, Isaac Wang, Lisa Papadimitriou, Isabel Google Research United States Google DeepMind United Kingdom Computer Science Department Stanford University United States

Data quality is a problem that perpetually resurfaces throughout the field of NLP, regardless of task, domain, or architecture, and remains especially severe for lower-resource languages. A typical and insidious issue, affecting both training data and model output, is data that is repetitive and dominated by linguistically uninteresting boilerplate, such as price catalogs or computer-generated log files. Though this problem permeates many web-scraped corpora, there has yet to be a benchmark to test against, or a systematic study to find simple metrics that generalize across languages and agree with human judgements of data quality. In the present work, we create and release BREAD, a human-labeled benchmark on repetitive boilerplate vs. plausible linguistic content, spanning 360 languages. We release several baseline CRED (Character REDundancy) scores along with it, and evaluate their effectiveness on BREAD. We hope that the community will use this resource to develop better filtering methods, and that our reference implementations of CRED scores can become standard corpus evaluation tools, driving the development of cleaner language modeling corpora, especially in low-resource languages. © 2023, CC BY.

关键词： Redundancy

Interpretability Illusions in the Generalization of Simplified Models

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Friedman, Dan Lampinen, Andrew Dixon, Lucas Chen, Danqi Ghandeharioun, Asma Department of Computer Science Princeton University United States Google DeepMind United Kingdom Google Research United States

A common method to study deep learning systems is to use simplified model representations—for example, using singular value decomposition to visualize the model’s hidden states in a lower dimensional space. This approach assumes that the results of these simplifications are faithful to the original model. Here, we illustrate an important caveat to this assumption: even if the simplified representations can accurately approximate the full model on the training set, they may fail to accurately capture the model’s behavior out of distribution. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, including the Dyck balanced-parenthesis languages and a code completion task. We simplify these models using tools like dimensionality reduction and clustering, and then explicitly test how these simplified proxies match the behavior of the original model. We find consistent generalization gaps: cases in which the simplified proxies are more faithful to the original model on the in-distribution evaluations and less faithful on various tests of systematic generalization. This includes cases where the original model generalizes systematically but the simplified proxies fail, and cases where the simplified proxies generalize better. Together, our results raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations. © 2023, CC BY.

关键词： Singular value decomposition

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Liu, Ryan Sumers, Theodore R. Dasgupta, Ishita Griffiths, Thomas L. Department of Computer Science Princeton University United States Google DeepMind United Kingdom Department of Psychology Princeton University United States

In day-to-day communication, people often approximate the truth - for example, rounding the time or omitting details - in order to be maximally helpful to the listener. How do large language models (LLMs) handle such nuanced tradeoffs? To address this question, we use psychological models and experiments designed to characterize human behavior to analyze LLMs. We test a range of LLMs and explore how optimization for human preferences or inference-time reasoning affects these trade-offs. We find that reinforcement learning from human feedback improves both honesty and helpfulness, while chain-ofthought prompting skews LLMs towards helpfulness over honesty. Finally, GPT-4 Turbo demonstrates human-like response patterns including sensitivity to the conversational framing and listener's decision context. Our findings reveal the conversational values internalized by LLMs and suggest that even these abstract values can, to a degree, be steered by zero-shot prompting. Copyright © 2024, The Authors. All rights reserved.

关键词： Reinforcement learning

Decoupling Semantic Similarity from Spatial Alignment for Neural Networks 38

学校读者我要写书评

暂无评论

Decoupling Semantic Similarity from Spatial Alignment for Ne...

38th Conference on Neural Information Processing Systems, NeurIPS 2024

作者： Wald, Tassilo Ulrich, Constantin Köhler, Gregor Zimmerer, David Denner, Stefan Baumgartner, Michael Isensee, Fabian Jaini, Priyank Maier-Hein, Klaus H. Heidelberg Germany Helmholtz Imaging DKFZ Heidelberg Germany Faculty of Mathematics and Computer Science University of Heidelberg Germany Medical Faculty Heidelberg University of Heidelberg Germany Google Deepmind United Kingdom Pattern Analysis and Learning Group Department of Radiation Oncology Heidelberg Germany

What representation do deep neural networks learn? How similar are images to each other for neural networks? Despite the overwhelming success of deep learning methods key questions about their internal workings still remain largely unanswered, due to their internal high dimensionality and complexity. To address this, one approach is to measure the similarity of activation responses to various inputs. Representational Similarity Matrices (RSMs) distill this similarity into scalar values for each input pair. These matrices encapsulate the entire similarity structure of a system, indicating which input leads to similar responses. While the similarity between images is ambiguous, we argue that the spatial location of semantic objects does neither influence human perception nor deep learning classifiers. Thus this should be reflected in the definition of similarity between image responses for computer vision systems. Revisiting the established similarity calculations for RSMs we expose their sensitivity to spatial alignment. In this paper, we propose to solve this through semantic RSMs, which are invariant to spatial permutation. We measure semantic similarity between input responses by formulating it as a set-matching problem. Further, we quantify the superiority of semantic RSMs over spatio-semantic RSMs through image retrieval and by comparing the similarity between representations to the similarity between predicted class probabilities. © 2024 Neural information processing systems foundation. All rights reserved.

关键词：

RoboTAP: Tracking Arbitrary Points for Few-Shot Visual Imitation

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Vecerik, Mel Doersch, Carl Yang, Yi Davchev, Todor Aytar, Yusuf Zhou, Guangyao Hadsell, Raia Agapito, Lourdes Scholz, Jon Google DeepMind United Kingdom Department of Computer Science University College London United Kingdom

For robots to be useful outside labs and specialized factories we need a way to teach them new useful behaviors quickly. Current approaches lack either the generality to onboard new tasks without task-specific engineering, or else lack the data-efficiency to do so in an amount of time that enables practical use. In this work we explore dense tracking as a representational vehicle to allow faster and more general learning from demonstration. Our approach utilizes Track-Any-Point (TAP) models to isolate the relevant motion in a demonstration, and parameterize a low-level controller to reproduce this motion across changes in the scene configuration. We show this results in robust robot policies that can solve complex object-arrangement tasks such as shape-matching, stacking, and even full path-following tasks such as applying glue and sticking objects together, all from demonstrations that can be collected in minutes. Copyright © 2023, The Authors. All rights reserved.

关键词： Demonstrations