Telegram Group Search
tasty transformer papers - september 2024

Emu3: Next-Token Prediction is All You Need
what: one transformer-decoder to generate videos.
- use vqvae to tokenize images.
- Also, [EOL] and [EOF] are inserted into the vision tokens to denote line breaks and frame
- training sample:
[BOS] {caption text} [SOV] {meta text} [SOT] {vision tokens} [EOV] [EOS].


Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
what: mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction
- parallel decoding: 8 tokens [t-1] -> 8 tokens [t]
- audio response as fast as text generation.

Were RNNs All We Needed?
what: do not care a lot about architecture. they seems work similar.
- make rnn faster and see that performance is similar.


my thoughts
- data is key in model training. whether you're using transformers, rnns, next-token prediction, or diffusion models is becoming less important.
- predicting text and audio in parallel is promising for next-gen brain-computer interfaces.
- focus less on architecture and more on the quality of data and the objectives you want to optimize.

#digest
Раз мы по видео пошли, свежего Лекуна вам в ленту

Lecture Series in AI: “How Could Machines Reach Human-Level Intelligence?”
https://www.youtube.com/watch?v=xL6Y0dpXEwc

Animals and humans understand the physical world, have common sense, possess a persistent memory, can reason, and can plan complex sequences of subgoals and actions. These essential characteristics of intelligent behavior are still beyond the capabilities of today's most powerful AI architectures, such as Auto-Regressive LLMs.

I will present a cognitive architecture that may constitute a path towards human-level AI. The centerpiece of the architecture is a predictive world model that allows the system to predict the consequences of its actions. and to plan sequences of actions that that fulfill a set of objectives. The objectives may include guardrails that guarantee the system's controllability and safety. The world model employs a Joint Embedding Predictive Architecture (JEPA) trained with self-supervised learning, largely by observation.

The JEPA simultaneously learns an encoder, that extracts maximally-informative representations of the percepts, and a predictor that predicts the representation of the next percept from the representation of the current percept and an optional action variable.

We show that JEPAs trained on images and videos produce good representations for image and video understanding. We show that they can detect unphysical events in videos. Finally, we show that planning can be performed by searching for action sequences that produce predicted end state that match a given target state.

Слайды:
https://drive.google.com/file/d/1F0Q8Fq0h2pHq9j6QIbzqhBCfTXJ7Vmf4/view

Надо будет JEPA и её вариации таки разобрать. Давно в очереди уже.
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

blogpost
tasty diffusion papers - september 2024

OmniGen: Unified Image Generation
what: one transformer for the text-to-image diffusion model.
- rectified flow.
- multimodal condition: text and image.
- one model processes all context and does diffusion steps.

Diffusion Policy Policy Optimization
what: set of best practices for fine-tuning diffusion-based policies in continuous control and robot learning tasks.

Rejection Sampling IMLE: Designing Priors for Better Few-Shot Image Synthesis
what: modification which helps to generalize better on few samples.

my thoughts
- i like how AI community is trying to simplify everything. surprisingly sometimes it works well. for example, rectified flow is simplified version of diffusion.
- rl + diffusion => next step in brain stimulation?
diffusion models could mimic brain patterns for smoother stimulation. with RL, they’d adapt in real-time, making treatments more precise and personalized. shifting from rigid protocols to dynamic brain interventions.

#digest
How effective is human-AI collaboration?

A meta-analysis of 106 studies just published in Nature reports an interesting result:

On average, there was no synergy: Human–AI combinations did not perform better than both humans and AI.

In particular, when the AI alone outperformed the human alone, the human–AI combination led to performance losses, likely because humans were unable to integrate the suggestions provided by the AI.

Conversely, when the human outperformed the AI alone, there was some synergy and human–AI combination led to performance gains, likely because this time humans were better at integrating the AI suggestions.
enhancing intuition in diffusion

Tutorial on Diffusion Models for Imaging and Vision
what: good and comprehensive tutorial about modern diffusion models.

Diffusion is spectral autoregression
what: blogpost with python notebook.

Diffusion Models are Evolutionary Algorithms
what: another view on diffusion
- the denoising process in diffusion models can be seen as a combination of selection and mutation.

if you know good blogs please write in comments

#knowledge
Philosophy of mind - worth it in 2024?

A relatively recent Giulio Tononi talk (IIT theory) at my university (not recorded) sparkled my interest in consciousness research. Turns out this area is exactly what seemed most fun to me even though I did not realize it.

After all, all there is to catch the Zen of this life it is to make sense about thy own mind...

2022 - 2023 presented us with lively debates about the neural basis of consciousness, with the most prominent opponents being integrated information theory (IIT) and global workspace theory (GWT). I recommend checking out this nature letter for more info on the clash between theories. Additionally, there is a cool review by A.Seth and T.Bayne about neuro-consciousness theories.

I came to conclusion that to even start digging into it neural basis of consciousness, might be useful to look at what our fellow philosophers of mind did (over the course of the last 300 years...). To my surprise, it does not seem that at least John Seerle (the author of a famous Chinese room thought experiment) is incompetent in the neural part of mind-body approaches.

I have to admit as a snob computational neuroscientist I highly doubt non-empirical reasoning. However, the philosophy of mind course by J.Seerle seems super fun and comprehensive. So far it discussed relationships between syntax and semantics, computational theory of the mind, different forms and shapes of materialism, connectionism and others, with the focus on heavy critiques of those 😁

So for anyone like me, trying to get the grasp on mind and body relationships while staying true to the rational, emprical neuroscience , I highly recommend this course (I listen to it like a podcast). Super fun and informative, an easy listen.

▶️ [youtube playlist link]

The plan is to finally understand IIT and neural theories of consciousness after being armed with philosophical arguments. Good luck to me/us 🕺
Please open Telegram to view this post
VIEW IN TELEGRAM
[1/4]
tasty diffusion papers - october 2024

Let's dive into video generation and "world models"

Diffusion for World Modeling: Visual Details Matter in Atari
what: DIAMOND: DIffusion As a Model Of eNvironment Dreams
- diffusion world model allows to train RL agent
- also you can play in the game.
- atari + CS GO
link: https://diamond-wm.github.io/

Oasis: A Universe in a Transformer
what: transform action into minecraft frames
- get latest N frames and action as input.
- DIT transformer generate next frame.
- Trained on Minecraft environment.
link: https://www.decart.ai/articles/oasis-interactive-ai-video-game-model

MarDini: Masked Auto-Regressive Diffusion for Video Generation at Scale
what: video generator with two models.
- planning model: masked auto-regression works on low quality
- DM focuses on detailed spatial modelling.
- image-to-video, video interpolation
link: https://mardini-vidgen.github.io/

Pyramidal Flow Matching for Efficient Video Generative Modeling
what: pyramidal flow matching with DIT to generate videos model
- "first steps are usually very noisy and less informative" so we can start with low resolution scale and than increase it.
link: https://arxiv.org/abs/2410.05954
transformers-october-2024.png
2 MB
tasty transformer papers | october 2024
[2/4]

Differential Transformer
what: small modification for self attention mechanism.
- focuses on the most important information, ignoring unnecessary details.
- it does this by subtracting one attention map from another to remove "noise."
link: https://arxiv.org/abs/2410.05258

Pixtral-12B
what: good multimodal model with simple arch.
- Vision Encoder with ROPE-2D: Handles any image resolution/aspect ratio natively.
- Break Tokens: Separates image rows for flexible aspect ratios.
- Sequence Packing: Batch-processes images with block-diagonal masks, no info “leaks.”
link: https://arxiv.org/abs/2410.07073

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
what: maskGIT with continual tokens.
- get vae with quantized loss but do not use quantization in decoder ( stable diffusion)
- propose BERT-like model to generate in random-order.
- ablation shows that bert-like better than gpt-like for images(tbh small improvements)
link: https://arxiv.org/abs/2410.13863

UniMTS: Unified Pre-training for Motion Time Series
what: one model to handle different device positions, orientations, and activity types.
- use graph convolution encoder to work with all devices
- contrastive learning with text from LLMs to “get” motion context.
- rotation-invariance: doesn’t care about device angle.
link: https://arxiv.org/abs/2410.19818

my thoughts

I'm really impressed with the Differential Transformer metrics. They made such a simple and clear modification. Basically, they let the neural network find not only the most similar tokens but also the irrelevant ones. Then they subtract one from the other to get exactly what's needed.

This approach could really boost brain signal processing. After all, brain activity contains lots of unnecessary information, and filtering it out would be super helpful. So it looks promising.

Mistral has really nailed how to build and explain models. Clear, brief, super understandable. They removed everything unnecessary, kept just what's needed, and got better results. The simpler, the better!
This media is not supported in your browser
VIEW IN TELEGRAM
neuro-october.png
4.6 MB
tasty neuro bci papers - october 2024
[3/4]

Synthetic touch for brain-controlled bionic hands: tactile edges and motion via patterned microstimulation of the human somatosensory cortex

what: complex touch sensations using patterned brain stimulation. Participants felt edges, shapes, and motion.
- Uses multiple electrodes firing in patterns in somatosensory cortex (S1)
- Creates edge and shape sensations
- Controls motion direction and speed
- Winner of BCI AWARD 2024
video: https://youtu.be/ipojAWqTxAA

Measuring instability in chronic human intracortical neural recordings towards stable, long-term brain-computer interfaces

what: metric to track distribution shift
- apply KL divergence for neural recording
- show that it's well correlated with decoder performance.
- good thing to track moment of recalibration.
link: https://www.nature.com/articles/s42003-024-06784-4


Accurate neural control of a hand prosthesis by posture-related activity in the primate grasping circuit

what: hand prosthetic control using neural posture signals instead of traditional velocity. Achieves precision grip control in macaques.
- Uses posture transitions vs standard velocity control
- Works with 3 brain areas (AIP, F5, M1)
- Matches natural hand control patterns
link: https://www.cell.com/neuron/abstract/S0896-6273(24)00688-3

my thoughts

Shift from "feeling dots" to "feeling objects" is amazing. That's like upgrading from morse code to actual writing for touch sensations. For sure, it's not perfect and we have to continue. In my view we should focus on "smart" stimulation. Which can use diverse feedback from participant. Maybe mix of RL and SFT.

Measuring changes in the neural recording is must have in any bci application. KL div is good starting point. however, plots show smooth performance degradation. So potentially we could capture this shift day by day and somehow fix it. For example, it's interesting to consider "stabilizer model" which should to match shifted data into original distribution. Flow matching, diffusion, or just AE with KL loss.
Please open Telegram to view this post
VIEW IN TELEGRAM
⚡️❗️ Breaking Ground in BCI: Science (Neuralink's Competitor) Unveils Revolutionary Biohybrid Neural Technology

Science, a neurotechnology company founded by former Neuralink President Max Hodak, has revealed a revolutionary approach to brain-computer interfaces (BCIs) that could fundamentally transform how we interact with the human brain.

Unlike traditional BCIs, including those developed by Neuralink, Science's innovative biohybrid approach utilizes living neurons instead of conventional electrodes.

The company has developed a unique technology where specially engineered neurons, derived from stem cells, are integrated with electronics before being implanted into the brain. The key innovation lies in keeping the neuron cell bodies within the device while allowing their axons and dendrites to naturally grow into the brain tissue, forming new connections with existing neurons.

This breakthrough approach offers several revolutionary advantages:

1. Natural Integration:
- A single implant of one million neurons can create over a billion synaptic connections
- The device occupies less than a cubic millimeter
- Forms genuine chemical synapses with brain cells

2. Versatility:
- Capability to use various neuron types (dopaminergic, cholinergic, glutamatergic)
- Ability to stimulate the brain using natural neurotransmitters
- Superior signal quality with lower power consumption

3. Scalability Potential:
- Technology can be scaled to millions of neurons
- Theoretical bandwidth comparable to the corpus callosum (the structure connecting brain hemispheres)

The development team is addressing several technical challenges:

1. Immunological Compatibility:
- Need to create immune-invisible cells
- Current personalized cell creation process is costly ($1M+) and time-consuming (months)

2. Cell Viability:
- Neurons must survive glycemic shock
- Protection from hypoxia is essential
- Proper glial support required
- Cells must mature within an active electronic device

Science has already published their first paper demonstrating this technology's capabilities.

While their biohybrid approach is still in early development, its potential is immense. It could solve the fundamental limitations of traditional BCIs - brain tissue damage during electrode implantation and limited long-term stability.

This development represents a significant departure from conventional BCI approaches, including those of Neuralink and other competitors. While Neuralink has focused on developing advanced electrode arrays, Science's biohybrid approach could potentially offer a more natural and sustainable solution for brain-computer integration.

The implications of this breakthrough extend beyond just technological advancement. It opens new possibilities for treating neurological conditions, restoring lost brain functions, and creating more natural brain-computer interfaces. If the technical challenges can be overcome, this technology could form the foundation for the next generation of neuroprosthetics and therapeutic devices.

This innovation underscores the rapid advancement in neurotechnology, with companies like Science and Neuralink pushing the boundaries of what's possible in brain-computer interfacing. The competition between these companies, led by visionary entrepreneurs like Max Hodak, continues to drive innovation in this crucial field, potentially bringing us closer to a future where seamless brain-computer integration becomes a reality.

Science's approach represents not just an incremental improvement but a paradigm shift in how we think about brain-computer interfaces, potentially offering a more biocompatible and sustainable solution for long-term neural interfacing.
This media is not supported in your browser
VIEW IN TELEGRAM
Протезы отстают от роботизированных рук и что с этим сделать

Давайте поговорим про руки - и про настоящие, и про искусственные. Своими вы пользуетесь каждый день, даже не задумываясь. А вот искусственные... они уже на подходе, и прогресс в этой области реально впечатляет! Посмотрите последние видосы от Tesla и Figure - их робо-руки уже почти неотличимы от человеческих по ловкости.

Зачем это всё?
Весь наш мир заточен под руки - от дверных ручек до смартфонов. Поэтому роботам, которые должны помогать нам в быту, просто необходимо научиться работать в нашем рукоцентричном мире.

За последние два года роботы сделали огромный скачок в управлении. Это работает примерно следующим образом. Берём трансформер, скармливаем ему кучу видео с человеческими движениями и учим повторять. По сути, робот учится на примерах.

Что с протезами?
Вот тут начинается самое интересное (и грустное). Логично подумать, что протезы развиваются так же круто, как роботы, или даже круче. Но нет. К сожалению, протезирование сильно отстаёт, особенно в управлении.

Как это работает сейчас: на культю крепится протез, который считывает электрические сигналы с мышц. Человек сокращает мышцы и протез начинает двигаться. На данный момент, управление ограничивается небольшим набором жестов, между которыми можно переключаться. Как будто играешь в игру с двумя кнопками.

Есть, конечно, эксперименты с вживлением электродов - там результаты огонь! Но до рынка эти решения пока не дошли.


Что мы можем с этим сделать?
Хочется, чтобы разрыв между роботами и протезами не был таким большим. Я считаю, что этого можно достичь с помощью активного использования AI.

Что если человек с ампутацией мог управлять отдельными пальцами? Мог бы печатать на клавиатуре? А играть на пианино?

Давайте прикинем как это можно сделать. Для начала ограничимся управлением в VR, а затем уж будем переносить на протезы. Погнали.

Задача 1. Управление пальцами в VR

Augmented Mirror Hand (MIRANDA): Advanced Training System for New Generation Prosthesis

старый постер: link
новое видео: youtube

В прошлом году мы с командой ALVI Labs показали что с помощью мышечных сигналов(EMG), человек без руки может управлять отдельными пальцами в VR.

По сути, мы взяли технологии от роботов, добавили свои фишки, и оно заработало! (q-former pre-train for imitation learning and fast instant finetuning.)

Данный подход необходимо расширить и добавить информацию о положении руки, чтобы сделать модель более устойчивой.

Задача 2. Печать в VR


TouchInsight: Uncertainty-aware Rapid Touch and Text Input for Mixed Reality from Egocentric Vision
https://arxiv.org/abs/2410.05940

Авторы предлагают улучшенную систему отслеживания рук для набора текста. Они объединили hand tracking с трансформерами, которые умно собирают всю информацию, поступающую от VR-очков, и фиксируют момент касания поверхности. Затем эти символы обрабатываются языковой моделью, которая понимает структуру языка и не допускает глупых ошибок.

Они создали пайплайн, полностью заточенный под одну конкретную задачу — печать. И вот это самое интересное: они сосредоточились на одном сценарии и довели его до ума. Нам нужно применять такой же подход для различных сценариев управления протезами.


Задача 3 Игра на пианино в VR

A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands

https://rp1m.github.io/

Двигаемся к игре на инструменте. Здесь можно научить модель играть на пианино. Затем объединить её с мышечными сигналами от человека. Примерно таким же способом как сделано для печати.

Итак, эти статьи могут быть подспорьем в новых исследованиях в протезировании. А мы будем держать руку на пульсе и делиться с вами новостями.

Давайте начнем верить в невозможное и постоянно пробовать, ошибаться и ещё раз пробовать. Только так можно пробить ограничения в наших взглядах и сделать то, что сегодня кажется фантастикой.
This media is not supported in your browser
VIEW IN TELEGRAM
Optimus Hand by Tesla

Управляется человеком в режиме реального времени.

Теперь стало 22 степени свободы на руку и 3 на кисть.

Выглядит очень натурально.

Про руки отдельный пост, посмотрите, если ещё не видели

https://www.group-telegram.com/neural_cell.com/209
Please open Telegram to view this post
VIEW IN TELEGRAM
tasty_bci_nov_2024.png
2.9 MB
tasty neuro bci papers which i like in november of 2024
[1/3]

🔘Speech motor cortex enables BCI cursor control and click

tl;dr: demonstrated that ventral motor cortex (typically used for speech) can enable high-performance cursor control
• rapid calibration (40 seconds) and accurate control (2.90 bits/sec) from vPCG neural signals
• all 4 arrays showed click-related activity, with best cursor control from dorsal 6v area
• system enabled real-world computer use including Netflix browsing and gaming
link: https://doi.org/10.1101/2024.11.12.623096

🔘Optogenetic stimulation of a cortical biohybrid implant guides goal directed behavior

tl;dr: novel BCI approach using living neurons on brain surface instead of invasive electrodes
• achieves 50% neuron survival by avoiding vascular damage during implantation
• transplanted neurons naturally integrate and show spontaneous activity
• mice successfully detect optogenetic stimulation to perform reward task
link: https://www.biorxiv.org/content/10.1101/2024.11.22.624907v1
press: https://science.xyz/technologies/biohybrid/

my thought:
speech motor cortex enabling netflix browsing in 40 seconds of calibration? that's the kind of real-world usability we've been waiting for. not just lab demos, but actual everyday control.

the biohybrid approach is tackling the integration problem from a completely different angle. getting living neurons to interface with the brain might sound complex, but it could be the elegant solution we need.

Pretty exciting to see BCI tech moving from "can we do it?" to "how do we make it better?"
Please open Telegram to view this post
VIEW IN TELEGRAM
Gemini 2.0 Flash Thinking Experimental

Очень интересно наблюдать за рассуждениями. Советую!

Пока что бесплатно.

https://aistudio.google.com/
tasty-visial-bci-nov-2024.png
9 MB
tasty visual bci papers which i like in november of 2024
[2/3]

MonkeySee: decoding natural images straight from primate brain activity

tl;dr: CNN decoder reconstructs what a monkey sees from its brain signals in V1, V4, and IT areas.
• neural signals from 576 electrodes in V1/V4/IT areas record monkey's response to visual stimuli
• decoder architecture is essentially U-Net with additional learned Gaussian layer mapping electrode signals to 2D space
• model trained on 22,248 images from THINGS dataset achieves high correlation with ground truth
• results show hierarchical processing: V1 better at low-level features, IT at high-level semantics
link: https://openreview.net/forum?id=OWwdlxwnFN


Precise control of neural activity using dynamically optimized electrical stimulation

tl;dr: new optimization approach for neural implants that uses temporal and spatial separation for precise control of neural activity
• the array was placed on retinal ganglion cells (RGCs).
• developed greedy algorithm that selects optimal sequence of simple stimuli.
• uses temporal dithering and spatial multiplexing to avoid nonlinear electrode interactions
• improves visual stimulus reconstruction accuracy by 40% compared to existing methods
link: https://doi.org/10.7554/eLife.83424


my thoughts
The MonkeySee decoder effectively reconstructs images by mirroring how our brain processes information, from basic features in V1 to deeper meanings in IT. While not entirely novel, their experiments are well-designed, using multiple electrodes to cover various visual areas, which is impressive.
Conversely, the electrical stimulation projects are making significant strides, employing clever timing and placement strategies to enhance stimulation. They aim to reduce nonlinear responses by adjusting the timing of stimulation. Perhaps incorporating reinforcement learning could elevate this further?
2025/06/15 17:33:46
Back to Top
HTML Embed Code: