Salesforce introduced MMPersuade, a comprehensive multimodal benchmark that assesses AI agents’ susceptibility to established persuasion principles, covering commercial, subjective and behavioral, and adversarial contexts.
MMPersuade is a new dataset and evaluation framework to systematically study multimodal persuasion in LVLMs.
Team built a comprehensive multimodal benchmark pairing persuasive strategies with over 62,000 images and 4,700 videos.
It covers 3 key contexts: Commercial (Sales & Ads), Subjective & Behavioral (Health Nudging, Politics) , Adversarial (Misinformation & Fabricated Claims)
MMPersuade is a new dataset and evaluation framework to systematically study multimodal persuasion in LVLMs.
Team built a comprehensive multimodal benchmark pairing persuasive strategies with over 62,000 images and 4,700 videos.
It covers 3 key contexts: Commercial (Sales & Ads), Subjective & Behavioral (Health Nudging, Politics) , Adversarial (Misinformation & Fabricated Claims)
Carnegie, Stanford introduced a new work on Training LLMs to Discover Abstractions for Solving Reasoning Problems
cohenqu.github.io
RLAD: RL through Abstraction Discovery
🔥2🥰2👏2🆒2
MIT presented LoRA vs full fine-tuning: same performance ≠ same solution.
This paper shows that LoRA and full fine-tuning, even when equally well fit, learn structurally different solutions and that LoRA forgets less and can be made even better (lesser forgetting) by a simple intervention.
This paper shows that LoRA and full fine-tuning, even when equally well fit, learn structurally different solutions and that LoRA forgets less and can be made even better (lesser forgetting) by a simple intervention.
👍2🥰2👏2
New Anthropic research: Signs of introspection in LLMs.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them?
Anthropic found evidence for genuine—though limited—introspective capabilities in Claude.
Researchers developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
In one experiment, researchers asked the model to detect when a concept is injected into its “thoughts.” When researchers inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.
However, it doesn’t always work. In fact, most of the time, models fail to exhibit awareness of injected concepts, even when they are clearly influenced by the injection.
Also show that Claude introspects in order to detect artificially prefilled outputs. Normally, Claude apologizes for such outputs. But if researchers retroactively inject a matching concept into its prior activations, team can fool Claude into thinking the output was intentional.
This reveals a mechanism that checks consistency between intention and execution. The model appears to compare "what did I plan to say?" against "what actually came out?"—a form of introspective monitoring happening in natural circumstances.
Also found evidence for cognitive control, where models deliberately "think about" something. For instance, when team instruct a model to think about "aquariums” in an unrelated context, researchers measure higher aquarium-related neural activity than if team instruct it not to.
Note that experiments do not address the question of whether AI models can have subjective experience or human-like self-awareness. The mechanisms underlying the behaviors observe are unclear, and may not have the same philosophical significance as human introspection.
While currently limited, AI models’ introspective capabilities will likely grow more sophisticated. Introspective self-reports could help improve the transparency of AI models’ decision-making—but should not be blindly trusted.
Can language models recognize their own internal thoughts? Or do they just make up plausible answers when asked about them?
Anthropic found evidence for genuine—though limited—introspective capabilities in Claude.
Researchers developed a method to distinguish true introspection from made-up answers: inject known concepts into a model's “brain,” then see how these injections affect the model’s self-reported internal states.
In one experiment, researchers asked the model to detect when a concept is injected into its “thoughts.” When researchers inject a neural pattern representing a particular concept, Claude can in some cases detect the injection, and identify the concept.
However, it doesn’t always work. In fact, most of the time, models fail to exhibit awareness of injected concepts, even when they are clearly influenced by the injection.
Also show that Claude introspects in order to detect artificially prefilled outputs. Normally, Claude apologizes for such outputs. But if researchers retroactively inject a matching concept into its prior activations, team can fool Claude into thinking the output was intentional.
This reveals a mechanism that checks consistency between intention and execution. The model appears to compare "what did I plan to say?" against "what actually came out?"—a form of introspective monitoring happening in natural circumstances.
Also found evidence for cognitive control, where models deliberately "think about" something. For instance, when team instruct a model to think about "aquariums” in an unrelated context, researchers measure higher aquarium-related neural activity than if team instruct it not to.
Note that experiments do not address the question of whether AI models can have subjective experience or human-like self-awareness. The mechanisms underlying the behaviors observe are unclear, and may not have the same philosophical significance as human introspection.
While currently limited, AI models’ introspective capabilities will likely grow more sophisticated. Introspective self-reports could help improve the transparency of AI models’ decision-making—but should not be blindly trusted.
Anthropic
Emergent introspective awareness in large language models
Research from Anthropic on the ability of large language models to introspect
❤3🔥2👏2
Cognition (ex-team of Windsurf) released SWE-1.5, fast agent model.
That delivers "near-SOTA coding performance" at significantly higher speeds.
U can try it here.
That delivers "near-SOTA coding performance" at significantly higher speeds.
U can try it here.
Cognition
Cognition | Introducing SWE-1.5: Our Fast Agent Model
Today we’re releasing SWE-1.5, the latest in our family of models optimized for software engineering. It is a frontier-size model with hundreds of billions of parameters that achieves near-SOTA coding performance. It also sets a new standard for speed: we…
❤2🔥2👏2
Perplexity launched Perplexity Patents, a new IP intelligence research Agent
"While in beta, Perplexity Patents will be free for all users. Pro and Max subscribers will receive additional usage quotas and model configuration options."
"While in beta, Perplexity Patents will be free for all users. Pro and Max subscribers will receive additional usage quotas and model configuration options."
www.perplexity.ai
Introducing Perplexity Patents: AI-Powered Patent Search for Everyone
Explore Perplexity's blog for articles, announcements, product updates, and tips to optimize your experience. Stay informed and make the most of Perplexity.
🔥2🥰2👏2
OpenAI introduced Aardvark, an agent that finds and fixes security bugs using GPT-5.
Openai
Introducing Aardvark: OpenAI’s agentic security researcher
Now in private beta: an AI agent that thinks like a security researcher and scales to meet the demands of modern software.
👍1🥰1👏1
Microsoft announced new agents + economics research
AI agents are starting to shop and buy for us. At the same time, agents are representing and providing customer support on behalf of businesses.
Real markets are messy: hundreds of options, agents with hidden strategies, conversations that can go anywhere. Microsoft built a simulated marketplace to test this at scale - and found issues that need fixing.
Approach: create a safe testing ground where AI shoppers and AI sellers can interact exactly like they would in the real world - searching, haggling, paying. And systematically test what goes wrong.
It's open source, so anyone building these systems can test before launching. Think of it like a flight simulator, but for AI commerce.
Key findings: The best AI models can find near-optimal deals - but only when search is perfect. Add real-world messiness and performance tanks. Worse: ALL models (even the best) grab the first decent offer, creating a 10-30x advantage for speed over quality.
More options paradoxically made results worse. Some models fell for fake credentials and manipulation.
The future: We need agents that truly compare options, markets that work at massive scale, and market designs that stay fair when humans and AI trade together. This simulator gives us a safe place to figure that out before real money is at stake.
AI agents are starting to shop and buy for us. At the same time, agents are representing and providing customer support on behalf of businesses.
Real markets are messy: hundreds of options, agents with hidden strategies, conversations that can go anywhere. Microsoft built a simulated marketplace to test this at scale - and found issues that need fixing.
Approach: create a safe testing ground where AI shoppers and AI sellers can interact exactly like they would in the real world - searching, haggling, paying. And systematically test what goes wrong.
It's open source, so anyone building these systems can test before launching. Think of it like a flight simulator, but for AI commerce.
Key findings: The best AI models can find near-optimal deals - but only when search is perfect. Add real-world messiness and performance tanks. Worse: ALL models (even the best) grab the first decent offer, creating a 10-30x advantage for speed over quality.
More options paradoxically made results worse. Some models fell for fake credentials and manipulation.
The future: We need agents that truly compare options, markets that work at massive scale, and market designs that stay fair when humans and AI trade together. This simulator gives us a safe place to figure that out before real money is at stake.
GitHub
GitHub - microsoft/multi-agent-marketplace: Magentic-Marketplace: Simulate Agentic Markets and See How They Evolve
Magentic-Marketplace: Simulate Agentic Markets and See How They Evolve - microsoft/multi-agent-marketplace
🔥5👍3❤2👏2
DeepAnalyze: Agentic LLM for Autonomous Data Science
DeepAnalyze-8B is the first agentic LLM capable of handling the entire data science pipeline—from raw data to analyst-grade research reports—without predefined workflows.
It learns like a human via a curriculum-based agentic training paradigm and a data-grounded trajectory synthesis process.
Despite having just 8B parameters, DeepAnalyze surpasses workflow-based agents built on proprietary LLMs, marking a major step toward open, autonomous data science.
GitHub.
DeepAnalyze-8B is the first agentic LLM capable of handling the entire data science pipeline—from raw data to analyst-grade research reports—without predefined workflows.
It learns like a human via a curriculum-based agentic training paradigm and a data-grounded trajectory synthesis process.
Despite having just 8B parameters, DeepAnalyze surpasses workflow-based agents built on proprietary LLMs, marking a major step toward open, autonomous data science.
GitHub.
🔥2🥰2👏2
The first research on the fundamentals of character training i.e. applying modern post training techniques to ingrain specific character traits into models.
Researchers used Constitutional AI + a new synthetic data pipeline:
1. Distillation (DPO from teacher embodying the constitution)
2. Introspection (the model generates its own character traits beyond the constitution)
Result: 11 different personas each trained on Llama 3.1, Qwen 2.5, and Gemma 3. All model weights are available.
A new eval measures the traits models choose to express on their own (revealed preferences).
Traits chosen more often have higher Elo scores. The difference before and after character training reveals its effect.
All models, datasets, code released.
Researchers used Constitutional AI + a new synthetic data pipeline:
1. Distillation (DPO from teacher embodying the constitution)
2. Introspection (the model generates its own character traits beyond the constitution)
Result: 11 different personas each trained on Llama 3.1, Qwen 2.5, and Gemma 3. All model weights are available.
A new eval measures the traits models choose to express on their own (revealed preferences).
Traits chosen more often have higher Elo scores. The difference before and after character training reveals its effect.
All models, datasets, code released.
🔥5🥰2
All about AI, Web 3.0, BCI
Future House launched an AI agent Finch that can do bioinformatics analysis, including repeating analysis from research papers. It is multimodal and results in a complete jupyter notebook (python or R) that ends in a concrete conclusion. Starting with closed…
Future House introduced Kosmos, an AI scientist system for data-driven discovery
Kosmos is a multi-agent system designed around a central “world model” to coordinate information across hundreds of scientific agent instances.
Use it.
Given an open-ended objective and dataset, Kosmos can perform up to 12 hours of research to explore, analyze, and complete the objective.
Team presented 7 expert-validated discoveries that Kosmos generated or reproduced across scientific disciplines, including:
1. A novel mechanism of ENT neuron vulnerability with aging
2. Identifying a critical determinant for perovskite performance
3. Evidence that high SOD2 levels may causally reduce myocardial fibrosis.
Kosmos is a multi-agent system designed around a central “world model” to coordinate information across hundreds of scientific agent instances.
Use it.
Given an open-ended objective and dataset, Kosmos can perform up to 12 hours of research to explore, analyze, and complete the objective.
Team presented 7 expert-validated discoveries that Kosmos generated or reproduced across scientific disciplines, including:
1. A novel mechanism of ENT neuron vulnerability with aging
2. Identifying a critical determinant for perovskite performance
3. Evidence that high SOD2 levels may causally reduce myocardial fibrosis.
❤4🔥4👏4💅1
TSMC broke ground on the world’s most advanced 1.4nm semiconductor fab, a total NT$1.5 trillion (US$48.5 billion) investment in the central Taiwan city of Taichung.
Mass production will start in 2028, with annual revenue seen at NT$500 billion ($16.2 billion).
Mass production will start in 2028, with annual revenue seen at NT$500 billion ($16.2 billion).
經濟日報
台積電中科1.4奈米廠動工 總投資規模上看1.5兆元 預計2028年量產 | 科技產業 | 產業 | 經濟日報
台積電中科1.4奈米製程新廠昨(5)日啟動基樁工程,台積電相當低調,未公開舉行動工儀式,但後續廠房工程招標作業已展開,全...
🔥3👏3🥰2
GPT-5.1 confirmed as new traces of "gpt-5-1-thinking" have been spotted on ChatGPT.
TestingCatalog
OpenAI readies GPT-5.1 Thinking model ahead of Gemini 3 Pro
GPT-5.1 Thinking debuts on ChatGPT with refined multi-step reasoning and variant models amid competitive pressures before Gemini 3 Pro.
Can AI invent new math? A new paper from Google DeepMind and renowned mathematician Terence Tao shows how.
Using AlphaEvolve, the team merges LLM-generated ideas with automated evaluation to propose, test, and refine mathematical algorithms.
In tests on 67 problems across analysis, geometry, and number theory, AlphaEvolve not only rediscovered known results but often improved upon them—even generalizing finite cases into universal formulas.
Paired with DeepThink and AlphaProof, it points toward a future where AI doesn’t just assist mathematicians—it collaborates with them in discovery.
Using AlphaEvolve, the team merges LLM-generated ideas with automated evaluation to propose, test, and refine mathematical algorithms.
In tests on 67 problems across analysis, geometry, and number theory, AlphaEvolve not only rediscovered known results but often improved upon them—even generalizing finite cases into universal formulas.
Paired with DeepThink and AlphaProof, it points toward a future where AI doesn’t just assist mathematicians—it collaborates with them in discovery.
arXiv.org
Mathematical exploration and discovery at scale
AlphaEvolve is a generic evolutionary coding agent that combines the generative capabilities of LLMs with automated evaluation in an iterative evolutionary framework that proposes, tests, and...
🔥7
Moonshot AI released Kimi K2 Thinking. The Open-Source Thinking Agent Model is here.
- SOTA on HLE (44.9%) and BrowseComp (60.2%)
- Executes up to 200 – 300 sequential tool calls without human interference
- Excels in reasoning, agentic search, and coding
- 256K context window
Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns.
Weights and code.
- SOTA on HLE (44.9%) and BrowseComp (60.2%)
- Executes up to 200 – 300 sequential tool calls without human interference
- Excels in reasoning, agentic search, and coding
- 256K context window
Built as a thinking agent, K2 Thinking marks our latest efforts in test-time scaling — scaling both thinking tokens and tool-calling turns.
Weights and code.
moonshotai.github.io
Kimi K2 Thinking
Kimi K2 Thinking, Moonshot's best open-source thinking model.
🔥4😍3🤔2🆒2
Google to roll out Polymarket and Kalshi prediction markets data in search results.
The Block
Google Finance to roll out Polymarket and Kalshi prediction markets data in search results
Google said prediction markets data from leading platforms Polymarket and Kalshi will roll out over the coming weeks.
👍5
Sakana AI is building artificial life and they can evolve: Petri Dish Neural Cellular Automata (PD-NCA) let multiple NCA agents learn and adapt during simulation, not just after training.
Each cell updates its own parameters via gradient descent, turning morphogenesis into a living ecosystem of competing, cooperating, and ever-evolving entities—showing emergent cycles and persistent complexity growth.
GitHub
Each cell updates its own parameters via gradient descent, turning morphogenesis into a living ecosystem of competing, cooperating, and ever-evolving entities—showing emergent cycles and persistent complexity growth.
GitHub
Petri Dish NCA
PPetri Dish Neural Cellular Automata (PD-NCA) is a new ALife simulation substrate that replaces the fixed, non-adaptive morphogenesis of conventional NCA—where model parameters remain constant during development—with multi-agent open-ended growth, trained…
❤7
DreamGym from Meta is a new framework that lets AI agents train via synthetic reasoning-based experiences instead of costly real rollouts.
It models environment dynamics, replays and adapts tasks, and even improves sim-to-real transfer.
Results: +30% gains on WebArena and PPO-level performance—using only synthetic interactions.
It models environment dynamics, replays and adapts tasks, and even improves sim-to-real transfer.
Results: +30% gains on WebArena and PPO-level performance—using only synthetic interactions.
🔥3
Google Introduced Nested Learning: a new ML paradigm for continual learning that views models as nested optimization problems to enhance long context processing.
A proof-of-concept model, Hope, shows improved performance in language modeling.
A proof-of-concept model, Hope, shows improved performance in language modeling.
research.google
Introducing Nested Learning: A new ML paradigm for continual learning
🔥4
