Deep Generative Modeling

CMT

[arXiv]

CMT reduces the training cost of diffusion-based flow map models by up to 90% while reaching SOTA performance

ICLR26

ConceptTRAK

[arXiv]

A framework for Identify which training examples influenced specific concepts within the diffusion model

ICLR26

CODA

[arXiv]

Improved object-centric diffusion learning with registers and contrastive alignment

ICLR26

Improved CFG

[arXiv]

An improved mechanism for applying classifier-free guidance in discrete diffusion

ICLR26

SONA

[arXiv]

Learning conditional, unconditional, and matching-aware discriminator with adaptive weighting mechanism (cSAN)

ICLR26

TLoRA

[arXiv]

Propose tensor-decomposition-based PEFT method, showing its effectiveness on T-to-I generation tasks

ICCV25

Di4C

[arXiv] [code]

Theoretical analysis of limitation of current discrete diffusion and a method for effectively capturing element-wise dependency

ICML25

VCT

[arXiv] [code]

Improving Consistency Training with a learned data-noise coupling

ICML25

Memorization

[arXiv] [code]

Classifier-Free Guidance inside the Attraction Basin May Cause Memorization

CVPR25

Jump Your Steps

[arXiv]

A general method to find an optimal sampling schedule for inference in discrete diffusion

ICLR25

HERO-DM

[arXiv] [demo]

A method efficiently leverages online human feedback to fine-tune Stable Diffusion for various range of tasks

ICLR25

WPSE

[arXiv]

An enhanced multimodal representation using weighted point clouds and its theoretical benefits

ICLR25

PaGoDA

[arXiv]

A 64x64 pre-trained diffusion model is all you need for 1-step high-resolution SOTA generation

NeurIPS24

CTM

[arXiv] [demo]

Unified framework enables diverse samplers and 1-step generation SOTAs

ICLR24

Applications:
[SoundGen]

SAN

[arXiv] [code] [demo]

Enhancing GAN with metrizable discriminators

ICLR24

Applications:
[Vocoder]

MPGD

[arXiv] [demo]

Fast, Efficient, Training-Free, and Controllable diffusion-based generation method

ICLR24

HQ-VAE

[OpenReview] [arXiv]

Generalizing hierarchical VQ-VAEs with a Bayesian framework

TMLR

FP-Diffusion

[PMLR] [code]

Improving density estimation of diffusion

ICML23

GibbsDDRM

[PMLR] [code]

Achieving blind inversion using DDPM

ICML23

Applications:
[DeReverb] [SpeechEnhance]

Consistency-type Models

[arXiv]

Theoretically unified framework for "consistency" on diffusion model

ICML23 SPIGM Workshop

SQ-VAE

[PMLR] [arXiv] [code]

Improving codebook utilization and training stability

ICML22

AR-ELBO

[Elsevier] [arXiv]

Mitigating oversmoothness in VAE

Neurocomputing

Multimodal NLP

DeepResonance

[EMNLP] [arXiv] [code]

DeepResonance: Enhancing Multimodal Music Understanding via Music-centric Multi-way Instruction Tuning

EMNLP25

CARE

[EMNLP] [arXiv] [data]

CARE: Assessing the Impact of Multilingual Human Preference Learning on Cultural Awareness

EMNLP25

BiAug

[MRR@ICCV25] [arXiv]

Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association

ICCV25 MRR Workshop

GLOV

[TMLR] [arXiv]

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

TMLR

Music-to-MVD

[RepL4NLP@NAACL25] [arXiv]

Cross-Modal Learning for Music-to-Music-Video Description Generation

NAACL25 RepL4NLP Workshop

VinaBench

[CVPR] [arXiv] [data]

VinaBench: Benchmark for Faithful and Consistent Visual Narratives

CVPR25

OpenMU

[arXiv] [data] [demo]

penMU: Your Swiss Army Knife for Music Understanding

ISMIR2024 Late Breaking Demos

DiffuCOMET

[ACL] [arXiv] [code]

DiffuCOMET: Contextual Commonsense Knowledge Diffusion

ACL24

CyCLIPs/CyCLAPs

[ACL] [arXiv]

On the Language Encoder of Contrastive Cross-modal Models

ACL24

DIIR

[ACL] [arXiv] [code]

Few-shot Dialogue Strategy Learning for Motivational Interviewing via Inductive Reasoning

ACL24

PeaCok

[ACL] [arXiv] [code]

PeaCoK: Persona Commonsense Knowledge for Consistent and Engaging Narratives
(Outstanding Paper Award)

ACL23

ComFact

[EMNLP] [arXiv] [code]

ComFact: A Benchmark for Linking Contextual Commonsense Knowledge

EMNLP22 Findings

Music Technologies

LLM2Fx-Tools

[arXiv] [demo]

Tool Calling For Music Post-Production

ICLR26

MEGAMI

[arXiv] [code] [demo]

Automatic music mixing using a generative model of effect embeddings

ICASSP26

Sampling Identification

[arXiv] [code]

Automatic Music Sample Identification with Multi-Track Contrastive Learning

ICASSP26

Lyrics Matching

[arXiv] [code]

Leveraging Whisper Embeddings for Audio-based Lyrics Matching

ICASSP26

Training Data Attribution

[arXiv]

Large-Scale Training Data Attribution for Music Generative Models via Unlearning

NeurIPS25 Creative AI

Beyond GenAI Music

[url]

Reductive, exclusionary, normalising: The limits of generative AI music

TISMIR

LLM2FX

[arXiv] [code] [demo] [dataset]

Can Large Language Models Predict Audio Effects Parameters from Natural Language?

WAASPA25

Vocal Effects Style Transfer

[arXiv] [code] [demo]

Inference-Time Optimisation for Vocal Effects Style Transfer using DiffVox

WAASPA25

Fx-Encoder++

[arXiv] [code]

SOTA Fx representation: Extract instrument-wise audio effects representations from music mixtures

ISMIR25

ITO-Master

[arXiv] [code] [demo]

Inference Time Optimization for Music Mastering Style Transfer

ISMIR25

GRAFx (ext.)

[JAES] [code] [demo]

Reverse Engineering of Music Mixing Graphs with Differentiable Processors and Iterative Pruning

JAES

CLEWS

[arXiv]

Supervised contrastive learning from weakly-labeled audio segments for musical version matching

ICML25

MFM as Generic Booster

[OpenReview] [arXiv]

Music Foundation Model as Generic Booster for Music Downstream Tasks

TMLR

DiffVox

[arXiv] [code] [demo] [audio]

DiffVox: A Differentiable Model for Capturing and Analysing Professional Effects Distributions

DAFx25

Variable Bitrate RVQ

[arXiv]

VRVQ: Variable Bitrate Residual Vector Quantization for Audio Compression

ICASSP25

Instr. Timbre Transfer

[arXiv] [code] [demo]

Latent Diffusion Bridges for Unsupervised Musical Audio Timbre Transfer

ICASSP25

Mixing Graph Estimation

[arXiv] [code] [demo]

Searching For Music Mixing Graphs: A Pruning Approach

DAFx24

Guitar Amp. Modeling

[arXiv]

Improving Unsupervised Clean-to-Rendered Guitar Tone Transformation Using GANs and Integrated Unaligned Clean Data

DAFx24

Text-to-Music Editing

[arXiv] [code] [demo]

MusicMagus: Zero-Shot Text-to-Music Editing via Diffusion Models

IJCAI24

Instr.-Agnostic Trans.

[IEEE] [arXiv]

Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music Transcription

ICASSP24

Vocal Restoration

[IEEE] [arXiv] [demo]

VRDMG: Vocal Restoration via Diffusion Posterior Sampling with Multiple Guidance

ICASSP24

hFT-Transformer

[arXiv] [code]

Automatic Piano Transcription with Hierarchical Frequency-Time Transformer

ISMIR23

Automatic Music Tagging

[arXiv]

An Attention-based Approach To Hierarchical Multi-label Music Instrument Classification

ICASSP23

Vocal Dereverberation

[arXiv] [demo]

Unsupervised Vocal Dereverberation with Diffusion-based Generative Models

ICASSP23

Mixing Style Transfer

[arXiv] [code] [demo]

Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects

ICASSP23

Music Transcription

[arXiv] [code] [demo]

DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability

ICASSP23

Singing Voice Vocoder

[arXiv] [demo]

Hierarchical Diffusion Models for Singing Voice Neural Vocoder

ICASSP23

Distortion Effect Removal

[poster] [arXiv] [demo]

Distortion Audio Effects: Learning How to Recover the Clean Signal

ISMIR22

Automatic Music Mixing

[poster] [arXiv] [code] [demo]

Automatic Music Mixing with Deep Learning and Out-of-Domain Data

ISMIR22

Sound Separation

[IEEE]

Music Source Separation with Deep Equilibrium Models

ICASSP22

Automatic DJ Transition

[arXiv] [code] [demo]

Automatic DJ Transitions with Differentiable Audio Effects and Generative Adversarial Networks

ICASSP22

Singing Voice Conversion

[arXiv] [demo]

Robust One-Shot Singing Voice Conversion

Sound Separation

[video] [site]

Glenn Gould and Kanji Ishimaru 2021: A collaboration with AI Sound Separation after 60 years

Cinematic Technologies

VIRTUE

[OpenReview] [arXiv]

VIRTUE: Visual-Interactive Text-Image Universal Embedder

ICLR26

CCStereo

[ACM] [arXiv] [code]

CCStereo: Audio-Visual Contextual and Contrastive Learning for Binaural Audio Generation

ACMMM25

TITAN-Guide

[CVF] [arXiv] [code] [demo]

TITAN-Guide: Taming Inference-Time AligNment for Guided Text-to-Video Diffusion Models

ICCV25

MMAudio

[CVF] [arXiv] [code] [demo]

MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

CVPR25

MMDisCo

[OpenReview] [arXiv] [code]

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

ICLR25

SoundCTM

[OpenReview] [arXiv] [code] [demo]

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

ICLR25

Mining Your Own Secrets

[OpenReview] [arXiv]

Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

ICLR25

GenWarp

[arXiv] [demo]

GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

NeurIPS24

SpecMaskGIT

[arXiv] [demo]

SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

ISMIR24

Acoustic Inv. Rendering

[CVF] [arXiv] [dataset] [code] [demo]

Hearing Anything Anywhere

CVPR24

BigVSAN Vocoder

[arXiv] [code] [demo]

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

ICASSP24

Zero-/Few-shot SELD

[IEEE] [arXiv]

Zero- and Few-shot Sound Event Localization and Detection

ICASSP24

STARSS23

[arXiv] [dataset]

STARSS23: An Audio-Visual Dataset of Spatial Recordings of Real Scenes with Spatiotemporal Annotations of Sound Events

NeurIPS23

Audio Restoration: ViT-AE

[IEEE] [arXiv] [demo]

Extending Audio Masked Autoencoders Toward Audio Restoration

WASPAA23

Diffiner

[ISCA] [arXiv] [code]

Diffiner: A Versatile Diffusion-based Generative Refiner for Speech Enhancement

INTERSPEECH23

CLIPSep

[OpenReview] [arXiv] [code] [demo]

CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos

ICLR23

Sound Event Localization and Detection

[IEEE] [arXiv]

Multi-ACCDOA: Localizing and Detecting Overlapping Sounds from the Same Class with Auxiliary Duplicating Permutation Invariant Training

ICASSP22

Hosted Challenges

CPD Challenge 2025

[CPD Challenge 2025]

Commonsense Persona-grounded Dialogue Challenge 2025

SVG Challenge 2024

[SVG Challenge 2024]

Sounding Video Generation Challenge 2024

DCASE Challenge Task 3

[DCASE Challenge2023]

Sound Event Localization and Detection Evaluated in Real Spatial Sound Scenes

CPD Challenge 2023

[CPD Challenge 2023]

Commonsense Persona-grounded Dialogue Challenge

SDX Challenge 2023

[site] [paper (music)] [paper (cinematic)]

Sound Demixing Challenge 2023

MDX Challenge 2021

[site] [frontiers]

Music Demixing Challenge 2021

Name		Name	Last commit message	Last commit date
Latest commit History 293 Commits
assets		assets
.README.md.un~		.README.md.un~
.local.css.un~		.local.css.un~
README.md		README.md
_config.yml		_config.yml
local.css		local.css

sony/creativeai

Folders and files

Latest commit

History

Repository files navigation

Deep Generative Modeling

CMT

ConceptTRAK

CODA

Improved CFG

SONA

TLoRA

Di4C

VCT

Memorization

Jump Your Steps

HERO-DM

WPSE

PaGoDA

CTM

SAN

MPGD

HQ-VAE

FP-Diffusion

GibbsDDRM

Consistency-type Models

SQ-VAE

AR-ELBO

Multimodal NLP

DeepResonance

[EMNLP] [arXiv] [code]

CARE

[EMNLP] [arXiv] [data]

BiAug

[MRR@ICCV25] [arXiv]

GLOV

[TMLR] [arXiv]

Music-to-MVD

VinaBench

[CVPR] [arXiv] [data]

OpenMU

DiffuCOMET

CyCLIPs/CyCLAPs

DIIR

PeaCok

ComFact

Music Technologies

LLM2Fx-Tools

MEGAMI

Sampling Identification

Lyrics Matching

Training Data Attribution

Beyond GenAI Music

LLM2FX

Vocal Effects Style Transfer

Fx-Encoder++

ITO-Master

GRAFx (ext.)

CLEWS

MFM as Generic Booster

DiffVox

Variable Bitrate RVQ

Instr. Timbre Transfer

Mixing Graph Estimation

Guitar Amp. Modeling

Text-to-Music Editing

Instr.-Agnostic Trans.

Vocal Restoration

hFT-Transformer

Automatic Music Tagging

Vocal Dereverberation

Mixing Style Transfer

Music Transcription

Singing Voice Vocoder

Distortion Effect Removal

Automatic Music Mixing

Sound Separation

Automatic DJ Transition

Singing Voice Conversion

Sound Separation