Technical Staff, Audio Omni
Job Description
About Anuttacon
Anuttacon is an independent research lab pursuing humanistic general intelligence that you can experience in every real-time interaction—seamlessly understanding and expressing through text, voice, visuals and beyond.
We see AI and humans as equal partners in virtual world creation and discovery. Our mission is to build multimodal AI with genuine emotional understanding and expressive communication – technology that not only thinks but feels, connecting with you authentically through rich, nuanced interactions that enhance your experience.
Key Responsibilities:
- Design and develop a unified Any-to-Any multimodal architecture with a primary focus on native Audio-in-Audio-out modeling.
- Develop high-performance Neural Audio Codecs, exploring the optimal balance between continuous representations and discrete tokens.
- Leverage large-scale multimodal data (speech, music, environmental audio, video) to lead distributed pre-training of ultra-large-scale models.
- Explore instruction fine-tuning and reinforcement learning algorithms tailored for the audio modality, optimizing emotional expression, interruption handling, paralinguistic features (laughter, pauses), and perceptual audio quality in speech interaction.
Qualifications:
- PhD in Computer Science, Artificial Intelligence, Electronic Engineering, or a related field;
- Hands-on experience training large-scale models (LLM or multimodal), with deep understanding of Transformer architectures and distributed training frameworks (Megatron-LM, DeepSpeed, TorchTitan, etc.);
- Deep expertise in at least one of the following areas:
- Audio/Text Interleaved Pretraining
- Multimodal Alignment & RL
- End-to-End Speech Dialogue Modeling
- Proficiency in PyTorch with extensive experience in large-scale data processing;
- Ability to collaborate across time zones with strong communication skills; results-driven and highly accountable.
Preferred Qualifications:
- Publications at top-tier venues such as NeurIPS, ICML, ICASSP, or ISMIR;
- Experience managing multi-type audio datasets at the scale of one million hours or above;
- Core contributor to an industry-level Omni or Multimodal Foundation Model.