AHCI RESEARCH GROUP

Publications

Papers published in international journals,
proceedings of conferences, workshops and books.

OUR RESEARCH

Scientific Publications

How to

Here you can find the complete list of our publications.
You can use the tag cloud to select only the papers dealing with specific research topics.
You can expand the Abstract, Links and BibTex record for each paper.

Show all

2025

Hu, Y. -H.; Matsumoto, A.; Ito, K.; Narumi, T.; Kuzuoka, H.; Amemiya, T.

Avatar Motion Generation Pipeline for the Metaverse via Synthesis of Generative Models of Text and Video Proceedings Article

In: Proc. - IEEE Conf. Virtual Real. 3D User Interfaces Abstr. Workshops, VRW, pp. 767–771, Institute of Electrical and Electronics Engineers Inc., 2025, ISBN: 9798331514846 (ISBN).

Abstract | Links | BibTeX | Tags: Ambient intelligence, Design and evaluation methods, Distributed computer systems, Human-centered computing, Language Model, Metaverses, Processing capability, Text-processing, Treemap, Treemaps, Visualization, Visualization design and evaluation method, Visualization design and evaluation methods, Visualization designs, Visualization technique, Visualization techniques

@inproceedings{hu_avatar_2025,

title = {Avatar Motion Generation Pipeline for the Metaverse via Synthesis of Generative Models of Text and Video},

author = {Y. -H. Hu and A. Matsumoto and K. Ito and T. Narumi and H. Kuzuoka and T. Amemiya},

url = {https://www.scopus.com/inward/record.uri?eid=2-s2.0-105005158851&doi=10.1109%2FVRW66409.2025.00155&partnerID=40&md5=fd66da570bb0639fc76ca7aa1f136b82},

doi = {10.1109/VRW66409.2025.00155},

isbn = {9798331514846 (ISBN)},

year  = {2025},

date = {2025-01-01},

booktitle = {Proc. - IEEE Conf. Virtual Real. 3D User Interfaces Abstr. Workshops, VRW},

pages = {767–771},

publisher = {Institute of Electrical and Electronics Engineers Inc.},

abstract = {Efforts to integrate AI avatars into the metaverse to enhance interactivity have progressed in both research and commercial domains. AI avatars in the metaverse are expected to exhibit not only verbal responses but also avatar motions, such as non-verbal gestures, to enable seamless communication with users. Large Language Models (LLMs) are known for their advanced text processing capabilities, such as user input, avatar actions, and even entire virtual environments as text, making them a promising approach for planning avatar motions. However, generating the avatar motions solely from the textual information often requires extensive training data whereas the configuration is very challenging, with results that often lack diversity and fail to match user expectations. On the other hand, AI technologies for generating videos have progressed to the point where they can depict diverse and natural human movements based on prompts. Therefore, this paper introduces a novel pipeline, TVMP, that synthesizes LLMs with advanced text processing capabilities and video generation models with the ability to generate videos containing a variety of motions. The pipeline first generates videos from text input, then estimates the motions from the generated videos, and lastly exports the estimated motion data into the avatars in the metaverse. Feedback on the TVMP prototype suggests further refinement is needed, such as speed control, display of the progress, and direct edition for contextual relevance and usability enhancements. The proposed method enables AI avatars to perform highly adaptive and diverse movements to fulfill user expectations and contributes to developing a more immersive metaverse. © 2025 Elsevier B.V., All rights reserved.},

keywords = {Ambient intelligence, Design and evaluation methods, Distributed computer systems, Human-centered computing, Language Model, Metaverses, Processing capability, Text-processing, Treemap, Treemaps, Visualization, Visualization design and evaluation method, Visualization design and evaluation methods, Visualization designs, Visualization technique, Visualization techniques},

pubstate = {published},

tppubtype = {inproceedings}

}

Kai, W. -H.; Xing, K. -X.

Video-driven musical composition using large language model with memory-augmented state space Journal Article

In: Visual Computer, vol. 41, no. 5, pp. 3345–3357, 2025, ISSN: 01782789 (ISSN); 14322315 (ISSN), (Publisher: Springer Science and Business Media Deutschland GmbH).

Abstract | Links | BibTeX | Tags: 'current, Associative storage, Augmented Reality, Augmented state space, Computer simulation languages, Computer system recovery, Distributed computer systems, HTTP, Language Model, Large language model, Long-term video-to-music generation, Mamba, Memory architecture, Memory-augmented, Modeling languages, Music, Musical composition, Natural language processing systems, Object oriented programming, Performance, Problem oriented languages, State space, State-space

@article{kai_video-driven_2025,

title = {Video-driven musical composition using large language model with memory-augmented state space},

author = {W. -H. Kai and K. -X. Xing},

url = {https://www.scopus.com/inward/record.uri?eid=2-s2.0-105001073242&doi=10.1007%2Fs00371-024-03606-w&partnerID=40&md5=71a40ea7584c5a5f210afc1c30aac468},

doi = {10.1007/s00371-024-03606-w},

issn = {01782789 (ISSN); 14322315 (ISSN)},

year  = {2025},

date = {2025-01-01},

journal = {Visual Computer},

volume = {41},

number = {5},

pages = {3345–3357},

abstract = {The current landscape of research leveraging large language models (LLMs) is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. However, the research work on LLms for music inspiration is still in its infancy. To fill the gap in this field and break through the dilemma that LLMs can only understand short videos with limited frames, we propose a large language model with state space for long-term video-to-music generation. To capture long-range dependency and maintaining high performance, while further decrease the computing cost, our overall network includes the Enhanced Video Mamba, which incorporates continuous moving window partitioning and local feature augmentation, and a long-term memory bank that captures and aggregates historical video information to mitigate information loss in long sequences. This framework achieves both subquadratic-time computation and near-linear memory complexity, enabling effective long-term video-to-music generation. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models. Our code released on https://github.com/kai211233/S2L2-V2M. © 2025 Elsevier B.V., All rights reserved.},

note = {Publisher: Springer Science and Business Media Deutschland GmbH},

keywords = {'current, Associative storage, Augmented Reality, Augmented state space, Computer simulation languages, Computer system recovery, Distributed computer systems, HTTP, Language Model, Large language model, Long-term video-to-music generation, Mamba, Memory architecture, Memory-augmented, Modeling languages, Music, Musical composition, Natural language processing systems, Object oriented programming, Performance, Problem oriented languages, State space, State-space},

pubstate = {published},

tppubtype = {article}

}