AHCI RESEARCH GROUP

Publications

Papers published in international journals,
proceedings of conferences, workshops and books.

OUR RESEARCH

Scientific Publications

How to

Here you can find the complete list of our publications.
You can use the tag cloud to select only the papers dealing with specific research topics.
You can expand the Abstract, Links and BibTex record for each paper.

Show all

2025

Garine, B. M.; Rajeshwari, R.; Motukuri, S.; Reddy, R. S. R.; Srilatha, S.

A NeRF-Based Captioning Framework for Spatially Rich and Context-Aware Image Descriptions Journal Article

In: Journal Europeen des Systemes Automatises, vol. 58, no. 5, pp. 1059–1064, 2025, ISSN: 12696935 (ISSN); 21167087 (ISSN), (Publisher: International Information and Engineering Technology Association).

Abstract | Links | BibTeX | Tags: 3D reconstruction, implicit representation, Neural Radiance Fields (NeRF), photorealistic rendering RayTracing, scene representation, volumetric rendering

@article{garine_nerf-based_2025,

title = {A NeRF-Based Captioning Framework for Spatially Rich and Context-Aware Image Descriptions},

author = {B. M. Garine and R. Rajeshwari and S. Motukuri and R. S. R. Reddy and S. Srilatha},

url = {https://www.scopus.com/inward/record.uri?eid=2-s2.0-105010157458&doi=10.18280%2Fjesa.580518&partnerID=40&md5=4959d11c61d611f0ec75c5e2a6adbc88},

doi = {10.18280/jesa.580518},

issn = {12696935 (ISSN); 21167087 (ISSN)},

year  = {2025},

date = {2025-01-01},

journal = {Journal Europeen des Systemes Automatises},

volume = {58},

number = {5},

pages = {1059–1064},

abstract = {Traditional caption models are mainly dependent on 2D visual properties, which limit their ability to understand and describe spatial conditions, depth and three-dimensional structures in images. These models struggle to capture object interviews, beef and light variations, which are important for generating relevant and spatial conscious details. To address these boundaries, we introduce Neural Radiance Feilds Captioning (NeRF-Cap) framework is a new Neural Radiance Field based on multimodal image-tight frame that integrates 3D-visual reconstruction with natural language treatment (NLP). NeRF's ability to create a constant volumetric representation of a view of several 2D approaches enables the recovery of depth-individual and geometrically accurate functions, which improves the descriptive power of the caption generated. Our approach also integrates the advanced visual language models such as Bootstrapping Language-Image Pre-training (BLIP), Contrastive Language-Image Pretraining (CLIP) and Large Language Model Meta AI (LLaMA) which process the text details by involving semantic object interlation, depth such and light effect in the caption process. By taking advantage of the high definition 3D representation of the NeRF, NeRF-Cap improved traditional captions by providing spatial consistent, photorealist and geometrically consistent details. We evaluate our method for synthetic and real-world datasets, and perform complex spatial properties and its effectiveness in capturing visual dynamics. Experimental results indicate that NeRF-Cap outperforms existing captioning models in terms of spatial awareness, contextual accuracy, and natural language fluency, as measured by standard benchmarks such as Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Consensus-based Image Description Evaluation (CIDEr) and a novel Depth-Awareness Score. Our work highlights the potential of 3D-aware multimodal captioning, paving the way for more advanced applications in robotic perception, augmented reality, and assistive vision systems. © 2025 Elsevier B.V., All rights reserved.},

note = {Publisher: International Information and Engineering Technology Association},

keywords = {3D reconstruction, implicit representation, Neural Radiance Fields (NeRF), photorealistic rendering RayTracing, scene representation, volumetric rendering},

pubstate = {published},

tppubtype = {article}

}

Traditional caption models are mainly dependent on 2D visual properties, which limit their ability to understand and describe spatial conditions, depth and three-dimensional structures in images. These models struggle to capture object interviews, beef and light variations, which are important for generating relevant and spatial conscious details. To address these boundaries, we introduce Neural Radiance Feilds Captioning (NeRF-Cap) framework is a new Neural Radiance Field based on multimodal image-tight frame that integrates 3D-visual reconstruction with natural language treatment (NLP). NeRF's ability to create a constant volumetric representation of a view of several 2D approaches enables the recovery of depth-individual and geometrically accurate functions, which improves the descriptive power of the caption generated. Our approach also integrates the advanced visual language models such as Bootstrapping Language-Image Pre-training (BLIP), Contrastive Language-Image Pretraining (CLIP) and Large Language Model Meta AI (LLaMA) which process the text details by involving semantic object interlation, depth such and light effect in the caption process. By taking advantage of the high definition 3D representation of the NeRF, NeRF-Cap improved traditional captions by providing spatial consistent, photorealist and geometrically consistent details. We evaluate our method for synthetic and real-world datasets, and perform complex spatial properties and its effectiveness in capturing visual dynamics. Experimental results indicate that NeRF-Cap outperforms existing captioning models in terms of spatial awareness, contextual accuracy, and natural language fluency, as measured by standard benchmarks such as Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit Ordering (METEOR), Consensus-based Image Description Evaluation (CIDEr) and a novel Depth-Awareness Score. Our work highlights the potential of 3D-aware multimodal captioning, paving the way for more advanced applications in robotic perception, augmented reality, and assistive vision systems. © 2025 Elsevier B.V., All rights reserved.