A multimodal deep learning framework for integrating visual, textual and categorical features in retail price estimation

Kazi, Redwan; Sourav, Datto; Mustakim, Ahmed; Habibur Rahman, Masum; Md. Faruk Abdullah Al, Sohan; Abu, Shufian

Please use this identifier to cite or link to this item: http://dspace.aiub.edu:8080/jspui/handle/123456789/2919

Title:	A multimodal deep learning framework for integrating visual, textual and categorical features in retail price estimation
Authors:	Kazi, Redwan Sourav, Datto Mustakim, Ahmed Habibur Rahman, Masum Md. Faruk Abdullah Al, Sohan Abu, Shufian
Keywords:	Category embedding Functional API GloVe embedding Multimodal price prediction Sequential API
Issue Date:	30-Oct-2025
Publisher:	Elsevier Array
Citation:	932
Series/Report no.:	28;3
Abstract:	Accurate product price prediction is a major challenge in modern e-commerce. Product value depends on images, textual descriptions, and categorical attributes, yet many methods under utilize these modalities jointly. This paper presents a multimodal deep learning framework for price regression using two architectures. The first is an attention-based Functional API model that applies EfficientNetB1 for visual features, pretrained GloVe embeddings with a bidirectional LSTM for text, and trainable embeddings for categorical inputs, combined via late fusion. The second is a lightweight Sequential API model that uses a compact convolutional network for image features and a dense layer to merge modalities, targeting computational efficiency for resource-limited deployments. Both models are trained on the same category-filtered dataset (≥20 per class) with a single price–decile-stratified train/validation/test split (seed 42). Evaluation uses Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination (𝑅2) on the original price scale after inverse transformation. Headline numbers are aggregated over repeated initializations on the same split (seeds {13, 21, 42, 77, 123}) and reported as mean ± SD with 95% confidence intervals. The Functional API model shows stronger performance under the same split and seeds, achieving MAE 5.37 ± 0.08 (95% CI [5.27, 5.47]), RMSE 7.32 ± 0.08 (95% CI [7.22, 7.42]), and 𝑅2 = 0.702 ± 0.007 (95% CI [0.693, 0.711]). The Sequential API model attains lower accuracy (MAE = 8.13, RMSE = 10.88, 𝑅2 = 0.43) but reduced training time and memory footprint. Ablation studies on the same split with repeated seeds isolate the effects of visual backbones, text encoders, fusion with/without attention, and loss functions. Preprocessing details, calibration checks, and decile and category-wise error summaries support transparency and reproducibility. The results establish a clear benchmark for multimodal regression in retail pricing, balancing predictive accuracy with operational feasibility.
URI:	http://dspace.aiub.edu:8080/jspui/handle/123456789/2919
ISSN:	0142-0615
Appears in Collections:	Publications From Faculty of Engineering

Files in This Item:

File	Description	Size	Format
Shufian_2025_Elsevier (Array).docx	Shufian_2025_Elsevier (Array)	4.14 MB	Microsoft Word XML	View/Open

Show full item record

AIUB DSpace

Welcome to the Institutional Repository of American International University-Bangladesh. We preserve and enable easy and open access to all types of digital content including text, images, moving images, mpegs and data sets.