Please use this identifier to cite or link to this item:
http://dspace.aiub.edu:8080/jspui/handle/123456789/2919| Title: | A multimodal deep learning framework for integrating visual, textual and categorical features in retail price estimation |
| Authors: | Kazi, Redwan Sourav, Datto Mustakim, Ahmed Habibur Rahman, Masum Md. Faruk Abdullah Al, Sohan Abu, Shufian |
| Keywords: | Category embedding Functional API GloVe embedding Multimodal price prediction Sequential API |
| Issue Date: | 30-Oct-2025 |
| Publisher: | Elsevier Array |
| Citation: | 932 |
| Series/Report no.: | 28;3 |
| Abstract: | Accurate product price prediction is a major challenge in modern e-commerce. Product value depends on images, textual descriptions, and categorical attributes, yet many methods under utilize these modalities jointly. This paper presents a multimodal deep learning framework for price regression using two architectures. The first is an attention-based Functional API model that applies EfficientNetB1 for visual features, pretrained GloVe embeddings with a bidirectional LSTM for text, and trainable embeddings for categorical inputs, combined via late fusion. The second is a lightweight Sequential API model that uses a compact convolutional network for image features and a dense layer to merge modalities, targeting computational efficiency for resource-limited deployments. Both models are trained on the same category-filtered dataset (≥20 per class) with a single price–decile-stratified train/validation/test split (seed 42). Evaluation uses Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and the coefficient of determination (𝑅2) on the original price scale after inverse transformation. Headline numbers are aggregated over repeated initializations on the same split (seeds {13, 21, 42, 77, 123}) and reported as mean ± SD with 95% confidence intervals. The Functional API model shows stronger performance under the same split and seeds, achieving MAE 5.37 ± 0.08 (95% CI [5.27, 5.47]), RMSE 7.32 ± 0.08 (95% CI [7.22, 7.42]), and 𝑅2 = 0.702 ± 0.007 (95% CI [0.693, 0.711]). The Sequential API model attains lower accuracy (MAE = 8.13, RMSE = 10.88, 𝑅2 = 0.43) but reduced training time and memory footprint. Ablation studies on the same split with repeated seeds isolate the effects of visual backbones, text encoders, fusion with/without attention, and loss functions. Preprocessing details, calibration checks, and decile and category-wise error summaries support transparency and reproducibility. The results establish a clear benchmark for multimodal regression in retail pricing, balancing predictive accuracy with operational feasibility. |
| URI: | http://dspace.aiub.edu:8080/jspui/handle/123456789/2919 |
| ISSN: | 0142-0615 |
| Appears in Collections: | Publications From Faculty of Engineering |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| Shufian_2025_Elsevier (Array).docx | Shufian_2025_Elsevier (Array) | 4.14 MB | Microsoft Word XML | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.