Multimodal Price Regressor

Smart product price prediction - Amazon ML Challenge 2025

[Code] [Data]

As part of team SPAM_LLMs, we achieved 6th place overall in the Amazon ML Challenge 2025, a national competition focused on smart product pricing. Our solution secured the 3rd position on the public leaderboard and 5th on the private leaderboard.

This project showcases a multimodal learning approach to predict product prices from text descriptions and images. By using large, pre-trained embedding models without fine-tuning, our architecture effectively processed complex, real-world e-commerce data to achieve high accuracy.


Our solution combined text and image modalities using separate networks before a final regression head.

Competition Highlights

Key Achievements

  • Top National Ranking: Finished 3rd (Public LB) and 5th (Private LB), placing us among the top 6 teams in a highly competitive national challenge.
  • Advanced Multimodal Architecture: Designed and implemented a novel system that processes text and image data in parallel to predict prices.
  • Strategic Model Selection: Demonstrated that large, frozen pre-trained models (Qwen-3-4B, Siglip2, DinoV3) can outperform fine-tuned smaller models, providing a crucial performance edge.
  • Effective Data Preprocessing: Developed a comprehensive pipeline for cleaning and normalizing noisy catalog data, which was critical for model performance.

Methodology

1. Business Problem

In e-commerce, setting the optimal price for products is vital for success. The challenge was to create a machine learning model that analyzes product details from text and images to accurately predict its price, tackling complexities like brand value, specifications, and quantity.

2. Data Handling & Preprocessing

  • Transformation: Applied a log1p transformation to the right-skewed price data to normalize its distribution.
  • Text Cleaning: Processed the catalog_content by stripping it into descriptions, bullet points, and quantity. We also standardized inconsistent units (e.g., “g,” “gm” to “grams”) and converted numbers to words to improve embedding quality.
  • Feature Selection: To handle large text fields, we prioritized using the product description if available, otherwise defaulting to the top five bullet points.

3. Multimodal Architecture

  • Embedding Generation: Utilized powerful, pre-trained models to generate high-quality embeddings for each modality:
    • Text: Qwen-3-4B
    • Text + Image: Siglip2 Giant
    • Image: DinoV3
  • Modality-Specific Networks: Fed the embeddings from each modality into separate, dedicated neural networks. This allowed the model to learn modality-specific representations before fusing the information.
  • Final Regressor: Concatenated the outputs from the modality-specific networks and passed them to a final regression head to predict the log-transformed price.

4. Training and Evaluation

  • Loss Function: Employed a Log-based Mean Squared Error (MSE) Loss, which is well-suited for log-transformed targets.
  • Evaluation Metric: The competition used the Symmetric Mean Absolute Percentage Error (SMAPE), where a lower score is better.

Insights & Conclusion

  • Power of Large Models: Our success showed that using larger, state-of-the-art pre-trained models, even without fine-tuning, can provide superior feature representations compared to fine-tuning smaller, distilled models.
  • Multimodality is Key: Integrating image data provided a significant signal that purely text-based models might miss, confirming the value of a multimodal approach for complex real-world problems.
  • Architecture Matters: Using modality-specific networks to process embeddings before concatenation was a key architectural decision that improved training stability and overall performance.

Team Members

This was a collaborative effort by SPAM_LLMs (IIT ISM Dhanbad):