AI-powered catalog enrichment platform for a B2B tech eCommerce
Design and implementation of an internal platform combining AI, external source integration and human validation to govern product data at scale for a B2B technology eCommerce.
Summary
A B2B distributor with a large technical catalog needed to improve the quality of its product data: inconsistent names, sparse SEO metadata and incomplete technical attributes that were limiting filtering and conversion. An internal AI-assisted enrichment platform was designed and built, combining controlled generation, integration with a structured external source and human validation workflows — all governed through an operational interface for the catalog team.
Context
The client operates a B2B eCommerce in the technology sector, with thousands of references from multiple manufacturers and suppliers. The corporate ERP, based on Firebird, acts as the commercial source of truth: it holds products, categories and canonical attributes that are eventually published on the online store.
The catalog grows continuously and heterogeneously. Data arrives with inconsistent commercial names, variable-quality descriptions and partial technical attributes. In a sector where buyers filter by specifications (RAM, storage, connectivity, panel type, GPU, etc.), missing structure directly impacts the buying experience and SEO.
The internal team was managing these improvements manually, without clear traceability, making it unviable to scale at the pace the catalog demanded.
Problem
The symptoms were consistent and well-known to the business, but their root cause was spread across data, processes and tooling:
- Inconsistent product names, poorly optimized for SEO or for the buyer.
- Missing meta descriptions and structured internal search tags.
- Incomplete technical attributes, scattered or with non-normalized values (“Red”, “red”, “RED” for the same concept).
- External data from suppliers and sources like Icecat with their own taxonomies, with no formal mapping to the internal model.
- Slow manual processes, hard to audit and dependent on specific individuals.
A purely manual approach was no longer sustainable. A solution based solely on generative AI without data governance would have introduced new problems: hallucinations, out-of-vocabulary values and no control over what enters the ERP.
Goals
- Improve the quality of names, metadata and attributes at scale.
- Keep clear human control over what gets written to the catalog.
- Reduce the repetitive manual workload on the catalog team.
- Make all bulk changes to the catalog fully auditable.
- Integrate structured external data without breaking the internal canonical model.
- Allow rules and prompts to be changed without deploying code.
Technical approach
A modular Python platform was designed with a Flask web interface, clearly separating the AI engine, business logic, operational persistence and integration with external systems.
Two-layer data architecture. The Firebird ERP was kept as the commercial source of truth. On top of it, an operational layer in PostgreSQL concentrates configuration, prompts, execution logs, response caches, pending candidates and mappings with external sources. This allows iterating on the enrichment process without touching the ERP until writeback time.
Prompt governance in the database. Prompts and model configuration do not live in code — they live in dedicated tables. This allows versioning them, adjusting them without deploying, and resolving them hierarchically by section > family > subfamily, with global templates combinable with category-specific rules.
Specialized AI pipelines. Three differentiated engines were built sharing the same internal structure: SEO renaming, meta description and tag generation, and technical attribute discovery/extraction. Each one operates against controlled vocabularies, distinguishes literal from interpreted attributes and supports measurable and boolean attributes.
Closed vocabulary and slug validation. To prevent the model from inventing values, attribute extraction works against an approved vocabulary. Unknown values are not written — they are sent to a human review queue. This preserves catalog integrity at the cost of some additional approval work, an intentional trade-off.
Icecat integration. An external enrichment layer was built with raw response caching, category synchronization, feature explorer, alias management and a candidate approval flow before writeback. Converts heterogeneous external data into usable internal data.
Preview and execute modes. All critical processes can be run in simulation before writing to the ERP, reducing the risk of bulk changes.
Automation with audit trail. A cron-based batch system allows running renaming, meta, attribute extraction and gap fill unattended, leaving detailed logs per product and per execution.
Decisions and trade-offs
A solution based solely on LLMs without schema was ruled out, because it would have amplified consistency problems. Instead, the AI operates within a closed vocabulary framework with human validation, trading some automatic coverage for reliability.
Two databases (Firebird as commercial truth, PostgreSQL as operational layer) were chosen instead of extending the ERP. This adds synchronization complexity but protects the critical system and allows fast iteration on the AI layer.
Database-driven configuration was prioritized over code configuration. This makes the initial setup more expensive, but allows the catalog team to adjust behavior without touching deployments.
Fully automating the attribute writeback was deliberately avoided. While it would be faster, human-assisted review is key to maintaining the business’s trust in the system.
Result
- Significant reduction in manual work on names, metas and attributes.
- Greater consistency across categories and manufacturers.
- Much higher technical attribute coverage, with better filtering experience.
- Auditable processes with per-product and per-batch logs.
- Real capacity to process thousands of references per batch in scheduled runs.
- Stable integration with structured external data without polluting the canonical model.
- Ability to change rules and prompts live without deploying code.
Tech stack
- Python (Flask)
- PostgreSQL
- Firebird (ERP)
- OpenAI API
- Icecat integration
- Gunicorn, Docker, Docker Compose
- Cron for unattended execution
What this case demonstrates
This case shows the ability to integrate applied AI into a real business system without building something fragile: clean architecture, data governance, integration with an existing ERP, structured external sources and human validation working together. It reflects judgment about what to automate, what to leave in the team’s hands and how to protect operations throughout the process.
If your company manages a large catalog and is starting to notice that data quality is limiting SEO, filtering or conversion, I can help you design a realistic, auditable enrichment process that works with your current ERP.