Procurement Benchmark Webapp
2025 · Sentence Transformers · BERTopic · OpenAI API · ChromaDB · Streamlit
Impact
- 1 buyer in active use, reducing a manual PO benchmarking task from 30 minutes to under 5 minutes
- Iterating on search calibration based on buyer feedback before wider rollout
Business Problem
Industrial procurement data is free-text line items across thousands of purchase orders, with no shared category structure. The same part — a pump seal, a hex bolt — appears under dozens of different descriptions depending on buyer, supplier, and year.
Spend cannot be aggregated, prices cannot be compared, and benchmarking is impossible. The problem is not missing data; it is the absence of a shared semantic structure that makes raw line items comparable.
Solution Design
Four components built around a three-level taxonomy of item clusters.
Hierarchical clustering builds the taxonomy through a deterministic-first, LLM-assisted pipeline. TF-IDF and token overlap handle the common case; GPT handles naming and edge cases, with names scored on a 5-point scale and regenerated below threshold.
Misfit detection flags leaves in the wrong category using two signals: intra-subcluster vocabulary profiling (leaves matching zero or one core keyword) and supplier domain profiling (items whose supplier exclusively operates in a different product domain).
Hybrid search queries the full PO corpus by free-text description. BM25 catches exact part numbers; dense vector search via ChromaDB catches semantic equivalents across languages. RRF fuses both signals; a cross-encoder reranker rescores the top-100 candidates before returning results with cluster label, price statistics, supplier list, and order history.
Streamlit benchmark UI surfaces results as a procurement decision tool: cluster summary cards, historical PO table, price trend scatter with inflation comparison, and supplier breakdown. Officers search by description or upload a supplier invoice for automatic line-by-line price comparison.
Technical Challenges
Free-text variance at scale. The same part can appear under 20 or more descriptions. Keyword rules cover 80% of cases; semantic similarity handles the long tail. LLM calls on every item would be non-deterministic and expensive.
Taxonomy correctness under iteration. Fixing one misfit can introduce another. A validation suite runs after every script: no orphaned leaves, no duplicate subcluster names, leaf count invariant across runs.
Leaf cluster immutability. Items cannot be reassigned within a leaf; only whole clusters move. L3 is treated as fixed fact; the L1/L2 hierarchy is built correctly around it.
Search strictness calibration. BM25 and dense retrieval have different precision-recall tradeoffs on procurement text. Too strict drops valid comparables; too loose pollutes the benchmark with irrelevant prices.
No ground truth. Quality is assessed through SME review and semantic coherence metrics. The first real signal comes from officers using the tool against live supplier quotes.
Status
- 1 buyer in active use, benchmarking industrial PO line items against cluster price history
- Search calibration ongoing based on buyer feedback; wider rollout pending acceptance threshold
- Streamlit UI deployed with invoice upload and line-by-line price comparison live
Next Steps
- Search approval rate: share of results accepted as valid comparables; primary quality signal for calibration
- Invoice match correctness: how accurately incoming line items resolve to the correct cluster and price family
- Live rollout: onboard remaining buyers once match correctness meets the acceptance threshold
- Dockerize: containerize for deployment on the company cloud platform