Abstract
The first paper in the LinkedCulture series argued that keyword search alone is insufficient for cultural heritage discovery, because the vocabulary of the cataloger and the vocabulary of the searcher often do not overlap. The second paper described the implementation of a hybrid retrieval system combining keyword search and semantic similarity across a multi-institutional cultural heritage corpus. That implementation adopted a multilingual embedding model after an English-centered embedding setup produced a visible retrieval failure when non-English records entered the index.
This paper reports what happened after that change was tested at larger scale. The central finding is that multilingual retrieval and cross-lingual representation are not the same thing. The multilingual embedding model used by LinkedCulture made records retrievable within multiple languages, but it did not produce a unified cross-language semantic space. French-language records overwhelmingly clustered with other French-language records, while English-language records overwhelmingly clustered with English-language records, even when records from different languages described conceptually related material.
The finding emerged from analysis of a corpus of 302,798 cultural heritage records, including a French-language bloc large enough to make language partitioning visible as structure rather than noise. Nearest-neighbor analysis showed near-total language siloing: French records’ nearest neighbors were 99.9% French despite French records comprising only 17.0% of the corpus. Model bake-off experiments further showed that alternative multilingual text embedding models reduced measured cross-language distance but did not eliminate the partition or produce reliable cross-language ranking. Query translation, rather than model substitution, produced the strongest cross-lingual retrieval improvement.
The paper also documents a related operational evolution in LinkedCulture’s hybrid retrieval layer: the replacement of Reciprocal Rank Fusion with normalized-score fusion and a coordination factor after production use revealed that RRF could elevate weak candidates in sparse-metadata environments. Together, these findings suggest that practical cross-institutional cultural heritage retrieval depends less on selecting a single superior model than on designing orchestration layers that compensate for the structural behavior of models, metadata, and ranking systems. For multilingual cultural heritage indexes, cross-language discovery is presently an orchestration problem, not a solved representation problem.
1. Introduction
The first paper [1] in this series posed a discovery problem. Cultural heritage search interfaces depend heavily on keyword matching, but keyword matching assumes overlap between the terms used by catalogers and the terms used by researchers. That assumption often fails. A collection may contain relevant material, but if the researcher does not know the institution’s vocabulary, the material remains practically undiscoverable.
The second paper [2] described the implementation of the LinkedCulture demonstration platform [3], an experimental cross-institutional retrieval prototype built to investigate that problem. LinkedCulture combines keyword search and semantic similarity over a shared cultural heritage corpus. It translates structured metadata into embeddable prose, generates semantic vectors using Nomic embedding models [4][5], indexes the constructed text in OpenSearch [6], stores vectors in Qdrant [7], and exposes an adjustable hybrid retrieval interface that lets researchers shift between keyword precision and semantic breadth.
At the time of the second paper, the system had already encountered one important failure. The Rijksmuseum collection was initially ingested from a Dutch-language endpoint. English-language queries returned confident-looking but semantically wrong Dutch records. The failure was not obvious from system behavior: results were returned, similarity scores appeared high, and the system gave no indication that it had misunderstood the query. The problem became visible only through human inspection of the results. The solution adopted at that stage was to re-embed the corpus using a multilingual embedding model.
That change appeared promising. Multilingual embeddings seemed to offer the representational layer needed for a cross-institutional and multilingual discovery system. If objects from different institutions and languages could occupy the same semantic space, then a researcher could search across collections without knowing the cataloging language of each source.
The second paper therefore ended with two open questions. First, how would per-institution variation in metadata quality shape the emergent cluster structure of the combined index? Second, would the multilingual representational space surface genuine conceptual relationships across objects described in different languages, or would those associations prove to be artifacts of the embedding model’s training distribution?
This paper answers the second question more directly than expected.
The answer is that the multilingual model did not create a shared cross-language semantic space in the way the project initially hoped. It made the corpus more usable across languages at query time, but the document space itself remained strongly partitioned by cataloging language. French records clustered with French records. English records clustered with English records. The partition persisted across model tests. Even models that reduced the numerical distance between languages did not mix the clusters or produce reliable English-to-French ranking.
That distinction matters. Multilingual retrieval and cross-lingual representation are different capabilities. A model can support multiple languages without placing semantically equivalent records from different languages close enough together for discovery, clustering, or ranking. For cultural heritage systems, where cross-language discovery is a central public-access goal, this difference is not technical trivia. It changes how such systems should be designed.
The argument of this paper is therefore simple:
LinkedCulture’s multilingual embedding made the corpus multilingual, but not genuinely cross-lingual. Cross-language discovery improved only when the system translated the query into the cataloging languages represented in the corpus and searched again. The successful mechanism was not a better static representation of the corpus. It was retrieval orchestration.
A secondary finding, discussed later in the paper, concerns the hybrid retrieval layer itself: continued operation of LinkedCulture showed that Reciprocal Rank Fusion could elevate weak candidates in sparse-metadata conditions, leading to its replacement by normalized-score fusion and a coordination factor.
2. The Blue Bird Probe
The observation that motivated this paper began with a dress.
As LinkedCulture expanded to include Paris Musées, a French-language collection entered the corpus in meaningful quantity. Paris Musées aggregates records from multiple museums operated by the City of Paris, including the Palais Galliera, Musée de la Mode de la Ville de Paris. One early discovery from this dataset was a Jean Paul Gaultier dress titled L’Oiseau Bleu.
The record was exactly the kind of discovery LinkedCulture was built to support. It came from a museum the researcher had not previously known. It surfaced through exploratory search. It connected a specific object to a broader cross-institutional discovery environment. The object became a useful probe because its title has a direct English equivalent: “Blue Bird.”
Under the project’s initial expectation, the multilingual embedding model should have placed “blue bird” and L’Oiseau Bleu close enough together for an English query to surface the French record. The language differs. The concept does not.
That did not happen.
A French query for “oiseau bleu” reached the French record far more effectively than an English query for “blue bird.” In one measurement, the English query reached only 0.23 cosine similarity with the French record, while the French query reached 0.45. The exact values matter less than the direction: the model reached the French record far better from French than from English.
This was not an isolated quirk of one query. English-language queries tended to retrieve English-language records. French-language queries tended to retrieve French-language records. The multilingual model supported retrieval in both languages, but equivalent concepts expressed across languages did not reliably converge in ranking.
At first, this appeared to be a retrieval-fusion problem. LinkedCulture combined keyword and semantic result sets, and hybrid ranking can introduce artifacts. But the same pattern persisted in semantic-only tests. The issue was not merely how results were combined. It was where records and queries lived in the representational space.
The Blue Bird probe turned a practical search problem into a research question:
Does a multilingual embedding model create a unified semantic space for cultural heritage metadata, or does it create a space that remains partitioned by cataloging language?
The rest of this paper investigates that question.
3. What Clustering Means Here
The second paper in this series described the emergent semantic cluster structure of the LinkedCulture index as future work. That phrase needs explanation before the evidence is presented.
An embedding model turns each object’s constructed metadata text into a numerical vector. In practical terms, each object becomes a point in a high-dimensional space. Objects the model considers similar are placed near one another. Objects the model considers less related are placed farther apart. Distance in this space is treated as a proxy for semantic difference.
Clustering is the process of identifying dense neighborhoods of nearby points. A cluster might contain Japanese woodblock prints, ancient coins, illuminated manuscripts, ceremonial textiles, fashion plates, or botanical illustrations. No cataloger assigned those groups. They emerge from the geometry of the embedding space.
LinkedCulture uses the galaxy visualization [8] to make semantic neighborhoods visible. Figure 1 shows a selected cluster and its surrounding semantic neighbors in the two-dimensional projection, representing both a cluster’s internal object membership and its relationship to surrounding neighborhoods.
This distinction matters because the paper is not primarily concerned with the human-readable names assigned to clusters. It is concerned with the structure beneath those names.
If a multilingual embedding model creates a genuinely shared semantic space, then records should cluster by conceptual proximity rather than by cataloging language. French fashion records should be able to appear near English fashion records. French bird-related records should appear near English bird-related records. Language may introduce noise, but it should not dominate the geometry.
If the space is language-partitioned, then records will cluster first by language and only secondarily by content. In that case, a French fashion object may be closer to other French records than to English records describing similar material.
That is what the evidence shows.
4. From a Fix to a Finding
Paper 2 treated multilingual embedding as a practical fix for an English-centered retrieval failure. That was a reasonable assumption at the time. The corpus was still dominated by English-language records, and the French-language component was small enough that deeper structural partitioning was not yet visible.
The corpus has since changed. Paris Musées expanded substantially, from a small initial ingest to 41,564 records. The Minneapolis Institute of Art was also added to the corpus. The full index grew to approximately 302,798 objects. More importantly, the French-language bloc grew to 17.0% of the corpus. At that scale, language partitioning was no longer background noise. It became visible as structure.
The central question changed accordingly. The question was no longer whether multilingual embeddings improved retrieval over an English-centered embedding setup. They did. The question was whether they produced the kind of shared semantic space that cross-language cultural heritage discovery requires.
Three forms of evidence were used:
- Direct nearest-neighbor measurement over the full 302,798-object corpus.
- Cluster and visualization analysis using the galaxy layout.
- A retrieval bake-off over a 13,949-document sample using multiple candidate embedding models and manually defined relevance judgments.
Together, these measurements show that the corpus is strongly organized by cataloging language.
5. Method
The analysis used three complementary methods.
5.1 Corpus and Language Definition
The full corpus analyzed in the primary language-rift measurement contained 302,798 records. Records were drawn from the institutions already described in Papers 1 and 2, with subsequent additions and expanded French-language coverage.
| Institution | Country | Language proxy | Source type | Indexed records |
|---|---|---|---|---|
| Getty Museum | United States | English | SPARQL | 59,979 |
| Art Institute of Chicago | United States | English | REST API | 58,443 |
| Rijksmuseum | Netherlands | English* | REST API | 47,156 |
| Paris Musées | France | French | GraphQL | 41,564 |
| Cleveland Museum of Art | United States | English | REST API | 41,279 |
| Minneapolis Institute of Art | United States | English | REST API | 33,591 |
| Joconde | France | French | CSV Bulk Data | 10,000 |
| Harvard Art Museums | United States | English | REST API | 6,509 |
| Metropolitan Museum of Art | United States | English | REST API | 4,277 |
| Total | 302,798 |
For the purposes of this analysis, “French-language records” refers to records from Paris Musées and Joconde. Both sources contain metadata primarily in French. “English-language records” refers to the remaining institutional sources whose metadata is primarily in English.
This is an institutional proxy for language rather than a record-level language classifier. That limitation should be acknowledged. Some records may contain multilingual fields, translated titles, or proper names that cross language boundaries. However, at the scale of the analysis, the institutional-language proxy is sufficient to reveal the dominant structure of the embedding space.
5.2 Nearest-Neighbor Measurement
The primary measurement was nearest-neighbor purity. For each record, the system retrieved its 20 nearest neighbors in the embedding space and measured the language composition of those neighbors.
If language were not a dominant organizing principle, French records should have nearest-neighbor sets that roughly reflect the corpus distribution, adjusted by content. Since French records comprise 17.0% of the corpus, a random neighborhood would not be expected to contain 99.9% French records.
Nearest-neighbor purity therefore provides a direct measure of language siloing. The same method was applied to English records. The analysis also measured the cosine cost of crossing the language boundary by comparing within-language nearest-neighbor similarity to cross-language nearest-neighbor similarity.
5.3 Model Bake-Off
A separate retrieval evaluation tested whether alternative multilingual embedding models could fix the problem. Candidate models included e5-small, e5-base, e5-large, bge-m3, and qwen3-0.6b. The evaluation used a 13,949-document sample and a set of 64 queries with relevance judgments focused on cross-language retrieval behavior.
The evaluation had two stages. The first stage tested pairwise cross-lingual reach. Some models appeared promising under this gate, showing 81–94% pairwise cross-lingual reach. On that basis, several models looked as though they might solve the problem.
The second stage tested full retrieval ranking. This is the harder and more important test. A model does not merely need to place an English query and a French record closer together in isolation. It must rank the relevant French records high enough to appear in the top results in the presence of thousands of competing English records. That is where the candidate models failed.
5.4 Translation-Augmented Retrieval
Finally, query translation was tested as an orchestration strategy. Instead of relying on a single query embedding to cross the language boundary, the system translated the query into the languages represented in the corpus and executed retrieval separately for each version.
For example, an English query for “blue bird” generated a French query for “oiseau bleu.” A French query for “oiseau bleu” generated an English query for “blue bird.” The system then ran semantic and keyword retrieval for each query form and fused the results. This approach does not claim to fix the embedding space. It bridges the language partition at query time.
6. The Language Partition
The central result is straightforward. French records overwhelmingly neighbor French records.
Using K=20 nearest-neighbor purity over the full 302,798-object corpus, a French object’s nearest neighbors were 99.9% French. This occurred despite French records comprising only 17.0% of the corpus. English objects showed the inverse pattern: only 0.1% of their nearest neighbors were French. This is not a weak bias. It is near-total siloing.
The cost of crossing the language boundary was also substantial. For French records, the nearest French neighbor averaged 0.933 cosine similarity, while the nearest English neighbor averaged 0.706. For English records, the nearest English neighbor averaged 0.905, while the nearest French neighbor averaged 0.646. The gap between same-language and cross-language neighbors therefore falls in the range of approximately 0.22–0.26 cosine.
That gap matters because clusters link at approximately the same range as the within-language similarities. Same-language links around 0.90–0.93 behave like “the same kind of thing.” Cross-language links around 0.70 behave more like loose association. The result is that cross-language links rarely become strong enough to form shared neighborhoods.
This structure is visible in the galaxy visualization. The French-language institutions appear as a detached continent rather than as fully integrated regions of the broader corpus. Paris Musées and Joconde sit close to one another. English-language institutions, despite substantial differences in content, sit closer to one another than to the French bloc.
That is the most important interpretive point. The partition is language, not content.
Getty antiquities, Art Institute of Chicago prints, Minneapolis Institute of Art Asian art, Cleveland Museum of Art objects, Harvard records, and Met records differ substantially in subject matter and institutional history. Yet English-language museums neighbor one another strongly. Paris Musées and Joconde also neighbor one another strongly. The cross-language relationship is weaker.
Group means show the same structure:
| Relationship | Mean similarity |
|---|---|
| English to English | 0.910 |
| French to French | 0.917 |
| French to English | 0.838 |
The precise numbers are less important than the pattern. Institutions described in the same language remain closer than institutions described in different languages, even where cross-language conceptual relationships should exist.
This answers one of the questions carried forward from Paper 2. The multilingual representational space does surface some cross-language associations at query time, but the document space itself does not form genuine shared cross-language conceptual clusters. The apparent cross-language discovery observed earlier was largely query-mediated, language-proximate, or otherwise insufficient to overcome the deeper partition.
7. Per-Institution Cluster Density
The language partition is the central finding of this paper, but it intersects with another observation anticipated in Paper 2: metadata density shapes cluster quality.
Institutions with richer descriptive metadata tend to form denser and more coherent neighborhoods. This is consistent with the implementation account given in Paper 2. Embeddings can only encode the information they receive. A record with title, object type, culture, period, medium, description, creator, and subject terms provides more semantic material than a record containing only a title and a date.
The Cleveland Museum of Art remains a useful example. Informal testing suggests that its records often perform particularly well on conceptual queries, likely because many of its records contain rich descriptive metadata across diverse collection areas. Sparse records, by contrast, occupy less differentiated regions of the space and are less likely to surface for conceptual queries.
This per-institution density effect should not be confused with the language partition. They are related but distinct forms of structure. Metadata density affects how sharply a record can be represented. Cataloging language affects where that representation sits relative to records in other languages.
A richly described French record may be well represented within the French-language region while still failing to cluster with semantically related English records. Similarly, an English record with sparse metadata may be poorly differentiated while still remaining within the English-language mass.
The clustering process also revealed over-fragmentation. HDBSCAN produced near-duplicate clusters representing closely related themes. A centroid-and-label deduplication process identified 763 near-duplicate cluster pairs and reduced the cluster set to 1,337 merged groupings. This finding is operational rather than central, but it reinforces the broader lesson: cluster structure is not a finished taxonomy. It is an analytical artifact that requires interpretation, deduplication, naming, and audit.
The cluster map should therefore not be read as a neutral classification system. It is the model’s organization of available metadata, filtered through institutional description, language, and embedding behavior.
8. Can a Different Model Fix It?
The natural response to language partitioning is to try a better model.
Several candidate embedding models were tested. The expectation was that stronger multilingual models might reduce the language gap sufficiently to produce cross-language retrieval and clustering. Candidate models included multilingual E5 small, base, and large [9], BGE-M3 [10], and Qwen3-0.6B [11]. The results were more disappointing and more informative than expected.
Initial pairwise tests looked promising. Several models showed 81–94% pairwise cross-lingual reach, suggesting that they could place cross-language equivalents near one another in controlled comparisons. This appeared to indicate that model substitution might solve the problem.
The full ranking test showed otherwise. On a retrieval test involving approximately 14,000 documents and 94 English-to-French relevance judgments, every candidate model scored between 0 and 1 relevant French documents in the top 10. In practical retrieval terms, this is failure. A researcher does not benefit from a relevant record being somewhat closer if it remains outside the visible result set.
Linear de-biasing was also tested by subtracting a language direction from embeddings. This partial improvement suggests that language direction can be measured and adjusted, but the remaining rank gap shows that debiasing alone did not solve operational retrieval. The improvement was measurable but not operationally sufficient.
The bake-off sample intentionally contained a much higher French share than the full corpus in order to stress-test cross-language behavior under conditions where French records were not rare. The result is therefore conservative for the language-partition claim: even when French records were abundant in the candidate pool, the models still failed to produce reliable English-to-French ranking or mixed nearest-neighbor neighborhoods.
A clustering-specific test made the result clearer. The same 13,949-document sample, with a French base rate of 62.8%, was re-embedded using multiple models. French nearest-neighbor purity remained effectively total.
| Model | French NN-purity | French to nearest English cosine | Within-language cosine | Gap |
|---|---|---|---|---|
| nomic-v2, same sample | 99.9% | 0.665 | 0.895 | 0.230 |
| e5-base | 100% | 0.868 | 0.963 | 0.095 |
| qwen3-0.6b | 100% | 0.635 | 0.893 | 0.258 |
| nomic-v2, full corpus | 99.9% | 0.706 | 0.933 | 0.227 |
The e5-base result is the most revealing. It substantially narrowed the cross-language gap. French-to-nearest-English similarity rose to 0.868, compared with 0.665 under nomic-v2 on the same sample. But within-language similarity rose in parallel to 0.963. Same-language neighbors remained closer. The cluster partition held.
This shows why pairwise alignment is not enough. A model can move records from “far” to “near” without changing retrieval behavior. The target French record may become closer to an English query, but if a crowd of English-language records remains equally close or closer, the French record still fails to enter the top results.
The Blue Bird probe illustrates the mechanism. Under e5-base, the top results for the English query “blue bird” remained entirely English-source: Art Institute of Chicago, Cleveland, Met, Harvard, Getty. The French record L’Oiseau Bleu remained outside the top 50. English-to-French recall@10 was 0 of 94.
The conclusion is not that these models are bad. The conclusion is that cross-lingual discovery in this corpus is not solved by model substitution alone. Even the model that reduced cross-language distance most effectively did not mix the neighborhoods or produce useful cross-language ranking. Nearer is not the same as retrievable.
9. Why Query Translation Works
The method that did work was simpler than the model bake-off suggested. Translate the query.
When an English query for “blue bird” failed to surface L’Oiseau Bleu, translating the query into French and searching for “oiseau bleu” moved the query into the French-language region of the embedding space. The French query reached French records directly. When the English and French result sets were merged, the system returned a more complete cross-language result set.
This mechanism is important because it does not pretend to fix the document space. The corpus remains partitioned. The French records remain close to French records. English records remain close to English records. Query translation simply asks the question again in the language of the relevant region of the corpus.
This suggests a different architecture for cross-lingual discovery. Instead of relying on a single multilingual embedding to do all cross-language work, the retrieval system should identify or infer the cataloging languages present in the corpus, translate the user’s query into those languages, run retrieval in each language, and merge the results.
For a corpus containing English and French records, this is straightforward. An English query can generate French and English retrieval legs. A French query can generate English and French retrieval legs. A Spanish or Italian query can be translated into both English and French before retrieval.
The approach becomes more complex as more cataloging languages are added. If the corpus contains English, French, Dutch, German, Spanish, Italian, and Japanese records, query translation may require multiple retrieval legs. This introduces operational cost and ranking complexity. However, it also reflects the actual structure of the corpus more honestly than assuming that one embedding space has already unified those languages.
The practical lesson is that cross-lingual discovery is an orchestration problem. The system does not need a smarter map alone. It needs to ask the question again in the languages of the map and merge the answers.
This distinction also reframes the role of multilingual embeddings. They remain useful. They allow multiple languages to be embedded and searched within a single infrastructure. They reduce some cross-language distance. They may improve same-language retrieval across non-English corpora. But they should not be assumed to create language-neutral conceptual neighborhoods in cultural heritage metadata. The retrieval layer must compensate for that limitation.
10. System Evolution Since Paper 2
The language-partition finding was not the only change that emerged from operating LinkedCulture after Paper 2. The hybrid retrieval layer also changed, in response to the same kind of production evidence that motivated the cross-language analysis.
Paper 1 identified an open problem with Reciprocal Rank Fusion [12]. RRF is attractive because it combines independently ranked lists without requiring score calibration. This made it a useful early choice for LinkedCulture, where keyword and semantic scores came from different systems and different scoring regimes.
Over time, however, RRF’s limitation became operationally visible. It treats ranked lists as equally trustworthy inputs. A weak keyword result set can contribute low-relevance candidates to the final ranking. If those candidates appear in both keyword and semantic lists, they can receive a compounded benefit even when neither system considers them especially strong. In sparse-metadata environments, this effect becomes more common. The result was that some records bubbled up in rankings despite weak apparent relevance.
LinkedCulture therefore moved from RRF to normalized-score fusion with an added coordination factor. Normalized-score fusion uses score information that RRF discards. Instead of relying only on rank position, scores from keyword and semantic systems are normalized onto comparable scales before combination. The coordination factor rewards query-term coverage, helping prevent weak partial matches from being overpromoted.
This change should be understood as system maturation rather than correction. Paper 1 already identified weighting and weak candidate promotion as open problems. Paper 2 documented the RRF-based implementation as it existed at publication. Subsequent operation produced enough evidence to replace that mechanism with a more stable fusion strategy.
The system also evolved operationally. Int8 quantization substantially improved performance, reducing some operations from the range of 11–60 seconds to approximately 0.5 seconds. This updates the performance picture described in Paper 2, where local inference and retrieval latency were documented as constraints of the low-cost open-source stack.
These changes reinforce the broader argument of this paper. Retrieval quality is not determined by embeddings alone. It depends on the orchestration layer: how queries are translated, how result sets are fused, how scores are normalized, how weak candidates are handled, and how infrastructure choices shape what is operationally possible.
11. Implications for Cultural Heritage Discovery
The findings in this paper have several implications for multi-institutional cultural heritage retrieval.
First, institutions should not assume that multilingual embeddings produce language-neutral discovery. A multilingual model can support multiple languages while still organizing records into language-specific neighborhoods. This matters because cultural heritage collections are frequently multilingual, but the conceptual relationships between objects do not respect cataloging language boundaries.
Second, cross-language discovery should be evaluated directly. It is not enough to test same-language retrieval in English, French, Dutch, or Spanish separately. A system intended to support cross-institutional discovery must test whether an English query can surface French records, whether a French query can surface English records, and whether records described in different languages cluster together when they describe related material.
Third, evaluation should be per-language and per-institution, not only averaged across the corpus. Language partitioning can disappear in aggregate metrics, especially when one language dominates the index. In LinkedCulture, French records comprised 17.0% of the corpus after expansion. That was large enough to reveal the partition. In smaller proportions, the same behavior might appear only as occasional retrieval failure.
Fourth, retrieval orchestration deserves as much attention as model selection. The strongest improvement in this study came not from replacing the embedding model but from translating the query and searching again. Similarly, the hybrid retrieval layer improved not by adopting a more fashionable rank-fusion technique, but by replacing RRF with a method better suited to this corpus.
Finally, language partitioning should be understood as a second axis of retrieval bias alongside metadata density and institutional imbalance. Papers 1 and 2 discussed how institutional silos, metadata richness, CC0 availability, and collection size affect discovery. This paper adds language as a structural force within the embedding space itself. A cross-institutional index that mixes cataloging languages should expect a language-partitioned representational space unless proven otherwise.
12. The Forward Bet: Multimodal Representation
If text embeddings remain language-partitioned, one possible route around the problem is not better text. It is image.
Many cultural heritage records describe visual objects. If two records depict visually similar objects but describe them in different languages, a text embedding may separate them by language. An image embedding may not. A French peacock and an English peacock may be represented similarly because they look alike, regardless of cataloging language.
This suggests that multimodal embeddings may offer a representation-level path that text embeddings do not. Models such as SigLIP-2 embed images into spaces where visual similarity, rather than cataloging language, becomes the organizing principle. For visual collections, this may allow cross-language clustering that text embeddings fail to produce. A French fashion object and an English fashion object could potentially neighbor each other through visual form even when their text descriptions remain separated by language.
This is not yet demonstrated in LinkedCulture. Preliminary probing has begun, but the multimodal path remains future work. It also introduces new risks. Visual similarity is not cultural meaning. A system that clusters objects by appearance may ignore provenance, function, material, maker, use, or community significance. For cultural heritage, visual proximity can be useful, but it must not be mistaken for interpretive equivalence.
The likely path forward is not replacing text embeddings with image embeddings. It is combining them. Text carries cataloged meaning. Images carry visual structure. Hybrid multimodal retrieval may be able to bridge some cross-language gaps while preserving the institutional metadata that grounds the record.
The language-rift finding makes this direction more important. If text alone partitions by language, then cross-language discovery may require representation channels that are not language-bound.
13. Limitations
The observations in this paper are specific to LinkedCulture’s corpus, pipeline, and evaluation conditions. The findings should not be read as universal claims about all multilingual embedding models or all cultural heritage corpora.
The language definition used in the primary analysis is institution-based rather than record-level. Paris Musées and Joconde are treated as French-language sources, while the remaining institutions are treated as English-language sources. This is appropriate for the broad structure being measured but does not capture record-level multilingual variation.
The evaluation set is limited. The retrieval bake-off used 64 queries and relevance judgments focused on English-to-French retrieval behavior. This is sufficient to reveal a major failure mode, but it is not a comprehensive benchmark for all cross-language cultural heritage retrieval.
The model comparison is also limited. Only a subset of candidate text embedding models was tested, and models were evaluated within the practical constraints of the LinkedCulture infrastructure. Larger proprietary embedding models or specialized cross-lingual models may behave differently. However, the constraint is deliberate: LinkedCulture is designed around low-cost, open infrastructure that small institutions and independent researchers can plausibly reproduce.
The cluster visualization is analytical, not authoritative. Clusters emerge from embedding geometry and are shaped by metadata, model behavior, dimensionality reduction, clustering parameters, and naming choices. They should be interpreted as evidence of representational structure, not as a replacement for curatorial taxonomy.
Finally, query translation introduces its own risks. Translation may flatten culturally specific terminology, introduce ambiguity, or privilege dominant language equivalents. It improves retrieval in the observed cases, but it should not be treated as a culturally neutral operation.
14. Conclusion
Paper 1 argued that keyword search alone cannot solve cultural heritage discovery. Paper 2 described LinkedCulture’s hybrid retrieval implementation and documented the transition to multilingual embedding. Paper 3 has tested the assumption that followed from that transition.
The assumption was that multilingual embeddings would create a shared cross-language semantic space. They did not.
The evidence shows that the document space remains strongly partitioned by cataloging language. French records overwhelmingly neighbor French records. English records overwhelmingly neighbor English records. Alternative multilingual text embedding models reduce the measured distance between languages in some cases, but they do not reliably mix clusters or produce useful cross-language ranking. Pairwise alignment does not guarantee retrieval. Nearer is not the same as retrievable.
The practical solution that worked was query translation. By translating a query into the corpus languages and searching within each language’s region of the space, the system bridged the partition at query time. The corpus representation remained language-partitioned, but the retrieval process compensated for it.
This finding changes the interpretation of LinkedCulture’s multilingual layer. The multilingual embedding model remains useful. It supports multiple cataloging languages within a single infrastructure. But cross-lingual discovery requires orchestration beyond the model.
The broader lesson is that cultural heritage retrieval systems should be evaluated not only by what they retrieve, but by how their underlying representational spaces are structured. A system may appear to work at the interface while remaining partitioned underneath. That partition matters because it determines what kinds of connections the system can surface without intervention.
For cultural heritage institutions, the implication is direct: multilingual collections require more than multilingual models. They require retrieval designs that actively test and bridge the language boundaries embedded in their own data.
The next phase of LinkedCulture will investigate whether multimodal representation can provide a language-independent path through visual similarity while preserving the authority of institutional metadata. For now, the conclusion is narrower and more immediate:
Multilingual is not cross-lingual. A model that can search in many languages does not necessarily connect them.
References
- [1]Monk, Edward. 2026. “Discovery Architecture for Cultural Heritage: Layered Retrieval, Institutional Authority, and the Limits of Keyword Search.” Zenodo. May 27, 2026. https://doi.org/10.5281/zenodo.20418999
- [2]Monk, Edward. 2026. “Embedding Cultural Heritage Metadata: Pipeline Design, Multilingual Retrieval, and Hybrid Search in LinkedCulture.” Zenodo. June 10, 2026. https://doi.org/10.5281/zenodo.20633107
- [3]Monk, Edward. 2026. LinkedCulture Demonstration Platform. Accessed 2026. https://linkedculture.org
- [4]Nussbaum, Zach, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. “Nomic Embed: Training a Reproducible Long Context Text Embedder.” arXiv:2402.01613. https://arxiv.org/abs/2402.01613
- [5]Nussbaum, Zach, and Brandon Duderstadt. 2025. “Training Sparse Mixture of Experts Text Embedding Models.” arXiv:2502.07972. https://arxiv.org/abs/2502.07972
- [6]OpenSearch Project. n.d. “OpenSearch Documentation.” Accessed 2026. https://docs.opensearch.org/
- [7]Qdrant. n.d. “Qdrant Documentation.” Accessed 2026. https://qdrant.tech/documentation/
- [8]Monk, Edward. 2026. LinkedCulture Galaxy Visualization. Accessed 2026. https://galaxy.linkedculture.org
- [9]Wang, Liang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. “Multilingual E5 Text Embeddings: A Technical Report.” arXiv:2402.05672. https://arxiv.org/abs/2402.05672
- [10]Chen, Jianlv, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. “BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation.” arXiv:2402.03216. https://arxiv.org/abs/2402.03216
- [11]Zhang, Yanzhao, et al. 2025. “Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.” arXiv:2506.05176. https://arxiv.org/abs/2506.05176
- [12]Cormack, Gordon V., Charles L. A. Clarke, and Stefan Buettcher. 2009. “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.” In Proceedings of the 32nd International ACM SIGIR Conference, 758–759. https://doi.org/10.1145/1571941.1572114
How to cite
Monk, Edward. 2026. “Semantic Clusters and Language Partitioning in a Multi-Institutional Cultural Heritage Index.” LinkedCulture Research Paper Series No. 3. Zenodo. https://doi.org/10.5281/zenodo.20827947
This is the HTML edition. The archival, citable version of record is the Zenodo deposit (DOI 10.5281/zenodo.20827947).