Embedding Cultural Heritage Metadata: Pipeline Design, Multilingual Retrieval, and Hybrid Search in LinkedCulture

Edward Monk

doi:10.5281/zenodo.20633107

Abstract

The first paper in this series described a layered discovery architecture for cultural heritage collections and documented three observations from its deployment in LinkedCulture, an experimental cross-institutional retrieval prototype. Central to that architecture is a hybrid retrieval system combining keyword search and semantic similarity. This paper describes how that system is implemented and why its design choices were made: the methodology used to translate structured cultural heritage metadata into embeddable prose, the embedding model selected and the reasoning behind it, the OpenSearch and Qdrant infrastructure that powers keyword and semantic retrieval, the Reciprocal Rank Fusion mechanism that combines them, and the ingestion pipeline that normalizes records from eight institutions with heterogeneous source APIs into a canonical form. It documents a transition from English-centered to multilingual embedding prompted by an observed cross-language retrieval failure, and reports on cross-lingual retrieval behavior in the extended index. It also examines where the pipeline’s design choices introduce potential bias and where the embedding model fails in predictable ways. The paper concludes with a consolidated account of the system’s limitations and their implications for how the observations in this series should be interpreted.

1. From architecture to implementation

The first paper in this series made an architectural argument: that layered retrieval combining keyword matching and semantic similarity addresses vocabulary failures that keyword search alone cannot resolve, and that the appropriate role of AI in this architecture is at the representation layer rather than the interpretation layer [3]. It documented observations from LinkedCulture, a prototype built across eight cultural heritage institutions and more than two hundred thousand records [5], and identified three behavioral patterns that warrant further investigation.

That paper intentionally deferred the question of implementation. Describing the architecture and describing the machinery that produces it are different tasks, and conflating them would have obscured both. This paper addresses the implementation: how the hybrid retrieval system is actually built, what decisions shaped it, and what those decisions mean for the retrieval behavior the first paper described.

The choices made at the pipeline layer are not purely technical. The text construction methodology determines what information the embedding model receives about each object. The model selection determines how that information is encoded. The field mapping decisions for each institution determine what survives the translation from heterogeneous source metadata into a canonical form. The design of the hybrid retrieval system determines how keyword and semantic signals are combined. Each of these decisions has consequences for retrieval behavior, and each introduces tradeoffs and potential distortions. This paper describes them in enough detail that the observations reported in the first paper can be situated in the implementation that produced them, and that practitioners interested in reproducing or adapting the approach understand what it actually requires.

This paper contributes a documented implementation pattern for hybrid cultural heritage retrieval across heterogeneous institutional metadata sources: prose-based metadata construction, local open-source embedding, multilingual model migration, and adjustable fusion between keyword and semantic retrieval.

2. The text construction problem

Cultural heritage metadata is structured. A catalog record typically consists of discrete named fields: title, object type, culture or origin, period or date, medium or material, department, subject tags, free-text description, creator or maker. This structure is intentional and valuable as it supports faceted filtering, authority control, and the stewardship functions for which catalog systems were designed.

Embedding models are trained on natural language text. They encode semantic relationships that emerge from patterns in continuous prose: the way concepts co-occur, the contexts in which words appear, the associative relationships that accumulate across large bodies of written language. A model trained in this way does not have a native understanding of structured key-value pairs. The structural argument for presenting metadata as prose rather than as concatenated field values is that a model trained on prose text has learned to interpret the relationships between concepts when they appear in prose-like contexts, and those learned relationships are suppressed when the same concepts appear in semicolon-separated sequences structurally unlike anything in the model's training distribution.

A concrete illustration of the difference:

Concatenated: Prothesis; black-figure; Greek; 5th century BCE; ceramic; funerary; Achilles Painter

Prose: this object is titled prothesis. it is classified as a ceramic vessel. it is associated with the greek culture. it dates to the 5th century bce. it was made using ceramic. it is associated with themes of funerary ritual. it is attributed to the achilles painter.

The tokens in both versions are drawn from the same source fields. The prose version does not add new catalog facts. It changes the structure in which those facts appear. Instead of presenting the model with a sequence of isolated field values, the prose template places the values in simple grammatical relationships: title, classification, culture, date, material, subject, and attribution. The design assumption is that this structure gives the embedding model a context closer to the prose patterns on which it was trained. In the concatenated version, the same tokens are present, but the relational cues that connect them are weaker.

This paper does not report a controlled comparison between the two approaches. Such a comparison would require a test set of labeled queries and judgments that does not currently exist for this collection. The argument for prose over concatenation is structural and theoretical rather than empirically demonstrated in this work. It is stated as a design rationale, not a finding.

A note on synthetic connective language. The prose template inserts English grammatical structure that is not present in the original record. "It is associated with the Māori culture. It dates to the nineteenth century." The original metadata may contain only the field values "Māori" and "1800–1900." The template adds the connective language. This is beneficial in the sense that it produces input closer to the model's training distribution. It is also a form of synthetic text construction: the pipeline is producing language that the cataloger never wrote, based on field values the cataloger did record. For most descriptive fields (culture, period, medium, object type) the connective language is relatively low-risk and semantically stable. For fields where the cataloger's precise framing carries interpretive weight, including provenance fields and community attribution language in collections like Te Papa's Taonga Māori holdings, the template's standardizing grammar may flatten distinctions the original record maintained. This is worth acknowledging as a design consequence even where it cannot be fully resolved.

The remaining text construction decisions are as follows. The canonical field order is fixed across all institutions, so the embedding model encounters structurally similar inputs regardless of how source APIs present the data. Lowercasing reduces surface variation across institutions; it introduces a tradeoff for proper nouns, since models may encode capitalized and lowercased forms differently, but the practical impact is expected to be small for models trained on mixed-case corpora. Empty fields are omitted rather than filled with placeholder language, which would introduce tokens with no semantic relationship to the object.

The same page content produced for embedding is also used to populate the OpenSearch keyword index. Both retrieval systems therefore operate on the same constructed text, ensuring that observed behavioral differences between them reflect the retrieval mechanism rather than different inputs.

3. Scope and data selection

LinkedCulture indexes records from eight cultural heritage institutions. The current index contains 229,394 records distributed as follows:

Institution	Country	Source Type	Indexed Records
Getty Museum	United States	SPARQL	59,979
Art Institute of Chicago	United States	REST API	58,443
Rijksmuseum	Netherlands	REST API	47,156
Cleveland Museum of Art	United States	REST API	41,279
Joconde	France	CSV Bulk Data	10,000
Harvard Art Museums	United States	REST API	6,509
Metropolitan Museum of Art	United States	REST API	4,277
Paris Musées	France	GraphQL	1,751
Total			229,394

Records indexed per institution (229,394 total).

Inclusion in the index is governed by two criteria: the institution must provide open access to its collection data through a publicly accessible API or bulk data export, and individual records must be accompanied by images released under a CC0 license [9]. The CC0 requirement is not incidental. It reflects a commitment to working only with material that institutions have explicitly placed in the public domain. This constraint also shapes the record counts: institutions with large collections may contribute fewer records than their total holdings would suggest, because only a portion of those holdings carry CC0 image licenses.

The Metropolitan Museum of Art's contribution warrants specific explanation. The Met provides a public API and has released a substantial open-access dataset, but API access from the server hosting the LinkedCulture pipeline was blocked during the collection period due to infrastructure-level access restrictions. The 4,277 records indexed represent a partial dataset and should not be taken as indicative of the Met's total CC0 holdings, which are considerably larger. This is a collection-period constraint, not an architectural decision.

The geographic and institutional range is intentional. Five institutions are based in the United States, two in France, and one in the Netherlands. The French institutions were added specifically to test cross-lingual retrieval behavior, an extension described in Section 7.

Institutional size imbalance. The distribution of records across institutions is substantially uneven: the three largest collections (Getty, Art Institute of Chicago, Rijksmuseum) account for roughly 73% of the total index. This imbalance has implications for how the shared representational space is structured. Semantic neighborhoods within the vector space are influenced by the objects present in it, and the objects present are disproportionately from institutions with large CC0 holdings. Retrieval behavior on queries related to the subjects, periods, and cultural traditions well-represented in those collections is likely to differ from behavior on queries related to traditions represented only by the smaller collections. The clusters observed in the third paper in this series will reflect this distribution. The index does not claim to represent the breadth of world cultural heritage equitably; it represents what is accessible through CC0 APIs at this collection scale.

4. Embedding model selection

The embedding model converts the page content text into a fixed-length numerical vector. All semantic retrieval behavior is downstream of this conversion. The choice of model determines the structure of the representational space and therefore the kinds of conceptual proximity the system can express.

LinkedCulture uses models from the Nomic Embed family, run locally via Ollama [8]. The infrastructure is entirely open-source: no proprietary embedding APIs, no external data transmission, no per-token cost. Cultural heritage institutions are cautious about third-party data handling, and a system that sends collection metadata to external commercial APIs would face institutional adoption barriers that an entirely local stack does not. The open-source and local inference constraints were imposed deliberately and shape the available model choices.

The initial deployment used nomic-embed-text, a reproducible open-source text embedder with a 768-dimensional output space [2]. The 768-dimensional cosine space proved adequate at the prototype's scale for representing the conceptual diversity of cultural heritage metadata across multiple institutions and subject areas. All vectors are stored with cosine distance as the similarity metric, which measures angular proximity regardless of vector magnitude. This is appropriate for semantic similarity tasks where directional relationship matters more than length.

The batch size for embedding is 32 records per inference call. All embedding inference runs on CPU hardware without GPU acceleration. This reflects both the cost constraint of an independent research project and a deliberate demonstration that the approach is feasible without specialized hardware. The performance implications are discussed in Section 9.

5. Per-institution ingestion

The eight institutions provide their collection data through four interface types: REST APIs with JSON responses, GraphQL APIs, SPARQL endpoints, and bulk file downloads in CSV format. No two institutions present their data in the same structure, and no shared schema governs available fields, accepted values, or handling of missing information.

The pipeline handles this heterogeneity through a connector architecture. Each institution is represented by a source class implementing two methods: fetch, which retrieves raw records and yields them one at a time with pagination handled internally; and transform, which maps a raw record to a canonical schema. The canonical schema defines the fields available to the text construction template: title, object type, culture, period, medium, department, tags, description, creator, image URL, and source identifier. Fields the institution does not provide are silently omitted.

This architecture isolates institutional specificity in the connector layer. The embedding pipeline, text construction template, and search indexers are identical for every institution. A new institution is added by writing one new source class; nothing else changes.

The Getty Museum publishes its open access collection through a SPARQL endpoint, requiring a query-based approach to retrieval. The Getty Museum's data is linked to external authority vocabularies, which provides richer controlled vocabulary coverage than institutions relying primarily on free-text description.

The Joconde database, maintained by the French Ministry of Culture, is available as a bulk CSV download covering objects from hundreds of French regional museums. Metadata is primarily in French and coverage is uneven.

Paris Musées provides its data through a GraphQL API. Like Joconde, records are primarily in French.

Record filtering. Records are filtered at the transform stage before embedding. Across all connectors, a record is excluded if it lacks an identifier or lacks a resolvable image URL. Image availability is the dominant filter, reflecting the pipeline's image-first design. A subset of connectors, including the Art Institute of Chicago and Rijksmuseum, additionally verifies CC0 or public-domain rights metadata at the individual record level. The remaining connectors rely on the source API or dataset already being scoped to open-access collections. Filtered records are logged with exclusion reasons and are not embedded or indexed.

6. Hybrid retrieval: combining keyword and semantic search

Both retrieval modes operate on the same page content text. OpenSearch provides the keyword retrieval layer; Qdrant provides the semantic retrieval layer [6, 7]. When a query is submitted, it fires against both systems simultaneously. The two ranked result lists are merged through Reciprocal Rank Fusion [1] into a single ranking.

OpenSearch indexes the page content text for BM25-based retrieval. It handles exact term matching and Boolean logic, the properties that make keyword search effective for known-item retrieval. Qdrant stores the vector representations and supports approximate nearest-neighbor search. A query submitted to the semantic layer is first embedded using the same model and prefix convention used during ingestion, then compared against stored vectors by cosine distance.

Reciprocal Rank Fusion merges the two ranked lists without requiring score calibration between systems. Each result receives a fused score based on its rank position in each list. Results appearing in both lists receive a compounded contribution; results appearing in only one list still contribute to the merged ranking.

Transparent retrieval control. The relative weight of the two retrieval modes in the final ranking is adjustable in real time through a UI slider. A researcher can shift from pure keyword retrieval to pure semantic retrieval or hold any position along the continuum. This is not typical of many deployed hybrid retrieval systems, where fusion weights are often fixed at deployment time and invisible to users.

The motivation for exposing this control is observational rather than operational. Making the weighting adjustable makes the behavioral difference between the two retrieval modes directly visible: a researcher can run the same query at different settings and watch how the result set changes. The first paper's observation that neither retrieval mode dominates unconditionally across query types is not a theoretical claim; it is observable through this mechanism. Queries using specific institutional or disciplinary terminology show clearly better keyword results; queries using thematic or conceptual language show clearly better semantic results. The slider makes this contrast accessible to a researcher who may not have a prior intuition about which mode to prefer for a given query.

Exposing retrieval weights also produces a form of qualitative evidence: if no difference is observable between the two extremes of the slider for a given query, that is informative about the query type and the metadata density for the relevant collection. This interpretive use of adjustable fusion weights is underexplored in the retrieval literature and is noted here as a direction worth investigating more formally.

7. The multilingual extension

The initial version of the LinkedCulture index used nomic-embed-text, with English-language retrieval as the primary assumed use case. A specific retrieval failure observed during early development prompted the transition to a multilingual model.

The Rijksmuseum collection was initially ingested from its Dutch-language OAI-PMH endpoint, before the institution's English-language REST API registration was discovered. Records carried titles and descriptive fields in Dutch. The consequences were visible immediately in retrieval. An English-language query for "an object that signals authority without text" returned a result set dominated by Rijksmuseum prints with Dutch-language titles: Advocaat met de Dood (Lawyer with Death), Amor wordt gekortwiekt (Amor is clipped), Lucius Junius Brutus laat zijn zonen onthoofden (Brutus has his sons beheaded). None of these records are conceptually related to the query. The English-centered embedding setup was finding spurious proximity between the English query and Dutch vocabulary that shares Germanic roots and lexical patterns with English, producing confident-looking similarity scores for semantically wrong results.

This failure mode is particularly difficult to detect because the system shows no sign of having misunderstood the query. Results are returned. Scores are high. The problem is only visible when a person examines the results and recognizes that the returned objects are wrong. This is the same class of invisible failure that the first paper described for keyword search: the system appears to work while producing results that serve the researcher poorly.

This failure likely generalizes to other multi-institutional indexes that combine English queries with non-English European collection metadata under an English-centered embedding setup. French, German, Dutch, and Spanish share enough vocabulary with English through shared roots and loan words that spurious proximity should be treated as a predictable structural risk rather than an isolated edge case.

The current index uses nomic-embed-text-v2-moe, a multilingual sparse mixture-of-experts model from the same Nomic Embed family [4]. This model is trained for multilingual retrieval and is intended to place semantically related content from different languages in proximity in the shared representational space. The multilingual model introduces an asymmetric embedding approach: documents are embedded with a search_document: prefix and queries with a search_query: prefix. This prefix convention is a documented feature of the model's training; applying the wrong prefix or omitting it produces degraded retrieval performance [4].

Observed cross-lingual behavior. Preliminary observations from operating the extended index suggest the following pattern. Same-language queries (French queries against French records, English queries against English records) behave similarly in terms of coherence and relevance. Cross-lingual queries (English queries against French records) return results that are relevant in most cases but not identical to what same-language retrieval produces. The representational space appears to support cross-lingual discovery to a meaningful degree. The relationship between query language and result quality is not symmetric, however, and has not been formally evaluated. A structured comparison across representative query types remains the most important piece of evidence this paper lacks.

The transition from the monolingual to the multilingual index required re-embedding the entire collection. Records are stored under versioned collection names in Qdrant (linkedculture_unified_multilingual_v2 for the current index) so that the previous index remained available during the transition for comparison purposes.

8. What the pipeline does and does not control

The pipeline controls the information that reaches the retrieval systems: which fields are extracted, how they are assembled into page content, which model converts that text to a vector, and how the resulting vectors are stored alongside an OpenSearch index. These are consequential decisions, but they operate on the input to the representational space. They do not control what the embedding model does with that input.

The embedding model's internal representations are the product of its training data and training methodology. The associations the model has learned (between concepts, periods, materials, cultural contexts) emerge from patterns across large text corpora not necessarily drawn from cataloging contexts. The pipeline shapes the input; the model shapes the encoding. This distinction matters for understanding both capabilities and limitations.

Metadata density and retrieval quality. Where the metadata is rich and the model's training has exposed it to the relevant vocabulary, the representational space is dense and semantically organized. The Cleveland Museum of Art's collection, which carries unusually descriptive metadata across a culturally diverse range of objects, performs consistently well on conceptual queries. The embedding model has enough material to work with, and the relevant concepts are well-represented in its training distribution.

Where the metadata is sparse (a title, a date, and nothing else) the embedding has less to encode. The representational space around sparse records is less differentiated, and these records are less likely to surface for queries that rely on conceptual proximity. No pipeline design compensates for records that contain insufficient descriptive information. Metadata quality is the primary determinant of retrieval quality, and it is determined at the institution before the pipeline ever sees the record.

Embedding model failures. A specific and consistent failure mode observed across the index concerns queries involving negation, paradox, or compound meaning. A query for "a throne that is not a throne" returns literal thrones and ceremonial chairs with high similarity scores. A query for "sacred but not religious" returns objects with explicit religious metadata. The embedding model does not handle the negation or the paradoxical relationship; it encodes the dominant noun and retrieves records that match it.

This is a known limitation of current dense retrieval approaches and is not specific to cultural heritage data. It is documented here because it is consistently observed across the index and because the high similarity scores returned for these cases create a false impression of confidence. A similarity score above 0.75 does not indicate conceptual correctness; it indicates proximity in the representational space, which for negation and paradox queries does not correspond to the intended meaning. Users and researchers relying on the system for this class of query should treat high-confidence results with particular skepticism.

9. Operational performance

A single search request passes through four computational stages: the query string is embedded into a 768-dimensional vector; keyword retrieval (OpenSearch BM25) and semantic retrieval (Qdrant approximate nearest-neighbor) execute in parallel; Reciprocal Rank Fusion merges the two result lists; and results are supplemented with vision metadata from a separate collection. The full pipeline runs on a single-CPU virtual private server (Hetzner CX33, 1 vCPU, 8GB RAM). No GPU is used at inference time. Latency measurements were taken on production infrastructure in June 2026.

Three observations from warm-state instrumentation carry implications for the architecture.

Local embedding inference is the dominant latency source. At 350ms per novel query on a single CPU, embedding is four to six times slower than all retrieval steps combined. This is a direct consequence of the design choice to run all inference locally. For systems with stricter real-time requirements, the tradeoff between local inference (no external dependency, no data transmission) and latency would need reconsideration. At low-to-moderate query volumes with caching in place, local inference is adequate.

Model cold start is a deployment risk. Embedding models unloaded from memory between requests incur a full reload cost (approximately 18 seconds for a 475M-parameter model) on the first query after any idle period. This latency is not signaled by any error; the request simply takes 18 seconds. For a system used in demonstration contexts, this is a significant operational hazard addressable through model keep-alive configuration, but the mitigation must be explicitly applied rather than assumed.

Facet aggregations should be decoupled from the primary search path. Running aggregation sub-queries on every search adds approximately 144ms (roughly 70% of total OpenSearch time) for results that change only when data changes. Decoupling them from the primary retrieval path and caching them against the query eliminates this overhead for the common case without sacrificing correctness.

After these changes, end-to-end latency for a warm-model, cached-query scenario is approximately 184–204ms. For a warm model processing a novel query, approximately 497ms.

10. Limitations

The observations in this paper are qualitative and exploratory. They describe implementation decisions and their observed consequences, not the results of controlled experiments with independent variables. This section consolidates the limitations that appear throughout the paper.

Prose construction is not empirically validated. The decision to convert structured metadata to natural language prose before embedding rests on a structural argument about model training distributions. It has not been tested against field concatenation on a labeled query set. The argument is sound but unverified in this context.

Synthetic connective language introduces bias. The text construction template adds English grammatical structure that the original cataloger did not write. For most descriptive fields this is a reasonable normalization. For fields where the cataloger's precise framing carries interpretive or cultural weight, the template may inadvertently standardize distinctions the original record maintained.

Institutional size imbalance may distort the representational space. The three largest collections account for approximately 73% of the index. Semantic neighborhoods, retrieval behavior, and cluster structure are likely influenced by this concentration. The index does not represent world cultural heritage equitably; it represents what is accessible through CC0 APIs at this scale.

Cross-lingual retrieval is not formally evaluated. The multilingual section describes observed behavior qualitatively. No structured comparison between monolingual and multilingual retrieval quality has been conducted on a labeled query set. The Rijksmuseum Dutch failure is documented from observation; its resolution under the multilingual model is asserted but not demonstrated with systematic evidence.

Embedding model failures are catalogued but not bounded. The failures described in Section 8 (sparse metadata, negation, paradox) are observed failure modes, not an exhaustive characterization of where the system fails. There are likely other failure classes not yet encountered or documented.

Only CC0 collections with images are included. The selection constraint produces an index that overrepresents institutions with strong open access programs and underrepresents institutions whose holdings are equally relevant but whose licensing positions have not permitted CC0 release.

The system has not been evaluated by end users. No user study, usability evaluation, or formal relevance assessment involving museum professionals or researchers has been conducted. The observations are the researcher's own, based on operating the system.

11. Implications for practitioners

Several aspects of the LinkedCulture pipeline design have direct implications for institutions or researchers considering similar implementations.

The text construction principle of translating structured metadata into natural language prose before embedding is transferable and adaptable. The specific template used in LinkedCulture reflects the fields available across the indexed institutions. An institution with a richer description field and sparser subject tagging might construct prose that weights those fields differently. The principle is more durable than any particular field ordering.

The connector architecture, in which institutional specificity is isolated in a source class that exposes a uniform interface to the rest of the pipeline, is a practical pattern for managing heterogeneous data sources at the scale a single researcher or small team can maintain. Adding a new institution requires writing one new class.

Building keyword and semantic indexes from the same constructed text simplifies the pipeline and ensures that observed behavioral differences reflect the retrieval mechanism rather than different inputs. Improvements to the text construction methodology benefit both retrieval systems simultaneously.

The open-source stack (Python, Ollama, OpenSearch, Qdrant) is sufficient for an index at this scale with no proprietary licenses or external API subscriptions. The dominant operational constraint is embedding inference latency, which is addressable through caching and model keep-alive configuration rather than infrastructure upgrades.

The CC0 data selection constraint is worth noting separately. Limiting the index to records with CC0 images reduces the addressable collection size at participating institutions but eliminates questions about data rights that would otherwise complicate institutional adoption. For a research prototype seeking to engage institutional partners, this is a deliberate and worthwhile tradeoff.

For any multi-institutional index that mixes collections in different languages, cross-language retrieval interference under an English-centered embedding setup should be expected. The failure produces confident-looking results rather than empty result sets and is therefore detectable only through examination of results, not through system monitoring. A multilingual embedding model is the most direct mitigation. Institutions building indexes that include collections described in languages other than English should plan for this failure mode from the outset.

Metadata quality variation across institutions is the largest single factor in retrieval performance variation and is not addressable within the pipeline. How cross-institutional discovery infrastructure is evaluated should account for this: per-institution performance variation is expected, informative, and worth reporting rather than averaged away.

12. Conclusion

This paper has described the implementation underlying the retrieval behavior reported in the first paper in this series. The hybrid retrieval system is the product of a sequence of design decisions: a text construction methodology that translates structured metadata into natural language prose; an open-source embedding model that encodes conceptual relationships from that prose into a shared vector space; an OpenSearch keyword index built from the same constructed text; a Reciprocal Rank Fusion mechanism with an adjustable weighting control that makes the behavioral difference between retrieval modes directly observable; a connector architecture that normalizes heterogeneous institutional data into a canonical form; a transition from English-centered to multilingual embedding in response to a specific retrieval failure; and infrastructure that operates entirely on commodity hardware without proprietary services.

None of these decisions are the only correct ones. They are the decisions that were made, with the reasoning documented so that they can be evaluated, reproduced, and improved upon. The limitations section documents where the reasoning is structural rather than empirically verified, and where the evidence base for key claims remains incomplete.

Two observations from this paper carry forward to the third paper in this series, which will describe the emergent semantic cluster structure observed within the combined index. First, the per-institution variation in metadata quality produces per-institution variation in embedding quality, which is visible in the cluster structure: institutions with richer metadata produce denser, more coherent semantic neighborhoods. Second, the multilingual representational space surfaces cross-institutional connections that a monolingual space does not, including associations across objects described in different languages. Whether those connections correspond to genuine conceptual relationships or to artifacts of the embedding model's training distribution is a question that cluster analysis makes tractable.

References

[1]Cormack, Gordon V., Charles L. A. Clarke, and Stefan Buettcher. 2009. “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods.” In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, 758–759. https://doi.org/10.1145/1571941.1572114
[2]Nussbaum, Zach, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. 2024. “Nomic Embed: Training a Reproducible Long Context Text Embedder.” arXiv:2402.01613. https://arxiv.org/abs/2402.01613
[3]Monk, Edward. 2026. “Discovery Architecture for Cultural Heritage: Layered Retrieval, Institutional Authority, and the Limits of Keyword Search.” Zenodo. May 27, 2026. https://doi.org/10.5281/zenodo.20418999
[4]Nussbaum, Zach, and Brandon Duderstadt. 2025. “Training Sparse Mixture of Experts Text Embedding Models.” arXiv:2502.07972. https://arxiv.org/abs/2502.07972
[5]Monk, Edward. 2026. LinkedCulture Demonstration Platform. Accessed 2026. https://linkedculture.org
[6]OpenSearch Project. n.d. OpenSearch Documentation. Accessed 2026. https://docs.opensearch.org/
[7]Qdrant. n.d. Qdrant Documentation. Accessed 2026. https://qdrant.tech/documentation/
[8]Ollama. n.d. Ollama Documentation. Accessed 2026. https://docs.ollama.com/
[9]Creative Commons. n.d. “CC0 1.0 Universal Public Domain Dedication.” Accessed 2026. https://creativecommons.org/publicdomain/zero/1.0/

How to cite

Monk, Edward. 2026. “Embedding Cultural Heritage Metadata: Pipeline Design, Multilingual Retrieval, and Hybrid Search in LinkedCulture.” LinkedCulture Research Paper Series No. 2. Zenodo. https://doi.org/10.5281/zenodo.20633107

This is the HTML edition. The archival, citable version of record is the Zenodo deposit (DOI 10.5281/zenodo.20633107).