Navigation

Specify a Language for Text Index

This tutorial describes how to specify the default language associated with the text index and also how to create text indexes for collections that contain documents in different languages.

Specify the Default Language for a text Index

The default language associated with the indexed data determines the rules to parse word roots (i.e. stemming) and ignore stop words. The default language for the indexed data is english.

To specify a different language, use the default_language option when creating the text index. See Text Search Languages for the languages available for default_language.

The following example creates for the quotes collection a text index on the content field and sets the default_language to spanish:

db.quotes.createIndex(
   { content : "text" },
   { default_language: "spanish" }
)

Create a text Index for a Collection in Multiple Languages

Specify the Index Language within the Document

If a collection contains documents or embedded documents that are in different languages, include a field named language in the documents or embedded documents and specify as its value the language for that document or embedded document.

MongoDB will use the specified language for that document or embedded document when building the text index:

  • The specified language in the document overrides the default language for the text index.
  • The specified language in an embedded document override the language specified in an enclosing document or the default language for the index.

See Text Search Languages for a list of supported languages.

For example, a collection quotes contains multi-language documents that include the language field in the document and/or the embedded document as needed:

{
   _id: 1,
   language: "portuguese",
   original: "A sorte protege os audazes.",
   translation:
     [
        {
           language: "english",
           quote: "Fortune favors the bold."
        },
        {
           language: "spanish",
           quote: "La suerte protege a los audaces."
        }
    ]
}
{
   _id: 2,
   language: "spanish",
   original: "Nada hay más surrealista que la realidad.",
   translation:
      [
        {
          language: "english",
          quote: "There is nothing more surreal than reality."
        },
        {
          language: "french",
          quote: "Il n'y a rien de plus surréaliste que la réalité."
        }
      ]
}
{
   _id: 3,
   original: "is this a dagger which I see before me.",
   translation:
   {
      language: "spanish",
      quote: "Es este un puñal que veo delante de mí."
   }
}

If you create a text index on the quote field with the default language of English.

db.quotes.createIndex( { original: "text", "translation.quote": "text" } )

Then, for the documents and embedded documents that contain the language field, the text index uses that language to parse word stems and other linguistic characteristics.

For embedded documents that do not contain the language field,

  • If the enclosing document contains the language field, then the index uses the document’s language for the embedded document.
  • Otherwise, the index uses the default language for the embedded documents.

For documents that do not contain the language field, the index uses the default language, which is English.

Use any Field to Specify the Language for a Document

To use a field with a name other than language, include the language_override option when creating the index.

For example, give the following command to use idioma as the field name instead of language:

db.quotes.createIndex( { quote : "text" },
                       { language_override: "idioma" } )

The documents of the quotes collection may specify a language with the idioma field:

{ _id: 1, idioma: "portuguese", quote: "A sorte protege os audazes" }
{ _id: 2, idioma: "spanish", quote: "Nada hay más surrealista que la realidad." }
{ _id: 3, idioma: "english", quote: "is this a dagger which I see before me" }