Solr 7: match exactly one TextField, fuzzy search the others

Disclaimer: I just recently started tinkering with Solr, so this may not be the “textbook” solution to this problem — but it works!

Say you wish to categorize your documents in a hierarchy. For example, you want to have a pet search engine, and you want to bucket the pets accordingly:

Dog (mammal)

Dogs are domesticated mammals, not natural wild animals. They were originally bred from wolves. They have been bred by humans for a long time, and were the first animals ever to be domesticated.

Cat (mammal)

Cats are also called domestic cats (Felis catus), are carnivorous (meat-eating) mammals, of the family Felidae.  Cats have been domesticated (tame) for nearly 10,000 years. They are currently the most popular pets in the world. Their origin is probably the African Wildcat Felis silvestris lybica.

Turtle (reptile)

Turtles are the reptile order Testudines. They have a special bony or cartilaginous shell developed from their ribs that acts as a shield.

Lizard (reptile)

Lizards are reptiles. Together with snakes, they make up the order Squamata. There are about 6,000 species, which live all over the world, except in cold climates. They range across all continents except Antarctica, as well as most oceanic island chains.

In this scenario, we can use Solr 7’s dynamic fields for our documents. We want to use one TextField in particular for our buckets mammal and reptile that incorporates solr.KeywordTokenizerFactory when indexed by Solr. This will enforce exact matches for that Textfield.

In this case, we can use the already defined fieldType ancestor_path, one that comes in the default managed_schema XML file in the folder of your Solr core. It is predefined in the XML schema as follows:

  <fieldType name="ancestor_path" class="solr.TextField">
    <analyzer type="index">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
    </analyzer>
  </fieldType>

A dynamic field that utilizes the ancestor_path fieldType is suffixed with _ancestor_path. The dynamic field is predefined in the XML schema as follows:

  <dynamicField name="*_ancestor_path" type="ancestor_path" indexed="true" stored="true"/>

We want to exactly match the animal type, for grouping purposes. For the other fields with text we just want to fuzzy search. Therefore, when we add our docs to solr, we can use the following dynamic field names in our document structure:

  type_ancestor_path
  animal_txt_en
  description_txt_en

To illustrate with our example, the dog document we give to Solr resembles the following in JSON:

  { 
    type_ancestor_path: "mammal", 
    animal_txt_en: "Dog", 
    description_txt_en: "Dogs are domesticated animals [...]" 
  }

Now, we can run our queries and limit the search to one particular type, or otherwise take advantage of other grouping capabilities Solr has to offer.

To group the search results by type on a generic query, we can have the following search parameters sent to Solr:

{ q: 'domesticated',
  qf: 'description_txt_en',
  group: true,
  'group.field': 'type_ancestor_path'
}

Similarly, if we want to limit our search to just reptiles:

{ q: 'shell',
  qf: 'description_txt_en',
  group: true,
  'group.query': 'type_ancestor_path:reptiles'
}

In these examples, the first query will return dog and cat grouped under mammals, and 0 results grouped under reptiles.  In the second query, Solr will ignore looking searching in the dog and cat descriptions based on our group.query, and will return the turtle document.

For more things you can do with grouping, and limiting your search to exact matches on particular terms and doing standard matching on others, you can look at some other examples posted in the Solr documentation.

*     *     *

If you found this helpful, be sure to leave a comment and take a look at my other work.

0 thoughts on “Solr 7: match exactly one TextField, fuzzy search the others”

Leave a Reply

Your email address will not be published. Required fields are marked *