Last edited 6 months ago
by Serena Cericola

Simple Search

Revision as of 16:07, 15 February 2024 by Serena Cericola (talk | contribs)

Introduction

Purpose of this page is to describe the peculiarities of the "simple" search workflow.

A simple search API is provided in ShareVDE by specific REST (/resources) and GraphQL (resources) endpoints; the main thing that qualifies a given search as "simple" is a query string which consists only of unstructured, ordered or unordered, set of terms.

Note the examples below use GraphQL.

Main Workflow: Full vs Partial match

The q parameter in simple search is composed by zero, one or more search terms

If it is empty or absent, then a "match everything" query is executed. Note since 2.1.0 a type filter parameter is required so "match everything" is always constrained to one of the entity set described below (i.e. only agents, agents and opuses, only instances).

This is useful for getting aggregations (i.e. facets) related to entities on the whole catalog.

For example, at time of writing, the following request on SIT 2.x:

 { 
     resources(filters:["type:INSTANCE"]) {
        facets {
            ... on FieldFacet {
              name
              buckets {
                id
                label
                count
              }
            }
            ... on StatsFacet {
              name
              min
              max
            }
          }
          totalMatches
      }
}

produces the following instance aggregations on the whole dataset:

{
  "data": {
    "resources": {
      "facets": [
        {
          "name": "contributor",
          "buckets": [
            {
              "id": "https://svde.org/agents/203",
              "label": "Carroll, Adam (Adam Paul)",
              "count": 10
            },
            {
              "id": "https://svde.org/agents/230",
              "label": "Rowling, J. K.",
              "count": 9
            },
            {
              "id": "https://svde.org/agents/201",
              "label": "Carroll, Lewis",
              "count": 8
            },
            {
              "id": "https://svde.org/agents/211",
              "label": "Scholes, Robert E.",
              "count": 6
            },
            {
              "id": "https://svde.org/agents/238",
              "label": "Williams, John",
              "count": 6
            },
            {
              "id": "https://svde.org/agents/204",
              "label": "Dodgson, Campbell",
              "count": 5
            },
            {
              "id": "https://svde.org/agents/229",
              "label": "Tolkien, J. R. R.",
              "count": 5
            },
            {
              "id": "https://svde.org/agents/241",
              "label": "Murakami, Haruki",
              "count": 5
            },
            {
              "id": "https://svde.org/agents/8",
              "label": "Administrative Radio Conference",
              "count": 5
            },
            {
              "id": "https://svde.org/agents/109",
              "label": "De Gruyter Saur",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/116",
              "label": "ACM Special Interest Group for Automata and Computability Theory",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/202",
              "label": "Carroll, Alfred Ludlow",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/208",
              "label": "Slusser, George Edgar",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/210",
              "label": "Rabkin, Eric S.",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/216",
              "label": "Roth, Joe",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/246",
              "label": "Euripides",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/254",
              "label": "Aretino, Pietro",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/7",
              "label": "Northwest Anthropological Conference",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/9",
              "label": "ACM Symposium on Principles of Programming Languages.",
              "count": 4
            },
            {
              "id": "https://svde.org/agents/215",
              "label": "Latimer, Karen",
              "count": 3
            }
          ]
        },
        {
          "name": "publicationPlace",
          "buckets": [
            {
              "id": "https://svde.org/places/5128581",
              "label": "New York City",
              "count": 14
            },
            {
              "id": "https://svde.org/places/4930956",
              "label": "Boston",
              "count": 11
            },
            {
              "id": "https://svde.org/places/2643743",
              "label": "London",
              "count": 10
            },
            {
              "id": "https://svde.org/places/2988506",
              "label": "Paris",
              "count": 7
            },
            {
              "id": "https://svde.org/places/3169070",
              "label": "Rome",
              "count": 7
            },
            {
              "id": "https://svde.org/places/3173435",
              "label": "Milan ",
              "count": 6
            },
            {
              "id": "https://svde.org/places/6252001",
              "label": "United States ",
              "count": 5
            },
            {
              "id": "https://svde.org/places/2950159",
              "label": "Berlin",
              "count": 4
            },
            {
              "id": "https://svde.org/places/3168673",
              "label": "Salerno",
              "count": 4
            },
            {
              "id": "https://svde.org/places/4140963",
              "label": "Washington",
              "count": 4
            },
            {
              "id": "https://svde.org/places/2660646",
              "label": "Geneva",
              "count": 3
            },
            {
              "id": "https://svde.org/places/293397",
              "label": "Tel Aviv",
              "count": 3
            },
            {
              "id": "https://svde.org/places/3176854",
              "label": "Foligno",
              "count": 3
            },
            {
              "id": "https://svde.org/places/5746545",
              "label": "Portland",
              "count": 3
            },
            {
              "id": "https://svde.org/places/4235193",
              "label": "Carbondale",
              "count": 2
            },
            {
              "id": "https://svde.org/places/1692192",
              "label": "Quezon City",
              "count": 1
            },
            {
              "id": "https://svde.org/places/1796236",
              "label": "Shanghai",
              "count": 1
            },
            {
              "id": "https://svde.org/places/1835848",
              "label": "Seoul",
              "count": 1
            },
            {
              "id": "https://svde.org/places/2017370",
              "label": "Russia",
              "count": 1
            },
            {
              "id": "https://svde.org/places/2653941",
              "label": "Cambridge",
              "count": 1
            }
          ]
        },
        {
          "name": "library",
          "buckets": [
            {
              "id": "https://svde.org/agents/UPENN",
              "label": "University of Pennsylvania",
              "count": 23
            },
            {
              "id": "https://svde.org/agents/STANFORD",
              "label": "Stanford University",
              "count": 17
            },
            {
              "id": "https://svde.org/agents/BL",
              "label": "The British Library",
              "count": 15
            },
            {
              "id": "https://svde.org/agents/LOC",
              "label": "Library of Congress",
              "count": 14
            },
            {
              "id": "https://svde.org/agents/NLN",
              "label": "National Library of Norway",
              "count": 14
            },
            {
              "id": "https://svde.org/agents/YALE",
              "label": "Yale University",
              "count": 11
            },
            {
              "id": "https://svde.org/agents/NYU",
              "label": "New York University",
              "count": 7
            },
            {
              "id": "https://svde.org/agents/UALBERTA",
              "label": "University of Alberta",
              "count": 7
            },
            {
              "id": "https://svde.org/agents/DUKE",
              "label": "Duke University",
              "count": 6
            },
            {
              "id": "https://svde.org/agents/CORNELL",
              "label": "Cornell University",
              "count": 4
            }
          ]
        },
        {
          "name": "opusType",
          "buckets": [
            {
              "id": "https://svde.org/opusTypes/T002",
              "label": "volume",
              "count": 16
            },
            {
              "id": "https://svde.org/opusTypes/T005",
              "label": "article",
              "count": 15
            },
            {
              "id": "https://svde.org/opusTypes/T004",
              "label": "journal",
              "count": 6
            },
            {
              "id": "https://svde.org/opusTypes/T007",
              "label": "Review",
              "count": 5
            },
            {
              "id": "https://svde.org/opusTypes/T003",
              "label": "series",
              "count": 3
            },
            {
              "id": "https://svde.org/opusTypes/T001",
              "label": "multi-volume",
              "count": 2
            }
          ]
        },
        {
          "name": "format",
          "buckets": [
            {
              "id": "https://svde.org/formats/nc",
              "label": "Volume",
              "count": 113
            },
            {
              "id": "https://svde.org/formats/cr",
              "label": "Online resource",
              "count": 13
            },
            {
              "id": "https://svde.org/formats/vd",
              "label": "Videodisc",
              "count": 7
            },
            {
              "id": "https://svde.org/formats/cd",
              "label": "Computer disc",
              "count": 3
            },
            {
              "id": "https://svde.org/formats/sd",
              "label": "Audio disc",
              "count": 2
            },
            {
              "id": "https://svde.org/formats/ss",
              "label": "Audiocassette",
              "count": 2
            },
            {
              "id": "https://svde.org/formats/vf",
              "label": "Videocassette",
              "count": 2
            },
            {
              "id": "https://svde.org/formats/nr",
              "label": "Object",
              "count": 1
            },
            {
              "id": "https://svde.org/formats/nz",
              "label": "Other unmediated carrier",
              "count": 1
            },
            {
              "id": "https://svde.org/formats/pp",
              "label": "Microscope slide",
              "count": 1
            }
          ]
        },
        {
          "name": "auctionExhibition",
          "buckets": []
        },
        {
          "name": "language",
          "buckets": [
            {
              "id": "https://svde.org/languages/eng",
              "label": "English",
              "count": 72
            },
            {
              "id": "https://svde.org/languages/ita",
              "label": "Italian",
              "count": 23
            },
            {
              "id": "https://svde.org/languages/fre",
              "label": "French",
              "count": 11
            },
            {
              "id": "https://svde.org/languages/ger",
              "label": "German",
              "count": 7
            },
            {
              "id": "https://svde.org/languages/grc",
              "label": "Greek, Ancient (to 1453)",
              "count": 4
            },
            {
              "id": "https://svde.org/languages/heb",
              "label": "Hebrew",
              "count": 3
            },
            {
              "id": "https://svde.org/languages/rus",
              "label": "Russian",
              "count": 3
            },
            {
              "id": "https://svde.org/languages/spa",
              "label": "Spanish",
              "count": 3
            },
            {
              "id": "https://svde.org/languages/gre",
              "label": "Greek",
              "count": 2
            },
            {
              "id": "https://svde.org/languages/cat",
              "label": "Catalan",
              "count": 1
            }
          ]
        },
        {
          "name": "type",
          "buckets": [
            {
              "id": "INSTANCE",
              "label": "INSTANCE",
              "count": 146
            },
            {
              "id": "OPUS",
              "label": "OPUS",
              "count": 122
            },
            {
              "id": "AGENT",
              "label": "AGENT",
              "count": 95
            }
          ]
        },
        {
          "name": "publicationYear",
          "min": 1500,
          "max": 2021
        },
        {
          "name": "printOnlineChoice",
          "buckets": [
            {
              "id": "print",
              "label": "print",
              "count": 113
            },
            {
              "id": "online",
              "label": "online",
              "count": 13
            }
          ]
        }
      ],
      "totalMatches": 146
    }
  }
}

The default behaviour of the simple search is to execute the query using a full match logic among clauses derived from the entered terms. In other words, all terms in the query string must be in a given entity definition in order to have that entity in search results.

In case the full match strategy fails and produces 0 results, then a second query is executed using a partial match strategy (i.e. at least 1 term should have a match). The response contains an attribute called "matchMode" which indicates the logic that have been applied. Here's an example

{
  "_embedded": {
    "resourceList": [
      ... (paged resource list)
    ]
  },
  ...
  "meta": {
    "matchMode": "FULL"
  }
}

Possible values of the matchMode meta attribute are:

  • FULL: it indicates that an AND logic between query terms has been applied
  • PARTIAL: it indicates that an OR logic between query terms has been applied
  • SERVER_DEFINED: (advanced search only) when the search logic that has been executed cannot be summarised/simplified using the mnemonic codes above.
  • USER_DEFINED: in case of simple search where at least one query term is prefixed by a mandatory (+) or unwanted (-) modifier.

It's possible, through the partialMatch api parameter, to skip the full match logic and move directly the execution towards a partial match.

Terms Modifiers

Query terms can be prepended by the following modifiers:

Modifier Description
(no modifier) Term is optional
+ Term is mandatory
- Term mustn't be in results

When at least a + or a - modifier is detected in the query string, the partial/full match workflow described in the previous point is discarded in favour of the logic expressed through the explicit modifiers. In that case the matchMode attribute will have a USER_DEFINED value.

Spellchecker (aka Did You Mean?)

The Spellchecker component executes as part of the simple search workflow and it provides the following features:

  • terms suggestions: terms are the tokens extracted from the user query that once executed isolated in a (single-term) query, produce at least 1 result. In the following examples, for each term suggestion we have the misspelled term and the corresponding corrections. Note the DidYouMean type in GraphQL response offers the same structure.
{
  "_embedded": {
    "resourceList": [
      ... (paged resource list)
    ]
  },
  ...
  "didYouMean": {
    "termSuggestions": [
        {
            "term": "levis",
            "corrections": [
                "lewis",
                "lives",
                "luiss"
            ]
        },
        {
            "term": "windreland",
            "corrections": [ "wonderland" ]
        }
    ],
    "querySuggestions": [
        ...
    ]
  }
}
  • collations / query-based suggestions: collations are the best combinations of terms suggestions that produce at least 1 result
{
  "_embedded": {
    "resourceList": [
      ... (paged resource list)
    ]
  },
  ...
  "didYouMean": {
    "termSuggestions": [
        {
            "term": "levis",
            "corrections": [
                "lewis",
                "lives",
                "luiss"
            ]
        },
        {
            "term": "windreland",
            "corrections": [ "wonderland" ]
        }
    ],
    "querySuggestions": [
        {
            "query": "lewis wonderland"
        }, 
        {
            "query": "luiss wonderland"
        }
    ]
  }
}
  • automatic query correction and (re)execution: in case there's only one suggested collation, it is automatically retried. In this case the response contains, in the "meta" section, the information about the original (user) query and the query suggestion that has been automatically executed:
{
  "_embedded": {
    "resourceList": [
      ... (paged resource list)
    ]
  },
  ...
  "meta": {
    "matchMode": "FULL",
    "userQuery": "amercan libaries",
    "executedQuery": "american libraries"
  }
}

The following diagram depicts the simple search workflow; it includes also the spellchecker component/feature.

Simple Search Flow.png

The following picture illustrates the same flow from a user interface perspective

1635277981739.png

To summarise:

  • the full match phase provides collations (and terms suggestions, but in this case are not useful) in case of 0 results
    • in case there's just one collation, a new query is executed automatically and transparently and the results returned
    • in case there are multiple collations, and empty response is returned. The response contains the several available collations, so the requestor can ask the user to choose one of them.
  • in case there's no collation, the partial match logic is executed
    • if there are results, they are returned
    • if there are no results, the system computes terms and query-based suggestions
    • in case there's just one collation, it is used for building and issuing a new query automatically, and the results are returned
    • in case there are multiple collations, and empty response is returned. The response contains the available terms and collations

Which kind of resources I can get back?

Starting from Share-VDE 2.1.0, the simple search service requires a mandatory type filter parameter which constraints the entities returned in response. The following sections describe the available options in terms of possible choices.

Agents + Opuses

The type filter includes Opuses and Agents; it can have one of the following forms:

  • type:(OPUS AGENT)
  • type:(AGENT OPUS)
  • type:"AGENT" OR type:"OPUS"

Example request (GraphQL)

{
    resources(q:"alice carroll", filters:["type:(OPUS AGENT)") {
        resources {
            ... on Opus {
                (opus fields)
            }
            ... on Person {
                (person fields)
            }
            ...other agents 
        }
        facets {
            ... on FieldFacet {
                name
                buckets {
                    id
                    label
                    count
                }
            }
            ... on StatsFacet {
                name
                min
                max
            }
        }
    }
}

The following facet are available in this result-set. Being a mixed result-set, some facets belong to agents, some others to opuses.

In case of field facet, the facet usually represents a Share-VDE cluster type: it includes its preferred name (or label), its Share-VDE URI and the occurrences count.

In case of stats facet, the underlying attribute is a numeric literal (e.g. year). In this case the bucket provides the min and max attribute values across the current result-set.

  • contributor: the top 20 contributors of the matching opuses.
  • opusType: the opus types of the matching opuses.
  • genre: the top 20 genres of the matching opuses
  • year: the min and max year of the matching opuses
  • agentType: the agent types of the matching agents.
  • location: the top 20 places related to the matching agents. This attribute groups/includes things that can represent different concepts depending on the matching entity. For example, for a person it could be a birth or a death place, for an organisation the location of its headquarter
  • type: the type of the matching entities (AGENT or OPUS) and the corresponding occurrences count.
  • beginningDate: the min and max beginning date of the matching agents. A date has a different meaning depending on the agent type. For example, a person could have a birth date, an organisation a founding year
  • endingDate: the min and max ending date of the matching agents. Same grouping logic as before: for a person this is the death date, for a meeting the end date, for an organization the dissolution year.

Agents

The type filter includes only agents; it can have one of the following forms:

  • type:(AGENT)
  • type:"AGENT"

Example request (GraphQL)

{
    resources(q:"alice carroll", filters:["type:\"AGENT\"") {
        resources {
            ... on Person {
                (person fields)
            }
            ...other agents 
        }
        facets {
            ... on FieldFacet {
                name
                buckets {
                    id
                    label
                    count
                }
            }
            ... on StatsFacet {
                name
                min
                max
            }
        }
    }
}

The following facet are available in this result-set.

In case of field facet, the facet usually represents a Share-VDE cluster type: it includes its preferred name (or label), its Share-VDE URI and the occurrences count.

In case of stats facet, the underlying attribute is a numeric literal (e.g. year). In this case the bucket provides the min and max attribute values across the current result-set.

  • agentType: the agent types of the matching agents.
  • location: the top 20 places related to the matching agents. This attribute groups/includes things that can represent different concepts depending on the matching entity. For example, for a person it could be a birth or a death place, for an organisation the location of its headquarter
  • beginningDate: the min and max beginning date of the matching agents. A date has a different meaning depending on the agent type. For example, a person could have a birth date, an organisation a founding year
  • endingDate: the min and max ending date of the matching agents. Same grouping logic as before: for a person this is the death date, for a meeting the end date, for an organization the dissolution year.
  • type: the type of the matching entities and the corresponding occurrences count. Note this facet ignores the type filter and provides an aggregation over the three available entity types: instances (publications), agents and opuses.

Opuses

The type filter includes only opuses; it can have one of the following forms:

  • type:(OPUS)
  • type:"OPUS"

Example request (GraphQL)

{
    resources(q:"alice carroll", filters:["type:\"OPUS\"") {
        resources {
            ... on Opus {
                (opus fields)
            }
        }
        facets {
            ... on FieldFacet {
                name
                buckets {
                    id
                    label
                    count
                }
            }
            ... on StatsFacet {
                name
                min
                max
            }
        }
    }
}

The following facet are available in this result-set.

In case of field facet, the facet usually represents a Share-VDE cluster type: it includes its preferred name (or label), its Share-VDE URI and the occurrences count.

In case of stats facet, the underlying attribute is a numeric literal (e.g. year). In this case the bucket provides the min and max attribute values across the current result-set.

  • contributor: the top 20 contributors of the matching opuses.
  • opusType: the opus types of the matching opuses.
  • genre: the top 20 genres of the matching opuses
  • year: the min and max year of the matching opuses
  • agentType: the agent types of the matching agents.
  • location: the top 20 places related to the matching agents. This attribute groups/includes things that can represent different concepts depending on the matching entity. For example, for a person it could be a birth or a death place, for an organisation the location of its headquarter
  • type: the type of the matching entities and the corresponding occurrences count. Note this facet ignores the type filter and provides an aggregation over the three available entity types: instances (publications), agents and opuses.

Publications

The type filter includes only publications; it can have one of the following forms:

  • type:(INSTANCE)
  • type:"INSTANCE"

Example request (GraphQL)

{
    resources(q:"alice carroll", filters:["type:\"INSTANCE\"") {
        resources {
            ... on PublicationFlatCollection {
                resources {
                    uri
                    instance {
                        (instance fields)
                    }
                }
            }
            facets {
                ... on FieldFacet {
                    name
                    buckets {
                        id
                        label
                        count
                    }
                }
                ... on StatsFacet {
                    name
                    min
                    max
                }
            }
        }
    }
}

The following facet are available in this result-set.

In case of field facet, the facet usually represents a Share-VDE cluster type: it includes its preferred name (or label), its Share-VDE URI and the occurrences count.

In case of stats facet, the underlying attribute is a numeric literal (e.g. year). In this case the bucket provides the min and max attribute values across the current result-set.

  • contributor: the top 20 contributors of the matching publications.
  • publicationPlace: the top 20 publication place of the matching publications.
  • opusType: the opus types of the parent opuses of the matching publications.
  • library: the libraries (and the corresponding counts) of the matching publications.
  • format: the top 20 formats of the matching publications.
  • auctionExhibition: (Kubikat only)
  • language: the top 20 languages of the matching publications.
  • publicationYear: the min and max publication year of the matching publications.
  • printOnlineChoice (Kubikat only): a two values attribute that allowing filtering between "print" and "online" publications
  • type: the type of the matching entities and the corresponding occurrences count. Note this facet ignores the type filter and provides an aggregation over the three available entity types: instances (publications), agents and opuses.

Exact Match Suggestions

There's another feature, only available as a GraphQL operation (No REST API), which accepts a query string composed only by terms and returns all entities which have an exact match in

  • identifiers (e.g. local id, viaf id, isni id, ISSN, ISBN, EAN, ISMN, Barcode)
  • headings (e.g. titles, names)

The exactMatch operation tries to do its best in order to understand if the query string contains multiple "exact matches".

See here for a detailed description about covered and uncovered cases.