Skip to content

Cloud Indexing schema

To allow efficient storage and effective retrieval of items, we use a highly-optimized indexing schema. When documents are pushed to the index they should use the schema as described here. Failure to do so results in error messages and the documents won't be indexed.

We support two types of documents, products and content pages. When other types of documents are required, please get in touch with us.

Products

Document ID and type

Each document has a document id and a type. The type described here is product; for the content page schema, see below. The id should be unique across all documents.

{
  "id": string,      # unique indentifier of the document
  "type": "product"  # document type
}

Search result data

The search_result_data component contains basic item information that can be used to visualize results, but that is not used to perform the actual search. In other words, data that is only present in search_result_data will not lead to any matches. Examples of information are price and the number of variants of a product.

{
  "search_result_data": {
    "productId": string,         # document identifier
    "name": string,              # document name or title
    "number_of_variants": int,   # number of product variants (e.g., different colors or sizes)
    "final_price": double,       # actual sales price of the products
    "base_price": double,        # price before discounts etc.
  }
}

Searchable data

For an item to be findable, we need data that we can search through. This data is stored in search_data, which is a list of dictionaries with one entry per product variant. Product variants are, for example, different colors of the same product. search_data has two required fields, namely full_text and full_text_boosted. The former is a concatination of most product fields (e.g., product name, description, brand, model, size, and color), whereas the latter is a concatination of the most important fields (for example, the product name, brand, and model).

Besides the two full text fields, searchable product data consists of product_fields, string_facet, and number_facet, which are explained below.

{
  "search_data": [                  # list of searchable data for each product variant (e.g., color)
    {
      "full_text": string,          # concatenation of textual document data (e.g., title, description, tags)
      "full_text_boosted": string,  # concatenation of important document data (e.g., title and description)

      "product_fields": [ ],
      "string_facet": [ ],
      "number_facet": [ ]
    },
    ...
  ]
}

Note

Storing product variants as separate products is also a possibility. Simple set number_of_variants to 1 and add each product as a separate document. Adding variants separately does mean that grouping product variants in the UI becomes very difficult.

Product fields

Product fields are searchable fields describing one specific product. Product fields are not part of aggregations, which means that these fields cannot be used for faceted search or filtering. Typical product fields include product description and product name, but not price or brand (because those fields are probably part of your filters).

{
  "product_fields": [           # list of searchable product fields (e.g., name, description)
    {
      "field-name": string,     # field name (e.g., name)
      "field-value": string     # field value (e.g., Apple iPhone)
    },
    ...
  ],
}

Facets

Facets are searchable fields that contain product characteristics, which can be used to filter search results to include only products that have a particular facet value. Typical examples of facets include brand, price, size, and color. Each facet consists of a facet name (e.g., color) and the facet value (e.g., red). We allow for string_facet (facets for which the values are strings, like brand and color) and number_facet (facets for which the values are numbers, like price and size). Numerical facets allow for more advanced ways of filtering, for example by using in-between, minimum, or maximum values ("select products that are cheaper than X").

{
  "string_facet": [             # list of document string facets and their values (e.g., brand, type, model)
    {
      "facet-name": string,     # facet name (e.g., brand)
      "facet-value": string     # facet value (e.g., Apple)
    },
    ...
  ],
  "number_facet": [             # list of document number facets and their values (e.g., price, size)
    {
      "facet-name": string,     # facet name (e.g., size)
      "facet-value": double     # facet value (e.g., 41)
    },
    ...
  ]
}

Categories

Documents often reside in a category structure. This structure contains important information for searching, and is a requirement for browsing. We require the direct parent category of a document, a list of all parents, and the path for the document. All these fields should be actual strings and not identifiers / references to category names stored somewhere else.

{
  "category": {
    "direct_parents": [  # a list of direct parent categories (usually one)
      string,
      ...
    ],
    "all_parents": [     # a list of all ancestors (parents, grandparents)
      string,
      ...
    ],
    "paths": [           # a list of the full paths to the document
      string,
      ...
    ]
  }
}

Product data

The best search (and browse) results are not only determined by searchable data like product descriptions. Additional product data plays an important role, which is why we require such data to achieve the optimal search and browse performance. Below is an example list of product_data, but you are free to remove or add product data. Get in touch with us to learn which data is most important and what other data sources could be useful.

{
  "product_data": {
    "sales_7_days": double,         # revenue from product in last 7 days
    "sales_14_days": double,        # revenue from product in last 14 days
    "sales_30_days": double,        # revenue from product in last 30 days
    "sales_60_days": double,        # revenue from product in last 60 days
    "sales_90_days": double,        # revenue from product in last 90 days
    "sales_180_days": double,       # revenue from product in last 180 days
    "average_rating": double,       # average product rating
    "number_of_ratings": int,       # number of product ratings
    "on_sale": bool,                # True if product is on sale
    "days_since_promotion": int,    # number of days since the product was promoted
    "features_on_frontpage": bool,  # True if product features on frontpage
    "features_on_banners": bool,    # True if product features on banners
    "general_popularity": int,      # General popularity indicator (e.g., views, purchases)
    "delivery_speed": int,          # Indicator of how fast the product can be delivered
    "stock": int,                   # Number of items in stock
  }
}

Synonyms

904Labs A.I. for Search has a module that automatically learns synonyms from search behavior. However, if synonyms are already available, it would be a waste not to use them. The synonyms component is a list of terms that are synonyms for the current product.

{
  "synonyms": [
    string, string, string, ...
  ]
}

Sorting

Search results are usually sorted by relevance, but we might also want to sort by different facets. To allow for this, we need to know which fields are eligable for sorting. string_sort is a dictionary with facet name and string value pairs that can be used for sorting. number_sort is the same, but then for facets with numerical values.

{
  "number_sort": {  # facets with numerical values (e.g., price)
    string: double, 
    ...: ...
  },
  "string_sort": {  # facets with string values (e.g., name)
    string: string,
    ...: ...
  }
}

Putting it all together

Combining all fields described before, and adding example data, we get the following product document.

{
  "id": "10001-1001-1",
  "type": "product",

  "search_result_data": {
    "productId": "10001-1001-1",
    "name": "Apple iPhone X",
    "number_of_variants": 1,
    "final_price": 980.34,
    "base_price": 999.00,
  },

  "search_data": [
    {
      "full_text": "Apple iPhone X  The world's most personal device. Apple white 64GB",
      "full_text_boosted": "Apple iPhone X Apple",

      "product_fields": [
        {
          "field-name": "name",
          "field-value": "Apple iPhone X"
        },
        {
          "field-name": "description",
          "field-value": "The world's most personal device."
        }
      ],
      "string_facet": [
        {
          "facet-name": "brand",
          "facet-value": "Apple"
        },
        {
          "facet-name": "color",
          "facet-value": "white"
        }
      ],
      "number_facet": [
        {
          "facet-name": "price",
          "facet-value": 980.34
        },
        {
          "facet-name": "storage",
          "facet-value": 64
        }
      ]
    }
  ],

  "category": {
    "direct_parents": [
      "iphones"
    ],
    "all_parents": [
      "iphones", "smartphones", "phones"
    ],
    "paths": [
      "phones - smartphones - iphones"
    ]
  },

  "product_data": {
    "sales_7_days": 500,
    "sales_14_days": 1100,
    "sales_30_days": 4500,
    "sales_60_days": 10300,
    "sales_90_days": 12000,
    "average_rating": 4.3,
    "number_of_ratings": 2500,
    "on_sale": 0,
    "days_since_promotion": 50,
    "features_on_frontpage": 1,
    "features_on_banners": 1,
    "general_popularity": 5600000,
    "delivery_speed": 1,
    "stock": 56,
  },

  "synonyms": [
    "iphonex", "applex", "iphone 10"
  ],

  "number_sort": {
    "price": 980.34
  },
  "string_sort": {
    "name": "Apple iPhone X",
    "brand": "Apple"
  }
}

Pages

Sometimes you don't just want to search through the product catalogue, but also include the FAQ, customer service pages, return policy, newsletter information, and other content pages. To allow for this, we support a page document type.

Document ID and type

Each document has a document id and a type. The type described here is page; for the product schema, see above. The id should be unique across all documents.

{
  "id": string,     # unique indentifier of the document
  "type": "page"    # document type
}

Search result data

The search_result_data component contains basic item information that can be used to visualize results, but that is not used to perform the actual search. In other words, data that is only present in search_result_data will not lead to any matches. Examples of information are image URL and page ID.

{
  "search_result_data": {
    "pageId": string,         # document identifier
    "name": string,           # document name or title
    "thumb_url": string,      # url of a thumbnail image
    "publication_date": date  # publication date of page
  }
}

Searchable data

For an item to be findable, we need data that we can search through. This data is stored in search_data, which is a list of dictionaries. search_data has two required fields, namely full_text and full_text_boosted. The former is a concatination of most page fields (e.g., title, abstract, all text), whereas the latter is a concatination of only the most important fields (for example, title and abstract).

Besides the two full text fields, searchable data consists of page_fields, string_facet, and number_facet, which are explained below.

{
  "search_data": [                  # list of searchable data for a page
    {
      "full_text": string,          # concatenation of textual document data (e.g., title, description, tags)
      "full_text_boosted": string,  # concatenation of important document data (e.g., title and description)

      "page_fields": [ ],
      "string_facet": [ ],
      "number_facet": [ ]
    }
  ]
}

Page fields

Page fields are searchable fields describing one specific page. Page fields are not part of aggregations, which means that these fields cannot be used for faceted search or filtering. Typical page fields include page title and page abstract, but not author (because thoat field is probably part of your filters).

{
  "product_fields": [           # list of searchable product fields (e.g., title, abstract
    {
      "field-name": string,     # field name (e.g., title)
      "field-value": string     # field value (e.g., How to reach us)
    },
    ...
  ],
}

Facets

Facets are searchable fields that contain page characteristics, which can be used to filter search results to include only pages that have a particular facet value. Typical examples of facets include author, source, and age restriction. Each facet consists of a facet name (e.g., author) and the facet value (e.g., Shakespeare). We allow for string_facet (facets for which the values are strings, like author) and number_facet (facets for which the values are numbers, like age restriction). Numerical facets allow for more advanced ways of filtering, for example by using in-between, minimum, or maximum values ("select pages that have minimum age restrictions of X").

{
  "string_facet": [             # list of document string facets and their values (e.g., author, publication status)
    {
      "facet-name": string,     # facet name (e.g., author)
      "facet-value": string     # facet value (e.g., Shakespeare)
    },
    ...
  ],
  "number_facet": [             # list of document number facets and their values (e.g., age restriction)
    {
      "facet-name": string,     # facet name (e.g., age limit)
      "facet-value": double     # facet value (e.g., 12)
    },
    ...
  ]
}

Categories

Documents often reside in a category structure. This structure contains important information for searching, and is a requirement for browsing. We require the direct parent category of a document, a list of all parents, and the path for the document. All these fields should be actual strings and not identifiers / references to category names stored somewhere else.

{
  "category": {
    "direct_parents": [  # a list of direct parent categories (usually one)
      string,
      ...
    ],
    "all_parents": [     # a list of all ancestors (parents, grandparents)
      string,
      ...
    ],
    "paths": [           # a list of the full paths to the document
      string,
      ...
    ]
  }
}

Page data

The best search (and browse) results are not only determined by searchable data like descriptions. Additional page data plays an important role, which is why we require such data to achieve the optimal search and browse performance. Below is an example list of page_data, but you are free to remove or add page data. Get in touch with us to learn which data is most important and what other data sources could be useful.

{
  "page_data": {
    "average_rating": double,       # average page/answer rating
    "number_of_ratings": int,       # number of page/answer ratings
    "general_popularity": int       # General popularity indicator (e.g., views)
  }
}

Synonyms

904Labs A.I. for Search has a module that automatically learns synonyms from search behavior. However, if synonyms are already available, it would be a waste not to use them. The synonyms component is a list of terms that are synonyms for the current page.

{
  "synonyms": [
    string, string, string, ...
  ]
}

Sorting

Search results are usually sorted by relevance, but we might also want to sort by different facets. To allow for this, we need to know which fields are eligable for sorting. string_sort is a dictionary with facet name and string value pairs that can be used for sorting. number_sort is the same, but then for facets with numerical values.

{
  "number_sort": {  # facets with numerical values (e.g., publication date)
    string: double, 
    ...: ...
  },
  "string_sort": {  # facets with string values (e.g., name)
    string: string,
    ...: ...
  }
}

Putting it all together

{
  "id": "1233-2424",
  "type": "page",

  "search_result_data": {
    "pageId": "1233-2424",
    "name": "Othello",
    "thumb_url": "/img/1233-2424/thumb.png",
    "publication_date": 2016-05-23T17:21:43.511Z
  },

  "search_data": [
    {
      "full_text": "Othello Shakespeare Tush! never tell me; I take it much unkindly That thou, Iago, who hast had my purse",
      "full_text_boosted": "Othello Shakespeare",

      "page_fields": [
        {
          "field-name": "title",
          "field-value": "Othello"
        }
      ],
      "string_facet": [
        {
          "facet-name": "author",
          "facet-value": "Shakespeare"
        }
      ],
      "number_facet": [
        {
          "facet-name": "age restriction",
          "facet-value": 12
        }
      ]
    }
  ],

  "category": {
    "direct_parents": [ "plays" ],
    "all_parents": [ "plays", "theatre", "culture" ],
    "paths": [ "culture - theatre - plays" ]
  },

  "page_data": {
    "average_rating": 3.2,
    "number_of_ratings": 145,
    "general_popularity": 5643
  },

  "synonyms": [ "roderigo", "iago", "more", "moore" ],

  "number_sort": {

  },
  "string_sort": {
    "title": "Othello",
    "author": "Shakespeare"
  }
}