Azure Search: Introduction (part 1 of 3)
Azure Search is a search-as-a-service cloud solution which provides an API to allow developers to integrate powerful search functionality without having to manage or install the search technology. The search service can be managed and queried through a REST API with its complexity hidden behind this.
To get started a free trial can be used in the Azure portal to test different possible configurations and solutions, allowing you to discover the full potential of Azure Search. After you have set up a free trial there are four main steps to get started: creating the service, creating the index, loading the data and searching. I will explain these steps along with example code in the section ‘Using Azure Search’.
A typical scenario is outlined in the diagram below. Azure Search is used in conjunction with an application’s database. This database can be used to populate and sync the data held within the search service. Azure Search will sit beside the data store, which could be either a relational database or a noSQL database. Data can either be sent directly to Azure Search via the REST API, or it can be added by crawling your data store via indexers and data sources.
Azure Search offers many features including:
These features can all be used via the REST API and some examples are provided in the ‘Using Azure Search’ section below. One of the main benefits of using Azure’s search service is that it provides all these features pretty much out of the box and it is managed by Microsoft. Another benefit is that all the complexity is hidden behind a REST API. A developer can simply connect to the search service using this REST API and they can setup a powerful search that can handle spelling mistakes and provide search suggestions.
Microsoft also boasts about the performance of the search. Depending on the service plan used it can handle millions of documents with ease. Finally, the service can easily be scaled on demand, although of course at a cost, so your application can handle peak loads and be scaled down when necessary.
Within Azure Search the index is the store of data, which will store documents. The structure of the data must be defined by the index and fields within the index to be defined with attributes. In simple database terms, the index is equivalent to a table, documents are equivalent to rows and fields are equivalent to columns. The JSON object below shows an example index definition. This example is used to demonstrate the different field attributes and types, as well as more complex features that can be defined as part of the index.
The JSON above can be sent as the body of a POST request to the Azure Search REST API. This example outlines an index that could be used for searching details of homes from an estate agent’s website.
The data type dictates how the field can be mapped between the index and the data source, and also defines how the field can be filtered or searched. Example data types include string, collection of strings, boolean, integer, double, date time offset and geographic location. In the example above the tags (which could be used as tags to describe the house) are represented as a collection of strings, the gardenIncluded field is represented as a boolean and the description field is represented by a string.
When defining fields, all field attributes
(excluding the key attribute) are defaulted to true unless specified to false.
Key – This is used as the unique identifier for a document in the index. The key field can be used to look up a document directly and is used to sync data between a data source and the index via an indexer. Key fields must be of type string.
Retrievable – Fields marked as retrievable will be returned when the index is queried. In the example above, “houseId” is not retrievable. This means that when Azure Search is queried the response will not contain the id for any of the returned objects.
Searchable – Fields marked as searchable will be searchable through the REST API. When a field is marked as searchable it undergoes token and word analysis. In the example above the description of the house is searchable. Therefore, if the description was “Spacious house with large garden” then the field will be broken down into words and undergo other lexical analysis, such as including word inflexions in the index. If the user searched for “gardening” or “gardens” then the example description would match, as they are word inflections of “garden”. An important point about searchable fields is that they take up more space in the index, because Azure will store different variations of the word.
Filterable – Fields marked as filterable are fields which can be filtered with classic filters, such as equals, less than or more than. In the example above “lastRenovationDate” is marked as filterable. This allows the user to filter only for houses that have been renovated during a certain time frame, or houses that have been renovated recently.
Sortable – Fields marked as sortable can be specified to tell Azure Search how to order the results returned. By default Azure will return results in the order of the search score (based on how closely the search text matches a result in the index).
Analysers – Different analysers can be specified to tell Azure Search how to analyse the inputted data. For example, “lucene.fr” is used in the above example for the “description_fr” field. This means that the text will be analysed and suggestions, tokenisation and other analysis will be performed to better suit the French language. Various languages can be chosen, as well as various analysers.
In some scenarios you may want to analyse text differently to the standard approach taken by Azure Search. This can be done by standard or custom analysers. Analysers are configurations that filter or replace certain characters and symbols from the input text. The example above defines a custom analyser called “phonetic_ascii_analyzer”. In this example a standard tokeniser is used but a custom analyser is created. The custom analyser will convert all input into lower case (search matching will happen on any case), ascii folding (normalises ö or ê to allow for easier matching) and phonetic (matches on phonetically similar words).
As well as custom analysers, a custom tokeniser can be created. A tokeniser defines how the input text can be split into independent tokens. For example, separating a sentence into words.
A data source can be used (alongside an indexer) to sync data between a database and the Azure Search index. This can be done manually as a one-off job, or as a scheduled job of intervals up to 5 minutes. When defining a data source, you are defining the connection information for your database. This connection information is used by the indexer to sync the data.
Currently, there are 4 different types of data sources that can be used. Those types are: “azuresql”, “documentdb” (Azure Cosmos DB), “azureblob” and “azuretable”. A more advanced feature that can be specified as part of the data source definition is the high watermark change detection policy. This policy is used to specify when a column has been changed. This can be the row version or a last updated column (such as a timestamp). Another policy that can be specified is the SQL integration change detection policy. This is the most efficient change detection policy but can only be used by data sources that support change tracking (e.g Azure SQL DB V12). This policy does not require a column name but is done automatically.
Once the data source has been defined, an indexer can be defined. The indexer will extract information from the data source by crawling through it. A schedule is added as a parameter when creating the indexer, which will tell Azure how often to run the indexer and check for changes. This can be up to every 5 minutes. There are also some additional settings that can be stated such as ‘batchSize’ (number of items in a batch which can be tweaked to improve performance), ‘maxFailedItems’ and ‘maxFailedItemsPerBatch’ (number of failures, can be set to 0 for no errors allowed or -1 for infinite number of errors).
In case the fields in the index and fields in the data source do not match, field mappings can be defined. These field mappings can map names of fields in the data source to differently named fields in the index. Through the REST API actions such as create, delete, update and list indexers/data sources can be performed. You can also check on the status of the index, to view information on the failures that could have occurred during indexing.
Next up I will describe the basics of how to use Azure Search through the REST API in my ‘Using Azure Search‘ blog.
Sign up to the Kainos newsletter