Implementing a search engine with elasticsearch and Symfony (part 2)

Published on 2019-10-28 • Modified on 2019-10-28

This is the second part of the tutorial. In this post, we will see how to improve our search engine to make is much more relevant. We will use an alias, create a custom command to populate the index. We will tune the search applying boosts to some fields and eventually we will paginate the result list. Let's go! 😎

» Published in "A week of Symfony 670" (28 October - 3 November 2019).

Tutorial

This post is the second part of the tutorial "Implementing a search engine with elasticsearch and Symfony":

Prerequisites

The prerequisites are the same of the first part. It is of course recommended to read it before continuing with this one.

Configuration

  • PHP 7.2
  • Symfony 4.4
  • elasticsearch 6.8

Code review and debugging

First, we will check the code and configuration we implemented in the previous post. Indeed, it's a good habit, before to develop something to check what was done previously to see if something could be improved or cleaned before adding new code. if you look at the configuration we added, you surely noticed that we put the host and the port of the elasticsearch server directly in the fos_elastica.yaml, that's a bad practice! So let's move this parameters in the file. Add the following line in your .env file:

ES_HOST=localhost
ES_PORT=9209

Then we must retrieve these environment parameters in the application parameters. Add the two following line in your config/services.yaml file:

# config/posts/48.yaml (imported in config/services.yaml)
parameters:
  es_host: '%env(ES_HOST)%'
  es_port: '%env(ES_PORT)%'

Eventually, we can now use these two new parameters in the fos_elastica configuration file. Modify the default line to use the new parameters: (we could also directly use the %env()% in this file)

# config/packages/fos_elastica.yaml
fos_elastica:
    clients:
        default: { host: '%es_host%', port: '%es_port%' }

That a first step. Now, let's see if we can fix a bug. Indeed, there is a very annoying one. Until now, when doing the search, we were getting the user input to pass it to the ->findHybrid (check first part). The problem is that there are forbidden characters used by elasticsearch like [] and using such a character will lead to a 500 error:

Failed to parse query [doctrine[]] [index: app] [reason: all shards failed]

To fix this, we will use the escaping function provided by the elastica library: Util::escapeTerm()

We can call this function in the search controller to avoid the error:

$results = !empty($q) ? $articlesFinder->findHybrid(Util::escapeTerm($q)) : [];

Now that we have cleaned up and fixed what we have done in the previous post. Let's see how to improve the elasticsearch environnment.

Using an elasticsearch alias

Until now we were directly using the main index to populate data. But what happens if the mapping changed? If fields were added or removed? It could be dangerous... Using an alias allows us to avoid downtime and errors as the index switch will only occur when all data will be re-indexed, it's especially true if you have a large amount of data and a lot of documents that takes times to redindex. First, delete the current index or we won't be able to create the alias as the "app" name would already exists. You can do this with CURL (You can also do this with the head manager: actions -> delete...):

curl -i -X DELETE 'http://localhost:9209/app'
{"acknowledged":true}

Let's add the alias option "use_alias: true" in the fos_elastica config:

# config/packages/fos_elastica.yaml
fos_elastica:
    clients:
        default: { host: '%es_host%', port: '%es_port%' }
    indexes:
        app:
            use_alias: true
            types:
                articles:
                    # ...

Now launch the fos:elastica:populate command. This time we can see that the created index is not "app" anymore but a timestamp suffix was added. Moreover an alias has automatically been associated to this index. It's this alias that have the "app" name now. At this point, your elasticsearch cluster should look like this:

The index has now an alias

Launch again the populate command, but this time with --no-delete option. You will see that there are two indexes but the alias is now on the most recent one. The alias handling is automatically done by the bundle so you don't have to do it manually. The old index has been "closed", this means the data are still here but the index can't be accessed anymore and the read/write operations are blocked. The cluster now looks like this:

The oldest index has been closed

Imagine you have a critical bug and the new index is the cause. You could open the old index, delete the alias and then assign the alias to the old one to make your application work again. That all for the alias part. Now let see how to add a custom provider to index data.

Creating a custom provider

As we have seen in the first part, until now, we are not indexing all the data because most of the texts you can read on this blog are stored in translations files, not in the database. So, let's see how to index these texts and make the search much more relevant. We need to create a custom data provider. This service needs to access the database with the ORM and to access the translations files thanks to the i18n component. Comparing to the standard provider, ours will do two things differently. He will ignore the inactive articles and it will get the i18n content of the articles. Here it is:

<?php declare(strict_types=1);

// src/Elasticsearch/Provider/ArticleProvider.php

namespace App\Elasticsearch\Provider;

use App\Entity\ArticleRepository;
use Doctrine\Common\Collections\ArrayCollection;
use FOS\ElasticaBundle\Provider\PagerProviderInterface;
use FOS\ElasticaBundle\Provider\PagerfantaPager;
use Pagerfanta\Adapter\DoctrineCollectionAdapter;
use Pagerfanta\Pagerfanta;
use Symfony\Component\Yaml\Yaml;
use Symfony\Contracts\Translation\TranslatorInterface;

class ArticleProvider implements PagerProviderInterface
{
    private $articleRepository;
    private $translation;

    public function __construct(ArticleRepository $articleRepository, TranslatorInterface $translation)
    {
        $this->articleRepository = $articleRepository;
        $this->translation = $translation;
    }

    public function provide(array $options = array())
    {
        $articles = $this->articleRepository->findActive();
        foreach ($articles as $article) {
            $domain = $article->isArticle() ? 'blog' : 'snippet';
            foreach (['En', 'Fr'] as $locale) {
                // keywords
                $fct = 'setKeyword'.$locale;
                $keywords = [];
                foreach (explode(',', $article->getKeyword() ?? '') as $keyword) {
                    $keywords[] = $this->translation->trans($keyword, [], 'breadcrumbs', strtolower($locale));
                }
                $article->$fct(implode(',', $keywords));

                // title
                $fct = 'setTitle'.$locale;
                $article->$fct($this->translation->trans('title_'.$article->getId(), [], $domain, strtolower($locale)));

                // headline
                $fct = 'setHeadline'.$locale;
                $headlineKey = $article->isArticle() ? 'headline' : 'intro'; // @fixme should the same
                $article->$fct($this->translation->trans($headlineKey.'_'.$article->getId(), [], $domain, strtolower($locale)));

                // There is only for articles to get the full fontent stored in i18n files
                if ($article->isArticle()) {
                    $i18nFile = 'post_'.$article->getId().'.'.strtolower($locale).'.yaml';
                    $file = dirname(__DIR__, 3).'/translations/blog/'.$i18nFile;
                    $translations = Yaml::parse((string) file_get_contents($file));
                    $translations = array_map('strip_tags', $translations); // tags are useless, only keep texts
                    $translations = array_map('html_entity_decode', $translations);
                    $fct = 'setContent'.$locale;
                    $article->$fct(implode(' ', $translations));
                }
            }
        }

        return new PagerfantaPager(new Pagerfanta(new DoctrineCollectionAdapter(new ArrayCollection($articles))));
    }
}

Some explanations, we have created a new function in the article repository findActive in order to retrieve only active articles. Then, for each one, depending on the type we are getting the translations which are stored in the snippets.XX.yaml files for the snippets and in the blog_ID.XX.yaml files for the blog posts. (All snippets texts are in the same translation file whereas there is one file by blog post). The virtual properties are populated, then the document is ready to be indexed. At the end we pass a Doctrine collection to the DoctrineCollectionAdapter. We are ready to use the new data provider. After launching it, in my case there is now one article less populated (47 instead of 48) because there is an inactive article which I use for the functional tests to verify we can't see an inactive article in the list neither we can access it on the show page. Let's check the new elasticsearch documents:

the elasticsearch document with the new fields

As you can see there are more data than previously and the new fields are correctly indexed. We can verify that the *En fields contain the English translations while the *Fr ones contain the french ones. Now, we can test the search, for example we can use the keyword "interface" that isn't in the keywords nor the titles of any article or snippet. Valid, we can see that there several articles matching including this one. One of them in the first part of this tutorial. This text in indeed present in the post content. You can also search for foobar which is only present in this post.

Tuning the search relevance

Now that we have all the data we need we can do some tuning. Let's try to add some boosts. For the fields we have we can class them by importance. The most important one seems the keywords whereas the least important is the content as there is a lot of text in this one. The default boost value is 1, so let's keep this for the content fields and let's add some boost to the headline and keywords fields like this:

# config/packages/fos_elastica.yaml
fos_elastica:
    clients:
        default: { host: '%es_host%', port: '%es_port%' }
    indexes:
        app:
            use_alias: true
            types:
                articles:
                    properties:
                        type: ~
                        keywordFr: { boost: 4 }
                        keywordEn: { boost: 4 }
                        # i18n
                        titleEn: { boost: 3 }
                        titleFr: { boost: 3 }
                        headlineEn: { boost: 2 }
                        headlineFr: { boost: 2 }
                        ContentEn: ~ # The default boost value is 1
                        ContentFr: ~
                    persistence:
                        driver: orm
                        model: App\Entity\Article
                        provider:
                            service: App\Elasticsearch\Provider\ArticleProvider
                        listener:
                            insert: false
                            update: false
                            delete: false
        suggest:
            use_alias: true
            settings:
                index:
                    analysis:
                        analyzer:
                            suggest_analyzer:
                                type: custom
                                tokenizer: standard
                                filter: [lowercase, asciifolding]
            types:
                keyword:
                    properties:
                        locale:
                            type: keyword
                        suggest:
                            type: completion
                            analyzer: suggest_analyzer
                            contexts:
                                - name: locale
                                  type: category
                                  path: locale

But which boosts should be used? Well, there is no easy answer for this. It depends on the search you want to implement and the data contained in the different fields. You will have to do a lot of tests to have something that fit your need and users.

β€œImproving relevance is hard, it really is.” β€” elasticsearch's blog

Adding the pagination

Now let's see how to add the pagination so we can access all the articles matching the keywords. The FOSelastica bundle already provide some methods for this. We will use the createHybridPaginatorAdapter that will return one adapter ready to use with the Pagerfanta library. We will install the PagerfantaBundle as it provides several helpers to display the pager. Look at the new code of the new controller:

<?php declare(strict_types=1);

// src/Controller/SearchControllerPart2.php

namespace App\Controller;

use Elastica\Util;
use FOS\ElasticaBundle\Finder\TransformedFinder;
use FOS\ElasticaBundle\Paginator\FantaPaginatorAdapter;
use Pagerfanta\Pagerfanta;
use Symfony\Bundle\FrameworkBundle\Controller\AbstractController;
use Symfony\Component\HttpFoundation\Request;
use Symfony\Component\HttpFoundation\Response;
use Symfony\Component\HttpFoundation\Session\SessionInterface;
use Symfony\Component\Routing\Annotation\Route;

/**
 * You know, for search.
 *
 * @Route("/{_locale}", name="search_", requirements={"_locale"="%locales_requirements%"})
 */
class SearchPart2Controller extends AbstractController
{
    /**
     * @Route({"en": "/search", "fr": "/recherche"}, name="main")
     */
    public function search(Request $request, SessionInterface $session, TransformedFinder $articlesFinder): Response
    {
        $q = (string) $request->query->get('q', '');
        $pagination = $this->findHybridPaginated($articlesFinder, Util::escapeTerm($q));
        $pagination->setCurrentPage($request->query->getInt('page', 1));
        $session->set('q', $q);

        return $this->render('search/search.html.twig', compact('pagination', 'q'));
    }

    /**
     * I made a PR to have this function in the bundle.
     *
     * @see https://github.com/FriendsOfSymfony/FOSElasticaBundle/pull/1567/files
     */
    private function findHybridPaginated(TransformedFinder $articlesFinder, string $query): Pagerfanta
    {
        $paginatorAdapter = $articlesFinder->createHybridPaginatorAdapter($query);

        return new Pagerfanta(new FantaPaginatorAdapter($paginatorAdapter));
    }
}

We replaced the findHybrid function by a new findHybridPaginated function that returns a pager instead of a result set. Below we set the active page by getting the query page parameter. (try to access a page that doesn't exists to see what happens...πŸ€”) The list template didn't changed a lot, we iterate the pagination object instead of the results variable. The total number of results is now retrieved with pagination.nbResults and eventually we are displaying the pager only if there more than one page (pagination.haveToPaginate). You can give a try with the form below. For example, the "symfony" keyword will return more than one page so we can see the pager in action. The new list template is also available by clicking on the following link.

Click here to view the new list template.
{% extends 'layout.html.twig' %}

{# templates/search/search.html.twig #}

{% trans_default_domain 'search' %}

{% set esArticles = article_es() %} {# Don't do this! This is to avoid polluting the SearchController #}

{% block content %}
    <div class="col-md-12">
        <div class="card">
            <div class="card-header card-header-primary">
                <p class="h3">{{ 'your_search_for'|trans}} <b>"{{ q }}"</b>, <b>{{ pagination.nbResults }}</b> {{ 'results'|trans}}.</p>
            </div>
            <div class="card-body">
                <p class="h4">&raquo; {{ 'get_back'|trans}} "<a href="{{ path('blog_show', {'slug': esArticles[1].slug|a_slug(locale), 'q': q}) }}#search_form">{{ ('title_'~esArticles[1].id)|trans({}, 'blog') }}</a>"</p>
                <p class="h4">&raquo; {{ 'get_back'|trans}} "<a href="{{ path('blog_show', {'slug': esArticles[2].slug|a_slug(locale), 'q': q}) }}#search_form">{{ ('title_'~esArticles[2].id)|trans({}, 'blog') }}</a>"</p>
                <p class="h4">&raquo; {{ 'get_back'|trans}} "<a href="{{ path('blog_show', {'slug': esArticles[3].slug|a_slug(locale), 'q': q}) }}#search_form">{{ ('title_'~esArticles[3].id)|trans({}, 'blog') }}</a>"</p>
            </div>
        </div>
    </div>

    {% for result in pagination %}
        {% set hit = result.result.hit %}
        {% set article = result.transformed %}
        {% if article.isArticle %}
            {% set tag_route = 'blog_list_tag' %}
            {% set pathEn = path('blog_show', {'_locale': 'en','slug': article.slug|a_slug('en')}) %}
            {% set pathFr = path('blog_show', {'_locale': 'fr','slug': article.slug|a_slug('fr')}) %}
            {% set title = ('title_'~article.id)|trans({}, 'blog') %}
        {% else %}
            {% set tag_route = 'snippet_list_tag' %}
            {% set pathEn = path('snippet_show', {'_locale': 'en', 'slug': article.slug|s_slug('en') }) %}
            {% set pathFr = path('snippet_show', {'_locale': 'fr', 'slug': article.slug|s_slug('fr') }) %}
            {% set title = ('title_'~article.id)|trans({}, 'snippet') %}
        {% endif %}
        <div class="card">
            <div class="card-header">
                <h2 class="h3">
                    [{{ ('type_'~article.type.id)|trans({}, 'blog') }}] {{ title }} &raquo; {{ 'score'|trans }} <b>{{ hit._score }}</b>
                </h2>
            </div>

            <div class="card-body">
                <div class="blog-tags">
                    {% for tag in article.keywords %}<a class="badge badge-{{ random_class() }}" href="{{ path(tag_route, {'tag': tag}) }}"><i class="far fa-tag"></i> &nbsp;{{ tag|trans({}, 'breadcrumbs') }}</a> {% endfor %}
                </div>
                <br/>
                <p class="card-text text-center">
                    <a href="{{ pathEn }}" class="btn btn-primary card-link">πŸ‡¬πŸ‡§ {{ 'read_in_english'|trans({}, 'blog') }}</a>
                    <a href="{{ pathFr }}" class="btn btn-primary card-link">πŸ‡«πŸ‡· {{ 'read_in_french'|trans({}, 'blog') }}</a>
                </p>
            </div>
        </div>
    {% endfor %}
    <div class="col-md-12">
        {% if 0 == pagination.nbResults %}
            <p class="h3">{{ 'no_results'|trans }}</p>
        {% endif %}
    </div>

    <div class="col-md-12">
        {% include 'search/_form.html.twig' %}
    </div>
{% endblock %}

{% block pagination %}
    {% if pagination.haveToPaginate %}
        {{ pagerfanta(pagination, { omitFirstPage: true, css_container_class: 'pagination justify-content-center'}) }}
    {% endif %}
{% endblock %}

{% block javascripts %}
   {{ parent() }}
   {% include 'blog/posts/_51_js.html.twig' %}
{% endblock %}

That's it for the pagination part. We can verify that the first article of the second page has a lower score than the last article of the first page. We are done for the second part of this tutorial. There are at least two interesting things I'd like to add. But let's see this in the third and last part of this elasticsearch tutorial.

That's it! I hope you like it. Check out the links below to have additional information related to the post. As always, feedback, likes and retweets are welcome. (see the box below) See you! COil. 😊

  Read the doc  More on the web

They gave feedback and helped me to fix errors and typos in this article, many thanks to Nico.F (Slack Symfony), jmsche. 😊


» Comments

Privacy-focused with Commento. (Comment system added on 2019-11-18: be the first! πŸ₯‡)

» Call to action

Did you like this post? You can help me back in several ways: (use the box above to comment or the Tweet on the right to contact me )

  • Report any error/typo.
  • Report something that could be improved.
  • Like and retweet!
  • Follow me on Twitter
  • Subscribe to the RSS feed.

Thank you for reading! And see you soon on Strangebuzz! πŸ˜‰

COil