Generating Google search snippets using BERT-QA
A simple solution to use a pre-trained BERT-QA model and return an HTML snippet

Practically speaking when it comes to using AI in production, not everything is about training, with the recent advances in AI especially NLP it is much easier now to use pre-trained models to build a baseline for your project. However, it won’t be easy to find exactly what you want, you will need to craft something that can work with you.
In my case, I wanted to use a question-answering model with HTML content, and generate an HTML card/snippet from the predicted answer, similar to the Google search featured snippets.
The easiest solution here is to use BERT pre-trained model(squad version), but it suffers from two problems:
- This BERT version can only predict short answers.
- It only deals with normal text, not HTML.
While there are already some solutions and new papers to address those problems, it is not publicly published, and lacking training data is still a big block for any project.
Solution:
Using the advantage of having the HTML of the text, I can access the text structure and retrieve the full chunk that holds the short answer.
Suppose you have a result like this

Here the model predicted only one item as the answer, which is not the full answer we want, but if we took a look at the HTML code for this answer we would find that all of them are under the <ul> parent tag, so if we can retrieve this tag problem solved!

Now how to do that
- First, we extract the text from the HTML context using beautifulsoup or any other tool, I used this function with a little modification.
- Then we pass this text to the model together with the question and get the model’s prediction.
- We need now to start looking for the HTML tag that holds this prediction, which can be done by parsing the HTML as a tree of elements and simply traverse this tree looking for the branch that holds this answer, then return this branch.
Here is the code for those two steps:
We can stop here and use the HTML result. However, websites have very different structures and it is not guaranteed that one function will always work on all of them, to bypass this we are going to post-process this result and simplify it.
In my situation, I didn’t need the styles I only needed the doc structure(lists, titles,..etc), so I used the html2text library to convert HTML to Markdown and then parsed the returned structure, also the images in the HTML snippet along with the text snippet if needed.
All the code for this project with details can be found here.
With that, you can have a simple baseline with low resources to build a QA model with HTML.
Lastly, this is more or less about how you can craft a fast and effective solution with what you have without needing to wait days for data and training, using the advantage of the pre-trained model's pool that exists today.
Hope that this was short and useful for some of you, Thanks.