The engines will keep failing. More checking at runtime can help to detect the errors, but it won't solve the issues. We lack time and hands to review all the changes.
Searx online IDE
Currently, to update an engine you have: git clone searx
, install the dependencies, edit some files, make run
, git commit, create a pull request. Same about the review: get the code, checks the different options, approve / comment.
What if we could make it more simple for new contributors ? Like editing a Jupyter notebook ?
More precisly, the idea is to have the "edit this file" button of Github on steroid:
- starts an online IDE with autocompletion.
- lxml and other dependencies are availabled.
- there is a "run" button which allow to try the code (with the all the parameters).
- a save button creates a PR in the git repository.
- like github comments the code is automaticaly saved in the browser.
For the json and xpath engines won't even need an online IDE: just some text inputs, and a "try" button. Later, a tool similar to wuzz can help for example.
Technicaly:
- Python can run inside a browser using WASM, but it may not work as expected. So, a Python inside a container would do the job:
- here we have to be very careful about the security : https://jupyter-notebook.readthedocs.io/en/stable/security.html
- the IP addresses will be blocked by some engines.
- there is running cost.
- The idea, here is to provide an online sandbox to try searx engine online. To avoid abuse, a Github OAuth anthentification would be required (so there is no need for an additional account).
- The IDE: fork of Jupyter or and monaco with language server ?
- The result can be saved as a PR in the git repository.
- Some engines reuse code from other engines. Example: the google engine, duckduckgo_images.
Searx engine database
Looking at the issue What is the data lifecycle ? #2052, there is another way to deal with that.
Why not make something similar to wikidata but only for searx: one entry per engine (and entries for currencies, DOI). The code could even be one field on a entry.
From the engine developer point of view:
- a "page" per engine (similar to https://www.wikidata.org/wiki/Q12805 for example, but only with the Searx revelant data).
- there are different fields, like engine name, bang shortcut name, categories, etc...
- the code is one the field. An online IDE allows to edit the code and try it online.
- save
Someone reviews the change request, and approve /deny.
Authentification can use OAuth from github.
The purpose here is to simply the maintenance/edit process to the maximum.
Disclaimers. I'm aware that:
- Even if a framework like django can help a lot, it still a lot of work.
- The 10 "write/update engine" tasks may turn into more than 10 "review" tasks.
Database model
Class Engine:
- official URL
- bang URL
- autocomplete URL
- Category / Subcategory
- Short name
- Engine name
- favicon.ico
- engine definition
Class Engine definition
- version
- default timeout
- languages (equivalent of searx/data/engines_languages.json)
Class Engine XPath(Engine definition)
- search_url
- url_xpath
- title_xpath
- content_xpath
Class Engine JSON(Engine definition)
- search_url
- url_query
- content_query
- title_query
- paging
- suggestion_query
- results_query
Class Engine Python(Engine definition)
- code
Class Engine DSL(Engine definition)
- code
Database storage
The storage can be a public git repository, so the backup is free.
Scenario
Scenario - An user adds a new engine
A user adds a search engine:
- URL (mandatory)
- Category (and Subcategory as DuckDuckGo) (optional)
- Searx name & shortcut (optional)
A backend looks for:
- opensearch.xml or sitelinks search box --> free external bang update/addition
- favicon.ico
- web site name
Add to engine definition:
- search URL
- autocomplete URL
- favicon
- name