searx/searx/engines/google_images.py

"""
 Google (Images)

 @website     https://www.google.com
 @provide-api yes (https://developers.google.com/custom-search/)

 @using-api   no
 @results     HTML chunks with JSON inside
 @stable      no
 @parse       url, title, img_src
"""

from datetime import date, timedelta
from lxml import html
from searx.url_utils import urlencode, urlparse, parse_qs


# engine dependent config
categories = ['images']
paging = True
safesearch = True
time_range_support = True
number_of_results = 100

search_url = 'https://www.google.com/search'\
    '?{query}'\
    '&tbm=isch'\
    '&gbv=1'\
    '&sa=G'\
    '&{search_options}'
time_range_attr = "qdr:{range}"
time_range_custom_attr = "cdr:1,cd_min:{start},cd_max{end}"
time_range_dict = {'day': 'd',
                   'week': 'w',
                   'month': 'm'}


# do search-request
def request(query, params):
    search_options = {
        'start': (params['pageno'] - 1) * number_of_results
    }

    if params['time_range'] in time_range_dict:
        search_options['tbs'] = time_range_attr.format(range=time_range_dict[params['time_range']])
    elif params['time_range'] == 'year':
        now = date.today()
        then = now - timedelta(days=365)
        start = then.strftime('%m/%d/%Y')
        end = now.strftime('%m/%d/%Y')
        search_options['tbs'] = time_range_custom_attr.format(start=start, end=end)

    if safesearch and params['safesearch']:
        search_options['safe'] = 'active'

    params['url'] = search_url.format(query=urlencode({'q': query}),
                                      search_options=urlencode(search_options))

    return params


# get response from search-request
def response(resp):
    dom = html.fromstring(resp.text)

    results = []
    for element in dom.xpath('//div[@id="search"] //td'):
        link = element.xpath('./a')[0]

        google_url = urlparse(link.xpath('.//@href')[0])
        query = parse_qs(google_url.query)
        source_url = next(iter(query.get('q', [])), None)

        title_parts = element.xpath('./cite//following-sibling::*/text()')
        title_parts.extend(element.xpath('./cite//following-sibling::text()')[:-1])

        result = {
            'title': ''.join(title_parts),
            'content': '',
            'template': 'images.html',
            'url': source_url,
            'img_src': source_url,
            'thumbnail_src': next(iter(link.xpath('.//img //@src')), None)
        }

        if not source_url or not result['thumbnail_src']:
            continue

        results.append(result)
    return results
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00			`"""`
			`Google (Images)`

			`@website https://www.google.com`
[fix] replace the dead google images ajax api with a working one 2015-12-09 01:20:46 +01:00			`@provide-api yes (https://developers.google.com/custom-search/)`
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00
[doc] correct google images docstring 2015-12-09 01:23:05 +01:00			`@using-api no`
			`@results HTML chunks with JSON inside`
[fix] replace the dead google images ajax api with a working one 2015-12-09 01:20:46 +01:00			`@stable no`
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00			`@parse url, title, img_src`
			`"""`
[enh] added google images engine 2013-10-19 22:19:14 +02:00
add year to time range to engines which support "Last year" Engines: * Bing images * Flickr (noapi) * Google * Google Images * Google News 2016-12-11 16:39:12 +01:00			`from datetime import date, timedelta`
[fix] replace the dead google images ajax api with a working one 2015-12-09 01:20:46 +01:00			`from lxml import html`
[fix] use html result page in google images (previous endpoint stopped working) 2018-06-14 11:39:54 +02:00			`from searx.url_utils import urlencode, urlparse, parse_qs`
[enh] added google images engine 2013-10-19 22:19:14 +02:00
add year to time range to engines which support "Last year" Engines: * Bing images * Flickr (noapi) * Google * Google Images * Google News 2016-12-11 16:39:12 +01:00
add comments to google-engines 2014-09-01 15:10:05 +02:00			`# engine dependent config`
[mod] category -> images 2013-10-19 22:19:31 +02:00			`categories = ['images']`
add comments to google-engines 2014-09-01 15:10:05 +02:00			`paging = True`
[enh] add safesearch to google_images 2015-02-08 22:15:25 +01:00			`safesearch = True`
add time range search for google images 2016-07-18 17:25:40 +02:00			`time_range_support = True`
[fix] google images paging - closes #571 2016-08-13 00:43:21 +02:00			`number_of_results = 100`
[enh] added google images engine 2013-10-19 22:19:14 +02:00
[fix] replace the dead google images ajax api with a working one 2015-12-09 01:20:46 +01:00			`search_url = 'https://www.google.com/search'\`
			`'?{query}'\`
			`'&tbm=isch'\`
[fix] use html result page in google images (previous endpoint stopped working) 2018-06-14 11:39:54 +02:00			`'&gbv=1'\`
			`'&sa=G'\`
[fix] google images paging - closes #571 2016-08-13 00:43:21 +02:00			`'&{search_options}'`
			`time_range_attr = "qdr:{range}"`
add year to time range to engines which support "Last year" Engines: * Bing images * Flickr (noapi) * Google * Google Images * Google News 2016-12-11 16:39:12 +01:00			`time_range_custom_attr = "cdr:1,cd_min:{start},cd_max{end}"`
add time range search for google images 2016-07-18 17:25:40 +02:00			`time_range_dict = {'day': 'd',`
			`'week': 'w',`
			`'month': 'm'}`
[enh] added google images engine 2013-10-19 22:19:14 +02:00
fix pep8 2016-07-19 10:14:11 +02:00
add comments to google-engines 2014-09-01 15:10:05 +02:00			`# do search-request`
[enh] added google images engine 2013-10-19 22:19:14 +02:00			`def request(query, params):`
[fix] google images paging - closes #571 2016-08-13 00:43:21 +02:00			`search_options = {`
			`'start': (params['pageno'] - 1) * number_of_results`
			`}`

[fix] time range detection 2016-07-26 00:22:05 +02:00			`if params['time_range'] in time_range_dict:`
[fix] google images paging - closes #571 2016-08-13 00:43:21 +02:00			`search_options['tbs'] = time_range_attr.format(range=time_range_dict[params['time_range']])`
add year to time range to engines which support "Last year" Engines: * Bing images * Flickr (noapi) * Google * Google Images * Google News 2016-12-11 16:39:12 +01:00			`elif params['time_range'] == 'year':`
			`now = date.today()`
			`then = now - timedelta(days=365)`
			`start = then.strftime('%m/%d/%Y')`
			`end = now.strftime('%m/%d/%Y')`
			`search_options['tbs'] = time_range_custom_attr.format(start=start, end=end)`
add comments to google-engines 2014-09-01 15:10:05 +02:00
[fix] replace the dead google images ajax api with a working one 2015-12-09 01:20:46 +01:00			`if safesearch and params['safesearch']:`
Fix google image search - Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source). See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f - Remove unused import - Fix google image search title - Keep google image safe value up to date 2019-04-12 23:12:56 +02:00			`search_options['safe'] = 'active'`
[fix] google images paging - closes #571 2016-08-13 00:43:21 +02:00
			`params['url'] = search_url.format(query=urlencode({'q': query}),`
			`search_options=urlencode(search_options))`
[fix] replace the dead google images ajax api with a working one 2015-12-09 01:20:46 +01:00
[enh] added google images engine 2013-10-19 22:19:14 +02:00			`return params`

[fix] pep/flake8 compatibility 2014-01-20 02:31:20 +01:00
add comments to google-engines 2014-09-01 15:10:05 +02:00			`# get response from search-request`
[enh] added google images engine 2013-10-19 22:19:14 +02:00			`def response(resp):`
Fix google image search - Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source). See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f - Remove unused import - Fix google image search title - Keep google image safe value up to date 2019-04-12 23:12:56 +02:00			`dom = html.fromstring(resp.text)`

[enh] added google images engine 2013-10-19 22:19:14 +02:00			`results = []`
Fix google image search - Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source). See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f - Remove unused import - Fix google image search title - Keep google image safe value up to date 2019-04-12 23:12:56 +02:00			`for element in dom.xpath('//div[@id="search"] //td'):`
			`link = element.xpath('./a')[0]`
add comments to google-engines 2014-09-01 15:10:05 +02:00
Fix google image search - Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source). See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f - Remove unused import - Fix google image search title - Keep google image safe value up to date 2019-04-12 23:12:56 +02:00			`google_url = urlparse(link.xpath('.//@href')[0])`
			`query = parse_qs(google_url.query)`
			`source_url = next(iter(query.get('q', [])), None)`
add comments to google-engines 2014-09-01 15:10:05 +02:00
Fix google image search - Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source). See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f - Remove unused import - Fix google image search title - Keep google image safe value up to date 2019-04-12 23:12:56 +02:00			`title_parts = element.xpath('./cite//following-sibling::*/text()')`
			`title_parts.extend(element.xpath('./cite//following-sibling::text()')[:-1])`

			`result = {`
			`'title': ''.join(title_parts),`
[fix] use html result page in google images (previous endpoint stopped working) 2018-06-14 11:39:54 +02:00			`'content': '',`
			`'template': 'images.html',`
Fix google image search - Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source). See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f - Remove unused import - Fix google image search title - Keep google image safe value up to date 2019-04-12 23:12:56 +02:00			`'url': source_url,`
			`'img_src': source_url,`
			`'thumbnail_src': next(iter(link.xpath('.//img //@src')), None)`
[fix] use html result page in google images (previous endpoint stopped working) 2018-06-14 11:39:54 +02:00			`}`
Fix google image search - Because there is not full image url in the dom, we replace "image_url" with the same url as the "url" (url of source). See example HTML https://gist.github.com/Nachtalb/2dea8a4d2c723c49226ad9645838121f - Remove unused import - Fix google image search title - Keep google image safe value up to date 2019-04-12 23:12:56 +02:00
			`if not source_url or not result['thumbnail_src']:`
			`continue`

			`results.append(result)`
[enh] added google images engine 2013-10-19 22:19:14 +02:00			`return results`