1
0
Fork 0

Compare commits

..

No commits in common. "master" and "0.0.1" have entirely different histories.

7 changed files with 43 additions and 87 deletions

View File

@ -76,18 +76,16 @@ type: kubernetes
name: publish-pypi name: publish-pypi
trigger: trigger:
branch:
- master
event: event:
- tag - tag
steps: steps:
- name: fetch tags - name: publish pypi
image: alpine/git
commands:
- git fetch --tags
- name: publish pypi
image: plugins/pypi image: plugins/pypi
settings: settings:
username: username:
from_secret: pypi_username from_secret: pypi_username
password: password:
from_secret: pypi_password from_secret: pypi_username

View File

@ -1 +0,0 @@
3.10

View File

@ -1,4 +1,4 @@
FROM python:3.10-bullseye FROM python:3.10
COPY . /tmp/package COPY . /tmp/package
RUN pip install --no-cache-dir /tmp/package && \ RUN pip install --no-cache-dir /tmp/package && \
rm -r /tmp/package rm -r /tmp/package

View File

@ -4,8 +4,6 @@ ecommerce-exporter is a [prometheus](https://prometheus.io/) exporter that websc
## Install ## Install
### Using docker
An aarch64 and an amd64 docker images are available on [docker hub](https://hub.docker.com/r/badjware/ecommerce-exporter). You can pull it using: An aarch64 and an amd64 docker images are available on [docker hub](https://hub.docker.com/r/badjware/ecommerce-exporter). You can pull it using:
``` sh ``` sh
docker pull badjware/ecommerce-exporter docker pull badjware/ecommerce-exporter
@ -13,13 +11,6 @@ docker pull badjware/ecommerce-exporter
This is the recommended way of running the exporter. This is the recommended way of running the exporter.
### Using pip
Alternatively, if you prefer to avoid having to use docker, you can install ecommerce-exporter as a standard python package.
``` sh
pip install ecommerce-exporter
```
## Usage ## Usage
Download the [example configuration file](ecommerce-exporter.example.yml) and edit it to configure the e-commerce sites you wish to scrape. You can configure multiple products and multiple targets in the same configuration file. Download the [example configuration file](ecommerce-exporter.example.yml) and edit it to configure the e-commerce sites you wish to scrape. You can configure multiple products and multiple targets in the same configuration file.
@ -58,7 +49,7 @@ options:
Finding the correct value for a selector will require some effort. Once you find the correct selector to use, you should be able to use the same one across the whole site. Finding the correct value for a selector will require some effort. Once you find the correct selector to use, you should be able to use the same one across the whole site.
### html parser ## html parser
The general procedure to figure out the selector for a site using an html parser is as follow: The general procedure to figure out the selector for a site using an html parser is as follow:
1. Open up the product page in your browser. 1. Open up the product page in your browser.
@ -76,7 +67,7 @@ Below is a table with examples of some CSS selectors that match the html element
| canadacomputer.com | `.price-show-panel .h2-big strong::text` | | canadacomputer.com | `.price-show-panel .h2-big strong::text` |
| memoryexpress.com | `.GrandTotal` | | memoryexpress.com | `.GrandTotal` |
### json parser ## json parser
The general procedure to figure out the selector for a site using an json parser is as follow: The general procedure to figure out the selector for a site using an json parser is as follow:
1. Open up the development tool of your browser using the F12 key. 1. Open up the development tool of your browser using the F12 key.
@ -93,20 +84,3 @@ Below is a table with examples of some jq selectors that match the json field co
| --- | --- | --- | | --- | --- | --- |
| newegg.ca | `.MainItem.UnitCost` | https://www.newegg.ca/product/api/ProductRealtime?ItemNumber=19-118-343&RecommendItem=&BestSellerItemList=9SIAA4YGC82324%2C9SIADGEGMY7603%2C9SIAVH1J0A6685&IsVATPrice=true | | newegg.ca | `.MainItem.UnitCost` | https://www.newegg.ca/product/api/ProductRealtime?ItemNumber=19-118-343&RecommendItem=&BestSellerItemList=9SIAA4YGC82324%2C9SIADGEGMY7603%2C9SIAVH1J0A6685&IsVATPrice=true |
| bestbuy.ca | `.[] \| .salePrice,.regularPrice` | https://www.bestbuy.ca/api/offers/v1/products/15778672/offers | | bestbuy.ca | `.[] \| .salePrice,.regularPrice` | https://www.bestbuy.ca/api/offers/v1/products/15778672/offers |
## Developing
Setup a virtualenv and activate it:
``` sh
python -m venv env
source env/bin/activate
```
Setup the project as an editable install:
``` sh
pip install -e .
```
You can now run the exporter using `ecommerce-exporter` while your virtualenv is active.
Happy hacking!

View File

@ -1,19 +1,13 @@
import argparse import argparse
import os import os
import time import time
import logging
import yaml import yaml
from httpx import RequestError
from prometheus_client import start_http_server, Gauge, Counter from prometheus_client import start_http_server, Gauge, Counter
from ecommerce_exporter.scrape_target import ScrapeTarget from ecommerce_exporter.scrape_target import ScrapeError, ScrapeTarget
logging.basicConfig(
format=os.environ.get('LOG_FORMAT', '[%(asctime)s] [%(levelname)-8s] %(message)s'),
level=os.environ.get('LOG_LEVEL', 'INFO')
)
logger = logging.getLogger(__name__)
ECOMMERCE_SCRAPE_TARGET_VALUE = Gauge( ECOMMERCE_SCRAPE_TARGET_VALUE = Gauge(
'ecommerce_scrape_target_value', 'ecommerce_scrape_target_value',
@ -49,7 +43,7 @@ def main():
'--user-agent', '--user-agent',
help='The user-agent to spoof. (default: %(default)s)', help='The user-agent to spoof. (default: %(default)s)',
type=str, type=str,
default='Mozilla/5.0 (X11; Linux x86_64; rv:131.0) Gecko/20100101 Firefox/131.0', default='Mozilla/5.0 (X11; Linux x86_64; rv:106.0) Gecko/20100101 Firefox/106.0',
) )
parser.add_argument( parser.add_argument(
'-p', '--listen-port', '-p', '--listen-port',
@ -65,17 +59,23 @@ def main():
) )
args = parser.parse_args() args = parser.parse_args()
scrape_targets = parse_config(os.path.abspath(args.config), user_agent=args.user_agent) scrape_targets = parse_config(os.path.abspath(args.config))
# setup the headers for each scrape targets
for scrape_target in scrape_targets:
scrape_target.headers = {
'Accept': '*/*',
'User-Agent': args.user_agent,
}
# start the http server to server the prometheus metrics # start the http server to server the prometheus metrics
logger.info("serving metrics on http://%s:%s/metrics", args.listen_address, args.listen_port)
start_http_server(args.listen_port, args.listen_address) start_http_server(args.listen_port, args.listen_address)
# start the main loop # start the main loop
while True: while True:
for scrape_target in scrape_targets: for scrape_target in scrape_targets:
try: try:
logger.info("Starting scrape. product: '%s', target '%s'", scrape_target.product_name, scrape_target.target_name) print("Starting scrape. product: '%s', target '%s'" % (scrape_target.product_name, scrape_target.target_name))
value = scrape_target.query_target() value = scrape_target.query_target()
ECOMMERCE_SCRAPE_TARGET_VALUE.labels( ECOMMERCE_SCRAPE_TARGET_VALUE.labels(
product_name=scrape_target.product_name, product_name=scrape_target.product_name,
@ -85,10 +85,8 @@ def main():
product_name=scrape_target.product_name, product_name=scrape_target.product_name,
target_name=scrape_target.target_name, target_name=scrape_target.target_name,
).inc() ).inc()
except KeyboardInterrupt: except (RequestError, ScrapeError) as e:
return print("Failed to scrape! product: '%s', target: '%s', message: '%s'" % (scrape_target.product_name, scrape_target.target_name, e))
except Exception as e:
logger.error("Failed to scrape! product: '%s', target: '%s', message: '%s'" , scrape_target.product_name, scrape_target.target_name, e)
ECOMMERCE_SCRAPE_TARGET_FAILURE.labels( ECOMMERCE_SCRAPE_TARGET_FAILURE.labels(
product_name=scrape_target.product_name, product_name=scrape_target.product_name,
target_name=scrape_target.target_name, target_name=scrape_target.target_name,
@ -96,9 +94,9 @@ def main():
).inc() ).inc()
time.sleep(args.interval * 60) time.sleep(args.interval * 60)
def parse_config(config_filename, user_agent): def parse_config(config_filename):
result = [] result = []
logger.info('Loading configurations from %s', config_filename) print('Loading configurations from %s' % config_filename)
with open(config_filename, 'r') as f: with open(config_filename, 'r') as f:
config = yaml.safe_load(f) config = yaml.safe_load(f)
@ -118,9 +116,6 @@ def parse_config(config_filename, user_agent):
target_name=target.get('name'), target_name=target.get('name'),
regex=target.get('regex'), regex=target.get('regex'),
parser=target.get('parser'), parser=target.get('parser'),
headers = {
'User-Agent': user_agent,
},
)) ))
return result return result

View File

@ -3,27 +3,19 @@ import re
from urllib.parse import urlparse from urllib.parse import urlparse
# import httpx import httpx
import requests
import parsel import parsel
import pyjq import pyjq
import logging
logger = logging.getLogger(__name__)
class ScrapeTarget: class ScrapeTarget:
def __init__(self, product_name, url, selector, target_name=None, regex=None, parser=None, headers={}): def __init__(self, product_name, url, selector, target_name=None, regex=None, parser=None):
self.product_name = product_name self.product_name = product_name
self.target_name = target_name if target_name else urlparse(url).hostname self.target_name = target_name if target_name else urlparse(url).hostname
self.url = url self.url = url
self.selector = selector self.selector = selector
self.regex = re.compile(regex if regex else r'([0-9]+,?)+(\.[0-9]{2})?') self.regex = re.compile(regex if regex else r'[0-9]+(\.[0-9]{2})?')
self.parser = parser if parser else 'html' self.parser = parser if parser else 'html'
if 'Referer' not in headers: self.headers = {}
headers['Referer'] = 'google.com'
if 'DNT' not in headers:
headers['DNT'] = '1'
self.headers = headers
self.session = requests.Session()
# sanity check # sanity check
valid_parsers = ('html', 'json') valid_parsers = ('html', 'json')
@ -31,24 +23,23 @@ class ScrapeTarget:
raise ValueError("Invalid parser configured (got '%s' but need one of %s) product: '%s', target: '%s'" % (self.parser, valid_parsers, self.product_name, self.target_name)) raise ValueError("Invalid parser configured (got '%s' but need one of %s) product: '%s', target: '%s'" % (self.parser, valid_parsers, self.product_name, self.target_name))
def query_target(self): def query_target(self):
query_response = self.session.get( # some sites get suspicious if we talk to them in HTTP/1.1 (maybe because it doesn't match our user-agent?)
self.url, # we use httpx to have HTTP2 support and circumvent that issue
query_response = httpx.get(
url=self.url,
headers=self.headers, headers=self.headers,
) follow_redirects=True,
logger.info('Status: %s', query_response.status_code) ).text
# self.client.cookies.update(query_response.cookies)
query_response_text = query_response.text
logger.debug('Response: %s', query_response_text)
# parse the response and match the selector # parse the response and match the selector
selector_match = '' selector_match = ''
if self.parser == 'html': if self.parser == 'html':
# parse response as html # parse response as html
selector = parsel.Selector(text=query_response_text) selector = parsel.Selector(text=query_response)
selector_match = selector.css(self.selector).get() selector_match = selector.css(self.selector).get()
elif self.parser == 'json': elif self.parser == 'json':
# parse response as json # parse response as json
query_response_json = json.loads(query_response_text) query_response_json = json.loads(query_response)
selector_match = str(pyjq.first(self.selector, query_response_json)) selector_match = str(pyjq.first(self.selector, query_response_json))
else: else:
raise ScrapeError('Invalid parser!') raise ScrapeError('Invalid parser!')
@ -59,7 +50,7 @@ class ScrapeTarget:
# match the regex # match the regex
regex_match = self.regex.search(selector_match) regex_match = self.regex.search(selector_match)
if regex_match: if regex_match:
str_result = regex_match.group(0).replace(',', '') str_result = regex_match.group(0)
# convert the result to float # convert the result to float
float_result = float(str_result) float_result = float(str_result)
return float_result return float_result

View File

@ -1,9 +1,8 @@
[metadata] [metadata]
name = ecommerce-exporter name = ecommerce-exporter
description = ecommerce-exporter is a prometheus exporter that export the price of products in e-commerce site as prometheus metrics. description = ecommerce-exporter is a prometheus exporter that export the price of products in e-commerce site as prometheus metrics.
url = https://code.badjware.dev/badjware/ecommerce-exporter
author = badjware author = badjware
author_email = marchambault@badjware.dev author_email = marchambault.badjware.dev
licence = MIT Licence licence = MIT Licence
classifers = classifers =
Programming Language :: Python Programming Language :: Python
@ -16,7 +15,7 @@ setup_requires =
setuptools_scm setuptools_scm
install_requires= install_requires=
PyYAML~=6.0 PyYAML~=6.0
requests~=2.32.0 httpx~=0.23.0
parsel~=1.6.0 parsel~=1.6.0
pyjq~=2.6.0 pyjq~=2.6.0
prometheus-client~=0.15.0 prometheus-client~=0.15.0