From ac3816dd390d29e4ee7d796644dadef18d879297 Mon Sep 17 00:00:00 2001 From: Ric Arcila Date: Fri, 18 May 2018 10:59:07 +0100 Subject: [PATCH] Added more description info, Elasticsearch quick setup and flag info --- README.md | 47 ++++++++++++++++++++++++++++++++++++----------- 1 file changed, 36 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 382bba1..41ec8aa 100644 --- a/README.md +++ b/README.md @@ -1,18 +1,19 @@ # BIOSCHEMAS.ORG GO CRAWL IT! -Minimal crawler and extractor of microdata and JSON-LD metadata. +Crawls and extracts bioschemas.org/schema.org JSON-LD and Microdata from a given website. The extracted information is stored on a JSON file and optionally can be stored on a Elasticsearch local service. ## How to use it: - +--- Use example: -``` -./bioschemas-gocrawlit -p -u "https://www.ebi.ac.uk/biosamples/samples" -./bioschemas-gocrawlit -q -u https://tess.elixir-europe.org/sitemaps/events.xml -./bioschemas-gocrawlit -u http://159.149.160.88/pscan_chip_dev/ +```bash +./bioschemas-gocrawlit_mac_64 -p -u "https://www.ebi.ac.uk/biosamples/samples" +./bioschemas-gocrawlit_mac_64 -q -u https://tess.elixir-europe.org/sitemaps/events.xml +./bioschemas-gocrawlit_mac_64 -u http://159.149.160.88/pscan_chip_dev/ ``` -A folder "bioschemas_gocrawlit_cache" will be created on the current path of execution;This folder contains crawled website information in order to prevent multiple download of pages. Is safe to delete this folder. +A folder "bioschemas_gocrawlit_cache" will be created on the current path of execution; This folder contains crawled website information in order to prevent multiple download of pages. Is safe to delete this folder. + ### Output @@ -26,21 +27,45 @@ Scraped data will be stored in a json file named ```_schema.json`` - **-e**: Adds crawled data to an Elasticsearch (v6) service at http://127.0.0.1:9200. - **-u**: Start page to start crawling. - **-q**: Remove query section from the link URL found. -- **--query**: Use with **-q** so it follows only links that contain the query word provided, e.g., ```./bioschemas-gocrawlit -u https://tess.elixir-europe.org/events -q --page page``` +- **--query**: Use with **-q** so it follows only links that contain the query word provided, e.g., ```./bioschemas-gocrawlit_mac_64 -u https://tess.elixir-europe.org/events -q --page page``` - **-h**: Print Help and exit. ## Building binaries - +---- To create a binary for your current SO use: -```make build``` +```bash +make build +``` To create a binary for windows, macos and linux SO use: -```make build-all``` +```bash +make build-all +``` The binaries would be placed under build/ path. +## Elasticsearch quick setup [DOCKER](https://www.docker.com/) +--- +Steps for starting dockerized [elasticsearch](https://www.elastic.co/products/elasticsearch) and [kibana](https://www.elastic.co/products/kibana) locally. This requires [Docker](https://store.docker.com/search?type=edition&offering=community). + +#### Create a custom network for your elastic-stack: + +```docker network create elastic-stack``` + +#### Pull and run an elasticsearch image: + +```docker run -it --network=elastic-stack -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch docker.elastic.co/elasticsearch/elasticsearch:6.2.4``` +> Avoid changing the containers name since Kibana docker image points by default to `http://elasticsearch:9200`. + +#### Pull and run an elasticsearch image: + +```docker run --network=elastic-stack --rm -it -p 5601:5601 --name kibana docker.elastic.co/kibana/kibana:6.2.4``` + +> Remember the --rm flag will delete the container once it is stoped. + + ## ToDo - [x] Crawl website