Skip to content

Commit

Permalink
Added more description info, Elasticsearch quick setup and flag info
Browse files Browse the repository at this point in the history
  • Loading branch information
ricardoaat committed May 18, 2018
1 parent 66a3fea commit ac3816d
Showing 1 changed file with 36 additions and 11 deletions.
47 changes: 36 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
# BIOSCHEMAS.ORG GO CRAWL IT!

Minimal crawler and extractor of microdata and JSON-LD metadata.
Crawls and extracts bioschemas.org/schema.org JSON-LD and Microdata from a given website. The extracted information is stored on a JSON file and optionally can be stored on a Elasticsearch local service.


## How to use it:

---
Use example:
```
./bioschemas-gocrawlit -p -u "https://www.ebi.ac.uk/biosamples/samples"
./bioschemas-gocrawlit -q -u https://tess.elixir-europe.org/sitemaps/events.xml
./bioschemas-gocrawlit -u http://159.149.160.88/pscan_chip_dev/
```bash
./bioschemas-gocrawlit_mac_64 -p -u "https://www.ebi.ac.uk/biosamples/samples"
./bioschemas-gocrawlit_mac_64 -q -u https://tess.elixir-europe.org/sitemaps/events.xml
./bioschemas-gocrawlit_mac_64 -u http://159.149.160.88/pscan_chip_dev/
```

A folder "bioschemas_gocrawlit_cache" will be created on the current path of execution;This folder contains crawled website information in order to prevent multiple download of pages. Is safe to delete this folder.
A folder "bioschemas_gocrawlit_cache" will be created on the current path of execution; This folder contains crawled website information in order to prevent multiple download of pages. Is safe to delete this folder.


### Output

Expand All @@ -26,21 +27,45 @@ Scraped data will be stored in a json file named ```<website_host>_schema.json``
- **-e**: Adds crawled data to an Elasticsearch (v6) service at http://127.0.0.1:9200.
- **-u**: Start page to start crawling.
- **-q**: Remove query section from the link URL found.
- **--query**: Use with **-q** so it follows only links that contain the query word provided, e.g., ```./bioschemas-gocrawlit -u https://tess.elixir-europe.org/events -q --page page```
- **--query**: Use with **-q** so it follows only links that contain the query word provided, e.g., ```./bioschemas-gocrawlit_mac_64 -u https://tess.elixir-europe.org/events -q --page page```
- **-h**: Print Help and exit.


## Building binaries

----
To create a binary for your current SO use:
```make build```
```bash
make build
```

To create a binary for windows, macos and linux SO use:
```make build-all```
```bash
make build-all
```

The binaries would be placed under build/ path.


## Elasticsearch quick setup [DOCKER](https://www.docker.com/)
---
Steps for starting dockerized [elasticsearch](https://www.elastic.co/products/elasticsearch) and [kibana](https://www.elastic.co/products/kibana) locally. This requires [Docker](https://store.docker.com/search?type=edition&offering=community).

#### Create a custom network for your elastic-stack:

```docker network create elastic-stack```

#### Pull and run an elasticsearch image:

```docker run -it --network=elastic-stack -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" --name elasticsearch docker.elastic.co/elasticsearch/elasticsearch:6.2.4```
> Avoid changing the containers name since Kibana docker image points by default to `http://elasticsearch:9200`.
#### Pull and run an elasticsearch image:

```docker run --network=elastic-stack --rm -it -p 5601:5601 --name kibana docker.elastic.co/kibana/kibana:6.2.4```

> Remember the --rm flag will delete the container once it is stoped.

## ToDo

- [x] Crawl website
Expand Down

0 comments on commit ac3816d

Please sign in to comment.