Skip to content
This repository has been archived by the owner on May 3, 2020. It is now read-only.

t28hub/graph-dom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GraphDOM

Graph DOM

CircleCI Build Status Dependabot Status FOSSA Status

Extract web data by GraphQL and DOM API. Demo
⚠️ This application is a proof of concept and might not be suitable for production.

Key Features

  • Support APIs like DOM API
  • Custom HTTP Headers and HTTP Cookies
  • Emulate devices and User-Agent
  • Render JavaScript
  • Support robots.txt
  • Expand short URL
  • Protect your privacy using incognito mode

Example

The following example extracts featured presentations attributes(title, url, thumbnail url and meta) from the SpeakerDeck.

{
  page(url: "https://speakerdeck.com/p/featured") {
    decks: querySelectorAll(selector: "div.container div.mb-5") {
      title: querySelector(selector: "a") {
        text: getAttribute(name: "title")
      }
      link: querySelector(selector: "a") {
        href: getAttribute(name: "href")
      }
      image: querySelector(selector: "div.deck-preview") {
        thumbnail: getAttribute(name: "data-cover-image")
      }
      meta: querySelectorAll(selector: "div.deck-preview-meta > div.py-3") {
        value: innerText
      }
    }
  }
}

See examples for more detailed examples.

Supported APIs

Document

  • title
  • head
  • body
  • children
  • childNodes
  • innerText
  • getElementById
  • getElementsByClassName
  • getElementsByTagName
  • querySelector
  • querySelectorAl

Element

  • attributes
  • children
  • childNodes
  • innerText
  • innerHTML
  • outerHTML
  • getAttribute
  • getElementById
  • getElementsByClassName
  • getElementsByTagName
  • querySelector
  • querySelectorAl

Other APIs

GraphQL is introspective. You can query a GraphQL schema using __schema and __type as below.

  • __schema lists all types defined in the schema.
query {
  __schema {
    types {
      name
      kind
      description
      fields {
        name
      }
    }
  }
}
  • __type gets details about a specific type.
query {
  __type(name: "Document") {
    name
    kind
    description
    fields {
      name
    }
  }
}

Development

docker-compose up

GraphDOM will be running at http://localhost:8080 and endpoint will be available. And also Playground will be running at http://localhost:8080/graphql, if NODE_ENV is development.

http;//localhost:8080/graphql

The 'ping' query is useful to check whether GraphDOM works.

curl \
-H 'Content-Type: application/json' \
-X POST http://localhost:8080/graphql \
-d '{"query":"{ping}"}'

The request should receive the following response, if GraphDOM works appropriately.

{
  "data": {
    "ping": "pong"
  }
}

Environment Variables

Environment variables are the follows and every variable is optional.

  • NODE_ENV: development or production.(Defaults to development)
  • SERVER_PORT: Port listened by the GraphDOM.(Defaults to 8080)
  • LOG_LEVEL: DEBUG, INFO, WARN, ERROR or TRACE.(Defaults to INFO)
  • APOLLO_API_KEY: API key for the Apollo GraphManager.
  • APOLLO_SCHEMA_TAG: Tag name of a GraphQL schema.
  • BROWSER_PATH: Path to a browser.(Defaults to detect automatically)
  • BROWSER_HEADLESS: Whether to launch browser in headless mode.(Defaults to true)
  • QUERY_COMPLEXITY_LIMIT: Maximum allowed complexity for query.(Defaults to 15)
  • QUERY_DEPTH_LIMIT: Maximum allowed depth for query.(Defaults to 5)
  • REDIS_URL: URL used to connect to Redis. If the environment variable is not set, the GraphDOM uses in-memory as a cache.

See .env.example for more detailed variables.

License

FOSSA Status