Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revive Hits - Better (Much Faster!) than Ever #50

Merged
merged 45 commits into from
Sep 4, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
168dc3f
update console.log to be more descriptive
nelsonic Jan 22, 2016
bad4407
fix merge conflict in server.js
nelsonic Jul 18, 2016
f21abf7
update dependencies fixes https://github.com/dwyl/hits/issues/41
nelsonic Aug 23, 2017
f15cb24
adds browser language to hit row fixes https://github.com/dwyl/hits/i…
nelsonic Aug 24, 2017
d9aab87
revive the idea of creating our own SVG for https://github.com/dwyl/h…
nelsonic Aug 24, 2017
fd22568
create single-purpose function to extract request data for https://gi…
nelsonic Aug 24, 2017
63d5c54
adds code to create our own SVG file instead of relying on Shields.io…
nelsonic Aug 24, 2017
815fc10
remove dependency on Wreck as no longer making HTTP request to Shield…
nelsonic Aug 24, 2017
60bc455
move headers to their own .json file to reduce clutter
nelsonic Aug 24, 2017
1896695
use latest version of redis-connection which does not Throw Error if …
nelsonic Aug 25, 2017
fd51035
split redis saving function into dedicated file with its own tests fo…
nelsonic Aug 25, 2017
4432e51
adds hash.js (with tests & fixtures) borrow hash code from https://gi…
nelsonic Aug 25, 2017
e0b9877
clarify/simplify unique data being stored fixes https://github.com/dw…
nelsonic Aug 25, 2017
5a805f9
tidy up tests for #44
nelsonic Aug 25, 2017
70810e2
adds implementation of filesystem db for when Redis is unavailable #42
nelsonic Aug 25, 2017
bbb4375
tidy up "db" files for consistency #42
nelsonic Aug 25, 2017
ddcbd68
use filesystem when Redis is unavailable for #42
nelsonic Aug 25, 2017
c35cd6b
update test/hits.test.js to use both "Datastores" fixes https://githu…
nelsonic Aug 25, 2017
544aea6
adds temporary debug console.log for heroku
nelsonic Aug 26, 2017
3699032
console.log(process.env.REDISCLOUD_URL) to check if Redis is availabl…
nelsonic Aug 26, 2017
50131ce
tidy up server.js and revive socket.io "live updates" https://github.…
nelsonic Aug 26, 2017
b3a6c3c
update server.js to show "LAN IP Address" fixes #42
nelsonic Aug 26, 2017
8351c42
adds human-friendly format of hit for displaying in browser UI fixes …
nelsonic Aug 26, 2017
51c46b4
use human-friendly format in UI via socket.io #17 & #45
nelsonic Aug 26, 2017
f03194b
tidy up client.js for #17
nelsonic Aug 26, 2017
f4f2230
remove child process exec (not being used) from db_filesystem.js
nelsonic Aug 26, 2017
d461979
update to latest version of JQuery (3.2.1) fixes https://github.com/d…
nelsonic Aug 27, 2017
97f595a
[WiP] re-writing client.js to not use JQuery for https://github.com/d…
nelsonic Aug 28, 2017
30a28e2
marginally better error logging fixes https://github.com/dwyl/hits/is…
nelsonic Aug 28, 2017
f710efd
update lib/db_fs to use error_logger #47 ... now to completely re-wri…
nelsonic Aug 28, 2017
2ab9c00
use a single log file for all logs ... https://github.com/dwyl/hits/i…
nelsonic Aug 28, 2017
7947244
working 100% now to use "readline" for "big files" ... #44
nelsonic Aug 28, 2017
806e61d
use "readline" for streaming log files if they are large see: https:/…
nelsonic Aug 28, 2017
69fda35
godbye jquery fixes https://github.com/dwyl/hits/issues/46
nelsonic Aug 28, 2017
0ffd463
one log file per endpoint fixes https://github.com/dwyl/hits/issues/48
nelsonic Aug 28, 2017
c60e01b
wrap socket.io code in client.js in setTimeout function with minor de…
nelsonic Aug 28, 2017
606c10a
put log files in /logs folder (duh!) for https://github.com/dwyl/hits…
nelsonic Aug 28, 2017
f025dda
update badges in readme to use "flat style" :wink:
nelsonic Aug 28, 2017
a6a4b1d
remove mkdirp from list of dependencies as no longer used! see: #14
nelsonic Aug 28, 2017
a49b0c2
adds hit badge to index.html so we can count how many people visit th…
nelsonic Aug 29, 2017
5c2b20f
adds UI for creating badges fixes https://github.com/dwyl/hits/issues/51
nelsonic Sep 4, 2017
ceddc22
ensure UI works when JS disabled for #51
nelsonic Sep 4, 2017
2ea3f17
adds instructions for when JS is disabled see: https://github.com/dwy…
nelsonic Sep 4, 2017
a376336
use table layout to align input elements https://github.com/dwyl/hits…
nelsonic Sep 4, 2017
3a8a2f8
reduce width of inputs in UI #51
nelsonic Sep 4, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ build/Release
# https://www.npmjs.org/doc/misc/npm-faq.html#should-i-check-my-node_modules-folder-into-git
node_modules

config.env
*.env
dump.rdb
npm-debug.log
data/
195 changes: 154 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,168 @@
# hits

What if there was a *simple+easy* way to see how many people have viewed your GitHub Repository?
A _simple & easy_ way to see how many people have _viewed_ your GitHub Repository.

[![Build Status](https://travis-ci.org/dwyl/hits.svg)](https://travis-ci.org/dwyl/hits)
[![HitCount](https://hitt.herokuapp.com/nelsonic/hits.svg)](https://github.com/nelsonic/hits)
[![Code Climate](https://codeclimate.com/github/dwyl/hits/badges/gpa.svg)](https://codeclimate.com/github/dwyl/hits)
[![codecov.io](http://codecov.io/github/dwyl/hits/coverage.svg?branch=master)](http://codecov.io/github/dwyl/hits?branch=master)
[![Dependency Status](https://david-dm.org/dwyl/hits.svg)](https://david-dm.org/dwyl/hits)
[![devDependency Status](https://david-dm.org/dwyl/hits/dev-status.svg)](https://david-dm.org/dwyl/hits#info=devDependencies)
[![Build Status](https://img.shields.io/travis/dwyl/hits.svg?style=flat-square)](https://travis-ci.org/dwyl/hits)
[![HitCount](http://hits.dwyl.io/dwyl/hits.svg)](https://github.com/dwyl/hits)
[![codecov.io](https://img.shields.io/codecov/c/github/dwyl/hits/master.svg?style=flat-square)](http://codecov.io/github/dwyl/hits?branch=master)
[![Dependency Status](https://img.shields.io/david/dwyl/hits.svg?style=flat-square)](https://david-dm.org/dwyl/hits)
[![devDependency Status](https://img.shields.io/david/dev/dwyl/hits.svg?style=flat-square)](https://david-dm.org/dwyl/hits#info=devDependencies)


## Why?

We have a few repos on GitHub ... but sadly, we have no idea how many people
are looking at the repos unless they star/watch them; GitHub does not share
any stats with people using their site.
We have a _few_ projects on GitHub ... <br />
_Sadly_, we ~~have~~ _had_ no idea how many people
are _reading/using_ the projects because GitHub only shares "[traffic](https://github.com/blog/1672-introducing-github-traffic-analytics)" stats
for the [_past 14 days_](https://github.com/dwyl/hits/issues/49) and **not** in "***real time***".
(_unless people star/watch the repo_) Also, _manually_ checking who has viewed a
project is _exceptionally_ tedious when you have more than a handful of projects.

We would like to *know* the popularity of each of our repos
to know where we need to be investing our time.
We want to *know* the popularity of _each_ of our repos
to know what people are finding _useful_ and help us
decide where we need to be investing our time.

## What?

A simple way to add (*very basic*) analytics to your GitHub repos.

There are already *many* "Badges" available which people put in their repos: https://github.com/dwyl/repo-badges
There are already *many* "badges" that people use in their repos.
See: [github.com/dwyl/**repo-badges**](https://github.com/dwyl/repo-badges) <br />
But we haven't seen one that gives a "***hit counter***"
of the number of times a page has been viewed ...
of the number of times a GitHub page has been viewed ... <br />
So, in today's mini project we're going to _create_ a _basic **Web Counter**_.

## How?

Place a badge (*image*) in your repo `README.md` so others can
can see how popular the page is and you can track it.
https://en.wikipedia.org/wiki/Web_counter

### Implementation
### What Data to Capture/Store?

What is the ***minimum possible*** amount of data we can store?
The _first_ question we asked ourselves was:
What is the ***minimum possible*** amount of (_useful/unique_)
**info** we can store ***per visit*** (_to one of our projects_)?

+ **date+time** the person visited the site.
1. **date + time** (_timestamp_) ***when***
the person visited the site/page. <br />
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Date/now
+ **user-agent** the browser or crawler visiting the page

2. **url** being visited. i.e. which project was viewed.

3. **user-agent** the browser/device (_or "crawler"_) visiting the site/page
https://en.wikipedia.org/wiki/User_agent
+ **referer** url of the page where the image is requested from?
https://en.wikipedia.org/wiki/HTTP_referer

Log entries are stored as a `String` which can be parsed and re-formatted into
any other format:
4. IP Address of the client. (_for checking uniqueness_)

5. **Language** of the person's web browser.
_Note: While not "essential", we added **Browser Language**
as the **5th** piece of data (when it is set/sent by the browser/device)
because it's **insightful** to know what language people are using
so that we can determine if we should be **translating**/"**localising**"
our content._

### "Common Log Format" (CLF) ?

We initially _considered_ using the "Common Log Format" (CLF)
because it's well-known/understood.
see: https://en.wikipedia.org/wiki/Common_Log_Format

An example log entry:
```
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
```

Real example:
```
84.91.136.21 Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) 007 [05/Aug/2017:16:50:51 -0000] "GET github.com/dwyl/phase-two HTTP/1.0" 200 42247
```

The data makes sense when viewed as a table:

| IP Address of Client | User Identifier | User ID | Date+Imte of Request | Request "Verb" and URL of Request | HTTP Status Code | Size of Response |
| -------------|:-----------|:--|:------------:|:--------:|:--|--|--|
| 84.91.136.21 | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 007 | [05/Aug/2017:16:50:51 -0000] | "GET github.com/dwyl/phase-two HTTP/1.0" | 200 | 42247 |

On further reflection, we think the "Common Log Format" is _inneficient_
as it contains a lot of _duplicate_ and some _useless_ data.

We can do better.

### Alternative Log Format ("ALF")

From the CLF we can remove:

+ **IP Address**, **User Identifier** and **User ID** can be condensed into a single hash (_see below_).
+ "**GET**"" - the word is implied by the service we are running (_we only accept GET requests_)
+ **Response size** is _irrelevant_ and will be the same for most requests.

| Timestamp | URL | User Agent | IP Address | Language | Hit Count |
| ------------- |:------------|:------------|:------------:|:--------:|
| 1436570536950 | github.com/dwyl/the-book | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) | 84.91.136.21 | EN-GB | 42 |


In the log entry (_example_) described above the first 3 bits of data will
identify the "user" requesting the page/resource, so rather than duplicating the data in an inefficient string, we can _hash_ it!

Any repeating user-identifying data should be concactenated

Log entries are stored as a (_"pipe" delimited_) `String`
which can be parsed and re-formatted into any other format:

```sh
1436570536950 x7uapo9 84.91.136.21
1436570536950|github.com/dwyl/phase-two|Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US|42
```
| Timestamp | User Agent | IP Address |
| ------------- |:------------|:------------:|
| 1436570536950 | x7uapo9 | 84.91.136.21 |

We then have a user-agent hash where we can lookup the by id:
### Reducing Storage (_Costs_)

If a person views _multiple_ pages, _three_ pieces of data are duplicated:
User Agent, IP Address and Language.
Rather than storing this data multiple times, we _hash_ the data
and store the hash as a lookup.

#### Hash Long Repeating (Identical) Data

If we run the following `Browser|IP|Language` `String`:
```sh
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|84.91.136.21|EN-US'
```
through a **SHA** hash function we get: `8HKg3NB5Cf` (_always_)<sup>1</sup>.

_Sample_ code:
```js
{
"x7uapo9":"Mozilla/5.0 (iPad; U; CPU OS 3_2_1 like Mac OS X; en-us) AppleWebKit/531.21.10",
"N03v1lz":"Googlebot/2.1 (+http://www.google.com/bot.html)"
}
var hash = require('./lib/hash.js');
var user_agent_string = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5)|88.88.88.88|EN-US';
var agent_hash = hash(user_agent_string, 10); // 8HKg3NB5Cf
```

<sup>1</sup>Note: SHA hash is _always_ 40 characters,
but we _truncate_ it because 10 alphanumeric characters (_selected from a set of 26 letters + 10 digits_)
means there are 36<sup>10</sup> = [3,656,158,440,062,976](http://www.wolframalpha.com/input/?i=36%5E10)
(_three and a half [**Quadrillion**](http://www.wolframalpha.com/input/?i=3,656,158,440,062,976+in+english)_)
possible strings which we consider "_enough_" entropy.
(_if you disagree, tell us why in an
[issue](https://github.com/dwyl/hits/issues)_!)

#### Hit Data With Hash

```
1436570536950|github.com/dwyl/the-book|8HKg3NB5Cf|42
```


## How?

Place a badge (*image*) in your repo `README.md` so others can
can see how popular the page is and you can track it.

### Fetch SVG from shields.io and serve it just-in-time

Given that shields.io has a badge creation service,
and it has acceptable latency, we are proxying the their service.

## Run it!
## _Run_ it Your_self_!

Download (clone) the code to your local machine:

```sh
git clone https://github.com/dwyl/hits.git && cd hits
```
> Note: you will need to have Redis running on your localhost,
> if you are new to Redis see: https://github.com/dwyl/learn-redis

> Note: you will need to have Node.js running on your localhost.

Install dependencies:
```sh
Expand All @@ -85,6 +175,20 @@ npm run dev
Visit: http://localhost:8000/any/url/count.svg


# Data Storage

Recording the "hit" data is _essential_
for this app to _work_ and be _useful_.

We have built it to work with _two_ "data stores":
Filesystem and Redis <!-- and PostgreSQL. --> <br />
> _**Note**: you only need **one** storage option to be available_.

## Filesystem




## Research

### User Agents
Expand All @@ -108,3 +212,12 @@ http://www.monitorware.com/en/logsamples/apache.php
### Node.js http module headers

https://nodejs.org/api/http.html#http_message_rawheaders

## Running the Test Suite locally

The test suite includes tests for 3 databases
therefore running the tests on your `localhost`
requires all 3 to be running.

Deploying and _using_ the app only requires _one_
of the databases to be available.
53 changes: 48 additions & 5 deletions lib/client.js
Original file line number Diff line number Diff line change
@@ -1,9 +1,52 @@
// connect to websocket server
$( document ).ready(function() {
console.log('Ready!', window.location.host);
var root = document.getElementById("hits");
console.log('Ready!', window.location.host);

setTimeout(function(){
var socket = io(window.location.host);
socket.on('news', function (data) {
console.log(data);
socket.emit('my other event', { my: 'data' });
socket.emit('hello', { msg: 'Hi!' });
});
});

socket.on('hit', function (data) {
var previous = root.childNodes[0];
root.insertBefore(div(Date.now(), data.hit), previous);
});

// borrowed from: https://git.io/v536m
function div(divid, text) {
var div = document.createElement('div');
div.id = divid;
div.className = divid;
if(text !== undefined) { // if text is passed in render it in a "Text Node"
var txt = document.createTextNode(text);
div.appendChild(txt);
}
return div;
}
document.getElementById("how").classList.remove('dn'); // show form if JS available (progressive enhancement)
document.getElementById("nojs").classList.add('dn'); // show form if JS available (progressive enhancement)
display_badge_markdown(); // render initial markdown template
}, 500);

// Markdown Template
var mt = '[![HitCount](http://hits.dwyl.io/{user}/{repo}.svg)](http://hits.dwyl.io/{user}/{repo})';

function generate_markdown () {
var user = document.getElementById("username").value || '{username}';
var repo = document.getElementById("repo").value || '{project}';
// console.log('user: ', user, 'repo: ', repo);
return mt.replace(/{user}/g, user).replace(/{repo}/g, repo);
}

function display_badge_markdown() {
var md = generate_markdown()
var pre = document.getElementById("badge").innerHTML = md;
}

var get = document.getElementsByTagName('input');
for (i = 0; i < get.length; i++) {
get[i].addEventListener('keyup', display_badge_markdown, false);
get[i].addEventListener('keyup', display_badge_markdown, false);

}
21 changes: 0 additions & 21 deletions lib/climate.svg

This file was deleted.

Loading