Kevin, Lydia, Fanny, Xiaotai
SNAP (Stanford) ~8GB zipped (~290GB unzipped) file with all of Wikipedia's revisions from its inception in 2001 to January 2008.
- Can we cluster life cycles?
- What were the most revised articles of all time?
- What attributes of article edit behavior can we glean from article categories?
- How does edit behavior reflect major events in news, culture, tech, or history?
-
Generate a list of unique article_ids
/unique-articles
-
Collect all article revisions associated with each unique article_id
/random-subsample
- Normalize revisions
/normalized-revision
- Look at creation timeline
/creation-timeline
- repeat steps above for entire dataset instead of sample
See this repo.
See slides here.
Run python3 unique_article.py
Otherwise the base python2.7 will scream at you.
$ gzip -cd enwiki-20080103.good.gz | head -n 1
$ gzip -cd unique_all_articleids_3.gz | head -n 10
.
├── README.md
├── creation-timeline
│ └── creation_timelines_toCSV.py
├── data
│ ├── all_revisions_1000_articles.txt
│ ├── all_revisions_1000_articles_caratseparated.txt
│ ├── all_revisions_1000_articles_commaseparated.txt
│ └── unique_all_articleids.gz
├── mrjob.conf
├── mrjob2.conf
├── normalized-revision
│ ├── create_normalized_revision_count_timeline.py
│ ├── create_normalized_revision_lengths_timeline.py
│ └── revision_count_timeline_2.py
├── random-subsample
│ ├── get_random_subsample.py
│ ├── get_random_subsample_toCSV.py
│ └── test_random_subsample.py
├── test
│ ├── blank_line_check.py
│ └── checking_format.py
└── unique-articles
├── unique-article
├── unique_article.py
└── unique_article_ids.py