-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Organize Guinea digitization project #37
Comments
I think there is no good, dependable OCR software that can do this task. And a dedicated team to just do this, is not feasible. I just thought crowdsourcing it is the best way, if we are expecting more such pdf documents. In the long run, one could also decouple the data-collection, curation (tasks which can be crowdsourced to non-math, non-coders), from the modeling, analysis work. To help the crowd, can migrate to a less-geeky google docs kind of alternative. Just giving a try. I collated the table pages alone in Guinea dataset pdf, and have created a google spreadsheet with the pivot column/row information (from 26th Aug onwards, the format is the same). It is accessible at ( bit.ly/ebola_guinea ). Have also added the 16th Sept, and 1st Oct .csv information. A few moderators could proof-read and 'freeze' cells which are confirmed (revision histories help too). We've seen it on Wiki. We've seen it on reddit. Can we expect the Internet to do its magic here again? |
@cmrivers Have you looked into Tabula (http://tabula.nerdpower.org/)? If your PDFs aren't scanned images it may be able to help you parse the data into tables faster. I'm familiarizing myself with your process -- I am a journalist in NY and am hoping to build some visualizations and an open API for this data. Do you have a list of your data sources -- I may built a scraper to fetch new data every 15 minutes to power the API for this. |
Yes I use Tabula for most my digitization efforts. Some of the Guinea data On Tue, Oct 14, 2014 at 5:45 PM, TC McCarthy [email protected]
|
Cool, thanks. I was clicking through those -- just wanted to make sure that list was exhaustive. Ugh government data makes me nuts lol -- no consistency. Thanks again! |
Agree completely. I should add that efforts to build an API are underway. On Tue, Oct 14, 2014 at 5:49 PM, TC McCarthy [email protected]
|
Make sure you first look at this file: with tons of detailed Guinea sub-national records: https://data.hdx.rwlabs.org/dataset/rowca-ebola-cases# |
Figure out a way to make clear which Guinea situation reports have already been converted to PDF, and which still need to be done. I want to keep the PDFs in the repo even after they are digitized, since they are not available easily available online like the SL and Liberia sitreps.
The text was updated successfully, but these errors were encountered: