Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Organize Guinea digitization project #37

Open
cmrivers opened this issue Sep 30, 2014 · 6 comments
Open

Organize Guinea digitization project #37

cmrivers opened this issue Sep 30, 2014 · 6 comments

Comments

@cmrivers
Copy link
Owner

Figure out a way to make clear which Guinea situation reports have already been converted to PDF, and which still need to be done. I want to keep the PDFs in the repo even after they are digitized, since they are not available easily available online like the SL and Liberia sitreps.

@cmrivers cmrivers changed the title Organized Guinea digitization project Organize Guinea digitization project Sep 30, 2014
@srinivvenkat
Copy link

I think there is no good, dependable OCR software that can do this task. And a dedicated team to just do this, is not feasible.

I just thought crowdsourcing it is the best way, if we are expecting more such pdf documents. In the long run, one could also decouple the data-collection, curation (tasks which can be crowdsourced to non-math, non-coders), from the modeling, analysis work. To help the crowd, can migrate to a less-geeky google docs kind of alternative.

Just giving a try. I collated the table pages alone in Guinea dataset pdf, and have created a google spreadsheet with the pivot column/row information (from 26th Aug onwards, the format is the same). It is accessible at ( bit.ly/ebola_guinea ). Have also added the 16th Sept, and 1st Oct .csv information. A few moderators could proof-read and 'freeze' cells which are confirmed (revision histories help too).

We've seen it on Wiki. We've seen it on reddit. Can we expect the Internet to do its magic here again?

@tc-mccarthy
Copy link

@cmrivers Have you looked into Tabula (http://tabula.nerdpower.org/)? If your PDFs aren't scanned images it may be able to help you parse the data into tables faster. I'm familiarizing myself with your process -- I am a journalist in NY and am hoping to build some visualizations and an open API for this data. Do you have a list of your data sources -- I may built a scraper to fetch new data every 15 minutes to power the API for this.

@cmrivers
Copy link
Owner Author

Yes I use Tabula for most my digitization efforts. Some of the Guinea data
are images though not data-embedded PDFs, and the tables are irregular from
day to day. Data sources are linked on the top level README.

On Tue, Oct 14, 2014 at 5:45 PM, TC McCarthy [email protected]
wrote:

Have you looked into Tabula (http://tabula.nerdpower.org/)? I'm
familiarizing myself with your process -- I am a journalist in NY and am
hoping to build some visualizations and an open API for this data. Do you
have a list of your data sources -- I may built a scraper to fetch new data
every 15 minutes to power the API for this.


Reply to this email directly or view it on GitHub
https://github.com/cmrivers/ebola/issues/37#issuecomment-59122174.[image:
Web Bug from
https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjM1MCwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--e5054fe740bfd27104c7296387aac2bce1f428df.gif]

@tc-mccarthy
Copy link

Cool, thanks. I was clicking through those -- just wanted to make sure that list was exhaustive. Ugh government data makes me nuts lol -- no consistency. Thanks again!

@cmrivers
Copy link
Owner Author

Agree completely. I should add that efforts to build an API are underway.
You can email me if you need more details.

On Tue, Oct 14, 2014 at 5:49 PM, TC McCarthy [email protected]
wrote:

Cool, thanks. I was clicking through those -- just wanted to make sure
that list was exhaustive. Ugh government data makes me nuts lol -- no
consistency. Thanks again!


Reply to this email directly or view it on GitHub
https://github.com/cmrivers/ebola/issues/37#issuecomment-59122752.[image:
Web Bug from
https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjU4NiwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--5879ba01f162fa7cfbe90a65133fb9e38cb988ba.gif]

@olarosling
Copy link

Make sure you first look at this file: with tons of detailed Guinea sub-national records: https://data.hdx.rwlabs.org/dataset/rowca-ebola-cases#

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants