Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not processed volumes #15

Open
liyakun opened this issue Jul 28, 2016 · 7 comments
Open

not processed volumes #15

liyakun opened this issue Jul 28, 2016 · 7 comments

Comments

@liyakun
Copy link

liyakun commented Jul 28, 2016

Some not processed volumes:

http://ceur-ws.org/Vol-41/

http://ceur-ws.org/Vol-1549/

@S6savahd
Copy link

S6savahd commented Jul 28, 2016

good that we are listing these, please try to have them manually in the dataset, so we have a complete dataset as of today

@clange
Copy link

clange commented Jul 28, 2016

Vol-1549 is interesting; @S6savahd would you be able to talk to Sarven about this one? I wonder why it's not working because I thought it's technically the same as Vol-1550 and Vol-1551. The source code has changed completely, and so has the way the layout is computed, so the developers of the information extraction tool had no chance to adapt their tool to it, but these new volumes look the same as the old volumes so should work. These volumes are important because this will soon be the new standard format. Many of them (those created with ceur-make and Rohan's soon-finished web UI frontend) will have RDFa so won't require sophisticated information extraction, but others (those created manually) won't have RDFa. For the latter an adaptation of one of the other information extraction tools might work better; in any case all of these new volumes will have a very clean, uniform structure.

@liyakun
Copy link
Author

liyakun commented Jul 29, 2016

@clange you are right, after filtering non relevant information with related to layout, Vol-1549, Vol-1550, Vol-1551 will not have any information, sorry that the list is not complete, below some data before filtering

<http://ceur-ws.org/Vol-1549/> <http://fitlayout.github.io/ontology/segmentation.owl#country> <http://dbpedia.org/resource/Australia> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#icoloc> "" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#idateplace> "Proceedings of the 1st International Workshop on Semantic Statistics co-located with 12th International Semantic Web Conference (ISWC 2013), Sydney, Australia, October 11th, 2013" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ienddate> "2013-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#iproceedings> "Proceedings of the 1st International Workshop on Semantic Statistics co-located with 12th International Semantic Web Conference (ISWC 2013), Sydney, Australia, October 11th, 2013" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#istartdate> "2013-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#isubmitted> "2016-03-15" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ititle> "Semantic Statistics 2013" .

<http://ceur-ws.org/Vol-1550/> <http://fitlayout.github.io/ontology/segmentation.owl#related> <http://ceur-ws.org/Vol-1549/> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#country> <http://dbpedia.org/resource/Italy> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#icoloc> "" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#idateplace> "Proceedings of the 2nd International Workshop on Semantic Statistics co-located with 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19th, 2014" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ienddate> "2014-10-19" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#iproceedings> "Proceedings of the 2nd International Workshop on Semantic Statistics co-located with 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Italy, October 19th, 2014" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#istartdate> "2014-10-19" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#isubmitted> "2016-04-23" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ititle> "Semantic Statistics 2014" .

<http://ceur-ws.org/Vol-1551/> <http://fitlayout.github.io/ontology/segmentation.owl#related> <http://ceur-ws.org/Vol-1549/> , <http://ceur-ws.org/Vol-1550/> ;
    <http://fitlayout.github.io/ontology/segmentation.owl#icoloc> "" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#idateplace> "Proceedings of the 3rd International Workshop on Semantic Statistics, co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, U.S., October 11th, 2015" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ienddate> "2015-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#iproceedings> "Proceedings of the 3rd International Workshop on Semantic Statistics, co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, U.S., October 11th, 2015" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#istartdate> "2015-10-11" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#isubmitted> "2016-03-15" ;
    <http://fitlayout.github.io/ontology/segmentation.owl#ititle> "Semantic Statistics 2015" .

The original tool will process some "related volume" for Vol-1550 and Vol-1551, as the information come from index.html. If we remove layout information for these three volumes like other volumes, then no information will be left.
I have tried to change the original tool before but no success, I will spend some more time on fixing the original tool, another way is that I write a separate script for these three volumes, and it should also work for similar volumes with these three if they are well structured.

@liyakun
Copy link
Author

liyakun commented Jul 30, 2016

some information added for Vol-1549, Vol-1550 and Vol-1551 from indexing page in the new dataset, and also all the information from Vol-41.

@liyakun
Copy link
Author

liyakun commented Jul 31, 2016

@S6savahd @clange I wrote a tool to process these three volumes, and it should also work with volumes in the same structure, the tool is ceurws.py, the output is 1549-1551.ttl and it can be extended to process other volumes as well.

@S6savahd
Copy link

great!

I didn't have a look into the code but is there a way that we have it embedded in the main code, I mean that for future we do not run them separately but all in once?

@liyakun
Copy link
Author

liyakun commented Jul 31, 2016

@S6savahd It is possible to embed it into the post processing script we already have, but I need to check how to embed it into the original tool as they are written in different language. We can also extend this tool to process different structure in the future, as the original tool uses common strategy to process all the volumes, it will not always be able to process all the volumes information completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants