Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert HTML file with table(s) to DataFrame #71

Open
scls19fr opened this issue Jun 20, 2019 · 12 comments
Open

Convert HTML file with table(s) to DataFrame #71

scls19fr opened this issue Jun 20, 2019 · 12 comments

Comments

@scls19fr
Copy link
Contributor

Hello,

I have an HTML file with a table and would like to convert it to a Julia DataFrame.

I was looking for a function similar to Python Pandas read_html function (which directly output a list of DataFrame).

Unfortunately I don't see similar function in Julia ecosystem

In Gumbo doc I was looking for an example to iterate over rows and colums of each table

here is a basic HTML source file with 2 tables

<!DOCTYPE >
<HTML>
  <head></head>
  <body>

    <h1>First table</h1>
    <table>
      <tbody>
        <tr>
          <th>
            A
          </th>
          <th>
            B
          </th>
        </tr>
        <tr>
          <td>
            1
          </td>
          <td>
            1.1
          </td>
        </tr>
        <tr>
          <td>
            2
          </td>
          <td>
            2.1
          </td>
        </tr>
      </tbody>
    </table>

    <h1>Second table</h1>
    <table>
      <tbody>
        <tr>
          <th>
            AA
          </th>
          <th>
            BB
          </th>
        </tr>
        <tr>
          <td>
            10
          </td>
          <td>
            10.1
          </td>
        </tr>
        <tr>
          <td>
            20
          </td>
          <td>
            20.1
          </td>
        </tr>
      </tbody>
    </table>

  </body>
</HTML>

I'm not sure if such example should be part of Gumbo or Cascadia or even EzXML.jl

Anyway none of this project show example with HTML tables... so there is probably a room for doc improvement.

Kind regards

PS : related SO post https://stackoverflow.com/questions/42915962/extracting-and-constructing-tables-from-html-files-using-julia

@scls19fr scls19fr changed the title Convert HTML file with table to DataFrame Convert HTML file with table(s) to DataFrame Jun 20, 2019
@scls19fr
Copy link
Contributor Author

scls19fr commented Jun 20, 2019

I wrote this code (which can help those who are looking for a similar feature) but this code is just a (very) quick implementation... which probably won't work with more complex HTML page with tables

@porterjamesj
Copy link
Collaborator

Hi Sébastien,

Thanks for opening the issue, I agree this would be a good thing to have. I'd rather not have a dependency on DataFrames in this package, since it's a large dependency that's not necessary for Gumbo's core functionality.

My impression is that the best way to do this would be to implement the Tables.jl interface for HTMLElement{:table}, and then we'll be able to construct DataFrames from HTML tables in a very direct, natural way.

I'm not sure when I'll have time to do this, but I don't think it would be very difficult; if someone else wants to take a crack at it I'd happily accept a pull request. I'm happy to add a dependency on Tables.jl, since it's pretty small.

@scls19fr
Copy link
Contributor Author

I really the idea of implementing Tables.jl interface for HTMLElement{:table}
Pinging @quinnj @davidanthoff

@quinnj
Copy link
Member

quinnj commented Jun 20, 2019

Yeah, it sounds like a great idea. Happy to help support however I can here. Currently, Tables.jl doesn't have a concept of streaming multiple tables at a time, but as long as there's a way to "select" a single table tag and "stream" that, then it should work pretty well. Happy to chat on slack if anyone wants to brainstorm this.

@porterjamesj
Copy link
Collaborator

@quinnj yeah, I think we're on the same page. I'm imagining that it's up to the user to locate a single <table> element in their HTML and pass that into the DataFrame constructor (or whatever else that uses the tables interface).

I'm actually pretty excited about this idea, since this is a feature request that's come up before, and I love the smooth interoperability between the whole ecosystem that packages like Tables can provide! I'll try to find time to work on it soon, I'll ask on Slack if I get stuck with anything Tables related.

@quinnj
Copy link
Member

quinnj commented Jun 20, 2019

Cool, yeah, just let me know if you run into any issues. Just to get the ball rolling, some things to think about include:

  • I'm not sure it makes sense to define the Tables.jl interface on HTMLElement{:table} directly, perhaps you'd want a dedicated HTMLTable type that could wrap the element node
  • Feel free to overload Tables.table(x::HTMLElement{:table}) for this, or just use your own constructor
  • The initial setup is pretty simple, including:
Tables.istable(::Type{<:HTMLTable}) = true
Tables.rowaccess(::Type{<:HTMLTable}) = true
Tables.rows(table::HTMLTable) = table
  • The trickier part will be implementing Tables.schema(x::HTMLTable), since it doesn't seem like you'll necessarily have the notion of a "schema" in an HTML table; for starters, you could just do Tables.schema(x::HTMLTable) = nothing, which introduces a little performance hit for sinks, but in the case of HTML tables, I don't think it should be significant
  • Apart from that, the other meat is to define proper iteration on HTMLTable; probably simplest to just iterate NamedTuples. Again, I'm not sure if there will be issues with HTML tables that don't have column names (you might have to auto-generate them if not), but it should be pretty straightforward implementing of the iteration protocol

Anyway, hopefully that gets the ball rolling and again, just let me know if you run into any issues.

@porterjamesj
Copy link
Collaborator

Thanks! That all makes sense. I agree there are some tricky parts and some places that'll have to use heuristics and guessing (for schemas, types, etc.). I think it's fine to just "do our best" and then people can clean things up themselves if they end up with messy data. The only thing I'm curious about is what the utility of the wrapper type (HTMLTable) is vs. defining the Tables interface directly on HTMLElement{:table}?

@quinnj
Copy link
Member

quinnj commented Jun 20, 2019

The main decision there is whether you're comfortable defining iterate(x::HTMLElement{:table}) to iterate NamedTuples. To me, that seems maybe a little weird, hence the suggestion to use an explicit wrapper type that provides a proper object to iterate NamedTuples. But then again, I'm not familiar w/ the details of the package very well, so feel free to make the call.

@porterjamesj
Copy link
Collaborator

Ahh, that make sense—I didn't realize the Tables interface required overriding the Base iterate function. I agree a wrapper type makes sense given that, we probably want iterate to iterate child elements for all HTMLElements.

@Nosferican
Copy link

Any update on this? It would be great to get a Table from HTMLElement{:table}.

@aminya
Copy link

aminya commented Jun 25, 2020

If we fix #85, we can just use AcuteML which already supports Tables.jl.

https://github.com/aminya/AcuteML.jl

@Nosferican
Copy link

Nothing yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants