Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to parse HTML #78

Open
harshzalavadiya opened this issue Jul 27, 2020 · 1 comment
Open

Unable to parse HTML #78

harshzalavadiya opened this issue Jul 27, 2020 · 1 comment

Comments

@harshzalavadiya
Copy link

So I was trying to parse content from multiple document formats and turns out it works for other document formats pdf, doc etc. but not for html files somehow

below is the minimal example with sample html

main.go

package main

import (
	"fmt"
	"log"

	"code.sajari.com/docconv"
)

func main() {
	// Attempt to read file
	txt, err := docconv.ConvertPath("test.html")
	if err != nil {
		log.Fatal(err)
	}

	fmt.Println(txt.Body)
}

test.html

<!DOCTYPE html>
<html>
  <body>
    <h1>This is heading 1</h1>
    <h2>This is heading 2</h2>
    <h3>This is heading 3</h3>
    <h4>This is heading 4</h4>
    <h5>This is heading 5</h5>
    <h6>This is heading 6</h6>
  </body>
</html>

As of now output is blank

also I noticed that there's no release from 2019 feb so code.sajari.com might be sending older library is there any way to maybe pre-release? version or configure CI to do that

@stuta
Copy link

stuta commented May 13, 2022

I have the same problem, in Ubuntu x64 and OSX arm M1 mac. No errors, no meta info or content.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants