Skip to content

Use phantom.js for extraction #15

@quyin

Description

@quyin

Right now our extraction uses Cyberneko to parse HTML and provide a DOM.

However since it does not have a JS engine, contents loaded through JS is not extractable.

One way to overcome this is to interface with phantom.js (which provides DOM and JS engine) for extraction.

This will require designing an extraction engine API in BSJava, and implementing a wrapper for phantom.js.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions