public class Jsoup
extends java.lang.Object
Modifier | Constructor and Description |
---|---|
private |
Jsoup() |
Modifier and Type | Method and Description |
---|---|
static java.lang.String |
clean(java.lang.String bodyHtml,
Safelist safelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of permitted
tags and attributes.
|
static java.lang.String |
clean(java.lang.String bodyHtml,
java.lang.String baseUri,
Safelist safelist)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through an allow-list of safe
tags and attributes.
|
static java.lang.String |
clean(java.lang.String bodyHtml,
java.lang.String baseUri,
Safelist safelist,
Document.OutputSettings outputSettings)
Get safe HTML from untrusted input HTML, by parsing input HTML and filtering it through a safe-list of
permitted tags and attributes.
|
static Connection |
connect(java.lang.String url)
Creates a new
Connection (session), with the defined request URL. |
static boolean |
isValid(java.lang.String bodyHtml,
Safelist safelist)
Test if the input body HTML has only tags and attributes allowed by the Safelist.
|
static Connection |
newSession()
Creates a new
Connection to use as a session. |
static Document |
parse(java.io.File file)
Parse the contents of a file as HTML.
|
static Document |
parse(java.io.File file,
java.lang.String charsetName)
Parse the contents of a file as HTML.
|
static Document |
parse(java.io.File file,
java.lang.String charsetName,
java.lang.String baseUri)
Parse the contents of a file as HTML.
|
static Document |
parse(java.io.File file,
java.lang.String charsetName,
java.lang.String baseUri,
Parser parser)
Parse the contents of a file as HTML.
|
static Document |
parse(java.io.InputStream in,
java.lang.String charsetName,
java.lang.String baseUri)
Read an input stream, and parse it to a Document.
|
static Document |
parse(java.io.InputStream in,
java.lang.String charsetName,
java.lang.String baseUri,
Parser parser)
Read an input stream, and parse it to a Document.
|
static Document |
parse(java.lang.String html)
Parse HTML into a Document.
|
static Document |
parse(java.lang.String html,
Parser parser)
Parse HTML into a Document, using the provided Parser.
|
static Document |
parse(java.lang.String html,
java.lang.String baseUri)
Parse HTML into a Document.
|
static Document |
parse(java.lang.String html,
java.lang.String baseUri,
Parser parser)
Parse HTML into a Document, using the provided Parser.
|
static Document |
parse(java.net.URL url,
int timeoutMillis)
Fetch a URL, and parse it as HTML.
|
static Document |
parseBodyFragment(java.lang.String bodyHtml)
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML. |
static Document |
parseBodyFragment(java.lang.String bodyHtml,
java.lang.String baseUri)
Parse a fragment of HTML, with the assumption that it forms the
body of the HTML. |
public static Document parse(java.lang.String html, java.lang.String baseUri)
html
- HTML to parsebaseUri
- The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur
before the HTML declares a <base href>
tag.public static Document parse(java.lang.String html, java.lang.String baseUri, Parser parser)
html
- HTML to parsebaseUri
- The URL where the HTML was retrieved from. Used to resolve relative URLs to absolute URLs, that occur
before the HTML declares a <base href>
tag.parser
- alternate parser
to use.public static Document parse(java.lang.String html, Parser parser)
<base href>
tag.html
- HTML to parse
before the HTML declares a <base href>
tag.parser
- alternate parser
to use.public static Document parse(java.lang.String html)
<base href>
tag.html
- HTML to parseparse(String, String)
public static Connection connect(java.lang.String url)
Connection
(session), with the defined request URL. Use to fetch and parse a HTML page.
Use examples:
Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();
Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();
url
- URL to connect to. The protocol must be http
or https
.newSession()
,
Connection.newRequest()
public static Connection newSession()
Connection
to use as a session. Connection settings (user-agent, timeouts, URL, etc), and
cookies will be maintained for the session. Use examples:
Connection session = Jsoup.newSession()
.timeout(20 * 1000)
.userAgent("FooBar 2000");
Document doc1 = session.newRequest()
.url("https://jsoup.org/").data("ref", "example")
.get();
Document doc2 = session.newRequest()
.url("https://en.wikipedia.org/wiki/Main_Page")
.get();
Connection con3 = session.newRequest();
For multi-threaded requests, it is safe to use this session between threads, but take care to call Connection.newRequest()
per request and not share that instance between threads when executing or parsing.
public static Document parse(java.io.File file, @Nullable java.lang.String charsetName, java.lang.String baseUri) throws java.io.IOException
file
- file to load HTML from. Supports gzipped files (ending in .z or .gz).charsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).baseUri
- The URL where the HTML was retrieved from, to resolve relative links against.java.io.IOException
- if the file could not be found, or read, or if the charsetName is invalid.public static Document parse(java.io.File file, @Nullable java.lang.String charsetName) throws java.io.IOException
file
- file to load HTML from. Supports gzipped files (ending in .z or .gz).charsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).java.io.IOException
- if the file could not be found, or read, or if the charsetName is invalid.parse(file, charset, baseUri)
public static Document parse(java.io.File file) throws java.io.IOException
<meta charset>
tag,
or if neither is present, will be UTF-8
.
This is the equivalent of calling parse(file, null)
file
- the file to load HTML from. Supports gzipped files (ending in .z or .gz).java.io.IOException
- if the file could not be found or read.parse(file, charset, baseUri)
public static Document parse(java.io.File file, @Nullable java.lang.String charsetName, java.lang.String baseUri, Parser parser) throws java.io.IOException
file
- file to load HTML from. Supports gzipped files (ending in .z or .gz).charsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).baseUri
- The URL where the HTML was retrieved from, to resolve relative links against.parser
- alternate parser
to use.java.io.IOException
- if the file could not be found, or read, or if the charsetName is invalid.public static Document parse(@WillClose java.io.InputStream in, @Nullable java.lang.String charsetName, java.lang.String baseUri) throws java.io.IOException
in
- input stream to read. The stream will be closed after reading.charsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).baseUri
- The URL where the HTML was retrieved from, to resolve relative links against.java.io.IOException
- if the file could not be found, or read, or if the charsetName is invalid.public static Document parse(java.io.InputStream in, @Nullable java.lang.String charsetName, java.lang.String baseUri, Parser parser) throws java.io.IOException
in
- input stream to read. Make sure to close it after parsing.charsetName
- (optional) character set of file contents. Set to null
to determine from http-equiv
meta tag, if
present, or fall back to UTF-8
(which is often safe to do).baseUri
- The URL where the HTML was retrieved from, to resolve relative links against.parser
- alternate parser
to use.java.io.IOException
- if the file could not be found, or read, or if the charsetName is invalid.public static Document parseBodyFragment(java.lang.String bodyHtml, java.lang.String baseUri)
body
of the HTML.bodyHtml
- body HTML fragmentbaseUri
- URL to resolve relative URLs against.Document.body()
public static Document parseBodyFragment(java.lang.String bodyHtml)
body
of the HTML.bodyHtml
- body HTML fragmentDocument.body()
public static Document parse(java.net.URL url, int timeoutMillis) throws java.io.IOException
connect(String)
instead.
The encoding character set is determined by the content-type header or http-equiv meta tag, or falls back to UTF-8
.
url
- URL to fetch (with a GET). The protocol must be http
or https
.timeoutMillis
- Connection and read timeout, in milliseconds. If exceeded, IOException is thrown.java.net.MalformedURLException
- if the request URL is not a HTTP or HTTPS URL, or is otherwise malformedHttpStatusException
- if the response is not OK and HTTP response errors are not ignoredUnsupportedMimeTypeException
- if the response mime type is not supported and those errors are not ignoredjava.net.SocketTimeoutException
- if the connection times outjava.io.IOException
- if a connection or read error occursconnect(String)
public static java.lang.String clean(java.lang.String bodyHtml, java.lang.String baseUri, Safelist safelist)
bodyHtml
- input untrusted HTML (body fragment)baseUri
- URL to resolve relative URLs againstsafelist
- list of permitted HTML elementsCleaner.clean(Document)
public static java.lang.String clean(java.lang.String bodyHtml, Safelist safelist)
Note that as this method does not take a base href URL to resolve attributes with relative URLs against, those
URLs will be removed, unless the input HTML contains a <base href> tag
. If you wish to preserve those, use
the clean(String html, String baseHref, Safelist)
method instead, and enable
Safelist.preserveRelativeLinks(boolean)
.
Note that the output of this method is still HTML even when using the TextNode only
Safelist.none()
, and so any HTML entities in the output will be appropriately escaped.
If you want plain text, not HTML, you should use a text method such as Element.text()
instead, after
cleaning the document.
Example:
String sourceBodyHtml = "<p>5 is < 6.</p>";
String html = Jsoup.clean(sourceBodyHtml, Safelist.none());
Cleaner cleaner = new Cleaner(Safelist.none());
String text = cleaner.clean(Jsoup.parse(sourceBodyHtml)).text();
// html is: 5 is < 6.
// text is: 5 is < 6.
bodyHtml
- input untrusted HTML (body fragment)safelist
- list of permitted HTML elementsCleaner.clean(Document)
public static java.lang.String clean(java.lang.String bodyHtml, java.lang.String baseUri, Safelist safelist, Document.OutputSettings outputSettings)
The HTML is treated as a body fragment; it's expected the cleaned HTML will be used within the body of an
existing document. If you want to clean full documents, use Cleaner.clean(Document)
instead, and add
structural tags (html, head, body
etc) to the safelist.
bodyHtml
- input untrusted HTML (body fragment)baseUri
- URL to resolve relative URLs againstsafelist
- list of permitted HTML elementsoutputSettings
- document output settings; use to control pretty-printing and entity escape modesCleaner.clean(Document)
public static boolean isValid(java.lang.String bodyHtml, Safelist safelist)
This method is intended to be used in a user interface as a validator for user input. Note that regardless of the
output of this method, the input document must always be normalized using a method such as
clean(String, String, Safelist)
, and the result of that method used to store or serialize the document
before later reuse such as presentation to end users. This ensures that enforced attributes are set correctly, and
that any differences between how a given browser and how jsoup parses the input HTML are normalized.
Example:
Safelist safelist = Safelist.relaxed();
boolean isValid = Jsoup.isValid(sourceBodyHtml, safelist);
String normalizedHtml = Jsoup.clean(sourceBodyHtml, "https://example.com/", safelist);
Assumes the HTML is a body fragment (i.e. will be used in an existing HTML document body.)
bodyHtml
- HTML to testsafelist
- safelist to test againstclean(String, Safelist)