Package org.w3c.tidy
Class EncodingUtils
- java.lang.Object
-
- org.w3c.tidy.EncodingUtils
-
public final class EncodingUtils extends java.lang.Object
- Version:
- $Revision: 622 $ ($Author: fgiust $)
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description (package private) static interface
EncodingUtils.GetBytes
Getter callback: called to retrieve 1 or more additional UTF-8 bytes.(package private) static interface
EncodingUtils.PutBytes
Putter callbacks: called to store 1 or more additional UTF-8 bytes.
-
Field Summary
Fields Modifier and Type Field Description static int
FSM_ASCII
states for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets.static int
FSM_ESC
state ESC.static int
FSM_ESCD
state ESCD.static int
FSM_ESCDP
state ESCDP.static int
FSM_ESCP
state ESCP.static int
FSM_NONASCII
state NONASCII.static int
HIGH_UTF16_SURROGATE
UTF-16 high surrogate.static int
LOW_UTF16_SURROGATE
utf16 low surrogate.private static int[]
MAC2UNICODE
John Love-Jensen contributed this table for mapping MacRoman character set to Unicode.static int
MAX_UTF16_FROM_UCS4
Max UTF-16 value.static int
MAX_UTF8_FROM_UCS4
Max UTF-88 valid char value.private static int
NUM_UTF8_SEQUENCES
number of valid utf8 sequances.private static int[]
OFFSET_UTF8_SEQUENCES
Offset for utf8 sequences.private static int[]
SYMBOL2UNICODE
table to map symbol font characters to Unicode; undefined characters are mapped to 0x0000 and characters without any unicode equivalent are mapped to '?'.static int
UNICODE_BOM
the default (big-endian) UNICODE BOM.static int
UNICODE_BOM_BE
the big-endian (default) UNICODE BOM.static int
UNICODE_BOM_LE
the little-endian UNICODE BOM.static int
UNICODE_BOM_UTF8
the UTF-8 UNICODE BOM.static int
UTF16_HIGH_SURROGATE_BEGIN
UTF-16 surrogate pair areas: high surrogates begin.static int
UTF16_HIGH_SURROGATE_END
UTF-16 surrogate pair areas: high surrogates end.static int
UTF16_LOW_SURROGATE_BEGIN
UTF-16 surrogate pair areas: low surrogates begin.static int
UTF16_LOW_SURROGATE_END
UTF-16 surrogate pair areas: low surrogates end.static int
UTF16_SURROGATES_BEGIN
UTF-16 surrogates begin.private static int
UTF8_BYTE_SWAP_NOT_A_CHAR
UTF-8 bye swap: invalid char.private static int
UTF8_NOT_A_CHAR
UTF-8 invalid char.private static ValidUTF8Sequence[]
VALID_UTF8
Array of valid UTF8 sequences.private static int[]
WIN2UNICODE
Mapping for Windows Western character set (128-159) to Unicode.
-
Constructor Summary
Constructors Modifier Constructor Description private
EncodingUtils()
don't instantiate.
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description protected static int
decodeMacRoman(int c)
Function to convert from MacRoman to Unicode.(package private) static int
decodeSymbolFont(int c)
Function to convert from Symbol Font chars to Unicode.(package private) static boolean
decodeUTF8BytesToChar(int[] c, int firstByte, byte[] successorBytes, EncodingUtils.GetBytes getter, int[] count, int startInSuccessorBytesArray)
Decodes an array of bytes to a char.protected static int
decodeWin1252(int c)
Function for conversion from Windows-1252 to Unicode.(package private) static boolean
encodeCharToUTF8Bytes(int c, byte[] encodebuf, EncodingUtils.PutBytes putter, int[] count)
Encode a char to an array of bytes.
-
-
-
Field Detail
-
UNICODE_BOM_BE
public static final int UNICODE_BOM_BE
the big-endian (default) UNICODE BOM.- See Also:
- Constant Field Values
-
UNICODE_BOM
public static final int UNICODE_BOM
the default (big-endian) UNICODE BOM.- See Also:
- Constant Field Values
-
UNICODE_BOM_LE
public static final int UNICODE_BOM_LE
the little-endian UNICODE BOM.- See Also:
- Constant Field Values
-
UNICODE_BOM_UTF8
public static final int UNICODE_BOM_UTF8
the UTF-8 UNICODE BOM.- See Also:
- Constant Field Values
-
FSM_ASCII
public static final int FSM_ASCII
states for ISO 2022 A document in ISO-2022 based encoding uses some ESC sequences called "designator" to switch character sets. The designators defined and used in ISO-2022-JP are: "ESC" + "(" + ? for ISO646 variants "ESC" + "$" + ? and "ESC" + "$" + "(" + ? for multibyte character sets. State ASCII.- See Also:
- Constant Field Values
-
FSM_ESC
public static final int FSM_ESC
state ESC.- See Also:
- Constant Field Values
-
FSM_ESCD
public static final int FSM_ESCD
state ESCD.- See Also:
- Constant Field Values
-
FSM_ESCDP
public static final int FSM_ESCDP
state ESCDP.- See Also:
- Constant Field Values
-
FSM_ESCP
public static final int FSM_ESCP
state ESCP.- See Also:
- Constant Field Values
-
FSM_NONASCII
public static final int FSM_NONASCII
state NONASCII.- See Also:
- Constant Field Values
-
MAX_UTF8_FROM_UCS4
public static final int MAX_UTF8_FROM_UCS4
Max UTF-88 valid char value.- See Also:
- Constant Field Values
-
MAX_UTF16_FROM_UCS4
public static final int MAX_UTF16_FROM_UCS4
Max UTF-16 value.- See Also:
- Constant Field Values
-
LOW_UTF16_SURROGATE
public static final int LOW_UTF16_SURROGATE
utf16 low surrogate.- See Also:
- Constant Field Values
-
UTF16_SURROGATES_BEGIN
public static final int UTF16_SURROGATES_BEGIN
UTF-16 surrogates begin.- See Also:
- Constant Field Values
-
UTF16_LOW_SURROGATE_BEGIN
public static final int UTF16_LOW_SURROGATE_BEGIN
UTF-16 surrogate pair areas: low surrogates begin.- See Also:
- Constant Field Values
-
UTF16_LOW_SURROGATE_END
public static final int UTF16_LOW_SURROGATE_END
UTF-16 surrogate pair areas: low surrogates end.- See Also:
- Constant Field Values
-
UTF16_HIGH_SURROGATE_BEGIN
public static final int UTF16_HIGH_SURROGATE_BEGIN
UTF-16 surrogate pair areas: high surrogates begin.- See Also:
- Constant Field Values
-
UTF16_HIGH_SURROGATE_END
public static final int UTF16_HIGH_SURROGATE_END
UTF-16 surrogate pair areas: high surrogates end.- See Also:
- Constant Field Values
-
HIGH_UTF16_SURROGATE
public static final int HIGH_UTF16_SURROGATE
UTF-16 high surrogate.- See Also:
- Constant Field Values
-
UTF8_BYTE_SWAP_NOT_A_CHAR
private static final int UTF8_BYTE_SWAP_NOT_A_CHAR
UTF-8 bye swap: invalid char.- See Also:
- Constant Field Values
-
UTF8_NOT_A_CHAR
private static final int UTF8_NOT_A_CHAR
UTF-8 invalid char.- See Also:
- Constant Field Values
-
WIN2UNICODE
private static final int[] WIN2UNICODE
Mapping for Windows Western character set (128-159) to Unicode.
-
MAC2UNICODE
private static final int[] MAC2UNICODE
John Love-Jensen contributed this table for mapping MacRoman character set to Unicode.
-
SYMBOL2UNICODE
private static final int[] SYMBOL2UNICODE
table to map symbol font characters to Unicode; undefined characters are mapped to 0x0000 and characters without any unicode equivalent are mapped to '?'. Is this appropriate?
-
VALID_UTF8
private static final ValidUTF8Sequence[] VALID_UTF8
Array of valid UTF8 sequences.
-
NUM_UTF8_SEQUENCES
private static final int NUM_UTF8_SEQUENCES
number of valid utf8 sequances.
-
OFFSET_UTF8_SEQUENCES
private static final int[] OFFSET_UTF8_SEQUENCES
Offset for utf8 sequences.
-
-
Method Detail
-
decodeWin1252
protected static int decodeWin1252(int c)
Function for conversion from Windows-1252 to Unicode.- Parameters:
c
- char to decode- Returns:
- decoded char
-
decodeMacRoman
protected static int decodeMacRoman(int c)
Function to convert from MacRoman to Unicode.- Parameters:
c
- char to decode- Returns:
- decoded char
-
decodeSymbolFont
static int decodeSymbolFont(int c)
Function to convert from Symbol Font chars to Unicode.- Parameters:
c
- char to decode- Returns:
- decoded char
-
decodeUTF8BytesToChar
static boolean decodeUTF8BytesToChar(int[] c, int firstByte, byte[] successorBytes, EncodingUtils.GetBytes getter, int[] count, int startInSuccessorBytesArray)
Decodes an array of bytes to a char.- Parameters:
c
- will contain the decoded charfirstByte
- first input bytesuccessorBytes
- array containing successor bytes (can be null if a getter is provided).getter
- callback used to get new bytes if successorBytes doesn't contain enough bytescount
- will contain the number of bytes readstartInSuccessorBytesArray
- starting offset for bytes in successorBytes- Returns:
true
if error
-
encodeCharToUTF8Bytes
static boolean encodeCharToUTF8Bytes(int c, byte[] encodebuf, EncodingUtils.PutBytes putter, int[] count)
Encode a char to an array of bytes.- Parameters:
c
- char to encodeencodebuf
- will contain the decoded bytesputter
- if not null it will be called to write bytes to outcount
- number of bytes written- Returns:
false
= ok,true
= error
-
-