Encoding between GB2312 and UTF-8
Author
Zhou Renjian
Create@
2004-11-12 09:10
Transform GB2312 bytes to UTF-8 bytes, and then UTF-8 bytes to GB2312 bytes, now the GB2312 bytes is not the same as the original GB2312 bytes. But GB2312 bytes to ISO-8859-1 bytes, and then ISO-8859-1 bytes to GB2312 bytes, the GB2312 bytes is still the same as the original GB2312 bytes. I don't know why. I am using Linux with JDK 1.4.2. So I have use the following nassy codes to transform GB2312 bytes into UTF-8 bytes. And after such transformation, I can turn UTF-8 bytes into GB2312 bytes and the GB2312 bytes is the same as the original GB2312 bytes.
/**
* For some tests under Linux, I found that <code>new String(String.getBytes(), "utf-8")</code> did not
* work. That is why this #nativeToUTF8 is here.
*
* @param str gb2312/iso-8859-1 encoded String
* @return utf-8 encoded String
* @throws Exception IOException or UnsupportedEncodedException will occurs
* Exceptions is thrown and not caught inside is for developer to use this
* method carefully.
*/
public static String nativeToUTF8(String str) throws Exception {
File f = new File(System.getProperty("java.io.tmpdir") + File.separator + Math.random());
FileOutputStream fos = new FileOutputStream(f);
fos.write(str.getBytes("utf-8"));
fos.close();
FileInputStream fis = new FileInputStream(f);
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
byte[] buf = new byte[1024];
int readLength = 0;
while (readLength != -1) {
readLength = fis.read(buf);
if (readLength != -1) {
buffer.write(buf, 0, readLength);
}
}
fis.close();
f.delete();
return new String(buffer.toByteArray(), "utf-8");
}