GB18030(兼容GB2312)编码验证与校正
之前写 APP 抓 GB2312
编码的HTML时候, 发现 HTML 文件里面混了一些非 GB18030
(包含GB2312
字符集) 字符集的字符。
以下代码出错:
//self.gb18030Encoding = CFStringConvertEncodingToNSStringEncoding (kCFStringEncodingGB_18030_2000);
//原页面采用 GB2312 编码
NSString *result = [[NSString alloc] initWithData:data encoding:self.gb18030Encoding];
根据 Apple 文档中关于编码问题的检测步骤,把 data 写到文件,然后使用
[NSString stringWithContentsOfURL:baseURL encoding:self.gb18030Encoding error:&encodeError];
发现错误:
Error: Operation could not be completed. (Cocoa error 261.)
和Stack Overflow
这位兄弟的错误一样, 于是根据 GB18030
的编码范围写了下面这个替换非法字符集的方法:
/*!
GB18030(兼容GB2312)编码验证与校正
字节结构
单字节,其值从0到0x7F。
双字节,第一个字节的值从0x81到0xFE,第二个字节的值从0x40到0xFE(不包括0x7F)。
四字节,第一个字节的值从0x81到0xFE,第二个字节的值从0x30到0x39,第三个字节从0x81到0xFE,第四个字节从0x30到0x39。
因为网页 GB2312是双字节编码,所以我仅实现了GB18030单字节和双字节编码的验证与校正
https://zh.wikipedia.org/wiki/GB_18030
https://zh.wikipedia.org/wiki/GB_2312
*/
-(NSData *)replaceInvalidGB18030Charset:(NSData *)data{
char aa[] = {'A','A','A','A'};//替换用的 char
NSMutableData *md = [NSMutableData dataWithData:data];
int loc = 0;//游标
while(loc < [md length])
{
char buffer;
[md getBytes:&buffer range:NSMakeRange(loc, 1)];
if ((buffer & 0xFF) == 0xFF){
//非法字符0xFF
[md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];
loc++;
continue;
}else if((buffer & 0x80) == 0){
//单字节 ASCII 码
loc++;
continue;
}else{
//大于0x80的双字节或者四字节,要根据下一位判断
loc++;
if (loc >= [md length]) break;
[md getBytes:&buffer range:NSMakeRange(loc, 1)];
//第二字节判断
if ((buffer & 0xFF) != 0xFF && ((buffer & 0x40)==0x40 || (buffer & 0x80) == 0x80)){
//双字节
loc++;
continue;
}else{
//四字节判断
}
//退回第一位
loc--;
//NSLog(@"find fail match at loc=%d",loc);
[md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];
loc++;
}
}
return md;
}
思路来源: UTF-8
版本1
//替换非utf8字符
//注意:如果是三字节utf-8,第二字节错误,则先替换第一字节内容(认为此字节误码为三字节utf8的头),然后判断剩下的两个字节是否非法;
- (NSData *)replaceNoUtf8:(NSData *)data
{
char aa[] = {'A','A','A','A','A','A'}; //utf8最多6个字符,当前方法未使用
NSMutableData *md = [NSMutableData dataWithData:data];
int loc = 0;
while(loc < [md length])
{
char buffer;
[md getBytes:&buffer range:NSMakeRange(loc, 1)];
if((buffer & 0x80) == 0)
{
loc++;
continue;
}
else if((buffer & 0xE0) == 0xC0)
{
loc++;
[md getBytes:&buffer range:NSMakeRange(loc, 1)];
if((buffer & 0xC0) == 0x80)
{
loc++;
continue;
}
loc--;
//非法字符,将这个字符(一个byte)替换为A
[md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];
loc++;
continue;
}
else if((buffer & 0xF0) == 0xE0)
{
loc++;
[md getBytes:&buffer range:NSMakeRange(loc, 1)];
if((buffer & 0xC0) == 0x80)
{
loc++;
[md getBytes:&buffer range:NSMakeRange(loc, 1)];
if((buffer & 0xC0) == 0x80)
{
loc++;
continue;
}
loc--;
}
loc--;
//非法字符,将这个字符(一个byte)替换为A
[md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];
loc++;
continue;
}
else
{
//非法字符,将这个字符(一个byte)替换为A
[md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];
loc++;
continue;
}
}
return md;
}
后来用 PHP 写代码的时候忽然又想起一个叫 iconv
(可以在 iOS 下使用) 的函数, 可以使用//IGNORE
和 //TRANSLIT
参数转,说明文档如下:
ENCODINGS
The values permitted for --from-code and --to-code can be listed by the iconv --list command, and all combinations of the listed values are supported. Furthermore the following two suffixes are supported:
//TRANSLIT
When the string "//TRANSLIT" is appended to --to-code, transliteration is activated. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking
characters.
//IGNORE
When the string "//IGNORE" is appended to --to-code, characters that cannot be represented in the target character set will be silently discarded.
坑着先, 有空再来试试。