GB18030(兼容GB2312)编码验证与校正

之前写 APP 抓 GB2312 编码的HTML时候, 发现 HTML 文件里面混了一些非 GB18030(包含GB2312字符集) 字符集的字符。

以下代码出错:

//self.gb18030Encoding = CFStringConvertEncodingToNSStringEncoding (kCFStringEncodingGB_18030_2000);

//原页面采用 GB2312 编码
NSString *result = [[NSString alloc] initWithData:data encoding:self.gb18030Encoding];


根据 Apple 文档中关于编码问题的检测步骤,把 data 写到文件,然后使用

[NSString stringWithContentsOfURL:baseURL encoding:self.gb18030Encoding error:&encodeError];

发现错误:

Error: Operation could not be completed. (Cocoa error 261.)

Stack Overflow这位兄弟的错误一样, 于是根据 GB18030的编码范围写了下面这个替换非法字符集的方法:

/*!
 GB18030(兼容GB2312)编码验证与校正

 字节结构
 单字节,其值从0到0x7F。
 双字节,第一个字节的值从0x81到0xFE,第二个字节的值从0x40到0xFE(不包括0x7F)。
 四字节,第一个字节的值从0x81到0xFE,第二个字节的值从0x30到0x39,第三个字节从0x81到0xFE,第四个字节从0x30到0x39。
 因为网页 GB2312是双字节编码,所以我仅实现了GB18030单字节和双字节编码的验证与校正

 https://zh.wikipedia.org/wiki/GB_18030
 https://zh.wikipedia.org/wiki/GB_2312
 */
-(NSData *)replaceInvalidGB18030Charset:(NSData *)data{
    char aa[] = {'A','A','A','A'};//替换用的 char
    NSMutableData *md = [NSMutableData dataWithData:data];
    int loc = 0;//游标
    while(loc < [md length])
    {
        char buffer;
        [md getBytes:&buffer range:NSMakeRange(loc, 1)];
        if ((buffer & 0xFF) == 0xFF){
            //非法字符0xFF
            [md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];
            loc++;
            continue;
        }else if((buffer & 0x80) == 0){
            //单字节 ASCII 码
            loc++;
            continue;
        }else{
            //大于0x80的双字节或者四字节,要根据下一位判断
            loc++;
            if (loc >= [md length]) break;
            [md getBytes:&buffer range:NSMakeRange(loc, 1)];

            //第二字节判断
            if ((buffer & 0xFF) != 0xFF && ((buffer & 0x40)==0x40 || (buffer & 0x80) == 0x80)){
                //双字节
                loc++;
                continue;
            }else{
                //四字节判断
            }

            //退回第一位
            loc--;
            //NSLog(@"find fail match at loc=%d",loc);
            [md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];
            loc++;
        }
    }

    return md;
}

思路来源: UTF-8 版本1

//替换非utf8字符  
//注意:如果是三字节utf-8,第二字节错误,则先替换第一字节内容(认为此字节误码为三字节utf8的头),然后判断剩下的两个字节是否非法;  
- (NSData *)replaceNoUtf8:(NSData *)data  
{  
    char aa[] = {'A','A','A','A','A','A'};                      //utf8最多6个字符,当前方法未使用  
    NSMutableData *md = [NSMutableData dataWithData:data];  
    int loc = 0;  
    while(loc < [md length])  
    {  
        char buffer;  
        [md getBytes:&buffer range:NSMakeRange(loc, 1)];  
        if((buffer & 0x80) == 0)  
        {  
            loc++;  
            continue;  
        }  
        else if((buffer & 0xE0) == 0xC0)  
        {  
            loc++;  
            [md getBytes:&buffer range:NSMakeRange(loc, 1)];  
            if((buffer & 0xC0) == 0x80)  
            {  
                loc++;  
                continue;  
            }  
            loc--;  
            //非法字符,将这个字符(一个byte)替换为A  
            [md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];  
            loc++;  
            continue;  
        }  
        else if((buffer & 0xF0) == 0xE0)  
        {  
            loc++;  
            [md getBytes:&buffer range:NSMakeRange(loc, 1)];  
            if((buffer & 0xC0) == 0x80)  
            {  
                loc++;  
                [md getBytes:&buffer range:NSMakeRange(loc, 1)];  
                if((buffer & 0xC0) == 0x80)  
                {  
                    loc++;  
                    continue;  
                }  
                loc--;  
            }  
            loc--;  
            //非法字符,将这个字符(一个byte)替换为A  
            [md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];  
            loc++;  
            continue;  
        }  
        else  
        {  
            //非法字符,将这个字符(一个byte)替换为A  
            [md replaceBytesInRange:NSMakeRange(loc, 1) withBytes:aa length:1];  
            loc++;  
            continue;  
        }  
    }  

    return md;  
}  

后来用 PHP 写代码的时候忽然又想起一个叫 iconv(可以在 iOS 下使用) 的函数, 可以使用//IGNORE//TRANSLIT 参数转,说明文档如下:

ENCODINGS
       The values permitted for --from-code and --to-code can be listed by the iconv --list command, and all combinations of the listed values are supported. Furthermore the following two suffixes are supported:

       //TRANSLIT
              When the string "//TRANSLIT" is appended to --to-code, transliteration is activated.  This means that when a character cannot be represented in the target character set, it can be approximated through one or several similarly looking
              characters.

       //IGNORE
              When the string "//IGNORE" is appended to --to-code, characters that cannot be represented in the target character set will be silently discarded.

坑着先, 有空再来试试。

标签:utf8, gb2312, gb18030