【踩坑笔记】Tesseract-OCR-04-使用 jTessBoxEditor 提高文字识别准确率_python笔记

发表评论
1,431

A+

所属分类：python笔记

本文信息本文由方法SEO顾问发表于2019-09-2300:04:23，共 2068 字，转载请注明：【踩坑笔记】Tesseract-OCR-04-使用 jTessBoxEditor 提高文字识别准确率_【方法SEO顾问】

参考文献：

1、https://blog.csdn.net/a745233700/article/details/80175883

大概顺序：

1、下载验证码图片

2、用PIL库将验证码图片放大，太小的图片识别不了，from PIL import Image

import pytesseract
from PIL import Image

#####For CentOS 7 run the following as root:
#####yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
#####sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
#####yum update
#####yum install tesseract 
#####yum install tesseract-langpack-deu

def ocr(img_path):
    img = Image.open(img_path)
    thumb = img.resize((420, 180))
    img_gray = thumb.convert('L')
    # img_gray.save('code_gray.png')
    # 转成黑白图片
    img_black_white = img_gray.point(lambda x: 0 if x > 200 else 255)
    img_black_white.save('img_black_white.png', quality=100)
    testdata_dir_config = '--tessdata-dir "./tessdata"'
    textCode = pytesseract.image_to_string(img_black_white, lang='训练完的数据集', config=testdata_dir_config)
    # 去掉非法字符，只保留字母数字
    textCode = re.sub("\W", "", textCode)
    print textCode
    return textCode

3、大坑：在合并TIF文件完了，生成.box文件后，如果标记图片时有字符不能出现坐标值，说明该张图片识别不了，赶紧把该图删除，然后重新合并TIF，否则标记完了也无法进行下一步。

手动标记完成后，逐条输入下面的命令来生成最终的训练集

echo "test 0 0 0 0 0" >font_properties
tesseract zwp.test.exp0.tif zwp.test.exp0 nobatch box.train
unicharset_extractor zwp.test.exp0.box
shapeclustering -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr
mftraining -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr
cntraining zwp.test.exp0.tr
rename normproto zwp.normproto
rename inttemp zwp.inttemp
rename pffmtable zwp.pffmtable
rename shapetable zwp.shapetable
combine_tessdata zwp.

最后将生成的zwp.traineddata 拷贝到相应目录调用，关键代码如下：

def ocr(img_path):
    img = Image.open(img_path)
    thumb = img.resize((940, 400))####放大验证码
    img_gray = thumb.convert('L')
    # img_gray.save('code_gray.png')
    # 转成黑白图片
    img_black_white = img_gray.point(lambda x: 0 if x > 200 else 255)
    img_black_white.save('724img_black_white.png', quality=100)
    # testdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
    testdata_dir_config = '--tessdata-dir "./tessdata"'
    textCode = pytesseract.image_to_string(img_black_white, lang='zwp', config=testdata_dir_config)    
    textCode = re.sub("\W", "", textCode)### 去掉非法字符，只保留字母数字
    print textCode
    return textCode

发表评论取消回复