【踩坑笔记】Tesseract-OCR-04-使用 jTessBoxEditor 提高文字识别准确率

  • A+
所属分类:python笔记
本文信息本文由方法SEO顾问发表于2019-09-2300:04:23,共 2068 字,转载请注明:【踩坑笔记】Tesseract-OCR-04-使用 jTessBoxEditor 提高文字识别准确率_【方法SEO顾问】

参考文献:

1、https://blog.csdn.net/a745233700/article/details/80175883

大概顺序:

1、下载验证码图片

2、用PIL库将验证码图片放大,太小的图片识别不了,from PIL import Image

import pytesseract
from PIL import Image

#####For CentOS 7 run the following as root:
#####yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
#####sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
#####yum update
#####yum install tesseract 
#####yum install tesseract-langpack-deu

def ocr(img_path):
    img = Image.open(img_path)
    thumb = img.resize((420, 180))
    img_gray = thumb.convert('L')
    # img_gray.save('code_gray.png')
    # 转成黑白图片
    img_black_white = img_gray.point(lambda x: 0 if x > 200 else 255)
    img_black_white.save('img_black_white.png', quality=100)
    testdata_dir_config = '--tessdata-dir "./tessdata"'
    textCode = pytesseract.image_to_string(img_black_white, lang='训练完的数据集', config=testdata_dir_config)
    # 去掉非法字符,只保留字母数字
    textCode = re.sub("\W", "", textCode)
    print textCode
    return textCode

3、大坑:在合并TIF文件完了,生成.box文件后,如果标记图片时有字符不能出现坐标值,说明该张图片识别不了,赶紧把该图删除,然后重新合并TIF,否则标记完了也无法进行下一步。

手动标记完成后,逐条输入下面的命令来生成最终的训练集

echo "test 0 0 0 0 0" >font_properties
tesseract zwp.test.exp0.tif zwp.test.exp0 nobatch box.train
unicharset_extractor zwp.test.exp0.box
shapeclustering -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr
mftraining -F font_properties -U unicharset -O zwp.unicharset zwp.test.exp0.tr
cntraining zwp.test.exp0.tr
rename normproto zwp.normproto
rename inttemp zwp.inttemp
rename pffmtable zwp.pffmtable
rename shapetable zwp.shapetable
combine_tessdata zwp.

最后将生成的zwp.traineddata 拷贝到相应目录调用,关键代码如下:

def ocr(img_path):
    img = Image.open(img_path)
    thumb = img.resize((940, 400))####放大验证码
    img_gray = thumb.convert('L')
    # img_gray.save('code_gray.png')
    # 转成黑白图片
    img_black_white = img_gray.point(lambda x: 0 if x > 200 else 255)
    img_black_white.save('724img_black_white.png', quality=100)
    # testdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
    testdata_dir_config = '--tessdata-dir "./tessdata"'
    textCode = pytesseract.image_to_string(img_black_white, lang='zwp', config=testdata_dir_config)    
    textCode = re.sub("\W", "", textCode)### 去掉非法字符,只保留字母数字
    print textCode
    return textCode

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: