{"id":2204,"date":"2015-04-17T07:26:44","date_gmt":"2015-04-17T10:26:44","guid":{"rendered":"http:\/\/tangopardo.com.ar\/2cf7\/?p=2204"},"modified":"2015-04-17T16:26:05","modified_gmt":"2015-04-17T19:26:05","slug":"breaking-a-simple-captcha-with-python-and-pillow","status":"publish","type":"post","link":"https:\/\/tangopardo.com.ar\/2cf7\/2015\/04\/17\/breaking-a-simple-captcha-with-python-and-pillow","title":{"rendered":"Breaking a simple captcha with Python and Pillow"},"content":{"rendered":"<p>A while ago one of our long time customers approached us to automate tasks on a government portal. At least here most of them are kind of ugly, work on a specific set of browser versions and are painfully slow. We already helped him with problems like this before, so instead of having someone enter manually all the data they just populate a database and then our robot does all the work, simulating the actions on the web portal.<\/p>\n<p>This one is a bit different, because they introduced a captcha in order to infuriate users (seriously, it looks like they don&#8217;t want people logging in).<\/p>\n<p>Most of the time they look like this:<\/p>\n\n\t\t<style type=\"text\/css\">\n\t\t\t#gallery-1 {\n\t\t\t\tmargin: auto;\n\t\t\t}\n\t\t\t#gallery-1 .gallery-item {\n\t\t\t\tfloat: left;\n\t\t\t\tmargin-top: 10px;\n\t\t\t\ttext-align: center;\n\t\t\t\twidth: 100%;\n\t\t\t}\n\t\t\t#gallery-1 img {\n\t\t\t\tborder: 2px solid #cfcfcf;\n\t\t\t}\n\t\t\t#gallery-1 .gallery-caption {\n\t\t\t\tmargin-left: 0;\n\t\t\t}\n\t\t\t\/* see gallery_shortcode() in wp-includes\/media.php *\/\n\t\t<\/style>\n\t\t<div data-carousel-extra='{\"blog_id\":1,\"permalink\":\"https:\\\/\\\/tangopardo.com.ar\\\/2cf7\\\/2015\\\/04\\\/17\\\/breaking-a-simple-captcha-with-python-and-pillow\"}' id='gallery-1' class='gallery galleryid-2204 gallery-columns-1 gallery-size-full'><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763.bmp'><img loading=\"lazy\" decoding=\"async\" width=\"86\" height=\"21\" src=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763.bmp\" class=\"attachment-full size-full\" alt=\"\" data-attachment-id=\"2205\" data-permalink=\"https:\/\/tangopardo.com.ar\/2cf7\/2015\/04\/17\/breaking-a-simple-captcha-with-python-and-pillow\/fail_976763\" data-orig-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763.bmp\" data-orig-size=\"86,21\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"captcha fail_976763\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763.bmp\" data-large-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763.bmp\" \/><\/a>\n\t\t\t<\/dt><\/dl><br style=\"clear: both\" \/>\n\t\t<\/div>\n\n<p>The first thing I tried was to remove the lines and feed the result into an ocr engine. So I made a very simple filter using Pillow:<\/p>\n<pre><code>\r\n#!\/usr\/bin\/python\r\n\r\nfrom PIL import Image\r\nimport sys, os\r\n\r\ndef filter_lines(src):\r\n    w,h = src.size\r\n\r\n    stripes = []\r\n    ss = {}\r\n\r\n    for x in range(w):\r\n        count = 0\r\n        for y in range(h):\r\n            if src.getpixel( (x,y) ) != (248, 255, 255):\r\n                count += 1\r\n        if count == h:\r\n            stripes.append(x)\r\n\r\n    for x in stripes:\r\n        for y in range(h):\r\n            src.putpixel( (x,y),  (248, 255, 255) )\r\n    return src\r\n\r\nif __name__ == '__main__':\r\n    src = Image.open(sys.argv[1])\r\n    region = filter_lines(src)\r\n    region.save(sys.argv[2])\r\n<\/code><\/pre>\n<p>Now it looks better but after trying gocr and tesseract it still needs more work:<\/p>\n\n\t\t<style type=\"text\/css\">\n\t\t\t#gallery-2 {\n\t\t\t\tmargin: auto;\n\t\t\t}\n\t\t\t#gallery-2 .gallery-item {\n\t\t\t\tfloat: left;\n\t\t\t\tmargin-top: 10px;\n\t\t\t\ttext-align: center;\n\t\t\t\twidth: 100%;\n\t\t\t}\n\t\t\t#gallery-2 img {\n\t\t\t\tborder: 2px solid #cfcfcf;\n\t\t\t}\n\t\t\t#gallery-2 .gallery-caption {\n\t\t\t\tmargin-left: 0;\n\t\t\t}\n\t\t\t\/* see gallery_shortcode() in wp-includes\/media.php *\/\n\t\t<\/style>\n\t\t<div data-carousel-extra='{\"blog_id\":1,\"permalink\":\"https:\\\/\\\/tangopardo.com.ar\\\/2cf7\\\/2015\\\/04\\\/17\\\/breaking-a-simple-captcha-with-python-and-pillow\"}' id='gallery-2' class='gallery galleryid-2204 gallery-columns-1 gallery-size-full'><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763_filtered_1.bmp'><img loading=\"lazy\" decoding=\"async\" width=\"86\" height=\"21\" src=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763_filtered_1.bmp\" class=\"attachment-full size-full\" alt=\"\" data-attachment-id=\"2208\" data-permalink=\"https:\/\/tangopardo.com.ar\/2cf7\/2015\/04\/17\/breaking-a-simple-captcha-with-python-and-pillow\/fail_976763_filtered_1\" data-orig-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763_filtered_1.bmp\" data-orig-size=\"86,21\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"fail_976763_filtered_1\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763_filtered_1.bmp\" data-large-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/fail_976763_filtered_1.bmp\" \/><\/a>\n\t\t\t<\/dt><\/dl><br style=\"clear: both\" \/>\n\t\t<\/div>\n\n<p>Just for kicks I decided to filter 100 images and overlap them, this is what I got:<\/p>\n\n\t\t<style type=\"text\/css\">\n\t\t\t#gallery-3 {\n\t\t\t\tmargin: auto;\n\t\t\t}\n\t\t\t#gallery-3 .gallery-item {\n\t\t\t\tfloat: left;\n\t\t\t\tmargin-top: 10px;\n\t\t\t\ttext-align: center;\n\t\t\t\twidth: 100%;\n\t\t\t}\n\t\t\t#gallery-3 img {\n\t\t\t\tborder: 2px solid #cfcfcf;\n\t\t\t}\n\t\t\t#gallery-3 .gallery-caption {\n\t\t\t\tmargin-left: 0;\n\t\t\t}\n\t\t\t\/* see gallery_shortcode() in wp-includes\/media.php *\/\n\t\t<\/style>\n\t\t<div data-carousel-extra='{\"blog_id\":1,\"permalink\":\"https:\\\/\\\/tangopardo.com.ar\\\/2cf7\\\/2015\\\/04\\\/17\\\/breaking-a-simple-captcha-with-python-and-pillow\"}' id='gallery-3' class='gallery galleryid-2204 gallery-columns-1 gallery-size-full'><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/overlapeada_original.bmp'><img loading=\"lazy\" decoding=\"async\" width=\"86\" height=\"21\" src=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/overlapeada_original.bmp\" class=\"attachment-full size-full\" alt=\"\" data-attachment-id=\"2209\" data-permalink=\"https:\/\/tangopardo.com.ar\/2cf7\/2015\/04\/17\/breaking-a-simple-captcha-with-python-and-pillow\/overlapeada_original\" data-orig-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/overlapeada_original.bmp\" data-orig-size=\"86,21\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"overlapeada_original\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/overlapeada_original.bmp\" data-large-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/overlapeada_original.bmp\" \/><\/a>\n\t\t\t<\/dt><\/dl><br style=\"clear: both\" \/>\n\t\t<\/div>\n\n<p>That is interesting&#8230; I used this script (not the most efficient approach, but still..)<\/p>\n<pre><code>\r\n#!\/usr\/bin\/python\r\n\r\nfrom PIL import Image\r\nimport sys, os\r\n\r\ndst = Image.new('RGB', (86, 21) )\r\n\r\nw,h = 86, 21\r\n\r\nfor x in range(w):\r\n    for y in range(h):\r\n        dst.putpixel( (x,y),  (255, 255, 255) )\r\n\r\nfor idx in range(30):\r\n    src = Image.open('filtradas\/%i.bmp'%idx)\r\n\r\n    for x in range(w):\r\n        for y in range(h):\r\n            if src.getpixel( (x,y) ) != (248, 255, 255):\r\n                dst.putpixel( (x,y),  (255, 0, 0) )\r\n\r\ndst.save('overlapeada.bmp')\r\n<\/code><\/pre>\n<p>With this piece of information I can focus my efforts on that area only.<br \/>\nThat font, even distorted, looks quite familiar to me. And indeed it is, it&#8217;s Helvetica.<br \/>\nThis makes the problem a lot easier.<\/p>\n<p>I grabbed a bitmapped version of the same size and made a grid that shows were can a number land assuming 8&#215;13 symbols:<\/p>\n\n\t\t<style type=\"text\/css\">\n\t\t\t#gallery-4 {\n\t\t\t\tmargin: auto;\n\t\t\t}\n\t\t\t#gallery-4 .gallery-item {\n\t\t\t\tfloat: left;\n\t\t\t\tmargin-top: 10px;\n\t\t\t\ttext-align: center;\n\t\t\t\twidth: 100%;\n\t\t\t}\n\t\t\t#gallery-4 img {\n\t\t\t\tborder: 2px solid #cfcfcf;\n\t\t\t}\n\t\t\t#gallery-4 .gallery-caption {\n\t\t\t\tmargin-left: 0;\n\t\t\t}\n\t\t\t\/* see gallery_shortcode() in wp-includes\/media.php *\/\n\t\t<\/style>\n\t\t<div data-carousel-extra='{\"blog_id\":1,\"permalink\":\"https:\\\/\\\/tangopardo.com.ar\\\/2cf7\\\/2015\\\/04\\\/17\\\/breaking-a-simple-captcha-with-python-and-pillow\"}' id='gallery-4' class='gallery galleryid-2204 gallery-columns-1 gallery-size-full'><dl class='gallery-item'>\n\t\t\t<dt class='gallery-icon landscape'>\n\t\t\t\t<a href='https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/grillas.png'><img loading=\"lazy\" decoding=\"async\" width=\"45\" height=\"18\" src=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/grillas.png\" class=\"attachment-full size-full\" alt=\"\" data-attachment-id=\"2211\" data-permalink=\"https:\/\/tangopardo.com.ar\/2cf7\/2015\/04\/17\/breaking-a-simple-captcha-with-python-and-pillow\/grillas\" data-orig-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/grillas.png\" data-orig-size=\"45,18\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"grillas\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/grillas.png\" data-large-file=\"https:\/\/tangopardo.com.ar\/2cf7\/wp-content\/uploads\/2015\/04\/grillas.png\" \/><\/a>\n\t\t\t<\/dt><\/dl><br style=\"clear: both\" \/>\n\t\t<\/div>\n\n<p>This shows that there is a slightly overlap between digits.<br \/>\nI went for a brute force approach, dividing the captcha in cells and comparing each one with every digit on the font with a small amount of overlap between them.<br \/>\nThe symbols are smaller than the cell, so for every one of them I build regions on the cell and assign a score for the number of pixels that are equal on both.<br \/>\nThe one that has a highest score is (likely) the correct number.<\/p>\n<p>This is really simple, event tough we do a lot of comparisons performs ok (the images are quite small), and without tunning we got about 30% success rate (the server also adds noise and more aggressive distortions from time to time).<\/p>\n<p>Have a difficult or non conventional problem? Give us a call, we are like the A-Team of technology.<\/p>\n<p>This is the complete algorithm (it&#8217;s in Spanish but shouldn&#8217;t be hard to follow), can also be found here:<a href=\"https:\/\/gist.github.com\/pardo-bsso\/a6ab7aa41bad3ca32e30\"> https:\/\/gist.github.com\/pardo-bsso\/a6ab7aa41bad3ca32e30<\/a><\/p>\n<pre><code>\r\n#!\/usr\/bin\/python\r\n\r\nfrom PIL import Image\r\nimport sys, os\r\n\r\n\r\nimgpatrones = []\r\npixelpatrones = []\r\n\r\nfor idx in range(10):\r\n    img = Image.open(\"patrones\/%i.png\" % idx).convert('RGB')\r\n    imgpatrones.append(img)\r\n    pixelpatrones.append( list(img.getdata()) )\r\n\r\n\r\ndef compara(region, patron):\r\n    pixels = list(region.getdata())\r\n    size = min(len(pixels), len(patron))\r\n\r\n    res = 0.0\r\n    for idx in range(size):\r\n        if pixels[idx] == patron[idx]:\r\n            res = res + 1\r\n\r\n    return res \/ size\r\n\r\n\r\ndef elimina_lineas(src):\r\n    cropeada = src.crop( (4, 1, 49, 19) )\r\n    w,h = cropeada.size\r\n    stripes = []\r\n\r\n    for x in range(w):\r\n        count = 0\r\n        for y in range(h):\r\n            if cropeada.getpixel( (x,y) ) != (248, 255, 255):\r\n                count += 1\r\n\r\n        if count == h:\r\n            stripes.append(x)\r\n\r\n    for x in stripes:\r\n        for y in range(h):\r\n            cropeada.putpixel( (x,y),  (248, 255, 255) )\r\n            cropeada.putpixel( (x,y),  (255, 0, 0) )\r\n\r\n    return cropeada\r\n\r\ndef crear_crops(src, celda):\r\n    limites = range(38)\r\n    xceldas = [0, 8, 16, 24, 32, 40]\r\n    xoffsets = range(-3,4)\r\n    yceldas = range(6)\r\n    boxes = []\r\n    crops = []\r\n\r\n    x = xceldas[celda]\r\n    x = [ (x+off) for off in xoffsets if (x+off) in limites ]\r\n\r\n    for left in x:\r\n        for top in yceldas:\r\n            boxes.append( (left, top, left+8, top+13) )\r\n\r\n    for box in boxes:\r\n        crops.append( src.crop(box) )\r\n\r\n    return crops\r\n\r\ndef compara_crops_con_patron(crops, patron):\r\n    scores = []\r\n    for crop in crops:\r\n        scores.append( compara(crop, pixelpatrones[patron] ))\r\n    return max(scores)\r\n\r\ndef decodifica_celda(src, celda):\r\n    pesos = []\r\n    crops = crear_crops(src, celda)\r\n\r\n    for patron in range(10):\r\n        pesos.append( compara_crops_con_patron(crops, patron) )\r\n\r\n    return pesos.index( max(pesos) )\r\n\r\ndef decodifica(filename):\r\n    original = Image.open(filename)\r\n    src = elimina_lineas(original)\r\n    res = []\r\n\r\n    for celda in range(6):\r\n        res.append( decodifica_celda(src, celda) )\r\n\r\n    return ''.join( str(x) for x in res )\r\n\r\nif __name__ == '__main__':\r\n    print decodifica(sys.argv[1])\r\n\r\n<\/code><\/pre>\n","protected":false},"excerpt":{"rendered":"<p>A while ago one of our long time customers approached us to automate tasks on a government portal. At least here most of them are kind of ugly, work on a specific set of browser versions and are painfully slow. We already helped him with problems like this before, so instead of having someone enter [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[24,4,3],"tags":[63,45,64],"class_list":["post-2204","post","type-post","status-publish","format-standard","hentry","category-hacks","category-python","category-software","tag-captcha","tag-crear","tag-python"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pmnbC-zy","_links":{"self":[{"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/posts\/2204","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/comments?post=2204"}],"version-history":[{"count":8,"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/posts\/2204\/revisions"}],"predecessor-version":[{"id":2216,"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/posts\/2204\/revisions\/2216"}],"wp:attachment":[{"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/media?parent=2204"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/categories?post=2204"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/tangopardo.com.ar\/2cf7\/wp-json\/wp\/v2\/tags?post=2204"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}