You are on page 1of 6
1012372020 Instaling and using Tesseract 4 on windows 10 | by Bharath Sivakumar | Quantum. | Medium You have 1 free member-only story left this month. Sign up for Medium and get an extra one QUANTRIUM GUIDES Installing and using Tesseract 4 on windows 10 a Bharath Sivakumar { Follow ] Jul 8-7 min read * Tesseract is an optical character recognition engine which can be used on various operating systems. It’s a free software, released under the Apache License. Originally, Tesseract was developed by Hewlett-Packard as proprietary software in the 1980s, later, it was released as an open source software in 2005. Then from 2006, it’s development is being sponsored by Google. In this guide, | will take you through the steps that 1 followed in order to install Tesseract on my Windows 10 machine. I shall also show you how you can use tesseract off the command line once you have successfully installed it. Installing Tesseract 4 on a Windows Machine using .exe File: To install Tesseract 4 on our Windows system, go to the following link: Index of /tesseract hitpssimedium.comiquantrum-tech/nstaling-and-using-tesseract-4-on-windows-10-4°783031382 16 1012372020 Instaling and using Tesseract 4 on windows 10 | by Bharath Sivakumar | Quantrium.l| Medium These executables are provided by Mannheim University Library. Licensed under the Apache License, Version 2.0 (the... digi bibuni-mannheimde Download windows executable file by clicking the hyper link titled tesseract-ocr-w64- setup-v4.1.0.20190314.exe. A notification asking you to save an exe file called “Tesseract- ocr-w64-setup-v4.1.0.20190314.exe” will appear. Save this .exe file wherever you have enough storage space. Open this exe file. If it windows asks you “Do you want to allow this software to make changes to your system”, click yes. You will be taken to the installation section. Hit next, click I agree to the terms and conditions and after selecting for whom and all you want to install Tesseract (anyone using this computer/just for me. You can select either one), click next. Tick the boxes that say “ScrollView”, “Training Tools”, “Shortcuts creation” and importantly “Language data”. These should be ticked by default but just do them just in case they haven’t been ticked in your system. Now, if you want to make predictions in foreign languages like Japanese, Chinese, Kurdish or Indian languages like Hindi, Tamil, Bengali etc., tick the “additional script data” and “additional language data” as well. If you want to make predictions only for the English language, you don’t have to tick this option. Click on Next. Select the directory where you want to install Tesseract. By default it shows c:\Program Files\t ocr for me and that’s where | installed it. You can install it as per your choice. But do take note of the path where you installed Tesseract on your machine. This is important. Now you can select the start menu folder in which you would like to create the programs shortcut. I created it in a folder called “Tesseract-OCR”. If you want it in a new folder, just type the name of the folder in the blank space right under the “Select the Start Menu folder in which you would like ....” text. You can also tick the “Do not create shortcuts” box in the bottom left if you don’t want to create any shortcuts. Once you are done with selecting your preferred option, click install. It should take a few minutes for the installation to happen. Once the installation is over, go to the directory where you have installed your Tesseract. We want to use Tesseract from our windows command line and to do that, we have to add Tesseract to our path in the system’s environment variable. To do so, click on your start button on windows and search “environment variable”. You will see a result called “Edit the system environment variables”. Click on that. After clicking this, you should be in the “Advanced” section of “System properties” and a hitpsimedium.comiquantrum-tech/nstaling-and-using-tesseract-4-on-windows-10-4°783031382 216 1012372020 Instaling and using Tesseract 4 on windows 10 | by Bharath Sivakumar | Quantrim.l | Medium button called “Environment Variables ....” should be visible on the bottom right. Click on that button. Now, you will see two tables here. One named vser variables for . Here, the is a variable that stands for the username using the PC currently. The other table called “System variables”. In the “System variables” table click on the variable called “Path” and then click on this button called “Edit” right above the “OK” button as shown down in the screenshot below. 4 o Set path variable for Tesseract on Windows Once you're done with this, you will see a page called “Edit environment variable”. Here on the top right, you will see a button called “New”. Click on that “New” button. You will get a blank space where you can add some text. Here, add your directory name where all your Tesseract-OCR files are stored. Once you have keyed in the directory name, hit “Enter” and check if your directory name has been added to the “Edit environment variable table”, Once it has been, click “OK”. Click on OK again in the “Environment Variables” page. Click “OK” in the “System Properties” page again. You must have exited from all the settings options now. Open command prompt and type tesscract --version on the command prompt and hit enter. You will see something like this: mand Prompt hitpssimedium.comiquantrum-techinstaling-and-using-tesseract-4-on-windows-10-4°7830313782 316 1012372020 Instaling and using Tesseract 4 on windows 10 | by Bharath Sivakumar | Quantrium.l | Medium Output for tesseract — version command after tesseract was successfully installed Ifyou see any error like tess connand not found , most probably you have made some mistake while following this guide. Go back and see where have you gone wrong and try to fix it. Alternatively, you can repeat the whole process again. Great! Now you have Tesseract installed on your machine. You can start playing around with it and explore it further. How to use Tesseract 4 using Command Line on a Windows Machine First, make sure you have some handwritten document or some typed document in the form of an image. Let’s say you have some photo in png form called on your Desktop and want to test Tesseract with it. Open your command prompt. You will start in this directory: \us. \username> where usernare is your username on that system. I need to go to the desktop directory. So I use the following command: C:\Users\username> cd Desktop Now Iam in the Desktop directory, where my image is located. You can see what Tesseract predicts the text in the document using the following command: Tesseract will directly output the text in the command line itself. The -1 parameter is used to specify the language. Here we have specified it as English, which is the case by default anyway, so using -» eng was redundant in this case. If you want to use some other language for OCR, check this link here which has all the. traine: files, which specify the language: tesseract-ocr/tessdata These language data files only work with Tesseract 4.0.0. They are based on the sources in tesseract-ocr/langdata on... hitpssimedium.comiquantrum-techinstaling-and-using-tesseract-4-on-windows-10-4°783031382 48 1012372020 Instaling and using Tesseract 4 on windows 10 | by Bharath Sivakumar | Quantrium.l | Medium github.com Say you have a text document written in Hindi. Then, go to this above link, click on the file titled ». neddata and download it. Once you have downloaded it, you need to move to the “tessdata” folder, which will be inside your directory where you had originally installed tesseract. Once you have done that, you can perform OCR of Hindi documents by using the following command: C:\Users\username\Desktop> tesseract hindi_image.png stdout -1 hin Instead of displaying the OCR output on the command line itself, let’s say you want your OCR output to be stored in a text file. In that case you can enter the following command instead: tesseract handwritten photo 1.png output.txt The text in nan .g will be stored in a text file called which. itten_photo 1 will be located in your present working directory, which was Desktop in my case. Tesseract can also take a text file as input, where the text needs to contain all the absolute path of the images that you want to process. This is especially useful when, let’s say you have two images handwritten in English called nandwritten photo *.png and handwr noto_2.png inthe c:\Program Files directory. Now, in your present working directory, you have a text file called input.txt whose contents are: In the first and second line respectively. Now if you want to store the contents of the these two handwritten photos in a text file, you can just do the following: tesseract input.txt output.txt -1 eng output.txt will have the OCR contents of both handwritten phoco_i.eng and handwritten photo 2.pns, in that order. Here, you should note that input.txt was in the hitpsimedium.comiquantrum-tech/nstaling-and-using-tesseract-4-on-windows-10-4°783031382 56 132020 lnstting and sng Tesseract 4 on windows 10 | by Bharath Sakuma | Quantum | Meum current working directory. You can use tesseract on a text file which is not in your present working directory either by including the directory location like here: Program Files\input.txt output.txt -1 eng output.txt Will again be located in the present working directory. You can do this for more than two photos as well. Note that the prediction for a new photo in the output.txt file will be preceded by some symbol as: 7 output engitt- Notepad Fie Exit Format View Help Viral Calic aCY am the king of the world a‘Com and Serr aCow will we bring forth afull form of MA Tesseract output of an input text file with 5 lines of image locations So in this case, virai calic is the prediction for the first image, cy am the ki worza the prediction for the second image, con ana sere the prediction for the third image and so on. You can check the output for all your input images and check the accuracy of the predictions. That’s it! Congratulations, you are now all set and ready to use Tesseract on your Windows 10 system. Tesseract OpenSource — Windows10 Ocr__—_ Computer Vision Ces omen hitpsimedium.comiquantrum-tech/nstaling-and-using-tesseract-4-on-windows-10-4°783031382 86

You might also like