How to Use Tesseract for OCR from the Linux Command Line

November 24, 2021

798

Here we can see, “How to Use Tesseract for OCR from the Linux Command Line”

The Tesseract OCR engine can extract text from photos via the Linux command line. It’s quick, accurate, and supports over 100 languages. Here’s how to put it to good use.

OCR stands for Optical Character Recognition.

The capacity to look at and detect words in an image and then extract them as editable text is known as optical character recognition (OCR). This simple task is extremely tough for computers to do. The early attempts were clumsy. If the typeface or size was not to the OCR software’s liking, computers were frequently puzzled.

Nonetheless, the forerunners in this sector were held in high regard. If you lose the electronic copy but still have the printed original, OCR can re-create an electronic, editable version of a document. Even if the outcomes weren’t perfect, it was still a significant time-saver.

Also See: OnePlus 9 Pro Media Storage app bug eats up too much storage space

You could get your document back with some manual cleanup. People forgave the errors it made because they recognized the difficulty of the task an OCR program faced. It was also preferable to retyping the entire document.

Since then, things have vastly improved. Hewlett Packard’s Tesseract OCR application was first released as a commercial product in the 1980s. It was released as open-source software in 2005 and is now sponsored by Google. It supports multiple languages, is widely considered one of the most accurate OCR systems available, and is entirely free to use.

Tesseract OCR Installation

Use the following command to install Tesseract OCR on Ubuntu:

sudo apt-get install tesseract-ocr

The command in Fedora is:

sudo dnf install tesseract

You must type the following on Manjaro:

sudo pacman -Syu tesseract

Tesseract OCR (Optical Character Recognition)

Tesseract OCR will be faced with a series of obstacles. An excerpt from Recital 63 of the General Data Protection Regulations is our first image with text. Let’s see if OCR can make sense of this (and stay awake).

Because each phrase begins with a faint superscript number, which is common in legislative documents, it’s a challenging image.

We must provide the following information to the tesseract command:

The name of the image file that we’d like it to work with.
The name of the text file in which the extracted text will be saved. We don’t have to give the file extension if we don’t want to (it will always be .txt). If a file with an identical name already exists, it will be overwritten.
The —dpi option can be used to tell Tesseract what the image’s dots per inch (dpi) resolution is. Tesseract will try to figure out the dpi value if we don’t offer one.

Our picture file is called “recital-63.png,” and it has a 150-dpi resolution. We’re going to save it as “recital.txt” in a text file.

This is what our command looks like:

tesseract recital-63.png recital --dpi 150

The outcomes are excellent. The only problem is that the superscripts are too faint to read correctly. To get good results, you’ll need a high-quality photograph.

Tesseract has read the superscript numerals as quote marks (“) and degree symbols (°), but the actual content has been perfectly retrieved (the right side of the image had to be trimmed to fit here).

The final character is a carriage return, represented by a byte with the hexadecimal value of 0x0C.

Another graphic with text in various sizes and bold and italics are shown below.

“bold-italic.png” is the name of the file. We wish to make a text file called “bold.txt,” thus we’ll use the following command:

tesseract bold-italic.png bold --dpi 150

This one went without a hitch, and the text was perfectly extracted.

Using a Variety of Languages

Tesseract OCR supports around 100 languages. You must first install a language before you can use it. Make a note of the abbreviation for the language you want to use in the list. We’re going to enable Welsh support. The abbreviation “cym” stands for “Cymru,” which means “Welsh.”

Also See: You Don’t Need a Product Key to Install and Use Windows 10

“tesseract-ocr-” is the installation package’s name, with the language abbreviation appended at the end. We’ll use the following commands to install the Welsh language file on Ubuntu:

sudo apt-get install tesseract-ocr-cym

Below is an image with the words. It’s the Welsh national anthem’s opening verse.

Let’s check if Tesseract OCR can handle the task. To tell Tesseract whatever language we want to work in, we’ll use the -l (language) option:

tesseract hen-wlad-fy-nhadau.png anthem -l cym --dpi 150

As evidenced in the excerpted text below, Tesseract performs admirably. Tesseract OCR, da iawn.

You can use a plus sign (+) to inform Tesseract to add another language if your document contains two or more languages (for example, a Welsh-to-English dictionary).

tesseract image.png textfile -l eng+cym+fra

Working with PDFs with Tesseract OCR

The tesseract command was created to operate with picture files, but it cannot read PDF files. If you need to extract text from a PDF, you can generate photos with another program first. A single image will be used to represent a single PDF page.

Your Linux PC should already have the pdf program installed. We’ll use a PDF of Alan Turing’s fundamental article on artificial intelligence, “Computing Machinery and Intelligence,” as our example.

To signal that we wish to create PNG files, we use the -png option. “turing.pdf” is the name of our PDF file. Our picture files will be named “turing-01.png,” “turing-02.png,” and so on:

pdftoppm -png turing.pdf turing

We need to utilize a for loop to run Tesseract on each image file with a single command. We use Tesseract to create a text file named “text-” plus “turing-nn” as part of the image file name for each of our “turing-nn.png” files:

for i in turing-??.png; do tesseract "$i" "text-$i" -l eng; done;

To combine all the text files into one, we can use cat:

cat text-turing* > complete.txt

So, how’d it turned out? As you can see below, everything went swimmingly. However, the first page appears to be extremely difficult. It has a variety of text styles and sizes, as well as adornment. On the right border of the page, there’s also a vertical “watermark.”

The output, on the other hand, is very close to the original. The formatting has been gone, but the text is correct.

At the bottom of the page, the vertical watermark was transcribed as a line of gibberish. The text was too small for Tesseract to read accurately, but it would be simple to locate and remove. Having stray characters at the end of each line would have been the worst-case scenario.

Surprisingly, the single letters at the beginning of the second page’s list of questions and answers have been omitted. The following is a passage from the PDF.

The questions are still there, but the “Q” and “A” at the beginning of each line have vanished.

Diagrams will also be incorrectly transcribed. Examine what occurs when we attempt to extract the one displayed below from the Turing PDF.

The characters were real, but the diagram format was lost, as you can see in our result below.

The modest size of the subscripts caused Tesseract to struggle once more, and they were drawn erroneously.

But, to be fair, that was still a good result. We could not extract simple text, but this case was picked specifically because it was a problem.

Conclusion

I hope you found this information helpful. Please fill out the form below if you have any queries or comments.

User Questions:

What is Tesseract Linux, and how can I use it?

On an Ubuntu installation, Tesseract is used. Installing Dependencies (1.1)
It’s being run. Select a picture containing text and run the following command in the console (assuming the input filename is img.png): out Tesseract img.png
Using Tesseract with Python. Python-tesseract is a Python wrapper for Tesseract-OCR from Google.

Is this an OCR input or an OCR output?

OCR stands for optical character recognition, and it is a device that reads printed text. Character by character, OCR scans the text optically, turns it into a machine-readable code, and saves it in the system memory.

How does the Tesseract algorithm work?

This technique is capable of decrypting and extracting text from a wide range of sources! It makes use of an improved version of the tesseract open source OCR technology, as its name suggests. We also use binarization to automatically binarize and preprocess photos so that Tesseract has an easier time decrypting them.

Also See: How to: Fix Vpn Error 720 on Windows 10 Using 7 Safe Solutions

Can someone please tell me how to use tesseract OCR

Can someone please tell me how to use tesseract OCR from linux4noobs

tesseract png > pdf stopped working

[help] tesseract png > pdf stopped working from commandline

Images credits: howtogeek

Table of Contents

OCR stands for Optical Character Recognition.

Tesseract OCR Installation

Tesseract OCR (Optical Character Recognition)

Using a Variety of Languages

Working with PDFs with Tesseract OCR

Conclusion

User Questions:

RELATED ARTICLESMORE FROM AUTHOR

Chamberlain Michigan | Nursing Schools – Chamberlain University

500 Internal Server Error NGINX Fix

How to: Fix Runtime Broker High Cpu Usage Error

Popular Category

RELATED ARTICLES MORE FROM AUTHOR