Querying or retrieving data from scanned project documents is one of the toughest nuts to crack in an engineering project life cycle. Customers normally provide the consultant/contractor with existing engineering and design documents at the beginning of every project. Quite often, they supply only scanned PDF documents in place of actual native file formats. This in turn leads to several nightmares for the project team. Engineers find themselves helpless as their searches for documents containing critical project information return nothing. Automating Data Retrieval from Scanned Engineering Drawings and Documents would be one simple and practical way to solve this issue. Let us see how we can go about it using Python programming.
Sample Data Collection
Let’s take a Piping & Instrument Diagram (P&ID) as a sample. P&IDs contain critical equipment and process information required at the very beginning stage of the project. I did a quick google search and took one of first few images appeared in the search results. It turned out to be a plus point that the image was not of good quality. That makes it interesting to see how our code handles this partially blurred image in our test run.
The Coding Part
For the test run purpose I have converted the image to a PDF file as most of the scanned documents are found in PDF formats. We are going to use highly efficient pytesseract library for the converting image to text. But funny part here is that pytesseract only accepts images as input for the OCR activity. So I am going to use pdf2image library as well for converting PDF document to images prior to the OCR run. Here is the code for converting PDF to images then extracting the text data from the converted images.
!pip install pytesseract !pip install pdf2image try: from PIL import Image except ImportError: import Image import pytesseract import pdf2image # Define executable paths for pytesseract and poppler # Need to keep these programs ready before running the code pt_path='C:\\Program Files\\Tesseract-OCR\\tesseract.exe' pop_path='C:\\Users\\admin\\Downloads\\poppler-0.90.0\\Library\\bin' pytesseract.pytesseract.tesseract_cmd = pt_path def extract_pdf_text(pdfpath): # extract images from the PDF file for feeding image_to_string() images = pdf2image.convert_from_path(pdfpath,poppler_path=pop_path) # define text variable for collecting text from all images pdftext='' # process images one by one for pg, img in enumerate(images): pdftext = pdftext + "\n" + pytesseract.image_to_string(img) return pdftext pdf_text=extract_pdf_text('C:\\temp\\sample_PID.pdf')
That’s all it takes to extract images from a PDF, then run OCR and extract text from those images. Needless to say, I have already fallen in love with Python after watching what it could accomplish with a few lines of code! Now it’s the testing part. Let’s test the accuracy of the extraction by randomly checking a few equipment tags and line numbers.
# Test code prints whether the queried tag is available in the extracted text print('P-102A is found' if ('P-102A' in pdf_text) else 'P-102A is not found' ) print('P-102B is found' if ('P-102B' in pdf_text) else 'P-102B is not found' ) print('T-101 is found' if ('T-101' in pdf_text) else 'T-101 is not found' ) print('V-104 is found' if ('V-104' in pdf_text) else 'V-104 is not found' ) print('E-104 is found' if ('E-104' in pdf_text) else 'E-104 is not found' ) print('12" Sch 10 CS is found' if ('12" Sch 10 CS' in pdf_text) else '12" Sch 10 CS is not found' ) print('2" Sch 40 CS is found' if ('2" Sch 40 CS' in pdf_text) else '2" Sch 40 CS is not found' ) print('6" Sch 40 CS is found' if ('6" Sch 40 CS' in pdf_text) else '6" Sch 40 CS is not found' )
It is interesting to see that the OCR has failed to capture a few tags highlighted in yellow in the above output. One of the reasons could be that the quality of image used is not very good. You can see the blurred text and graphic content in the zoomed in part of the image below. Another reason would be that I haven’t used any of the Tesseract output quality improvement methods in my code. I am pretty sure that we can obtain an almost perfect result by tweaking these settings.
Document content extraction is one of AI possibilities I discussed in my earlier blog post. In effect, the above code helps me do some of the most common activities I used to perform using a licensed PDF editor application. My plan is to extend the code with additional functionalities like saving images to the original formats, extracting non-image data from the PDF file and so on. Meanwhile please feel free to share your thoughts on this idea for making it suitable for real world implementations.