Automating Data Retrieval from Scanned Engineering Drawings and Documents

A programmatic approach to easily process engineering and design raster data by automating data retrieval from scanned engineering drawings and documents

Querying or retrieving data from scanned project documents is one of the toughest nuts to crack in an engineering project life cycle. Customers normally provide the consultant/contractor with existing engineering and design documents at the beginning of every project. Quite often, they supply only scanned PDF documents in place of actual native file formats. This in turn leads to several nightmares for the project team. Engineers find themselves helpless as their searches for documents containing critical project information return nothing. Automating Data Retrieval from Scanned Engineering Drawings and Documents would be one simple and practical way to solve this issue. Let us see how we can go about it using Python programming.

Sample Data Collection

Let’s take a Piping & Instrument Diagram (P&ID) as a sample. P&IDs contain critical equipment and process information required at the very beginning stage of the project. I did a quick google search and took one of first few images appeared in the search results. It turned out to be a plus point that the image was not of good quality. That makes it interesting to see how our code handles this partially blurred image in our test run.

Sample P&ID Image

The Coding Part

For the test run purpose I have converted the image to a PDF file as most of the scanned documents are found in PDF formats. We are going to use highly efficient pytesseract library for the converting image to text. But funny part here is that pytesseract only accepts images as input for the OCR activity. So I am going to use pdf2image library as well for converting PDF document to images prior to the OCR run. Here is the code for converting PDF to images then extracting the text data from the converted images.

!pip install pytesseract
!pip install pdf2image
    from PIL import Image
except ImportError:
    import Image
import pytesseract
import pdf2image

# Define executable paths for pytesseract and poppler
# Need to keep these programs ready before running the code 
pt_path='C:\\Program Files\\Tesseract-OCR\\tesseract.exe'

pytesseract.pytesseract.tesseract_cmd = pt_path

def extract_pdf_text(pdfpath):
    # extract images from the PDF file for feeding image_to_string()
    images = pdf2image.convert_from_path(pdfpath,poppler_path=pop_path)
    # define text variable for collecting text from all images
    # process images one by one
    for pg, img in enumerate(images):
        pdftext = pdftext + "\n" + pytesseract.image_to_string(img) 
    return pdftext


That’s all it takes to extract images from a PDF, then run OCR and extract text from those images. Needless to say, I have already fallen in love with Python after watching what it could accomplish with a few lines of code! Now it’s the testing part. Let’s test the accuracy of the extraction by randomly checking a few equipment tags and line numbers.

# Test code prints whether the queried tag is available in the extracted text

print('P-102A is found' if ('P-102A' in pdf_text) else 'P-102A is not found' )
print('P-102B is found' if ('P-102B' in pdf_text) else 'P-102B is not found' )
print('T-101 is found' if ('T-101' in pdf_text) else 'T-101 is not found' )
print('V-104 is found' if ('V-104' in pdf_text) else 'V-104 is not found' )
print('E-104 is found' if ('E-104' in pdf_text) else 'E-104 is not found' )
print('12" Sch 10 CS is found' if ('12" Sch 10 CS' in pdf_text) else '12" Sch 10 CS is not found' )
print('2" Sch 40 CS is found' if ('2" Sch 40 CS' in pdf_text) else '2" Sch 40 CS is not found' )
print('6" Sch 40 CS is found' if ('6" Sch 40 CS' in pdf_text) else '6" Sch 40 CS is not found' )

Program Output

Test code output

It is interesting to see that the OCR has failed to capture a few tags highlighted in yellow in the above output. One of the reasons could be that the quality of image used is not very good. You can see the blurred text and graphic content in the zoomed in part of the image below. Another reason would be that I haven’t used any of the Tesseract output quality improvement methods in my code. I am pretty sure that we can obtain an almost perfect result by tweaking these settings.

Zoomed in image

Moving Forward

Document content extraction is one of AI possibilities I discussed in my earlier blog post. In effect, the above code helps me do some of the most common activities I used to perform using a licensed PDF editor application. My plan is to extend the code with additional functionalities like saving images to the original formats, extracting non-image data from the PDF file and so on. Meanwhile please feel free to share your thoughts on this idea for making it suitable for real world implementations.

Exploring AI Possibilities in the Field of Engineering Design and Documentation

The recent advancements in Artificial Intelligence have opened multiple opportunities for process improvements in the engineering design and documentation fields. Here are a few glimpses of various possibilities lying ahead.

Ever since I started exploring the foundations of Artificial Intelligence – thanks to the pandemic outbreak – I have found it super exciting to relate my past engineering experience with the AI capabilities to visualize some possible applications of AI, especially Machine Learning in the engineering design and documentation fields. A few thoughts are discussed below around this topic. Chances are that many of the applications are already in action at the moment. But hopefully you may come across something very new in this article as well.

Use of Generative Design for Developing Complex Structures
Generative Design has already exhibited its potential by solving some complex design challenges in the Aeronautics and Automobile industries. It looks highly beneficial to use this AI based design process for developing optimal solutions for highly complex structures such as offshore platforms and bridges.

Document Type Identification and Categorization
Most of you working in the engineering domain might have come across this requirement multiple times in your career. For example, document controllers perform this categorization very often for enabling proper document distribution. For some cases like brownfield engineering modification projects or proposals, the customer might have thrown a bunch of existing documents at you without proper categorization or indexing. You end up putting in a lot of manual effort for proper identification of documents before starting your actual engineering/proposal works. With the help of Machine Learning, we could automate this process and save valuable time and effort. This method would be highly advantageous for design/construction firms repeatedly offering services to Energy/Resources/Infrastructure customers.

Document Content Extraction from Vector/Hybrid/PDF Drawings
This is another headache that engineering and design professionals come across in their day to day life. While it is possible to get accurate information (for example, engineering schedules, materials take-off, Document index etc.) from intelligent design systems, it requires a lot of effort to get this done in non-intelligent design system based projects. Moreover, the quality of existing documents such as vector drawings with exploded attributes, hybrid drawings/documents which are a combination of vector and image data and PDF only documents leave the design people completely helpless in preparing accurate reports/schedules. This can be easily overcome by using Machine Learning technique that can identify information in any format of the document with same level of accuracy.

Data Input for Digital Twins from Dumb Documents
This is again related to the document content identification technique using Machine Learning as discussed above. Once we are able to identify the documents content and map interrelations between these documents using identified data, then it would be a lot easier to feed this data to Digital Twins (for example, by automatically mapping similar tags at both ends) for real-time monitoring.

Automated Pipe Routing and Plant Design Optimization
I have seen some intelligent 3D design systems in action doing automated pipe routing, but not sure up to what extent these systems offer multiple options for the designer to choose from. Using AI, it is highly probable to present the designer with many design options based on the design constraints and factors. I believe in future many plant design activities such as equipment locations, orientation, pipe routing and many other tasks would be mostly handled by AI.

Real-time Designing and Visualization for Model Reviews
Okay, your customer doesn’t want that design option you opted to exhibit. So what are the other options? Using AI, it looks highly probable to present multiple design suggestions in a real-time basis. This would also help the customer and design consultant/contractor to understand the material take-off (MTO) variations in real-time thus providing an idea the cost differences on the spot for the purpose of feasibility discussions.

Replacing Rule Based Standard Compliance Checking with AI Based Ones
How often have you seen your drawings/documents rejected by your customer for not adhering to their design/drafting standards? Quite often, I would say. The biggest challenge here is that it is really hard to develop such a rule based standard checking utility for both the customer and the design consultants. Instead, we could use Machine Learning for preparing algorithms that learn the customer standards from thousands of existing customer drawings. This method would also make it easy to implement standard revisions without much headache by employing continuous learning.

These are only glimpses of AI applications in the engineering design and documentation fields. I believe there are many more applications yet to be explored. Please expect more blog posts on various other applications in the future as and when some ideas pop up. Meanwhile, please stay in touch by subscribing the RSS / Email feed subscriptions of this blog using the right side menu widgets.