Automating Data Retrieval from Scanned Engineering Drawings and Documents

A programmatic approach to easily process engineering and design raster data by automating data retrieval from scanned engineering drawings and documents

Querying or retrieving data from scanned project documents is one of the toughest nuts to crack in an engineering project life cycle. Customers normally provide the consultant/contractor with existing engineering and design documents at the beginning of every project. Quite often, they supply only scanned PDF documents in place of actual native file formats. This in turn leads to several nightmares for the project team. Engineers find themselves helpless as their searches for documents containing critical project information return nothing. Automating Data Retrieval from Scanned Engineering Drawings and Documents would be one simple and practical way to solve this issue. Let us see how we can go about it using Python programming.

Sample Data Collection

Let’s take a Piping & Instrument Diagram (P&ID) as a sample. P&IDs contain critical equipment and process information required at the very beginning stage of the project. I did a quick google search and took one of first few images appeared in the search results. It turned out to be a plus point that the image was not of good quality. That makes it interesting to see how our code handles this partially blurred image in our test run.

Sample P&ID Image

The Coding Part

For the test run purpose I have converted the image to a PDF file as most of the scanned documents are found in PDF formats. We are going to use highly efficient pytesseract library for the converting image to text. But funny part here is that pytesseract only accepts images as input for the OCR activity. So I am going to use pdf2image library as well for converting PDF document to images prior to the OCR run. Here is the code for converting PDF to images then extracting the text data from the converted images.

!pip install pytesseract
!pip install pdf2image
try:
    from PIL import Image
except ImportError:
    import Image
import pytesseract
import pdf2image

# Define executable paths for pytesseract and poppler
# Need to keep these programs ready before running the code 
pt_path='C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
pop_path='C:\\Users\\admin\\Downloads\\poppler-0.90.0\\Library\\bin'

pytesseract.pytesseract.tesseract_cmd = pt_path

def extract_pdf_text(pdfpath):
    # extract images from the PDF file for feeding image_to_string()
    images = pdf2image.convert_from_path(pdfpath,poppler_path=pop_path)
    
    # define text variable for collecting text from all images
    pdftext=''
    
    # process images one by one
    for pg, img in enumerate(images):
        pdftext = pdftext + "\n" + pytesseract.image_to_string(img) 
    return pdftext

pdf_text=extract_pdf_text('C:\\temp\\sample_PID.pdf')
 


That’s all it takes to extract images from a PDF, then run OCR and extract text from those images. Needless to say, I have already fallen in love with Python after watching what it could accomplish with a few lines of code! Now it’s the testing part. Let’s test the accuracy of the extraction by randomly checking a few equipment tags and line numbers.

# Test code prints whether the queried tag is available in the extracted text

print('P-102A is found' if ('P-102A' in pdf_text) else 'P-102A is not found' )
print('P-102B is found' if ('P-102B' in pdf_text) else 'P-102B is not found' )
print('T-101 is found' if ('T-101' in pdf_text) else 'T-101 is not found' )
print('V-104 is found' if ('V-104' in pdf_text) else 'V-104 is not found' )
print('E-104 is found' if ('E-104' in pdf_text) else 'E-104 is not found' )
print('12" Sch 10 CS is found' if ('12" Sch 10 CS' in pdf_text) else '12" Sch 10 CS is not found' )
print('2" Sch 40 CS is found' if ('2" Sch 40 CS' in pdf_text) else '2" Sch 40 CS is not found' )
print('6" Sch 40 CS is found' if ('6" Sch 40 CS' in pdf_text) else '6" Sch 40 CS is not found' )
 

Program Output

Test code output


It is interesting to see that the OCR has failed to capture a few tags highlighted in yellow in the above output. One of the reasons could be that the quality of image used is not very good. You can see the blurred text and graphic content in the zoomed in part of the image below. Another reason would be that I haven’t used any of the Tesseract output quality improvement methods in my code. I am pretty sure that we can obtain an almost perfect result by tweaking these settings.

Zoomed in image

Moving Forward

Document content extraction is one of AI possibilities I discussed in my earlier blog post. In effect, the above code helps me do some of the most common activities I used to perform using a licensed PDF editor application. My plan is to extend the code with additional functionalities like saving images to the original formats, extracting non-image data from the PDF file and so on. Meanwhile please feel free to share your thoughts on this idea for making it suitable for real world implementations.

My First Python Application – A Word Cloud Based Resume Optimizer

A real world application of Python for Data Visualization that I learnt from recently completed Artificial Intelligence Foundation course.

It has been a couple of months since I started learning Foundations of Artificial Intelligence Program from SkillUp Online. After putting in some dedicated effort on the studies, I successfully completed the well crafted course on last Wednesday. So I thought of applying my learning in a real world scenario by developing a Python application.

Here is my first application of Python programming that I learnt through the course. Well, it is not a CAD/Design AI Application as I mentioned in my earlier post. Rather, I would start with a simple yet really useful application for now. As an aspiring job seeker, I thought of analyzing my resume for relevant keywords by visualizing how prominently they appear in the resume. Moreover, it is a well know fact that Application Tracking Systems (ATS) scrutinize resumes based on keywords and filter them accordingly. So I wanted to analyze the current resume status and optimize it accordingly.

Here is the result I got from the first run of Python program on my resume

Word Cloud Before Resume Optimization
Word Cloud Before Resume Optimization

A quick analysis of the word cloud revealed that my resume does not properly represent my key skills. I wanted to represent my key strengths such as Information Management, Automation, Innovation and VBA in a better way. Other skills such as CAD, Data Control, Design and Engineering are better represented though. After several iterations of the resume, I was able to come up with a balanced one. The word cloud generated by the same application after the resume fine-tuning is as shown below.

Word Cloud After Resume Optimization
Word Cloud After Resume Optimization

And here is the tiny-dirty Python application that I wrote for accomplishing this. Since this is my first program, I haven’t cared much about following best programming practices. I hope to improve that part in future though. Nevertheless, it was really amazing to see what such a small program in Python could achieve! Here is the link to the GitHub Gist page if you want to play with the code.

"""
Word Cloud Resume Optimizer
By Mohamed Haris
GitHub ID: @harismohamed

"""
#install and import required libraries
!pip install python-docx 
!pip install wordcloud 
from docx import Document
from wordcloud import WordCloud, STOPWORDS
import matplotlib as mpl
import matplotlib.pyplot as plt

#set path for resume file
myresume="C:\\temp\\my_resume.docx"

#read the document
doc = Document(myresume)

#collect text from each paragraph
resumetext = "" 
text = []
for para in doc.paragraphs:
    text.append(para.text)
    resumetext = '\n'.join(text)
stopwords = set(STOPWORDS)

#add custom stop words for better results
#custom stopwords may be stored in a text file
custstopwordsfile="C:\\temp\\stopwords.txt"
f=open(custstopwordsfile,'r')
custstopwords=f.read().splitlines()
f.close()

#add custom stop words to the default ones
stopwords.update(custstopwords)

#instantiate a word cloud object
resume_wc = WordCloud(
    background_color='black',
    stopwords=stopwords)

#adjust the image rendering area
fig = plt.figure()
fig.set_figwidth(12) # set width
fig.set_figheight(16) # set height

#generate the word cloud
resume_wc.generate(resumetext)
plt.imshow(resume_wc, interpolation='bilinear')
plt.axis('off')
plt.show()

It would be fun to host this program online for the readers to analyze their own resumes. But right now I don’t have any idea on how to go about it. I shall post an update once it is configured online for public testing. Till then, I advise you to run the code on your system to test the output.

Exploring AI Possibilities in the Field of Engineering Design and Documentation

The recent advancements in Artificial Intelligence have opened multiple opportunities for process improvements in the engineering design and documentation fields. Here are a few glimpses of various possibilities lying ahead.

Ever since I started exploring the foundations of Artificial Intelligence – thanks to the pandemic outbreak – I have found it super exciting to relate my past engineering experience with the AI capabilities to visualize some possible applications of AI, especially Machine Learning in the engineering design and documentation fields. A few thoughts are discussed below around this topic. Chances are that many of the applications are already in action at the moment. But hopefully you may come across something very new in this article as well.

Use of Generative Design for Developing Complex Structures
Generative Design has already exhibited its potential by solving some complex design challenges in the Aeronautics and Automobile industries. It looks highly beneficial to use this AI based design process for developing optimal solutions for highly complex structures such as offshore platforms and bridges.

Document Type Identification and Categorization
Most of you working in the engineering domain might have come across this requirement multiple times in your career. For example, document controllers perform this categorization very often for enabling proper document distribution. For some cases like brownfield engineering modification projects or proposals, the customer might have thrown a bunch of existing documents at you without proper categorization or indexing. You end up putting in a lot of manual effort for proper identification of documents before starting your actual engineering/proposal works. With the help of Machine Learning, we could automate this process and save valuable time and effort. This method would be highly advantageous for design/construction firms repeatedly offering services to Energy/Resources/Infrastructure customers.

Document Content Extraction from Vector/Hybrid/PDF Drawings
This is another headache that engineering and design professionals come across in their day to day life. While it is possible to get accurate information (for example, engineering schedules, materials take-off, Document index etc.) from intelligent design systems, it requires a lot of effort to get this done in non-intelligent design system based projects. Moreover, the quality of existing documents such as vector drawings with exploded attributes, hybrid drawings/documents which are a combination of vector and image data and PDF only documents leave the design people completely helpless in preparing accurate reports/schedules. This can be easily overcome by using Machine Learning technique that can identify information in any format of the document with same level of accuracy.

Data Input for Digital Twins from Dumb Documents
This is again related to the document content identification technique using Machine Learning as discussed above. Once we are able to identify the documents content and map interrelations between these documents using identified data, then it would be a lot easier to feed this data to Digital Twins (for example, by automatically mapping similar tags at both ends) for real-time monitoring.

Automated Pipe Routing and Plant Design Optimization
I have seen some intelligent 3D design systems in action doing automated pipe routing, but not sure up to what extent these systems offer multiple options for the designer to choose from. Using AI, it is highly probable to present the designer with many design options based on the design constraints and factors. I believe in future many plant design activities such as equipment locations, orientation, pipe routing and many other tasks would be mostly handled by AI.

Real-time Designing and Visualization for Model Reviews
Okay, your customer doesn’t want that design option you opted to exhibit. So what are the other options? Using AI, it looks highly probable to present multiple design suggestions in a real-time basis. This would also help the customer and design consultant/contractor to understand the material take-off (MTO) variations in real-time thus providing an idea the cost differences on the spot for the purpose of feasibility discussions.

Replacing Rule Based Standard Compliance Checking with AI Based Ones
How often have you seen your drawings/documents rejected by your customer for not adhering to their design/drafting standards? Quite often, I would say. The biggest challenge here is that it is really hard to develop such a rule based standard checking utility for both the customer and the design consultants. Instead, we could use Machine Learning for preparing algorithms that learn the customer standards from thousands of existing customer drawings. This method would also make it easy to implement standard revisions without much headache by employing continuous learning.

These are only glimpses of AI applications in the engineering design and documentation fields. I believe there are many more applications yet to be explored. Please expect more blog posts on various other applications in the future as and when some ideas pop up. Meanwhile, please stay in touch by subscribing the RSS / Email feed subscriptions of this blog using the right side menu widgets.

Application of Artificial Intelligence for Early Detection of Pandemic Outbreak

Some random thoughts on how Artificial Intelligence could be used for early identification of pandemic outbreaks

Today I stumbled upon an interesting BBC article titled ‘Treating cancer, stopping violence… How AI protects us‘ and that suddenly sparked a few thoughts on how AI could help us with early prediction of Pandemics/Epidemics in the coming days. A google search helped me with a few initiatives in this regard. I would like to discuss these initiatives first, then talk about my personal views on further opportunities in this direction.

The BBC article mentions two interesting systems in use for early prediction of infectious diseases such as Dengue fever, yellow fever, Zika and chikungunya. The first one named Artificial Intelligence in Medical Epidemiology (AIME) uses case reports pulled in from local hospitals combined with the weather and social factors to predict the outbreak well in advance. Another project from Microsoft called Microsoft Premonition employs scalable monitoring of the environment to detect disease threats early, using robotics and genomics. Their cloud-scale genomic analysis try to identify all the species of organisms and viruses in environmental samples to spot new transmission patterns.

More recently, an article titled ‘AI could help with the next pandemic—but not with this one‘ in MIT Technology Review explains how companies like BlueDot and Metabiota used a range of natural language processing (NLP) algorithms to monitor news outlets and official health care reports in different languages around the world to provide early indications of COVID-19 outbreak. They also effectively applied air travel data analysis for predicting the pace of spread with some reasonable accurate results.

One problem with big data analysis method described above is that not all countries allow transparent sharing of information on news channels or social media platforms. In that case, depending heavily on news/social media analysis for pandemic prediction may draw inconsistent results. We might need to use additional techniques as well to come up with a more reliable result. For example, we could use Computer Vision technology along with Thermal Imaging sensors in crowded places such as Bus stations,Railway stations and Airports to identify abnormal patterns. If we can apply AI techniques such as Natural Language Processing (NLP) , Speech and Computer Vision in multiple areas such as news channels, social media, video/voice communications and crowded places, then we would be in a very good position to track the hints of pandemic at the very early stage itself. Needless to say, early identification and subsequent rapid preventive measures can only ensure the containment of the disease to a small geographical area.

Of course, data privacy is a major concern here. But it is not limited to this case. Most of the AI applications share the same concern due to the heavy volume of data usage. It needs to be mitigated at appropriate levels to ensure the survival of human beings. At some point of time, we need to stop thinking on absolute basis and start perceiving things on a comparative mode to ensure that the rules and regulations do not compromise our existence in this world.

Micro Workshare System – Revisiting an Old Innovation Idea in the Context of COVID-19

An old innovation idea proposed on an entirely different context fits very well for handling COVID-19 effect on workload imbalances

Micro Workshare System (MWS) is one of my most favourite ideas in my innovation history and the first one to receive any kind of seed funding in my entire career. It was proposed in the year of 2014 primarily targeting to regulate the workload imbalance resulted due to recession in various geographies, especially for the multinational companies. During the time, it was taken as one of the two base ideas for the most eventful innovation venture in Worley, my previous organization. I guess it would be inappropriate to disclose the details of that idea outside the seed funded organization, so I would rather try to investigate how the concept would apply for resolving the workload imbalance caused by COVID-19.

The core of the concept is to disintegrate the conventional way of work-sharing. I.e. enabling individuals to take over work from remote offices without being part of a dedicated workshare team. Or rather, enabling individuals to accept work from remote teams when there is not enough work available within their locations. Whatever be the scenario, MWS basically investigates the systems and procedures required for enabling individuals to work remotely with distant project teams and that is what makes this idea highly significant at this COVID-19 remote working era. Although the idea was originally targeting the workshare domain, the same principles are equally applicable for the work from home scenario happening in the organizations today.

It is not as easy as it sounds though. There needs to overcome so many barriers to make it possible. When it comes to individuals distributed over multiple geographies, it brings up many complications in areas such as finance, legal, etc. (data protection for example). In summary, it requires a lot of cross-functional department work at corporate level to bring up such a channel alive.

While it true that it takes a lot of effort to establish such an individual based channel, what COVID-19 has recently exposed is the highly pathetic situation of not having such a system in place for many multi-national companies. We saw higher managements struggling to respond to the unexpected situation and compromising drastically on their delivery. It is a harsh reality that there is always an uncertainty bounding us in the form for natural disasters, pandemic, epidemic etc. and we see those things happen more often than ever in the current era. So, it is high time that the organizations bring up these systems to the mainstream so that the work happens normal despite calamities happening at any region around the world.

Interestingly, there seems to have some good progress happening in this area since COVID-19 days. Some companies have already declared that a major share of their workforce would be working from home post COVID-19 days as well. While it would be comparatively an easy task for some industries such as Information Technology to adapt to this changing situation, it would be a real challenge for industries such as Engineering, Construction, Manufacturing etc., to mention a few. For some industries, it may require redefining the work methodologies from scratch for adapting to this challenge.

One thing for sure, we are going to see many Micro Workshare Systems taking over the conventional office work methods in the days to come. The sooner organizations adapt to this changing scenario, the better they would be equipped with handling unforeseen situations. Let us wait and watch how it is going to solve our work challenges in the future.

Stay safe!

A Plant with a Passport

An innovation idea that aim to balance human interventions on the nature by developing an environment friendly mindset in the kids

‘A Plant with a Passport’ (PwP) is a humble attempt to sow a few seeds of innovation within the future generations while preparing them ready for the climate change challenges getting intense day by day. I thought of starting it with my own kids for two reasons. Even though I was very active in my last company innovation initiatives for more than 11 years, I had spent only a very little time with my kids for nurturing their talents and guiding them on their extra-curricular activities. That makes it indispensable for me to start some initiatives along with them to develop their attitude (By the way, my perspective of innovation is like this – “Innovation is an attitude for a person where as it is a culture for a company”). Secondly, I realize that innovation shall not be restricted within the walls your organization only, but it should be embedded within all dimensions of your life. A few weeks after stepping down from my last company, I think this is my best time to cater to the needs of my family at the same time spreading the light of innovation around the society I live in.

The idea is to nurture the habit of planting within the young generations, especially the kids making it an integral part of their entire life cycle. At the very base level, each planting is recorded in a document called Plant Passport and the each life cycle stage for that specific plant is recorded in the same document till the end. Even though I have started the first few passports using pocket notes, my actual intention is to have an App for doing the same, making it very simple for the plant owner to maintain multiple plants simultaneously. Additionally, an App would make it easy to spread the initiative beyond boundaries and make it easy to evaluate the overall progress at the same time making it easy to recognize individuals based on their progress. My aspiration is to embed this as part of school curriculum, recognizing each student performance by a grading system and further extending the activities beyond schools and colleges by making it a desired qualification for jobs, promotions, government funding etc. This way, we can prompt every individual to follow the practice of planting throughout their entire life cycle thereby planting millions or billions of plants all around the world.

So here is the first step (small step though, I am always inspired by Neil Armstrong quote ;-). We bought a two plants of Malabar Plum (Syzygium Cumini) from our nearby nursery. Each plant ownership was assigned to my kids Farhan and Faiha respectively.

So my responsibility as a Plant Sponsor is done. Now it is the Plant Owner’s responsibility to plant them and take care of the plant from this stage throughout the entire plant life cycle. Of course, I have still got a few more responsibilities as a father to help them dig the pits with enough dimensions :-).

Farhan and Faiha with their plants
Faiha plating
Faiha watering

Now it’s Farhan’s turn as the Plant Owner. Here he goes

Farhan planting
Farhan watering

And here is the first two passports of Farhan and Faiha. As mentioned, this is just an initial format for keeping the things rolling. It needs to be digitalized (Yes, as mentioned in my introductory blog post, digitalization is my favourite topic ; -) for bulk application and passport fields may change based on the digital transformation.

Pocket notes as initial passports
Front page of first Plant Passport

A few days post planting, both the plants look very healthy and a few new leaves seems to be growing on both the plants. While Faiha is very instrumental in watering the plant, Farhan needs a little bit of push to do it promptly. Anyway, there is not much progress to update the passports yet. I hope to update you the progress after sometime.

Like all my previous innovation initiatives, I dream big on this and hope to see this initiative spreading beyond my kids prompting millions of people to plant billions of trees thus protecting our nature and future generations from the climate change challenges. I thank you all in advance for your invaluable comments and suggestions on improvising this idea for shaping it as a perfect one.

#Innovation #ClimateChange #GlobalWarming #PwP

Hello world!

Hello! Welcome to InnDiEn.com, a place where Innovation, Digitalization and Engineering come together. My name is Mohamed Haris, a professional with extensive experience in Engineering Information Management, CAD Design & Systems. The name InnDiEn comes from the first few letters of Innovation, Digitalization and Engineering. Initially my plan is to publish blog posts, mostly based on my personal experiments on innovation together with discussions on latest digitalization and digital transformation happenings in particular to Engineering fields. Later on, I would like to expand this site beyond blog posts. Please stay tuned for further updates. Thank you very much for visiting this site!