Technology Powered Knowledge Base
Get Our Free Instant Messenger with Audio & Video Call!

Extract Text Content from Rich Documents using Simple PHP

VN:F [1.9.22_1171]
Rating: +29 (from 33 votes)
VN:F [1.9.22_1171]
Rating: 9.8/10 (36 votes cast)

It’s much easier to extract text from reach documents using exec() for linux/windows or COM() for windows with PHP script. If we don’t have dedicated/VPS server or don’t have support applications(MS Word, AntiDOC, Adobe PDF etc) then its not possible to do with exec() or COM().
This is technical document to understand how we can extract text from DOC, PDF, ODT, HTML, DOCX, XLSX, PPTX etc with block of general PHP code. We know that Microsoft follows Open XML Format for there reach documents for MS Office 2007 and higher. That’s why we have some known solution to extract raw text from DOCX, XLSX, PPTX files but its very much painful for DOC(MS Office 2003 and below) files. We have a working solution for DOC but accuracy for extraction of text is 70% – 95%. Continue reading