Technology Powered Knowledge Base
Click Here To Make FREE Audio & Video Call From Web Browser!

Extract Text Content from Rich Documents using Simple PHP

VN:F [1.9.22_1171]
Rating: +29 (from 33 votes)
VN:F [1.9.22_1171]
Rating: 9.8/10 (36 votes cast)

It’s much easier to extract text from reach documents using exec() for linux/windows or COM() for windows with PHP script. If we don’t have dedicated/VPS server or don’t have support applications(MS Word, AntiDOC, Adobe PDF etc) then its not possible to do with exec() or COM().
 
This is technical document to understand how we can extract text from DOC, PDF, ODT, HTML, DOCX, XLSX, PPTX etc with block of general PHP code. We know that Microsoft follows Open XML Format for there reach documents for MS Office 2007 and higher. That’s why we have some known solution to extract raw text from DOCX, XLSX, PPTX files but its very much painful for DOC(MS Office 2003 and below) files. We have a working solution for DOC but accuracy for extraction of text is 70% – 95%.
 
DOC to TEXT:


<?php
function doc_to_text($input_file){
	$file_handle = fopen($input_file, "r"); //open the file
	$stream_text = @fread($file_handle, filesize($input_file));
	$stream_line = explode(chr(0x0D),$stream_text);
	$output_text = "";
	foreach($stream_line as $single_line){
		$line_pos = strpos($single_line, chr(0x00));
		if(($line_pos !== FALSE) || (strlen($single_line)==0)){
			$output_text .= "";
		}else{
			$output_text .= $single_line." ";
		}
	}
	$output_text = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/", "", $output_text);
	return $output_text;
}

echo doc_to_text("sample.doc");
?>

 

For every open XML supported reach document we can use a common mechanism  for extract the archive and find text from data XML file.
ODT to TEXT:


<?php
function odt_to_text($input_file){
	$xml_filename = "content.xml"; //content file name
	$zip_handle = new ZipArchive;
	$output_text = "";
	if(true === $zip_handle->open($input_file)){
		if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
			$xml_datas = $zip_handle->getFromIndex($xml_index);
			$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
			$output_text = strip_tags($xml_handle->saveXML());
		}else{
			$output_text .="";
		}
		$zip_handle->close();
	}else{
	$output_text .="";
	}
	return $output_text;
}

echo odt_to_text("sample.odt");
?>

 

DOCX to TEXT:


<?php
function docx_to_text($input_file){
	$xml_filename = "word/document.xml"; //content file name
	$zip_handle = new ZipArchive;
	$output_text = "";
	if(true === $zip_handle->open($input_file)){
		if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
			$xml_datas = $zip_handle->getFromIndex($xml_index);
			$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
			$output_text = strip_tags($xml_handle->saveXML());
		}else{
			$output_text .="";
		}
		$zip_handle->close();
	}else{
	$output_text .="";
	}
	return $output_text;
}

echo docx_to_text("sample.docx");
?>

 

XLSX to TEXT:


<?php
function xlsx_to_text($input_file){
	$xml_filename = "xl/sharedStrings.xml"; //content file name
	$zip_handle = new ZipArchive;
	$output_text = "";
	if(true === $zip_handle->open($input_file)){
		if(($xml_index = $zip_handle->locateName($xml_filename)) !== false){
			$xml_datas = $zip_handle->getFromIndex($xml_index);
			$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
			$output_text = strip_tags($xml_handle->saveXML());
		}else{
			$output_text .="";
		}
		$zip_handle->close();
	}else{
	$output_text .="";
	}
	return $output_text;
}

echo xlsx_to_text("sample.xlsx");
?>

 

PPTX to TEXT:


<?php
function pptx_to_text($input_file){
	$zip_handle = new ZipArchive;
	$output_text = "";
	if(true === $zip_handle->open($input_file)){
		$slide_number = 1; //loop through slide files
		while(($xml_index = $zip_handle->locateName("ppt/slides/slide".$slide_number.".xml")) !== false){
			$xml_datas = $zip_handle->getFromIndex($xml_index);
			$xml_handle = DOMDocument::loadXML($xml_datas, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
			$output_text .= strip_tags($xml_handle->saveXML());
			$slide_number++;
		}
		if($slide_number == 1){
			$output_text .="";
		}
		$zip_handle->close();
	}else{
	$output_text .="";
	}
	return $output_text;
}

echo pptx_to_text("sample.pptx");
?>
Extract Text Content from Rich Documents using Simple PHP, 9.8 out of 10 based on 36 ratings



Sign Up     Sign In