PHP讀取word文檔里的文字及圖片,並保存
一、composer安裝phpWord
composer require phpoffice/phpword
傳送門:https://packagist.org/packages/phpoffice/phpword
二、phpWord 讀取 docx 文檔(注意是docx格式,doc格式不行)
如果你的文件是doc格式,直接另存為一個docx就行了;如果你的doc文檔較多,可以下一個批量轉換工具:http://www.batchwork.com/en/doc2doc/download.htm
如果你還沒配置自動加載,則先配置一下:
require './vendor/autoload.php';
加載文檔:
$dir = str_replace('\\', '/', __DIR__) . '/'; $source = $dir . 'test.docx'; $phpWord = \PhpOffice\PhpWord\IOFactory::load($source);
三、關鍵點
1)對齊方式:PhpOffice\PhpWord\Style\Paragraph -> getAlignment()
2)字體名稱:\PhpOffice\PhpWord\Style\Font -> getName()
3)字體大小:\PhpOffice\PhpWord\Style\Font -> getSize()
4)是否加粗:\PhpOffice\PhpWord\Style\Font -> isBold()
5)讀取圖片:\PhpOffice\PhpWord\Element\Image -> getImageStringData()
6)ba64格式圖片數據保存為圖片:file_put_contents($imageSrc, base64_decode($imageData))
四、完整代碼
require './vendor/autoload.php'; function docx2html($source) { $phpWord = \PhpOffice\PhpWord\IOFactory::load($source); $html = ''; foreach ($phpWord->getSections() as $section) { foreach ($section->getElements() as $ele1) { $paragraphStyle = $ele1->getParagraphStyle(); if ($paragraphStyle) { $html .= '<p style="text-align:'. $paragraphStyle->getAlignment() .';text-indent:20px;">'; } else { $html .= '<p>'; } if ($ele1 instanceof \PhpOffice\PhpWord\Element\TextRun) { foreach ($ele1->getElements() as $ele2) { if ($ele2 instanceof \PhpOffice\PhpWord\Element\Text) { $style = $ele2->getFontStyle(); $fontFamily = mb_convert_encoding($style->getName(), 'GBK', 'UTF-8'); $fontSize = $style->getSize(); $isBold = $style->isBold(); $styleString = ''; $fontFamily && $styleString .= "font-family:{$fontFamily};"; $fontSize && $styleString .= "font-size:{$fontSize}px;"; $isBold && $styleString .= "font-weight:bold;"; $html .= sprintf('<span style="%s">%s</span>', $styleString, mb_convert_encoding($ele2->getText(), 'GBK', 'UTF-8') ); } elseif ($ele2 instanceof \PhpOffice\PhpWord\Element\Image) { $imageSrc = 'images/' . md5($ele2->getSource()) . '.' . $ele2->getImageExtension(); $imageData = $ele2->getImageStringData(true); // $imageData = 'data:' . $ele2->getImageType() . ';base64,' . $imageData; file_put_contents($imageSrc, base64_decode($imageData)); $html .= '<img src="'. $imageSrc .'" style="width:100%;height:auto">'; } } } $html .= '</p>'; } } return mb_convert_encoding($html, 'UTF-8', 'GBK'); } $dir = str_replace('\\', '/', __DIR__) . '/'; $source = $dir . 'test.docx'; echo docx2html($source);
五、補充
很明顯,這是一個簡陋的word讀取示例,只讀取了段落的對齊方式,文字的字體、大小、是否加粗及圖片等信息,其他例如文字顏色、行高。。。等等信息都忽悠了。需要的話,請自行查看phpWord源碼,看\PhpOffice\PhpWord\Style\xxx 和 \PhpOffice\PhpWord\Element\xxx 等類里有什么讀取方法就可以了
六、2020-07-21 補充
可以用以下方法直接獲取到完整的html
$phpWord = \PhpOffice\PhpWord\IOFactory::load('xxx.docx'); $xmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, "HTML"); $html = $xmlWriter->getContent();
注:html內容里包含了head部分,如果只需要style和body的話,需要自己處理一下;然后圖片是base64的,要保存的話,也需要自己處理一下
base64數據保存為圖片請參考上面代碼
如果只想獲取body里的內容,可以參考 \PhpOffice\PhpWord\Writer\HTML\Part\Body 里的 write 方法
$phpWord = \PhpOffice\PhpWord\IOFactory::load('xxxx.docx'); $htmlWriter = \PhpOffice\PhpWord\IOFactory::createWriter($phpWord, "HTML"); $content = ''; foreach ($phpWord->getSections() as $section) { $writer = new \PhpOffice\PhpWord\Writer\HTML\Element\Container($htmlWriter, $section); $content .= $writer->write(); } echo $content;exit;
圖片的處理的話,暫時沒有好辦法能在不修改源碼的情況下處理好,改源碼的話,相關代碼在 \PhpOffice\PhpWord\Writer\HTML\Element\Image 里
public function write() { if (!$this->element instanceof ImageElement) { return ''; } $content = ''; $imageData = $this->element->getImageStringData(true); if ($imageData !== null) { $styleWriter = new ImageStyleWriter($this->element->getStyle()); $style = $styleWriter->write(); // $imageData = 'data:' . $this->element->getImageType() . ';base64,' . $imageData; $imageSrc = 'images/' . md5($this->element->getSource()) . '.' . $this->element->getImageExtension(); // 這里可以自己處理,上傳oss之類的 file_put_contents($imageSrc, base64_decode($imageData)); $content .= $this->writeOpening(); $content .= "<img border=\"0\" style=\"{$style}\" src=\"{$imageSrc}\"/>"; $content .= $this->writeClosing(); } return $content; }
完。