Phpword: TemplateProcessor cloneBlock breaks my document

Created on 20 Nov 2015  路  4Comments  路  Source: PHPOffice/PHPWord

Hi

This is my first use of PHPWord. I created a docx file with LibreOffice (do not ask why I'm not using MS Office :) )

The document contains a block (a single paragraph) to be cloned. The resulting docx file appears to be empty (still using LibreOffice).

I noticed the paragraphs are different from the sample_23 provided by PHPWord and those in my document created by LibreOffice.

The following code has been code formated (with Eclipse) to check the XML structure.

Sample_23 provided by PHPWord repository (block tag only):

        <w:p w:rsidR="00C0566D" w:rsidRPr="003B08B6" w:rsidRDefault="00C0566D"
            w:rsidP="00C0566D">
            <w:r>
                <w:t>${CLONEME}</w:t>
            </w:r>
        </w:p>

Paragraph generated by LibreOffice (block tag only)

        <w:p>
            <w:pPr>
                <w:pStyle w:val="Corpsdetexte" />
                <w:pageBreakBefore w:val="false" />
                <w:rPr></w:rPr>
            </w:pPr>
            <w:r>
                <w:rPr></w:rPr>
                <w:t>${itemtypeBlock}</w:t>
            </w:r>
        </w:p>

output file (cloned blocks appears to be nested)

        <w:p>
            <w:pPr>
                <w:pStyle w:val="Corpsdetexte" />
                <w:p>
                    <w:pPr>
                        <w:pStyle w:val="Corpsdetexte" />
                        <w:rPr></w:rPr>
                    </w:pPr>
                    <w:r>
                        <w:rPr></w:rPr>
                        <w:t>cloned paragraph</w:t>
                    </w:r>
                </w:p>
                <w:p>
                    <w:pPr>
                        <w:p>
                            <w:pPr>
                                <w:pStyle w:val="Corpsdetexte" />
                                <w:rPr></w:rPr>
                            </w:pPr>
                            <w:r>
                                <w:rPr></w:rPr>
                                <w:t>cloned paragraph</w:t>
                            </w:r>
                        </w:p>
                        <w:p>
                            <w:pPr>
                                <w:p>
                                    <w:pPr>
                                        <w:pStyle w:val="Normal" />
                                        <w:rPr></w:rPr>
                                    </w:pPr>

I began to debug and found the following :

  • the XML tree is broken in the output file (cloned blocks nested, and some XML tags are not properly closed (not visible on the snippets, because it would be too big; please be confident to the indentation)
  • the paragraphs generated by LibreOffice contains extra tags in paragraphs, which breaks the regex used in cloneBlock()
  • I'm far from being a docx specification expert, but the regex appears (to me) to be easily breakable because of matching on different opening and closing tags (no backreference used). Also I notice the use of and which is nonsense to me (but I admit I may b e wrong)
  • I believe PHPWord should be also compatible with documents generated by other softwares than MS Office, as far they are properly built.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

Most helpful comment

I finally found a working regex on my document. It has been designed to run also with the sample_23 document.

    public function cloneBlock($blockname, $clones = 1, $replace = true)
    {
        $xmlBlock = null;
        preg_match(
            //'/(<\?xml.*)(<w:p.*>\${' . $blockname . '}<\/w:.*?p>)(.*)(<w:p.*\${\/' . $blockname . '}<\/w:.*?p>)/is',
            '/(<\?xml.*)(<w:p( [^>]*)?>([\s]*<.*>)?\${' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)(.*)(<w:p( [^>]*)?>([\s]*<.*>)?\${\/' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)/is',
            $this->tempDocumentMainPart,
            $matches
        );

        if (isset($matches[6])) {
            $xmlBlock = $matches[6];
            $cloned = array();
            for ($i = 1; $i <= $clones; $i++) {
                $cloned[] = $xmlBlock;
            }

            if ($replace) {
                $this->tempDocumentMainPart = str_replace(
                    $matches[2] . $matches[6] . $matches[7],
                    implode('', $cloned),
                    $this->tempDocumentMainPart
                );
            }
        }

        return $xmlBlock;
    }

I submit it here for review, and if it passes the tests I'll make a pull request.

As the regex is not easily readable, the following explains how it works

(<\?xml.*)

is a greedy XML tags eater, which stops when we reach the nearest of the begining of the searched tag (see sub regex below)

(<w:p( [^>]*)?>([\s]*<.*>)?\${' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)

This second part handles attributes that may be found in and the fewest tags to reach the begin of the block tag.
As few as possible XML tags match after the block tag until we reach /w:p
Note I left some [s] for debug purpose, and this should not affect a real life document (but we may consider to remove them)

The begin of a block should be on its own paragraph strictly alone (I mean : without any text) (if I understand the original regex)

(.*) 

This sub regex matches the XML code to be cloned, until we reach the the paragraph with the end block tag (seel below)

(<w:p( [^>]*)?>([\s]*<.*>)?\${\/' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)

This sub regex is similar to the begin block tag : the neareast previous and the nearest next /w:p with the end of block tag as fixed point in the document.

Hope this helps to improve PHPWord.

All 4 comments

I finally found a working regex on my document. It has been designed to run also with the sample_23 document.

    public function cloneBlock($blockname, $clones = 1, $replace = true)
    {
        $xmlBlock = null;
        preg_match(
            //'/(<\?xml.*)(<w:p.*>\${' . $blockname . '}<\/w:.*?p>)(.*)(<w:p.*\${\/' . $blockname . '}<\/w:.*?p>)/is',
            '/(<\?xml.*)(<w:p( [^>]*)?>([\s]*<.*>)?\${' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)(.*)(<w:p( [^>]*)?>([\s]*<.*>)?\${\/' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)/is',
            $this->tempDocumentMainPart,
            $matches
        );

        if (isset($matches[6])) {
            $xmlBlock = $matches[6];
            $cloned = array();
            for ($i = 1; $i <= $clones; $i++) {
                $cloned[] = $xmlBlock;
            }

            if ($replace) {
                $this->tempDocumentMainPart = str_replace(
                    $matches[2] . $matches[6] . $matches[7],
                    implode('', $cloned),
                    $this->tempDocumentMainPart
                );
            }
        }

        return $xmlBlock;
    }

I submit it here for review, and if it passes the tests I'll make a pull request.

As the regex is not easily readable, the following explains how it works

(<\?xml.*)

is a greedy XML tags eater, which stops when we reach the nearest of the begining of the searched tag (see sub regex below)

(<w:p( [^>]*)?>([\s]*<.*>)?\${' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)

This second part handles attributes that may be found in and the fewest tags to reach the begin of the block tag.
As few as possible XML tags match after the block tag until we reach /w:p
Note I left some [s] for debug purpose, and this should not affect a real life document (but we may consider to remove them)

The begin of a block should be on its own paragraph strictly alone (I mean : without any text) (if I understand the original regex)

(.*) 

This sub regex matches the XML code to be cloned, until we reach the the paragraph with the end block tag (seel below)

(<w:p( [^>]*)?>([\s]*<.*>)?\${\/' . $blockname . '}(<.*?>[\s]*)?<\/w:p>)

This sub regex is similar to the begin block tag : the neareast previous and the nearest next /w:p with the end of block tag as fixed point in the document.

Hope this helps to improve PHPWord.

This regex works better but I'm affraid the pattern ( http://www.regular-expressions.info/catastrophic.html

@btry hey man, your regex is so :fire: :fire: :fire:

Thank you so much

Hi

I'm no longer using phpword. It seems some other proposals were done. If they merged in a release, it shall work.

@nicoder, when I designed my regex I was able to repeatedly grow a document full of tables, from a template of 5 pages to a document reaching about 40 pages without any problem. I may still have samples. If you wish to see them just ask (I need to redact them for confidentiality needs)

Was this page helpful?
0 / 5 - 0 ratings