I need to perform some modifications to PHP files (PHTML files to be exact, but they are still valid PHP files), from a Bash script. My original thought was to use sed or similar utility with regex, but reading some of the replies here for other HTML parsing questions it seems that there might be a better solution.
The problem I was facing with the regex was a lack of support for detecting if the string I wanted to match: (src|href|action)=["']/ was in <?php ?> tags or not, so that I could then either perform string concatenation if the match was in PHP tags, or add in new PHP tags should it not be. For example:
(1) <img id="icon-loader-small" src="/css/images/loader-small.gif" style="vertical-align:middle; display:none;"/>
(2) <li><span class="name"><?php echo $this->loggedInAs()?></span> | <a href="/Login/logout">Logout</a></li>
(3) <?php echo ($watched_dir->getExistsFlag())?"":"<span class='ui-icon-alert'><img src='/css/images/warning-icon.png'></span>"?><span><?php echo $watched_dir->getDirectory();?></span></span><span class="ui-icon ui-icon-close"></span>
(EDIT: 4) <form method="post" action="/Preference/stream-setting" enctype="application/x-www-form-urlencoded" onsubmit="return confirm('<?php echo $this->confirm_pypo_restart_text ?>');">
In (1) there a src="/css, and as it is not in PHP tags I want that to become src="<?php echo $baseUrl?>/css. In (2), there is a PHP tag but it is not around the href="/Login, so it also becomes href="<?php echo $baseUrl?>/Login.
Unfortunately, (3) has src='/css but inside the PHP tags (it is an echoed string). It is also quoted by " in the PHP code, so the modification needs to pick up on that too. The final result would look something like: src='".$baseUrl."/css.
All the other modifications to my HTML and PHP files have been done using a regex (I know, I know…). If regexes could support matching everything except a certain pattern, like [^(<\?php)(\?>)]* then I would be flying through this part. Unfortunately it seems that this is Type 2 grammar territory. So – what should I use?
Ideally it needs to be installed by default with the GNU suite, but other tools like PHP itself or other interpreters are fine too, just not preferred. Of course, if someone could structure a regex that would work on the above examples, then that would be excellent.
EDIT: (4) is the nasty match, where most regexes will fail.
The way I solved this problem was by separating my file into sections that were encapsulated by . The script kept track of the ‘context’ it was currently in – by default set to html but switching to php when it hit those tags. An operation (not necessarily a regex) then performs on that section, which is then appended to the output buffer. When the file is completely processed the output buffer is written back into the file.
I attempted to do this with sed, but I faced the problem of not being able to control where newlines would be printed. The context based logic was also hardcoded meaning it would be tedious to add in a new context, like ASP.NET support for example. My current solution is written in Perl and mitigates both problems, although I am having a bit of trouble getting my regex to actually do something, but this might just be me coding my regex incorrectly.
Script is as follows:
I hope that this can be used and modified by others to suit their needs.