I am trying to harvest all inclusion directives from a PHP file using a regular expression (in Java).
The expression should pick up only those which have file names expressed as unconcatenated string literals. Ones with constants or variables are not necessary.
Detection should work for both single and double quotes, include-s and require-s, plus the additional trickery with _once and last but not least, both keyword- and function-style invocations.
A rough input sample:
<?php
require('a.php');
require 'b.php';
require("c.php");
require "d.php";
include('e.php');
include 'f.php';
include("g.php");
include "h.php";
require_once('i.php');
require_once 'j.php';
require_once("k.php");
require_once "l.php";
include_once('m.php');
include_once 'n.php';
include_once("o.php");
include_once "p.php";
?>
And output:
["a.php","b.php","c.php","d.php","f.php","g.php","h.php","i.php","j.php","k.php","l.php","m.php","n.php","o.php","p.php"]
Any ideas?
To do this accurately, you really need to fully parse the PHP source code. This is because the text sequence:
require('a.php');can appear in places where it is not really an include at all – such as in comments, strings and HTML markup. For example, the following are NOT real PHP includes, but will be matched by the regex:That said, if you are happy with getting a few false positives, the following single regex solution will do a pretty good job of scraping all the filenames from all the PHP include variations:
Additional 2011-07-24 It turns out the OP wants a solution in Java not PHP. Here is a tested Java program which is nearly identical. Note that I am not a Java expert and don’t know how to dynamically size an array. Thus, the solution below (crudely) sets a fixed size array (100) to hold the array of filenames.