Using Java, I have a class which retrieves a webpage as a byte array. I then need to strip out some content if it exists. (The application monitors web pages for changes, but needs to remove session Ids from the html which are created by php, and would mean changes were detected each visit to the page).
Some of the resulting byte arrays could be 10s of 1000s bytes long. They’re not stored like this – a 16 byte MD5 of the page is stored. However, it is the original full size byte array which needs to be processed.
(UPDATE – the code does not work. See comment from A.H. below)
A test showing my code:
public void testSessionIDGetsRemovedFromData() throws IOException
{
byte[] forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>".getBytes();
byte[] sidPattern = "&sid=".getBytes();
int sidIndex = ArrayCleaner.getPatternIndex(forumContent, sidPattern);
assertEquals(54, sidIndex);
// start of cleaning code
ArrayList<Byte> forumContentList = new ArrayList<Byte>();
forumContentList.addAll(forumContent);
forumContentList.removeAll(Arrays.asList(sidPattern));
byte[] forumContentCleaned = new byte[forumContentList.size()];
for (int i = 0; i < forumContentCleaned.length; i++)
{
forumContentCleaned[i] = (byte)forumContentList.get(i);
}
//end of cleaning code
sidIndex = ArrayCleaner.getPatternIndex(forumContentCleaned, sidPattern);
assertEquals(-1, sidIndex);
}
This all works fine, but I’m worried about the efficiency of the cleaning section. I had hoped to operate solely on arrays, but the ArrayList has nice built in functions to removed a collection from the ArrayList, etc, which is just what I need. So I have had to create an ArrayList of Byte, as I can’t have an ArrayList of the primitive byte (can anyone tell me why?), convert the pattern to remove to another ArrayList (I suppose this could be an ArrayList all along) to pass to removeAll(). I then need to create another byte[] and cast each element of the ArrayList of Bytes to a byte and add it to the byte[].
Is there a more efficient way of doing all this?
Can it be performed using arrays?
UPDATE
This is the same functionality using strings:
public void testSessionIDGetsRemovedFromDataUsingStrings() throws IOException
{
String forumContent = "<li class=\"icon-logout\"><a href=\"./ucp.php?mode=logout&sid=3a4043284674572e35881e022c68fcd8\" title=\"Logout [ barry ]\" accesskey=\"x\">Logout [ barry ]</a></li>";
String sidPattern = "&sid=";
int sidIndex = forumContent.indexOf(sidPattern);
assertEquals(54, sidIndex);
forumContent = forumContent.replaceAll(sidPattern, "");
sidIndex = forumContent.indexOf(sidPattern);
assertEquals(-1, sidIndex);
}
Is this as efficient as the array/arrayList method?
Thanks,
Barry
Really? Did you inspect the resulting “string”? On my machine the data in
forumContentCleanedstill contains the&sid=...data.That’s because
tries to remove a
List<byte[]>from aList<Byte>. This will do nothing. And even if you replace the argument ofremoveAllwith a realList<Byte>containing the bytes of"&sid=", then you will remove ALL occurences of eacha, eachm, eachpand so forth. The resulting data will look like this:Well, strictly speaking, the
&sid=part is gone, but I’m quite sure this is not what you wanted.Therefore take a step back and think: You are doing string manipulation here, so use a
StringBuilder, feed it with theString(forumContent)and do your manipulation there.Edit
Looking at the given example input string, I guess, that also the value of
sidshould be removed, not only the key. This code should do it efficiently without regular expresions:Edit 2
Here is a small benchmark between
StringBuilderandString.replaceAll:Results: