I have a regular expression on line 6. I have tested it, and confirmed it works. Now, I have a very large array. Though I have tested the regular expression on smaller arrays, it does not appear to work on the larger one! What gives? Basically, the script below is intended to remove array entries that contain uncommon characters. Look at what happens when you try both arrays–the first one successfully removes the odd array entry, but the larger one does not!
See here:
http://pastebin.com/raw.php?i=KpNZHbrv
$sentences_working = array(
'Egypt, officially the Arab Republic of Egypt, is a country mainly in North Africa, with the Sinai Peninsula forming a land bridge in Southwest Asia.',
'Egypt is thus a transcontinental country, and a major power in Africa, the Mediterranean Basin, the Middle East and the Muslim world.',
'The English name Egypt was borrowed from Middle French Egypte, from Latin, from ancient Greek Aígyptos, from earlier Linear B a-ku-pi-ti-yo.',
'In later years, the dynasty became a British puppet.',
);
$sentences_notworking = unserialize(urldecode("a%3A30%3A%7Bi%3A0%3Bs%3A160%3A%22Egypt++%2C+officially+the+Arab+Republic+of+Egypt%2C+Arabic%3A+%2C+is+a+country+mainly+in+North+Africa%2C+with+the+Sinai+Peninsula+forming+a+land+bridge+in+Southwest+Asia.%22%3Bi%3A1%3Bs%3A133%3A%22Egypt+is+thus+a+transcontinental+country%2C+and+a+major+power+in+Africa%2C+the+Mediterranean+Basin%2C+the+Middle+East+and+the+Muslim+world.%22%3Bi%3A2%3Bs%3A223%3A%22Covering+an+area+of+about+1%2C010%2C000+square+kilometers+%2C+Egypt+is+bordered+by+the+Mediterranean+Sea+to+the+north%2C+the+Gaza+Strip+and+Israel+to+the+northeast%2C+the+Red+Sea+to+the+east%2C+Sudan+to+the+south+and+Libya+to+the+west.%22%3Bi%3A3%3Bs%3A74%3A%22Egypt+is+one+of+the+most+populous+countries+in+Africa+and+the+Middle+East.%22%3Bi%3A4%3Bs%3A171%3A%22The+great+majority+of+its+over+81+million+people+live+near+the+banks+of+the+Nile+River%2C+in+an+area+of+about+40%2C000+square+kilometers+%2C+where+the+only+arable+land+is+found.%22%3Bi%3A5%3Bs%3A60%3A%22The+large+areas+of+the+Sahara+Desert+are+sparsely+inhabited.%22%3Bi%3A6%3Bs%3A177%3A%22About+half+of+Egypt%27s+residents+live+in+urban+areas%2C+with+most+spread+across+the+densely+populated+centres+of+greater+Cairo%2C+Alexandria+and+other+major+cities+in+the+Nile+Delta.%22%3Bi%3A7%3Bs%3A118%3A%22Monuments+in+Egypt+such+as+the+Giza+pyramid+complex+and+its+Great+Sphinx+were+constructed+by+its+ancient+civilization.%22%3Bi%3A8%3Bs%3A155%3A%22Its+ancient+ruins%2C+such+as+those+of+Memphis%2C+Thebes%2C+and+Karnak+and+the+Valley+of+the+Kings+outside+Luxor%2C+are+a+significant+focus+of+archaeological+study.%22%3Bi%3A9%3Bs%3A83%3A%22The+tourism+industry+and+the+Red+Sea+Riviera+employ+about+12%25+of+Egypt%27s+workforce.%22%3Bi%3A10%3Bs%3A170%3A%22The+economy+of+Egypt+is+one+of+the+most+diversified+in+the+Middle+East%2C+with+sectors+such+as+tourism%2C+agriculture%2C+industry+and+service+at+almost+equal+production+levels.%22%3Bi%3A11%3Bs%3A133%3A%22In+early+2011%2C+Egypt+underwent+a+revolution%2C+which+resulted+in+the+ousting+of+President+Hosni+Mubarak+after+nearly+30+years+in+power.%22%3Bi%3A12%3Bs%3A50%3A%22Presidential+elections+are+scheduled+for+May+2012.%22%3Bi%3A13%3Bs%3A190%3A%22The+English+name+Egypt+was+borrowed+from+Middle+French+Egypte%2C+from+Latin+%2C+from+ancient+Greek+A%26iacute%3Bgyptos+%2C+from+earlier+Linear+B+%26%2365601%3B%26%2365555%3B%26%2365568%3B%26%2365588%3B%26%2365549%3B+a-ku-pi-ti-yo.%22%3Bi%3A14%3Bs%3A277%3A%22The+adjective+aig%26yacute%3Bpti-%2C+aig%26yacute%3Bptios+was+borrowed+into+Coptic+as+%26%2311397%3B%26%2311433%3B%26%2311425%3B%26%231007%3B%26%2311411%3B%26%2311423%3B%26%2311429%3B%2F%26%2311413%3B%26%2311433%3B%26%2311425%3B%26%231007%3B%26%2311411%3B%26%2311423%3B%26%2311429%3B+gyptios%2C+kyptios%2C+and+from+there+into+Arabic+as+%2C+back+formed+into+%2C+whence+English+Copt.%22%3Bi%3A15%3Bs%3A209%3A%22The+Greek+forms+were+borrowed+from+Late+Egyptian++Hikuptah+%22Memphis%22%2C+a+corruption+of+the+earlier+Egyptian+name+Hwt-ka-Ptah+%2C+meaning+%22home+of+the+ka++of+Ptah%22%2C+the+name+of+a+temple+to+the+god+Ptah+at+Memphis.%22%3Bi%3A16%3Bs%3A130%3A%22Strabo+attributed+the+word+to+a+folk+etymology+in+which+A%26iacute%3Bgyptos++evolved+as+a+compound+from++%2C+meaning+%22below+the+Aegean%22.%22%3Bi%3A17%3Bs%3A288%3A%22%2C+the+Arabic+and+modern+official+name+of+Egypt+%2C+is+of+Semitic+origin%2C+directly+cognate+with+other+Semitic+words+for+Egypt+such+as+the+Hebrew+%26lrm%3B+%2C+literally+meaning+%22the+two+straits%22+.+The+word+originally+connoted+%22metropolis%22+or+%22civilization%22+and+means+%22country%22%2C+or+%22frontier-land%22.%22%3Bi%3A18%3Bs%3A232%3A%22The+ancient+Egyptian+name+of+the+country+is+Kemet++%5B%26%2378222%3B%26%2378163%3B%26%2378799%3B%26%2378486%3B%5D%2C+which+means+%22black+land%22%2C+referring+to+the+fertile+black+soils+of+the+Nile+flood+plains%2C+distinct+from+the+deshret+%2C+or+%22red+land%22+of+the+desert.%22%3Bi%3A19%3Bs%3A152%3A%22The+name+is+realized+as++and++in+the+Coptic+stage+of+the+Egyptian+language%2C+and+appeared+in+early+Greek+as++.+Another+name+was++%22land+of+the+riverbank%22.%22%3Bi%3A20%3Bs%3A105%3A%22The+names+of+Upper+and+Lower+Egypt+were+Ta-Sheme%27aw++%22sedgeland%22+and+Ta-Mehew++%22northland%22%2C+respectively.%22%3Bi%3A21%3Bs%3A79%3A%22There+is+evidence+of+rock+carvings+along+the+Nile+terraces+and+in+desert+oases.%22%3Bi%3A22%3Bs%3A103%3A%22In+the+10th+millennium+BC%2C+a+culture+of+hunter-gatherers+and+fishers+replaced+a+grain-grinding+culture.%22%3Bi%3A23%3Bs%3A117%3A%22Climate+changes+and%2For+overgrazing+around+8000+BC+began+to+desiccate+the+pastoral+lands+of+Egypt%2C+forming+the+Sahara.%22%3Bi%3A24%3Bs%3A129%3A%22Early+tribal+peoples+migrated+to+the+Nile+River+where+they+developed+a+settled+agricultural+economy+and+more+centralized+society.%22%3Bi%3A25%3Bs%3A63%3A%22By+about+6000+BC+a+Neolithic+culture+rooted+in+the+Nile+Valley.%22%3Bi%3A26%3Bs%3A104%3A%22During+the+Neolithic+era%2C+several+predynastic+cultures+developed+independently+in+Upper+and+Lower+Egypt.%22%3Bi%3A27%3Bs%3A108%3A%22The+Badarian+culture+and+the+successor+Naqada+series+are+generally+regarded+as+precursors+to+dynastic+Egypt.%22%3Bi%3A28%3Bs%3A100%3A%22The+earliest+known+Lower+Egyptian+site%2C+Merimda%2C+predates+the+Badarian+by+about+seven+hundred+years.%22%3Bi%3A29%3Bs%3A198%3A%22Contemporaneous+Lower+Egyptian+communities+coexisted+with+their+southern+counterparts+for+more+than+two+thousand+years%2C+remaining+culturally+distinct%2C+but+maintaining+frequent+contact+through+trade.%22%3B%7D"));
$sentences = $sentences_working; // change here for testing
foreach ($sentences as $sentence_key => $sentence)
{
if (preg_match('/[^\x20-\x7E]/', $sentence))
{
unset($sentences[$sentence_key]);
}
}
echo "<pre>";
print_r($sentences);
echo "</pre>";
To test both arrays (the smaller and larger) simply change line 12. What’s going on?
The long string you provided is not only urlencoded, php serialized, but also has it’s entities encoded.
html_entity_decode should fix it, but I would also think a bit about whether a triple encoding is really what you want.
Edit: Just saw your comment:
A few things here.. it is so very easy to find out.. just do a hexdump. However, you would have spotted this even earlier if you ran this in a terminal. If you’re gonna check out the output of a PHP script where encoding is relevant.. at least check ‘view source’ rather than your browser screen.