I was looking into multi-byte characters and how they are used but how many different identifiers/pasterns are used for different multi-bytes.
e.g: &nbps;,&#nbsp;,U+0026,%20
how many different identifiers such as &,&#,u+ ,% etc are there ?
Im trying to look for inputs if they have words which are more than 255 characters long then its probably a multi-byte (hack attempt) and then I can check if word can be split has the multi-byte identifier then stop the hack attempt.
%format – a url-encoded value for embedding into URLS, e.g. %20 is a space (ascii 20) – named character entity, a non-breaking space in this caseU+0026– a unicode character in hex notation, an&in this case&#...;– a numbered character entity in decimal (base10)&= &&#x...;– a numbered character entity in hex (base 16):&= &