I have a file with two columns, one has the content type of HTTP objects like text/html, application/rar etc and the other has the bytes size.
Content Type Size
video/x-flv 100
image/jpeg 150
text/html 160
application/octet-stream 200
application/x-shockwave-flash ...
text/plain
application/x-javascript
text/xml
text/css
text/html; charset=utf-8
application/x-javascript; charset=utf-8 ...
As you can see there are many variations of the same content type, such as application/x-javascript and application/x-javascript; charset=utf-8 and so on. So, I would like to create another column to categorize them more generically. So, that these two would just be web/javascript and so on.
Content Type Size Category
video/x-flv 100 web/video
image/jpeg 150 web/image
text/html 160 web/html
application/octet-stream 200 web/binary
application/x-shockwave-flash ... web/flash
text/plain web/plaintext
application/x-javascript web/javascript
video/x-msvideo web/video
text/xml web/xml
text/css web/css
text/html; charset=utf-8 web/html
video/quicktime web/video
application/x-javascript; charset=utf-8 web/javascript
How would I accomplish this in R and I presume I need to use regular expressions of some sort for this?
There are several ways you can simplify your variable. Here I will use the
stringrpackage for string manipulation functions :First, copy your content type variable into a new character variable :
Which just gives you :
Then you can work on your new variable. You can just replace manually certain type value by another :
You can use regexp matching to replace all the values matching, for example, “video” :
Or you can use regexp replacement to clean certain values. For example by removing everything behind the “;” in your content types :
Be careful of the order of your instructions, though, as your result highly depends on it.