I have a large database of resumes (CV), and a certain table skills grouping all users skills.
inside that table there’s a field skill_text that describes the skill in full text.
I’m looking for an algorithm/software/method to extract significant terms/phrases from that table in order to build a new table with standarized skills..
Here are some examples skills extracted from the DB :
- Sectoral and competitive analysis
- Business Development (incl. in international settings)
- Specific structure and road design software – Microstation, Macao, AutoCAD (basic knowledge)
- Creative work (Photoshop, In-Design, Illustrator)
- checking and reporting back on campaign progress
- organising and attending events and exhibitions
- Development : Aptana Studio, PHP, HTML, CSS, JavaScript, SQL, AJAX
- Discipline: One to one marketing, E-marketing (SEO & SEA, display, emailing, affiliate program) Mix marketing, Viral Marketing, Social network marketing.
The output shoud be something like :
- Sectoral and competitive analysis
- Business Development
- Specific structure and road design software –
- Macao
- AutoCAD
- Photoshop
- In-Design
- Illustrator
- organising events
- Development
- Aptana Studio
- PHP
- HTML
- CSS
- JavaScript
- SQL
- AJAX
- Mix marketing
- Viral Marketing
- Social network marketing
- emailing
- SEO
- One to one marketing
As you see only skills remains no other representation text.
I know this is possible using text mining technics but how to do it ?
the database is realy large.. it’s a good thing because we can calculate text frequency and decide if it’s a real skill or just meaningless text…
The big problem is .. how to determin that “blablabla” is a skill ?
Edit :
please don’t tell me to use standard things like a text tokinzer, or regex .. because users input skills in a very arbitrary way !!
thanks
If I was doing this programmatically I would:
Extract all punctuation delimited data (or perhaps just brackets and commas) into a new table (with no primary key, just skill) so
Creative work (Photoshop, In-Design, Illustrator)becomesThen, after you’ve proceed all CVs, query for the most common skills (this is MySQL)
Which may look like this contrived example
Then you decide, from the top X skills, which you want to capture, which must map to other skills (
IndesignandIn-designshould map to the same skill, for example) and which to discard, then script the process using a data map.Use the data map to write a new word frequency table (this time skill_id, skill, frequency) and the second time when parsing the data also write to a lookup table (cv_id,skill_id). Your data will then be in a state where each CV is mapped to a number of skills, and each skill to a number of CVs. You can query for the most popular skills, CVs matching certain criteria etc.