I have some data that needs to be cleaned before inserting in to a DB. Each row represents a publication and some of them have different formatting. The only similarity is that each record is on the same row. Eg.
5: Aghasadeghi MR, Salmani AS, Sadat SM, Javadi F, Memarnejadian A, Vahabpour R, Zabihollahi R, Moshiri A, Siadat SD. Application of outer membrane vesicle ofNeisseria meningitidis serogroup B as a new adjuvant to induce stronglyTh1-oriented responses against HIV-1. Curr HIV Res. 2011 Dec 1;9(8):630-5. PubMedPMID: 22211657.
6: Ramezani A; Banifazl M; Mohraz M; Rasoolinejad M; Aghakhani A; Occulthepatitis B virus infection: A major concern in HIV-infected patients: Occult HBVin HIV. Hepat Mon. 2011 Jan 1;11(1):7-10. PubMed PMID: 22087108; PubMed CentralPMCID: PMC3206662.
7: Roohvand, F., Kossari, N. Advances in hepatitis C virus vaccines, Part one:Advances in basic knowledge for hepatitis C virus vaccine design. Expert OpinTher Pat. 2011 Dec;21(12):1811-30. Epub 2011 Oct 25. Review. PubMed PMID:22022980.
8: Chinikar, S., Javadi, A., Ataei, B., Shakeri, H., Moradi, M., Mostafavi, E., Ghiasi, S.M.Detection of West Nile virus genome and specific antibodies in Iranianencephalitis patients. Epidemiol Infect. 2011 Oct 19:1-5. [Epub ahead of print]PubMed PMID: 22008154.
You can see that some of the authors are separated by semi colon and others are separated by a comma. Rows 7 and 8 have a comma that separates the last name by the middle initial. I would like to group all of the authors and put them in an author field OR maybe even place them in their own columns. What would be the best way to separate each other these authors to do this? This is not a easy task 😉
This can get tricky when the format is not consistent, because you need to make some assumptions. The assumption I’m making for this solution is that people won’t have names longer than 20 characters, and the titles will be at least 20 characters and not contain commas, semi-colons, or periods.
Here is a version that will insert a tab after the final author:
And here is a way to get a list of authors for each book:
Result: