I have a column on an SQL Server 2005 table called BIO – the data in the BIO column is formatted like this:
<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006– </A>; Postdoctoral Fellow, Toronto Western Hosp. 2000–06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. & Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>
I need to extract values from each of the anchor tags ie:
<A name=SN>AARTS</A>
I would need to have AARTS in a column called SN in the result set
This is what I have so far…
SELECT CONTACT_ID
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=SN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=SN>', [BIO])) - CHARINDEX('<A name=SN>', [BIO])-11))) AS 'SN'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=GN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=GN>', [BIO])) - CHARINDEX('<A name=GN>', [BIO])-11))) AS 'GN'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=HO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=HO>', [BIO])) - CHARINDEX('<A name=HO>', [BIO])-11))) AS 'HO'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=OC>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=OC>', [BIO])) - CHARINDEX('<A name=OC>', [BIO])-11))) AS 'OC'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PO>', [BIO])) - CHARINDEX('<A name=PO>', [BIO])-11))) AS 'PO'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BD>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BD>', [BIO])) - CHARINDEX('<A name=BD>', [BIO])-11))) AS 'BD'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PA>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PA>', [BIO])) - CHARINDEX('<A name=PA>', [BIO])-11))) AS 'PA'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BY>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BY>', [BIO])) - CHARINDEX('<A name=BY>', [BIO])-11))) AS 'BY'
,dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=ED>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=ED>', [BIO])) - CHARINDEX('<A name=ED>', [BIO])-11))) AS 'ED'
FROM [cww].[dbo].[Contacts]
ORDER BY CONTACT_ID
The results I get from that look like this:
CONTACT_ID SN GN HO OC PO DB PA BY ED
3 AARON Raymond Leonard B.Sc. business coach, professional speaker, real estate entrepreneur D>AARON
5 AATAMI Pita C.Q. business executive; Kuujjuaq
7 ABBOTT Anthony C. P.C. lawyer Montreal
8 ABBOTT Elizabeth M.A. historian Ottawa
9 ABBOTT (Caroline) Louise D>ABBOTT writer, photographer, filmmaker Montreal
I can keep going and manually add all of the substrings for each differently named anchor but the problem with this is that I do not know all of ‘names’ that are used in the anchors and there are 22000+ records in this table that I would have to look through to make sure I catch them all. As well, not all BIOs have all the anchors so if you look at the result for ‘ABBOTT (Caroline) Louise’ she doesn’t have an ‘HO’ anchor so it returns incorrect data ‘D>ABBOTT’ and I haven’t seen this yet with the limited results I’m bringing up but some records have multiple anchors such as 2 ‘HO’s which I imagine will cause problems..
One last problem is that not all anchor names are 2 letters so the 11 I’m using in the charindex would be wrong for those ones..
Is there a better way to do this? Any help would be appreciated.
UPDATE – I’ve added CASE statements to remove incorrect data when the anchor name doesn’t exist for the current record.
SELECT CONTACT_ID
,'SN' =
CASE
WHEN CHARINDEX('<A name=SN>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=SN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=SN>', [BIO])) - CHARINDEX('<A name=SN>', [BIO])-11)))
END
,'GN' =
CASE
WHEN CHARINDEX('<A name=GN>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=GN>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=GN>', [BIO])) - CHARINDEX('<A name=GN>', [BIO])-11)))
END
,'HO' =
CASE
WHEN CHARINDEX('<A name=HO>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=HO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=HO>', [BIO])) - CHARINDEX('<A name=HO>', [BIO])-11)))
END
,'OC' =
CASE
WHEN CHARINDEX('<A name=OC>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=OC>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=OC>', [BIO])) - CHARINDEX('<A name=OC>', [BIO])-11)))
END
,'PO' =
CASE
WHEN CHARINDEX('<A name=PO>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PO>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PO>', [BIO])) - CHARINDEX('<A name=PO>', [BIO])-11)))
END
,'BD' =
CASE
WHEN CHARINDEX('<A name=BD>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BD>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BD>', [BIO])) - CHARINDEX('<A name=BD>', [BIO])-11)))
END
,'PA' =
CASE
WHEN CHARINDEX('<A name=PA>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=PA>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=PA>', [BIO])) - CHARINDEX('<A name=PA>', [BIO])-11)))
END
,'BY' =
CASE
WHEN CHARINDEX('<A name=BY>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=BY>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=BY>', [BIO])) - CHARINDEX('<A name=BY>', [BIO])-11)))
END
,'ED' =
CASE
WHEN CHARINDEX('<A name=ED>', [BIO]) = 0 THEN NULL
ELSE dbo.udf_StripHTML(SUBSTRING([BIO], (CHARINDEX('<A name=ED>', [BIO]) + 11), (CHARINDEX('</A>', [BIO], CHARINDEX('<A name=ED>', [BIO])) - CHARINDEX('<A name=ED>', [BIO])-11)))
END
--INTO [cww].[dbo].[BioDetails]
FROM [cww].[dbo].[Contacts]
ORDER BY CONTACT_ID
I don’t know how you could do this purely in T-SQL.
If you can retrieve the CONTACT_ID and BIO columns into an application, you could iterate over the result set, parse the BIO data as XML, then use XPath to get the name attribute value and the anchor body, building a map of the data to be inserted into your new table. Since you don’t know all the different names that could exist, you’ll probably need to recreate the table each time it’s run, so store names found in a Set and after iterating over all the rows use the Set to generate your
create tablestatement.The DB code is pure fantasy, but here’s a snippet showing how you could do it using the XOM XML library for Java. I’m not positive this would work since your attribute values aren’t quoted, but you might be able to find a parser that isn’t too picky, and I’m sure you could do something similar in .NET.