I’m wanting to build a scraper that parses through transcripts from the Leveson Inquiry, which are in the following format as plaintext:
1 Thursday, 2 February 2012
2 (10.00 am)
3 LORD JUSTICE LEVESON: Good morning.
4 MR BARR: Good morning, sir. We're going to start today
5 with witnesses from the mobile phone companies,
6 Mr Blendis from Everything Everywhere, Mr Hughes from
7 Vodafone and Mr Gorham from Telefonica.
8 LORD JUSTICE LEVESON: Very good.
9 MR BARR: We're going to listen to them all together, sir.
10 Can I ask that the gentlemen are sworn in, please.
11 MR JAMES BLENDIS (affirmed)
12 MR ADRIAN GORHAM (sworn)
13 MR MARK HUGHES (sworn)
14 Questions by MR BARR
15 MR BARR: Can I start, please, Mr Hughes, with you. Could
16 you tell us the position that you hold and a little bit
17 about your professional background, please?
18 MR HUGHES: Yes, sure. I'm currently head of fraud risk and
19 security for Vodafone UK. I have been in that position
20 since August 2011 and I've worked in the fraud risk and
21 security department in Vodafone since October 2006.
22 Q. Mr Gorham, if I could ask you the same question, please.
23 MR GORHAM: I'm the head of fraud and security for
24 Telefonica O2, I've been in that role for ten years and
25 have been in the industry for 13.
1
Ultimately I want to build an XML file structured as follows:
<hearing date="2012-02-02" time="10:00">
<quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
<quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
<quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>
…Any help?
(Also note, that “MR BARR:” changes into simply “Q.” at a certain point.)
Many thanks!
let me start by saying this is not a foolproof script, there might well be something I forgot or overlooked,
but it is a proof of concept for you to improve and expand upon or just get an idea.
There are enough regularities in the text layout for us to work with, what the script does is split the
transcript to an array of lines and match those lines against a few patterns in an attempt to identify the
regularities and determine the role of the data.
Example Script:
I will update the comments and script later today
Output Sample:
b.t.w. just out of curiosity, what is it you need this for?