Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7559873
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 30, 20262026-05-30T12:44:55+00:00 2026-05-30T12:44:55+00:00

I’m wanting to build a scraper that parses through transcripts from the Leveson Inquiry

  • 0

I’m wanting to build a scraper that parses through transcripts from the Leveson Inquiry, which are in the following format as plaintext:

         1                                      Thursday, 2 February 2012

         2   (10.00 am)

         3   LORD JUSTICE LEVESON:  Good morning.

         4   MR BARR:  Good morning, sir.  We're going to start today

         5       with witnesses from the mobile phone companies,

         6       Mr Blendis from Everything Everywhere, Mr Hughes from

         7       Vodafone and Mr Gorham from Telefonica.

         8   LORD JUSTICE LEVESON:  Very good.

         9   MR BARR:  We're going to listen to them all together, sir.

        10       Can I ask that the gentlemen are sworn in, please.

        11                   MR JAMES BLENDIS (affirmed)

        12                     MR ADRIAN GORHAM (sworn)

        13                      MR MARK HUGHES (sworn)

        14                       Questions by MR BARR

        15   MR BARR:  Can I start, please, Mr Hughes, with you.  Could

        16       you tell us the position that you hold and a little bit

        17       about your professional background, please?

        18   MR HUGHES:  Yes, sure.  I'm currently head of fraud risk and

        19       security for Vodafone UK.  I have been in that position

        20       since August 2011 and I've worked in the fraud risk and

        21       security department in Vodafone since October 2006.

        22   Q.  Mr Gorham, if I could ask you the same question, please.

        23   MR GORHAM:  I'm the head of fraud and security for

        24       Telefonica O2, I've been in that role for ten years and

        25       have been in the industry for 13.


                                         1

(Full example)

Ultimately I want to build an XML file structured as follows:

<hearing date="2012-02-02" time="10:00">
    <quote speaker="Lord Justice Leveson" page="1" line="3">Good morning.</quote>
    <quote speaker="Mr Barr" page="1" line="4">Good morning, sir. We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</quote>
    <quote speaker="Lord Justice Leveson" page="1" line="8">Very good.</quote>
[... and on ...]
</hearing>

…Any help?

(Also note, that “MR BARR:” changes into simply “Q.” at a certain point.)

Many thanks!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-30T12:44:57+00:00Added an answer on May 30, 2026 at 12:44 pm

    let me start by saying this is not a foolproof script, there might well be something I forgot or overlooked,
    but it is a proof of concept for you to improve and expand upon or just get an idea.

    There are enough regularities in the text layout for us to work with, what the script does is split the
    transcript to an array of lines and match those lines against a few patterns in an attempt to identify the
    regularities and determine the role of the data.

    Example Script:

    <?php
    /*
    Proof of Concept : Transcript to XML by Robjong
    
    ? :
        - action on date change (what to do when the date changes?)
        - what to do with lines like "MR MARK HUGHES (sworn)" (make it a note?!)
        - what to do with lines like "Questions by MR BARR" (make it a note?!)
        - detect events/notes in quotes better? (e.g: MR BLENDIS: (Nods head).)
    
    
    Notes :
    
        - desperately needs error checking/handling!!!! (for now it just got in the way)
        - it might well be that the configuration of PHP will cause file_get_contents to fail,
          try curl or download it manually and read the local file
        - if you are running PHP < 5.2.4, change the \h in the pattern to \s or [\t ]
    
    */
    
    # basic usage
    // get the transcript as plain text
    $txt = file_get_contents( 'http://www.levesoninquiry.org.uk/wp-content/uploads/2012/02/Transcript-of-Morning-Hearing-2-February-2012.txt' );
    // convert transcript to XML
    $xml = transcriptToXML_beta( $txt );
    // we have the transcript as XML, now what?
    file_put_contents( 'transcript.xml', $xml ); // let's write it to a file
    echo $xml;
    
    
    function transcriptToXML_beta( $string ) { // beta is just to emphasize lack of torough testing
        $lines = explode( "\n", $string ); // split text into an array array of lines
        if( !is_array( $lines ) ) { // the provided string was not multiline
            return false;
        }
    
        // these vars will hold the data we need to build our XML
        $date = ''; // the date of the transcript
        $time = ''; // transcript time
        $page = 1; // this will hold the current page number
    
        $linenr = ''; // this will hold the line nr
        $speaker = ''; // the name of the speaker
        $text = ''; // transcribed text attributed to the speaker
        $new = false; // will be true if a new item has been matched
        $event = ''; // this will hold notes that are in a quote but need to be stored separately (events)
    
        $xml = ''; // this will be the XML string
        $i = 0; // count the lines to display actual line number for debugging
        foreach( $lines as $line ) { // loop over lines
            $i++;
            if( !preg_match( "/[[:graph:]]/", $line ) ) { // line is empty, does not contain printable characters....
                continue; // ....so we skip to the next one
            }
    
            if( preg_match( "/^\h*\d+\h+(?P<date>[a-z]+,\h+\d+\h+[a-z]+\h\d{4})\s*$/i", $line, $match ) ) { # it looks like a date
                $date = $match['date']; // set date
                $speaker = ''; // reset vars
                $text = '';
                continue;// no need to handle this line any further
            } elseif( preg_match( "/^\h*\d+\h+([A-Z]+(?:\s+[A-Z]+){0,4}\h+\(.*?\)|(?i:questions\h+by)[A-Z\h]+)\s*$/", $line, $match ) ) { # (qued) event, uppercase text followed by text between parentheses
                $event .= "    <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry to que, to be added after current quote
                continue;// no need to handle this line any further
            } elseif( preg_match( "/^\h*(\d*)\h*\(\h*(?P<time>\d{1,2}[:.]\d{1,2}\h*[ap]m)\)\s*$/i", $line, $match ) ) { # seems we have a time entry
                $time = $match['time']; // set date
                $xml .= "    <time page=\"{$page}\" line=\"{$match[1]}\">" . strtoupper( str_replace( '.', ':', $match['time'] ) ) . "</time>\n"; // add time as entry
                $speaker = ''; // reset vars
                $text = '';
                continue;// no need to handle this line any further
            } elseif( preg_match( "/^\h*(\d+)\s*$/", $line, $match ) ) { # line has just one or more digits, we assume its a pagenr
                if( $match[1] <= $page ) { // if the number is lower then the current page number ignore it, this avoids triggering errors for 'empty lines' that only have a line number
                    continue;
                }
                $page = (int) $match[1] + 1; // set pagenr, add one because the nr is at the bottom of the page
                continue;// no need to handle this line any further
            } elseif( preg_match( "/^\h*\d+\s+\(([[:print:]]+)\)\s*$/", $line, $match ) && !$speaker ) { # note, text is between parentheses
                $xml .= "    <event page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
                continue;// no need to handle this line any further
            } elseif( preg_match( "/^\h*\d+\h+[A-Z\h]+\(.*?\)\s*$/", $line, $match ) && !$speaker ) { # note, uppercase text followed by text between parentheses, only if not in quote
                $xml .= "    <event type=\"note\" speaker=\"\" page=\"{$page}\" line=\"{$linenr}\">{$match[1]}</event>\n"; // add entry as note
                continue;// no need to handle this line any further
            } elseif( preg_match("/^\h*(?P<linenr>\d+)\h+(?P<speaker>(?:\h[A-Z]+(?:\h[A-Z]+){0,4}))[:.]\h*(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # new speaker entry
                if( $new && $speaker ) { // if we have one open we need to add it first
                    $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n"; // add quote
                    $new = false; // reset
                    if( $event ) { // if we have a qued note we need to add that too
                        $xml .= $event; // add entry to XML string
                        $event = ''; // clear 'que'
                    }
                }
                $speaker = trim( $match['speaker'] ); // assign match to speaker var
                $linenr = (int) $match['linenr']; // assign line number
                $text = trim( $match['text'] ); // assign text
                $new = true; // set new match bool
            } elseif( preg_match( "/^\h*(?P<linenr>\d+)\h+(?P<text>[[:print:]]+?)\s*$/", $line, $match ) ) { # follow up text
                $text .= ' ' . trim( $match['text'] ); // append text
            } else { # unkown line (add check for linenr only lines or remove $match[1] >= $page from the pagenr match conditional)
                // not sure what kind of line this is... add it as a separate 'type'?!
                trigger_error( "Unable to parse line {$i} \"{$line}\"" ); // throw exception / trigger error
                continue; // no need to handle this line any further
            }
    
            if( !$new && $speaker ) {
                $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
                $speaker = ''; // reset vars
                $text = '';
                $new = false;
                if( $event ) { // if we have a qued note we need to add that too
                    $xml .= $event; // add entry to XML string
                    $event = ''; // clear 'que'
                }
            }
        }
    
        // if we have a match open we need to handle it, this might happen because we do not assign the match in the same iteration as we matched it
        if( $new ) {
            $xml .= "    <entry type=\"quote\" speaker=\"{$speaker}\" page=\"{$page}\" line=\"{$linenr}\">$text</entry>\n";
        }
    
        if( !trim( $xml ) ) { // no text found so $xml is still an empty string
            return false;
        }
    
        $date = new DateTime( $date ); // instantiate datetime with the time from the transcript
        $date = date( 'Y-m-d', $date->getTimestamp() ); // format date
        // now we need to wrap the nodes in a root node
        $xml = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<hearing date=\"{$date}\">\n{$xml}</hearing>\n";
    
        return $xml; // return the XML
    }
    ?>
    

    I will update the comments and script later today

    Output Sample:

    <hearing date="2012-02-02"> 
        <time page="1" line="2">10:00 AM</time> 
        <entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="3">Good morning.</entry> 
        <entry type="quote" speaker="MR BARR" page="1" line="4">Good morning, sir.  We're going to start today with witnesses from the mobile phone companies, Mr Blendis from Everything Everywhere, Mr Hughes from Vodafone and Mr Gorham from Telefonica.</entry> 
        <entry type="quote" speaker="LORD JUSTICE LEVESON" page="1" line="8">Very good.</entry> 
        <entry type="quote" speaker="MR BARR" page="1" line="9">We're going to listen to them all together, sir. Can I ask that the gentlemen are sworn in, please.</entry> 
        <event page="1" line="9">MR JAMES BLENDIS (affirmed)</event> 
        <event page="1" line="9">MR ADRIAN GORHAM (sworn)</event> 
        <event page="1" line="9">MR MARK HUGHES (sworn)</event> 
        <event page="1" line="9">Questions by MR BARR</event> 
    

    b.t.w. just out of curiosity, what is it you need this for?

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I have a text area in my form which accepts all possible characters from
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
For some reason, after submitting a string like this Jack’s Spindle from a text
I am trying to understand how to use SyndicationItem to display feed which is
I used javascript for loading a picture on my website depending on which small
I've got a string that has curly quotes in it. I'd like to replace
I have a French site that I want to parse, but am running into
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.