i am trying to extract some specific data out of a text file using regular expressions with shell script
that is using a multiline grep .. and the tool i am using is pcregrep so that i can get compatibility with perl’s regular expressions
[58]Walid Chamoun Architects WLL
* [59]Map
* [60]Website
* [61]Email
* [62]Profile
* [63]Display Ad
Walid Chamoun Architects WLL
PO Box:
55803, Doha, Qatar
Location:
D-Ring Road, New Salata Shamail 40, Villa 340, Doha, Qatar
Tel:
(00974) 44568833
Fax:
(00974) 44568811
Mob:
(00974) 44568822
* Accurate Budget Costing
* Eco-Friendly Structural Design
* Exclusive & Unique Design
* Quality Architecture & Design
Company Profile
Walid Chamoun Architects (WCA) was founded in Beirut, Lebanon, in 1992,
committed to the concept of fully integrated design-build delivery of
projects. In late '90s, company established in-house architectural and
engineering services. As a full service provider, WCA expanded from
multi-family projects to industrial and office construction, which
added development services, including site acquisition and financing.
In 2001, WCA had opportunity and facilities to experience European
market and establish office in Puerto Banus, Marbella, Spain. By 2005,
WCA refined its structure to focus on specific market segments and new
office was opened in Doha, state of Qatar. From a solid foundation and
reputation built over eighteen years, WCA continually to provide
leadership in design-build through promotion of benefits and education
to its practitioners.
Project Planning: Project planning and investigation occurs before
design begins has greatest impact on cost, schedule and ultimately the
success of project. Creativity in Design: You can rely on our in-house
designers for design excellence in all aspects of the project. Our
designs have received recommendations and appreciations on national and
international levels. Creativity in Execution: Experienced in close
collaboration with the designers as part of the integrated team, our
construction managers, superintendents and field staff create value
throughout the project. Post Completion Services: Your needs can be
served through our skills and experience long after the last
construction crew has left the site. Performance: Corporate and
institutional clients, developers and public agencies repeatedly select
WCA on the basis of its consistent record of performance excellence.
Serving clients throughout the Middle East and GCC, WCA provides
complete planning for architectural, interior design and construction
on a single-responsibility basis. Our expertise spans industrial,
commercial, institutional, public and residential projects. Benefits of
Design-Build: Design-build is a system of contracting under which one
entity performs both design and construction. Benefits of design-build
project delivery include: Single point responsibility Early knowledge
of cost Time and Cost savings
Classification:
Architects - [64]Architects
[65]Al Ali Consulting & Engineering
* [66]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[67]Upgrade this free listing here
PO Box:
467, Doha, Qatar
Tel:
(00974) 44360011
Company Profile
Classification:
Architects - [68]Architects
[69]Al Gazeerah Consulting Engineering
* [70]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[71]Upgrade this free listing here
PO Box:
22414, Doha, Qatar
Tel:
(00974) 44352126
Company Profile
Classification:
Architects - [72]Architects
[73]Al Murgab Consulting Engineering
* [74]Map
* Website
* Email
* Profile
* Display Ad
Is this your company?
[75]Upgrade this free listing here
PO Box:
2856, Doha, Qatar
Tel:
(00974) 44448623
Company Profile
Classification:
Architects - [76]Architects
References
Visible links
1. http://www.qatcom.com/useraccounts/login
2. http://www.qatcom.com/useraccounts/register
3. http://www.qatcom.com/
4. http://www.qatcom.com/
5. http://www.qatcom.com/qataryellowpages/map-of-doha
6. http://www.qatcom.com/qataryellowpages/about-qatcom
7. http://www.qatcom.com/qataryellowpages/advertise-with-qatcom
8. http://www.qatcom.com/qataryellowpages/advertiser_testimonials
9. http://www.qatcom.com/useraccounts/login
10. http://www.qatcom.com/useraccounts/register
11. http://www.qatcom.com/contact-qatcom
12. http://www.qatcom.com/qataryellowpages/companies
13. http://www.qatcom.com/classifications/index/A
14. http://www.qatcom.com/classifications/index/B
15. http://www.qatcom.com/classifications/index/C
16. http://www.qatcom.com/classifications/index/D
17. http://www.qatcom.com/classifications/index/E
18. http://www.qatcom.com/classifications/index/F
19. http://www.qatcom.com/classifications/index/G
20. http://www.qatcom.com/classifications/index/H
21. http://www.qatcom.com/classifications/index/I
22. http://www.qatcom.com/classifications/index/J
23. http://www.qatcom.com/classifications/index/K
24. http://www.qatcom.com/classifications/index/L
25. http://www.qatcom.com/classifications/index/M
26. http://www.qatcom.com/classifications/index/N
27. http://www.qatcom.com/classifications/index/O
28. http://www.qatcom.com/classifications/index/P
for a sample data like this, i am trying to grab the details of companies namely
company name
po box
Tel
fax
mobile
company profile
into a .csv file
i am new to regular expressions and linux too..
all i could manage to get was something like this
\[\d*\][^\.]*[\(\d*\)\s\d*)]
can anyone help me out with this please..
improvements:
i figured out something like this
$ awk '/^\[/ && ! /Upgrade this free listing/ {print $0} /:$/ && ! /Classification/ {printf $0 ; getline x ; print x}' file
but that still isn’t what i want it to be…
You can do this in awk, but you’ll be better off parsing the HTML instead. A good tool to do that with would be Python using the Beautiful Soup module. But that’s not very exciting, so here’s how to do it the awkward (hah!) way:
Save as parse.awk and then invoke with
./parse.awk < sample.txt. Out comes a CSV, like this:There’s comments that should hopefully explain what’s going on. This will run in plain old awk and doesn’t require fancy gawk features. Keep in mind that awk arrays are arbitrarily ordered. This is prone to breaking a whole bunch with varying input data, which is just one of the many reasons why you really should parse the HTML instead of such
lynx -dumpshenanigans.