Parsing a text file into an array

I am new to perl, and have not found any good examples of parsing to help me out. I have a text file that I am reading into an array that has to be parsed out and put into another ...

Posted On: Monday 26th of November 2012 08:47:45 PM Total Views:  269
View Complete with Replies

Related Messages:

Which Perl module for parsing ambiguous grammars?   (82 Views)
I am looking for a module that can parse ambiguous grammars and that gives me all the possible parse trees. Also, it should have 'actions'.I looked at Parse::Yapp and Parse::RedDescent which both look great, they have actions, but neither will give multiple solutions in case of ambiguity (or will they!).I saw Parse::Earley but it doesn't support actions.Any other suggestion
HTML::Tokerparser / parsing ,   (100 Views)
I am using the Tokeparser module for a html file that contains movie title, description, library call number, year, duration etc. But the html files are marked so I have to rely on the the paragraph and line break. Here's what I have so far. This is a sample html chunk, after the the next record starts. I am trying to covert this into a SQL database. ___HTML__ 124- Chocolat Santa Monica, CA, 106 minutes MGM Home Entertainment, 2001 DVD 791.43653 C451 A young woman returns to Cameroon to trace her past. Soon the sights, sounds, and smells sweep her back to her childhood and memories of the people who populated her youth. (French with English and Spanish subtitles.) __HTML___ This is what I have so far: use strict ; use HTML::TokeParser; $p = HTML::TokeParser->new("movie.html") || die "Can't open: $!"; while (my $token = $p->get_token) { if ($p->get_tag("p")) { my $title = $p->get_trimmed_text; print "$title\n"; } elsif ($p->get_tag("br")){ my $desc = $p->get_trimmed_text ; print "$desc \n"; } } I was able to extract the title and description, how do I get the rest of the information # perl 124- Chocolat A young woman returns to Cameroon to trace her past. Soon the sights, sounds, and smells sweep her back to her childhood and memories of the people who populated her youth. (French with English and Spanish subtitles.) --s
parsing and storing with Mechanize & DBI   (126 Views)
jobst mller wrote: > hello list , hello Rob > > many thanks for the reply. > > to avoid confusion - i try a first reply to your adress - not to the list. I am aware that i have to explain the issue, the problem and the needs more clearly - i try to do so. > > Rob please give me feedback on that - if you need more input then please let me know. I try to do all i can do! > > > so i start here to describe the problems; > > > I need collect some of the data out of a site - here is an example. > this is very similar to the site in am interested in...! > > why do i need to harvest - and collect some data. Why do i need to collect the data, you may ask: i am an researcher and i want to do some socio-ethnographic research (see the research field - describet at and > ). Therefore i need the data: i want to harvest the data. > > Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information. I need to gather information out of a phpBB2. The question is: Can we tailor httrack to harvest and to digest information in some different formats. I need to fetch data out of a online-forum (a phpBB-board) and to store it locally in a mysql-db) > Is this possible with perl. > > > first snippets were available to solve it > > > > You allready reviewed it - at a first glance... Now the problem is. I have to get the site above - in a allmost > full and complete data set. > > according my view: The problem is two folded: it has two major issues or things... > > 1. Grabbing the data out of the site and then parsing it; finally > 2. storing the data in the new - (local ) database... > > Well the question of restoring is not too hard. if i can pull almost a full thread-data-set out of the site > The tables are shown here in this site: > Well if we are able to do the first job very good: > > 1. Grabbing the data out of the site and then parsing it; then the second job would be not too hard. Then i have as a result - a large file of CSV - data, dont i The final question was: how can the job of restoring be done! Then i am able to have a full set of data. Well i guess that it can be done with some help of the from the -Team > > The question is: how should i get the data with the robot:USER-AGENT - does the Agent give me back the most of the > data - so that i can use it for an investigation. BTW -the investigation needs to be done with some retrieval operations. > Therefore i need to store the gathered datas in a mysql-db. > > Well thats it. I need to build u p an allmost 100 per cent COPY of the original site - i need to store it locally - here on my machine. I need to collect some of the data out of the site which i am interested in: > > If the data that is gained with an script - i have to set up some PERL:BI and try to store the data in a phpBB-DB. > > Rob, what do you think about it. Are we able to do so! > > > Rob , perhaps with a good converter or at least a part of a converter i can restore the whole cvs-dump with ease. > What do you think. So if we do the first job then i think the second part can be done also. > > Rob, i look forward to hear from you > best regards > > martin aka jobst Hi Martin I'm sorry, but your question is still unapproachable. We are happy to help you, but please summarize your problem in a brief message that doesn't include external links. Perhaps it would help to take your design further, and approach step 1 first Have you been able to download the data you need from the site Have you made sure that the owners of the site are happy with what you are doing It is important that you get permission to copy other people's data, and in particular there are site rules for robot access that you must adhere to. If you show us a code fragment that simply downloads the site you are interested in, and also say what problems you have, then we can help. As
Need help parsing one text file based on data from a second text file   (88 Views)
A few comments to start:-you should post your code (and data) between [code]your code here[/code], this will make it more readable for us-I assume, consistently with your code, that each record in the sequences file is a single line with fields separated by |-it could be important to know if your identifying numbers (I suppose they are fixed length) might start with leading zeroes (I'll assume they don't): in that case perl could drop the leading 0's depending on how they are used in the code -the simplest solution to your problem (though not the only one) is to use a hash where the key are your numbersThis is a simplified version of what you need (untested):CODEopen(LISTFILE,$listdata)or die $!;my(%hashlist,$num);while(){ chomp; $hashlist{$_}=1;}open(FAAFILE,$faaname)or die $!;open(FAAFILEOUT, ">>test6.txt");while(){ (undef,$num)=split/|/; if(exists$hashlist{$num}){ print FAAFILEOUT; }}It is as simple as that!In the code above I used some tricks of perl that might be difficult to understand to you (particularly the use of the special variable $_ ), however they make coding so easier and faster!More particularly:while(){ reads into $_;chomp; operates on $_;split/|/; operates on $_;print FAAFILEOUT; writes out $_ (that already includes the line terminator) Franco : Online engineering calculations : Magnetic brakes for fun rides : Air bearing pads
parsing CSV files with control and extended ASCII characters   (102 Views)
I have some CSV input files that contain control and extended ASCII characters, including: - vertical tabs (0x0B) - acute and grave accents - tildes - circumflexes - umlauts - nonbreaking spaces (0xA0) The Text::CSV or Tie::Handle::CSV modules don't like these characters; the snippets below both return errors when they get to one. Is there some other method for stuffing comma-separated ASCII (*any* ASCII) into a hash or list
Link parsing (was: Getting error...)   (92 Views)
hotkitty wrote: > I ultimately want to go to politics, follow all links under > the "Election Coverage" headline and, w/in those links, save all the > links under the "Don't Miss" sections that appear in those stories. > However, after many hours and trial & error I've yet to complete the > task. I know mechanize can do this somehow but I've yet to figure out > how to put it all together. It's not so much about putting it together; it's more like writing Perl code step by step... > Here's the script I have so far, which gets me to only step one: Actually, I'm not sure that the code you have even gets you to step one. As a parsing exercise, I wrote the code below. I chose to make use of LWP::Simple and HTML::TokeParser. Please study the docs for the latter: #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::TokeParser; my $domain = ''; my $uri = $domain . '/POLITICS/'; my $html = get($uri) or die "Fetching $uri failed"; my $p = HTML::TokeParser->new(\$html); # go to start position in the document while ( $p->get_tag('div') ) { last if $p->get_text eq 'Election coverage'; } # extract links my @links; while ( my $token = $p->get_token ) { if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) { push @links, $token->[2]{href}; } last if $token->[0] eq 'E' and $token->[1] eq 'ul'; } foreach my $uri ( map $domain . $_, @links ) { my $html = get($uri) or warn "Fetching $uri failed" and next; my $p = HTML::TokeParser->new(\$html); # go to start position in the document $p->get_tag('h4'); unless ( $p->get_text eq "Don't Miss" ) { warn "Didn't find section \"Don't Miss\""; next; } print "$uri\n"; # extract links while ( my $token = $p->get_token ) { if ( $token->[0] eq 'S' and $token->[1] eq 'a' ) { print ' ', $p->get_text, "\n"; my $uri = substr($token->[2]{href}, 0, 4) eq 'http' $token->[2]{href} : $domain . $token->[2]{href}; print " $uri\n\n"; } last if $token->[0] eq 'E' and $token->[1] eq 'ul'; } } -- Gunnar Hjalmarsson Email:
Time::Piece capturing parsing problems   (129 Views)
Myf White wrote: > I think my question relates to STDOUT rather than Time::Piece but I'm not > sure. > > I am trying to use Time::Piece to process and convert a string which may be > a bit dodgy. What I can't understand is how to capture the problem. The > following code only captures the problem with the second test in the > $EVAL_ERROR ($@). The problem with the first one ("garbage at end of string > in strptime: ...") just goes to the screen - but I need to be able to handle > it. I for one think it's a result of how Time::Piece works... One option might be to use 'good old' Date::Parse instead: $ cat use Date::Parse; my @tests = ('28 FEB 2008', 'garbage 28 FEB 2008', '28 FEB 2008 garbage'); foreach my $i ( 0..$#tests ) { print "TEST $i\n"; if ( my ($d, $m, $y) = ( strptime $tests[$i] )[3..5] ) { printf "%02d/%02d/%d\n", $d, $m+1, $y+1900; } else { warn "Parsing of '$tests[$i]' failed"; } } $ perl TEST 0 28/02/2008 TEST 1 Parsing of 'garbage 28 FEB 2008' failed at line 8. TEST 2 Parsing of '28 FEB 2008 garbage' failed at line 8. $ -- Gunnar Hjalmarsson Email:
HTML parsing   (70 Views)
June Lee wrote: >any good way to extract the data > I want to parse the following HTML page As has been mentioned many, many times in this NG: if you want to parse HTML then use an HTML ...
Need help parsing file for output   (91 Views)
Running Perl 5.x on Solaris 9 SPARC No modules other than what comes with the basic perl installation are available. I have a file: file1.csv as such: Day of the year, value 1,4144.34 2,4144.38 3,4144.38 4,4144.38 5,4144.44 6,4144.48 7,4144.48 8,4144.50 ...
looking at parsing procedures   (125 Views)
I have posted in the memorable past about using the Perl Programming Language to make a unBuckeyesque yikes. Instead of something thnt has to do with Ohio; im looking to develop my first if statemnet. Hot here in the desert. ...
Name for a webalizer.hist-parsing module?   (96 Views)
(previously posted to , but didn't get a reaction, so I'm also trying here before I do stupid things. =) I was really surprised when I couldn't find a module for parsing the famous webalizer.hist file, which contains a ...
Logging into and parsing a website using Perl   (52 Views)
I'm trying to create a perl script that will log into a website (the login form uses POST), navigate to several pages, and append the (html) content parsed from those pages to a seperate log file. I'm not very ...
Perl expression for parsing CSV (ignoring parsing commas when in double quotes)   (93 Views)
I can't figure an expression needed to parse a string. This problem arrises from parsing Excel csv files ... The expression must parse a string based upon comma delimiters, but if a comma appears in double quotes it should not ...
XML::Atom::Feed - parsing at all?   (67 Views)
I'm trying to parse up a feed but XML::Atom::Feed doesn't seem to recognize that anything at all is in the XML string I'm feeding to it. Here's a complete example of how it is not working: #!/usr/bin/perl -w use strict; ...
Juniper config file parsing approach   (108 Views)
I want to gather some information from a config file, but I am strugling with the way to get the information I need.The config file looks similar to this:CODEgroups { re0 { system { host-name my-re0; } interfaces { fxp0 { description "10/100 Management interface"; unit 0 { family inet { address; } } } } } re1 { system { host-name my-re1; } interfaces { fxp0 { description "10/100 Management interface"; unit 0 { family inet { address; } } } } }}I am not directly asking for a solution but more for an approach on handling files like this. For instance how would I go about grabbing the re0 information (blue bold) , Much neater solution! Annihilannic.
Processing and not just parsing a file -- Please help !   (89 Views)
all,I have a file similar to this, (about 1G)==================================================startpoint: AAAendpoint: BBB# body for every start/end comboa/Xxxxxx yyy num1 num2 num3 & num4...a/ttttttt fff num4 num5 num6 * num722b/ttttttt fff num44 num50 non_zero * num734...b/yyyyyy fff num43 num52 num65 num745startpoint: CCCendpoint: DDDzzzzzz yyy numa numb numc & num4...a4/nnnnn yyy num4 num5 num6 * num44bb/nnnnn yyy num4 num5 num6 * num9bb/kkkkk yyy num4 num5 non_zero * num8...===================================================================For every startpoint/endpoint combo, I need to go into the body and find the line with "*". (As I have pointed out, not all lines will have a "*" in their 6th column. ) Also there will be multiple lines inside the body with "*" in column 6, but I need to pick the last one of them. This last line holds a non zero number in column 5. The rest of the lines with "char1" in column 6 will have zeros in their respective column 5s. Finally print the startpoint/endpoint combo along with its "non_zero" column 5 value and column 7. Here is another complication for me I also need to print the (n-1)th line's column 1 before the last "*" for every start/end combo. For instance, if you look at the 2nd start/end combo in the above example, I want print "bb/nnnnn" . O/P file,Startpoint endpoint column5 column7 column1(of previous line) I'm a complete newbie to perl any help will be appreciated !
Text parsing   (51 Views)
I just started PERL a couple of days back and am trying to parse the output of dir to obtain just the filenames. eg. 12/17/2003 11:09 AM . 12/17/2003 11:09 AM .. 11/24/2003 10:32 AM 1,340 cracker.cmd ...
regex parsing-Beginner   (50 Views)
Hi , I am quite new to perl so pls bear with me.I am trying to do some Bioinformatics ****ysis.A part of my file looks like this: gene 410..1750 /gene="dnaA" /db_xref="EMBL:2632267" CDS 410..1750 /gene="dnaA" /function="initiation of chromosome replication (DNA synthesis)" ...
can't find xml-parsing module...   (94 Views)
, I need to parse xml, but am not allowed to install modules on the server. I have no idea which of the modules installed on the server will do the job. This is the list of installed modules : ... parsing odity   (51 Views)
. With Oo_style coding, I use a form with GET method, it parses a textfield with name attribute: "participant". I validate that parameter with Perl's substitute and match handlers in order to make a file in "/tmp/SOMEDIR/" with names ...