Tag Archives: perl

Text Manipulation: Multiple Line Regex Deletions Using Perl

Removing portions of files that match a multiple line regular expression can be tricky, unless you’re using perl. Let’s take an example file:

header 1
========
some data
header 2
========
header 3
========
some data
header 4
========
some data
some more data
even more data
header 5
========
header 6
========
some data

header 1

========

some data

header 2

========

header 3

========

some data

header 4

========

some data

some more data

even more data

header 5

========

header 6

========

some data

We would like to remove all headers that are not followed by any data. Not much of an example but it’ll demonstrate the technique nethertheless! The easiest way to do this with perl is to read the entire file into a single scalar variable, and then just parse that substituting our multiple line regular expression with nothing. Observe:

#!/usr/bin/perl
open( FH, "/path/to/the/file" ) || die "Couldn't open file...\n";
while ( <FH> ) {
   $data .= $_;
}
$data =~ s/[^\n]*\n=+\n\n//g;
print $data;
close( FH );

#!/usr/bin/perl

open( FH, "/path/to/the/file" ) || die "Couldn't open file...\n";

while ( <FH> ) {

$data .= $_;

}

$data =~ s/[^\n]*\n=+\n\n//g;

print $data;

close( FH );

Obviously you’ll need to redirect the output of this file, or just write $data out to a new file within the perl script itself.

Running the script on the example data gives the expected output:

$ ./parse_it.pl
header 1
========
some data
header 3
========
some data
header 4
========
some data
some more data
even more data
header 6
========
some data

$ ./parse_it.pl

header 1

========

some data

header 3

========

some data

header 4

========

some data

some more data

even more data

header 6

========

some data

Using this method, you can easily modify the regular expression in the perl script to suit your needs.

Text Manipulation: How to Delete the First Line of Text in a Large File

Editing a very large file can be a resource- (and time-) consuming nightmare. Having a requirement to delete the first line in such a file in-place whilst avoiding opening the file up in SomeEditor(TM) can be done in various ways, with various resource overheads.

Let me introduce you to the three methods we’ll be trying. The first uses GNU sed and the -i (inplace) option to edit the file in-place.

sed -i '1d' large_file

1	sed -i '1d' large_file

The second methods uses perl to get the job done.

perl -pi -e '$_ = "" if ( $. == 1 );' large_file

1	perl -pi -e '$_ = "" if ( $. == 1 );' large_file

The final example uses printf (so use a shell that supports it) and ex (command-line vi).

printf ":1d\n:wq\n" | ex large_file

1	printf ":1d\n:wq\n" \| ex large_file

Let’s use these methods and time them…

$ wc -l large_file
15711759 large_file
$ time sed -i '1d' large_file
real    3m1.870s
user    2m2.406s
sys     0m23.906s
$ time perl -pi -e '$_ = "" if ($. == 1);' large_file
real    1m46.065s
user    0m0.015s
sys     0m0.016s
$ time printf ":1d\n:wq\n" | ex large_file
real    6m53.646s
user    1m52.295s
sys     0m19.374s

$ wc -l large_file

15711759 large_file

$ time sed -i '1d' large_file

real 3m1.870s

user 2m2.406s

sys 0m23.906s

$ time perl -pi -e '$_ = "" if ($. == 1);' large_file

real 1m46.065s

user 0m0.015s

sys 0m0.016s

$ time printf ":1d\n:wq\n" | ex large_file

real 6m53.646s

user 1m52.295s

sys 0m19.374s

So the moral of this tip? Use perl for performing edits on extremely large files!

Text Manipulation: How to Globally Delete Lines Matching a Certain Pattern

In order to delete complete lines that match a certain pattern, you can use various tools. I find that the easiest tools to use in this situation are perl or our humble elderly friend ed.

For example, say we want to delete all lines containing the string “delete_me”, appearing at the very beginning of the line. The following two commands will have the desired effect of deleting the required lines.

Using perl:

perl -pi -e "s/^delete_me.*\n//" my_file

1	perl -pi -e "s/^delete_me.*\n//" my_file

Using ed:

echo "g/delete_me/d\nwq!" | ed - my_file

1	echo "g/delete_me/d\nwq!" \| ed - my_file

Of course, there are many more ways of achieving our goal but these are my two personal favourites. It goes without saying that you can modify the search pattern to meet your exact requirements.

Toki Winter

Advanced UNIX for the experienced system administrator

Tag Archives: perl

Text Manipulation: Multiple Line Regex Deletions Using Perl

Text Manipulation: How to Delete the First Line of Text in a Large File

Text Manipulation: How to Globally Delete Lines Matching a Certain Pattern