SED - Regular Expression over multiple lines

Posted by herrherr on Stack Overflow See other posts from Stack Overflow or by herrherr
Published on 2010-12-22T15:41:09Z Indexed on 2010/12/22 15:54 UTC
Read the original article Hit count: 462

Filed under:
|
|
|

Hi there,

I'm stuck with this for several hours now and cycled through a wealth of different tools to get the job done. Without success. It would be fantastic, if someone could help me out with this.

Here is the problem:

I have a very large CSV file (400mb+) that is not formatted correctly. Right now it looks something like this:

Alan Smithee ist ein Anagramm von „The
[...]
„Alan Smythee“, und „Adam Smithee“."   
,Alan Smithee  
Die Aussagenlogik ist
der Bereich der Logik, der sich mit
[...]
ihrer Teilaussagen bestimmen.   
,Aussagenlogik

As you can probably see the words ",Alan Smithee" and ",Aussagenlogik" should actually be on the same line as the foregoing sentence. Then it would look something like this:

Alan Smithee ist ein Anagramm von „The Smitheeeee
[...]
„Alan Smythee“, und „Adam Smithee“.,Alan Smithee  
Die Aussagenlogik ist
der Bereich der Logik, der sich mit
[...]
ihrer Teilaussagen bestimmen.,Aussagenlogik

Please note that the end of the sentence can contain quotes or not. In the end they should be replaced too.

Here is what I came up with so far:

sed -n '1h;1!H;${;g;s/\."?.*,//g;p;}' out.csv > out1.csv

This should actually get the job done of matching the expression over multiple lines. Unfortunately it doesn't :)

The expression is looking for the dot at the end of the sentence and the optional quotes plus a newline character that I'm trying to match with .*.

Help much appreciated. And it doesn't really matter what tool gets the job done (awk, perl, sed, tr, etc.).

Thanks, Chris

© Stack Overflow or respective owner

Related posts about regex

Related posts about bash