Specific Paragraph Matching with Inner Matches  
Author Message
MuscleHead





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Let's say that I have a bunch of input strings as such:

-----------------------

blah blah blah

Attendees:
Mr. Jones Alfred
Mrs. Smith John
Mr. Black Peter
Mr. White Arnold

blah blah blah

------------------

And I want to extract the last names of the attendees (i.e., in the above regex, I want four matches: Jones, Smith, Black, White)

If a string does not have the "Attendees:" keyword, then I don't care about that message at all. Likewise, I don't care about people's names if they don't follow the "Attendees:" keyword. For example, in the two following input strings, "Chump" should not be returned:

---------------------

blah blah blah

Losers:
Mr. Chump Chimp
Mr. Chump Chimp

blah blah blah

----------------------

blah blah blah

Mr. Chump Chimp

Attendees:
Mr. Jones Alfred
Mrs. Smith John
Mr. Black Peter
Mr. White Arnold

blah blah blah

-----------------------

I thought the following might work:

Attendees:(\s|\S)*(Mr\.|Mrs\.)\s( <name>\w+)+

but that only matches the last name in the list (White).

This matches only the first name in the list (Jones):

Attendees:(\s|\S)* (Mr\.|Mrs\.)\s( <name>\w+)+

So then I thought I should be using an assertion for Attendees, but that didn't make much of a difference. The following returned only the last name (White):

( =Attendees:)((\s|\S)* (Mr\.|Mrs\.)\s( <name>\w+))+

I've been using regular expresions for a while (albeit infrequently, so I always feel I forgot stuff I used to know :) ), but this time it feels like I'm missing something conceptual. It is of course trivial to have multiple matches from a string in general (for example, if "Attendee:" preceded each attendeed name), but the difference in this case is that requirement for that 'Attendees:' to be there once before any of the matches.

Any help would be appreciated.

(Moderator: Thread moved to this forum for better responses)
(Moderator: Thread Title changed from "Regex Help" to "Specific Paragraph Matching with Inner Matches" for quicker thread understanding during a search)


.NET Development30  
 
 
OmegaMan





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

This will get you closer. The idea is that there is a look ahead and look behind constructs in regex. What they do is just bookend for what you really wanted, or anchor points so to speak.

So we specify the look ahead to be Attendees and the look behind to be two blank lines. The look ahead/behind will not be captured so what you will end up with is the attendees. You will need to the sort out the names as to what you need from the match.

Notes, the attributes are important so keep them. Also below is a tool I use and am not associated with it. It is free to use by the author.


// using System.Text.RegularExpressions;

/// <summary>
/// Regular expression built for C# on: Sat, Jan 13, 2007, 02:36:31 PM
/// Using Expresso Version: 3.0.2559, http://www.ultrapico.com
///
/// Find the group of attendees names for later processing.
///
/// A description of the regular expression:
///
/// Match a prefix but exclude it from the capture. [Attendees:\r\n]
/// Attendees:\r\n
/// Attendees:
/// Carriage return
/// New line
/// [1]: A numbered capture group. [.*]
/// Any character, any number of repetitions
/// Match a suffix but exclude it from the capture. [( :\r\n){2,}]
/// Match expression but don't capture it. [\r\n], at least 2 repetitions
/// \r\n
/// Carriage return
/// New line
///
///
/// </summary>
public Regex Attendees = new Regex(
( <=Attendees:\r\n)(.*)( =( :\r\n){2,})",
RegexOptions.Singleline
| RegexOptions.ExplicitCapture
| RegexOptions.CultureInvariant
| RegexOptions.Compiled
);




 
 
MuscleHead





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Thanks Omegaman.

With the singleline option turned on, the regex in your post captures everything after Attendees: as one big match, as opposed to multiple matches (unless I screwed it up)...

I did try to tweak it: ( <=Attendees:\r\n)(( <=( :Mr.|Mrs.))\s( <lastname>\w+)\s( <firstname>\w+)( =\r\n))

I was hoping that forcing each match to start with Mr. or Mrs. and prevent each match from having some carriage returns in the middle would solve the problem, but the above didn't match at all. If I just take out the part that enforces Attendees:

(( <=( :Mr.|Mrs.))\s( <lastname>\w+)\s( <firstname>\w+)( =\r\n))

Then it does match, but it would include a match before Attendees:, which is no good... I just can't seem to get the first anchor point to work

Thanks!



 
 
OmegaMan





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

What you have found is exactly right. The regex provided is only to acquire the set of names but not return the names themselves. I am remiss for not mentioning that....my bad. <g> The thought was once that was done another regex would extract the names.

As you have found by trying to have the regex do all the work at once, it clouds the issue by matching outside attendees block....hence is why I am now suggesting a two part process be done.

Here is an example, note I changed the original regex to place the capture into a group At****deesRaw as denoted by regex grammar of "< <GroupNameHere> ... )":


string

Attendees:
Mr. Jones Alfred
Mrs. Smith John
Mr. Black Peter
Mr. White Arnold

blah blah blah";

string attendeesPattern


Match m = Regex.Match(data, attendeesPattern, RegexOptions.Singleline
| RegexOptions.ExplicitCapture
| RegexOptions.CultureInvariant
| RegexOptions.Compiled);

if (m != null)
{
List<string> names = new List<string>();

string
Regex LastNames = new Regex(pattern,
RegexOptions.Compiled |
RegexOptions.CultureInvariant |
RegexOptions.ExplicitCapture);

MatchCollection mts = LastNames.Matches(m.Groups["AttendeesRaw"].Value);

foreach (Match m2 in mts)
names.Add(m2.Groups["Last"].Value);

foreach (string name in names)
Console.WriteLine(name);
}



Note the above code will fail if the attendees section doesn't have \n\n or falls at the end of the document.

Console OutputAlfred
John
Peter
Arnold


 
 
MuscleHead





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Thanks. Doing it in two parts could work.

Is it really true though that a single regex couldn't do it



 
 
OmegaMan





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Thanks. Doing it in two parts could work.

Is it really true though that a single regex couldn't do it



I couldn't find a way...but that doesn't mean a holy grail regex is not out there. <g>


 
 
MuscleHead





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Thanks Omega Man.

Does anyone know if that 'Holy Grail' Regex exists :)



 
 
Sergei Z





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

u can try this

( <!Attendees)\s* ( :mrs .* \w+\s+( <LAST_NAME>\w+)( :\s+|\Z))+

SingleLine On, IgnoreCase ON

vs input

Attendees:
Mr. Jones Alfred
Mr. John Calder

last names will be captured in one step in group LAST_NAME: [metadata from regexBuddy]

Match offset: 10
Match length: 35
Group "LAST_NAME": Calder
Group "LAST_NAME" offset: 39
Group "LAST_NAME" length: 6


 
 
Sergei Z





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

only now realized that LAST_Name comes before first name in the input:

Attendees:
Mr. Jones Alfred
Mrs. Smith John
Mr. Black Peter
Mr. White Arnold

the regex to capture all LAST_NAME needs to be changed to

( <=Attendees:)\s* ( :mrs .* ( <LAST_NAME>\w+)\s+\w+( :\s+|\Z))+

SingleLine On, IgnoreCase ON

Match offset: 10
Match length: 74
Group "LAST_NAME": White
Group "LAST_NAME" offset: 68
Group "LAST_NAME" length: 5


 
 
OmegaMan





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Hi Sergie Z,

Welcome to the MSDN forums and thanks for dropping by.

I was a little confused by your first post with the use of <!. But the second post clarified it to use the positive look behind <= which made more sense and worked! Thanks.


 
 
MuscleHead





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Thanks for the help Sergei. But did it really work for you I am using regex buddy, and have both "dot matches newline" and "case insensitive" checked (but ^$ match at line breaks unchecked), and all that gets selected is the last name in the list ("White")...

 
 
Sergei Z





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

no, it works OK. Group LAST_NAME actually captures all 4 last_names. The problem is that regexBuddy deceives u, showing u only the last one. To get all last_names , u'll have to access CaptureCollection of the group LAST_Name. The code below gives u an example how to access members of CaptureCollection of a capturing Group. There will be 4 captures in LAST_NAME group:

Jones

Smith

Black

White

u can use Expresso to see them all: it shows all captures of a group.

<C# code>

//how group/capture collection works:
using System;
using System.Text.RegularExpressions;

public class RegexTest
{
public static void RunTest()
{
int counter;
Match m;
CaptureCollection cc;
GroupCollection gc;
// declare Regex , Regex pattern

// Define the string to analyze, find matches

//get a group collection
gc = m.Groups;
// Print the number of groups.
Console.WriteLine("Captured groups = " + gc.Count.ToString());
// Loop through each group.
for (int i=0; i < gc.Count; i++)
{ //get a collection of captures
cc = gc [ i ].Captures;
counter = cc.Count;
// Print number of captures in this group.
Console.WriteLine("Captures count = " + counter.ToString());
// Loop through each capture in group. Print all the captures
for (int ii = 0; ii < counter; ii++)
{
// Print capture
Console.WriteLine(cc[ii]);
}
}
}

public static void Main()
{
RunTest();
Console.WriteLine("\nPress Enter to Exit");
Console.ReadLine();
}

}



 
 
Sergei Z





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

Omegaman,

thanx for kind words. It's good to be here. Hopefully I can interact with more .NET people here, got tired of PHP/Perl folks in regexadvice. :=)

see u around


 
 
OmegaMan





PostPosted: Regular Expressions, Specific Paragraph Matching with Inner Matches Top

It is subtle, but yes they are in the capture collection of one match. I confirm what Sergei states.