Let's say that I have a bunch of input strings as such:
-----------------------
blah blah blah
Attendees: Mr. Jones Alfred Mrs. Smith John Mr. Black Peter Mr. White Arnold
blah blah blah
------------------
And I want to extract the last names of the attendees (i.e., in the above regex, I want four matches: Jones, Smith, Black, White)
If a string does not have the "Attendees:" keyword, then I don't care about that message at all. Likewise, I don't care about people's names if they don't follow the "Attendees:" keyword. For example, in the two following input strings, "Chump" should not be returned:
---------------------
blah blah blah
Losers: Mr. Chump Chimp Mr. Chump Chimp
blah blah blah
----------------------
blah blah blah
Mr. Chump Chimp
Attendees: Mr. Jones Alfred Mrs. Smith John Mr. Black Peter Mr. White Arnold
blah blah blah
-----------------------
I thought the following might work:
Attendees:(\s|\S)*(Mr\.|Mrs\.)\s( <name>\w+)+
but that only matches the last name in the list (White).
This matches only the first name in the list (Jones):
Attendees:(\s|\S)* (Mr\.|Mrs\.)\s( <name>\w+)+
So then I thought I should be using an assertion for Attendees, but that didn't make much of a difference. The following returned only the last name (White):
I've been using regular expresions for a while (albeit infrequently, so I always feel I forgot stuff I used to know :) ), but this time it feels like I'm missing something conceptual. It is of course trivial to have multiple matches from a string in general (for example, if "Attendee:" preceded each attendeed name), but the difference in this case is that requirement for that 'Attendees:' to be there once before any of the matches.
Any help would be appreciated.
(Moderator: Thread moved to this forum for better responses)
(Moderator: Thread Title changed from "Regex Help" to "Specific Paragraph Matching with Inner Matches" for quicker thread understanding during a search)
This will get you closer. The idea is that there is a look ahead and look behind constructs in regex. What they do is just bookend for what you really wanted, or anchor points so to speak.
So we specify the look ahead to be Attendees and the look behind to be two blank lines. The look ahead/behind will not be captured so what you will end up with is the attendees. You will need to the sort out the names as to what you need from the match.
Notes, the attributes are important so keep them. Also below is a tool I use and am not associated with it. It is free to use by the author.
// using System.Text.RegularExpressions;
/// <summary> /// Regular expression built for C# on: Sat, Jan 13, 2007, 02:36:31 PM /// Using Expresso Version: 3.0.2559, http://www.ultrapico.com /// /// Find the group of attendees names for later processing. /// /// A description of the regular expression: /// /// Match a prefix but exclude it from the capture. [Attendees:\r\n] /// Attendees:\r\n /// Attendees: /// Carriage return /// New line /// [1]: A numbered capture group. [.*] /// Any character, any number of repetitions /// Match a suffix but exclude it from the capture. [( :\r\n){2,}] /// Match expression but don't capture it. [\r\n], at least 2 repetitions /// \r\n /// Carriage return /// New line /// /// /// </summary> public Regex Attendees = new Regex( ( <=Attendees:\r\n)(.*)( =( :\r\n){2,})", RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.CultureInvariant | RegexOptions.Compiled );
MuscleHead
Posted: Regular Expressions, Specific Paragraph Matching with Inner Matches
With the singleline option turned on, the regex in your post captures everything after Attendees: as one big match, as opposed to multiple matches (unless I screwed it up)...
I did try to tweak it: ( <=Attendees:\r\n)(( <=( :Mr.|Mrs.))\s( <lastname>\w+)\s( <firstname>\w+)( =\r\n))
I was hoping that forcing each match to start with Mr. or Mrs. and prevent each match from having some carriage returns in the middle would solve the problem, but the above didn't match at all. If I just take out the part that enforces Attendees:
What you have found is exactly right. The regex provided is only to acquire the set of names but not return the names themselves. I am remiss for not mentioning that....my bad. <g> The thought was once that was done another regex would extract the names.
As you have found by trying to have the regex do all the work at once, it clouds the issue by matching outside attendees block....hence is why I am now suggesting a two part process be done.
Here is an example, note I changed the original regex to place the capture into a group At****deesRaw as denoted by regex grammar of "< <GroupNameHere> ... )":
string
Attendees: Mr. Jones Alfred Mrs. Smith John Mr. Black Peter Mr. White Arnold
blah blah blah";
string attendeesPattern
Match m = Regex.Match(data, attendeesPattern, RegexOptions.Singleline | RegexOptions.ExplicitCapture | RegexOptions.CultureInvariant | RegexOptions.Compiled);
if (m != null) { List<string> names = new List<string>();
Welcome to the MSDN forums and thanks for dropping by.
I was a little confused by your first post with the use of <!. But the second post clarified it to use the positive look behind <= which made more sense and worked! Thanks.
MuscleHead
Posted: Regular Expressions, Specific Paragraph Matching with Inner Matches
Thanks for the help Sergei. But did it really work for you I am using regex buddy, and have both "dot matches newline" and "case insensitive" checked (but ^$ match at line breaks unchecked), and all that gets selected is the last name in the list ("White")...
Sergei Z
Posted: Regular Expressions, Specific Paragraph Matching with Inner Matches
no, it works OK. Group LAST_NAME actually captures all 4 last_names. The problem is that regexBuddy deceives u, showing u only the last one. To get all last_names , u'll have to access CaptureCollection of the group LAST_Name. The code below gives u an example how to access members of CaptureCollection of a capturing Group. There will be 4 captures in LAST_NAME group:
Jones
Smith
Black
White
u can use Expresso to see them all: it shows all captures of a group.
<C# code>
//how group/capture collection works: using System; using System.Text.RegularExpressions;
public class RegexTest { public static void RunTest() { int counter; Match m; CaptureCollection cc; GroupCollection gc; // declare Regex , Regex pattern
// Define the string to analyze, find matches
//get a group collection gc = m.Groups; // Print the number of groups. Console.WriteLine("Captured groups = " + gc.Count.ToString()); // Loop through each group. for (int i=0; i < gc.Count; i++) { //get a collection of captures cc = gc [ i ].Captures; counter = cc.Count; // Print number of captures in this group. Console.WriteLine("Captures count = " + counter.ToString()); // Loop through each capture in group. Print all the captures for (int ii = 0; ii < counter; ii++) { // Print capture Console.WriteLine(cc[ii]); } } }
public static void Main() { RunTest(); Console.WriteLine("\nPress Enter to Exit"); Console.ReadLine(); }
}
Sergei Z
Posted: Regular Expressions, Specific Paragraph Matching with Inner Matches