Archive

Posts Tagged ‘search tool’

Regex saves the day, again

August 26th, 2009 Chuck No comments

I was supposed to do a “search” feature and a “related items” feature for the module I was assigned to. This was were my regex skills came in handy yet again.

I made a very barebone prototype of it, which takes some input text for the search query against some stored content, in this case, a full-text article (although the final version should be against a database).

The prototype I made is still very rough around the edges. What it does is that it extracts all keywords from the search query using this expression:

$regex = '/\b([\p{L}|\p{Ll}|\p{Lu}|_]+?)\b/i';

That rule can be translated as follows:

take any combination of unicode characters and underscores that are enclosed in word boundaries

Having done that, I place all matched data into an array and call that my “keywords” array, to be used later.

The next thing I did was to chop the article down into smaller pieces, currently by sentence (I only used periods as a delimiter, I should probably include other punctuation marks I guess). Then I ran preg_match() through each piece to quickly check for matches for any of the keywords. These results are then compiled.

In each compiled piece, I assign a corresponding weight. This weight is my arbitrary way of picking the best match. My currently implementation sets weight to be equal to the number of characters in the piece that are matched with keywords. I still have to refine these rules later on.

When all pieces have their weight assigned, I sort them according to weight in descending order (highest to lowest weight). Then I display the results and highlight the matched characters.

Here’s the prototype I made: search tool.

I’m still thinking of the refinements I could make so I’ll just post them here as I go along.

Also, it’s good to know that MySQL supports regex in your SQL queries, this should save me a lot of time!