Using SQL-CLR to create a Regular Expression Replace function

{outline||<1> - |Step <1>} Contributed by: Michael Morrison

Background

On a project I was on that used Sql Server 2005, we had a screen in which a user could search the database on many different variables. On of those variables were a street address. Well, because a user who enters a record with address information could have spelled it differently or used an abbreviation, we needed to still pull back records that were somewhat like what was being searched for. Well this is relatively simple using the DIFFERENCE function in SQL Server. I won't go in to much detail on that function here, but click on the link if you'd liked to learn how it works. Our problem was in the fact that these addresses obviously had street numbers and that we needed to somehow "strip" them out when comparing them to the given search parameter.

The Solution

To accomplish this, I built a SQL Clr function that takes 3 arguments. 1) The string expression to be searched, 2) a Regular Expression pattern to match, and 3) a string expression to replace matches with.

=

= In Visual Studio, create a SQL-CLR project
Image


=

= Add a User-Defined Function item.
Image
Image

=

= Add a using statement for System.Text.RegularExpressions

=

= Add 3 SqlString parameters to the RegExReplace function: expression, pattern, replace

public static SqlString RegExReplace(SqlString expression, SqlString pattern, SqlString replace)
{
}

=

= Define the function as follows:

public static SqlString RegExReplace(SqlString expression, SqlString pattern, SqlString replace)
{
  if (expression.IsNull || pattern.IsNull || replace.IsNull)
    return SqlString.Null;
    
  Regex r = new Regex(pattern.ToString());
  return new SqlString(r.Replace(expression.ToString(), replace.ToString()));
}

=

= Save and deploy your assembly to Sql Server

Now how to use it

Say we have a table with Address information defined as so

CREATE TABLE [dbo].[AddressRef](
	[AddressId] [int] IDENTITY(1,1) NOT NULL,
	[Address1] [varchar](25) NULL,
	[Address2] [varchar](25) NULL,
	[City] [varchar](17) NULL,
	[StateCode] [char](2) NULL,
	[Zip] [int] NULL,
 CONSTRAINT [PK_AddressRef] PRIMARY KEY CLUSTERED 
(
	[AddressId] ASC
)
) 

In this table, we have 4 records and the Address1 field holds street information. The 4 records we'll use for testing have values in the Address1 of:

123 Lucky Street
345 Luckie St.
222 Lucky Str.
227 E. Ponce de Leon Ave.

If we want to search on any addresses that could be located on the given street name of "Lucky St" we could do the following,

DECLARE @searchOn varchar(25)
DECLARE @address1 varchar(25)
DECLARE @add1Num varchar(25)

SET @searchOn = LTRIM(RTRIM(rccms.dbo.RegExReplace('445 Lucky St', '[0-9]', '')))

;WITH a1 AS (
    SELECT
         a.AddressId,
         a.Address1,
         'Diff' = DIFFERENCE(LTRIM(RTRIM(rccms.dbo.RegExReplace(a.Address1, '[0-9]', ''))), @searchOn)
    FROM AddressRef a WITH (NOLOCK)
    WHERE 1=1
         AND (COALESCE(a.Address1, '') <> '')
)
SELECT
    AddressId,
    Address1
FROM a1
WHERE 1=1
    AND (Diff >= 3)


Notice I set my DIFFERENCE tolerance level to 3. At this level, street names that are similar based on their Soundex value will make it into the result set but street names that are completely different won't. You can adjust your tolerance as needed between 0 and 4 with 4 being a strong match.

So with a tolerance of 3 I get back the following resultset:
AddressId Address1
----------- -------------------------
1 123 Lucky Street
2 345 Luckie St.
3 222 Lucky Str.

Let's add another record to the AddressRef table:
345 Easy St.

Notice running our query with tolerance of 3 returns:
AddressId Address1
----------- -------------------------
1 123 Lucky Street
2 345 Luckie St.
3 222 Lucky Str.
5 345 Easy St.

Notice how we have the new record show up? Why is this? Well it just happens that the Soundex for Easy St. and Lucky St. are similar enough to warrant a DIFFERENCE of 3. How could that be? Well let's look at their SOUNDEX values:
print SOUNDEX('Easy St.') -- E200
print SOUNDEX('Lucky St.') -- L200

So the DIFFERENCE looks at the two values and returns the number of characters that match, in this case 3.

Well let's see what happens if we up our tolerance level to 4 (strong similarity or identical strings):
AddressId Address1
----------- -------------------------
1 123 Lucky Street
2 345 Luckie St.
3 222 Lucky Str.

Ah, now we're back to our better result. So what should we use, 3 or 4? Well it really depends on how loose you want your "fuzzy" search to be. For me 3 is usually a good enough match. Of course there are also many more exact algorithms for comparing the differences of strings, but they don't exist inside of T-SQL and would definitely be fodder for another article. See you then!