I got a couple of customer records that need to bechecked for duplicate house numbers in the address / street field. Due to a frontend error, several entries have been created that have duplicate house numbers like this:
'Sesame Street 312 312' or 'ArlingtonRoad 7 7' or 'Queens Boulevard 22 22'
My task is to identify those entries with a duplicate street number like the ones above. Tried several combinations with substr, stuff, charindex but did not find a solution...
asked 07 Nov '14, 06:20
This might not be the easiest way, and it requires v11 or above as it does make use of regular expressions, particularly of REGEX_SUBSTR.
I'm assuming the house numbers consist of digits only and follow any other non-digit parts, otherwise you will have to adapt the pattern.
The REGEXP pattern "\s[0-9]+" looks for a blank followed by one or more digits, and the "positive lookahead zero-width assertion" expression "(?=\s[0-9]+)" makes sure there's one more pattern following the current position.
The first WHERE condition finds all entries with at least 2 trailing blocks of digits (whether identical or not) and is intended to speed up the query (though I don't know if it really does), the second condition compares both blocks. You might omit the fist condition.
That's a starting point, I'm sure a regex expert can certainly do better...
Here is some sample data:
Provided that you know that the house number is numeric and last in the address, you could use something like the following:
The above finds the last space in the string (locate with negative offset returns last occurrence), uses that to find the last word in the address, then finds the first occurrence of the last word. If the first occurrence is earlier, then there is a duplicate. I check too that the last word is numeric. You could refine this to check that the first occurrence immediately precedes the last space so that you don't incorrectly identify the same number earlier in the string as a duplicate.
If your data doesn't necessarily have the house number as the last word, it is a harder problem. The following would work with SQLA (since about 9.0 I believe) by first splitting words on space boundaries (using openstring; sa_split_list works well too) then looking for duplicated numbers:
I suspect this would work in IQ but perhaps slowly as OPENSTRING is implemented only in the system dbspace and the query would likely run in partial passthrough mode. Depending on the size of the data it might be satisfactory, or you might get better performance by copying the table or a portion of it to the system dbspace.