[PRL] Tagging project

Mitchell Wand wand at ccs.neu.edu
Tue Feb 5 07:43:16 EST 2008


Here is an opportunity for somebody to earn a little extra money.  See me if
you're interested.  --Mitch

On Jan 14, 2008 9:15 AM, a friend of mine wrote:

Mitchell,

I'd love your feedback on a narrow problem I'm trying to solve. At heart, it
is a computational language theory problem. It's been years since my formal
training and the job of the VC hardly gives me the time to sink in my teeth
into the problem. Any pointers to good solution approaches or even people (a
great grad student who wants to do a consulting project for me?) would be
much appreciated.

The problem is the automated and efficient tagging of short pieces of text.
Each piece of text can be tagged with multiple tags. Whether a tag should be
applied to the text or not is determined by whether a given regular
expression (RE) matches the text. More formally, given a body of text B and
a one-to-one and onto map of regular expressions to tags (strings) M:{RE(i)
<- T(i)}, create a function F such that F(B, M) returns the subset S of tags
such that T(j) is in S if and only if RE(j) matches B.

The text is likely to be less than a few hundred characters in length more
than 95% of the time. The size of M is likely to vary from a few dozen to a
couple of hundred. There may be an opportunity to restrict the use of some
regular expression features in the RE(i) expressions, however, they will
definitely use OR groups, e.g., (Bob)|(Sam). The REs will definitely NOT use
back references, which (if I recall correctly) should reduce them to being
regular languages. I expect there is a way to compile a single DFA/NFA that
resolves the matching with a single pass over the text.

Thanks in advance!

On Monday, January 14, 2008 10:31 AM I replied:

 What you describe sounds pretty straightforward.  There are lots of
 technologies you could use, depending on what the data source is, how
 you want to handle the output, how often the regexps change, and (of
 course) performance requirements.

 My first reaction would be to simply write it in Perl, which has
 regular-expression matching built in.  If that doesn't do the job, you
 could try something fancier.

 Hope that helps.

On Jan 14, 2008 11:02 AM, my friend wrote:

Mitchell, the problem is the performance requirements. This will be used to
process incoming/outgoing email and RSS feeds. A naïve implementation that
iterates the REs would be too slow most likely. I'd love to find a more
efficient approach, e.g., compiling all the REs into a big DFA, caching that
and then processing the text only once to extract all the tags. Do you have
a student who'd be interested in doing something like this?

On Monday, January 14, 2008 11:29 AM, I wrote:

Ahh, ok.  I can certainly look for someone.   Can you give me a short
job description I can pass along?

On Jan 14, 2008 11:59 AM, my friend wrote:

The problem statement [above] is what I have. It reminds of a theory of
computation final project. ;-)
Happy to discuss further with an interested person.
Step one would be to come up with a design/algorithm. Implementation can be
flexible but Microsoft technologies will be easier to work with.
-------------- next part --------------
HTML attachment scrubbed and removed


More information about the PRL mailing list