Extract number from cell when specific format is found

paul c · Jul 31, 2009

Is there a formula that will extract only the numbers that immediately follow
"E" in a string?

For example, the formula would get "1787" from this string
http://www.glassdoor.com/Jobs/First-Data-Account-Executive-Jobs-EI_IE1787.0,10_KO11,28.htm

Jacob Skaria · Jul 31, 2009

If the number is always after "_IE" then the below formula would work

=LEFT(MID(A1,FIND("_IE",A1)+3,99),FIND(".",MID(A1,FIND("_IE",A1)+3,99))-1)

If the number is the first numeric in the string the below should work

=LEFT(MID(A1,MIN(SEARCH({0,1,2,3,4,5,6,7,8,9},A1&"0123456789")),LEN(A1)),FIND(".",MID(A1,MIN(SEARCH({0,1,2,3,4,5,6,7,8,9},A1&"0123456789")),LEN(A1)))-1)

If this post helps click Yes

Ron Rosenfeld · Jul 31, 2009

Is there a formula that will extract only the numbers that immediately follow
"E" in a string?

For example, the formula would get "1787" from this string:
http://www.glassdoor.com/Jobs/First-Data-Account-Executive-Jobs-EI_IE1787.0,10_KO11,28.htm

Do you want only the digits? Or could the number include the decimal -- e.g.
should it be 1787.0 in this instance?

In either event, this can be done easily with a User Defined Function. The one
below is annotated so you can set it up for either option. And it will return
a #NUM! error if there are no digits after any "E".

To enter this User Defined Function (UDF), <alt-F11> opens the Visual Basic
Editor.
Ensure your project is highlighted in the Project Explorer window.
Then, from the top menu, select Insert/Module and
paste the code below into the window that opens.

To use this User Defined Function (UDF), enter a formula like

=NumAfterE(cell_ref

in some cell.

==================================
Option Explicit
Function NumAfterE(s As String) As Variant
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.Pattern = "E(\d+(\.\d+)?)"
'for only digits and no decimals, change to
're.Pattern = "E(\d+)"
If re.test(s) = True Then
Set mc = re.Execute(s)
NumAfterE = CDbl(mc(0).submatches(0))
Else: NumAfterE = CVErr(xlErrNum)
End If
End Function
==================================
--ron

Harlan Grove · Jul 31, 2009

Ron Rosenfeld said:
....
In either event, this can be done easily with a User Defined Function. . .

paul c · Jul 31, 2009

Thanks, this worked for all of the cases in my file.

Ron Rosenfeld · Jul 31, 2009

Thanks, this worked for all of the cases in my file.

"Ron Rosenfeld" wrote:

Glad to help.

Thanks for the feedback.
--ron

Ron Rosenfeld · Jul 31, 2009

Tangent: I just can't understand the 'logic' of not indenting
statements inside procedures.

There's no logic. It's personal preference for where one starts indenting.

Back to the point, it's more efficient to replace the unwanted
contents with nothing than to use the match collection.

What do you mean by "efficient"? Easier to code? Faster execution? Something
else?

Without actually measuring, I can accept that it might have faster execution.
But a native VBA function would probably be even faster. So, from one who has
never taken a course in computer science, how do you make a trade-off between
"easier to code" and "faster execution" (or whatever parameter you are using to
define "efficient". I've done assembly language programming which can often be
optimized to run even more quickly, but can certainly be more difficult to
code.

Also somewhat
picky (but you started this by musing about fractional parts), your
formula fails to match numbers with no integer part, e.g., E.123,
negative numbers or signed positive numbers. Lurk around Perl or Unix
shell newsgroups and you'll learn robust regex idioms.

The OP's example was with an integer, so I left that mandatory in both
patterns.

Anyway, how about

Function foo(s As String) As Variant
Dim re As Object
Set re = CreateObject("VBScript.Regexp")
re.Pattern = ".*?E([-+]?\d*\.?\d+).*"
If re.Test(s) Then foo = CDbl(re.Replace(s, "$1")) _
Else foo = CVErr(xlErrNum)

Set re = Nothing 'paranoia

When is the "set re = Nothing" required?

I had thought it would occur, in this sort of construction, at the End Function
line.

So far as your regex is concerned, I believe the lazy qualifier following the
initial "." is only required if there is more than one Ennnn substring AND the
OP wants to return the first one. With it absent, the function will return the
last Ennn; and I believe the regex engine will come to the solution in fewer
steps with much less back-tracking.

--ron

Harlan Grove · Aug 2, 2009

Ron Rosenfeld said:
What do you mean by "efficient"? Easier to code? Faster execution? Something
else?

....

Regex engine does less, so entire regex use takes less time and uses
less memory. Regex .Replace method simply returns a string, but
it's .Execute method create a Matches collection object.

When is the "set re = Nothing" required?

I said paranoia. I don't entirely trust CreateObject. It does no harm
if it's not needed.

So far as your regex is concerned, I believe the lazy qualifier following the
initial "." is only required if there is more than one Ennnn substring AND the
OP wants to return the first one. . . .

Correct. It matches only up to the first E\d... pattern.

. . . With it absent, the function will return the last Ennn; and I believe the regex
engine will come to the solution in fewer steps with much less back-tracking.

Unclear. You mean that greedy * would find the last match faster than
non-greedy * would find the first? Unlikely. Also irrelevant if the
goal is the first match.

Maybe there's more backtracking in my function (picky: unlikely
there's any true backtracking, but there's A LOT of look ahead
checking), but it seems to run faster than your function. Maybe the
Matches collection overhead is that much of a drag. Also, rewriting
the regex to avoid non-greedy quantifiers requires something along the
lines of

re.Pattern = "^([^E]*(E\D)?)*E(\d+(\.\d+)?).*$"
t = re.Replace(s, "$3")
If t <> vbNullString Then hg = CDbl(t) Else hg = CVErr(xlErrNum)

which involves back-to-back-to-back 0-or-more quantifiers, so probably
even more backtracking.

FWLIW, it's a pity VBScript regexs lack a simple .Find method that
returns the position of the first match in the string as a Long, or
even better something like the awk approach

result = match(s, /E[0-9]*\.?[0-9]+/) ? substr(s, RSTART, RLENGTH) :
""

Constructing and dereferencing the Matches collection takes a lot of
time.

Ron Rosenfeld · Aug 2, 2009

. . . With it absent, the function will return the last Ennn; and I believe the regex
engine will come to the solution in fewer steps with much less back-tracking.

Click to expand...

Unclear. You mean that greedy * would find the last match faster than
non-greedy * would find the first? Unlikely. Also irrelevant if the
goal is the first match.

Maybe there's more backtracking in my function (picky: unlikely
there's any true backtracking, but there's A LOT of look ahead
checking), but it seems to run faster than your function. Maybe the
Matches collection overhead is that much of a drag. Also, rewriting
the regex to avoid non-greedy quantifiers requires something along the
lines of

re.Pattern = "^([^E]*(E\D)?)*E(\d+(\.\d+)?).*$"
t = re.Replace(s, "$3")
If t <> vbNullString Then hg = CDbl(t) Else hg = CVErr(xlErrNum)

which involves back-to-back-to-back 0-or-more quantifiers, so probably
even more backtracking.

FWLIW, it's a pity VBScript regexs lack a simple .Find method that
returns the position of the first match in the string as a Long, or
even better something like the awk approach

result = match(s, /E[0-9]*\.?[0-9]+/) ? substr(s, RSTART, RLENGTH) :
""

Constructing and dereferencing the Matches collection takes a lot of
time.

As I wrote before, I can certainly accept that the .Replace function will work
faster.

However, with regard to your regex, testing it in another program, against the
OP's test string, revealed that it took significantly more steps to match than
if you omit the lazy qualifier. However, upon closer examination, I discovered
that the number of steps was more dependent on the location of the matched
substring than on the use of a lazy vs greedy ".". Oh well, live and learn.

Now being curious, I ran some procedure timings. I did a run of 100 calls and
measured the elapsed time for four different routines. I repeated this several
times and averaged the results.

I then changed the position of the Ennn substring from near the end to near the
beginning of the string and did more runs.

The string was in A1; and the four function calls were in B1:E1

The four routines were:

foo: Your original with the lazy ".?"
foobar: Yours with the greedy "."
NumAfterE: Mine with the match collection
DoItInVBA: A VBA routine that I cobbled together (see below)

If my method of obtaining timing is accurate, the differences in the timing of
the various regex approaches seems trivial. The VBA routine is MUCH faster.

On my machine, the hi resolution counter frequency is 266,680 and these were
the results with the four different routines:

Ennn near
Start End
foo 1.914E-02 1.980E-02
foobar 1.832E-02 1.881E-02
NumAfterE 1.851E-02 1.888E-02
DoItInVBA 1.912E-04 1.825E-04

This is unexpected as I would have expected my regex routine to take
significantly longer. As a matter of fact, if I place the timer within the
function, it does seem to. However, the execution is so quick that we are
talking about differences of only a few counts of the timer.

One of these days, I'll do a bit more extensive testing.

======================================================
Option Explicit

Sub CheckTiming()
Dim oTime As New CHiResTimer
Const s As String =
"1http://www.glassdoor.com/Jobs/First-Data-Account-Executive-Jobs-EI_IE1783.0,10_KO11,28.htm"
Const i As Long = 100
Dim j As Long
Dim res

oTime.StartTimer
For j = 1 To i
res = foo(s)
Next j
oTime.StopTimer
Debug.Print "foo took " & Format(oTime.Elapsed, "#.0000000") & " seconds."

oTime.StartTimer
For j = 1 To i
res = foobar(s)
Next j
oTime.StopTimer
Debug.Print "foobar took " & Format(oTime.Elapsed, "#.0000000") & " seconds."

oTime.StartTimer
For j = 1 To i
res = NumAfterE(s)
Next j
oTime.StopTimer
Debug.Print "NumAfterE took " & Format(oTime.Elapsed, "#.0000000") & "
seconds."

oTime.StartTimer
For j = 1 To i
res = DoItInVBA(s)
Next j
oTime.StopTimer
Debug.Print "DoItInVBA took " & Format(oTime.Elapsed, "#.0000000") & "
seconds."

End Sub
'----------------------------------------
Option Explicit
Function foo(s As String) As Variant
Dim re As Object
Set re = CreateObject("VBScript.Regexp")

re.Pattern = ".*?E([-+]?\d*\.?\d+).*"
If re.test(s) Then foo = CDbl(re.Replace(s, "$1")) _
Else foo = CVErr(xlErrNum)
Set re = Nothing 'paranoia
End Function
'-------------------------------------------
Function foobar(s As String) As Variant
Dim re As Object
Set re = CreateObject("VBScript.Regexp")
re.Pattern = ".*E([-+]?\d*\.?\d+).*"
If re.test(s) Then foobar = CDbl(re.Replace(s, "$1")) _
Else foobar = CVErr(xlErrNum)
Set re = Nothing 'paranoia
End Function
'----------------------------------------
Function NumAfterE(s As String) As Variant
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.Pattern = "E(\d+(\.\d+)?)"
If re.test(s) = True Then
Set mc = re.Execute(s)
NumAfterE = CDbl(mc(0).submatches(0))
Else: NumAfterE = CVErr(xlErrNum)
End If
Set re = Nothing
End Function
'-------------------------------------------
Function DoItInVBA(s As String) As Variant
Dim lStart As Long
lStart = 1
Do
lStart = InStr(lStart, s, "E") + 1
Loop Until Mid(s, lStart, 1) Like "#"
DoItInVBA = Val(Mid(s, lStart))
End Function
'------------------------------

The following is in a Class Module:

==================================
Option Explicit

'How many times per second is the counter updated?
Private Declare Function QueryFrequency Lib "kernel32" _
Alias "QueryPerformanceFrequency" ( _
lpFrequency As Currency) As Long

'What is the counter's value
Private Declare Function QueryCounter Lib "kernel32" _
Alias "QueryPerformanceCounter" ( _
lpPerformanceCount As Currency) As Long

'Variables to store the counter information
Dim cFrequency As Currency
Dim cOverhead As Currency
Dim cStarted As Currency
Dim cStopped As Currency

Private Sub Class_Initialize()
Dim cCount1 As Currency, cCount2 As Currency

'Get the counter frequency
QueryFrequency cFrequency

'Call the hi-res counter twice, to check how long it takes
QueryCounter cCount1
QueryCounter cCount2

'Store the call overhead
cOverhead = cCount2 - cCount1

End Sub

Public Sub StartTimer()
'Get the time that we started
QueryCounter cStarted
End Sub

Public Sub StopTimer()
'Get the time that we stopped
QueryCounter cStopped
End Sub

Public Property Get Elapsed() As Double

Dim cTimer As Currency

'Have we stopped or not?
If cStopped = 0 Then
QueryCounter cTimer
Else
cTimer = cStopped
End If

'If we have a frequency, return the duration, in seconds
If cFrequency > 0 Then
Elapsed = (cTimer - cStarted - cOverhead) / cFrequency
End If

End Property
=================================
--ron

Ron Rosenfeld · Aug 2, 2009

As I wrote before, I can certainly accept that the .Replace function will work
faster.

However, with regard to your regex, testing it in another program, against the
OP's test string, revealed that it took significantly more steps to match than
if you omit the lazy qualifier. However, upon closer examination, I discovered
that the number of steps was more dependent on the location of the matched
substring than on the use of a lazy vs greedy ".". Oh well, live and learn.

Now being curious, I ran some procedure timings. I did a run of 100 calls and
measured the elapsed time for four different routines. I repeated this several
times and averaged the results.

I then changed the position of the Ennn substring from near the end to near the
beginning of the string and did more runs.

The string was in A1; and the four function calls were in B1:E1

Ooops. I hit Send too soon.

I first tried the timing with the string and function calls as noted above. But
the timings were too quick -- only a few counts of the high resolution timer.
So I changed to calling the functions from a Sub, as you see in the routine I
posted previously.

Sorry about the confusion.
--ron

Extract number from cell when specific format is found

paul c

Jacob Skaria

Ron Rosenfeld

Harlan Grove

paul c

Ron Rosenfeld

Ron Rosenfeld

Harlan Grove

Ron Rosenfeld

Ron Rosenfeld