Normalizing some imported data

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

I have an interesting problem that I hope someone can help me with. (Note
that “interesting†is not always a good thing.)

Some well-meaning person set up automated data logging on a process in my
company. The resulting information, however, is very difficult to work with.

A “snapshot†is taken each half-hour of the status at each station. Each
line of data begins with the date and time, then a counter of the cumulative
number of units produced. Then there’s a pair of columns for each station,
with the station number in the headers (data in the field names! Aaargh!).
The first column is a (cumulative again) count of “incidents†at that
station, and the second is the (cumulative, of course) amount of time lost.

I think the way I want to see the data is to have a file with the date,
time, and units produced info. A second file would have date, time, station
number, incidents, and time lost.

The data is presented in a .CSV file. There’s about 70 stations, by the way.

Does anyone have a way of “de-pivoting†this data, so to speak? I’d rather
have individual values for each time period, instead of the cumulative ones,
too.`

It could be done in Excel and then imported into Access, or it could be
imported directly into Access if I could set up some routines to do it there.

Thoughts?

TIA,
Randy
 
Randy:

I'll assume the columns in the original table are called LogDate, LogTime,
Units, Station1Incidents, Station1TimeLost, Station2Incidents,
Station2TimeLost etc. I don't see the need for two new tables, however; one
should be enough. You can always view the columns independently in queries
if required

To insert rows into the table link to the text file. A series of 'append'
queries something like this should then do it (note that I've combined the
date and time into a single column as the Access date/time data type holds
both).

INSERT INTO
NewTable(LogDateTime, StationNumber, Units, Incidents, TimeLost)
SELECT
LogDate + LogTime,
1,
Units – (SELECT NZ(SUM(Units),0)
FROM OldTable AS T2
WHERE T2.Logdate + T2.logTime
< T1.LogDate + T1.logTime),
Station1Incidents – (SELECT NZ(SUM(Sation1Incidents),0)
FROM OldTable AS T2
WHERE T2.Logdate + T2.logTime
< T1.LogDate + T1.logTime),
Station1TimeLost – (SELECT NZ(SUM(Sation1TimeLost),0)
FROM OldTable AS T2
WHERE T2.Logdate + T2.logTime
< T1.LogDate + T1.logTime)
FROM OldTable AS T1;

LogDateTime and StationNumber should be the composite primary key of the
table. Repeat the query, amending the SQL appropriately, for each station.

Ken Sheridan
Stafford, England
 
Ken,
Thanks for taking a shot at this.

A couple of points:

First, did you notice that I said there's 70 stations? I suppose once the
queries are set up, that's not a big deal. The first time is going to be very
tedious, though. The stations are numbered 10 through 690, by 10, but I think
there are cases where there's an A and a B for the same number (ex. 190A and
190B). And there may be some gaps. I was hoping to extract these from the
field names, but if there's a dedicated query for each station I guess that
won't matter.

I shouldn't have a missing data problem at all, so the NZ functions probably
aren't necessary.

I am confused by your use of SUM, though. Each data record's values are
cumulative. So if I want to know how many incidents occurred in this
half-hour time period, I need to subtract the previous record's value. If no
incidents occurred, the result will be zero. If the number has been 1 for the
last 8 periods, I'll be subtracting 8 from 1 by using the sum, won't I? Maybe
the logic just hasn't filtered into my brain yet... I may have to play with
it a little.

I still think I'll want 2 tables, since the units produced relates to the
process as a whole, not to the individual stations. And there's no field for
time lost for the process, just the individual stations. Different data,
different tables, no? Well maybe not, but that's really a minor part of my
problem at this point.

You have no idea how much I appreciate the time you spend on this (my
questions and others'). Thanks!

Randy
 
Randy:

1. You don't need 70 separate queries. Write a VBA procedure with a For n
= 10 To 690 Step 10 loop and execute the SQL statement in each iteration of
the loop, incrementing the station number etc each time. Trap the error
which will result from any gaps and move on to the next iteration of the
loop. If the number might in future exceed 690 you could loop to an
artificially high number, e.g. 1000 and let the error handling cope with the
redundant numbers.

2. The NZ function is there to handle the first row in the original table,
for which there won't be any earlier ones, so keep it.

3. You are right about the SUM operator, it should have been MAX, assuming
every log point contains data in all columns of course.

4. I misunderstood about the units produced. Two tables is right, so
you'll need two routines, one for each table. For the table of units
produced you just need to execute one 'append' query of course, so no looping
is required. Make the LogDateTime column the primary key of this table.

Ken Sheridan
Stafford, England

Randy Griffin said:
Ken,
Thanks for taking a shot at this.

A couple of points:

First, did you notice that I said there's 70 stations? I suppose once the
queries are set up, that's not a big deal. The first time is going to be very
tedious, though. The stations are numbered 10 through 690, by 10, but I think
there are cases where there's an A and a B for the same number (ex. 190A and
190B). And there may be some gaps. I was hoping to extract these from the
field names, but if there's a dedicated query for each station I guess that
won't matter.

I shouldn't have a missing data problem at all, so the NZ functions probably
aren't necessary.

I am confused by your use of SUM, though. Each data record's values are
cumulative. So if I want to know how many incidents occurred in this
half-hour time period, I need to subtract the previous record's value. If no
incidents occurred, the result will be zero. If the number has been 1 for the
last 8 periods, I'll be subtracting 8 from 1 by using the sum, won't I? Maybe
the logic just hasn't filtered into my brain yet... I may have to play with
it a little.

I still think I'll want 2 tables, since the units produced relates to the
process as a whole, not to the individual stations. And there's no field for
time lost for the process, just the individual stations. Different data,
different tables, no? Well maybe not, but that's really a minor part of my
problem at this point.

You have no idea how much I appreciate the time you spend on this (my
questions and others'). Thanks!

Randy
 
Ken,
Excellent. This is very helpful to me.

I took a closer look at the data, and there's no "A" and "B" version of any
station like I thought there might be, so your looping solution should do the
trick.

I did find that there are three stations with an additional field of data,
however. The info is a count of failures of an automated test. I should be
able to pick those up with an If..Then inside the loop. Or I could simply
skip that data as I'm not reallly interested in it at this point. I hate to
throw away information, though.

The other annoying thing that I noticed is that the name of the .CSV file
changes each day (date embedded in the name). Simply linking to the file is
not going to work. Looks like I'll need to import it manually. Of course, if
I do a little searching here maybe I can find some tips to make it easier.

Well, now to tackle the job. I understand why I need the NZ function now; I
wouldn't have thought of the first record problem and would have been
scratching my head for hours wondering why my query didn't work.

So again, my thanks. I learn more this way than I ever could from a book.

Randy
 
Back
Top