Parsing a text file

Helper

THE POST BELOW IS MORE THAN 5 YEARS OLD. RELATED SUPPORT INFORMATION MIGHT BE OUTDATED OR DEPRECATED

On 02/11/2007 at 13:17, xxxxxxxx wrote:

User Information:
Cinema 4D Version:   10.1
Platform:      Mac OSX ;
Language(s) :   C.O.F.F.E.E ;

---------
Hello,
I've been using C4D for years, but I'm new to C.O.F.F.E.E. and scripting in general, so pardon my simple questions.
I"m roughing together a number of file parsers and have run into a limitation, found a workaround, and then discovered limits for that. I simply need a function analogous to readlines() in Python. That will allow me to step through an imported string, one line at a time to isolate information from lines that start with the word "ATOM". Unable to find such a function in the SDK, I tried to hunt for a specific string using.

var phrase = stradd(tostring(0x000A),"ATOM ");
     var posMove0 = strstr(str,phrase,pos);
        ...
     var posMove= strstr(str,phrase,(posAdvance + posMove0));

This yeilded random results, usually pulling any word ATOM from the middle of lines etc., telling me that 0x000A and 0x000D did not work as proper carriage returns. To work around this, I grabbed a copy of the 80th character and pasted it into my search string.
var phrase = stradd(strmid(myString,80,1), "ATOM ");
This works great on a properly formatted PDB file which is supposed to have a carriage return at position 80 and should thus work for unix or mac unicoded returns...however, I quickly discovered that many files available in the Protein DataBank are not correctly formatted, the first line often has less than 80 characters and my parser fails.
Is there a readlines() function in COFFEE that I missed? If not, is there a better way to incorporate a carriage return into my search string?

Also, since I'm new to this, I imagine my clumsy code could be made more efficient, so I'd love to hear any suggestions and will paste it below. Could I fill an array first, or is it faster to just fill one line of the polygon object points list at a time as I'm doing, etc. I'll paste the much more elegant python script below that for comparison of what I'd like to achieve directly in C4D.
Thank you,
Graham

main(doc,op)
{
          // Lets the user select a file
          var filename = GeGetStartupPath();
          filename->FileSelect("Please select a file.", FALSE);
//          println ("filename = ", filename->GetFullString());
                    // Opens the same file and reads it
                    var file2 = new(BaseFile);
                    file2->Open(filename, GE_READ, FILE_DIALOG);
                    var myString = file2->ReadString(file2->GetLength());
                                   var PointCloud = doc->FindObject("PointCloud");
                                   var op = PointCloud;
                                   op->SetScale(vector(1,1,-1));
     var cnt = op->GetPointCount(); // Get actual point count.
     var MaxPoints      = 10000;
     var str              = myString;
     var pos                     = 0;
                    // Roundabout way to get end of line or linefeed character to find first atom
                    // in the string created in "phrase2" just below.
                    var phrase = stradd(strmid(myString,80,1), "ATOM ");
                var posMove0 = strstr(str,phrase,pos);;
          for (cnt=0;cnt<MaxPoints;cnt++)
          {
          if (!instanceof(op,PointObject)) return FALSE;        // Not a point object.
          var vc = new(VariableChanged); if (!vc) return FALSE; // Allocate class.
          var cnt = op->GetPointCount();      // Get actual point count.
          vc->Init(cnt,cnt+1);                // Resize point count from cnt to cnt+1.
                                               // Just for adding a point at the end of
                                                                                     // the point list one doesn't need a map.
          var pos= cnt;
      var posAdvance = ((pos*79));
      var posMove= strstr(str,phrase,(posAdvance + posMove0));
          posMove = posMove + 1;
          if (posMove == 0) break;
          var len = 8;
          var x = strmid(str,(posMove + 30), len);
      var y = strmid(str,(posMove + 38), len);
      var z = strmid(str,(posMove + 46), len);
                         {
                         //****** z must be (-) because C4D uses left-handed coordinate system******
                              var ok = op->MultiMessage(MSG_POINTS_CHANGED,vc);
                              var p = vector (evaluate(x), evaluate(y), (evaluate(z)));
                              op->SetPoint(cnt, p);    // Set data p for new point
                         }
          }
}

IN PYTHON
#!/usr/bin/env python

import glob, os

filelist = glob.glob("*.pdb")
for f in filelist:
    ctr = 1
    name = os.path.splitext(os.path.basename(f))[0]
    outputfilename = name+ '.txt'
    fptr = open(outputfilename, 'w')
    ostr = "Point X Y Z\n"
    fptr.write(ostr)
    ptr = open(f)
    lines = ptr.readlines()
    for l in lines:
        if l.find("ATOM")==0 or l.find("HETA")==0:
            xcoord = float(l[30:38])
            ycoord = float(l[38:46])
            zcoord = float(l[46:54])
            ostr = "%5d %.3f %.3f %.3f\n" %(ctr, xcoord, ycoord, zcoord)
            fptr.write(ostr)
            ctr += 1
    fptr.close()
    print "wrote ", outputfilename

Thanks
Graham

Helper

THE POST BELOW IS MORE THAN 5 YEARS OLD. RELATED SUPPORT INFORMATION MIGHT BE OUTDATED OR DEPRECATED

On 02/11/2007 at 17:15, xxxxxxxx wrote:

1. Unless you wrote the text file using C4D WriteString(), ReadString() is inappropriate. ReadString() reads C4D Strings ONLY (they are written with a particular Binary format).

2. Your best bet is to read CHAR by CHAR into a line buffer until you reach an End-of-Line character.

3. Tokenization is not part of COFFEE (or the C++ SDK for that matter). You'll have to extract tokens (keywords) based on surrounding delimiters (such as whitespace, quotes, commas, etc.). This is Parsing 101.

Helper

THE POST BELOW IS MORE THAN 5 YEARS OLD. RELATED SUPPORT INFORMATION MIGHT BE OUTDATED OR DEPRECATED

On 02/11/2007 at 22:36, xxxxxxxx wrote:

Hi Robert,
   Thanks for the rapid reply. Re' Parsing 101, I could definitely use such a class, but I thought I WAS extracting my token based on the surrounding delimiters, i.e., a carriage return followed by the word ATOM is the only efficient unique way to describe these lines...they are followed by a variable white space ranging from 1 to 7 spaces before a number and are preceeded by 2-0 spaces followed by a newline symbol. Elsewhere in the file, ATOM often exists with a variety of spaces, characters and/or numbers after it, but never at the start of a new line except where I need it...in the description of the coordinates. Perhaps if I create a WriteFile document I can see how to better describe a return.
   Questions I still have: 1) if ReadString() reads only C4D Binary, why does it have no problem picking up "ATOM"? 2) If I do a CHAR by CHAR, how will I know when I reach an End-of-Line character when I don't know how to express that character in the first place? 3) I'm curious about this...my script, as inefficient as it looks to me (using searches instead of simple NextLine commands) imports this data about 4 times faster than the Import ASCII File in the native Structure manager is able to do...any thoughts on why that may be? On top of that, the python version is able to do parse the file about 4x faster than my version, 16x faster than C4D's native importer can read in the simplified text file that the python code outputs...understandable without all of the overhead of the C4D software, but still makes me curious since I'd like to batch hundred, possibly thousands of these files at a time.
   I'll past a snip of file below to show the format, and here is the URL to a typical file: http://www.rcsb.org/pdb/files/2sod.pdb   This file, however, does not have any of the formatting issues I mentioned like ATOM near the end of a line, or truncated first line, but should give you the gist.
Thank you kindly for your help,
Graham

MTRIX3   3 -0.124120 -0.161850 -0.978980        3.34240    1
ATOM      1 N   ALA O   1     -20.479 24.715 -21.334 1.00 16.16           N
ATOM      2 CA ALA O   1     -19.117 24.539 -21.395 1.00 15.65           C

Helper

THE POST BELOW IS MORE THAN 5 YEARS OLD. RELATED SUPPORT INFORMATION MIGHT BE OUTDATED OR DEPRECATED

On 03/11/2007 at 00:44, xxxxxxxx wrote:

For a line-by-line parse, best to avoid tokenization based solely on an end-of-line character. Read a line into a String not including the EOL character(s) and tokenize it - thus you don't even deal with the EOL in the tokenization process. When tokenizing a string (representing a line for instance), one always has to deal with multiple delimiters between tokens. For tokenization, always remove leading delimiters/whitespace first. This accounts for multiple EOL characters, blank lines, and whitespace at the front of a line (unless tabs at the front are used to delineate nesting - which is rare and not a good format choice). If the token is the last and there are delimiters/whitespace after it, the tokenization must handle that as well.

Also note that Windows uses two characters (CR+LF) while MacOS uses one (CR). (Carriage Return and Line Feed).

COFFEE ReadString() differs from C++ as the latter doesn't read with 'n characters' but directly into a String. The COFFEE docs say that ReadString() can read strict ASCII files, so you may be okay here.
Well, either way (COFFEE or C++), you don't know how many characters are in the line (up to the End-of-Line). Unfortunately, COFFEE doesn't have a ReadLine() method - nor does the C++ SDK. You are reading the entire file into a String which may work for smallish files, but may cause issues with large files. The standard practice is to allocate a CHAR* buffer sufficient for the longest expected line (1024 or 2048 bytes will usually suffice for these type of ASCII files) and read characters into the buffer until you reach an EOL character. Characters are numbers. CR = 13, LF = 10. This is pretty much standard even for Unicode.
Because you read the entire file into a single String. Actually, since I'm parsing very large text files (sometimes 100-200MB!), I allocate a file buffer of either the file size or about 2MB if the filesize cannot be allocated. This allows me to read, extract lines, and parse these files faster. C4D probably isn't doing this - they should consider it.

As an example of what you should be looking for in creating and parsing a text format, you can look at one of your lines as given (EOL shown as '\n' here) :

ATOM 2 CA ALA O 1 -19.117 24.539 -21.395 1.00 15.65 C\n

Okay, I'll delineate the tokens:

ATOM
2
CA
ALA
O
1
-19.117
24.539
-21.395
1.00
15.65
C

These are the expected tokens for the line if tabs/spaces are to be used as delimiters. Any delimiters before or after the line (this can include end-of-line characters) should be removed either from the line or from the first and last tokens. Basically, tokenization is about things of interest (tokens) surrounded by things that separate them (delimiters). You should find the first non-delimiter and read characters into a token String until you find a delimiter. Using strstr() with an advancing 'pos' can work, but watch your values. You do (pos*79) which is what? If the line is 128 characters and the token is a 23, 23*79 is beyond the line. Addition is better here.

Depending upon how flexible the format is, you may need to loop through the possible keywords as you get tokens. For instance if ATOM and MTRIX3 can appear at any line, you'll want to check for either as the first token of each line.

var pos
if ((pos = strstr(myString, "ATOM", &pos;)) != -1) // this is an ATOM line
else if ((pos = strstr(myString, "MTRIX3", &pos;)) != -1) // this is a MTRIX3 line
... etc.

Extract the token (for ATOM, for instance) :

var token = strmid(myString, pos, 4);

The problem with this type of tokenization is that you'll need to validate that the found match with strstr() isn't part of something else ("ATOMIC"). If you attempt to use delimiters in the search, you'll end up in a quagmire quickly for anything but a rigorously strict and predictable format - not much flexibility at all and probably fragile. This is why I prefer a tokenization that goes char-by-char as it guarantees that the token was surrounded by delimiters and not just some match which may have surrounding non-delimiters.

Helper

THE POST BELOW IS MORE THAN 5 YEARS OLD. RELATED SUPPORT INFORMATION MIGHT BE OUTDATED OR DEPRECATED

On 06/11/2007 at 04:49, xxxxxxxx wrote:

Hi Robert,
Thank you for all of the feedback. I'm stuck at a conference for the next week, but look forward to applying your suggestions upon my return.
-G

Helper

THE POST BELOW IS MORE THAN 5 YEARS OLD. RELATED SUPPORT INFORMATION MIGHT BE OUTDATED OR DEPRECATED

On 03/12/2007 at 21:19, xxxxxxxx wrote:

Hi Robert,
I finally got around to scripting a generic parser for my filetypes. A programmer in my lab clued me to the fact that it would work fastest to do all my splitting into and within a series of arrays... coupling this with the SetPoints command (which the SDK says is faster) allows my script to write a vector for each of 8500 points from my test file in less than 0.12 seconds. Nice improvement over the 16 seconds it took to import via the structure manager (and that doesn't include jumping through all the hoops to run the python script in the first place.) By the way, I love the little algorithm timer the SDK recommends under time(). Now I'm off to search for a method that will allow me to access files over the internet directly without having to download them first if at all.
Thanks again,
Graham