How to handle C4D Unicode in Python scripting?

Cairyn

Hello; I seem to have stumbled over some interesting quirks in C4D Python that make it impossible to work with strings that are not ASCII.

Before I start: I am aware of the u prefix for unicode strings, the unicode string type, and the unicode character escape notations, so this is not really a Python question but an implementation question. Maybe I'm missing something essential...

Here's a test script:

import c4d
import maxon
from maxon import String

def main():
    print "---------- Start ----------"
    a = "äöü"
    print "Umlaut:", a, len(a), a[0], a[1:3]
    b = u"äöü"
    print "Unicode umlaut:", b, len(b), b[0], b[1:3]
    c = "\u0189\u018B\u01F7"
    print "u escape:", c, len(c), c[0], c[1:3]
    d = u"\u0189\u018B\u01F7"
    print "Unicode u escape:", d, len(d), d[0], d[1:3]
    e = "ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
    print "Multibyte characters:", e, len(e), e[0], e[1:3]
    f = u"ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
    print "Unicode multibyte characters:", f, len(f), f[0], f[1:3]

    x = op.GetName() # object is called äöü
    print "Object name:", x, len(x), x[0], x[1:3]
    y = String("Bär")
    print "Explicit string:", y, "no len() attribute" #len(y), y[0], y[1:3]
    z = str("Bär")
    print "Python string:", z, len(z), z[0], z[1:3]

    print type(a), type(b)
    print type(c), type(d)
    print type(e), type(f)
    print type(x), type(y), type(z)

    #p = unicode(op.GetName())
    #print "Object name cast to unicode:", p, len(p), p[0], p[1:3]
    q = str(op.GetName())
    print "Object name cast to str:", q, len(q), q[0], q[1:3]

if __name__=='__main__':
    main()

(I do see the unicode characters in the post preview, so this should be visible to everyone. Yes, I meant to write these.)

And here are the results from the console (I did not test everything with a MessageDialog)

---------- Start ----------
Umlaut: äöü 6
Unicode umlaut: Ã¤Ã¶Ã¼ 6 Ã ¤Ã
u escape: \u0189\u018B\u01F7 18 \ u0
Unicode u escape: ƉƋǷ 3 Ɖ ƋǷ
Multibyte characters: ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆ 33
Unicode multibyte characters: ÎÎ©Î¨ Ð©ÐÐ â¡â¦â´ âââ 33 Î Î
Object name: äöü 6
Explicit string: Bär no len() attribute
Python string: Bär 4 B ä
<type 'str'> <type 'unicode'>
<type 'str'> <type 'unicode'>
<type 'str'> <type 'unicode'>
<type 'str'> <class 'maxon.reference.String'> <type 'str'>
Object name cast to str: äöü 6

At the beginning, I am testing a simple sequence of German umlauts in a and b. These are located in my national extended ASCII codepage (so they can be written with one byte). One might expect that these at least should work, but no.
With the str literal, the output is correct but the length is a two-byte encoding byte length (6) instead of the character length. Neither the single index nor the slice are working.
With the unicode literal, the output becomes some explicit encoding (I didn't bother to find out what). The single index and slice work fine - sort of, if you wanted to slice the encoded string.

Okay, so perhaps using actual characters beyond plain 7-bit ASCII doesn't work. Next, I try encoding Unicode characters by their escape sequences in c and d.
With the str literal, the escape sequences are not interpreted. That is fine, as this is the behavior from the Python standard.
With the unicode literal, the output and the single index and the slice are all fine! Yay, this seems the way to go. (But read on...)

For the fun of it, I took a text editor and created a few Unicode characters beyond the 1-byte range: Greek, Cyrillic, Currency, Math (all from left-to-right scripts, so we won't run into issues there). The Script Manager doesn't seem to mind. (If I save the script, these characters appear encoded in the source, but upon reloading, they are restored to their Unicode glory.) That's samples e and f.
Here, the same thing happens as with the umlauts. With the str literal, the output is fine but length, index and slice give me the wrong results. With the unicode literal, I see only the encoded characters - note that this sequence contains many non-printable characters otherwise you could see that the len, index and slice results are actually correct for the encoded sequence.
What's funny here that I can write these multibyte characters into a plain str. With äöü, I can argue that these can be represented by a single byte and therefore are fine to use in a str. With Greek and Cyrillic, I can't - but the str in e is the one that gives me the correct output in the console at least. Huh? There must be a good deal of transformations in the usage...

But at least I have found the way to handle C4D strings, right? Not quite...

Next, I try to read a name from an object in the Object Manager into x. This is named äöü, and upon checking that, I get the same results as from the str literal äöü - correct output, wrong length, index, and slice.

Hmm. Maybe the GetName() must be used as maxon.String instead. I create a variable y that is a String. This works with a literal containing an umlaut and gives me the correct output. But there is no len() attribute, nor a GetLength() one. Nor do I get any indices from a String class.
The documentation also is very sparse on String in Python. Obviously, we are supposed to use the Python internal classes str and unicode, which are either wrappers around maxon.String or are converted when used in the API.
Lastly, I create an explicit str object with a literal that contains an umlaut, again. This z behaves like the literal without cast, unsurprisingly. Note that index and slice indicate that the encoding is using varying byte lengths - the B comes out fine, and the ä is okay if you imagine it as 2-byte code.

Writing down the types holds no surprises. The explicit literals all give us str or unicode as expected. The object name results in a str (not a unicode, although C4D strings are supposed to support Unicode!). The explicit constructors do what they should.

Well, now here's the riddle: If an API function like GetName() gives us a str type, but the str type does not work properly with len, index, and slice, then how do we work in Python with C4D names that happen to be Unicode?

I try to cast str to unicode in p (after all, for the unicode literal d the functionality is there), but this attempt crashes with a fat error:

Traceback (most recent call last):
  File "D:\3D\Cinema4D\HomeDir\Cinema4D V21_6F07B783\library\scripts\Test_StringLiteralsAndUnicode.py", line 38, in <module>
    main()
  File "D:\3D\Cinema4D\HomeDir\Cinema4D V21_6F07B783\library\scripts\Test_StringLiteralsAndUnicode.py", line 32, in main
    #p = unicode(op.GetName())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Therefore I have commented it out here. Casting to str in q is (as expected) pointless, as the type is already str.

I do not mind writing unicode escape codes to get the string functionality working - but how do I get a unicode type from the C4D API to work with it, in the first place?

ferdinand

Hi,

with all due respect, digging through all that is a bit much (at least for me). Some points:

It seems to me that you are mixing up Python 2 and 3 features regarding unicode strings. You cannot enforce unicode representation for characters in unicode (generated by the OS, your keyboard, etc.) in Python 2. So print (u"ä ö ü") will not work in Python 2 as it will cause Python try to interpret that string as a sequence of hex symbols (but it will work in Python 3). print ("ä ö ü") will work both in Python 2 and 3, given your system locale supports it.
Your unicode symbols are weird - at least for the german diacritics. The 16-Bit unicode symbols for "äöü" should be u"\u00E4 \u00F6 \u00FC" and work fine for me here.

I hope, this helps. If not, you might want to highlight selected problems.

Cheers
zipit

Cairyn

@zipit said in How to handle C4D Unicode in Python scripting?:

It seems to me that you are mixing up Python 2 and 3 features regarding unicode strings. You cannot enforce unicode representation for characters in unicode (generated by the OS, your keyboard, etc.) in Python 2. So print (u"ä ö ü") will not work in Python 2 as it will cause Python try to interpret these characters as hex ids. But it will work in Python 3. print ("ä ö ü") will work both in Python 2 and 3, given your system locale supports it.

Thanks, I think I have found the core issue now (and I would say it's a C4D bug). See next post... I'll try to keep this short this time

I was not trying to mix Python 3 features into this, actually Python3 doesn't even have the unicode class... all strings are now unicode. But as C4D is still stuck with Python 2.7 or something, the current implementation is buggy in this respect.

Your unicodes symbols are weird - at least for the german diacritics. The 16-Bit unicode symbols for "äöü" should be u"\u00E4 \u00F6 \u00FC" and work fine for me here.

Heh, you are right of course. The variables c and d were not supposed to represent äöü though, but some random Unicode characters.

ferdinand

I was not trying to mix Python 3 features into this, actually Python3 doesn't even have the unicode class... all strings are now unicode.

Well, that was sort of my point. I do not understand the purpose of u"äöü" in your code then, since Python 2 is expecting an escaped string there [1]. Or am I overlooking something?

[1] Python 2.7 Unicode HOWTO. url: https://docs.python.org/2/howto/unicode.html
Cheers
zipit

Cairyn

Okay, I checked some more, and I think there is a C4D bug at the core of the matter.

I tried using # -*- coding: latin-1 -*- to get at least the German umlauts corrected, also with the coding utf-8. That didn't help though (and anyway, it would only solve the literal issue which is not the core problem).

After digging through the internals of Python's unicode class, I am now convinced that str is wrongly implemented in C4D's Python. str in Python 2.7 should be a one-byte representation that allows characters up to codepoint 255. What we get returned from BaseObject.GetName() and also from the literal construction of a str is actually a UTF-8 encoded string (which is what Python 3 would do, as this version does not have a unicode class any more, and all str objects are UTF-8 unicode).

That doesn't matter for pure 7-bit ASCII as the representation is the same, but it goes haywire for all characters >127 (as far as one-byte representation would be possible), and especially for all Unicode characters beyond the one-byte codepage.

str is actually currently (R21) built as unicode with this encoding. But it does not support the proper len, index, and slice functions - these still treat the characters as if they were one-byte codes. Which then extracts partial codes from the multi-byte encodings, which by the nature of UTF-8 will be >127.

I found that the decode function, invoked on such a str, actually reinterprets the string as already UTF-8-encoded, and returns a proper unicode string, which allows me to use len, index and slice in the intended way:

import c4d

def main():
    print "---------- Start ----------"

    a = "Bär"
    print "str literal:", a, len(a), type(a)
    for c in a: print c, " ",
    print
    u = a.decode('utf-8')
    print "decoded as unicode:", u, len(u), type(u)
    for c in u: print c, " ",
    print
    
if __name__=='__main__':
    main()

Result:

---------- Start ----------
str literal: Bär 4 <type 'str'>
B         r
decoded as unicode: Bär 3 <type 'unicode'>
B   ä   r

The same works actually on the str returned by GetName().

(I'm continuing the experiments)

Cairyn

@zipit said in How to handle C4D Unicode in Python scripting?:

Well, that was sort of my point. I do not understand the purpose of u"äöü" in your code then, since Python 2 is expecting an escaped string there [1]. Or am I overlooking something?

[1] Python 2.7 Unicode HOWTO. url: https://docs.python.org/2/howto/unicode.html

If you scroll down on that documentation, you will find the sample code:

#!/usr/bin/env python
# -*- coding: latin-1 -*-

u = u'abcdé'
print ord(u[-1])

which is supposed to work. So, at least if I specify the encoding in this comment notation, I should be able to use these umlauts in a unicode literal.
If the encoding notation is not supported and there is no encoding default either, then using unsupported characters in a literal should raise an error.
Instead, the notation creates a unicode string in which the UTF-8 symbols are contained as characters. A double encoding, so to say.

ferdinand

Hi,

well, that example defines the encoding, which is kind of the point of that example. About your other code - the following code:

a = "Bär"
print "str literal:", a, len(a), type(a)

will return, run from a default Python 2.7.8 interpreter:

str literal: Bär 4 <type 'str'>

Now that you mention it, I remember reading about that weird behavior of len and unicode literals in Python 2.7 before. Iterating through that string will cause an exception in Python 2.7.8 because of that.

I do not know what Cinema's Python does behind the curtain, but to me it looks to me more like a feature than a bug.

Cheers
zipit

Cairyn

@zipit said in How to handle C4D Unicode in Python scripting?:

well, that example defines the encoding, which is kind of the point of that example.

It is a unicode literal that contains a non-ASCII character, so it's the same as u"äöü" in my eyes...

About your other code - the following code:
a = "Bär"
print "str literal:", a, len(a), type(a)
will return, run from a default Python 2.7.8 interpreter:
str literal: Bär 4 <type 'str'>

That is weird. If that is the standard implementation, it makes no sense to me... the str is interpreted as three letters when written to the output, but if used otherwise (be it with a for loop, an index, a slice, or len()) it treats the content like a sequence of single bytes. That means that all functions that rely on cutting up the string or extracting something by index will potentially slash a multi-byte encoded character in half. Seems like a contradiction in handling.

I don't even want to stick with literals. Getting a string directly from the API as with GetName() causes the same encoded content with the same problems. If I name an object "Bär" and get the name string with GetName() then the len() is still 4. That is simply not what I expect.

If a string contains encoded content, I would assume that all functions that handle this string keep the integrity of the single characters (not bytes), so len("Bär") should be 3. (It gets difficult enough when Unicode uses separate (and potentially multiple) diacritical marks, modifiers, directional codes, or other stuff that makes it difficult to tell the characters apart...)

What I have found working with Unicode is the following (samples):

import c4d

def outputUstr(myUstr):
    print "String:", myUstr
    print "Length:", len(myUstr), "Type:", type(myUstr)
    for c in myUstr: print c, "(" + str(ord(c)) + ") -",
    print

def main():
    print "---------- Start ----------"

    a = "äöü".decode('utf-8')
    outputUstr(a)
    b = u"\u0189\u018B\u01F7"
    outputUstr(b)
    c = "ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆".decode('utf-8')
    outputUstr(c)
    d = op.GetName()
    d = d.decode('utf-8')
    outputUstr(d)
    op.SetName(d + (" a∏ß".decode('utf-8')))
    c4d.EventAdd()

if __name__=='__main__':
    main()

A literal containing Unicode characters needs to be decoded (see variable a).
If the literal is supposed to contain escaped Unicode characters, then it must be an explicit unicode literal (see variable b). This cannot be decoded, and it must contain only 7-bit ASCII characters other than the escaped ones. Inserting Unicode characters directly results in multibyte codes being inserted as multiple characters (not as the intended encoded character).
As variable c shows, decoding works even for multibyte characters that have been copied from some text editor.
Names from the API, as in d, need to be decoded too to yield a unicode string. After that, len, index and slice work fine. You can write that string back as name directly, as SetName() accepts a unicode parameter.

And now I close shop for today...

ferdinand

Hi,

I am kind of confused on what you are trying to do. You can hard-code your Unicode symbols or just set the encoding of the file.

# -*- coding: utf-8 -*-

string_literals = [
    u"äöüß",
    u"âêôû",
    u"ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
]

for literal in string_literals:
    print literal, len(literal)

This will put out :

äöüß 4
âêôû 4
ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆ 15

For the default Python 2.7.8 interpreter and c4d's interpreter (the console struggles with some characters). You also have the option of loading strings from a resource file.

Cheers
zipit

Cairyn

@zipit said in How to handle C4D Unicode in Python scripting?:

# -*- coding: utf-8 -*-

string_literals = [
    u"äöüß",
    u"âêôû",
    u"ΛΩΨ ЩЖЊ ₡₦₴ ∑∏∆"
]

for literal in string_literals:
    print literal, len(literal)

Well, now I am flabberghasted. I tried the encoding comment before, and it did not work at all. A coding of latin-1 still doesn't btw. Apparently I must have made a typo then, because all of a sudden it works and gives me the correct strings. (I still have questions why the non-unicode strings worked before... guess there are any implementation details I don't see...)

That seems to solve the issues of literals for now. The issue of names returned from the API remains - these need to be decoded before use, since str isn't working as expected.

As what I mean to do - actually nothing. I started writing a course on Python-in-Cinema4D over on Patreon (https://www.patreon.com/cairyn if you bother to look), and as I came to the string chapter, I wanted to check out all the unicode possibilities, as my readers appreciate a thorough overview. Normally I don't write object names in Cyrillic

So I started out with string literals, string literals in German, string literals in Python unicode, and how all of these are represented in to .py file and in memory. That was when I noticed the weird str behavior, and the same with names I get from the API.

Even with the literal issue solved, I do wonder why I cannot find anything on the issue on the web. If there is a Russian or Greek programmer who found out that his characters aren't resolving without first decoding the str to unicode, I'm sure they would post something somewhere? Perhaps on Russian or Greek forums I am not privy to ... sigh

ferdinand

Hi,

well, that API object names thing is a flaw of Python 2. So working as expected or not as expected is a bit a question of the point of view. If you got the string passed from any other source the problem would be the same.

On a more productive note: I think that focusing on Unicode strings isn't really that important for Python stuff in c4d, since object names should be something you largely ignore, as they are a unreliable source of identification and only are rarely important in other contexts.

PS: I have already seen your python patreon thingy on c4dcafe
PPS: If you google "python unicode len()" you will find a lot of confused python programmers on StackOverflow

Cheers
zipit

Cairyn

@zipit said in How to handle C4D Unicode in Python scripting?:

well, that API object names thing is a flaw of Python 2. So working as expected or not as expected is a bit a question of the point of view. If you got the string passed from any other source the problem would be the same.

Right. The main thing is to understand the issue, and then to write the chapter in a way that explains what to watch out for. (I do wonder how third-party modules would do with a name string passed to them from a script that reads them from the API... well, another bridge to cross another day.)

Python 3 clearly is superior in that respect, as there is no unicode class and all str objects are unicode (what they appear to be already in C4D, but with matching len, index, and slice capabilities).

On a more productive note: I think that focusing in Unicode strings isn't really that important for
Python stuff in c4d, since object names should be something you largely ignore as they are a unreliable source of identification and only are rarely important in other contexts.

Hmm, I am not sure whether I would agree to that. Good naming is essential to find your way through complex scenes, and a good naming schema can be built in a way that is friendly to string search and comparison criteria, esp. if you can build your own scripts to perform the search and selection. I just point at the _L _R naming schema for joints that is common in C4D's docs.

Of course, if your objects are all named Cube, Cube.1, Cube.2, Cube.3, then name-based identification may be unhelpful

Anyway, I am not the person to judge that, as I am only teaching Python to interested users. What they do with it is their own decision; I just have to point out the crucial points so they can apply the code to their own concepts.