I want to represent data as a spreadsheet would in Python. Thinking “well, someone’s certainly written such a module!” I went to PyPI, where I found Tabular, which wraps NumPy’s recarrays with powerful data manipulations functions. Great! Sadly, it doesn’t seem to act like a spreadsheet at all when it comes to strings.
>>> import tabular as tb
>>> t = tb.tabarray(records=[('bork', 1, 3.5), ('stork', 2, -4.0)], names=['a','b','c'])
>>> t
tabarray([('bork', 1, 3.5), ('stork', 2, -4.0)],
dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
>>> t['a'][0] = 'gorkalork, but not mork'
>>> t
tabarray([('gorka', 1, 3.5), ('stork', 2, -4.0)],
dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
Um…tabarray! You truncated my string there! Really?! The NumPy dtype ‘|S5’ means is a string of 5 or fewer characters, but come on! Update the dtype. Reformat the entire column, if need be. Whatever. But don’t silently throw away my data!
I tried several other approaches, none of which do the trick. E.g., it intuits the data type/size on tabarray creation, but not when adding records:
>>> t.addrecords(('mushapushalussh', 3, 4.44))
tabarray([('gorka', 1, 3.5), ('stork', 2, -4.0), ('musha', 3, 4.44)],
dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
I tried slicing out the entire column, changing its type, setting the value, and reassigning it:
>>> firstcol_long = firstcol.astype('|S15')
>>> firstcol_long
tabarray(['gorka', 'stork'],
dtype='|S15')
>>> firstcol_long[0] = 'morkapork'
>>> firstcol_long
tabarray(['morkapork', 'stork'],
dtype='|S15')
>>> t['a'] = firstcol_long
>>> t
tabarray([('morka', 1, 3.5), ('stork', 2, -4.0)],
dtype=[('a', '|S5'), ('b', '<i8'), ('c', '<f8')])
>>>
It does the value assignment correctly, but the original datatype is still in force, and my previously-correct data is again silently truncated. I even tried an explicit data type setting:
>>> t = tb.tabarray(records=[('bork', 1, 3.5), ('stork', 2, -4.0)], dtype=[('a', str),('b', int),('c', float)])
>>> t
tabarray([('', 1, 3.5), ('', 2, -4.0)],
dtype=[('a', '|S0'), ('b', '<i8'), ('c', '<f8')])
Good Lord! That’s worse! It correctly mapped int and float types, but it guessed that str meant I wanted 0-length strings, and truncated all of the data to nothing. Long story short, not only does tabular not act like a spreadsheet out of the box, I can’t find a way to make it work. Performance is not a huge issue for me. My spreadsheets might have hundreds or thousands of rows, max, and I’d gladly have the system do a bit of data copying to make my code easy. Tabular seems in many other respects to fit the bill very nicely.
I guess I could subclass tabular with something that defaults all strings to something improbably large (1024 or 4096 bytes, say), with a __setitem__ method that raises an exception should a larger string be assigned. Rather sloppy…but are there better alternatives? I rooted around numpy.recarray and such, a bit, and didn’t see a clear way…but I’ll be the first to admit that I’m completely inexpert at NumPy. The reality is that data manipulation programs may increase the length of strings beyond their initial max. Surely high-function modules should accomodate that. The “just truncate it!” approach common in record-oriented databases of 1974 cannot be the right state-of-the-art for Python in 2011!
Thoughts and suggestions?
As one of the designers of tabular … I have to say that I largely think the first answerer sort of hits the nail on the head.
OP, the “truncation” behavior that you deplore is a fundamental issue with NumPy, on which Tabular is based. But it’s not really accurate to say that it’s a “bug” that should fixed, it’s more a “limitation” that echoes / reinforces the whole point of NumPy (and Tabular) to begin with.
As the first answerer noted, NumPy has an absolute requirement for data structures to be uniform in their size. Once you allocate a numpy array of a given datatype, the array must remain that datatype — or otherwise, a new array with new memory must be initialized. With string datatypes, the length of the string is an integral fixed part of the datatype — you can’t just “convert” an array of length-N strings to an array of length-M strings.
Fixed dataypes are critical for the way NumPy achieves huge performance gains over standard Python objects. This is because, with fixed datatypes, NumPy objects know how many bytes have been allocated to each object, and can just “jump” in memory space out to where a given entry “should” be, without having to read and process that contents of all the intervening entries, unlike Python lists. Of course, this limits the kinds of objects that can naturally BE numpy arrays … or really, it limits the kinds of operations that can be done to a numpy array. Unlike a Python list which is completely mutable (e.g. you can replace any element with any other python object, without disturbing the memory allocation of all the other objects in the list), you can’t mutate a numpy array’s value to a object of a different datatype — because how would byte accounting work then? If suddenly the Nth item gets larger than all the other items in the array, what happens to the data/locations of all the remaining items?
You may not like NumPy’s default behavior for what happens when you TRY to make an “illegal” assignment that breaks the datatype — perhaps you want an error to be issued instead of silent truncation? If so, you should post on the NumPy list about this, since I think it’s more fundamental an issue than Tabular can handle — and regardless of our personal feelings about error handling, we’d want to be consistent with whatever NumPy does here.
You may also not like how Tabular does datatype inference. Infact, NumPy stays away from dtype inferences and basically always requires the user to explicitly specify datatypes. This is good in the sense that it demands the user think about these issues, but it’s annoying in that it is quite cumbersome at times. Tabular tries to hit the happy medium that is useful most of the time, but sometimes this will fail — in which case, the defaults can be overridden by just specifying the same keyword arguments as NumPy constructors.
I do think that you’re not quite right when you say that the “approach common in record-oriented databases of 1974 cannot be the right state-of-the-art for Python in 2011”. In fact, the foundations of NumPy memory management are indeed the exact same tools as used in the 1970’s — it may be surprising, but big pieces of optimized NumPy are still built on Fortran! The memory allocation issues of those days are not really avoidable even today, though NumPy does provide a much cleaner and simpler interface most of the time. But it must be said that if you would “gladly have the system do a bit of data copying to make my code easy” — then probably NumPy and Tabular are not for you, since silent data copying, and everything it represents, is explicitly counter to the design intent of these packages.
So the question becomes: what is your objective? If you really need performance with array-like operations, than use NumPy — in which case, Tabular provides spreadsheet like operations — but live within NumPy’s limitations. If you don’t need performance, there’s no point in having array-like objects to begin with, and you can be more flexible. However, Tabular’s spreadsheet-like operations don’t extend to general python objects — and it’s not even exactly clear how to make that extension.
And, let me add one more (quite important) thing — OP, if performance is not your main issue, but you still want to use Tabular as a source of spreadsheet operations, you could just do all the operations that you want that might change datatypes with new calls to the Tabular array constructor. That is, if in a given operation you might need to make an assignment to a new larger string datatype, just construct a new Tabarray every time. This is obviously not as good for performance, if that’s not your limitation, then it should be no problem.
The key point here is that Tabular and NumPy set certain standards for what counts as “fast” or “slow” — and then, force you to be explicit about operations that are going to be slow. They never allow you to hide (the way, e.g. Matlab, does) very slow operations under the hood. Something that’s easy syntactically should be fast — and if you want to do something that’s going to be slow, you should have to work a bit harder in your code to do it and therefore pay attention to what is going on. As a result, your code ends up being cleaner and better, but still easier to write than if you had been working directly in C or Fortran. In fact, this principle largely applies to all of Python itself as well — though with somewhat different standards for what counts as “fast” or “slow”.
HTH,
D