I have a column of data that contains strings, and I want to create a new column that takes only the first two characters from the corresponding data string.
It seems logical to use the apply function for this, but it doesn’t work like expected. It does not even seem to be consistent with other uses of apply. See below.
In [205]: dfrm_test = pandas.DataFrame({"A":np.repeat("the", 10)})
In [206]: dfrm_test
Out[206]:
A
0 the
1 the
2 the
3 the
4 the
5 the
6 the
7 the
8 the
9 the
In [207]: dfrm_test["A"].apply(lambda x: x+" cat")
Out[207]:
0 the cat
1 the cat
2 the cat
3 the cat
4 the cat
5 the cat
6 the cat
7 the cat
8 the cat
9 the cat
Name: A
In [208]: dfrm_test["A"].apply(lambda x: x[0:2])
Out[208]:
0 the
1 the
Name: A
Based on this, it appears that apply does nothing but perform the NumPy equivalent of whatever is called inside. That is, apply seems to execute the same thing as arr + " cat" in the first example. And if NumPy happens to broadcast that, then it will work. If not, then it won’t.
But this seems to break from what apply promises in the docs. Below is the quotation for what pandas.Series.apply should expect:
Invoke function on values of Series. Can be ufunc or Python function expecting only single values (link)
It says explicitly that it can accept Python functions expecting only single values. And the function that’s not working (lambda x: x[0:2]) definitely satisfies that. It doesn’t say that the single argument must be an array. And given that things like numpy.sqrt are commonly used for single inputs (so not exclusively arrays), it seems natural to expect Pandas to work with any such function.
Is there some way of using apply that I am missing here?
Note: I did write my own extra function below:
def ix2(arr):
return np.asarray([x[0:2] for x in arr])
and I verified that this version does work with Pandas apply. But this is beside the point. It would be easier to write something that operated externally on top of a Series object than to have to constantly write wrappers that use list comprehensions to effectively loop over the contents of the Series. Isn’t this specifically what apply is supposed to abstract away from the user?
I am using Pandas version 0.7.3, and it is on a workplace shared network, so there’s no way to upgrade to the recent release.
Added:
I was able to confirm that this behavior changes from version 0.7.3 to version 0.8.1. In 0.8.1 it works as expected with no NumPy ufunc wrapper.
My guess is that in the code, someone was trying to use numpy.vectorize or numpy.frompyfunc within a try-except statement. Perhaps it did not work correctly with the particular lambda function I am using, and so in the except part of the code, it defaulted to just relying on generic NumPy broadcasting.
It would be great to get some confirmation on this from a Pandas developer, if possible. But in the meantime, the ufunc workaround should suffice.
One workaround I can think of would be converting the Python function to
numpy.ufuncwithnumpy.frompyfunc:and use this in
apply: