I’m working on automating some reports via pandas and the Google Analytics API. When requesting several dimensions for the data to be split out by, the resulting recordset is well above the default 10k max_result limit imposed by pandas.
To get around this, I’m passing in a large number for the max_results parameter and specifying a chunksize. My intention is to then iterate over the resulting generator to create one large DataFrame which I can do all of my operations on.
from pandas.io import ga
import pandas as pd
max_results = 1000000
chunks = ga.read_ga(metrics=["visits"],
dimensions=["date", "browser", "browserVersion",
"operatingSystem", "operatingSystemVersion",
"isMobile", "mobileDeviceInfo"],
start_date="2012-12-01",
end_date="2012-12-31",
max_results=max_results,
chunksize=5000)
stats = pd.concat([chunk for chunk in chunks])
stats.groupby(level="date").sum()
However, it’s clear that some records aren’t being pulled as the overall daily sum of visits does not match Google Analytics.
I do not run into this issue when selecting only a couple dimensions. For instance …
test = ga.read_ga(metrics=["visits"], dimensions=["date"],
start_date="2012-12-01", end_date="2012-12-31")
test.groupby(level="date").sum()
… produces the same numbers as Google Analytics.
Thanks in advance for the help.
The 10000 rows total is a limit imposed by the google analytics API (https://developers.google.com/analytics/devguides/reporting/core/v3/reference#maxResults)
The code uses the start_index to make multiple requests and work around the limit. I marked this as a bug in pandas: https://github.com/pydata/pandas/issues/2805
I’ll take a look whenever I get a chance. If you could show some expected data vs what you get via pandas that’d be helpful.
As a workaround, I would suggest iterating over each day and making a daily request.