I am not a statistics guy but have to deal with quite some data. In most cases these data sets come from an online survey; hence I do have a MySQL database and know how to get some results out of that.
However, now I got a Stata file and I am required to do some analysis. In MySQL I’d know how to do that, but I am stuck in Stata and ask for your help.
I have a not too small table (roughly 50k rows) containing following columns (there are more cols but these are the ones I have to work with):
id – Object ID, unique values
name – Name of object, string value
class – Class of object, integer range 1 – 6
origin – Origin of object, integer range 1 – 2
Within the 50k rows there are only about 7k different names. In Stata I can retrieve all names with list name and could even restrict it to a single class with list name if class == 2.
Now I want a list of all different names along with a count of objects having that name and have the list sorted by count.
In MySQL I’d query SELECT name, COUNT(*) AS cnt FROM objects GROUP BY name ORDER BY cnt DESC. But how would that be done in Stata?
Next steps would be to get such lists for each class or for both origins, i.e. SELECT name, COUNT(*) AS cnt FROM objects WHERE class = 2 GROUP BY name ORDER BY cnt DESC, is that possible with Stata, too?
ps: I don’t know if stackoverflow is the right place as Stata is not really a programming language, is it? But I found some Stata-related questions here; that’s why I posted it here. If there’s a better place to do so, please point me to the right place.
Keep in mind that Stata only works with rectangular tables of fixed length, so you can only add columns that span the whole 50k rows. Within this setup, this is what you can do.
For the first problem (the list of names and frequencies), you can
Note that
collapsewill delete the existing data and replace with the summary. (Usually, I hate this command for this aspect of data management, but it should work here.) If you don’t want this to happen, here’s a more sophisticated trick:(Explanation:
_Nis the number of observations inby-group, and_nis the number of the current observation within theby-group.)You can then subset this to the class of interest with
if class==#, as you already know.