The Popularity of Data Analysis Software

Written By Keyboard Item Team on Jumat, 21 Oktober 2011 | 19.33


Abstract: This page presents various ways of measuring the popularity or market share of BMDP, JMP, Minitab, R, R-PLUS, Revolution R, S-PLUS, SAS, SPSS, Stata, Statistica, and Systat, as well as two implementations of the SAS Lanugage, Carolina and WPS. I update this paper several times a year at http://r4stats.com to provide an ongoing view of the software. Recent updates include adding the number of books on each package, adding the KDnuggets poll on languages and adding 2011 data to Figure 1 (10/13/2011), deleting the discussion of Google Insights (10/10/2011) due to it's variability, adding a brief discussion of Quora.com (9/13/2011), Stack Exchange and Stack Overflow.com (7/12/11), updating the blog counts in Table 3 (6/21/11), and replacing Fig. 5 with the latest one (5/26/11) 

Introduction
When choosing an analytical tool to use, there are many factors to consider. Does it run natively on your computer? Does the software provide all the methods you use? If not, how extensible is it? Does that extensibility use its own language, or an external one (e.g. Python, R, SQL) that is commonly accessible from many packages? Does it fully support the style (programming vs. point-and-click) that you like? Are its visualization options (e.g. static vs. interactive) adequate for your problems? Does it provide output the form you prefer (e.g. cut & paste vs. LaTeX integration)? Does it handle large enough data sets?  Do your colleagues us it so you can easily share data and programs? Can you afford it?
It can also be helpful to know the size of the software’s market share and whether it is growing or shrinking. Software that is popular and growing probably meets the needs of many people well, however certainly doesn't mean it will meet yours. That said, let's examine various ways to estimate popularity and/or market share.

Sales & Downloads
Sales figures reported by some commercial vendors include products that have little to do with analysis. Not all vendors release sales figures. Open source software such as R (Ihaka and Gentleman 1996) could count downloads but one person can download many copies, inflating the total and many people can install from a single download, deflating it. Download counts for the R-based Bioconductor project are located athttp://www.bioconductor.org/packages/stats/. Similar figures for downloads of Stata add-ons (not Stata itself) are available at http://fmwww.bc.edu/fmrc/reports/report.ssc.html.  A list of Stata repositories is available at http://stata.com/links/resources2.html. The many sources of downloads both in repositories and individual's web sites makes counting downloads a very difficult task.

Language Popularity Measures
The TIOBE Community Programming Index ranks the popularity of programming languages, but from a programming language perspective rather than as analytical software(http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html). In March 2011, they rank R in 24th place and SAS at 26th. No other data analysis languages covered by this article even make their top 100.
Langpop.com also ranks programming languages (http://langpop.com/) in a variety of interesting ways, but unfortunately their focus excludes statistical software. I have tried to emulate their displays of number of books on each software, but with R this is not possible. The letter "R" appears in many titles, author names and even keywords that have nothing to do with the R software. In addition, many older versions of manuals are still carried at stores like Amazon.com. It is also difficult to count books available for commercial packages since the same manual may have printed in every version for over 30 years.

Internet Discussion
There are some stable and objective measures regarding analytic software. Schwartz (2009) suggested estimating relative popularity by plotting the amount of email discussion devoted to each. The most widely used packages all have discussion lists, or "listservs" devoted to them. The less popular ones either do not have such discussions or, like the list for Minitab, may have only a dozen or so emails per year. Some software packages have multiple discussion lists. For example there are 21 devoted to using R for various focused areas such as  graphics, mapping, ecology, epidemiology, etc. (http://www.r-project.org/mail.html). A broader list, including a version of R-Help in Spanish lists 49 discussions (https://stat.ethz.ch/mailman/listinfo).
There are other discussion forums besides listservs. The SAS newsgroup available athttp://groups.google.com/group/comp.soft-sys.sas/topics is still quite active, often with over 1,000 entries per month. SAS Institute also offers 17 support forms on topics such as general procedures, statistical procedures and graphics. However when this was written the traffic on SAS-L seemed the most popular single source of discussion about SAS. SPSS also has a newsgroup discussion athttp://groups.google.com/group/comp.soft-sys.stat.spss/topics, but it only has around 150 entries per month. IBM/SPSS also offer corporate forums. Ideally we could combine the data from all discussion lists and forums. Unfortunately that would be too time consuming. Therefore, Figure 1 shows the level of activity on only the main discussion listserv in a typical month (i.e. corporate forums, news groups and Google groups are excluded). Each point represents the mean of the 12 monthly counts that occurred in that year. This plot contains data through the end of June, 2011.
Figure 1. Plot of listserv discussion traffic by year (through 7/31/2011).
We can see that R is the most discussed software by almost a two-to-one margin, followed by Stata then SAS. Keep in mind that both R and SAS have substantial amounts of discussion in other areas which, if included, would raise both of their lines substantially.
SAS saw growth in its discussion until 2006 when it leveled off and then declined. That decline could be the result of at least three factors: 1) migration to the SAS corporate forum, 2) the introduction of the Enterprise Guide user interface which may generate fewer questions than programming the SAS language and 3) competition from the increased popularity of R and Stata. Note that in the last few years, R also saw the introduction of easy-to-use user interfaces and just in 2011 started to decline slightly.
Stata has seen substantial growth in the amount of discussion devoted to it, finally surpassing that of SAS in 2010 on this single list.
SPSS has had a relatively low and consistent amount of discussion over the years. SPSS’ traditional user base is in the social sciences where, in my experience, people are less interested in programming and more interested in the product’s easy-to-use graphical user interface. It had that interface for the whole of the period shown. 
R and S-PLUS are both implementations of the S language and so are in the most direct competition. From the view of Internet discussion, S-PLUS is experiencing a significant decline. In some months, a third of the entries in its discussion list are actually announcements about R.
Could the numbers in Figure 1 be the result of a few people doing a lot of talking? When you follow any of these discussion lists, it quickly becomes obvious that a core group of people really keep the lists humming. However the number of people who subscribe to each list shows a similar pattern with R-Help dominating the scene, see Table 1. An early version of this table failed to include subscribers to Statalist's digest version, and so under counted the total by about half.
 Discussion
 List   
 Subscribers
 R-Help 10,379
 Statalist 3,692
 SAS-L 3,253
 SPSSX-L 2,105
Table 1. Number of subscribers for each Internet discussion list on June 20, 2010.
It would be interesting to see what topics were most discussed on each list. The only such analysis of which I am aware was done by Arthur Tabachnek (2010) for the SAS list. The most popular topic in 2009 turned out to be...R! You can read his full analysis here under slides from the 2010 session.
Another way people help one another is through a pair of related sites. The site Cross Validated (http://stats.stackexchange.com/) is for statistical topics while Stack Overflow (http://stackoverflow.com) is for programming in general. At both sites users tag their topics, making it particularly easy to focus searches. Quora.com is a site that provides similar programming advice. R dominates by a wide margin (see Table 2) for both topics. I did search for the other software but found no topics discussed for them.
 Software   Cross Validated Discussions Stack
 Overflow
 Discussions
 Quora.com
 R 8185,481  6,557
 SAS 35 339 367
 SPSS 79 53 64
 Stata 32 30 42
 All others 0 0 0
Table 2. Cumulative number of topics for each software at two support web sites on July 12, 2011. Quoara.com data was added 9/13/2011.

Blogs
On Internet web logs known as blogs, people write about software that interests them, showing how to solve problems and interpreting events in the field. The more popular a software package is, the more bloggers there are writing about it. Blog consolidators like Tal Galili's R-Bloggers.com and SAS-X.com, and sasCommunity.org Planet combine various blogs into a single location. While any particular blogger may write only an article every week or so, by combining them, the consolidators essentially provide a daily newspaper on various packages. So far only R and SAS are popular enough to have consolidated versions of their blogs (see Table 3).
Software    Number
 of Blogs
 R 209
 SAS 34
 Stata 7
 Others 0-3
Table 3. Number of blogs devoted to each software package on June 21, 2011.

R's 201 blogs put it way out in front of the pack, with SAS coming in at second place with 34. Stata has 7, which are listed here. Each of the other packages have either none or just a few.

Web Site Popularity
Another measure of software popularity is the number of other web pages that contain links that point to the software’s main web site. Figure 2 provides those numbers, recorded using Google on March 19, 2011.
Figure 2. The number of web site links that point to the main web site of each software package on March 19, 2011.
As in Figure 1, we see R dominating the plot by over a two-to-one margin. The other software follows in the order that I suspect is reflective of their respective market shares. It’s interesting to note that the Stata web site contains fewer than half the number of incoming links than the SAS web site does, rather than Stata appearing to dominate SAS in Figure 1. As mentioned, the many other sources of SAS discussion are not reflected in Figure 1. 
Revolution R and R-PLUS are both commercial versions of R that are relatively quite new to the market.WPS is an implementation of the SAS Language and Carolina is a SAS-to-Java compiler.
The number of incoming links is an important part of Google’s famous PageRank algorithm (http://en.wikipedia.org/wiki/PageRank). PageRank is made more useful for searching by (among other things) weighting the importance of each link. Links from major sites like WikiPedia would carry far more weight than would a link from a professor’s course syllabus. The practical range of PageRank is from 1 to 10. Figure 3 plots this data. Here R shows up only slightly higher than the major commercial packages. 
Figure 3. The Google PageRank figures of each web site on June 19, 2010.
Surveys of Use
One way to estimate the relative popularity of data analysis software is though a survey. Rexer Analytics does a survey each year asking about tools used for data mining. The difference between software for classical data analysis software and data mining seems like more of a marketing concept than one based on any actual difference in analytic need. Figure 4 shows the results of just one "check all that apply" type question about the tools that respondents reported using in 2009 (the survey was taken in 2010).
Figure 4. Data mining/analytic tools reported in use on Rexer Analytics survey during 2009.
We see that R comes out on top, followed by SAS and SPSS. The entire report contained over 40 questions on topics such as algorithms used, fields, challenges, data, impact of the economy on the field, and more. More comprehensive results are available here. It's interesting to note that SPSS and SAS are used more often than their more expensive products aimed specifically at data mining, SPSS IBM Modeler (formerly Clementine) and SAS Enterprise Miner.
The results of a similar survey done by the data mining web site KDnuggets in 2011 are shown in Figure 5. This one shows RapidMiner in first place, followed by R and Excel. It's interesting to see that all of those packages showed a decline in use since the 2010 survey, while SAS, SAS Enterprise Miner, IBM SPSS Modeler all showed slight increases. Salford and Revolution Analytics (shown under its previous name Revolution Computing) showed a substantial increases while JMP, Mathematica, Tableau and 11 Ants Analytics appeared in the poll for the first time. You can see the full results and read about the survey's details here.

Figure 5. Results of the 2011 KDnuggets poll on data mining software.


The KDnuggets site conducted similar poll, this time asking, "What programming languages you used for data mining / data analysis in the past 12 months?"  R dominated this poll as shown in Figure 6.

Figure 6. Languages used in data mining or analysis.
Books
The number of books published on each software reflects their relative popularity. Amazon.com offers an advanced search method which works well for all the software except R. I configured it with the following parameters:
Title: SAS -excerpt -chapter -changes   [using SAS as an example]
Subject: Computers & Internet
Condition: New
Format: All formats
Publication Date: After September, 2001  [i.e. 10 years before the search on 10/13/2011]
Since it's difficult to determine how many books use a particular software in its examples, I searched for books that included the software in the title. SAS has many manuals for sale as individual chapters or excerpts. Luckily, they contain "chapter" or "excerpt" in their title so I excluded them using the minus sign, e.g. "-excerpt". SAS also has short "changes and enhancements" booklets that the other packages release only in the form of flyers and/or web pages so I excluded "changes" as well. 
SAS and SPSS both have many versions of the same book or manual still for sale. For example, Marija Norusis' 3 books on SPSS appear 20 times for various versions of SPSS released in the last 10 years. The SAS and SPSS numbers are both somewhat inflated as a result. Limiting the search to books published in the last 10 years mitigated this problem somewhat, but the SAS and SPSS figures are probably both still somewhat exaggerated.
The count of R books came from http://www.r-project.org/doc/bib/R-books.html. This list does contain seven books on S that are older but still relevant. Version numbers do not appear in any book titles so R avoids the over-counting problem that plagued my count of SAS and SPSS manuals. The most surprising aspect of the result (Figure 7) was how extremely dominant the top few packages are and that three well known packages had no books at all written about them (BMDP, Statistica, Systat). Revolution R and R-PLUS have no books with their names in the titles, but of course the books on R apply to them as well.

Figure 7. The number of books that contain the name of each software package in their titles on October 13, 2011.

Impact on Scholarly Activity
While Internet search engines make it very easy to locate information about software, their inclusive nature make it difficult to narrow the search enough to determine the prevalence of various packages. For example, searching for the term “SAS” quickly locates the main web site for the SAS Institute, but it also ends up including many hits regarding a shoe company, an airline and the British commando group. Even in the realm of scholarly journal articles, S.A.S. stands for over a dozen terms such as Synthetic Aperture Sonar.
The more popular a software package is, the more likely it will appear in scholarly publications as a topic and as a method of analysis. Google Scholar offers a convenient way to measure such activity. No search of this magnitude is perfect and will include some irrelevant articles and reject some relevant ones. However, after testing various search terms and their combinations, these seemed to work well:
"R Project"
"S-PLUS"
"SAS Institute"
SPSS
Stata 
Statsoft Statistica
Figure 8 shows the results of the search on these terms from 1995 through 2010.

Figure 8. Impact of data analysis software on academic publications as measured by hits on Google Scholar.
Articles using SPSS show an extreme level of dominance for many of the years, peaking in 2005. Articles using SAS were in second place for much of the period, peaking in 2004. The use of Stata showed strong growth from around 2000 until it peaked in 2008. The fact that SPSS, Stata and SAS all declined from 2008 onward may be a result of The Great Recession limiting grant funds. The use of R didn't start picking up until around 2005 but it managed to continue growing during the recession years. The use of Statistica is low and increasing very slowly. The use of S-PLUS also low. It peaked around 2005 and has declined slightly since.

Growth in Capability
The capability of all the software we are examining has grown significantly over the years. It would be helpful to be able to plot the growth of each software package’s capabilities, but such data is hard to obtain. John Fox (2009) acquired it for R’s main distribution site http://cran.r-project.org/ by year. I collected the later years following his same method. Figure 9 displays the data with a smoothed fit. Each point represents the number of packages at CRAN when the major versions of R (e.g. 2.10, 2.11) were released. A package in R is similar to a SAS or SPSS add-on module. They focus on a particular topic (e.g. time series) and include around 20 functions (procedures, commands) per package.
Figure 9. The number of R add-on packages from R's main software repository.
R’s capability is clearly growing at a very rapid rate and is a major factor in the rapid increase in R's popularity. R does have eight other main software repositories, such as the one athttp://www.bioconductor.org/ that are not included in this graph. A program run on 3/24/2011 counted 4,338 R packages at all major repositories, 2,849 of which were at CRAN. So the growth curve for the software at all repositories would be roughly 33% higher on the y-axis than the one shown in Figure 8. As with any analysis software, individuals also maintain their own separate collections typically available on their web sites.
If this type of data becomes as easily available for the other software, I will include it in a future edition.

IT Research Firms
IT research firms study software products and corporate strategies and provide their opinions on each in reports they sell to their clients. Two such reports that focus on data mining tools are here:
Both firms rank SAS and SPSS as the top two and also predict greater than 100% annual growth for open source business intelligence software.

Job Market

Employment is important to us all, so what software skills are employers seeking? A thorough answer to this question would require a time consuming content analysis of job descriptions. However we can get a rough idea by searching on job advertising sites. Monster.com is the largest job advertising site in the world, so I went there and searched for jobs that listed data analysis software in its requirements, searching for keywords such as "SAS" or "SPSS". The data is presented in Figure 10.

Figure 10. Number of jobs listing each software package in its requirements on June 27, 2010. The maximum they will display is 1,000.
Monster.com will only count jobs up to "greater than 1,000", a level met only by SAS and SPSS. Interestingly, Minitab showed up in third place with 178 jobs. That is its highest showing by far of all the measures discussed in this paper. Stata and JMP followed with 69 and 58 jobs respectively. Of S-PLUS' 22 jobs, 14 of them listed them with R as an option. In this database, R was so difficult to search for that the 14 jobs here were found only as a subset of the S-PLUS jobs. Phrases that helped for other searches, such as "R Project" or "R graph" were useless here. Even the search for "S-PLUS" yielded  more than 50% bad hits, including misspellings like "women s plus sizes". The correct ones were sorted out manually. BMDP, Systat and Statistica all had zero jobs that listed them as requirements. The newest software here, Revolution R and R-PLUS, are variations of R itself. Although no jobs listed them as a requirement, knowing R would cover most of their capabilities.
Job sites are rumored to have fake advertisements that are near-duplicates of real jobs listed by fake employers as a way to get people's personal information. However I saw no sign of that here, with the great majority of employers being well known. The S-PLUS advertisements required manual selection of valid hits, so I had a chance to look at all 46 of them (i.e. the 22 valid ones and the 24 invalid) and noticed no such duplication. However, if this problem is happening on Monster.com, we can hope that people are choosing the jobs to duplicate at random, maintaining at least their relative positions, if not their actual values. Given this assumption, it looks like a data analyst would do well to know SAS or SPSS unless he or she were training for field in which one of the other packages is dominant. 

What's Missing?
The most frequent question I receive about this paper is why I don't collect data on MATLAB, Mathematica, or similar open source software such as Octave, Scilab and Sage. They are, of course, quite capable of doing data analysis. However, I did not collect data on them because their use is more popular in the fields of general science and engineering, not data analysis in the statistical or predictive analytics sense. Graphs from other sources however occasionally do include them (e.g. Figures 4 & 5). 
The other thing missing is the discussion I previously included on Google Trends. That site tracks not what's actually on the Internet via searches, but rather the keywords and phrases that people are entering into their Google searches. That ended up being so variable as to be essentially worthless. For an interesting discussion of this topic, see this article by Rick Wicklin.

Conclusion
By most of the measures discussed here, R is competing well with the commercial software vendors. However, I advise not over generalizing from this data. SAS and SPSS continue to dominate the corporate world and Stata is doing quite well in the scholarly arena. Each of these packages is dominant in one market or another. I'm interested in other ways to measure software popularity.  If  you have any ideas on the subject, please contact me at muenchen.bob@gmail.com.
If you are a SAS or SPSS user interested in learning more about R, you might consider my book, R for SAS and SPSS Users. Stata users might want to consider reading R for Stata Users, written with Stata guru Joe Hilbe.

Acknowledgments
I am grateful to John Fox (2009) for the data on R package growth and to Marc Schwartz (2009) for the idea of plotting the amount of activity on e-mail discussion lists. Thanks to Duncan Murdoch for clarifying the pitfalls of counting downloads. Thanks to Martin Weiss for pointing out both how to query Statlist for its number of subscribers. Thanks to Christopher Baum for information regarding counting Stata downloads. Thanks to John (Jiangtang) HU for suggesting I add more detail from the TIOBE index. Thanks to Andre Wielki Andre for suggesting the addition of SAS Institute's support forums. Thanks to Kjetil Halvorsen for the location of the expanded list of Internet R discussions. Thanks to both Dario Solari and Joris Meys for their suggestions on how to improve Google Insight searches. Thanks to Keo Ormsby for his suggestions regarding Google Scholar. Thanks to Karl Rexer for the use of his data mining survey data. Thanks to Gregory Piatetsky-Shapiro for the use of his KDnuggets data mining poll. Thanks to Tal Galili for advice on blogs and consolidation, as well as Stack Exchange and Stack Overflow. Thanks to Patrick Burns for his advice. Thanks to Nick Cox for advice to clarify the role of Stata's software repositories and of popularity itself. Thanks to Stas Kolenikov for the link of known Stata repositories. Thanks to Rick Wicklin for convincing me to stop trying to get anything useful out of Google Insights.

Bibliography
J. Fox. Aspects of the Social Organization and Trajectory of the R Project. R Journalhttp://journal.r-project.org/archive/2009-2/RJournal_2009-2_Fox.pdf
R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5:299–314, 1996.

R. Muenchen, R for SAS and SPSS Users, Springer, 2009

R. Muenchen, J. Hilbe, R for Stata Users, Springer, 2010

Trademarks
BMDP, Carolina, JMP, Minitab, R-PLUS, Revolution R, SAS, SAS Enterprinse Miner, IBM SPSS Modeler, IBM SPSS Statistics, Stata, Statistica, Systat and WPS are registered trademarks of their respective companies.

Tidak ada komentar:

Posting Komentar