Posts about data

H-index Rank for Countries: for Science Publications

The SCImago Journal and Country Rank provides journal and country scientific indicators. As stated in previous posts, these types of rankings have limitations but they are also interesting. The table shows the top 6 countries by h-index and then some others I chose to list (the top 6 repeat from my post in 2008 – Country H-index Rank for Science Publications). The h-index provides a numeric indication of scientific production and significance (by looking at the citations given papers by other papers). Read more about the h-index (Hirsh index).

Country h-index h-index (2007) % of World
Population
total Cites
USA

1,139 793     4.5% 87,296,701
United Kingdom

689 465     .9% 21,030,171
Germany

607 408     1.2% 17,576,464
France

554 376     1.0% 12,168,898
Canada

536 370     .5% 10,375,245
Japan

527 372     1.8% 14,341,252
Additional countries of interest
18) China

279 161 19.4% 5,614,294
21) South Korea

258 161     .7% 2,710,566
22) Brazil

239 148  2.8% 1,970,704
25) India

227 146 17.5% 2,590,791
31) Singapore

196 .01% 871,512

Related: Top Countries for Science and Math Education: Finland, Hong Kong and KoreaWorldwide Science and Engineering Doctoral Degree Data Top 15 Manufacturing Countries in 2009Science and Engineering Doctoral Degrees WorldwideRanking Universities Worldwide (2008)Government Debt as Percentage of GDP 1990-2009: USA, Japan, Germany, China…

Engineering Again Dominates The Highest Paying College Degree Programs

As usual most of the highest paying undergraduate college degrees in the USA are engineering. Based on data from payscale, all of the top 10 highest paying fields are in engineering. The highest non-engineering fields are applied mathematics and computer science. Petroleum Engineering salaries have exploded over the last few years to $93,000 for a starting median salary, more than $30,000 above the next highest paying degree.

Mid-career median salaries follow the same tendency for engineering degrees, though in this case, 3 of the top 10 salaries (15 years into a career) are for those with non-engineering degrees: applied mathematics, physics and economics.

Highest Paid Undergrad College Degrees
Degree Starting Median Salary Mid-Career Median Salary 2009 starting salary
Petroleum Engineering $93,000 $157,000
Chemical Engineering $64,800 $108,000 $65,700
Nuclear Engineering $63,900 $104,000
Computer Engineering $61,200 $99,500 $61,700
Electrical Engineering $60,800 $104,000 $60,200
Aerospace Engineering $59,400 $108,000 $59,600
Material Science and Engineering $59,400 $93,600
Industrial Engineering $58,200 $97,400 $57,100
Mechanical Engineering $58,300 $97,400 $58,900
Software Engineering $56,700 $91,300
Applied Mathematics $56,400 $101,000
Computer Science $56,200 $97,700 $56,400

Related: PayScale Survey Shows Engineering Degree Results in the Highest Pay (2009)Engineering Majors Hold 8 of Top 10 Highest Paid Majors (2010)Engineering Graduates Get Top Salary Offers in 2006Shortage of Petroleum Engineers (2006)10 Jobs That Provide a Great Return on Investment

More degrees are shown in the following table, but this table doesn’t include all the degree; it just shows a sample of the rest of the degrees.
Continue reading

Wind Power Capacity Up 170% Worldwide from 2005-2009

graph of global installed wind power capacity from 2005-2009Chart showing global installed wind energy capacity by Curious Cat Science and Engineering Blog, Creative Commons Attribution. Data from World Wind Energy Association, for installed Megawatts of global wind power capacity.

_________________________

Globally 38,025 MW of capacity were added in 2009, bringing the total to 159,213 MW, a 31% increase. The graph shows the top 10 producers (with the exceptions of Denmark and Portugal) and includes Japan (which is 13th).

Wind power is now generating 2% of global electricity demand, according to the World Wind Energy Association. The countries with the highest shares of wind energy generated electricity: Denmark 20%, Portugal 15%, Spain 14%, Germany 9%. Wind power employed 550,000 people in 2009 and is expected to employ 1,000,000 by 2012.

From 2005 to 2009 the global installed wind power capacity increased 170% from 59,033 megawatts to 159,213 megawatts. The percent of global capacity of the 9 countries in the graph has stayed remarkably consistent: from 81% in 2005 growing slowly to 83% in 2009.

Over the 4 year period the capacity in the USA increased 284% and in China increased 1,954%. China grew 113% in 2009, the 4th year in a row it more than doubled capacity. In 2007, Europe had for 61% of installed capacity and the USA 18%. At the end of 2009 Europe had 48% of installed capacity, Asia 25% and North America 24%.

Related: Wind Power Provided Over 1% of Global Electricity in 2007USA Wind Power Installed Capacity 1981 to 2005Wind Power has the Potential to Produce 20% of Electricity by 2030

Hans Rosling on Global Population Growth

Hans Rosling provides another interesting TED talk. As he mentions economics plays a huge role in whether we will slow population growth. The economic condition plays a huge role in child survival which he calls “the new green.” Meaning the fate of the environment is tied to increasing child survival (and decreasing poverty). There are many important factors that will impact the fate of the environment but a big factor is world population.

Related: Data Visualization ExampleStatistics Insights for Scientists and EngineersVery Cool Wearable Computing Gadget from MITUnderstanding the Nature of CompoundingPopulation Action

Google Prediction API

This looks very cool.

The Prediction API enables access to Google’s machine learning algorithms to analyze your historic data and predict likely future outcomes. Upload your data to Google Storage for Developers, then use the Prediction API to make real-time decisions in your applications. The Prediction API implements supervised learning algorithms as a RESTful web service to let you leverage patterns in your data, providing more relevant information to your users. Run your predictions on Google’s infrastructure and scale effortlessly as your data grows in size and complexity.

Accessible from many platforms: Google App Engine, Apps Script (Google Spreadsheets), web & desktop apps, and command line.

The Prediction API supports CSV formatted training data, up to 100M in size. Numeric or unstructured text can be sent as input features, and discrete categories (up to a few hundred different ones) can be provided as output labels.

Uses:
Language identification
Customer sentiment analysis
Product recommendations & upsell opportunities
Diagnostics
Document and email classification

Related: The Second 5,000 Days of the WebRobot Independently Applies the Scientific MethodControlled Experiments for Software SolutionsStatistical Learning as the Ultimate Agile Development Tool by Peter Norvig

Statistics Insights for Scientists and Engineers

My father was a engineer and statistician. Along with George Box and Stu Hunter (no relation) they wrote Statistics for Experimenters (one of the potential titles had been Statistics for Engineers). They had an interest in bringing applied statistics to the work of scientists and engineers and I have that interest also. To me the key trait for applied statistics is to help experimenters learn quickly: it is an aid in the discovery process. It should not be a passive tool for analysis (which is how people often think of statistics).

José Ramírez studied applied and industrial statistics at the University of Wisconsin – Madison with my father and George Box. And now has a book and blog on taking statistics to engineers and scientists

The book is primarily written for engineers and scientists who need to use statistics and JMP to make sense of data and make sound decisions based on their analyses. This includes, for example, people working in semiconductor, automotive, chemical and aerospace industries. Other professionals in these industries who will find it valuable include quality engineers, reliability engineers, Six Sigma Black Belts and statisticians.

For those who want a reference for how to solve common problems using statistics and JMP, we walk through different case studies using a seven-step problem-solving framework, with heavy emphasis on the problem setup, interpretation, and translation of the results in the context of the problem.

For those who want to learn more about the statistical techniques and concepts, we provide a practical overview of the underpinnings and provide appropriate references. Finally, for those who want to learn how to benefit from the power of JMP, we have loaded the book with many step-by-step instructions and tips and tricks.

Related: Highlights from George Box Speech at JMP conference Nov 2009Controlled Experiments for Software SolutionsMistakes in Experimental Design and InterpretationFlorence Nightingale: The passionate statistician

Stat Insights is a blog by José and Brenda Ramírez.

Analyzing and Interpreting Continuous Data Using JMP by José and Brenda Ramírez. view chapter 1 online.

[We] have focused on making statistics both accessible and effective in helping to solve common problems found in an industrial setting. Statistical techniques are introduced not as a collection of formulas to be followed, but as a catalyst to enhance and speed up the engineering and scientific problem-solving process. Each chapter uses a 7-step problem-solving framework to make sure that the right problem is being solved with an appropriate selection of tools.

Florence Nightingale: The passionate statistician

Florence Nightingale: The passionate statistician

She brought about fundamental change in the British military medical system, preventing any such future calamities. To do it, she pioneered a brand-new method for bringing about social change: applied statistics.

he statistics changed Nightingale’s understanding of the problems in Turkey. Lack of sanitation, she realized, had been the principal reason for most of the deaths, not inadequate food and supplies as she had previously thought.

As impressive as her statistics were, Nightingale worried that Queen Victoria’s eyes would glaze over as she scanned the tables. So Nightingale devised clever ways of presenting the information in charts. Statistics had been presented using graphics only a few times previously, and perhaps never to persuade people of the need for social change.

Applied statistics is a tool available to all to achieve great improvement. Unfortunately it is still very underused. As George Box says: applied statistics is not about proving a theorem, it’s about being curious about things. The goal of design of experiments is to learn and refine your experiment based on the knowledge you gain and experiment again. It is a process of discovery.

Related: articles on applied statisticsThe Value of Displaying Data WellStatistics for ExperimentersPlaying Dice and Children’s NumeracyQuality, SPC and Your CareerGreat Charts

Learning Design of Experiments with Paper Helicopters

Paper helicopter stairwell dropPhoto showing the helicopter test track by Brad

Dr. George E.P. Box wrote a great paper on Teaching Engineers Experimental Design With a Paper Helicopter that can be used to learn principles of experimental design, including – conditions for validity of experimentation, randomization, blocking, the use of factorial and fractional factorial designs and the management of experimentation.

I ran across an interesting blog post on a class learning these principles today – Brad’s Hella-Copter:

For our statistics class, we have been working hard on a Design of Experiments project that optimizes a paper helicopter with respect to hang time an accuracy of a decent down a stairwell.

We were to design a helicopter that would drop 3 stories down within the 2ft gap between flights of stairs.

[design of experiments is] very powerful when you have lots of variables (ie. paper type, helicopter blade length, blade width, body height, body width, paperclip weights, etc) and not a lot of time to vary each one individually. If we were to individually change each variable one at a time, we would have made over 256 different helicopters. Instead we built 16, tested them, and got a feel for which variables were most important. We then focused on these important variables for design improvement through further testing and optimization.

Related: 101 Ways to Design an Experiment, or Some Ideas About Teaching Design of Experiments by William G. Hunter (my father) – posts on design of experimentsGeorge Box on quality improvementDesigned ExperimentsAutonomous Helicopters Teach Themselves to FlyStatistics for Experimenters

The Value of Displaying Data Well


Anscombe’s quartet: all four sets are identical when examined statistically, but vary considerably when graphed. Image via Wikipedia.

___________________
Anscombe’s quartet comprises four datasets that have identical simple statistical properties, yet are revealed to be very different when inspected graphically. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician F.J. Anscombe to demonstrate the importance of graphing data before analyzing it, and of the effect of outliers on the statistical properties of a dataset.

Of course we also have to be careful of drawing incorrect conclusions from visual displays.

For all four datasets:

Property Value
Mean of each x variable 9.0
Variance of each x variable 10.0
Mean of each y variable 7.5
Variance of each y variable 3.75
Correlation between each x and y variable 0.816
Linear regression line y = 3 + 0.5x

Edward Tufte uses the quartet to emphasize the importance of looking at one’s data before analyzing it in the first page of the first chapter of his book, The Visual Display of Quantitative Information.

Related: Edward Tufte’s: Beautiful EvidenceSimpson’s ParadoxCorrelation is Not CausationSeeing Patterns Where None ExistsGreat ChartsPlaying Dice and Children’s NumeracyTheory of Knowledge

Controlled Experiments for Software Solutions

by Justin Hunter

Jeff Fry linked to a great webcast in Controlled Experiments To Test For Bugs In Our Mental Models.

I firmly believe that applied statistics-based experiments are under-appreciated by businesses (and, for that matter, business schools). Few people who understand them are as articulate and concise as Kohavi. Admittedly, I could be accused of being biased as: (a) I am the son of a prominent applied statistician and (b) I am the founder of a software testing tools company that uses applied statistics-based methods and algorithms to make our tool work.

Summary of the webcast, on Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO – a presentation by Ron Kohavi with Microsoft Research.

1:00 Amazon: in 2000, Greg Linden wanted to add recommendations in shopping cards during the check out process. The “HiPPO” (meaning the Highest Paid Person’s Opinion) was against it on the grounds that it would be a bad idea; recommendations would confuse and/or distract people. Amazon, a company with a good culture of experimentation, decided to run a small experiment anyway, “just to get the data” – It was wildly successful and is in widespread use today at Amazon and other firms.

3:00 Dr. Footcare example: Including a coupon code above the total price to be paid had a dramatic impact on abandonment rates.

4:00 “Was this answer useful?” Dramatic differences occur when Y/N is replaced with 5 Stars and whether an empty text box is initially shown with either (or whether it is triggered only after a user clicks to give their initial response)

6:00 Sewing machines: experimenting with a sales promotion strategy led to extremely counter-intuitive pricing choice

7:00 “We are really, really bad at understanding what is going to work with customers…”
Continue reading

Data Analysts Captivated by R’s Power

Data Analysts Captivated by R’s Power

data mining has entered a golden age, whether being used to set ad prices, find new drugs more quickly or fine-tune financial models. Companies as diverse as Google, Pfizer, Merck, Bank of America, the InterContinental Hotels Group and Shell use it.

Close to 1,600 different packages reside on just one of the many Web sites devoted to R, and the number of packages has grown exponentially. One package, called BiodiversityR, offers a graphical interface aimed at making calculations of environmental trends easier.

Another package, called Emu, analyzes speech patterns, while GenABEL is used to study the human genome. The financial services community has demonstrated a particular affinity for R; dozens of packages exist for derivatives analysis alone. “The great beauty of R is that you can modify it to do all sorts of things,” said Hal Varian, chief economist at Google. “And you have a lot of prepackaged stuff that’s already available, so you’re standing on the shoulders of giants.”

R first appeared in 1996, when the statistics professors Ross Ihaka and Robert Gentleman of the University of Auckland in New Zealand released the code as a free software package. According to them, the notion of devising something like R sprang up during a hallway conversation. They both wanted technology better suited for their statistics students, who needed to analyze data and produce graphical models of the information. Most comparable software had been designed by computer scientists and proved hard to use.

R is another example of great, free, open source software. See R packages for Statistics for Experimenters.

via: R in the news

Related: Mistakes in Experimental Design and InterpretationData Based Decision Making at GoogleFreeware Math ProgramsHow Large Quantities of Information Change Everything