2018 Kaggle Machine Learning & Data Science Survey
- by user1
- 17 February, 2022
The most comprehensive dataset available on the state of ML and data science
LicenseCC BY-SA 4.0
Tagsbusiness, earth and nature, computer science, education, gamesand 1 more
Overview
Welcome to Kaggle’s second annual Machine Learning and Data Science Survey ― and our first-ever survey data challenge.
This year, as last year, we set out to conduct an industry-wide survey that presents a truly comprehensive view of the state of data science and machine learning. The survey was live for one week in October, and after cleaning the data we finished with 23,859 responses, a 49% increase over last year!
There’s a lot to explore here. The results include raw numbers about who is working with data, what’s happening with machine learning in different industries, and the best ways for new data scientists to break into the field. We’ve published the data in as raw a format as possible without compromising anonymization, which makes it an unusual example of a survey dataset.
Challenge
This year Kaggle is launching the first Data Science Survey Challenge, where we will be awarding a prize pool of $28,000 to kernel authors who tell a rich story about a subset of the data science and machine learning community..
In our second year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry. This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. For that reason, we’re inviting the Kaggle community to dive deep into the survey datasets and help us tell the diverse stories of data scientists from around the world.
The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!
Submissions will be evaluated on the following:
- Composition – Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations.
- Originality – Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time.
- Documentation – Are your code, and kernel, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible
To be valid, a submission must be contained in one kernel, made public on or before the submission deadline. Participants are free to use any datasets in addition to the Kaggle Data Science survey, but those datasets must also be publicly available on Kaggle by the deadline for a submission to be valid.
Prizes
There will be 6 prizes for the best data storytelling submissions:
- 1st place: $8,000
- 2nd place: $5,000
- 3rd place: $3,000
- 4th place: $2,000
- 5th place: $2,000
- 6th place: $2,000
Winning submissions will also be published in Kaggle’s blog No Free Hunch, and shared in the Kaggle Newsletter (which goes out to nearly one million Kagglers!)
While the challenge is running, Kaggle will also give a Weekly Kernel Award of $1,500 to recognize excellent kernels that are public analyses of the survey. Weekly Kernel Awards will be announced every Friday between 11/9 and 11/30.
How to Participate
To make a submission, complete the submission form. Only one submission will be judged per participant, so if you make multiple submissions we will review the last (most recent) entry.
No submission is necessary for the Weekly Kernels Awards. To be eligible, a kernel must be public and use the 2018 Data Science Survey as a data source.
Timeline
All dates are 11:59PM UTC
- Submission deadline: December 3rd
- Winners announced: December 10th
- Weekly Kernels Award prize winners announcements: November 9th, 16th, 23rd, and 30th
All kernels are evaluated after the deadline.
Rules
To be eligible to win a prize in either of the above prize tracks, you must be:
- a registered account holder at Kaggle.com;
- the older of 18 years old or the age of majority in your jurisdiction of residence; and
- not a resident of Crimea, Cuba, Iran, Syria, North Korea, or Sudan
Your kernels will only be eligible to win if they have been made public on kaggle.com by the above deadline. All prizes are awarded at the discretion of Kaggle. Kaggle reserves the right to cancel or modify prize criteria.
Unfortunately employees, interns, contractors, officers and directors of Kaggle Inc., and their parent companies, are not eligible to win any prizes.
Survey Methodology
- This survey received 23,859 usable respondents from 147 countries and
territories. If a country or territory received less than 50
respondents, we grouped them into a group named “Other” for
anonymity. - We excluded respondents who were flagged by our survey system as
“Spam”. - Most of our respondents were found primarily through Kaggle channels,
like our email list, discussion forums and social media channels. - The survey was live from October 22nd to October 29th. We allowed
respondents to complete the survey at any time during that window.
The median response time for those who participated in the survey was
15-20 minutes. - Not every question was shown to every respondent. You can learn more
about the different segments we used in the schema.csv file. - To protect the respondents’ identity, the answers to multiple choice
questions have been separated into a separate data file from the
open-ended responses. We do not provide a key to match up the
multiple choice and free form responses. Further, the free form
responses have been randomized column-wise such that the responses
that appear on the same row did not necessarily come from the same
survey-taker.
Size: 4302 KB Price: Free Author: Kaggle and 4 collaborators Data source: kaggle.com