UCI ML Drug Review dataset

user1

3 years ago

Over 200,000 patient drug reviews

Tagsearth and nature, computer science, education, health, softwareand 6 more

This dataset was used for the Winter 2018 Kaggle University Club Hackathon and is now publicly available. See Acknowledgments section for citation and licensing

Welcome to the Kaggle University Club Hackathon!

If you are interested in joining Kaggle University Club, please e-mail Jessica Li at lijessica@google.com

This Hackathon is open to all undergraduate, master, and PhD students who are part of the Kaggle University Club program. The Hackathon provides students with a chance to build capacity via hands-on ML, learn from one another, and engage in a self-defined project that is meaningful to their careers.

Teams must register via Google Form to be eligible for the Hackathon. The Hackathon starts on Monday, November 12, 2018 and ends on Monday, December 10, 2018. Teams have one month to work on a team submission. Teams must do all work within the Kernel editor and set Kernel(s) to public at all times.

Prompt

The freestyle format of hackathons has time and again stimulated groundbreaking and innovative data insights and technologies. The Kaggle University Club Hackathon recreates this environment virtually on our platform. We challenge you to build a meaningful project around the UCI Machine Learning – Drug Review Dataset. Teams are free to let their creativity run and propose methods to analyze this dataset and form interesting machine learning models.

Machine learning has permeated nearly all fields and disciplines of study. One hot topic is using natural language processing and sentiment analysis to identify, extract, and make use of subjective information. The UCI ML Drug Review dataset provides patient reviews on specific drugs along with related conditions and a 10-star patient rating system reflecting overall patient satisfaction. The data was obtained by crawling online pharmaceutical review sites. This data was published in a study on sentiment analysis of drug experience over multiple facets, ex. sentiments learned on specific aspects such as effectiveness and side effects (see the acknowledgments section to learn more).

The sky’s the limit here in terms of what your team can do! Teams are free to add supplementary datasets in conjunction with the drug review dataset in their Kernel. Discussion is highly encouraged within the forum and Slack so everyone can learn from their peers.

Here are just a couple ideas as to what you could do with the data:

Classification: Can you predict the patient’s condition based on the review?
Regression: Can you predict the rating of the drug based on the review?
Sentiment analysis: What elements of a review make it more helpful to others? Which patients tend to have more negative reviews? Can you determine if a review is positive, neutral, or negative?
Data visualizations: What kind of drugs are there? What sorts of conditions do these patients have?

Top Submissions

There is no one correct answer to this Hackathon, and teams are free to define the direction of their own project. That being said, there are certain core elements generally found across all outstanding Kernels on the Kaggle platform. The best Kernels are:

Complex: How many domains of analysis and topics does this Kernel cover? Does it attempt machine learning methods? Does the Kernel offer a variety of unique analyses and interesting conclusions or solutions?
Original: What is the subject matter of this Kernel? Does it have a well-defined and interesting project scope, narrative or problem? Could the results make an impact? Is it thought provoking?
Approachable: How easy is it to understand this Kernel? Are all thought processes clear? Is the code clean, with useful comments? Are visualizations and processes articulated and self-explanatory?

Teams with top submissions have a chance to receive exclusive Kaggle University Club swag and be featured on our official blog and across social media.

IMPORTANT: Teams must set all Kernels to public at all times. This is so we can track each team’s progression, but more importantly it encourages collaboration, productive discussion, and healthy inspiration to all teams. It is not so that teams can simply copycat good ideas. If a team’s Kernel isn’t their own organic work, it will not be considered a top submission. Teams must come up with a project on their own.

Submission Styling

The final Kernel submission for the Hackathon must contain the following information:

All team members added as collaborators to the Kernel
Somewhere at the top of your Kernel, find a space to put down all team member names, university name, club name, and team name (as specified when you filled out the team registration form)
An engaging title that summarizes your project
A team picture (smile!)

Timeline

November 12, 2018 (Monday): Hackathon starts
November 19, 2018 (Monday): Teams have at least one public Kernel with a topic proposal
December 10, 2018 (Monday): Hackathon submissions due (by 11:59pm EST)
December 14, 2018 (Friday): Top submissions announced

Rules

All coding must take place within the Kernel editor, not locally. This is so we can track each team’s progression.
Kernels must be public at all times for progression tracking purposes.
Teams are encouraged to be inspired by each other’s work via public discussions and Kernels. However, if a team’s Kernel is simply a “copycat” project combining other people’s discoveries and is not their own organic work, they will not be considered as a top submission. Teams must come up with a project on their own.
Teams are allowed to work on multiple Kernels, but the final submission for the Hackathon must be in the form of one Kaggle Kernel. Teams must specify which Kernel they would like to count as the final submission.
Teams may use any number of supplementary datasets as seen fit for the Hackathon.
This Hackathon and project is as great as you make it. Give it your best and have fun with your teammates and peers!

Acknowledgments

The dataset was originally published on the UCI Machine Learning repository. Citation:

Felix Gräßer, Surya Kallumadi, Hagen Malberg, and Sebastian Zaunseder. 2018. Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning. In Proceedings of the 2018 International Conference on Digital Health (DH ’18). ACM, New York, NY, USA, 121-125.

When using this dataset, you agree that you:

Only use the data for research purposes
Don’t use the data for any commercial purposes
Don’t distribute the data to anyone else