Machine Learning: Bitcoin Market Analysis
Investigating the manipulative effect of tweets on Bitcoin price using Natural Language Processing (BERT) and Multi-Layer Perceptron (MLP).
Project Overview (Abstract)
Social media is one of the biggest and most important communication tools of our time, but can it be manipulative if used correctly? Hundreds of thousands of tweets are tweeted every day on Twitter, the largest and most interactive social media platform, and these tweets may be manipulating people in some way, influencing them at decision-making points when making choices. The focus of this project is to investigate whether tweets can have a manipulative effect on the price of Bitcoin. To do this research, we aimed to develop a model using Multi-Layer Perceptron and BERT Transform.
Methodology & Technologies
The project involved several key stages, from data acquisition and preprocessing to model development and evaluation:
- Initial Data Exploration & Challenges: The initial goal was to analyze the relationship between Google Trends, daily Bitcoin tweet volume, and Bitcoin price. However, significant gaps in available Twitter datasets (from Kaggle) led to a pivot. Various imputation methods (Spline Interpolation, Linear Interpolation, Moving Average, STL Imputation) were explored to address missing data, but results were unsatisfactory for the original goal, with some methods even producing impossible results like negative tweet counts.
- Revised Focus: The project shifted to examining the direct relationship between the content of tweets and subsequent Bitcoin price movements (specifically, whether the price would increase or decrease one week after the tweets).
-
Dataset Preparation & Filtering:
- Sourced Bitcoin tweets dataset from Kaggle (initially 4,689,288 rows).
- Filtered relevant columns: user name, follower count, tweet content, and date.
- Applied filters to enhance data quality and relevance: removed tweets from crypto exchange accounts, filtered for accounts with a minimum of 10,000 followers (to focus on influential accounts), removed telegram advertisements, and selected tweets containing predictive keywords (e.g., "buy", "sell"). This reduced the dataset to 54,460 relevant tweets.
- Created a corresponding Bitcoin price dataset using the YahooFinance API, aligning with tweet dates. A "result" column was generated, indicating price direction (1 for positive change one week later, 0 otherwise).
-
Data Merging & Preprocessing for NLP:
- Merged the filtered tweet dataset with the Bitcoin price dataset.
- Focused on "text" (tweet content) and "result" (price direction) columns.
- To manage the computational cost of BERT transformation, tweets sent on the same day were combined, reducing the dataset shape to 219 rows (daily aggregated tweet data).
-
Natural Language Processing (NLP):
- Utilized the BERT (Bidirectional Encoder Representations from Transformers) model to convert tweet text into numerical vector representations suitable for machine learning.
- The dataset was split into 80% training and 20% testing sets before BERT transformation.
-
Machine Learning Model:
- Employed a Multi-Layer Perceptron (MLP) classifier from the Scikit-learn (sklearn) library.
- Trained the MLP model using the BERT-transformed tweet data as input and the price direction ("result") as the output.
- Experimented with various hidden layer sizes (from 1 to 7 for two hidden layers) and maximum iteration numbers (from 1000 to 49000) to optimize model performance, tracking accuracy for each of the 2401 combinations.
- Key Technologies: Python, Pandas (for data manipulation), Scikit-learn (for MLP), Hugging Face Transformers (for BERT), YahooFinance API, Kaggle (for dataset).
Full Project Report
The detailed methodology, data processing steps, model configurations, and results are documented in the full project report, embedded below:
Conclusion & Outcomes
Despite initial challenges with data completeness for the original research question, the project successfully pivoted to develop a model for understanding the potential manipulative effect of tweet content on Bitcoin's price. The final model, utilizing BERT for text feature extraction and an MLP classifier (with 5 neurons in the first hidden layer, 6 in the second, and 10,000 maximum iterations), achieved an accuracy score of approximately 72.73%. This result suggests that, with appropriate filtering and NLP techniques, it is possible to build a model that can discern patterns between tweet sentiment/content and short-term Bitcoin price movements. The project provided valuable experience in data cleaning, imputation strategies (and their limitations), NLP application, and iterative machine learning model development.