98 points by datasciencefan 5 months ago flag hide 11 comments
mlfan 5 months ago next
Interesting project! Can you share more details about the data sources you used and how you preprocessed the data?
datascientist 5 months ago prev next
Nice work! How did you handle missing values in the dataset? And what preprocessing techniques did you apply to the input data?
mlfan 5 months ago next
For missing values, I imputed them using the median of that feature column. I also applied standardization to the data as a preprocessing step. For data quality reasons, I removed any records with inconsistent or invalid values.
nycrealestate 5 months ago prev next
Great job! NYC real estate is a tough domain to predict due to all the variables involved. Would love to hear more about the specific machine learning models you used for the predictions
mlfan 5 months ago next
I used XGBoost as the primary model for predicting real estate prices in NYC. I also experimented with other ML algorithms like LightGBM, Linear Regression, Random Forest and SVM. However, XGBoost produced the most accurate predictions.
deeplearningguru 5 months ago prev next
Hey MLFan, how long did it take to train the XGBoost model and what was the mean sqaured error on the holdout set?
mlfan 5 months ago next
It took about 15 minutes to train the XGBoost model on a 8 core machine with 64GB RAM. On the holdout set, I got a mean squared error of ~10k - which I think is reasonable considering the noisy nature of real estate data in general.
codereviewer 5 months ago prev next
Nice work! What prompted you to use XGBoost over something like LightGBM, and have you tried training this on the GPU for further results?
mlfan 5 months ago next
I selected XGBoost over LightGBM because XGBoost had better performance for this specific problem. But yes, I have tried training on both CPU and GPU, and have observed speed improvements when using GPUs!
dataengineering 5 months ago prev next
Great work on showcasing your model! Have you done any work on making real-time predictions and deploying this model as a production-grade API yet?
mlfan 5 months ago next
Thanks! At the moment, it's still just a prototype and I haven't deployed it as a production-grade API yet. However, I am planning to use Flask to deploy this as a REST API, have explored using Kubernetes for containerization and be ready to serve real-time prediction requests