What do you do as a Data Scientist?
- clean the data and prepare for analysis
- analyzing data sets of various sizes from data quality assurance perspective (missing values, errors, statistics)
- visualizing the data (charts of distributions, tables, relations…)
- model the data
- transform the data
How can somebody become a Data Scientist?
- learn any open source stack to work with data sheets, Python (Pandas library) – preferred – or R, then in depends on the size of data sets and domain
- have a strong domain knowledge about the problem (or have someone else who has that)
- knowing maths and statistics helps
What are the traits of excellent Data Scientist?
- as every engineer: focus on a product, knowing what’s the goal
- his work is 100% reproducible, transparent
- consider data as sacred – never does some weird transformation without thinking through possible side-effects
- he is never dishonest about the analysis, never hides something
- never believes only his intuition, always validates that with reality
- being able to handle whatever input source of the data he encounters
Do I have to know some programming language to become a Data Scientist?
- yes, Python or R is ruling the world, respectively
- Excel, SPSS, Tableau etc. are sooner or later limit you what you can do (automation, running on a headless Linux server, size of the data set, …). Plus they are expensive for no additional value
In what companies are Data Scientist positions?
- tech companies, not surprisingly… Any company which collects some data and is big enough to hire a dedicated data scientist (>40 people my humble guess). Data are coming from non-data products such as user behavior, acquisition, various performance, pricing models… Or data products such as performing some data research (such as surveys), evaluating a performance of some processes, web analytics, …
What is the best way to get a entry level data science job?
- knowing some open source stack to the job (so the company doesn’t have to invest in you)
- have hands-on experience on a real data, e.g. through Kaggle competition
- if you can program (advantage of knowing Python), it’s usually much easier since you can also act based on the analysis
What are the main challenges of the Data Science job?
- having the data – real-world data sets are small, messy, full of empty values and errors, without proper description or documentation. It’s often more important to propose a better data acquisition mechanism for better data and providing a simple analysis rather than having robust analysis on a garbage data (there is a saying “Garbage in, garbage out” and it’s true). It’s often not necessary to collect everything, but only a good subset and concentrate on that
- predicting time taken, it’s just a dark art in DS. But all of us must meet the deadlines.
What do you like mostly about Data Science job?
- it’s variance (non routine) – no data set is the same, there are always surprises, you never know what’s going to come
- infinite number of approaches to solve a given problem
- rapidly developing tech stack
What tools use Data Scientist?
- PC/laptop with sufficiently high memory is all you need from a hardware perspective (depending of the size of the data set)
- programming stack of his choice
- various input sources of the data/storage (e.g. SQL, CSV, HDF, Hadoop…)
What in you opinion is a future of data science?
- automation of data insights – data science as an automated service
Are there any good data science courses? a Are there any data science certifications?