We’re team Vectorspace AI and here to talk about datasets based on human language and how they can contribute to scientific discovery.



What do we do?



In general terms, we add structure to unstructured data for unsupervised Machine Learning (ML) systems. Not very glamorous or even interesting to many, but you might liken it to the glue that binds data and semi-intelligent systems.



More specifically, we build datasets and augment existing datasets with additional 'signal' for the purpose of minimizing a loss function. We do this by generating context-controlled correlation matrices. The correlation scores are derived from machine & human language processed in vector space via labeled embeddings (LBNL 2005, Google 2010).



Why are we doing this?



We can enable data, ML and Natural Language Processing/Understanding/Interpretation/Generation (NLP/NLU/NLI/NLG) engineers and scientists to save time by testing hypotheses or running experiments a bit faster and for additional data interpretation. From improving music and movie recommendation systems to enabling a researcher in discovering a hidden connection in nature, this can increase the speed of innovation and better yet, novel scientific breakthroughs and discoveries.



We are particularly interested in how we can get machines to trade information with one another or exchange and transact data in a way that minimizes a selected loss function.



Today we continue to work in the area of life sciences and the financial markets with groups including Lawrence Berkeley National Laboratory and a few internal groups at Google along with a of couple hedge funds in the area of analyzing global trends in news, research and on reddit similar to approaches like this [minute 39:35]



Who are we?



We started as a team of amateur epigraphists, musicians, data engineers, visualization designers, dataminers, bioinformaticians and scientists working to detect hidden relationships between objects in the domain of life sciences including relationships between human genes, diseases and therapeutics.



After spending time at Genentech and a few bioinformatics startups, we found ourselves in 2005 in the Life Sciences (renamed Biosciences) division at Lawrence Berkeley National Lab/DOE. Our work was applied specifically to understanding how to use scientific literature to analyze genomic pathways in breast cancer, the effects of radiation, or gamma rays found in space, on human chromosomes and the detection of hidden relationships between genes that were found to extend the lifespan of C. Elegans (nematodes) to what is equivalent to about 300 years in human lifespan. Google, Buck Institute on Aging, SENS Research Foundation, Vitalik Buterin/Ethereum, Human Longevity and few others advance research in this area. Applications in areas of research like this were of interest because they connect to our ability to do things like travel in deep space for extended periods of time on our way to what may be habitable planets.



Meanwhile, back at the ranch, on Earth, we were offered another option by the US Navy's SPAWAR division but we were booked by the DOE at time. It was also a rabbit hole we did not want to go down. Today we work on generating what we like to internally call 'supercolumns' which are like mini-datasets that can be appended to other datasets to form what some call alternative datasets.



The datasets are correlation matrices engineered with ‘feature attributes’ or continuously valued distance calculations. These 'feature vectors' are related to today’s ‘word embedding’ methods used in Natural Language Processing (NLP). Resulting feature vectors consist of scored and ranked relationships between entities and surrounding human language that define their contexts. What we do is not magic but basic and fundamental to any machine learning effort, however, as Andrew Ng put it, “Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.” - Machine Learning and AI via Brain simulations [via wikipedia].



We’ve had the opportunity to operate in different knowledge domains such as finance where we’ve developed datasets that allow for context-controlled correlations of sympathetic, parasitic and symbiotic known and hidden relationships between entities such as public companies correlated to the periodic table of elements, cryptocurrencies, human genes, diseases, phytochemicals, pharmaceuticals and other entities. So far we've received about 350 custom dataset requests in this area.



This led us down a path in developing an automated feature engineering pipeline that generates the vectors and resulting datasets described above which can be used for unsupervised learning in ML or NLP for context-controlled clustering or possible extraction of additional insights or signals. Context-control has always been central to our vision in the area of life sciences. ‘Context Adaptation’ or the ability to control the context in which data is processed by machines is being called the 'third wave of AI' by groups like DARPA recently. An example of context-control applied to content summarization algorithms can be viewed here.



This is important to us because extracting new patterns or signals from data can lead to hidden relationship detection, new visualizations, insights, interpretations, hypotheses and hopefully new discoveries in Life Sciences and beyond.



Our services are free for all academic use and tier-1 commercial. Resources and funding come from licensing, consulting and investment with a portion donated to life sciences research in addition to organizations such as Refunite.org.



We’re here to answer questions related to datasets and their connection to our work in the past, present and future. We'd also like to introduce our approach in automated feature engineered vectors for advancing datasets and would like to know how we can advance this platform in ways that would be valuable for today’s data engineers, scientists and of course our friends and colleagues at r/datasets. Please feel free to ask us anything you’d like related to our methods, approach or applications or if you just want to shoot the research breeze, that’s fine too.





Thanks for having us r/askscience!



Team Vectorspace AI & Vectorspace Life Sciences



Edit: Thanks for all your great questions! Feel free to contact us anytime with follow up questions at info@vectorspace.ai