Mongodb
Mongodb is able to query values inside arrays and not only match strings or intigers.
Projection
Describe the shape that we want the documents to take in the query.
query = {"manufacturer": "Toyota"}
projection = {"_id": "0", "name": 1}
autos = db.autos.find(query, projection)
mongoimport
mongoimport -d dbname -c collectionname --file input-file.json
operators, staring by “$”
Inequality operators:
* ($gt, $lt, $gte, $lte, $ne)
* example:
* query = {"population" : {"$gt": 5000}}
* query = {"population" : {"$gt": 5000, "$lt": 1000}}
* City names starting by X: query = {"name" : {"$gte": 'X', "$lt": 'Y'}}
* Date between: query = {"dounfing date" : {"$gt": datetime(1890,1,1), "$lte": datetime(1900,1,1)}}
Regex operator
- Base on PCRE (perl compatible regular expression library)
- Regular expresion queries
db.cities.find({"motto": {"$regex" : "friendship"}}).count()
db.cities.find({"motto": {"$regex" : "[Ff]riendship"}}).count()
db.cities.find({"motto": {"$regex" : "[Ff]riendship|[Hh]appiness"}}).count()
In operator
Allow to specify an array of values possible for the query
db.autos.find("modelyears": {"$in": [1958, 1966, 1967]})
all operator
The query values must contain all this values in order to be retrieved:
db.autos.find("modelyears": {"$all": [1958, 1966, 1967]})
Dot notation
You can dig in into properties using dot notation.
db.tweets.find({“entities.hashtags”: {"$ne" : []}}, {“entities.hashtags.text” : 1, “id” : 0})
Updates
city = db.cities.findOne({"name": "munchen", "country": "germany"})
city['isoCountryCode'] = 'DEU'
db.cities.save(city)
- update expects a query document as is first parameter, and an update parameter as a second parameter
- if we dont use set or unset the document will be replaced.
Set: If the document does not already contains the specified parameter, then that field should be added.
city = db.cities.update({"name" : "munchen", "country" : "germany"}, {"$set" : {"isoCountryCode": "DEU"}})
Unset What ever documents that are dound in the query and if they contains the field, then remove it.
city = db.cities.update({"name" : "munchen", "country" : "germany"}, {"$unset" : {"isoCountryCode": ""}})
Remove documents
drop: remove the entire collections db.cities.drop()
remove:
remove the cities containing chicago: db.cities.remove({"name":"Chicago"})
remove cities that doesnt have a name: db.cities.remove({"name": { "$exist" : 0}})
Shell
To start shell
mongo
use examples
db.cities.find()
db.cities.find({"gouvermentType": {"$exists" : 1}})
db.cities.find({"gouvermentType": {"$exists" : 1}}).count()
db.cities.find({"gouvermentType": {"$exists" : 1}}).pretty()
Aggregation framework
Provides some builtin data analysis tools for using mongodb.
- Example: Grouping tweets by user and count the number of tweets for each user. Use the aggregator sum to add 1 for each value.
results = db.tweets.aggregate([{"$group" : {"_id" : "$user.screen_name",
"count": { "$sum" : 1}}},
{ "$sort": { "count" : -1}} ])
Example for tweeter source
#!/usr/bin/env python
"""
The tweets in our twitter collection have a field called "source". This field describes the application
that was used to create the tweet. Following the examples for using the $group operator, your task is
to modify the 'make-pipeline' function to identify most used applications for creating tweets.
As a check on your query, 'web' is listed as the most frequently used application.
'Ubertwitter' is the second most used. The number of counts should be stored in a field named 'count'
(see the assertion at the end of the script).
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation pipeline
that can be passed to the MongoDB aggregate function. As in our examples in this lesson, the aggregation
pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided.
If you want to run this code locally on your machine, you have to install MongoDB,
download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the twitter dataset
used in examples in this lesson.
If you attempt some of the same queries that we looked at in the lesson examples,
your results will be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [{"$group" : {"_id" : "$source", "count": { "$sum" : 1}}},
{ "$sort": { "count" : -1}} ]
return pipeline
def tweet_sources(db, pipeline):
return [doc for doc in db.tweets.aggregate(pipeline)]
if __name__ == '__main__':
db = get_db('twitter')
pipeline = make_pipeline()
result = tweet_sources(db, pipeline)
import pprint
pprint.pprint(result[0])
assert result[0] == {u'count': 868, u'_id': u'web'}
operators
-
“$project”: select the field that we want in the next stage of the pipeline
- include fields from the original document
- insert computed fields
- rename fields
- create fields that hold
-
“$match” : Subset documents from the initial ones
-
“$group”
-
“$sort”
-
“$skip”: skip a number of documents at the beginning of the search
-
“$limit”: limit the number of selected documents
-
“$unwind”: break the field values that are composed by arrays in to individual documents.
results = db.tweets.aggregate([
{ "$match" : { "users.friends_count" : { "$gt" : 0},
"users.followers_count" : { "$gt" : 0}}},
{ "$project" : { "ratio" : { "$divide" : [ "$user.followers_count", "$user.friends_count"]},
"screen name": "$user.screen_name"}},
{ "$sort": { "ratio" : -1}},
{ "$limit": 1}])
Find the number person in Brasilia timezone with more than 100 tweets and more followers
#!/usr/bin/env python
"""
Write an aggregation query to answer this question:
Of the users in the "Brasilia" timezone who have tweeted 100 times or more,
who has the largest number of followers?
The following hints will help you solve this problem:
- Time zone is found in the "time_zone" field of the user object in each tweet.
- The number of tweets for each user is found in the "statuses_count" field.
To access these fields you will need to use dot notation (from Lesson 4)
- Your aggregation query should return something like the following:
{u'ok': 1.0,
u'result': [{u'_id': ObjectId('52fd2490bac3fa1975477702'),
u'followers': 2597,
u'screen_name': u'marbles',
u'tweets': 12334}]}
Note that you will need to create the fields 'followers', 'screen_name' and 'tweets'.
Please modify only the 'make_pipeline' function so that it creates and returns an aggregation
pipeline that can be passed to the MongoDB aggregate function. As in our examples in this lesson,
the aggregation pipeline should be a list of one or more dictionary objects.
Please review the lesson examples if you are unsure of the syntax.
Your code will be run against a MongoDB instance that we have provided. If you want to run this code
locally on your machine, you have to install MongoDB, download and insert the dataset.
For instructions related to MongoDB setup and datasets please see Course Materials.
Please note that the dataset you are using here is a smaller version of the twitter dataset used
in examples in this lesson. If you attempt some of the same queries that we looked at in the lesson
examples, your results will be different.
"""
def get_db(db_name):
from pymongo import MongoClient
client = MongoClient('localhost:27017')
db = client[db_name]
return db
def make_pipeline():
# complete the aggregation pipeline
pipeline = [
{ "$match" : { "user.time_zone" : "Brasilia",
"user.statuses_count" : {"$gte" : 100} } },
{ "$project" : { "_id" : 1, "screen_name": "$user.screen_name",
"tweets" : "$user.statuses_count",
"followers": "$user.followers_count"} },
{ "$sort": { "followers" : -1}},
{ "$limit": 1}
]
#print(pipeline)
return pipeline
def aggregate(db, pipeline):
return [doc for doc in db.tweets.aggregate(pipeline)]
if __name__ == '__main__':
db = get_db('twitter')
pipeline = make_pipeline()
result = aggregate(db, pipeline)
#print(result)
import pprint
pprint.pprint(result)
assert len(result) == 1
assert result[0]["followers"] == 17209
Indexes
Faster query system:
db.osm.ensureIndex("tg" : 1)
Open Street maps, indexes coordinates (latitude and longitude), then it uses
$near
operator to search locations near a certain point